GithubDocuments
    1. Marketplace
    2. CMU Database - 15445 - 2025 Spring
    Collections
    CMU Database - 15445 - 2025 Spring
    cmu_database
    This collection of documents, "CMU Database - 15445 - 2025 Spring," provides a comprehensive overview of database systems, primarily focusing on the design, implementation, and ...
    DocumentsKnowledge Graph
    lecture-23-slides.pdf
    Carnegie Mellon University
    Database Systems
    Distributed OLTP Databases

    ADMINISTRIVIA

    Project #4 is due Sunday April 20th @ 11
    $\longrightarrow$ Recitation: Friday, April 11th in GHC 4303 from 3
    - 4
    PM
    HW6 is due Sunday, April 20, 2025 @ 11
    Final Exam is on Monday, April 28, 2025, from 05
    - 08
    $\longrightarrow$ Early exam will not be offered. Do not make travel plans.
    This course is recruiting TAs for the next semester $\longrightarrow$ Apply at: https://www.ugrad.cs.cmu.edu/ta/F25/

    ADMINISTRIVIA

    Class on Monday, April 21: Review Session
    $\rightarrow$ Come to class prepared with your questions. What material do you want me to go over again?
    Class on Wednesday, April 23: Guest Lecture
    $\rightarrow$ Real- world applications of Gen AI and Databases $\rightarrow$ Speaker: Sailesh Krishnamurthy, Google

    UPCOMING DATABASE TALKS

    MariaDB (DB Seminar)
    $\rightarrow$ Monday, April 14 @ 4
    $\rightarrow$ MariaDB's New Query Optimizer $\rightarrow$ Speaker: Michael Widenius $\rightarrow$ https://cmu.zoom.us/j/93441451665
    Gel (DB Seminar) $\rightarrow$ Monday, April 21 @ 4
    $\rightarrow$ EdgeQL with Gel $\rightarrow$ Speaker: Michael Sullivan $\rightarrow$ https://cmu.zoom.us/j/93441451665

    LAST CLASS

    System Architectures $\rightarrow$ Shared- Everything, Shared- Disk, Shared- Nothing
    Partitioning/Sharding $\rightarrow$ Hash, Range, Round Robin
    Transaction Coordination $\rightarrow$ Centralized vs. Decentralized

    OLTP VS. OLAP

    On-line Transaction Processing (OLTP):

    $\longrightarrow$ Short- lived read/write txns. $\longrightarrow$ Small footprint. $\longrightarrow$ Repetitive operations.

    On-line Analytical Processing (OLAP):

    $\longrightarrow$ Long- running, read- only queries. $\longrightarrow$ Complex joins. $\longrightarrow$ Exploratory queries.

    DECENTRALIZED COORDINATOR

    Partitions

    DECENTRALIZED COORDINATOR

    Partitions

    DECENTRALIZED COORDINATOR

    DECENTRALIZED COORDINATOR

    DECENTRALIZED COORDINATOR

    DECENTRALIZED COORDINATOR

    DECENTRALIZED COORDINATOR

    OBSERVATION

    Recall that our goal is to have multiple physical nodes appear as a single logical DBMS.
    We have not discussed how to ensure that all nodes agree to commit a txn and then to make sure it does commit if the DBMS decides it should. $\rightarrow$ What happens if a node fails? $\rightarrow$ What happens if messages show up late? $\rightarrow$ What happens if the system does not wait for every node to agree to commit?

    IMPORTANT ASSUMPTION

    We will assume that all nodes in a distributed DBMS are well- behaved and under the same administrative domain. $\rightarrow$ If we tell a node to commit a txn, then it will commit the txn (if there is not a failure).If you do not trust the other nodes in a distributed DBMS, then you need to use a Byzantine Fault Tolerant protocol for txns (blockchain). $\rightarrow$ Blockchains are not good for high- throughput workloads.

    IMPORTANT ASSUMPTION

    We will assume that all nodes in a distributed DBMS are well- behaved and under the same administrative domain.
    $\rightarrow$ If we tell a node to commit a txn, then it will commit the txn (if there is not a failure).
    If you do not trust the other nodes in a distributed DBMS, then you need to use a Byzantine Fault Tolerant protocol for txns (blockchain). $\rightarrow$ Blockchains are not good for high- throughput workloads.

    TODAY'S AGENDA

    ReplicationAtomic Commit ProtocolsConsistency Issues (CAP / PACELC)

    REPLICATION

    The DBMS can replicate a database across redundant nodes to increase availability. $\longrightarrow$ Partitioned vs. Non- Partitioned $\longrightarrow$ Shared- Nothing vs. Shared- Disk
    Design Decisions:
    $\longrightarrow$ Replica Configuration $\longrightarrow$ Propagation Scheme $\longrightarrow$ Propagation Timing $\longrightarrow$ Update Method

    REPLICA CONFIGURATIONS

    Approach #1: Primary-Replica

    $\longrightarrow$ All updates go to a designated primary for each object. $\longrightarrow$ The primary propagates updates to its replicas by shipping logs. $\longrightarrow$ Read- only txns may be allowed to access replicas. $\longrightarrow$ If the primary goes down, then hold an election to select a new primary.

    Approach #2: Multi-Primary

    $\longrightarrow$ Txns can update data objects at any replica. $\longrightarrow$ Replicas must synchronize with each other using an atomic commit protocol.

    REPLICA CONFIGURATIONS

    Primary- Replica

    REPLICA CONFIGURATIONS

    Primary-Replica

    REPLICA CONFIGURATIONS

    Primary- Replica

    REPLICA CONFIGURATIONS

    Primary-Replica

    REPLICA CONFIGURATIONS

    Primary-Replica

    Multi-Primary

    REPLICA CONFIGURATIONS

    Primary-Replica

    Multi-Primary

    REPLICA CONFIGURATIONS

    Primary-Replica

    Multi-Primary

    K-SAFETY

    K- safety is a threshold for determining the fault tolerance of the replicated database.
    The value $K$ represents the number of replicas per data object that must always be available.
    If the number of replicas goes below this threshold, then the DBMS halts execution and takes itself offline.

    PROPAGATION SCHEME

    When a txn commits on a replicated database, the DBMS decides whether it must wait for that txn's changes to propagate to other nodes before it can send the acknowledgement to application.
    Propagation levels: $\rightarrow$ Synchronous (Strong Consistency) $\rightarrow$ Asynchronous (Eventual Consistency)

    PROPAGATION SCHEME

    Approach #1: Synchronous

    → The primary sends updates to replicas and then waits for them to acknowledge that they fully applied (i.e., logged) the changes.

    PROPAGATION SCHEME

    Approach #1: Synchronous

    → The primary sends updates to replicas and then waits for them to acknowledge that they fully applied (i.e., logged) the changes.

    PROPAGATION SCHEME

    Approach #1: Synchronous

    $\rightarrow$ The primary sends updates to replicas and then waits for them to acknowledge that they fully applied (i.e., logged) the changes.

    PROPAGATION SCHEME

    Approach #1: Synchronous

    → The primary sends updates to replicas and then waits for them to acknowledge that they fully applied (i.e., logged) the changes.

    PROPAGATION SCHEME

    Approach #1: Synchronous

    → The primary sends updates to replicas and then waits for them to acknowledge that they fully applied (i.e., logged) the changes.

    PROPAGATION SCHEME

    Approach #1: Synchronous

    → The primary sends updates to replicas and then waits for them to acknowledge that they fully applied (i.e., logged) the changes.

    PROPAGATION SCHEME

    Approach #1: Synchronous

    → The primary sends updates to replicas and then waits for them to acknowledge that they fully applied (i.e., logged) the changes.

    PROPAGATION SCHEME

    Approach #1: Synchronous

    → The primary sends updates to replicas and then waits for them to acknowledge that they fully applied (i.e., logged) the changes.

    Approach #2: Asynchronous

    → The primary immediately returns the acknowledgement to the client without waiting for replicas to apply the changes.

    PROPAGATION SCHEME

    Approach #1: Synchronous

    → The primary sends updates to replicas and then waits for them to acknowledge that they fully applied (i.e., logged) the changes.

    Approach #2: Asynchronous

    → The primary immediately returns the acknowledgement to the client without waiting for replicas to apply the changes.

    PROPAGATION SCHEME

    Approach #1: Synchronous

    → The primary sends updates to replicas and then waits for them to acknowledge that they fully applied (i.e., logged) the changes.

    Approach #2: Asynchronous

    → The primary immediately returns the acknowledgement to the client without waiting for replicas to apply the changes.

    PROPAGATION TIMING

    Approach #1: Continuous

    → The DBMS sends log messages immediately as it generates them. → Also need to send a commit/abort message.

    Approach #2: On Commit

    → The DBMS only sends the log messages for a txn to the replicas once the txn is commits. → Do not waste time sending log records for aborted txns.

    ACTIVE VS. PASSIVE

    Approach #1: Active-Active

    $\rightarrow$ A txn executes at each replica independently. $\rightarrow$ Need to check at the end whether the txn ends up with the same result at each replica.

    Approach #2: Active-Passive

    $\rightarrow$ Each txn executes at a single location and propagates the changes to the replica. $\rightarrow$ Can either do physical or logical replication. $\rightarrow$ Not the same as Primary- Replica vs. Multi- Primary

    OBSERVATION

    If only one node decides whether a txn is allowed to commit, then making that decision is easy.
    Life is much harder when multiple nodes are allowed to decide: $\rightarrow$ What if multiple nodes need to agree a txn is allowed to commit? $\rightarrow$ What if a primary node goes down and the system needs to choose a new primary?

    ATOMIC COMMIT PROTOCOL

    Coordinating the commit order of txns across nodes in a distributed DBMS.
    $\longrightarrow$ Commit Order = State Machine $\longrightarrow$ It does not matter whether the database's contents are replicated or partitioned.

    Examples:

    $\longrightarrow$ Two- Phase Commit (1970s) $\longrightarrow$ Three- Phase Commit (1983) $\longrightarrow$ Viewstamped Replication (1988) $\longrightarrow$ Paxos (1989) $\longrightarrow$ ZAB (2008?) $\longrightarrow$ Raft (2013)

    ATOMIC COMMIT PROTOCOL

    Coordinating the commit order of txns across nodes in a distributed DBMS.
    $\longrightarrow$ Commit Order = State Machine $\longrightarrow$ It does not matter whether the database's contents are replicated or partitioned.

    Examples:

    $\longrightarrow$ Two- Phase Commit (1970s) $\longrightarrow$ Three- Phase Commit (1983) $\longrightarrow$ Viewstamped Replication (1988) $\longrightarrow$ Paxos (1989) $\longrightarrow$ ZAB (2008?) $\longrightarrow$ Raft (2013)

    ATOMIC COMMIT PROTOCOL

    Resource Managers (RMs)

    $\rightarrow$ Execute on different nodes $\rightarrow$ Coordinate to decide fate of a txn.

    Properties of the Commit Protocol

    $\rightarrow$ Stability: Once the fate is decided, it cannot be changed. $\rightarrow$ Consistency: All RMs end up in the same state.

    Assumes Liveness:

    $\rightarrow$ There is some way of progressing forward. $\rightarrow$ Enough nodes are alive and connected for the duration of the protocol.

    TWO-PHASE COMMIT (SUCCESS)

    TWO-PHASE COMMIT (SUCCESS)

    TWO-PHASE COMMIT (SUCCESS)

    TWO-PHASE COMMIT (SUCCESS)

    TWO-PHASE COMMIT (SUCCESS)

    TWO-PHASE COMMIT (SUCCESS)

    TWO-PHASE COMMIT (SUCCESS)

    TWO-PHASE COMMIT (SUCCESS)

    TWO-PHASE COMMIT (SUCCESS)

    TWO-PHASE COMMIT (ABORT)

    TWO-PHASE COMMIT (ABORT)

    TWO-PHASE COMMIT (ABORT)

    TWO-PHASE COMMIT (ABORT)

    TWO-PHASE COMMIT (ABORT)

    TWO-PHASE COMMIT (ABORT)

    TWO-PHASE COMMIT (ABORT)

    TWO-PHASE COMMIT

    Each node records the inbound/outbound messages and outcome of each phase in a non- volatile storage log.
    On recovery, examine the log for 2PC messages: $\rightarrow$ If local txn in prepared state, contact coordinator. $\rightarrow$ If local txn not in prepared, abort it. $\rightarrow$ If local txn was committing and node is the coordinator, send COMMIT message to nodes.

    TWO-PHASE COMMIT FAILURES

    What happens if the coordinator crashes?
    What happens if the participant crashes?

    TWO-PHASE COMMIT FAILURES

    What happens if the coordinator crashes?

    $\longrightarrow$ Participants must decide what to do after a timeout (this only applies if the participants know of all other participants). $\longrightarrow$ System is not available during this time.
    What happens if the participant crashes?

    TWO-PHASE COMMIT FAILURES

    What happens if the coordinator crashes?

    $\rightarrow$ Participants must decide what to do after a timeout (this only applies if the participants know of all other participants). $\rightarrow$ System is not available during this time.

    What happens if the participant crashes?

    $\rightarrow$ Coordinator assumes that it responded with an abort if it has not sent an acknowledgement yet. $\rightarrow$ Again, nodes use a timeout to determine whether a participant is dead.

    2PC OPTIMIZATIONS

    Early Prepare Voting (Rare)

    $\rightarrow$ If you send a query/request to a remote node that you know will be the last one to execute in this txn, then that node will also return their vote for the prepare phase with the query result.

    Early Ack After Prepare (Common)

    $\rightarrow$ If all nodes vote to commit a txn, the coordinator can send the client an acknowledgement that their txn was successful before the commit phase finishes.

    EARLY ACKNOWLEDGEMENT

    EARLY ACKNOWLEDGEMENT

    EARLY ACKNOWLEDGEMENT

    EARLY ACKNOWLEDGEMENT

    EARLY ACKNOWLEDGEMENT

    EARLY ACKNOWLEDGEMENT

    EARLY ACKNOWLEDGEMENT

    PAXOS

    Consensus protocol where a coordinator proposes an outcome (e.g., commit or abort) and then the participants vote on whether that outcome should succeed.
    Does not block if a majority of participants are available and has provably minimal message delays in the best case.

    The Part-Time Parliament

    LESLIE LAMPORT Digital Equipment Corporation
    Recalls in chronological discovery of the main room of a room reveal that the parliament machines despeite the peripatetic propensity of its part- time legislators. The legislators maintained consistent copies of the parliamentary record, despite their frequent forays from the chamber and the forgetfulness of their messengers. The Paxon parliament's protocol provides a new way of implementing the state- machine approach to the design of distributed systems.
    Categories and Subject Descriptors: Central Computer- Communications Networks; Distributed Systems- Network operating systems; 1.5; Operating System; Reliability- Fault- tolerance; J.1 [Administrative Data Processing]; Government
    General Terms: Design, Reliability Additional Key Words and Phrases: State machines, three- phase commit, voting
    This submission was recently discovered behind a filing cabinet in the TOCS editorial office. Despite its age, the editor- co- chief felt that it was worth publishing. Because the author is currently doing field work in the Greek isles and cannot be reached, I was asked to prepare it for publication.
    The author appears to be an archaeologist with only a passing interest in computer science. This is unfortunate; even though the obscure ancient Paxon civilization he describes is of little interest to most computer scientists, its legislative system is an excellent model for how to implement a distributed computer system in an asynchronous environment. Indeed, some of the refinements that Paxon made to their protocol appear to be unknown in the systems literature.
    The author does give a brief discussion of the Paxon Parliament's relevance to distributed computing in Section 4. Computer scientists will probably want to read that section first. Even before that, they might want to read the explanation of the algorithm for computer scientists by Lampsore [1996]. The algorithm is also described more formally by De Prisco et al. [1997]. I have added further comments on the relation between the ancient protocols and more recent work at the end of Section 4.
    Keith Marrullo University of California, San Diego
    Consensus protocol where a coordinator proposes an outcome (e.g., commit or abort) and then t participants vote on whether that outcome should succeed.
    Does not block if a majority of participants are available and has provably minimal message delays the best case.

    Consensus on Transaction Commit

    JIM GRAY and LESLIE LAMPORT Microsoft Research
    The distributed transaction commit problem requires reaching agreement on whether a transaction is committed or aborted. The classic Two- Phase Commit protocol blocks if the coordinator fails Fault- tolerant consensus algorithms also reach agreement but do not block whenever any majority of the processes are reached. The LESLIE Commit algorithm runs a Paxos consensus algorithm on the commitment/abort decision of each participant to obtain a transaction commit protocol that uses $2F + 1$ coordinates and makes progress if at least $F + 1$ of them are working properly. Paxos Commit has the same stable- storage write delay, and can be implemented to have the same message delay in the fault- free case as Two- Phase Commit, but it uses more messages. The classic Two- Phase Commit algorithm is obtained as the special $F = 0$ case of the Paxos Commit algorithm. Categories and Subject Descriptors: D.4.1 [Operating Systems]: Process Management—Con. currency; D.4.5 [Operating Systems]: Reliability—Fault- tolerance; D.4.7 [Operating Systems]: Organization and Design—Distributed systems General Terms: Algorithms, Reliability Additional Key Words and Phrases: Consensus, Paxos, two- phase commit

    1. INTRODUCTION

    A distributed transaction consists of a number of operations, performed at multiple sites, terminated by a request to commit or abort the transaction. The sites then use a transaction request protocol to decide whether the transaction is committed or aborted. The transaction can be committed only if all sites are willing to commit it. Achieving this all- or- nothing atomicity property in a distributed system is not trivial. The requirements for transaction commit are stated precisely in Section 2.
    The classic transaction commit protocol is Two- Phase Commit (Gray 1978), described in Section 3. It uses a single coordinator to reach agreement. The fail- ure of that coordinator can cause the protocol to block, with no process knowing the outcome, until the coordinator is repaired. In Section 4, we use the Paxos consensus algorithm [Lamport 1989] to obtain a transaction commit protocol
    Authors' addresses: J. Gray, Microsoft Research, 455 Market St., San Francisco, CA 94105; email: Jin. Gray@microsoft.com; L. Lamport, Microsoft Research, 1061 La Venida, Mountain View, CA 94043. Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or direct commercial advantage and that copies show this notice on the first page or initial screen of a display along with the full citation. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise to republish, to post on servers, to redistribute to lists, or to use any component of this work in other works requires prior specific permission and/or a fee. Permissions may be requested from Publications Dept., ACM, Inc., 1515 Broadway, New York, NY 10036 USA, fax: +1 (212) 689- 0481, or permissions@acm.org. © 2006 ACM 0362- 5915/06/0300- 0133 $5.00
    ACM Transactions on Database Systems, Vol. 31, No. 2, March 2006, Pages 133- 160.

    PAXOS

    PAXOS

    PAXOS

    PAXOS

    PAXOS

    PAXOS

    PAXOS

    PAXOS

    PAXOS

    PAXOS

    PAXOS

    PAXOS

    PAXOS

    PAXOS

    MULTI-PAXOS

    If the system elects a single leader that oversees proposing changes for some period, then it can skip the Propose phase.
    $\rightarrow$ Fall back to full Paxos whenever there is a failure.
    The system periodically renews the leader (known as a lease) using another Paxos round. $\rightarrow$ Nodes must exchange log entries during leader election to make sure that everyone is up- to- date.

    2PC VS. PAXOS VS. RAFT

    Two-Phase Commit

    $\rightarrow$ Blocks if coordinator fails after the prepare message is sent, until coordinator recovers.

    Paxos

    $\rightarrow$ Non- blocking if a majority participants are alive, provided there is a sufficiently long period without further failures.

    Raft:

    $\rightarrow$ Similar to Paxos but with fewer node types. $\rightarrow$ Only nodes with most up- to- date log can become leaders.

    CAP THEOREM

    Proposed in the late 1990s that is impossible for a distributed database to always be:
    → Consistent → Always Available → Network Partition Tolerant
    Whether a DBMS provides Consistency or Availability during a Network partition.

    CONSISTENCY

    CONSISTENCY

    CONSISTENCY

    CONSISTENCY

    CONSISTENCY

    CONSISTENCY

    CONSISTENCY

    AVAILABILITY

    AVAILABILITY

    AVAILABILITY

    AVAILABILITY

    AVAILABILITY

    AVAILABILITY

    AVAILABILITY

    PARTITION TOLERANCE

    PARTITION TOLERANCE

    PARTITION TOLERANCE

    PARTITION TOLERANCE

    PARTITION TOLERANCE

    PARTITION TOLERANCE

    PARTITION TOLERANCE

    PARTITION TOLERANCE

    PARTITION TOLERANCE

    PARTITION TOLERANCE

    PARTITION TOLERANCE

    Choice #1: Halt the System

    $\rightarrow$ Stop accepting updates in any partition that does not have a majority of the nodes.
    Choice #2: Allow Split, Reconcile Changes
    $\rightarrow$ Allow each side of partition to keep accepting updates. $\rightarrow$ Upon reconnection, perform reconciliation to determine the "correct" version of any updated record $\rightarrow$ Server- side: Last Update Wins $\rightarrow$ Client- side: Vector Clocks

    PARTITION TOLERANCE

    Choice #1: Halt the System

    → Stop accepting updates in any partition that does not have a majority of the nodes.
    Choice #2: Allow Split, Reconcile Changes
    → Allow each side of partition to keep accepting updates. → Upon reconnection, perform reconciliation to determine the "correct" version of any updated record → Server- side: Last Update Wins → Client- side: Vector Clocks

    PACELC THEOREM

    Extension to CAP proposed in 2010 to include consistency vs. latency trade- offs:
    $\longrightarrow$ Partition Tolerant $\longrightarrow$ Always Available $\longrightarrow$ Consistent $\longrightarrow$ Else, choose during normal operations $\longrightarrow$ Latency $\longrightarrow$ Consistency

    LATENCY VS. CONSISTENCY

    Replica (us- west)
    Replica (eu- east)

    LATENCY VS. CONSISTENCY

    LATENCY VS. CONSISTENCY

    LATENCY VS. CONSISTENCY

    LATENCY VS. CONSISTENCY

    LATENCY VS. CONSISTENCY

    LATENCY VS. CONSISTENCY

    LATENCY VS. CONSISTENCY

    LATENCY VS. CONSISTENCY

    CONCLUSION

    Maintaining transactional consistency across multiple nodes is hard. Bad things will happen. $\rightarrow$ Don't let the "unwashed masses" go without txns!
    2PC / Paxos / Raft are the most common protocols to ensure correctness in a distributed DBMS.
    More info (and humiliation): $\rightarrow$ Kyle Kingsbury's Jepsen Project
    Maintaining transactional consist multiple nodes is hard. Bad thing $\rightarrow$ Don't let the "unwashed masses" go
    2PC / Paxos / Raft are the most to ensure correctness in a distrib
    More info (and humiliation): $\rightarrow$ Kyle Kingsbury's Jepsen Project

    Spanner: Google's Globally-Distributed Database

    James C. Corbett, Jeffrey Dean, Michael Epstein, Andrew Fikes, Christopher Frost, JJ Furman, Sanjay Ghemawat, Andrey Gulcarev, Christopher Heiser, Peter Hochschild, Wilson Hsieh, Sebastian Kamthak, Eugene Kogan, Hong Li, Alexander Lloyd, Sergey Melnik, David Mwaura, David Nagle, Sean Quinlan, Rakesh Rao, Lindsay Rolig, Yanshi Saito, Michal Szynaniak, Christopher Taylor, Ruth Wang, Dale Woodford
    Google, Inc.

    Abstract

    Spanner is Google's scalable, multi- version, globally- distributed, and synchronously- replicated database. It is the first system to distribute data at global scale and support externally- consistent distributed transactions. The paper describes how Spanner is structured, its feature set, the rationale underlying various design decisions, and a novel time API that exposes clock uncertainties. This API and its implementation are critical to supporting external consistency and a variety of powerful features: blocking reads in the past, lock- free read- factory transactions, and atomic schema changes, across all of Spanner.

    1 Introduction

    Spanner is a scalable, globally- distributed database designed, built, and deployed at Google. At the heart of its level of abstraction, it is a database that shreds data across many sets of Paxos [21] state machines in data centers spread all over the world. Replication is used for globally availability and geographic locality; clients automatically failover between replicas. Spanner automatically reshreds data servers as much as the amount of data or the number of servers changes, and it automatically migrates data across machines (even across data centers) to balance load and in response to failures. Spanner is designed to scale up in response to failures. Spanner is dreds of datacenters and trillions of database row level applications can use Spanner for high availability even in the face of wide- area natural disasters, by replicating their data within or even across continents. Our initial customer was F1 [35], a rewrite of Google's advertising backend. F1 uses five replicas spread across the United States. Most other applications will probably replicate their data across 3 to 5 datacenters in one geographic region, but with relatively independent failure modes. That is, most with relatively independent failure modes.
    Published in the Proceedings of OSDI 2012
    tency over higher availability, as long as they can survive 1 or 2 datacenter failures.
    Spanner's main focus is managing cross- datacenter replicated data, but we have also spent a great deal of time in designing and implementing important database features on top of our distributed- systems infrastructure. Even though many projects happily use Bigtable [9], we have also consistently received complaints from users that Bigtable can be difficult to use for some kinds of applications: those that have complex, evolving schema, or those that want strong consistency in the presence of wide- area replication. (Similar claims have been made by other authors [37].) Many applications at Google have chosen to use Megastore [5] because of its semi relational data model and support for synchronous replication, despite its relatively poor write throughput. As a consequence, Spanner has evolved from a Bigtable- like versioned key- value store into a temporal multi- version database. Data stored in schemattized semi- relational tables; data is versioned, and each version is automatically timestamped with its commit time; old versions of data are subject to configurable garbage- collection policies; and applications can read data at old timestamps. Spanner supports general- purpose transactions, and provides a SQL- based query language.
    As a globally- distributed database, Spanner provides several interesting features. First, the replication configurations for data can be dynamically controlled at a fine grain by applications. Applications can specify constraints to control which datacenters contain which data, how far data is from its users (to control read latency), how far replicas are from each other (to control write latency), and how many replicas are maintained (to control durability, availability, and read performance). Data can also be dynamically and transparently moved between datacenters by the system to balance resource usage across datacenters. Second, Spanner has two features that are difficult to implement in a distributed database: it
    Maintaining transactional consist multiple nodes is hard. Bad thing
    → Don't let
    2PC / Paxc to ensure c
    More info → Kyle King
    was in part built to address this failing. Some authors have claimed that general two- phase commit is too expensive to support, because of the performance or availability problems that it brings [9, 10, 19]. We believe it is better to have application programmers deal with performance problems due to overuse of transactions as bottlenecks arise, rather than always coding around the lack of transactions. Running two- phase commit over Paxos mitigates the availability problems.
    The application data model is layered on top of the directory- bucketed key- value mappings supported by the

    Spanner: Google's Globally-Distributed Database

    ity, as long as they can survive
    s managing cross- datacenter we also spent a great deal of cementing important database butted- systems infrastructure. happily use Bigtable [9], we ited complaints from users to use for some kinds of ap- complexity, evolving schemas, liar claims have been made ny applications at Google re [5] because of its semi- port for synchronous repl- oor write throughput. As a v a temporal multi- version themuttized semi- relational each version is automati- mmit time; old versions of e garbage- collection pol- a data at old timestamps. age. iabase, Spunner provides first, the replication con- taneously controlled at a lations can specify con- ters contain which data, to control read latency), ther (to control latwe la- are maintained (to con- ad performance). Data unsparely moved be- to balance resource us- panner has two features distributed database: it
    Maintaining transactional consist multiple nodes is hard. Bad thing
    → Don't let
    2PC / Paxc to ensure c
    More info → Kyle King
    was in part built to address this failing. Some authors have claimed that general two- phase commit is too expensive to support, because of the performance or availability problems that it brings [9, 10, 19]. We believe it is better to have application programmers deal with performance problems due to overuse of transactions as bottlenecks arise, rather than always coding around the lack of transactions. Running two- phase commit over Paxos mitigates the availability problems.
    The application data model is layered on top of the directory- bucketed key- value mappings supported by the

    Spanner: Google's Globally-Distributed Database

    ity, as long as they can survive
    s managing cross- datacenter we also spent a great deal of lamenting important database bumed- systems infrastructure. happily use Bigtable [9], we ited complaints from users to use for some kinds of ap- complexity, evolving schemas, liar claims have been made ny applications at Google re [5] because of its semi- port for synchronous repl- oed write throughput. As a v.a from a Bigtable- like a the temporal multi- version themutted semi- relational each version is automati- mmit time; old versions of e garbage- collection pol- a data at old timestamps. ie. age. iabase, Spunner provides first, the replication con- taneously controlled at a itions contain which data, to control read latency), ther (to control write la- are maintained (to con- ad performance). Data unsparently moved be- to balance resource us- panner has two features distributed database: it

    CONCLUSION

    Maintaining transactional consistency across multiple nodes is hard. Bad things will happen. $\rightarrow$ Don't let the "unwashed masses" go without txns!
    2PC / Paxos / Raft are the most common protocols to ensure correctness in a distributed DBMS.
    More info (and humiliation): $\rightarrow$ Kyle Kingsbury's Jepsen Project

    NEXT CLASS

    Distributed OLAP Systems