Carnegie Mellon University
Database Systems
Concurrency Control Theory
ADMINISTRIVIA
Project #3 is due Sunday March 30th @ 11
$\rightarrow$ Recitation: slides, recording
COURSE STATUS
A DBMS's concurrency control and recovery components permeate throughout the design of its entire architecture.
Query Planning
Operator Execution
Access Methods
Buffer Pool Manager
Disk Manager
COURSE STATUS
A DBMS's concurrency control and recovery components permeate throughout the design of its entire architecture.
Query Planning
Concurrency Control
Operator Execution
Access Methods
Recovery
Buffer Pool Manager
Disk Manager
MOTIVATION EXAMPLE #1
Application Logic
Read(A); Check(A > (25); Pay()25); A = A - (25; Write(A);
MOTIVATION EXAMPLE #1
Application Logic
Read(A); Check(A > (25); Pay()25); A = A - (25; Write(A);
MOTIVATION EXAMPLE #1
Application Logic
Read(A); Check(A > (25); Pay()25); A = A - (25; Write(A);
MOTIVATION EXAMPLE #1
Application Logic
Read(A); Check(A > $25); Pay($25); A = A - $25; Write(A);
MOTIVATION EXAMPLE #1
Application Logic
Read(A); Check(A > $25); Pay($ 25); A = A - $25; Write(A);
MOTIVATION EXAMPLE #1
Application Logic
Read(A); Check(A > (25); Pay()25); A = A - (25; Write(A);
MOTIVATION EXAMPLE #1
Application Logic
Read(A); Check(A > (25); Pay()25); A = A - (25; Write(A);
MOTIVATION EXAMPLE #1
Application Logic
Read(A); Check(A > (25); Pay()25); A = A - (25; Write(A);
MOTIVATION EXAMPLE #1
Application Logic
Read(A); Check(A > $25); Pay($25);
MOTIVATION EXAMPLE #1
Application Logic
Read(A); Check(A > $25); Pay($25);
MOTIVATION EXAMPLE #2
Application Logic
Read(A); Check(A > (25); Pay()25); A = A - (25; Write(A);
MOTIVATION EXAMPLE #2
Application Logic
Read(A); Check(A > (25); Pay()25); A = A - (25; Write(A);
MOTIVATION EXAMPLE #2
Application Logic
Read(A); Check(A > (25); Pay()25); A = A - (25; Write(A);
MOTIVATION EXAMPLE #2
Application Logic
Read(A);
Check(A > $25);
Pay($25);
A = A - $25;
Write(A);
MOTIVATION EXAMPLE #2
Application Logic
Read(A);
Check(A > $25);
Pay($25);
A = A - $25;
Write(A);
MOTIVATION EXAMPLE #2
Application Logic
Read(A);
Check(A > $25);
Pay($25);
A = A - $25;
Write(A);
MOTIVATION EXAMPLE #2
Application Logic
Read(A); Check(A > (25); Pay()25); A = A - (25; Write(A);
MOTIVATION EXAMPLE #2
Application Logic
Read(A); Check(A > (25); Pay()25); A = A - (25; Write(A);
STRAWMAN SYSTEM
Execute each txn one- by- one (i.e., serial order) as they arrive at the DBMS.
$\rightarrow$ One and only one txn can be running simultaneously in the DBMS.
Before a txn starts, copy the entire database to a new file and make all changes to that file. $\rightarrow$ If the txn completes successfully, overwrite the original file with the new one. $\rightarrow$ If the txn fails, just remove the dirty copy.
PROBLEM STATEMENT
A (potentially) better approach is to allow concurrent execution of independent transactions.
Why do we want that?
$\rightarrow$ Better utilization/throughput $\rightarrow$ Increased response times to users.
But we also would like:
$\rightarrow$ Correctness $\rightarrow$ Fairness
PROBLEM STATEMENT
Arbitrary interleaving of operations can lead to: $\rightarrow$ Temporary Inconsistency (ok, unavoidable) $\rightarrow$ Permanent Inconsistency (bad!)
The DBMS is only concerned about what data is read/written from/to the database. $\rightarrow$ Changes to the "outside world" are beyond the scope of the DBMS.
We need formal correctness criteria to determine whether an interleaving is valid.
FORMAL DEFINITIONS
Database: A fixed set of named data objects (e.g., A, B, C, ...).
$\rightarrow$ We do not need to define what these objects are now. $\rightarrow$ We will discuss how to handle inserts/deletes next week.
Transaction: A sequence of read and write operations (e.g., R(A), W(B), ...) $\rightarrow$ DBMS's abstract view of a user program. $\rightarrow$ A new txn starts with the BEGIN command. $\rightarrow$ The txn stops with either COMMIT or ROLLBACK
CORRECTNESS CRITERIA: ACID
Atomicity
All actions in txn happen, or none happen. "All or nothing..."
Consistency
If each txn is consistent and the DB starts consistent, then it ends up consistent. "It looks correct to me..."
Isolation
Execution of one txn is isolated from that of other txns. "All by myself..."
Durability
If a txn commits, its effects persist. "I will survive..."
TODAY'S AGENDA
AtomicityConsistencyIsolationDurabilityDB Flash Talk: ClickHouse
ATOMICITY OF TRANSACTIONS
Two possible outcomes of executing a txn:
$\rightarrow$ Commit after completing all its actions. $\rightarrow$ Abort (or be aborted by the DBMS) after executing some actions.
DBMS guarantees that txn are atomic.
$\rightarrow$ From user's point of view: txn always either executes all its actions or executes no actions at all.
A
MECHANISMS FOR ENSURING ATOMICITY
Approach #1: Logging
$\rightarrow$ DBMS logs all actions so that it can undo the actions of aborted transactions. $\rightarrow$ Maintain undo records both in memory and on disk. $\rightarrow$ Think of this like the black box in airplanes...
Logging is used by almost every DBMS. $\rightarrow$ Audit Trail $\rightarrow$ Efficiency Reasons
MECHANISMS FOR ENSURING ATOMICITY
Approach #2: Shadow Paging
$\rightarrow$ DBMS makes copies of pages and txns make changes to those copies. Only when the txn commits is the page made visible to others. $\rightarrow$ Originally from IBM System R.
Few systems do this: $\rightarrow$ CouchDB $\rightarrow$ Tokyo Cabinet $\rightarrow$ LMDB (OpenLDAP)
MECHANISMS FOR ENSURING ATOMICITY
Approach #2: Shadow Paging
$\rightarrow$ DBMS makes copies of pages and txns make changes to those copies. Only when the txn commits is the page made visible to others. $\rightarrow$ Originally from IBM System R.
Few systems do this: $\rightarrow$ CouchDB $\rightarrow$ Tokyo Cabinet $\rightarrow$ LMDB (OpenLDAP)
CONSISTENCY
The database accurately models the real world.
$\rightarrow$ SQL has methods to specify integrity constraints (e.g., key definitions, CHECK and ADD CONSTRAINT) and the DBMS will enforce them.
$\rightarrow$ Application must define these constraints.
$\rightarrow$ DBMS ensures that all ICs are true before and after the transaction ends.
CONSISTENCY
The database accurately models the real world.
$\rightarrow$ SQL has methods to specify integrity constraints (e.g., key definitions, CHECK and ADD CONSTRAINT) and the DBMS will enforce them.
$\rightarrow$ Application must define these constraints.
$\rightarrow$ DBMS ensures that all ICs are true before and after the transaction ends.
A note on Eventual Consistency.
$\rightarrow$ A committed transaction may see inconsistent results (e.g., may not see the updates of an older committed txn). $\rightarrow$ Difficult for developers to reason about such semantics. $\rightarrow$ The trend is to move away from such models.
CONSISTENCY
The database accurately models the real world.
$\rightarrow$ SQL has methods to specify integrity constraints (e.g., key definitions, CHECK and ADD CONSTRAINT) and the DBMS will enforce them.
$\rightarrow$ Application must define these constraints.
$\rightarrow$ DBMS ensures that all ICs are true before and after the transaction ends.
A note on Eventual Consistency.
$\rightarrow$ A committed transaction may see inconsistent results (e.g., may not see the updates of an older committed txn). $\rightarrow$ Difficult for developers to reason about such semantics. $\rightarrow$ The trend is to move away from such models.
ISOLATION OF TRANSACTIONS
Users submit txns, and each txn executes as if it were running by itself. $\rightarrow$ Easier programming model to reason about.
ISOLATION OF TRANSACTIONS
Users submit txns, and each txn executes as if it were running by itself. $\rightarrow$ Easier programming model to reason about.
But the DBMS achieves concurrency by interleaving the actions (reads/writes of DB objects) of txns.
We need a way to interleave txns but still make it appear as if they ran one- at- a- time.
MECHANISMS FOR ENSURING ISOLATION
A concurrency control protocol is how the DBMS decides the proper interleaving of operations from multiple transactions.
Two categories of protocols: $\rightarrow$ Pessimistic: Don't let problems arise in the first place. $\rightarrow$ Optimistic: Assume conflicts are rare; deal with them after they happen.
EXAMPLE
Assume at first A and B each have $1000. $T_1$ transfers $100 from A’s account to B’s $T_2$ credits both accounts with 6% interest.
EXAMPLE
Assume at first A and B each have $1000.
What are the possible outcomes of running $T_1$ and $T_2$ ?
EXAMPLE
Assume at first A and B each have $1000.
What are the possible outcomes of running $T_1$ and $T_2$ ?
Many! But A+B should be: $\rightarrow$ $2000*1.06=$2120
There is no guarantee that $T_1$ will execute before $T_2$ or vice- versa, if both are submitted together.
But the net effect must be equivalent to these two transactions running serially in some order.
EXAMPLE
Legal outcomes: $\begin{array}{r}\rightarrow \mathsf{A} = 954,\mathsf{B} = 1166\ \rightarrow \mathsf{A} = 960,\mathsf{B} = 1160 \end{array}$
The outcome depends on whether $T_1$ executes before $T_2$ or vice versa.
EXAMPLE
Legal outcomes:
$\rightarrow A = 954, B = 1166 \rightarrow A + B =$ 2120 $\rightarrow A = 960, B = 1160 \rightarrow A + B =$ 2120
The outcome depends on whether $T_1$ executes before $T_2$ or vice versa.
SERIAL EXECUTION EXAMPLE
Schedule
Schedule
SERIAL EXECUTION EXAMPLE
INTERLEAVING TRANSACTIONS
We interleave txns to maximize concurrency. $\rightarrow$ Slow disk/network I/O. $\rightarrow$ Multi- core CPUs.
When one txn stalls because of a resource (e.g., page fault), another txn can continue executing and make forward progress.
INTERLEAVING EXAMPLE (GOOD)
Schedule
Table (html):
| T1 | T2 |
| BEGIN
A=A-100 | BEGIN
A=A*1.06 |
| B=B+100
COMMIT | B=B*1.06
COMMIT |
A=954, B=1166
INTERLEAVING EXAMPLE (GOOD)
Schedule
Schedule
INTERLEAVING EXAMPLE (GOOD)
Schedule
Schedule
INTERLEAVING EXAMPLE (GOOD)
INTERLEAVING EXAMPLE (BAD)
Schedule
A=954, B=1166 or A=960, B=1160
INTERLEAVING EXAMPLE (BAD)
Schedule
INTERLEAVING EXAMPLE (BAD)
Schedule
DBMS View
Table (html):
| T1 | T2 |
| BEGIN
A=A-100 | BEGIN
A=A*1.06
B=B*1.06
COMMIT |
| B=B+100 |
| COMMIT |
A=954, B=1160
A+B=$2114
Table (html):
| T1 | T2 |
| BEGIN
R(A)
W(A) | |
| BEGIN
R(A)
W(A)
R(B)
W(B)
COMMIT |
| R(B)
W(B)
COMMIT | |
INTERLEAVING EXAMPLE (BAD)
Schedule
DBMS View
A+B=$2114
INTERLEAVING EXAMPLE (BAD)
Schedule
A+B=$2114
How do we judge whether a schedule is correct?
INTERLEAVING EXAMPLE (BAD)
Schedule
A+B=$2114
How do we judge whether a schedule is correct?
If the schedule is equivalent to some serial execution.
FORMAL PROPERTIES OF SCHEDULES
Serial Schedule
$\rightarrow$ A schedule that does not interleave the actions of different transactions.
Equivalent Schedules
$\rightarrow$ For any database state, the effect of executing the first schedule is identical to the effect of executing the second schedule.
FORMAL PROPERTIES OF SCHEDULES
Serializable Schedule
→ A schedule that is equivalent to some serial execution of the transactions. → If each transaction preserves consistency, every serializable schedule preserves consistency.
FORMAL PROPERTIES OF SCHEDULES
Serializable Schedule
$\rightarrow$ A schedule that is equivalent to some serial execution of the transactions. $\rightarrow$ If each transaction preserves consistency, every serializable schedule preserves consistency.
Serializable is a less intuitive notion of correctness compared to txn initiation time or commit order, but it provides the DBMS with more flexibility in scheduling operations. $\rightarrow$ More flexibility means better parallelism.
CONFLICTING OPERATIONS
We need a formal notion of equivalence that can be implemented efficiently based on the notion of "conflicting" operations.
Two operations conflict if:
$\rightarrow$ They are by different transactions, $\rightarrow$ They are on the same object and one of them is a write.
CONFLICTING OPERATIONS
We need a formal notion of equivalence that can be implemented efficiently based on the notion of "conflicting" operations.
Two operations conflict if:
$\rightarrow$ They are by different transactions, $\rightarrow$ They are on the same object and one of them is a write.
Interleaved Execution Anomalies
$\rightarrow$ Unrepeatable Read (Read- Write) $\rightarrow$ Dirty Read (Write- Read) $\rightarrow$ Lost Update (Write- Write)
CONFLICTING OPERATIONS
We need a formal notion of equivalence that can be implemented efficiently based on the notion of "conflicting" operations.
Two operations conflict if:
$\rightarrow$ They are by different transactions, $\rightarrow$ They are on the same object and one of them is a write.
Interleaved Execution Anomalies
$\rightarrow$ Unrepeatable Read (Read- Write) $\rightarrow$ Dirty Read (Write- Read) $\rightarrow$ Lost Update (Write- Write) $\rightarrow$ Phantom Reads (Scan- Write) $\rightarrow$ Write- Skew (Read- Write)
CONFLICTING OPERATIONS
We need a formal notion of equivalence that can be implemented efficiently based on the notion of "conflicting" operations.
Two operations conflict if:
$\rightarrow$ They are by different transactions, $\rightarrow$ They are on the same object and one of them is a write.
Interleaved Execution Anomalies
$\rightarrow$ Unrepeatable Read (Read- Write) $\rightarrow$ Dirty Read (Write- Read) $\rightarrow$ Lost Update (Write- Write) $\rightarrow$ Phantom Reads (Scan- Write) Lecture #17 $\rightarrow$ Write- Skew (Read- Write) Lecture #19
READ-WRITE CONFLICTS
Unrepeatable Read: Txn gets different values when reading the same object multiple times.
READ-WRITE CONFLICTS
Unrepeatable Read: Txn gets different values when reading the same object multiple times.
READ-WRITE CONFLICTS
Unrepeatable Read: Txn gets different values when reading the same object multiple times.
READ-WRITE CONFLICTS
Unrepeatable Read: Txn gets different values when reading the same object multiple times.
READ-WRITE CONFLICTS
Unrepeatable Read: Txn gets different values when reading the same object multiple times.
WRITE-READ CONFLICTS
Dirty Read: One txn reads data written by another txn that has not committed yet.
WRITE-READ CONFLICTS
Dirty Read: One txn reads data written by another txn that has not committed yet.
WRITE-READ CONFLICTS
Dirty Read: One txn reads data written by another txn that has not committed yet.
WRITE-READ CONFLICTS
Dirty Read: One txn reads data written by another txn that has not committed yet.
WRITE-READ CONFLICTS
Dirty Read: One txn reads data written by another txn that has not committed yet.
WRITE-READ CONFLICTS
Dirty Read: One txn reads data written by another txn that has not committed yet.
WRITE-READ CONFLICTS
Dirty Read: One txn reads data written by another txn that has not committed yet.
WRITE-WRITE CONFLICTS
Lost Update: One txn overwrites uncommitted data from another uncommitted txn.
WRITE-WRITE CONFLICTS
Lost Update: One txn overwrites uncommitted data from another uncommitted txn.
WRITE-WRITE CONFLICTS
Lost Update: One txn overwrites uncommitted data from another uncommitted txn.
FORMAL PROPERTIES OF SCHEDULES
Given these conflicts, we now can understand what it means for a schedule to be serializable. $\rightarrow$ This is to check whether schedules are correct. $\rightarrow$ This is not how to generate a correct schedule.
There are different levels of serializability: $\rightarrow$ Conflict Serializability $\rightarrow$ View Serializability
FORMAL PROPERTIES OF SCHEDULES
Given these conflicts, we now can understand what it means for a schedule to be serializable. $\rightarrow$ This is to check whether schedules are correct. $\rightarrow$ This is not how to generate a correct schedule.
FORMAL PROPERTIES OF SCHEDULES
Given these conflicts, we now can understand what it means for a schedule to be serializable. $\rightarrow$ This is to check whether schedules are correct. $\rightarrow$ This is not how to generate a correct schedule.
There are different levels of $\rightarrow$ Conflict Serializability $\rightarrow$ View Serializability
No DBMS can do this.
Most DBMSs try to support this.
CONFLICT SERIALIZABLE SCHEDULES
Two schedules are conflict equivalent iff:
$\rightarrow$ They involve the same actions of the same transactions. $\rightarrow$ Every pair of conflicting actions is ordered the same way.
Schedule S is conflict serializable if:
$\rightarrow$ S is conflict equivalent to some serial schedule. $\rightarrow$ Intuition: You can transform S into a serial schedule by swapping consecutive non- conflicting operations of different transactions.
DEPENDENCY GRAPHS
One node per txn.
Edge from T; to T; if:
$\rightarrow$ An operation $0_{\mathrm{i}}$ of $T_{\mathrm{i}}$ conflicts with an operation $0_{\mathrm{j}}$ of $T_{\mathrm{j}}$ and $\rightarrow 0_{\mathrm{i}}$ appears earlier in the schedule than $0_{\mathrm{j}}$ .
Also known as a precedence graph. A schedule is conflict serializable iff its dependency graph is acyclic.
Dependency Graph
EXAMPLE #1
Schedule
Dependency Graph
EXAMPLE #1
Schedule
Dependency Graph
EXAMPLE #1
Schedule
Dependency Graph
EXAMPLE #1
Schedule
Dependency Graph
EXAMPLE #1
Schedule
Dependency Graph
EXAMPLE #1
Schedule
Dependency Graph
EXAMPLE #2 - THREE TRANSACTIONS
Schedule
Dependency Graph
EXAMPLE #2 - THREE TRANSACTIONS
Schedule
Dependency Graph
EXAMPLE #2 - THREE TRANSACTIONS
Schedule
Dependency Graph
EXAMPLE #2 - THREE TRANSACTIONS
Schedule
Dependency Graph
EXAMPLE #2 - THREE TRANSACTIONS
Schedule
Dependency Graph
EXAMPLE #2 - THREE TRANSACTIONS
Schedule
Dependency Graph
EXAMPLE #2 - THREE TRANSACTIONS
Schedule
Dependency Graph
EXAMPLE #2 - THREE TRANSACTIONS
Schedule
Dependency Graph
EXAMPLE #2 - THREE TRANSACTIONS
Schedule
Dependency Graph
Is this equivalent to a serial execution?
EXAMPLE #2 - THREE TRANSACTIONS
Schedule
Dependency Graph
Is this equivalent to a serial execution? Yes $(T_{2},T_{1},T_{3})$ $\rightarrow$ Notice that $\mathsf{T}3$ should go after $\mathsf{T}{2}$ although it starts before it!
EXAMPLE #3 - INCONSISTENT ANALYSIS
Schedule
Dependency Graph
EXAMPLE #3 - INCONSISTENT ANALYSIS
Schedule
Dependency Graph
EXAMPLE #3 - INCONSISTENT ANALYSIS
Schedule
Dependency Graph
EXAMPLE #3 - INCONSISTENT ANALYSIS
Schedule
Dependency Graph
EXAMPLE #3 - INCONSISTENT ANALYSIS
Schedule
Dependency Graph
EXAMPLE #3 - INCONSISTENT ANALYSIS
Schedule
Dependency Graph
EXAMPLE #3 - INCONSISTENT ANALYSIS
Schedule
Dependency Graph
EXAMPLE #3 - INCONSISTENT ANALYSIS
Schedule
Dependency Graph
EXAMPLE #3 - INCONSISTENT ANALYSIS
Schedule
Dependency Graph
EXAMPLE #3 – INCONSISTENT ANALYSIS
Schedule
Dependency Graph
Is it possible to modify only the application logic so that schedule produces a "correct" result but is still not conflict serializable?
EXAMPLE #3 – INCONSISTENT ANALYSIS
Schedule
Dependency Graph
Is it possible to modify only the application logic so that schedule produces a "correct" result but is still not conflict serializable?
VIEW SERIALIZABILITY
Alternative (broader) notion of serializability.
Schedules $\mathsf{S}_1$ and $\mathsf{S}_2$ are view equivalent if:
$\rightarrow$ If $\mathsf{T}_1$ reads initial value of A in $\mathsf{S}_1$ , then $\mathsf{T}_1$ also reads initial value of A in $\mathsf{S}_2$ . $\rightarrow$ If $\mathsf{T}_1$ reads value of A written by $\mathsf{T}_2$ in $\mathsf{S}_1$ , then $\mathsf{T}_1$ also reads value of A written by $\mathsf{T}_2$ in $\mathsf{S}_2$ . $\rightarrow$ If $\mathsf{T}_1$ writes final value of A in $\mathsf{S}_1$ , then $\mathsf{T}_1$ also writes final value of A in $\mathsf{S}_2$ .
VIEW SERIALIZABILITY
Schedule
Dependency Graph
VIEW SERIALIZABILITY
Schedule
Dependency Graph
VIEW SERIALIZABILITY
Schedule
Dependency Graph
VIEW SERIALIZABILITY
Schedule
Dependency Graph
VIEW SERIALIZABILITY
Schedule
Dependency Graph
VIEW SERIALIZABILITY
Schedule
Dependency Graph
VIEW SERIALIZABILITY
Schedule
Schedule
VIEW SERIALIZABILITY
VIEW SERIALIZABILITY
SERIALIZABILITY
View Serializability allows for (slightly) more schedules than Conflict Serializability does. $\longrightarrow$ But it is difficult to enforce efficiently.
Neither definition allows all schedules that you would consider "serializable." $\longrightarrow$ This is because they don't understand the meanings of the operations or the data (recall example #3) $\longrightarrow$ In practice, Conflict Serializability is what systems support because it can be enforced efficiently.
UNIVERSE OF SCHEDULES
All Schedules
UNIVERSE OF SCHEDULES
All Schedules
Serial
UNIVERSE OF SCHEDULES
All Schedules
UNIVERSE OF SCHEDULES
All Schedules
View Serializable
Conflict Serializable
Serial
TRANSACTION DURABILITY
All the changes of committed transactions should be persistent. $\longrightarrow$ No torn updates. $\longrightarrow$ No changes from failed transactions.
The DBMS can use either logging or shadow paging to ensure that all changes are durable.
CORRECTNESS CRITERIA: ACID
Atomicity
All actions in txn happen, or none happen. "All or nothing..."
Consistency
If each txn is consistent and the DB starts consistent, then it ends up consistent. "It looks correct to me..."
Isolation
Execution of one txn is isolated from that of other txns. "All by myself..."
Durability
If a txn commits, its effects persist. "I will survive..."
CORRECTNESS CRITERIA: ACID
Atomicity
All actions in txn happen, or none happen. "All or nothing..."
Consistency
If each txn is consistent and the DB starts consistent, then it ends up consistent. "It looks correct to me..."
Isolation
Execution of one txn is isolated from that of other txns. "All by myself..."
Durability
If a txn commits, its effects persist. "I will survive..."
CORRECTNESS CRITERIA: ACID
CORRECTNESS CRITERIA: ACID
Redo/Undo Mechanism
Atomicity
All actions in txn happen, or none happen. "All or nothing..."
Integrity Constraints
Consistency
If each txn is consistent and the DB starts consistent, then it ends up consistent. "It looks correct to me..."
Concurrency Control
Isolation
Execution of one txn is isolated from that of other txns. "All by myself..."
Redo/Undo Mechanism
Durability
If a txn commits, its effects persist. "I will survive..."
CORRECTNESS CRITERIA: ACID
Redo/Undo Mechanism
Atomicity
All actions in txn happen, or none happen. "All or nothing..."
Integrity Constraints
Consistency
If each txn is consistent and the DB starts consistent, then it ends up consistent. "It looks correct to me..."
Concurrency Control
Isolation
Execution of one txn is isolated from that of other txns. "All by myself..."
Redo/Undo Mechanism
Durability
If a txn commits, its effects persist. "I will survive..."
CONCLUSION
Concurrency control and recovery are among the most important functions provided by a DBMS.
Concurrency control is automatic
$\longrightarrow$ System automatically inserts lock/unlock requests and schedules actions of different txns. $\longrightarrow$ Ensures that resulting execution is equivalent to executing the txns one after the other in some order.
Just like "NoSQL" there was a "who needs transactions" phase. That has passed. $\longrightarrow$ SQL and transactions are good and necessary!
CONCLUSI
Concurrency control and recovery are among the most important functions provided by a DBMS.
Concurrency control is automatic $\longrightarrow$ System automatically inserts lock/unlock requests and schedules actions of different $\longrightarrow$ Ensures that resulting execution is equiv a to executing the txns one after the other i some order.
Just like "NoSQL" there was a "who n transactions" phase. That has passed. $\longrightarrow$ SQL and transactions are good and neces
Spanner: Google's Globally-Distributed Database
James C. Corbett, Jeffrey Dean, Michael Epstein, Andrew Fikes, Christopher Frost, JJ Furman, Sanjay Ghemawat, Andrey Gubarev, Christopher Heiser, Peter Hochschild, Wilson Hsieh, Sebastian Kanthak, Eugene Kogan, Hong Li Li, Alexander Lloyd, Sergey Melnik, David Mwaura, David Nagle, Sean Quinlan, Rajesh Rai, Liang- Roliq, Yasushi Saito, Michal Szymaniak, Christopher Taylor, Ruth Wang, Dale Woodford
Google, Inc.
Abstract
Spanner is Google's scalable, multi- version, globally- distributed, and synchronously- replicated database. It is the first system to distribute data at global scale and an open externally- consistent distributed transactions. The paper describes how Spanner is structured, its feature set, the rationale underlying various design decisions, and a novel time API that exposes clock uncertainty. This API and its implementation are critical to supporting extralocal consistency and a variety of powerful features: nonblocking reads in the past, lock- free read- only transactions, and atomic schema changes, across all of Spanner.
1 Introduction
Spanner is a scalable, globally- distributed database designed, built, and deployed at Google. At the highest est level of abstraction, it is a database that shurs data across many sets of Paxos [2] state machines in data centers spread all over the world. Replication is used for global availability and geographic locality; clients automatically failover between replicas. Spanner automatically rehashs data across machines as the amount of data or the number of servers changes, and it automatically migrates data across machines (even across datacenters) to balance load and in response to failures. Spanner designed to scale up to millions of machines across hundreds of datacenters and trillions of database rows. Applications can use Spanner for high availability, even in the face of wide- area natural disasters, by replicating their data within or even across continents. Our initial customer was FI [33], a rewrite of Google's advertising backend. FI uses five replicas spread across the United States. Most other applications will probably replicate their data across 3 to 5 datacenters in one geographic region, but with relatively independent failure modes. That is, most applications will choose lower la
tency over higher availability, as long as they can survive 1 or 2 datacenter failures. Spanner's main focus is managing cross- datacenter replicated data, but we have also spent a great deal of time in designing and implementing important database features on top of our distributed- systems infrastructure. Even though many projects happily use Bigtable [4], we have also consistently received complaints from users that Bigtable can be difficult to use for some kinds of applications: those that have complex, evolving schemas, or those that want strong consistency in the presence of wide- area replication. (Similar claims have been made by other authors [57].) Many applications at Google have chosen to use Megastore [5] because of its semi- relational data model and support for synchronous replication, despite its relatively poor write throughput. As a consequence, Spanner has evolved from a Bigtable- like versioned key- value store into a temporal multi- version database. Data is stored in schematized semi- relational tables; data is versioned, and each version is automatically timestamped with its commit time; old versions of data are subject to configurable garbage- collection policies; and applications can read data into old timestamps. Spanner supports general- purpose transactions, and provides a SQL- based query language.
As a globally- distributed database, Spanner provides several interesting features. First, the replication configurations for data can be dynamically controlled on fine grain by applications. Applications can specify a straints to control which datacenters contain which data, how far data is from its users (to control read latency), how far replicas are from each other (to control write latency), and how many replicas are maintained (to control durability, availability, and read performance). Data can also be dynamically and transparently moved between datacenters by the system to balance resource usage across datacenters. Second, Spanner has two features that are difficult to implement in a distributed database: it
Published in the Proceedings of OSDI 2012
CONCLUSI
Concurrency control and recovery are among the most important functions provided by a DBMS.
Concurrency control is automatic $\longrightarrow$ System automatically inserts lock/unlock requests and schedules actions of different
Spanner: Google's Globally-Distributed Database
James C. Corbett, Jeffrey Dean, Michael Epstein, Andrew Fikes, Christopher Frost, JJ Furman, Sanjay Ghemawat, Andrey Gubarev, Christopher Heiser, Peter Hochschild, Wilson Hsieh, Sebastian Kanthak, Eugene Kogan, Hong Li, Alexander Lloyd, Sergey Melnik, David Mwaura, David Nagle, Sean Quinlan, Rajesh Rai, Liang- Roliq, Yasushi Saito, Michal Scymaniak, Christopher Taylor, Ruth Wang, Dale Woodford
Google, Inc.
Abstract
Spanner is Google's scalable, multi- version, globally- distributed, and synchronously- replicated database. It is the first system to distribute data at global scale and an open externally- consistent distributed transactions. The paper describes how Spanner is structured, its feature set, the rationale underlying various design decisions, and a novel time API that exposes clock uncertainty. This API and its implementation are critical to supporting external consistency and a variety of powerful features: nonblocking reads in the past, lock- free read- only transactions, and atomic schema changes, across all of Spanner.
1 Introduction
tency over higher availability, as long as they can survive 1 or 2 datacenter failures. Spanner's main focus is managing cross- datacenter replicated data, but we have also spent a great deal of time in designing and implementing important database features on top of our distributed- systems infrastructure. Even though many projects happily use Bigtable [9], we have also consistently received complaints from users that Bigtable can be difficult to use for some kinds of applications: those that have complex, evolving schemas, or those that want strong consistency in the presence of wide- area replication. (Similar claims have been made by other authors [3]). Many applications at Google have chosen to use Megastore [3] because of its semi- relational data model and support for synchronous replication. As a result, its relatively poor write throughput- as a key, Spanner has evolved from a Bigtable- like- key- value store into a temporal multi- version data is stored in schematized semi- relational is versioned, and each version is automat- amped with its commit time; old versions of p- tions can read data via old timestamps, ports general- purpose transactions, and pro- based query language. ally- distributed database, Spanner provides esting features. First, the replication con- r data can be dynamically controlled at a applications. Applications can specify a- trol which datacenters contain which data, is from its users (to control read latency), as are from each other (to control write la- ow many replicas are maintained (to con- availability, and read performance). Data tnamically and transparently moved be- ters by the system to balance resource us- icenters. Second, Spanner has two features it to implement in a distributed database: it
ability problems that it brings [9] [10] [19]. We believe it is better to have application programmers deal with performance problems due to overuse of transactions as bottlenecks arise, rather than always coding around the lack of transactions. Running two- phase commit over Paxos
CONCLUSION
Concurrency control and recovery are among the most important functions provided by a DBMS.
Concurrency control is automatic
$\longrightarrow$ System automatically inserts lock/unlock requests and schedules actions of different txns. $\longrightarrow$ Ensures that resulting execution is equivalent to executing the txns one after the other in some order.
Just like "NoSQL" there was a "who needs transactions" phase. That has passed. $\longrightarrow$ SQL and Transactions are good and necessary!
NEXT CLASS
Two-Phase LockingIsolation Levels