Carnegie Mellon University
Database Systems
Multi- Version
Concurrency Control
ADMINISTRIVIA
Project #4 is out; due Sunday April 20th @ 11
$\rightarrow$ Recitation: Friday, April 11th in GHC 4303 from 3
- 4
PM
Final Exam is on Monday, April 28, 2025, from 05
- 08
$\rightarrow$ Early exam will not be offered. Do not make travel plans.
No OH on April 7th (I have to teach another class)
UPCOMING DATABASE TALKS
StarRocks (DB Seminar)
$\longrightarrow$ Monday, March 31 @ 4
$\longrightarrow$ StarRocks Query Optimizer $\longrightarrow$ Speaker: Kaisen Kang $\longrightarrow$
https://cmu.zoom.us/j/93441451665
StarRocks
Oracle (DB Seminar)
$\longrightarrow$ Tuesday, April 1 @ noon $\longrightarrow$ Evolving Transactions in Oracle 23ai: New Advances in Concurrency and Consistency $\longrightarrow$ Speakers: Kishy Kumar and Akshay S. Kulkarni $\longrightarrow$
https://cmu.zoom.us/my/jignesh
ORACLE
LAST CLASS
We discussed concurrency control protocols for generating conflict serializable schedules without needing to know what queries a txn will execute.
The two- phase locking (2PL) protocol requires txns to acquire locks on database objects before they are allowed to access them.
MULTI-VERSION CONCURRENCY CONTROL
The DBMS maintains multiple physical versions of a single logical object in the database:
$\longrightarrow$ When a txn writes to an object, the DBMS creates a new version of that object. $\longrightarrow$ When a txn reads an object, it reads the newest version that existed when the txn started.
MVCC HISTORY
Protocol was first proposed in 1978 MIT PhD dissertation.
First implementations was Rdb/VMS and InterBase at DEC in early 1980s. $\longrightarrow$ Both were by Jim Starkey, co- founder of Nuodb. $\longrightarrow$ DEC Rdb/VMS is now "Oracle Rdb". $\longrightarrow$ InterBase was open- sourced as Firebird.
Rdb/VMS
MVCC HISTORY
Protocol was first proposed in 1978 MIT PhD dissertation.
First implementations was Rdb/VMS and InterBase at DEC in early 1980s. $\longrightarrow$ Both were by Jim Starkey, co- founder of. NuoDB. $\longrightarrow$ DEC Rdb/VMS is now "Oracle Rdb". $\longrightarrow$ InterBase was open- sourced as Firebird.
MULTI-VERSION CONCURRENCY CONTROL
Writers do not block readers. Readers do not block writers.
Read- only txns can read a consistent snapshot without acquiring locks.
$\longrightarrow$ Use timestamps to determine visibility. $\longrightarrow$ MVCC naturally supports Snapshot Isolation (SI).
Multi- versioning without garbage collection allows the DBMS to support time- travel queries.
MVCC - EXAMPLE #1
Schedule
Database
MVCC - EXAMPLE #1
MVCC - EXAMPLE #1
Schedule
Database
MVCC - EXAMPLE #1
Schedule
Database
MVCC - EXAMPLE #1
Schedule
Database
Table (html):
| begin-ts | end-ts | value |
| A0 | 0 | - | 123 |
| | | |
| | | |
MVCC - EXAMPLE #1
Schedule
Database
Table (html):
| begin-ts | end-ts | value |
| A0 | 0 | - | 123 |
| | | |
| | | |
MVCC - EXAMPLE #1
Schedule
Database
MVCC - EXAMPLE #1
Schedule
Database
Table (html):
| begin-ts | end-ts | value |
| A0 | 0 | - | 123 |
| | | |
| | | |
MVCC - EXAMPLE #1
Schedule
Database
Table (html):
| begin-ts | end-ts | value |
| A0 | 0 | - | 123 |
| | | |
| | | |
MVCC - EXAMPLE #1
Schedule
Database
MVCC - EXAMPLE #1
Schedule
Database
MVCC - EXAMPLE #1
Schedule
Database
Txn Status Table
Table (html):
| txnid | timestamp | status |
| T1 | 1 | Active |
| T2 | 2 | Active |
| | |
MVCC - EXAMPLE #1
Schedule
Database
Table (html):
| begin-ts | end-ts | value |
| A0 | 0 | 2 | 123 |
| A1 | 2 | - | 456 |
| | | |
Txn Status Table
Table (html):
| txnid | timestamp | status |
| T1 | 1 | Active |
| T2 | 2 | Active |
| | |
MVCC - EXAMPLE #1
MVCC - EXAMPLE #1
Schedule
Database
Table (html):
| begin-ts | end-ts | value |
| A0 | 0 | 2 | 123 |
| A1 | 2 | - | 456 |
| | | |
Txn Status Table
Table (html):
| txnid | timestamp | status |
| T1 | 1 | Active |
| T2 | 2 | Active |
| | |
MVCC - EXAMPLE #2
Schedule
Database
Table (html):
| begin-ts | end-ts | value |
| A0 | 0 | - | 123 |
| | | |
| | | |
Txn Status Table
Table (html):
| txnid | timestamp | status |
| T1 | 1 | Active |
| | |
| | |
MVCC - EXAMPLE #2
Schedule
Database
Table (html):
| begin-ts | end-ts | value |
| A0 | 0 | - | 123 |
| | | |
| | | |
Txn Status Table
Table (html):
| txnid | timestamp | status |
| T1 | 1 | Active |
| | |
| | |
MVCC - EXAMPLE #2
Schedule
Database
Table (html):
| begin-ts | end-ts | value |
| A0 | 0 | - | 123 |
| | | |
| | | |
Txn Status Table
Table (html):
| txnid | timestamp | status |
| T1 | 1 | Active |
| | |
| | |
MVCC - EXAMPLE #2
Schedule
Database
Table (html):
| begin-ts | end-ts | value |
| A0 | 0 | - | 123 |
| | | |
| | | |
Txn Status Table
Table (html):
| txnid | timestamp | status |
| T1 | 1 | Active |
| | |
| | |
MVCC - EXAMPLE #2
Schedule
Database
Table (html):
| begin-ts | end-ts | value |
| A0 | 0 | - | 123 |
| A1 | 1 | - | 456 |
| | | |
Txn Status Table
Table (html):
| txnid | timestamp | status |
| T1 | 1 | Active |
| | |
| | |
MVCC - EXAMPLE #2
Schedule
Database
Table (html):
| begin-ts | end-ts | value |
| A0 | 0 | - | 123 |
| A1 | 1 | - | 456 |
| | | |
Txn Status Table
Table (html):
| txnid | timestamp | status |
| T1 | 1 | Active |
| | |
| | |
MVCC - EXAMPLE #2
Schedule
Database
Table (html):
| begin-ts | end-ts | value |
| A0 | 0 | 1 | 123 |
| A1 | 1 | - | 456 |
| | | |
Txn Status Table
Table (html):
| txnid | timestamp | status |
| T1 | 1 | Active |
| | |
| | |
MVCC - EXAMPLE #2
Schedule
Database
Table (html):
| begin-ts | end-ts | value |
| A0 | 0 | 1 | 123 |
| A1 | 1 | - | 456 |
| | | |
Txn Status Table
Table (html):
| txnid | timestamp | status |
| T1 | 1 | Active |
| | |
| | |
MVCC - EXAMPLE #2
MVCC - EXAMPLE #2
Schedule
Database
Table (html):
| begin-ts | end-ts | value |
| A0 | 0 | 1 | 123 |
| A1 | 1 | - | 456 |
| | | |
Txn Status Table
Table (html):
| txnid | timestamp | status |
| T1 | 1 | Active |
| T2 | 2 | Active |
| | |
MVCC - EXAMPLE #2
Schedule
MVCC - EXAMPLE #2
Schedule
Database
Table (html):
| begin-ts | end-ts | value |
| A0 | 0 | 1 | 123 |
| A1 | 1 | - | 456 |
| | | |
Txn Status Table
Table (html):
| txnid | timestamp | status |
| T1 | 1 | Active |
| T2 | 2 | Active |
| | |
MVCC - EXAMPLE #2
MVCC - EXAMPLE #2
Schedule
Database
Table (html):
| begin-ts | end-ts | value |
| A0 | 0 | 1 | 123 |
| A1 | 1 | - | 456 |
| | | |
Txn Status Table
Table (html):
| txnid | timestamp | status |
| T1 | 1 | Active |
| T2 | 2 | Active |
| | |
MVCC - EXAMPLE #2
Schedule
Database
Table (html):
| begin-ts | end-ts | value |
| A0 | 0 | 1 | 123 |
| A1 | 1 | - | 456 |
| | | |
Txn Status Table
Table (html):
| txnid timestamp | status | |
| T1 | 1 | Committed |
| T2 | 2 | Active |
| | |
MVCC - EXAMPLE #2
SNAPSHOT ISOLATION (SI)
When a txn starts, it sees a consistent snapshot of the database that existed when that the txn started. $\rightarrow$ No torn writes from active txns. $\rightarrow$ If two txns update the same object, then first writer wins.
SI is susceptible to the Write Skew Anomaly.
WRITE SKEW ANOMALY
WRITE SKEW ANOMALY
Txn #1 Change all white marbles to black.
Txn #2 Change all black marbles to white.
WRITE SKEW ANOMALY
WRITE SKEW ANOMALY
WRITE SKEW ANOMALY
WRITE SKEW ANOMALY
WRITE SKEW ANOMALY
WRITE SKEW ANOMALY
MULTI-VERSION CONCURRENCY CONTROL
MVCC is more than just a concurrency control protocol. It completely affects how the DBMS manages transactions and the database.
MULTI-VERSION CONCURRENCY CONTROL
MVCC is more than just a concurrency control protocol. It completely affects how the DBMS manages transactions and the database.
MVCC DESIGN DECISIONS
Concurrence Control ProtocolVersion StorageGarbage CollectionIndex ManagementDeletes
CONCURRENCY CONTROL PROTOCOL
Approach #1: Timestamp Ordering
→ Assign txns timestamps that determine serial order.
Approach #2: Optimistic Concurrency Control
→ Three- phase protocol from last class. → Use private workspace for new versions.
Approach #3: Two-Phase Locking
→ TXns acquire appropriate lock on physical version before they can read/write a logical tuple.
VERSION STORAGE
The DBMS uses the tuples' pointer field to create a version chain per logical tuple.
→ This allows the DBMS to find the version that is visible to a particular txn at runtime. → Indexes always point to the "head" of the chain.
Different storage schemes determine where/what to store for each version.
VERSION STORAGE
Approach #1: Append- Only Storage $\rightarrow$ New versions are appended to the same table space.
Approach #2: Time- Travel Storage $\rightarrow$ Old versions are copied to separate table space.
Approach #3: Delta Storage $\rightarrow$ The original values of the modified attributes are copied into a separate delta record space.
VERSION STORAGE
Approach #1: Append-Only Storage
→ New versions are appended to the same table space.
Do This!
Approach #2: Time-Travel Storage
→ Old versions are copied to separate table space.
Approach #3: Delta Storage
→ The original values of the modified attributes are copied into a separate delta record space.
APPEND-ONLY STORAGE
All the physical versions of a logical tuple are stored in the same table space. The versions are inter- mixed.
On every update, append a new version of the tuple into an empty space in the table.
Main Table
Table (html):
| value | pointer |
| A0 | $111 | |
| A1 | $222 | 0 |
| B1 | $10 | 0 |
| | |
APPEND-ONLY STORAGE
All the physical versions of a logical tuple are stored in the same table space. The versions are inter- mixed.
On every update, append a new version of the tuple into an empty space in the table.
Main Table
APPEND-ONLY STORAGE
All the physical versions of a logical tuple are stored in the same table space. The versions are inter- mixed.
On every update, append a new version of the tuple into an empty space in the table.
Main Table
APPEND-ONLY STORAGE
All the physical versions of a logical tuple are stored in the same table space. The versions are inter- mixed.
On every update, append a new version of the tuple into an empty space in the table.
Main Table
Table (html):
| value | pointer |
| A0 | $111 | |
| A1 | $222 | 0 |
| B1 | $10 | 0 |
| A2 | $333 | 0 |
APPEND-ONLY STORAGE
All the physical versions of a logical tuple are stored in the same table space. The versions are inter- mixed.
On every update, append a new version of the tuple into an empty space in the table.
Main Table
APPEND-ONLY STORAGE
All the physical versions of a logical tuple are stored in the same table space. The versions are inter- mixed.
On every update, append a new version of the tuple into an empty space in the table.
Main Table
VERSION CHAIN ORDERING
Approach #1: Oldest-to-Newest (O2N)
$\rightarrow$ Append new version to end of the chain. $\rightarrow$ Must traverse chain on look- ups.
Approach #2: Newest-to-Oldest (N2O)
$\rightarrow$ Must update index pointers for every new version. $\rightarrow$ Do not have to traverse chain on look- ups. $\rightarrow$ This is typically the better approach because most txns only care about the newest version.
TIME-TRAVEL STORAGE
Main Table
Time-Travel Table
Table (html):
TIME-TRAVEL STORAGE
Main Table
Time-Travel Table
On every update, copy the current version to the time- travel table. Update pointers.
TIME-TRAVEL STORAGE
Main Table
Time-Travel Table
On every update, copy the current version to the time- travel table. Update pointers.
TIME-TRAVEL STORAGE
Main Table
Time-Travel Table
On every update, copy the current version to the time- travel table. Update pointers.
TIME-TRAVEL STORAGE
Main Table
Time-Travel Table
TIME-TRAVEL STORAGE
Main Table
Time-Travel Table
TIME-TRAVEL STORAGE
Main Table
Time-Travel Table
TIME-TRAVEL STORAGE
Main Table
Time-Travel Table
DELTA STORAGE
Main Table
Delta Storage Segment
Table (html):
DELTA STORAGE
Main Table
Delta Storage Segment
Table (html):
On every update, copy only the column values that were modified to the delta storage and overwrite the master version.
DELTA STORAGE
Main Table
Delta Storage Segment
On every update, copy only the column values that were modified to the delta storage and overwrite the master version.
DELTA STORAGE
Main Table
On every update, copy only the column values that were modified to the delta storage and overwrite the master version.
Delta Storage Segment
Table (html):
| delta | pointer |
| A1 | (VALUE→$111) | 0 |
DELTA STORAGE
Main Table
Delta Storage Segment
On every update, copy only the column values that were modified to the delta storage and overwrite the master version.
DELTA STORAGE
Main Table
Delta Storage Segment
On every update, copy only the column values that were modified to the delta storage and overwrite the master version.
DELTA STORAGE
Main Table
Delta Storage Segment
On every update, copy only the column values that were modified to the delta storage and overwrite the master version.
DELTA STORAGE
Main Table
Delta Storage Segment
On every update, copy only the column values that were modified to the delta storage and overwrite the master version.
DELTA STORAGE
Main Table
Delta Storage Segment
GARBAGE COLLECTION
The DBMS needs to remove reclaimable physical versions from the database over time. $\rightarrow$ No active txn in the DBMS can "see" that version (SI). $\rightarrow$ The version was created by an aborted txn.
Two additional design decisions: $\rightarrow$ How to look for expired versions? $\rightarrow$ How to decide when it is safe to reclaim memory?
GARBAGE COLLECTION
The DBMS needs to remove reclaimable physical versions from the database over time. $\rightarrow$ No active txn in the DBMS can "see" that version (SI). $\rightarrow$ The version was created by an aborted txn.
Two additional design decisions: $\rightarrow$ How to look for expired versions? $\rightarrow$ How to decide when it is safe to reclaim memory?
GARBAGE COLLECTION
Approach #1: Tuple-level
→ Find old versions by examining tuples directly. → Background Vacuuming vs. Cooperative Cleaning
Approach #2: Transaction-level
→ Txns keep track of their old versions so the DBMS does not have to scan tuples to determine visibility.
TUPLE-LEVEL GC
Txn #1
Tid=12
Txn #2
Tid=25
Table (html):
| begin-ts | end-ts |
| A100 | 1 | 9 |
| B100 | 1 | 9 |
| B101 | 10 | 20 |
TUPLE-LEVEL GC
Txn #1
Tid=12
Txn #2
Tid=25
Background Vacuuming:
Separate thread(s) periodically scan the table and look for reclaimable versions. Works with any storage.
Table (html):
| begin-ts | end-ts |
| A100 | 1 | 9 |
| B100 | 1 | 9 |
| B101 | 10 | 20 |
TUPLE-LEVEL GC
Txn #1
Tid=12
Txn #2
Tid=25
Vacuum
Background Vacuuming:
Separate thread(s) periodically scan the table and look for reclaimable versions. Works with any storage.
Table (html):
| begin-ts | end-ts |
| A100 | 1 | 9 |
| B100 | 1 | 9 |
| B101 | 10 | 20 |
TUPLE-LEVEL GC
Background Vacuuming:
Separate thread(s) periodically scan the table and look for reclaimable versions. Works with any storage.
Table (html):
| begin-ts | end-ts |
| A100 | 1 | 9 |
| B100 | 1 | 9 |
| B101 | 10 | 20 |
TUPLE-LEVEL GC
Background Vacuuming:
Separate thread(s) periodically scan the table and look for reclaimable versions. Works with any storage.
TUPLE-LEVEL GC
Background Vacuuming:
Separate thread(s) periodically scan the table and look for reclaimable versions. Works with any storage.
TUPLE-LEVEL GC
Background Vacuuming:
Separate thread(s) periodically scan the table and look for reclaimable versions. Works with any storage.
TUPLE-LEVEL GC
Background Vacuuming:
Separate thread(s) periodically scan the table and look for reclaimable versions. Works with any storage.
Table (html):
| Dirty Block BitMap | begin-ts | end-ts |
| | |
| | |
| B101 | 10 | 20 |
TUPLE-LEVEL GC
Background Vacuuming:
Separate thread(s) periodically scan the table and look for reclaimable versions. Works with any storage.
Table (html):
TUPLE-LEVEL GC
Background Vacuuming:
Separate thread(s) periodically scan the table and look for reclaimable versions. Works with any storage.
Table (html):
TUPLE-LEVEL GC
Txn #1
Tid=12
Txn #2
Tid=25
Background Vacuuming:
Separate thread(s) periodically scan the table and look for reclaimable versions. Works with any storage.
TUPLE-LEVEL GC
Txn #1
Tid=12
Txn #2
Tid=25
Background Vacuuming:
Separate thread(s) periodically scan the table and look for reclaimable versions. Works with any storage.
Cooperative Cleaning:
Worker threads identify reclaimable versions as they traverse version chain. Only works with O2N.
TUPLE-LEVEL GC
Txn #1
Tid=12
Txn #2
Tid=25
Background Vacuuming:
Separate thread(s) periodically scan the table and look for reclaimable versions. Works with any storage.
Cooperative Cleaning:
Worker threads identify reclaimable versions as they traverse version chain. Only works with O2N.
TUPLE-LEVEL GC
Background Vacuuming:
Separate thread(s) periodically scan the table and look for reclaimable versions. Works with any storage.
Cooperative Cleaning:
Worker threads identify reclaimable versions as they traverse version chain. Only works with O2N.
TUPLE-LEVEL GC
Background Vacuuming:
Separate thread(s) periodically scan the table and look for reclaimable versions. Works with any storage.
Cooperative Cleaning:
Worker threads identify reclaimable versions as they traverse version chain. Only works with O2N.
TUPLE-LEVEL GC
Background Vacuuming:
Separate thread(s) periodically scan the table and look for reclaimable versions. Works with any storage.
Cooperative Cleaning:
Worker threads identify reclaimable versions as they traverse version chain. Only works with O2N.
TUPLE-LEVEL GC
Background Vacuuming:
Separate thread(s) periodically scan the table and look for reclaimable versions. Works with any storage.
Cooperative Cleaning:
Worker threads identify reclaimable versions as they traverse version chain. Only works with O2N.
TUPLE-LEVEL GC
Background Vacuuming:
Separate thread(s) periodically scan the table and look for reclaimable versions. Works with any storage.
Cooperative Cleaning:
Worker threads identify reclaimable versions as they traverse version chain. Only works with O2N.
TUPLE-LEVEL GC
Background Vacuuming:
Separate thread(s) periodically scan the table and look for reclaimable versions. Works with any storage.
Cooperative Cleaning:
Worker threads identify reclaimable versions as they traverse version chain. Only works with O2N.
TUPLE-LEVEL GC
Txn #1
Tid=12
Txn #2
Tid=25
Background Vacuuming:
Separate thread(s) periodically scan the table and look for reclaimable versions. Works with any storage.
Cooperative Cleaning:
Worker threads identify reclaimable versions as they traverse version chain. Only works with O2N.
TRANSACTION-LEVEL GC
Each txn keeps track of its read/write set. On commit/abort, the txn provides this information to a centralized vacuum worker.
The DBMS periodically determines when all versions created by a finished txn are no longer visible.
TRANSACTION-LEVEL GC
Txn #1
BEGIN @ 10
Table (html):
| begin-ts | end-ts | value | |
| A2 | 1 | ∞ | - |
| B6 | 8 | ∞ | - |
| | | |
| | | |
TRANSACTION-LEVEL GC
Txn #1
BEGIN @ 10
Table (html):
| begin-ts | end-ts | value |
| A2 | 1 | ∞ | - |
| B6 | 8 | ∞ | - |
| | | |
| | | |
TRANSACTION-LEVEL GC
Txn #1
BEGIN @ 10
Table (html):
| begin-ts | end-ts | value | |
| A2 | 1 | ∞ | - |
| B6 | 8 | ∞ | - |
| | | |
| | | |
TRANSACTION-LEVEL GC
Txn #1
BEGIN @ 10
Table (html):
| begin-ts | end-ts | value |
| A2 | 1 | 10 | - |
| B6 | 8 | ∞ | - |
| A3 | 10 | ∞ | - |
| | | |
TRANSACTION-LEVEL GC
Txn #1
BEGIN @ 10
Old Versions
A2
Table (html):
| begin-ts | end-ts | value |
| A2 | 1 | 10 | - |
| B6 | 8 | ∞ | - |
| A3 | 10 | ∞ | - |
| | | |
TRANSACTION-LEVEL GC
TRANSACTION-LEVEL GC
Txn #1
BEGIN @ 10
Old Versions
A2
Table (html):
| begin-ts | end-ts | value |
| A2 | 1 | 10 | - |
| B6 | 8 | ∞ | - |
| A3 | 10 | ∞ | - |
| | | |
TRANSACTION-LEVEL GC
Txn #1
BEGIN @ 10
Old Versions
A2
Table (html):
| begin-ts | end-ts | value |
| A2 | 1 | 10 | - |
| B6 | 8 | 10 | - |
| A3 | 10 | ∞ | - |
| B7 | 10 | ∞ | - |
TRANSACTION-LEVEL GC
TRANSACTION-LEVEL GC
Txn #1
BEGIN @ 10 COMMIT @ 15
Old Versions
Table (html):
| begin-ts | end-ts | value |
| A2 | 1 | 10 | - |
| B6 | 8 | 10 | - |
| A3 | 10 | ∞ | - |
| B7 | 10 | ∞ | - |
TRANSACTION-LEVEL GC
Txn #1
BEGIN @ 10 COMMIT @ 15
Old Versions
[ImageCaption: Vacuum]
Table (html):
| begin-ts | end-ts | value |
| A2 | 1 | 10 | - |
| B6 | 8 | 10 | - |
| A3 | 10 | ∞ | - |
| B7 | 10 | ∞ | - |
INDEX MANAGEMENT
Primary key indexes point to version chain head. $\rightarrow$ How often the DBMS must update the pkey index depends on whether the system creates new versions when a tuple is updated. $\rightarrow$ If a txn updates a tuple's pkey attribute(s), then this is treated as a DELETE followed by an INSERT.
Secondary indexes are more complicated...
Primary key in → How often th on whether ti is updated. → If a txn updat treated as a D
Secondary in
ARCHITECTURE WHY UBER ENGINEERING SWITCHED FROM POSTGRES TO MYSQL
JULY 26, 2016 BY EVAN KLITZKE
SECONDARY INDEXES
Approach #1: Logical Pointers
→ Use a fixed identifier per tuple that does not change. $\rightarrow$ Requires an extra indirection layer. $\rightarrow$ Primary Key vs. Tuple Id
Approach #2: Physical Pointers
→ Use the physical address to the version chain head.
INDEX POINTERS: APPEND-ONLY
PRIMARY INDEX
SECONDARY INDEX
INDEX POINTERS: APPEND-ONLY
PRIMARY INDEX
SECONDARY INDEX
INDEX POINTERS: APPEND-ONLY
GET(A)
PRIMARY INDEX
SECONDARY INDEX
INDEX POINTERS: APPEND-ONLY
INDEX POINTERS: APPEND-ONLY
PRIMARY INDEX
SECONDARY INDEX
INDEX POINTERS: APPEND-ONLY
GET(A)
PRIMARY INDEX
SECONDARY INDEX
INDEX POINTERS: APPEND-ONLY
INDEX POINTERS: APPEND-ONLY
INDEX POINTERS: APPEND-ONLY
INDEX POINTERS: APPEND-ONLY
GET(A)
PRIMARY INDEX
SECONDARY INDEX
INDEX POINTERS: APPEND-ONLY
INDEX POINTERS: APPEND-ONLY
GET(A)
PRIMARY INDEX
SECONDARY INDEX
MVCC INDEXES
MVCC DBMS indexes (usually) do not store version information about tuples with their keys. $\rightarrow$ Exception: Index- organized tables (e.g., MySQL)
Every index must support duplicate keys from different snapshots:
$\rightarrow$ The same key may point to different logical tuples in different snapshots.
MVCC DUPLICATE KEY PROBLEM
MVCC DUPLICATE KEY PROBLEM
Txn #1
BEGIN @ 10
MVCC DUPLICATE KEY PROBLEM
Txn #1
BEGIN @ 10
Txn #2
BEGIN @ 20
MVCC DUPLICATE KEY PROBLEM
Txn #1
BEGIN @ 10
Txn #2
BEGIN @ 20
MVCC DUPLICATE KEY PROBLEM
Txn #1
BEGIN @ 10
Txn #2
BEGIN @ 20
MVCC DUPLICATE KEY PROBLEM
Txn #1
BEGIN @ 10
Txn #2
BEGIN @ 20 COMMIT @ 25
MVCC DUPLICATE KEY PROBLEM
Txn #1
BEGIN @ 10
Index
Txn #2
BEGIN @ 20 COMMIT @ 25
MVCC DUPLICATE KEY PROBLEM
Txn #1
BEGIN @ 10
Txn #2
BEGIN @ 20 COMMIT @ 25
Txn #3
BEGIN @ 30
MVCC DUPLICATE KEY PROBLEM
Txn #1
BEGIN @ 10
Index
Txn #2
BEGIN @ 20 COMMIT @ 25
Txn #3
BEGIN @ 30
MVCC DUPLICATE KEY PROBLEM
Txn #1
BEGIN @ 10
Index
Txn #2
BEGIN @ 20 COMMIT @ 25
Txn #3
BEGIN @ 30
MVCC INDEXES
Each index's underlying data structure must support the storage of non- unique keys.
Use additional execution logic to perform conditional inserts for pkey / unique indexes. $\rightarrow$ Atomically check whether the key exists and then insert.
Workers may get back multiple entries for a single fetch. They then must follow the pointers to find the proper physical version.
MVCC DELETES
The DBMS physically deletes a tuple from the database only when all versions of a logically deleted tuple are not visible.
$\rightarrow$ If a tuple is deleted, then there cannot be a new version of that tuple after the newest version. $\rightarrow$ No write- write conflicts / first- writer wins
We need a way to denote that tuple has been logically delete at some point in time.
MVCC DELETES
Approach #1: Deleted Flag
$\rightarrow$ Maintain a flag to indicate that the logical tuple has been deleted after the newest physical version. $\rightarrow$ Can either be in tuple header or a separate column.
Approach #2: Tombstone Tuple
$\rightarrow$ Create an empty physical version to indicate that a logical tuple is deleted. $\rightarrow$ Use a separate pool for tombstone tuples with only a special bit pattern in version chain pointer to reduce the storage overhead.
MVCC IMPLEMENTATIONS
Table (html):
| Protocol | Version Storage | Garbage Collection | Indexes |
| Oracle | MV2PL | Delta | Vacuum | Logical |
| Postgres | MV-2PL/MV-TO | Append-Only | Vacuum | Physical |
| MySQL-InnoDB | MV-2PL | Delta | Vacuum | Logical |
| HYRISE | MV-OCC | Append-Only | - | Physical |
| Hekaton | MV-OCC | Append-Only | Cooperative | Physical |
| MemSQL (2015) | MV-OCC | Append-Only | Vacuum | Physical |
| SAP HANA | MV-2PL | Time-travel | Hybrid | Logical |
| NuoDB | MV-2PL | Append-Only | Vacuum | Logical |
| HyPer | MV-OCC | Delta | Txn-level | Logical |
| CockroachDB | MV-2PL | Delta (LSM) | Compaction | Logical |
CONCLUSION
MVCC is the widely used scheme in DBMSs. Even systems that do not support multi- statement txns (e.g., NoSQL) use it.
NEXT CLASS
Logging and recovery!