Carnegie Mellon University
Database Systems Database Recovery
ADMINISTRIVIA
Project #4 is due Sunday April 20th @ 11
$\longrightarrow$ Recitation: Friday, April 11
th in GHC 4303 from 3
- 4
PM
Final Exam is on Monday, April 28, 2025, from 05
- 08
$\longrightarrow$ Early exam will not be offered. Do not make travel plans.
No OH today (I have to teach another class)
UPCOMING DATABASE TALKS
0xide (DB Seminar) $\longrightarrow$ Monday, April 7 @ 4
$\longrightarrow$ OxQL: Oximeter Query Language $\longrightarrow$ Speaker: Ben Naecker $\longrightarrow$
https://cmu.zoom.us/j/93441451665
Oxide
MariaDB (DB Seminar) $\longrightarrow$ Monday, April 14 @ 4
$\longrightarrow$ MariaDB's New Query Optimizer $\longrightarrow$ Speaker: Michael Widenius $\longrightarrow$
https://cmu.zoom.us/j/93441451665
CRASH RECOVERY
Recovery algorithms are techniques to ensure database consistency, transaction atomicity, and durability despite failures.
Recovery algorithms have two parts:
$\rightarrow$ Actions during normal txn processing to ensure that the DBMS can recover from a failure. $\rightarrow$ Actions after a failure to recover the database to a state that ensures atomicity, consistency, and durability.
CRASH RECOVERY
Recovery algorithms are techniques to ensure database consistency, transaction atomicity, and durability despite failures.
Recovery algorithms have two parts:
$\rightarrow$ Actions during normal txn processing to ensure that the DBMS can recover from a failure.
$\rightarrow$ Actions after a failure to recover the database to a state that ensures atomicity, consistency, and durability.
Today
CRASH RECOVERY OVERVIEW
STEAL + NO-FORCE
Atomicity: Txns may abort/fail. Durability: Changes of committed. txns should survive system failure.
Schedule
Table (html):
| T1 | T2 | T3 |
| BEGIN
W(A)
COMMIT | | |
| BEGIN
W(B)
ABORT | |
CRASH RECOVERY OVERVIEW
STEAL + NO-FORCE
Atomicity: Txns may abort/fail. Durability: Changes of committed. txns should survive system failure.
Schedule
CRASH RECOVERY OVERVIEW
STEAL + NO-FORCE
Atomicity: Txns may abort/fail. Durability: Changes of committed. txns should survive system failure.
Desired behavior after the DBMS restarts (i.e., the contents of volatile memory are lost): $\rightarrow \mathbf{T}_1$ should be durable. $\rightarrow \mathbf{T}_2 + \mathbf{T}_3$ should be aborted.
Schedule
ARIES
Algorithms for Recovery and Isolation Exploiting Semantics
Developed at IBM Research in early 1990s for the DB2 DBMS.
Not all systems implement ARIES exactly as defined in this paper but they're close enough.
ARIES: A Transaction Recovery Method Supporting Fine Granularity Locking and Partial Rollbacks Using Write- Ahead Logging
G. MOHAN
IBM Almaden Research Center and DON HADERLE IBM Santa Teresa Laboratory
and
BRUCE LINDSAY, HAMID PIRAHESH and PETER SCHWARZ IBM Almaden Research Center
In this paper we present a simple and efficient method, called ARIES (Algorithm for Recovery and Isolation Exploiting Semantics), which supports partial rollbacks of transactions, fine granularity (e.g., record blocking and recovery using write- ahead logging (WAL). We introduce the paradigm of repeating history to redo all missing updates before performing the rollbacks of the loser transactions during restart after a system failure. ARIES uses a log sequence number in each page to correlate the state of a page with respect to logged updates of that page. All updates of a transaction are logged, including those performed during rollbacks. By appropriate chaining of the log records written during rollbacks to those written during forward progress, a bounded amount of logs rolling is ensured during rollbacks even in the face of repeated failures during restart or of nested rollbacks. We deal with a variety of features that are very important in building and operating an ARIES. We deal with transaction processing system. ARIES supports fuzzy checkpoints, selective and deferred restart, fuzzy image copies, media recovery, and high concurrency lock modes (e.g., increment/decrement) which exploit the semantics of the operations and require the ability to perform operation logging. ARIES is flexible with respect to the kinds of buffer management policies that can be implemented. It supports objects of varying length efficiency. By enabling parallelism during restart, page- oriented redo, and join- lock, the application can be extended to handle the creation of ARIES. ARIES supports B Paradigms for logging and recovery, which were based on the shadow page technique, need to be changed in the context of WAL. We compare ARIES to the WAL- based recovery methods of
ARIES - MAIN IDEAS
Write-Ahead Logging:
$\rightarrow$ Flush WAL record(s) changes to disk before a database object is written to disk. $\rightarrow$ Must use STEAL + NO- FORCE buffer pool policies.
Repeating History During Redo:
$\rightarrow$ On DBMS restart, retrace actions and restore database to exact state before crash.
Logging Changes During Undo:
$\rightarrow$ Record undo actions to log to ensure action is not repeated in the event of repeated failures.
TODAY'S AGENDA
Log Sequence Numbers Normal Commit & Abort Operations Fuzzy Checkpointing Recovery Algorithm
WAL RECORDS
We need to extend our log record format from last class to include additional info.
Every log record includes a globally unique log sequence number (LSN).
$\rightarrow$ LSNs represent the physical order that txns make changes to the database.
Various components in the system keep track of LSNs that pertain to them...
WAL BOOKKEEPING
Log Sequence Number (LSN). $\longrightarrow$ Unique and monotonically increasing.
WAL BOOKKEEPING
Log Sequence Number (LSN). $\longrightarrow$ Unique and monotonically increasing.
Each data page contains a pageLSN. $\longrightarrow$ The LSN of the most recent log record that updated the page.
WAL BOOKKEEPING
Log Sequence Number (LSN). $\longrightarrow$ Unique and monotonically increasing.
Each data page contains a pageLSN. $\longrightarrow$ The LSN of the most recent log record that updated the page.
WAL BOOKKEEPING
Log Sequence Number (LSN). $\longrightarrow$ Unique and monotonically increasing.
Each data page contains a pageLSN. $\longrightarrow$ The LSN of the most recent log record that updated the page.
WAL BOOKKEEPING
Log Sequence Number (LSN). $\rightarrow$ Unique and monotonically increasing.
Each data page contains a pageLSN. $\rightarrow$ The LSN of the most recent log record that updated the page.
System keeps track of flushedLSN. $\rightarrow$ The max LSN flushed so far.
WAL BOOKKEEPING
Log Sequence Number (LSN). $\longrightarrow$ Unique and monotonically increasing.
Each data page contains a pageLSN. $\longrightarrow$ The LSN of the most recent log record that updated the page.
System keeps track of flushedLSN. $\longrightarrow$ The max LSN flushed so far.
WAL BOOKKEEPING
Log Sequence Number (LSN). $\rightarrow$ Unique and monotonically increasing.
Each data page contains a pageLSN. $\rightarrow$ The LSN of the most recent log record that updated the page.
System keeps track of flushedLSN. $\rightarrow$ The max LSN flushed so far.
WAL: Before a $\mathrm{page}_{\mathrm{x}}$ is written, pageLSNx ≤ flushedLSN
LOG SEQUENCE NUMBERS
Table (html):
| Name | Location | Definition |
| fflushedLSN | Memory | Last LSN in log on disk |
| pageLSN | pagex | Newest update to pagex |
| recLSN | DPT† | Oldest update to pagex since it was last flushed |
| lastLSN | ATT* | Latest record of txn T1 |
| MasterRecord | Disk | LSN of latest checkpoint |
† DPT = Dirty Page Table.
- ATT = Active Transaction Table.
WRITING LOG RECORDS
WRITING LOG RECORDS
WRITING LOG RECORDS
WRITING LOG RECORDS
WRITING LOG RECORDS
WRITING LOG RECORDS
WRITING LOG RECORDS
WRITING LOG RECORDS
WRITING LOG RECORDS
WRITING LOG RECORDS
All log records have an LSN.
Update the pageLSN every time a txn modifies a record in the page.
Update the flushedLSN in memory every time the DBMS writes the WAL buffer to disk.
NORMAL EXECUTION
Each txn invokes a sequence of reads and writes, followed by commit or rollback.
Assumptions in this lecture:
$\rightarrow$ All log records fit within a single page. $\rightarrow$ Disk writes are atomic. $\rightarrow$ Single- versioned tuples with Strong Strict 2PL. $\rightarrow$ STEAL + NO- FORCE buffer management with WAL.
TRANSACTION COMMIT
When a txn commits, the DBMS writes a COMMIT record to log and guarantees that all log records up to txn's COMMIT record are flushed to disk.
$\rightarrow$ Log flushes are sequential, synchronous writes to disk. $\rightarrow$ Many log records per log page.
When the commit succeeds, write a TXN- END record to log. $\rightarrow$ Indicates that no new log record for a txn will appear in the log ever again. $\rightarrow$ This does not need to be flushed immediately.
TRANSACTION COMMIT
TRANSACTION COMMIT
TRANSACTION COMMIT
TRANSACTION COMMIT
TRANSACTION COMMIT
TRANSACTION COMMIT
TRANSACTION COMMIT
TRANSACTION COMMIT
TRANSACTION ABORT
Aborting a txn is a special case of the ARIES undo operation applied to only one txn.
We need to add another field to our log records: $\rightarrow$ prevLSN: The previous LSN for the txn. $\rightarrow$ This maintains a linked- list for each txn that makes it easy to walk through its records.
TRANSACTION ABORT
TRANSACTION ABORT
TRANSACTION ABORT
TRANSACTION ABORT
TRANSACTION ABORT
COMPENSATION LOG RECORDS
A CLR describes the actions taken to undo the actions of a previous update record.
It has all the fields of an update log record plus the undoNextLSN pointer (i.e., the next- to- be- undone LSN).
CLRs are added to log records but the DBMS does not wait for them to be flushed before notifying the application that the txn aborted.
TRANSACTION ABORT - CLR EXAMPLE
Table (html):
| LSN | prevLSN | TxnId | Type | Object | Before | After | UndoNextLSN |
| 001 | nil | T1 | BEGIN | - | - | - | - |
| 002 | 001 | T1 | UPDATE | A | 30 | 40 | - |
| : |
| 011 | 002 | T1 | ABORT | - | - | - | - |
TRANSACTION ABORT - CLR EXAMPLE
Table (html):
| LSN | prevLSN | TxnId | Type | Object | Before | After | UndoNextLSN |
| 001 | nil | T1 | BEGIN | - | - | - | - |
| 002 | 001 | T1 | UPDATE | A | 30 | 40 | - |
| : | | | | | | | |
| 011 | 002 | T1 | ABORT | - | - | - | - |
| : | | | | | | | |
| 026 | 011 | T1 | CLR-002 | A | 40 | 30 | 001 |
TRANSACTION ABORT - CLR EXAMPLE
TRANSACTION ABORT - CLR EXAMPLE
Table (html):
| LSN | prevLSN | TxnId | Type | Object | Before | After | UndoNextLSN |
| 001 | nil | T1 | BEGIN | - | - | - | - |
| 002 | 001 | T1 | UPDATE | A | 30 | 40 | - |
| : | | | | | | | |
| 011 | 002 | T1 | ABORT | - | - | - | - |
| : | | | | | | | |
| 026 | 011 | T1 | CLR-002 | A | 40 | 30 | 001 |
TRANSACTION ABORT - CLR EXAMPLE
TRANSACTION ABORT - CLR EXAMPLE
TRANSACTION ABORT - CLR EXAMPLE
Table (html):
| LSN | prevLSN | TxnId | Type | Object | Before | After | UndoNextLSN |
| 001 | nil | T1 | BEGIN | - | - | - | - |
| 002 | 001 | T1 | UPDATE | A | 30 | 40 | - |
| : | | | | | | | |
| 011 | 002 | T1 | ABORT | - | - | - | - |
| : | | | | | | | |
| 026 | 011 | T1 | CLR-002 | A | 40 | 30 | 001 |
| 027 | 026 | T1 | TXN-END | - | - | - | nil |
ABORT ALGORITHM
First write an ABORT record to log for the txn.
Then analyze the txn's updates in reverse order. For each update record:
$\rightarrow$ Write a CLR entry to the log. $\rightarrow$ Restore old value.
Lastly, write a TXN- END record and release locks.
Notice: CLRs never need to be undone.
TODAY'S AGENDA
Log Sequence Numbers Normal Commit & Abort Operations Fuzzy Checkpointing Recovery Algorithm
NON-FUZZY CHECKPOINTS
The DBMS halts everything when it takes a checkpoint to ensure a consistent snapshot: $\begin{array}{rl} & \longrightarrow \mathrm{Haltthestartofanynewtxns}.\ & \longrightarrow \mathrm{Waituntilallactivetxnsfinishexecuting}.\ & \longrightarrow \mathrm{Flushesdirtypagesondisk}. \end{array}$
This is bad for runtime performance but makes recovery easy.
SLIGHTLY BETTER CHECKPOINTS
Pause modifying txns while the DBMS takes the checkpoint.
$\longrightarrow$ Flushes dirty pages on disk. $\longrightarrow$ Prevent queries from acquiring write latch on table/index pages. $\longrightarrow$ Don't have to wait until all txns finish before taking the checkpoint.
SLIGHTLY BETTER CHECKPOINTS
Pause modifying txns while the DBMS takes the checkpoint.
$\longrightarrow$ Flushes dirty pages on disk. $\longrightarrow$ Prevent queries from acquiring write latch on table/index pages. $\longrightarrow$ Don't have to wait until all txns finish before taking the checkpoint.
SLIGHTLY BETTER CHECKPOINTS
Pause modifying txns while the DBMS takes the checkpoint.
$\longrightarrow$ Flushes dirty pages on disk. $\longrightarrow$ Prevent queries from acquiring write latch on table/index pages. $\longrightarrow$ Don't have to wait until all txns finish before taking the checkpoint.
SLIGHTLY BETTER CHECKPOINTS
Pause modifying txns while the DBMS takes the checkpoint.
$\longrightarrow$ Flushes dirty pages on disk. $\longrightarrow$ Prevent queries from acquiring write latch on table/index pages. $\longrightarrow$ Don't have to wait until all txns finish before taking the checkpoint.
SLIGHTLY BETTER CHECKPOINTS
Pause modifying txns while the DBMS takes the checkpoint.
$\longrightarrow$ Flushes dirty pages on disk. $\longrightarrow$ Prevent queries from acquiring write latch on table/index pages. $\longrightarrow$ Don't have to wait until all txns finish before taking the checkpoint.
SLIGHTLY BETTER CHECKPOINTS
Pause modifying txns while the DBMS takes the checkpoint.
$\longrightarrow$ Flushes dirty pages on disk. $\longrightarrow$ Prevent queries from acquiring write latch on table/index pages. $\longrightarrow$ Don't have to wait until all txns finish before taking the checkpoint.
SLIGHTLY BETTER CHECKPOINTS
Pause modifying txns while the DBMS takes the checkpoint.
$\longrightarrow$ Flushes dirty pages on disk. $\longrightarrow$ Prevent queries from acquiring write latch on table/index pages. $\longrightarrow$ Don't have to wait until all txns finish before taking the checkpoint.
SLIGHTLY BETTER CHECKPOINTS
Pause modifying txns while the DBMS takes the checkpoint.
$\longrightarrow$ Flushes dirty pages on disk. $\longrightarrow$ Prevent queries from acquiring write latch on table/index pages. $\longrightarrow$ Don't have to wait until all txns finish before taking the checkpoint.
SLIGHTLY BETTER CHECKPOINTS
Pause modifying txns while the DBMS takes the checkpoint.
$\longrightarrow$ Flushes dirty pages on disk. $\longrightarrow$ Prevent queries from acquiring write latch on table/index pages. $\longrightarrow$ Don't have to wait until all txns finish before taking the checkpoint.
We must record internal state as of the beginning of the checkpoint.
$\longrightarrow$ Active Transaction Table (ATT) $\longrightarrow$ Dirty Page Table (DPT)
ACTIVE TRANSACTION TABLE (ATT)
One entry per currently active txn. $\rightarrow$ txnId: Unique txn identifier. $\rightarrow$ status: The current "mode" of the txn. $\rightarrow$ lastLSN: Most recent LSN created by txn.
Remove entry after the TXN- END record.
Txn Status Codes:
$\rightarrow$ Running (R) $\rightarrow$ Committing (C) $\rightarrow$ Candidate for Undo (U)
DIRTY PAGE TABLE (DPT)
Keep track of which pages in the buffer pool contain changes that have not been flushed to disk.
One entry per dirty page in the buffer pool: $\rightarrow$ recLSN: The LSN of the log record that first caused the page to be dirty.
SLIGHTLY BETTER CHECKPOINTS
At the first checkpoint, assuming $\mathsf{P}{11}$ was flushed, $\mathsf{T}{2}$ is still running and there is only one dirty page $(\mathsf{P}_{22})$ .
001: 002: 003:<T1, A→P11, 100, 120> 004: 005:<T2, C→P22, 100, 120> 006:<T TXN- END 007:<CHECKPOINT ATT={T2} DPT={P22} 008: 009:<T2, A→P11, 120, 130> 010: 011:<T3, B→P33, 200, 400> 012:<CHECKPOINT ATT={T2,T3} DPT={P11,P33} 013:<T3, B→P33, 400, 600>
SLIGHTLY BETTER CHECKPOINTS
At the first checkpoint, assuming $\mathsf{P}{11}$ was flushed, $\mathsf{T}{2}$ is still running and there is only one dirty page $(\mathsf{P}_{22})$ .
001: 002: 003:<T1, A>P11, 100, 120> 004: 005:<T2, C>P22, 100, 120> 006: 007: 008: 009:<T2, A>P11, 120, 130> 010: 011:<T3, B>P33, 200, 400> 012: <CHECKPOINT ATT={T2, T3}, DPT={P11, P33}> 013:<T3, B>P33, 400, 600>
SLIGHTLY BETTER CHECKPOINTS
At the first checkpoint, assuming $\mathsf{P}{11}$ was flushed, $\mathsf{T}{2}$ is still running and there is only one dirty page $(\mathsf{P}_{22})$ .
001: 002: 003:<T1, A>P11, 100, 120> 004: 005:<T2, C>P22, 100, 120> 006: 007:<CHECKPOINT ATT={T2}, DPT={P22} 008: 009:<T2, A>P11, 120, 130> 010: 011:<T3, B>P33, 200, 400> 012:<CHECKPOINT ATT={T2, T3}, DPT={P11, P33}> 013:<T3, B>P33, 400, 600>
SLIGHTLY BETTER CHECKPOINTS
At the first checkpoint, assuming $\mathsf{P}{11}$ was flushed, $\mathsf{T}{2}$ is still running and there is only one dirty page $(\mathsf{P}{22})$ . At the second checkpoint, assuming $\mathsf{P}{22}$ was flushed, $\mathsf{T}{2}$ and $\mathsf{T}{3}$ are active and the dirty pages are $(\mathsf{P}{11}, \mathsf{P}{33})$ .
001: 002: 003:<T1, A>P11, 100, 120> 004: 005:<T2,C>P22, 100, 120> 006: 007:<CHECKPOINT ATT={T2} DPT={P22} 008: 009:<T2,A>P11, 120, 130> 010: 011:<T3,B>P33, 200, 400> 012:<CHECKPOINT ATT={T2,T3} DPT={P11,P33} 013:<T3,B>P33, 400, 600>
SLIGHTLY BETTER CHECKPOINTS
At the first checkpoint, assuming $\mathsf{P}{11}$ was flushed, $\mathsf{T}{2}$ is still running and there is only one dirty page $(\mathsf{P}{22})$ . At the second checkpoint, assuming $\mathsf{P}{22}$ was flushed, $\mathsf{T}{2}$ and $\mathsf{T}{3}$ are active and the dirty pages are $(\mathsf{P}{11}, \mathsf{P}{33})$ .
001: 002: 003:<T1, A>P11, 100, 120> 004: 005:<T2, C>P22, 100, 120> 006: 007: 008: 009:<T2, A>P11, 120, 130> 010: 011:<T3, B>P33, 200, 400> 012: 013:<T3, B>P33, 400, 600>
SLIGHTLY BETTER CHECKPOINTS
At the first checkpoint, assuming $\mathsf{P}{11}$ was flushed, $\mathsf{T}{2}$ is still running and there is only one dirty page $(\mathsf{P}{22})$ . At the second checkpoint, assuming $\mathsf{P}{22}$ was flushed, $\mathsf{T}{2}$ and $\mathsf{T}{3}$ are active and the dirty pages are $(\mathsf{P}{11}, \mathsf{P}{33})$ .
001: 002: 003:<T1, A>P11, 100, 120> 004: 005:<T2,C>P22, 100, 120> 006: 007:<CHECKPOINT ATT={T2} DPT={P22} 008: 009:<T2,A>P11, 120, 130> 010: 011:<T3,B>P33, 200, 400> 012:<CHECKPOINT ATT={T2,T3} DPT={P11,P33} 013:<T3,B>P33, 400, 000>
SLIGHTLY BETTER CHECKPOINTS
At the first checkpoint, assuming $\mathsf{P}{11}$ was flushed, $\mathsf{T}{2}$ is still running and there is only one dirty page $(\mathsf{P}_{22})$ .
At the second checkpoint, assuming $\mathsf{P}{22}$ was flushed, $\mathsf{T}{2}$ and $\mathsf{T}{3}$ are active and the dirty pages are $(\mathsf{P}{11}, \mathsf{P}_{33})$ .
This still is not ideal because the DBMS must stall txns during checkpoint...
001: 002: 003:<T1, A>P11, 100, 120> 004: 005:<T2,C>P22, 100, 120> 006:<T1 TXN- END 007:<CHECKPOINT ATT={T2} DPT={P22} 008: 009:<T2,A>P11, 120, 130> 010: 011:<T3,B>P33, 200, 400> 012:<CHECKPOINT ATT={T2,T3} DPT={P11,P33} 013:<T3,B>P33, 400, 000>
FUZZY CHECKPOINTS
A fuzzy checkpoint is where the DBMS allows active txns to continue to run while the system writes the log records for checkpoint. $\longrightarrow$ No attempt to force dirty pages to disk.
New log records to track checkpoint boundaries: $\longrightarrow$ CHECKPOINT- BEGIN: Indicates start of checkpoint $\longrightarrow$ CHECKPOINT- END: Contains ATT + DPT.
FUZZY CHECKPOINTS
Assume the DBMS flushes $\mathsf{P}_{11}$ before the first checkpoint starts.
Any txn that begins after the checkpoint starts is excluded from the ATT in the CHECKPOINT- END record.
The LSN of the CHECKPOINT- BEGIN record is written to the MasterRecord when it completes.
001: 002: 003:<T1, A>P11, 100, 120> 004: 005:<T2, C>P22, 100, 120> 006: 007: 008: 009:<T2, A>P11, 120, 130> 010:<CHECKPOINT- END $\mathsf{ATT} = {\mathsf{T}_2} ,$ DPT={P22}> 011: 012:<T3, B>P33, 200, 400> 013: 014:<T3, B>P33, 10, 12> 015:<CHECKPOINT- END $\mathsf{ATT} = {\mathsf{T}_2,\mathsf{T}_3} ,$ DPT={P11, P33}
FUZZY CHECKPOINTS
Assume the DBMS flushes $\mathsf{P}_{11}$ before the first checkpoint starts.
Any txn that begins after the checkpoint starts is excluded from the ATT in the CHECKPOINT- END record.
The LSN of the CHECKPOINT- BEGIN record is written to the MasterRecord when it completes.
001: 002: 003:<T1, A>P11, 100, 120> 004: 005:<T2, C>P22, 100, 120> 006: 007: 008: 009:<T2, A>P11, 120, 130> 010:<CHECKPOINT- END $\mathsf{ATT} = {\mathsf{T}_2} ,$ DPT={P22}> 011: 012:<T3, B>P33, 200, 400> 013: 014:<T3, B>P33, 10, 12> 015:<CHECKPOINT- END $\mathsf{ATT} = {\mathsf{T}_2,\mathsf{T}_3} ,$ DPT={P11, P33}
FUZZY CHECKPOINTS
Assume the DBMS flushes $\mathsf{P}_{11}$ before the first checkpoint starts.
Any txn that begins after the checkpoint starts is excluded from the ATT in the CHECKPOINT- END record.
The LSN of the CHECKPOINT- BEGIN record is written to the MasterRecord when it completes.
001: 002: 003:<T1, A>P11, 100, 120> 004: 005:<T2, C>P22, 100, 120> 006: 007: 008: 009:<T2, A>P11, 120, 130> 010:<CHECKPOINT- END $\mathsf{ATT} = {\mathsf{T}_2}$ DPT={P22} > 011: 012:<T3, B>P33, 200, 400> 013: 014:<T3, B>P33, 10, 12> 015:<CHECKPOINT- END $\mathsf{ATT} = {\mathsf{T}_2,\mathsf{T}_3}$ DPT={P11, P33} >
FUZZY CHECKPOINTS
Assume the DBMS flushes $\mathsf{P}_{11}$ before the first checkpoint starts.
Any txn that begins after the checkpoint starts is excluded from the ATT in the CHECKPOINT- END record.
The LSN of the CHECKPOINT- BEGIN record is written to the MasterRecord when it completes.
001: 002: 003:<T1, A>P11, 100, 120> 004: 005:<T2, C>P22, 100, 120> 006: 007: 008: 009:<T2, A>P11, 120, 130> 010:<CHECKPOINT- END
ATT={T2}, DPT={P22}>
011: 012:<T3, B>P33, 200, 400> 013: 014:<T3, B>P33, 10, 12> 015:<CHECKPOINT- END
ATT={T2, T3}, DPT={P11, P33}>
FUZZY CHECKPOINTS
Assume the DBMS flushes $\mathsf{P}_{11}$ before the first checkpoint starts.
Any txn that begins after the checkpoint starts is excluded from the ATT in the CHECKPOINT- END record.
The LSN of the CHECKPOINT- BEGIN record is written to the MasterRecord when it completes.
001: 002: 003:<T1, A>P11, 100, 120> 004: <T2, C>P22, 100, 120> 006: 007: 008: 009:<T2, A>P11, 120, 130> 010:<CHECKPOINT- END $\mathsf{ATT} = {\mathsf{T}_2} ,$ DPT={P22} 011: 012:<T3, B>P33, 200, 400> 013: 014:<T3, B>P33, 10, 12> 015:<CHECKPOINT- END $\mathsf{ATT} = {\mathsf{T}_2,\mathsf{T}_3} ,$ DPT={P11, P33}
ARIES - RECOVERY PHASES
Phase #1 - Analysis
$\rightarrow$ Examine the WAL in forward direction starting at MasterRecord to identify dirty pages in the buffer pool and active txns at the time of the crash.
Phase #2 - Redo
$\rightarrow$ Repeat all actions starting from an appropriate point in the log (even txns that will abort).
Phase #3 - Undo
$\rightarrow$ Reverse the actions of txns that did not commit before the crash.
ARIES - OVERVIEW
Start from last BEGIN- CHECKPOINT found via MasterRecord.
Analysis: Figure out which txns committed or failed since checkpoint.
Redo: Repeat all actions.
Undo: Reverse effects of failed txns.
ARIES - OVERVIEW
Start from last BEGIN- CHECKPOINT found via MasterRecord.
Analysis: Figure out which txns committed or failed since checkpoint.
Redo: Repeat all actions.
Undo: Reverse effects of failed txns.
ARIES - OVERVIEW
Start from last BEGIN- CHECKPOINT found via MasterRecord.
Analysis: Figure out which txns committed or failed since checkpoint.
Redo: Repeat all actions.
Undo: Reverse effects of failed txns.
ARIES - OVERVIEW
Start from last BEGIN- CHECKPOINT found via MasterRecord.
Analysis: Figure out which txns committed or failed since checkpoint.
Redo: Repeat all actions.
Undo: Reverse effects of failed txns.
ARIES - OVERVIEW
Start from last BEGIN- CHECKPOINT found via MasterRecord.
Analysis: Figure out which txns committed or failed since checkpoint.
Redo: Repeat all actions.
Undo: Reverse effects of failed txns.
ARIES - OVERVIEW
Start from last BEGIN- CHECKPOINT found via MasterRecord.
Analysis: Figure out which txns committed or failed since checkpoint.
Redo: Repeat all actions.
Undo: Reverse effects of failed txns.
ARIES - OVERVIEW
Start from last BEGIN- CHECKPOINT found via MasterRecord.
Analysis: Figure out which txns committed or failed since checkpoint.
Redo: Repeat all actions.
Undo: Reverse effects of failed txns.
ARIES - OVERVIEW
Start from last BEGIN- CHECKPOINT found via MasterRecord.
Analysis: Figure out which txns committed or failed since checkpoint.
Redo: Repeat all actions.
Undo: Reverse effects of failed txns.
ARIES - OVERVIEW
Start from last BEGIN- CHECKPOINT found via MasterRecord.
Analysis: Figure out which txns committed or failed since checkpoint.
Redo: Repeat all actions.
Undo: Reverse effects of failed txns.
ARIES - OVERVIEW
Start from last BEGIN- CHECKPOINT found via MasterRecord.
Analysis: Figure out which txns committed or failed since checkpoint.
Redo: Repeat all actions.
Undo: Reverse effects of failed txns.
ANALYSIS PHASE
Scan log forward from last successful checkpoint. If the DBMS finds a TXN- END record, remove its corresponding txn from ATT.
All other records: $\rightarrow$ If txn not in ATT, add it with status UNDO. $\rightarrow$ On commit, change txn status to COMMIT.
For update log records: $\rightarrow$ If page P not in DPT, add P to DPT, set its recLSN=LSN.
ANALYSIS PHASE
At end of the Analysis Phase: $\rightarrow$ ATT identifies which txns were active at time of crash. $\rightarrow$ DPT identifies which dirty pages might not have made it to disk.
ANALYSIS PHASE EXAMPLE
Table (html):
ANALYSIS PHASE EXAMPLE
ANALYSIS PHASE EXAMPLE
ANALYSIS PHASE EXAMPLE
ANALYSIS PHASE EXAMPLE
ANALYSIS PHASE EXAMPLE
ANALYSIS PHASE EXAMPLE
ANALYSIS PHASE EXAMPLE
ANALYSIS PHASE EXAMPLE
ANALYSIS PHASE EXAMPLE
REDO PHASE
The goal is to repeat history to reconstruct the database state at the moment of the crash: $\rightarrow$ Reapply all updates (even aborted txns!) and redo CLRs.
There are techniques that allow the DBMS to avoid unnecessary reads/writes, but we will ignore that in this lecture...
REDO PHASE
Scan forward from the log record containing smallest recLSN in DPT.
For each update log record or CLR with a given LSN, redo the action unless:
$\rightarrow$ The affected page is not in DPT, or $\rightarrow$ The affected page is in DPT, but that log record's LSN is less than the page's recLSN. (The update was propagated to disk.) $\rightarrow$ Log record's LSN ≤ pageLSN; DBMS must fetch page from the disk to read page value.
REDO PHASE
To redo an action: $\rightarrow$ Reapply logged update. $\rightarrow$ Set pageLSN to log record's LSN. $\rightarrow$ No additional logging, no forced flushes!
At the end of Redo Phase, write TXN- END log records for all txns with status C and remove them from the ATT.
UNDO PHASE
Undo all txns that were active at the time of crash and therefore will never commit.
$\rightarrow$ These are all the txns with U status in the ATT after the Analysis Phase.
Process them in reverse LSN order using the lastLSN to speed up traversal.
$\rightarrow$ At each step, pick the largest lastLSN across all transactions in the ATT. $\rightarrow$ Traverse lastLSNs in the same order, but in reverse, for how the updates happened originally.
Write a CLR for every modification.
FULL EXAMPLE
FULL EXAMPLE
FULL EXAMPLE
FULL EXAMPLE
FULL EXAMPLE
FULL EXAMPLE
FULL EXAMPLE
FULL EXAMPLE
FULL EXAMPLE
FULL EXAMPLE
FULL EXAMPLE
FULL EXAMPLE
FULL EXAMPLE
ADDITIONAL CRASH ISSUES (1)
What does the DBMS do if it crashes during recovery in the Analysis Phase?
ADDITIONAL CRASH ISSUES (1)
What does the DBMS do if it crashes during recovery in the Analysis Phase?
$\longrightarrow$ Nothing. Just run recovery again.
ADDITIONAL CRASH ISSUES (1)
What does the DBMS do if it crashes during recovery in the Analysis Phase?
$\longrightarrow$ Nothing. Just run recovery again.
What does the DBMS do if it crashes during recovery in the Redo Phase?
ADDITIONAL CRASH ISSUES (1)
What does the DBMS do if it crashes during recovery in the Analysis Phase?
$\longrightarrow$ Nothing. Just run recovery again.
What does the DBMS do if it crashes during recovery in the Redo Phase?
$\longrightarrow$ Again nothing. Redo everything again.
ADDITIONAL CRASH ISSUES (2)
How can the DBMS improve performance during recovery in the Redo Phase?
$\rightarrow$ Assume that it is not going to crash again and flush all changes to disk asynchronously in the background.
How can the DBMS improve performance during recovery in the Undo Phase?
$\rightarrow$ Lazily rollback changes before new txns access pages. $\rightarrow$ Rewrite the application to avoid long- running txns.
CONCLUSION
Mains ideas of ARIES:
$\longrightarrow$ WAL with STEAL + NO- FORCE $\longrightarrow$ Fuzzy Checkpoints (snapshot of dirty page ids) $\longrightarrow$ Redo everything since the earliest dirty page $\longrightarrow$ Undo txns that never commit $\longrightarrow$ Write CLRs when undoing, to survive failures during restarts
Log Sequence Numbers:
$\longrightarrow$ LSNs identify log records; linked into backwards chains per transaction via prevLSN. $\longrightarrow$ pageLSN allows comparison of data page and log records.
NEXT CLASS
You now know how to build a single- node DBMS.
Let's make it even more challenging and start talking about distributed DBMSs!