CMU Database - 15445 - 2025 Spring

cmu_database

This collection of documents, "CMU Database - 15445 - 2025 Spring," provides a comprehensive overview of database systems, primarily focusing on the design, implementation, and ...

Documents Knowledge Graph

lecture-21-slides.pdf

Carnegie Mellon University

Database Systems Database Recovery

ADMINISTRIVIA

Project #4 is due Sunday April 20^th @ 11

$\longrightarrow$ Recitation: Friday, April 11^th in GHC 4303 from 3

- 4

Final Exam is on Monday, April 28, 2025, from 05

- 08

$\longrightarrow$ Early exam will not be offered. Do not make travel plans.

No OH today (I have to teach another class)

UPCOMING DATABASE TALKS

0xide (DB Seminar) $\longrightarrow$ Monday, April 7 @ 4

$\longrightarrow$ OxQL: Oximeter Query Language $\longrightarrow$ Speaker: Ben Naecker $\longrightarrow$ https://cmu.zoom.us/j/93441451665

Oxide

MariaDB (DB Seminar) $\longrightarrow$ Monday, April 14 @ 4

$\longrightarrow$ MariaDB's New Query Optimizer $\longrightarrow$ Speaker: Michael Widenius $\longrightarrow$ https://cmu.zoom.us/j/93441451665

CRASH RECOVERY

Recovery algorithms are techniques to ensure database consistency, transaction atomicity, and durability despite failures.

Recovery algorithms have two parts:

$\rightarrow$ Actions during normal txn processing to ensure that the DBMS can recover from a failure. $\rightarrow$ Actions after a failure to recover the database to a state that ensures atomicity, consistency, and durability.

CRASH RECOVERY

Recovery algorithms are techniques to ensure database consistency, transaction atomicity, and durability despite failures.

Recovery algorithms have two parts:

$\rightarrow$ Actions during normal txn processing to ensure that the DBMS can recover from a failure.

$\rightarrow$ Actions after a failure to recover the database to a state that ensures atomicity, consistency, and durability.

Today

CRASH RECOVERY OVERVIEW

STEAL + NO-FORCE

Atomicity: Txns may abort/fail. Durability: Changes of committed. txns should survive system failure.

Schedule

Table (html):

T1	T2	T3
BEGIN W(A) COMMIT
	BEGIN W(B) ABORT

CRASH RECOVERY OVERVIEW

STEAL + NO-FORCE

Atomicity: Txns may abort/fail. Durability: Changes of committed. txns should survive system failure.

Schedule

CRASH RECOVERY OVERVIEW

STEAL + NO-FORCE

Atomicity: Txns may abort/fail. Durability: Changes of committed. txns should survive system failure.

Desired behavior after the DBMS restarts (i.e., the contents of volatile memory are lost): $\rightarrow \mathbf{T}_1$ should be durable. $\rightarrow \mathbf{T}_2 + \mathbf{T}_3$ should be aborted.

Schedule

ARIES

Algorithms for Recovery and Isolation Exploiting Semantics

Developed at IBM Research in early 1990s for the DB2 DBMS.

Not all systems implement ARIES exactly as defined in this paper but they're close enough.

ARIES: A Transaction Recovery Method Supporting Fine Granularity Locking and Partial Rollbacks Using Write- Ahead Logging

G. MOHAN

IBM Almaden Research Center and DON HADERLE IBM Santa Teresa Laboratory

and

BRUCE LINDSAY, HAMID PIRAHESH and PETER SCHWARZ IBM Almaden Research Center

In this paper we present a simple and efficient method, called ARIES (Algorithm for Recovery and Isolation Exploiting Semantics), which supports partial rollbacks of transactions, fine granularity (e.g., record blocking and recovery using write- ahead logging (WAL). We introduce the paradigm of repeating history to redo all missing updates before performing the rollbacks of the loser transactions during restart after a system failure. ARIES uses a log sequence number in each page to correlate the state of a page with respect to logged updates of that page. All updates of a transaction are logged, including those performed during rollbacks. By appropriate chaining of the log records written during rollbacks to those written during forward progress, a bounded amount of logs rolling is ensured during rollbacks even in the face of repeated failures during restart or of nested rollbacks. We deal with a variety of features that are very important in building and operating an ARIES. We deal with transaction processing system. ARIES supports fuzzy checkpoints, selective and deferred restart, fuzzy image copies, media recovery, and high concurrency lock modes (e.g., increment/decrement) which exploit the semantics of the operations and require the ability to perform operation logging. ARIES is flexible with respect to the kinds of buffer management policies that can be implemented. It supports objects of varying length efficiency. By enabling parallelism during restart, page- oriented redo, and join- lock, the application can be extended to handle the creation of ARIES. ARIES supports B Paradigms for logging and recovery, which were based on the shadow page technique, need to be changed in the context of WAL. We compare ARIES to the WAL- based recovery methods of

ARIES - MAIN IDEAS

Write-Ahead Logging:

$\rightarrow$ Flush WAL record(s) changes to disk before a database object is written to disk. $\rightarrow$ Must use STEAL + NO- FORCE buffer pool policies.

Repeating History During Redo:

$\rightarrow$ On DBMS restart, retrace actions and restore database to exact state before crash.

Logging Changes During Undo:

$\rightarrow$ Record undo actions to log to ensure action is not repeated in the event of repeated failures.

TODAY'S AGENDA

Log Sequence Numbers Normal Commit & Abort Operations Fuzzy Checkpointing Recovery Algorithm

WAL RECORDS

We need to extend our log record format from last class to include additional info.

Every log record includes a globally unique log sequence number (LSN).

$\rightarrow$ LSNs represent the physical order that txns make changes to the database.

Various components in the system keep track of LSNs that pertain to them...

WAL BOOKKEEPING

Log Sequence Number (LSN). $\longrightarrow$ Unique and monotonically increasing.

WAL BOOKKEEPING

Log Sequence Number (LSN). $\longrightarrow$ Unique and monotonically increasing.

Each data page contains a pageLSN. $\longrightarrow$ The LSN of the most recent log record that updated the page.

WAL BOOKKEEPING

Log Sequence Number (LSN). $\longrightarrow$ Unique and monotonically increasing.

Each data page contains a pageLSN. $\longrightarrow$ The LSN of the most recent log record that updated the page.

WAL BOOKKEEPING

Log Sequence Number (LSN). $\longrightarrow$ Unique and monotonically increasing.

Each data page contains a pageLSN. $\longrightarrow$ The LSN of the most recent log record that updated the page.

WAL BOOKKEEPING

Log Sequence Number (LSN). $\rightarrow$ Unique and monotonically increasing.

Each data page contains a pageLSN. $\rightarrow$ The LSN of the most recent log record that updated the page.

System keeps track of flushedLSN. $\rightarrow$ The max LSN flushed so far.

WAL BOOKKEEPING

Log Sequence Number (LSN). $\longrightarrow$ Unique and monotonically increasing.

Each data page contains a pageLSN. $\longrightarrow$ The LSN of the most recent log record that updated the page.

System keeps track of flushedLSN. $\longrightarrow$ The max LSN flushed so far.

WAL BOOKKEEPING

Log Sequence Number (LSN). $\rightarrow$ Unique and monotonically increasing.

Each data page contains a pageLSN. $\rightarrow$ The LSN of the most recent log record that updated the page.

System keeps track of flushedLSN. $\rightarrow$ The max LSN flushed so far.

WAL: Before a $\mathrm{page}_{\mathrm{x}}$ is written, pageLSNx ≤ flushedLSN

LOG SEQUENCE NUMBERS

Table (html):

Name	Location	Definition
fflushedLSN	Memory	Last LSN in log on disk
pageLSN	pagex	Newest update to pagex
recLSN	DPT†	Oldest update to pagex since it was last flushed
lastLSN	ATT*	Latest record of txn T1
MasterRecord	Disk	LSN of latest checkpoint

† DPT = Dirty Page Table.

ATT = Active Transaction Table.

WRITING LOG RECORDS

All log records have an LSN.

Update the pageLSN every time a txn modifies a record in the page.

Update the flushedLSN in memory every time the DBMS writes the WAL buffer to disk.

NORMAL EXECUTION

Each txn invokes a sequence of reads and writes, followed by commit or rollback.

Assumptions in this lecture:

$\rightarrow$ All log records fit within a single page. $\rightarrow$ Disk writes are atomic. $\rightarrow$ Single- versioned tuples with Strong Strict 2PL. $\rightarrow$ STEAL + NO- FORCE buffer management with WAL.

TRANSACTION COMMIT

When a txn commits, the DBMS writes a COMMIT record to log and guarantees that all log records up to txn's COMMIT record are flushed to disk.

$\rightarrow$ Log flushes are sequential, synchronous writes to disk. $\rightarrow$ Many log records per log page.

When the commit succeeds, write a TXN- END record to log. $\rightarrow$ Indicates that no new log record for a txn will appear in the log ever again. $\rightarrow$ This does not need to be flushed immediately.

TRANSACTION COMMIT

TRANSACTION ABORT

Aborting a txn is a special case of the ARIES undo operation applied to only one txn.

We need to add another field to our log records: $\rightarrow$ prevLSN: The previous LSN for the txn. $\rightarrow$ This maintains a linked- list for each txn that makes it easy to walk through its records.

TRANSACTION ABORT

COMPENSATION LOG RECORDS

A CLR describes the actions taken to undo the actions of a previous update record.

It has all the fields of an update log record plus the undoNextLSN pointer (i.e., the next- to- be- undone LSN).

CLRs are added to log records but the DBMS does not wait for them to be flushed before notifying the application that the txn aborted.

TRANSACTION ABORT - CLR EXAMPLE

Table (html):

LSN	prevLSN	TxnId	Type	Object	Before	After	UndoNextLSN
001	nil	T1	BEGIN	-	-	-	-
002	001	T1	UPDATE	A	30	40	-
:
011	002	T1	ABORT	-	-	-	-

TRANSACTION ABORT - CLR EXAMPLE

Table (html):

LSN	prevLSN	TxnId	Type	Object	Before	After	UndoNextLSN
001	nil	T1	BEGIN	-	-	-	-
002	001	T1	UPDATE	A	30	40	-
:
011	002	T1	ABORT	-	-	-	-
:
026	011	T1	CLR-002	A	40	30	001

TRANSACTION ABORT - CLR EXAMPLE

Table (html):

LSN	prevLSN	TxnId	Type	Object	Before	After	UndoNextLSN
001	nil	T1	BEGIN	-	-	-	-
002	001	T1	UPDATE	A	30	40	-
:
011	002	T1	ABORT	-	-	-	-
:
026	011	T1	CLR-002	A	40	30	001

TRANSACTION ABORT - CLR EXAMPLE

Table (html):

LSN	prevLSN	TxnId	Type	Object	Before	After	UndoNextLSN
001	nil	T1	BEGIN	-	-	-	-
002	001	T1	UPDATE	A	30	40	-
:
011	002	T1	ABORT	-	-	-	-
:
026	011	T1	CLR-002	A	40	30	001
027	026	T1	TXN-END	-	-	-	nil

ABORT ALGORITHM

First write an ABORT record to log for the txn.

Then analyze the txn's updates in reverse order. For each update record:

$\rightarrow$ Write a CLR entry to the log. $\rightarrow$ Restore old value.

Lastly, write a TXN- END record and release locks.

Notice: CLRs never need to be undone.

TODAY'S AGENDA

Log Sequence Numbers Normal Commit & Abort Operations Fuzzy Checkpointing Recovery Algorithm

NON-FUZZY CHECKPOINTS

The DBMS halts everything when it takes a checkpoint to ensure a consistent snapshot: $\begin{array}{rl} & \longrightarrow \mathrm{Halt~~the~~startofany~~new~~txns}.\ & \longrightarrow \mathrm{Wait~~until~~all~~active~~txns~~finish~~executing}.\ & \longrightarrow \mathrm{Flushes~~dirty~~pagesondisk}. \end{array}$

This is bad for runtime performance but makes recovery easy.

SLIGHTLY BETTER CHECKPOINTS

Pause modifying txns while the DBMS takes the checkpoint.

$\longrightarrow$ Flushes dirty pages on disk. $\longrightarrow$ Prevent queries from acquiring write latch on table/index pages. $\longrightarrow$ Don't have to wait until all txns finish before taking the checkpoint.

SLIGHTLY BETTER CHECKPOINTS

Pause modifying txns while the DBMS takes the checkpoint.

SLIGHTLY BETTER CHECKPOINTS

Pause modifying txns while the DBMS takes the checkpoint.

SLIGHTLY BETTER CHECKPOINTS

Pause modifying txns while the DBMS takes the checkpoint.

SLIGHTLY BETTER CHECKPOINTS

Pause modifying txns while the DBMS takes the checkpoint.

SLIGHTLY BETTER CHECKPOINTS

Pause modifying txns while the DBMS takes the checkpoint.

SLIGHTLY BETTER CHECKPOINTS

Pause modifying txns while the DBMS takes the checkpoint.

SLIGHTLY BETTER CHECKPOINTS

Pause modifying txns while the DBMS takes the checkpoint.

SLIGHTLY BETTER CHECKPOINTS

Pause modifying txns while the DBMS takes the checkpoint.

We must record internal state as of the beginning of the checkpoint.

$\longrightarrow$ Active Transaction Table (ATT) $\longrightarrow$ Dirty Page Table (DPT)

ACTIVE TRANSACTION TABLE (ATT)

One entry per currently active txn. $\rightarrow$ txnId: Unique txn identifier. $\rightarrow$ status: The current "mode" of the txn. $\rightarrow$ lastLSN: Most recent LSN created by txn.

Remove entry after the TXN- END record.

Txn Status Codes:

$\rightarrow$ Running (R) $\rightarrow$ Committing (C) $\rightarrow$ Candidate for Undo (U)

DIRTY PAGE TABLE (DPT)

Keep track of which pages in the buffer pool contain changes that have not been flushed to disk.

One entry per dirty page in the buffer pool: $\rightarrow$ recLSN: The LSN of the log record that first caused the page to be dirty.

SLIGHTLY BETTER CHECKPOINTS

At the first checkpoint, assuming $\mathsf{P}{11}$ was flushed, $\mathsf{T}{2}$ is still running and there is only one dirty page $(\mathsf{P}_{22})$ .

001: 002: 003:<T1, A→P11, 100, 120> 004: 005:<T2, C→P22, 100, 120> 006:<T TXN- END 007:<CHECKPOINT ATT={T2} DPT={P22} 008: 009:<T2, A→P11, 120, 130> 010: 011:<T3, B→P33, 200, 400> 012:<CHECKPOINT ATT={T2,T3} DPT={P11,P33} 013:<T3, B→P33, 400, 600>

SLIGHTLY BETTER CHECKPOINTS

At the first checkpoint, assuming $\mathsf{P}{11}$ was flushed, $\mathsf{T}{2}$ is still running and there is only one dirty page $(\mathsf{P}_{22})$ .

001: 002: 003:<T1, A>P11, 100, 120> 004: 005:<T2, C>P22, 100, 120> 006: 007: 008: 009:<T2, A>P11, 120, 130> 010: 011:<T3, B>P33, 200, 400> 012: <CHECKPOINT ATT={T2, T3}, DPT={P11, P33}> 013:<T3, B>P33, 400, 600>

SLIGHTLY BETTER CHECKPOINTS

At the first checkpoint, assuming $\mathsf{P}{11}$ was flushed, $\mathsf{T}{2}$ is still running and there is only one dirty page $(\mathsf{P}_{22})$ .

001: 002: 003:<T1, A>P11, 100, 120> 004: 005:<T2, C>P22, 100, 120> 006: 007:<CHECKPOINT ATT={T2}, DPT={P22} 008: 009:<T2, A>P11, 120, 130> 010: 011:<T3, B>P33, 200, 400> 012:<CHECKPOINT ATT={T2, T3}, DPT={P11, P33}> 013:<T3, B>P33, 400, 600>

SLIGHTLY BETTER CHECKPOINTS

At the first checkpoint, assuming $\mathsf{P}{11}$ was flushed, $\mathsf{T}{2}$ is still running and there is only one dirty page $(\mathsf{P}{22})$ . At the second checkpoint, assuming $\mathsf{P}{22}$ was flushed, $\mathsf{T}{2}$ and $\mathsf{T}{3}$ are active and the dirty pages are $(\mathsf{P}{11}, \mathsf{P}{33})$ .

001: 002: 003:<T1, A>P11, 100, 120> 004: 005:<T2,C>P22, 100, 120> 006: 007:<CHECKPOINT ATT={T2} DPT={P22} 008: 009:<T2,A>P11, 120, 130> 010: 011:<T3,B>P33, 200, 400> 012:<CHECKPOINT ATT={T2,T3} DPT={P11,P33} 013:<T3,B>P33, 400, 600>

SLIGHTLY BETTER CHECKPOINTS

001: 002: 003:<T1, A>P11, 100, 120> 004: 005:<T2, C>P22, 100, 120> 006: 007: 008: 009:<T2, A>P11, 120, 130> 010: 011:<T3, B>P33, 200, 400> 012: 013:<T3, B>P33, 400, 600>

SLIGHTLY BETTER CHECKPOINTS

At the first checkpoint, assuming $\mathsf{P}{11}$ was flushed, $\mathsf{T}{2}$ is still running and there is only one dirty page $(\mathsf{P}_{22})$ .

At the second checkpoint, assuming $\mathsf{P}{22}$ was flushed, $\mathsf{T}{2}$ and $\mathsf{T}{3}$ are active and the dirty pages are $(\mathsf{P}{11}, \mathsf{P}_{33})$ .

This still is not ideal because the DBMS must stall txns during checkpoint...

001: 002: 003:<T1, A>P11, 100, 120> 004: 005:<T2,C>P22, 100, 120> 006:<T1 TXN- END 007:<CHECKPOINT ATT={T2} DPT={P22} 008: 009:<T2,A>P11, 120, 130> 010: 011:<T3,B>P33, 200, 400> 012:<CHECKPOINT ATT={T2,T3} DPT={P11,P33} 013:<T3,B>P33, 400, 000>

FUZZY CHECKPOINTS

A fuzzy checkpoint is where the DBMS allows active txns to continue to run while the system writes the log records for checkpoint. $\longrightarrow$ No attempt to force dirty pages to disk.

New log records to track checkpoint boundaries: $\longrightarrow$ CHECKPOINT- BEGIN: Indicates start of checkpoint $\longrightarrow$ CHECKPOINT- END: Contains ATT + DPT.

FUZZY CHECKPOINTS

Assume the DBMS flushes $\mathsf{P}_{11}$ before the first checkpoint starts.

Any txn that begins after the checkpoint starts is excluded from the ATT in the CHECKPOINT- END record.

The LSN of the CHECKPOINT- BEGIN record is written to the MasterRecord when it completes.

001: 002: 003:<T1, A>P11, 100, 120> 004: 005:<T2, C>P22, 100, 120> 006: 007: 008: 009:<T2, A>P11, 120, 130> 010:<CHECKPOINT- END $\mathsf{ATT} = {\mathsf{T}_2} ,$ DPT={P22}> 011: 012:<T3, B>P33, 200, 400> 013: 014:<T3, B>P33, 10, 12> 015:<CHECKPOINT- END $\mathsf{ATT} = {\mathsf{T}_2,\mathsf{T}_3} ,$ DPT={P11, P33}

FUZZY CHECKPOINTS

Assume the DBMS flushes $\mathsf{P}_{11}$ before the first checkpoint starts.

Any txn that begins after the checkpoint starts is excluded from the ATT in the CHECKPOINT- END record.

The LSN of the CHECKPOINT- BEGIN record is written to the MasterRecord when it completes.

FUZZY CHECKPOINTS

Assume the DBMS flushes $\mathsf{P}_{11}$ before the first checkpoint starts.

Any txn that begins after the checkpoint starts is excluded from the ATT in the CHECKPOINT- END record.

The LSN of the CHECKPOINT- BEGIN record is written to the MasterRecord when it completes.

001: 002: 003:<T1, A>P11, 100, 120> 004: 005:<T2, C>P22, 100, 120> 006: 007: 008: 009:<T2, A>P11, 120, 130> 010:<CHECKPOINT- END $\mathsf{ATT} = {\mathsf{T}_2}$ DPT={P22} > 011: 012:<T3, B>P33, 200, 400> 013: 014:<T3, B>P33, 10, 12> 015:<CHECKPOINT- END $\mathsf{ATT} = {\mathsf{T}_2,\mathsf{T}_3}$ DPT={P11, P33} >

FUZZY CHECKPOINTS

Assume the DBMS flushes $\mathsf{P}_{11}$ before the first checkpoint starts.

Any txn that begins after the checkpoint starts is excluded from the ATT in the CHECKPOINT- END record.

The LSN of the CHECKPOINT- BEGIN record is written to the MasterRecord when it completes.

001: 002: 003:<T1, A>P11, 100, 120> 004: 005:<T2, C>P22, 100, 120> 006: 007: 008: 009:<T2, A>P11, 120, 130> 010:<CHECKPOINT- END

ATT={T2}, DPT={P22}>

011: 012:<T3, B>P33, 200, 400> 013: 014:<T3, B>P33, 10, 12> 015:<CHECKPOINT- END

ATT={T2, T3}, DPT={P11, P33}>

FUZZY CHECKPOINTS

Assume the DBMS flushes $\mathsf{P}_{11}$ before the first checkpoint starts.

Any txn that begins after the checkpoint starts is excluded from the ATT in the CHECKPOINT- END record.

The LSN of the CHECKPOINT- BEGIN record is written to the MasterRecord when it completes.

001: 002: 003:<T1, A>P11, 100, 120> 004: <T2, C>P22, 100, 120> 006: 007: 008: 009:<T2, A>P11, 120, 130> 010:<CHECKPOINT- END $\mathsf{ATT} = {\mathsf{T}_2} ,$ DPT={P22} 011: 012:<T3, B>P33, 200, 400> 013: 014:<T3, B>P33, 10, 12> 015:<CHECKPOINT- END $\mathsf{ATT} = {\mathsf{T}_2,\mathsf{T}_3} ,$ DPT={P11, P33}

ARIES - RECOVERY PHASES

Phase #1 - Analysis

$\rightarrow$ Examine the WAL in forward direction starting at MasterRecord to identify dirty pages in the buffer pool and active txns at the time of the crash.

Phase #2 - Redo

$\rightarrow$ Repeat all actions starting from an appropriate point in the log (even txns that will abort).

Phase #3 - Undo

$\rightarrow$ Reverse the actions of txns that did not commit before the crash.

ARIES - OVERVIEW