GithubDocuments
    1. Marketplace
    2. CMU Database - 15445 - 2025 Spring
    Collections
    CMU Database - 15445 - 2025 Spring
    cmu_database
    This collection of documents, "CMU Database - 15445 - 2025 Spring," provides a comprehensive overview of database systems, primarily focusing on the design, implementation, and ...
    DocumentsKnowledge Graph
    lecture-21-slides.pdf
    Carnegie Mellon University
    Database Systems Database Recovery

    ADMINISTRIVIA

    Project #4 is due Sunday April 20th @ 11
    $\longrightarrow$ Recitation: Friday, April 11th in GHC 4303 from 3
    - 4
    PM
    Final Exam is on Monday, April 28, 2025, from 05
    - 08
    $\longrightarrow$ Early exam will not be offered. Do not make travel plans.
    No OH today (I have to teach another class)

    UPCOMING DATABASE TALKS

    0xide (DB Seminar) $\longrightarrow$ Monday, April 7 @ 4
    $\longrightarrow$ OxQL: Oximeter Query Language $\longrightarrow$ Speaker: Ben Naecker $\longrightarrow$ https://cmu.zoom.us/j/93441451665

    Oxide

    MariaDB (DB Seminar) $\longrightarrow$ Monday, April 14 @ 4
    $\longrightarrow$ MariaDB's New Query Optimizer $\longrightarrow$ Speaker: Michael Widenius $\longrightarrow$ https://cmu.zoom.us/j/93441451665

    CRASH RECOVERY

    Recovery algorithms are techniques to ensure database consistency, transaction atomicity, and durability despite failures.
    Recovery algorithms have two parts:
    $\rightarrow$ Actions during normal txn processing to ensure that the DBMS can recover from a failure. $\rightarrow$ Actions after a failure to recover the database to a state that ensures atomicity, consistency, and durability.

    CRASH RECOVERY

    Recovery algorithms are techniques to ensure database consistency, transaction atomicity, and durability despite failures.
    Recovery algorithms have two parts:
    $\rightarrow$ Actions during normal txn processing to ensure that the DBMS can recover from a failure.
    $\rightarrow$ Actions after a failure to recover the database to a state that ensures atomicity, consistency, and durability.
    Today

    CRASH RECOVERY OVERVIEW

    STEAL + NO-FORCE

    Atomicity: Txns may abort/fail. Durability: Changes of committed. txns should survive system failure.

    Schedule

    Table (html):
    T1T2T3
    BEGIN W(A) COMMIT
    BEGIN W(B) ABORT

    CRASH RECOVERY OVERVIEW

    STEAL + NO-FORCE

    Atomicity: Txns may abort/fail. Durability: Changes of committed. txns should survive system failure.

    Schedule

    CRASH RECOVERY OVERVIEW

    STEAL + NO-FORCE

    Atomicity: Txns may abort/fail. Durability: Changes of committed. txns should survive system failure.
    Desired behavior after the DBMS restarts (i.e., the contents of volatile memory are lost): $\rightarrow \mathbf{T}_1$ should be durable. $\rightarrow \mathbf{T}_2 + \mathbf{T}_3$ should be aborted.

    Schedule

    ARIES

    Algorithms for Recovery and Isolation Exploiting Semantics

    Developed at IBM Research in early 1990s for the DB2 DBMS.
    Not all systems implement ARIES exactly as defined in this paper but they're close enough.
    ARIES: A Transaction Recovery Method Supporting Fine Granularity Locking and Partial Rollbacks Using Write- Ahead Logging
    G. MOHAN
    IBM Almaden Research Center and DON HADERLE IBM Santa Teresa Laboratory
    and
    BRUCE LINDSAY, HAMID PIRAHESH and PETER SCHWARZ IBM Almaden Research Center
    In this paper we present a simple and efficient method, called ARIES (Algorithm for Recovery and Isolation Exploiting Semantics), which supports partial rollbacks of transactions, fine granularity (e.g., record blocking and recovery using write- ahead logging (WAL). We introduce the paradigm of repeating history to redo all missing updates before performing the rollbacks of the loser transactions during restart after a system failure. ARIES uses a log sequence number in each page to correlate the state of a page with respect to logged updates of that page. All updates of a transaction are logged, including those performed during rollbacks. By appropriate chaining of the log records written during rollbacks to those written during forward progress, a bounded amount of logs rolling is ensured during rollbacks even in the face of repeated failures during restart or of nested rollbacks. We deal with a variety of features that are very important in building and operating an ARIES. We deal with transaction processing system. ARIES supports fuzzy checkpoints, selective and deferred restart, fuzzy image copies, media recovery, and high concurrency lock modes (e.g., increment/decrement) which exploit the semantics of the operations and require the ability to perform operation logging. ARIES is flexible with respect to the kinds of buffer management policies that can be implemented. It supports objects of varying length efficiency. By enabling parallelism during restart, page- oriented redo, and join- lock, the application can be extended to handle the creation of ARIES. ARIES supports B Paradigms for logging and recovery, which were based on the shadow page technique, need to be changed in the context of WAL. We compare ARIES to the WAL- based recovery methods of

    ARIES - MAIN IDEAS

    Write-Ahead Logging:

    $\rightarrow$ Flush WAL record(s) changes to disk before a database object is written to disk. $\rightarrow$ Must use STEAL + NO- FORCE buffer pool policies.

    Repeating History During Redo:

    $\rightarrow$ On DBMS restart, retrace actions and restore database to exact state before crash.

    Logging Changes During Undo:

    $\rightarrow$ Record undo actions to log to ensure action is not repeated in the event of repeated failures.

    TODAY'S AGENDA

    Log Sequence Numbers Normal Commit & Abort Operations Fuzzy Checkpointing Recovery Algorithm

    WAL RECORDS

    We need to extend our log record format from last class to include additional info.
    Every log record includes a globally unique log sequence number (LSN).
    $\rightarrow$ LSNs represent the physical order that txns make changes to the database.
    Various components in the system keep track of LSNs that pertain to them...

    WAL BOOKKEEPING

    Log Sequence Number (LSN). $\longrightarrow$ Unique and monotonically increasing.

    WAL BOOKKEEPING

    Log Sequence Number (LSN). $\longrightarrow$ Unique and monotonically increasing.
    Each data page contains a pageLSN. $\longrightarrow$ The LSN of the most recent log record that updated the page.

    WAL BOOKKEEPING

    Log Sequence Number (LSN). $\longrightarrow$ Unique and monotonically increasing.
    Each data page contains a pageLSN. $\longrightarrow$ The LSN of the most recent log record that updated the page.

    WAL BOOKKEEPING

    Log Sequence Number (LSN). $\longrightarrow$ Unique and monotonically increasing.
    Each data page contains a pageLSN. $\longrightarrow$ The LSN of the most recent log record that updated the page.

    WAL BOOKKEEPING

    Log Sequence Number (LSN). $\rightarrow$ Unique and monotonically increasing.
    Each data page contains a pageLSN. $\rightarrow$ The LSN of the most recent log record that updated the page.
    System keeps track of flushedLSN. $\rightarrow$ The max LSN flushed so far.

    WAL BOOKKEEPING

    Log Sequence Number (LSN). $\longrightarrow$ Unique and monotonically increasing.
    Each data page contains a pageLSN. $\longrightarrow$ The LSN of the most recent log record that updated the page.
    System keeps track of flushedLSN. $\longrightarrow$ The max LSN flushed so far.

    WAL BOOKKEEPING

    Log Sequence Number (LSN). $\rightarrow$ Unique and monotonically increasing.
    Each data page contains a pageLSN. $\rightarrow$ The LSN of the most recent log record that updated the page.
    System keeps track of flushedLSN. $\rightarrow$ The max LSN flushed so far.
    WAL: Before a $\mathrm{page}_{\mathrm{x}}$ is written, pageLSNx ≤ flushedLSN

    LOG SEQUENCE NUMBERS

    Table (html):
    NameLocationDefinition
    fflushedLSNMemoryLast LSN in log on disk
    pageLSNpagexNewest update to pagex
    recLSNDPT†Oldest update to pagex since it was last flushed
    lastLSNATT*Latest record of txn T1
    MasterRecordDiskLSN of latest checkpoint
    † DPT = Dirty Page Table.
    • ATT = Active Transaction Table.

    WRITING LOG RECORDS

    WRITING LOG RECORDS

    WRITING LOG RECORDS

    WRITING LOG RECORDS

    WRITING LOG RECORDS

    WRITING LOG RECORDS

    WRITING LOG RECORDS

    WRITING LOG RECORDS

    WRITING LOG RECORDS

    WRITING LOG RECORDS

    All log records have an LSN.
    Update the pageLSN every time a txn modifies a record in the page.
    Update the flushedLSN in memory every time the DBMS writes the WAL buffer to disk.

    NORMAL EXECUTION

    Each txn invokes a sequence of reads and writes, followed by commit or rollback.
    Assumptions in this lecture:
    $\rightarrow$ All log records fit within a single page. $\rightarrow$ Disk writes are atomic. $\rightarrow$ Single- versioned tuples with Strong Strict 2PL. $\rightarrow$ STEAL + NO- FORCE buffer management with WAL.

    TRANSACTION COMMIT

    When a txn commits, the DBMS writes a COMMIT record to log and guarantees that all log records up to txn's COMMIT record are flushed to disk.
    $\rightarrow$ Log flushes are sequential, synchronous writes to disk. $\rightarrow$ Many log records per log page.
    When the commit succeeds, write a TXN- END record to log. $\rightarrow$ Indicates that no new log record for a txn will appear in the log ever again. $\rightarrow$ This does not need to be flushed immediately.

    TRANSACTION COMMIT

    TRANSACTION COMMIT

    TRANSACTION COMMIT

    TRANSACTION COMMIT

    TRANSACTION COMMIT

    TRANSACTION COMMIT

    TRANSACTION COMMIT

    TRANSACTION COMMIT

    TRANSACTION ABORT

    Aborting a txn is a special case of the ARIES undo operation applied to only one txn.
    We need to add another field to our log records: $\rightarrow$ prevLSN: The previous LSN for the txn. $\rightarrow$ This maintains a linked- list for each txn that makes it easy to walk through its records.

    TRANSACTION ABORT

    TRANSACTION ABORT

    TRANSACTION ABORT

    TRANSACTION ABORT

    TRANSACTION ABORT

    COMPENSATION LOG RECORDS

    A CLR describes the actions taken to undo the actions of a previous update record.
    It has all the fields of an update log record plus the undoNextLSN pointer (i.e., the next- to- be- undone LSN).
    CLRs are added to log records but the DBMS does not wait for them to be flushed before notifying the application that the txn aborted.

    TRANSACTION ABORT - CLR EXAMPLE

    Table (html):
    LSNprevLSNTxnIdTypeObjectBeforeAfterUndoNextLSN
    001nilT1BEGIN----
    002001T1UPDATEA3040-
    :
    011002T1ABORT----

    TRANSACTION ABORT - CLR EXAMPLE

    Table (html):
    LSNprevLSNTxnIdTypeObjectBeforeAfterUndoNextLSN
    001nilT1BEGIN----
    002001T1UPDATEA3040-
    :
    011002T1ABORT----
    :
    026011T1CLR-002A4030001

    TRANSACTION ABORT - CLR EXAMPLE

    TRANSACTION ABORT - CLR EXAMPLE

    Table (html):
    LSNprevLSNTxnIdTypeObjectBeforeAfterUndoNextLSN
    001nilT1BEGIN----
    002001T1UPDATEA3040-
    :
    011002T1ABORT----
    :
    026011T1CLR-002A4030001

    TRANSACTION ABORT - CLR EXAMPLE

    TRANSACTION ABORT - CLR EXAMPLE

    TRANSACTION ABORT - CLR EXAMPLE

    Table (html):
    LSNprevLSNTxnIdTypeObjectBeforeAfterUndoNextLSN
    001nilT1BEGIN----
    002001T1UPDATEA3040-
    :
    011002T1ABORT----
    :
    026011T1CLR-002A4030001
    027026T1TXN-END---nil

    ABORT ALGORITHM

    First write an ABORT record to log for the txn.
    Then analyze the txn's updates in reverse order. For each update record:
    $\rightarrow$ Write a CLR entry to the log. $\rightarrow$ Restore old value.
    Lastly, write a TXN- END record and release locks.
    Notice: CLRs never need to be undone.

    TODAY'S AGENDA

    Log Sequence Numbers Normal Commit & Abort Operations Fuzzy Checkpointing Recovery Algorithm

    NON-FUZZY CHECKPOINTS

    The DBMS halts everything when it takes a checkpoint to ensure a consistent snapshot: $\begin{array}{rl} & \longrightarrow \mathrm{Haltthestartofanynewtxns}.\ & \longrightarrow \mathrm{Waituntilallactivetxnsfinishexecuting}.\ & \longrightarrow \mathrm{Flushesdirtypagesondisk}. \end{array}$
    This is bad for runtime performance but makes recovery easy.

    SLIGHTLY BETTER CHECKPOINTS

    Pause modifying txns while the DBMS takes the checkpoint.
    $\longrightarrow$ Flushes dirty pages on disk. $\longrightarrow$ Prevent queries from acquiring write latch on table/index pages. $\longrightarrow$ Don't have to wait until all txns finish before taking the checkpoint.

    SLIGHTLY BETTER CHECKPOINTS

    Pause modifying txns while the DBMS takes the checkpoint.
    $\longrightarrow$ Flushes dirty pages on disk. $\longrightarrow$ Prevent queries from acquiring write latch on table/index pages. $\longrightarrow$ Don't have to wait until all txns finish before taking the checkpoint.

    SLIGHTLY BETTER CHECKPOINTS

    Pause modifying txns while the DBMS takes the checkpoint.
    $\longrightarrow$ Flushes dirty pages on disk. $\longrightarrow$ Prevent queries from acquiring write latch on table/index pages. $\longrightarrow$ Don't have to wait until all txns finish before taking the checkpoint.

    SLIGHTLY BETTER CHECKPOINTS

    Pause modifying txns while the DBMS takes the checkpoint.
    $\longrightarrow$ Flushes dirty pages on disk. $\longrightarrow$ Prevent queries from acquiring write latch on table/index pages. $\longrightarrow$ Don't have to wait until all txns finish before taking the checkpoint.

    SLIGHTLY BETTER CHECKPOINTS

    Pause modifying txns while the DBMS takes the checkpoint.
    $\longrightarrow$ Flushes dirty pages on disk. $\longrightarrow$ Prevent queries from acquiring write latch on table/index pages. $\longrightarrow$ Don't have to wait until all txns finish before taking the checkpoint.

    SLIGHTLY BETTER CHECKPOINTS

    Pause modifying txns while the DBMS takes the checkpoint.
    $\longrightarrow$ Flushes dirty pages on disk. $\longrightarrow$ Prevent queries from acquiring write latch on table/index pages. $\longrightarrow$ Don't have to wait until all txns finish before taking the checkpoint.

    SLIGHTLY BETTER CHECKPOINTS

    Pause modifying txns while the DBMS takes the checkpoint.
    $\longrightarrow$ Flushes dirty pages on disk. $\longrightarrow$ Prevent queries from acquiring write latch on table/index pages. $\longrightarrow$ Don't have to wait until all txns finish before taking the checkpoint.

    SLIGHTLY BETTER CHECKPOINTS

    Pause modifying txns while the DBMS takes the checkpoint.
    $\longrightarrow$ Flushes dirty pages on disk. $\longrightarrow$ Prevent queries from acquiring write latch on table/index pages. $\longrightarrow$ Don't have to wait until all txns finish before taking the checkpoint.

    SLIGHTLY BETTER CHECKPOINTS

    Pause modifying txns while the DBMS takes the checkpoint.
    $\longrightarrow$ Flushes dirty pages on disk. $\longrightarrow$ Prevent queries from acquiring write latch on table/index pages. $\longrightarrow$ Don't have to wait until all txns finish before taking the checkpoint.
    We must record internal state as of the beginning of the checkpoint.
    $\longrightarrow$ Active Transaction Table (ATT) $\longrightarrow$ Dirty Page Table (DPT)

    ACTIVE TRANSACTION TABLE (ATT)

    One entry per currently active txn. $\rightarrow$ txnId: Unique txn identifier. $\rightarrow$ status: The current "mode" of the txn. $\rightarrow$ lastLSN: Most recent LSN created by txn.
    Remove entry after the TXN- END record.

    Txn Status Codes:

    $\rightarrow$ Running (R) $\rightarrow$ Committing (C) $\rightarrow$ Candidate for Undo (U)

    DIRTY PAGE TABLE (DPT)

    Keep track of which pages in the buffer pool contain changes that have not been flushed to disk.
    One entry per dirty page in the buffer pool: $\rightarrow$ recLSN: The LSN of the log record that first caused the page to be dirty.

    SLIGHTLY BETTER CHECKPOINTS

    At the first checkpoint, assuming $\mathsf{P}{11}$ was flushed, $\mathsf{T}{2}$ is still running and there is only one dirty page $(\mathsf{P}_{22})$ .
    001: 002: 003:<T1, A→P11, 100, 120> 004: 005:<T2, C→P22, 100, 120> 006:<T TXN- END 007:<CHECKPOINT ATT={T2} DPT={P22} 008: 009:<T2, A→P11, 120, 130> 010: 011:<T3, B→P33, 200, 400> 012:<CHECKPOINT ATT={T2,T3} DPT={P11,P33} 013:<T3, B→P33, 400, 600>

    SLIGHTLY BETTER CHECKPOINTS

    At the first checkpoint, assuming $\mathsf{P}{11}$ was flushed, $\mathsf{T}{2}$ is still running and there is only one dirty page $(\mathsf{P}_{22})$ .
    001: 002: 003:<T1, A>P11, 100, 120> 004: 005:<T2, C>P22, 100, 120> 006: 007: 008: 009:<T2, A>P11, 120, 130> 010: 011:<T3, B>P33, 200, 400> 012: <CHECKPOINT ATT={T2, T3}, DPT={P11, P33}> 013:<T3, B>P33, 400, 600>

    SLIGHTLY BETTER CHECKPOINTS

    At the first checkpoint, assuming $\mathsf{P}{11}$ was flushed, $\mathsf{T}{2}$ is still running and there is only one dirty page $(\mathsf{P}_{22})$ .
    001: 002: 003:<T1, A>P11, 100, 120> 004: 005:<T2, C>P22, 100, 120> 006: 007:<CHECKPOINT ATT={T2}, DPT={P22} 008: 009:<T2, A>P11, 120, 130> 010: 011:<T3, B>P33, 200, 400> 012:<CHECKPOINT ATT={T2, T3}, DPT={P11, P33}> 013:<T3, B>P33, 400, 600>

    SLIGHTLY BETTER CHECKPOINTS

    At the first checkpoint, assuming $\mathsf{P}{11}$ was flushed, $\mathsf{T}{2}$ is still running and there is only one dirty page $(\mathsf{P}{22})$ . At the second checkpoint, assuming $\mathsf{P}{22}$ was flushed, $\mathsf{T}{2}$ and $\mathsf{T}{3}$ are active and the dirty pages are $(\mathsf{P}{11}, \mathsf{P}{33})$ .
    001: 002: 003:<T1, A>P11, 100, 120> 004: 005:<T2,C>P22, 100, 120> 006: 007:<CHECKPOINT ATT={T2} DPT={P22} 008: 009:<T2,A>P11, 120, 130> 010: 011:<T3,B>P33, 200, 400> 012:<CHECKPOINT ATT={T2,T3} DPT={P11,P33} 013:<T3,B>P33, 400, 600>

    SLIGHTLY BETTER CHECKPOINTS

    At the first checkpoint, assuming $\mathsf{P}{11}$ was flushed, $\mathsf{T}{2}$ is still running and there is only one dirty page $(\mathsf{P}{22})$ . At the second checkpoint, assuming $\mathsf{P}{22}$ was flushed, $\mathsf{T}{2}$ and $\mathsf{T}{3}$ are active and the dirty pages are $(\mathsf{P}{11}, \mathsf{P}{33})$ .
    001: 002: 003:<T1, A>P11, 100, 120> 004: 005:<T2, C>P22, 100, 120> 006: 007: 008: 009:<T2, A>P11, 120, 130> 010: 011:<T3, B>P33, 200, 400> 012: 013:<T3, B>P33, 400, 600>

    SLIGHTLY BETTER CHECKPOINTS

    At the first checkpoint, assuming $\mathsf{P}{11}$ was flushed, $\mathsf{T}{2}$ is still running and there is only one dirty page $(\mathsf{P}{22})$ . At the second checkpoint, assuming $\mathsf{P}{22}$ was flushed, $\mathsf{T}{2}$ and $\mathsf{T}{3}$ are active and the dirty pages are $(\mathsf{P}{11}, \mathsf{P}{33})$ .
    001: 002: 003:<T1, A>P11, 100, 120> 004: 005:<T2,C>P22, 100, 120> 006: 007:<CHECKPOINT ATT={T2} DPT={P22} 008: 009:<T2,A>P11, 120, 130> 010: 011:<T3,B>P33, 200, 400> 012:<CHECKPOINT ATT={T2,T3} DPT={P11,P33} 013:<T3,B>P33, 400, 000>

    SLIGHTLY BETTER CHECKPOINTS

    At the first checkpoint, assuming $\mathsf{P}{11}$ was flushed, $\mathsf{T}{2}$ is still running and there is only one dirty page $(\mathsf{P}_{22})$ .
    At the second checkpoint, assuming $\mathsf{P}{22}$ was flushed, $\mathsf{T}{2}$ and $\mathsf{T}{3}$ are active and the dirty pages are $(\mathsf{P}{11}, \mathsf{P}_{33})$ .
    This still is not ideal because the DBMS must stall txns during checkpoint...
    001: 002: 003:<T1, A>P11, 100, 120> 004: 005:<T2,C>P22, 100, 120> 006:<T1 TXN- END 007:<CHECKPOINT ATT={T2} DPT={P22} 008: 009:<T2,A>P11, 120, 130> 010: 011:<T3,B>P33, 200, 400> 012:<CHECKPOINT ATT={T2,T3} DPT={P11,P33} 013:<T3,B>P33, 400, 000>

    FUZZY CHECKPOINTS

    A fuzzy checkpoint is where the DBMS allows active txns to continue to run while the system writes the log records for checkpoint. $\longrightarrow$ No attempt to force dirty pages to disk.
    New log records to track checkpoint boundaries: $\longrightarrow$ CHECKPOINT- BEGIN: Indicates start of checkpoint $\longrightarrow$ CHECKPOINT- END: Contains ATT + DPT.

    FUZZY CHECKPOINTS

    Assume the DBMS flushes $\mathsf{P}_{11}$ before the first checkpoint starts.
    Any txn that begins after the checkpoint starts is excluded from the ATT in the CHECKPOINT- END record.
    The LSN of the CHECKPOINT- BEGIN record is written to the MasterRecord when it completes.
    001: 002: 003:<T1, A>P11, 100, 120> 004: 005:<T2, C>P22, 100, 120> 006: 007: 008: 009:<T2, A>P11, 120, 130> 010:<CHECKPOINT- END $\mathsf{ATT} = {\mathsf{T}_2} ,$ DPT={P22}> 011: 012:<T3, B>P33, 200, 400> 013: 014:<T3, B>P33, 10, 12> 015:<CHECKPOINT- END $\mathsf{ATT} = {\mathsf{T}_2,\mathsf{T}_3} ,$ DPT={P11, P33}

    FUZZY CHECKPOINTS

    Assume the DBMS flushes $\mathsf{P}_{11}$ before the first checkpoint starts.
    Any txn that begins after the checkpoint starts is excluded from the ATT in the CHECKPOINT- END record.
    The LSN of the CHECKPOINT- BEGIN record is written to the MasterRecord when it completes.
    001: 002: 003:<T1, A>P11, 100, 120> 004: 005:<T2, C>P22, 100, 120> 006: 007: 008: 009:<T2, A>P11, 120, 130> 010:<CHECKPOINT- END $\mathsf{ATT} = {\mathsf{T}_2} ,$ DPT={P22}> 011: 012:<T3, B>P33, 200, 400> 013: 014:<T3, B>P33, 10, 12> 015:<CHECKPOINT- END $\mathsf{ATT} = {\mathsf{T}_2,\mathsf{T}_3} ,$ DPT={P11, P33}

    FUZZY CHECKPOINTS

    Assume the DBMS flushes $\mathsf{P}_{11}$ before the first checkpoint starts.
    Any txn that begins after the checkpoint starts is excluded from the ATT in the CHECKPOINT- END record.
    The LSN of the CHECKPOINT- BEGIN record is written to the MasterRecord when it completes.
    001: 002: 003:<T1, A>P11, 100, 120> 004: 005:<T2, C>P22, 100, 120> 006: 007: 008: 009:<T2, A>P11, 120, 130> 010:<CHECKPOINT- END $\mathsf{ATT} = {\mathsf{T}_2}$ DPT={P22} > 011: 012:<T3, B>P33, 200, 400> 013: 014:<T3, B>P33, 10, 12> 015:<CHECKPOINT- END $\mathsf{ATT} = {\mathsf{T}_2,\mathsf{T}_3}$ DPT={P11, P33} >

    FUZZY CHECKPOINTS

    Assume the DBMS flushes $\mathsf{P}_{11}$ before the first checkpoint starts.
    Any txn that begins after the checkpoint starts is excluded from the ATT in the CHECKPOINT- END record.
    The LSN of the CHECKPOINT- BEGIN record is written to the MasterRecord when it completes.
    001: 002: 003:<T1, A>P11, 100, 120> 004: 005:<T2, C>P22, 100, 120> 006: 007: 008: 009:<T2, A>P11, 120, 130> 010:<CHECKPOINT- END
    ATT={T2}, DPT={P22}>
    011: 012:<T3, B>P33, 200, 400> 013: 014:<T3, B>P33, 10, 12> 015:<CHECKPOINT- END
    ATT={T2, T3}, DPT={P11, P33}>

    FUZZY CHECKPOINTS

    Assume the DBMS flushes $\mathsf{P}_{11}$ before the first checkpoint starts.
    Any txn that begins after the checkpoint starts is excluded from the ATT in the CHECKPOINT- END record.
    The LSN of the CHECKPOINT- BEGIN record is written to the MasterRecord when it completes.
    001: 002: 003:<T1, A>P11, 100, 120> 004: <T2, C>P22, 100, 120> 006: 007: 008: 009:<T2, A>P11, 120, 130> 010:<CHECKPOINT- END $\mathsf{ATT} = {\mathsf{T}_2} ,$ DPT={P22} 011: 012:<T3, B>P33, 200, 400> 013: 014:<T3, B>P33, 10, 12> 015:<CHECKPOINT- END $\mathsf{ATT} = {\mathsf{T}_2,\mathsf{T}_3} ,$ DPT={P11, P33}

    ARIES - RECOVERY PHASES

    Phase #1 - Analysis

    $\rightarrow$ Examine the WAL in forward direction starting at MasterRecord to identify dirty pages in the buffer pool and active txns at the time of the crash.

    Phase #2 - Redo

    $\rightarrow$ Repeat all actions starting from an appropriate point in the log (even txns that will abort).

    Phase #3 - Undo

    $\rightarrow$ Reverse the actions of txns that did not commit before the crash.

    ARIES - OVERVIEW

    Start from last BEGIN- CHECKPOINT found via MasterRecord.
    Analysis: Figure out which txns committed or failed since checkpoint.
    Redo: Repeat all actions.
    Undo: Reverse effects of failed txns.

    ARIES - OVERVIEW

    Start from last BEGIN- CHECKPOINT found via MasterRecord.
    Analysis: Figure out which txns committed or failed since checkpoint.
    Redo: Repeat all actions.
    Undo: Reverse effects of failed txns.

    ARIES - OVERVIEW

    Start from last BEGIN- CHECKPOINT found via MasterRecord.
    Analysis: Figure out which txns committed or failed since checkpoint.
    Redo: Repeat all actions.
    Undo: Reverse effects of failed txns.

    ARIES - OVERVIEW

    Start from last BEGIN- CHECKPOINT found via MasterRecord.
    Analysis: Figure out which txns committed or failed since checkpoint.
    Redo: Repeat all actions.
    Undo: Reverse effects of failed txns.

    ARIES - OVERVIEW

    Start from last BEGIN- CHECKPOINT found via MasterRecord.
    Analysis: Figure out which txns committed or failed since checkpoint.
    Redo: Repeat all actions.
    Undo: Reverse effects of failed txns.

    ARIES - OVERVIEW

    Start from last BEGIN- CHECKPOINT found via MasterRecord.
    Analysis: Figure out which txns committed or failed since checkpoint.
    Redo: Repeat all actions.
    Undo: Reverse effects of failed txns.

    ARIES - OVERVIEW

    Start from last BEGIN- CHECKPOINT found via MasterRecord.
    Analysis: Figure out which txns committed or failed since checkpoint.
    Redo: Repeat all actions.
    Undo: Reverse effects of failed txns.

    ARIES - OVERVIEW

    Start from last BEGIN- CHECKPOINT found via MasterRecord.
    Analysis: Figure out which txns committed or failed since checkpoint.
    Redo: Repeat all actions.
    Undo: Reverse effects of failed txns.

    ARIES - OVERVIEW

    Start from last BEGIN- CHECKPOINT found via MasterRecord.
    Analysis: Figure out which txns committed or failed since checkpoint.
    Redo: Repeat all actions.
    Undo: Reverse effects of failed txns.

    ARIES - OVERVIEW

    Start from last BEGIN- CHECKPOINT found via MasterRecord.
    Analysis: Figure out which txns committed or failed since checkpoint.
    Redo: Repeat all actions.
    Undo: Reverse effects of failed txns.

    ANALYSIS PHASE

    Scan log forward from last successful checkpoint. If the DBMS finds a TXN- END record, remove its corresponding txn from ATT.
    All other records: $\rightarrow$ If txn not in ATT, add it with status UNDO. $\rightarrow$ On commit, change txn status to COMMIT.
    For update log records: $\rightarrow$ If page P not in DPT, add P to DPT, set its recLSN=LSN.

    ANALYSIS PHASE

    At end of the Analysis Phase: $\rightarrow$ ATT identifies which txns were active at time of crash. $\rightarrow$ DPT identifies which dirty pages might not have made it to disk.

    ANALYSIS PHASE EXAMPLE

    Table (html):
    LSNATTDPT
    010
    020
    030
    040
    050

    ANALYSIS PHASE EXAMPLE

    ANALYSIS PHASE EXAMPLE

    ANALYSIS PHASE EXAMPLE

    ANALYSIS PHASE EXAMPLE

    ANALYSIS PHASE EXAMPLE

    ANALYSIS PHASE EXAMPLE

    ANALYSIS PHASE EXAMPLE

    ANALYSIS PHASE EXAMPLE

    ANALYSIS PHASE EXAMPLE

    REDO PHASE

    The goal is to repeat history to reconstruct the database state at the moment of the crash: $\rightarrow$ Reapply all updates (even aborted txns!) and redo CLRs.
    There are techniques that allow the DBMS to avoid unnecessary reads/writes, but we will ignore that in this lecture...

    REDO PHASE

    Scan forward from the log record containing smallest recLSN in DPT.
    For each update log record or CLR with a given LSN, redo the action unless:
    $\rightarrow$ The affected page is not in DPT, or $\rightarrow$ The affected page is in DPT, but that log record's LSN is less than the page's recLSN. (The update was propagated to disk.) $\rightarrow$ Log record's LSN ≤ pageLSN; DBMS must fetch page from the disk to read page value.

    REDO PHASE

    To redo an action: $\rightarrow$ Reapply logged update. $\rightarrow$ Set pageLSN to log record's LSN. $\rightarrow$ No additional logging, no forced flushes!
    At the end of Redo Phase, write TXN- END log records for all txns with status C and remove them from the ATT.

    UNDO PHASE

    Undo all txns that were active at the time of crash and therefore will never commit.
    $\rightarrow$ These are all the txns with U status in the ATT after the Analysis Phase.
    Process them in reverse LSN order using the lastLSN to speed up traversal.
    $\rightarrow$ At each step, pick the largest lastLSN across all transactions in the ATT. $\rightarrow$ Traverse lastLSNs in the same order, but in reverse, for how the updates happened originally.
    Write a CLR for every modification.

    FULL EXAMPLE

    FULL EXAMPLE

    FULL EXAMPLE

    FULL EXAMPLE

    FULL EXAMPLE

    FULL EXAMPLE

    FULL EXAMPLE

    FULL EXAMPLE

    FULL EXAMPLE

    FULL EXAMPLE

    FULL EXAMPLE

    FULL EXAMPLE

    FULL EXAMPLE

    ADDITIONAL CRASH ISSUES (1)

    What does the DBMS do if it crashes during recovery in the Analysis Phase?

    ADDITIONAL CRASH ISSUES (1)

    What does the DBMS do if it crashes during recovery in the Analysis Phase?
    $\longrightarrow$ Nothing. Just run recovery again.

    ADDITIONAL CRASH ISSUES (1)

    What does the DBMS do if it crashes during recovery in the Analysis Phase?
    $\longrightarrow$ Nothing. Just run recovery again.
    What does the DBMS do if it crashes during recovery in the Redo Phase?

    ADDITIONAL CRASH ISSUES (1)

    What does the DBMS do if it crashes during recovery in the Analysis Phase?
    $\longrightarrow$ Nothing. Just run recovery again.
    What does the DBMS do if it crashes during recovery in the Redo Phase?
    $\longrightarrow$ Again nothing. Redo everything again.

    ADDITIONAL CRASH ISSUES (2)

    How can the DBMS improve performance during recovery in the Redo Phase?
    $\rightarrow$ Assume that it is not going to crash again and flush all changes to disk asynchronously in the background.
    How can the DBMS improve performance during recovery in the Undo Phase?
    $\rightarrow$ Lazily rollback changes before new txns access pages. $\rightarrow$ Rewrite the application to avoid long- running txns.

    CONCLUSION

    Mains ideas of ARIES:
    $\longrightarrow$ WAL with STEAL + NO- FORCE $\longrightarrow$ Fuzzy Checkpoints (snapshot of dirty page ids) $\longrightarrow$ Redo everything since the earliest dirty page $\longrightarrow$ Undo txns that never commit $\longrightarrow$ Write CLRs when undoing, to survive failures during restarts
    Log Sequence Numbers:
    $\longrightarrow$ LSNs identify log records; linked into backwards chains per transaction via prevLSN. $\longrightarrow$ pageLSN allows comparison of data page and log records.

    NEXT CLASS

    You now know how to build a single- node DBMS.
    Let's make it even more challenging and start talking about distributed DBMSs!