Carnegie Mellon University
Database Systems
Database Logging
ADMINISTRIVIA
Project #4 is due Sunday April 20th @ 11
$\longrightarrow$ Recitation: Friday, April 11
th in GHC 4303 from 3
- 4
PM
Final Exam is on Monday, April 28, 2025, from 05
- 08
$\longrightarrow$ Early exam will not be offered. Do not make travel plans.
No OH on April 7th (I have to teach another class)
UPCOMING DATABASE TALKS
0xide (DB Seminar) $\rightarrow$ Monday, April 7 @ 4
$\rightarrow$ OxQL: Oximeter Query Language $\rightarrow$ Speaker: Ben Naecker $\rightarrow$
https://cmu.zoom.us/j/93441451665
Oxide
LAST CLASS
We discussed multi- version concurrency control (MVCC) and how it effects the design of the entire DBMS architecture.
A DBMS's concurrency control protocol gives it Atomicity + Consistency + Isolation.
We now need ensure Atomicity + Durability...
MOTIVATION
Schedule
T1 BEGIN R(A) W(A) COMMIT
Buffer Pool
MOTIVATION
Schedule
MOTIVATION
Schedule
MOTIVATION
Schedule
MOTIVATION
[ImageCaption: Schedule]
MOTIVATION
Schedule
CRASH RECOVERY
Recovery algorithms are techniques to ensure database consistency, transaction atomicity, and durability despite failures.
Recovery algorithms have two parts:
$\rightarrow$ Actions during normal txn processing to ensure that the DBMS can recover from a failure. $\rightarrow$ Actions after a failure to recover the database to a state that ensures atomicity, consistency, and durability.
CRASH RECOVERY
Recovery algorithms are techniques to ensure database consistency, transaction atomicity, and durability despite failures.
Recovery algorithms have two parts:
$\rightarrow$ Actions during normal txn processing to ensure that the DBMS can recover from a failure.
Today
$\rightarrow$ Actions after a failure to recover the database to a state that ensures atomicity, consistency, and durability.
TODAY'S AGENDA
Buffer Pool PoliciesShadow PagingWrite- Ahead LogLogging SchemesCheckpointsDB Flash Talk: Firebolt
OBSERVATION
The database's primary storage location is on non- volatile storage, but this is slower than volatile storage. Use volatile memory for faster access: $\rightarrow$ First copy target record into memory. $\rightarrow$ Perform the writes in memory. $\rightarrow$ Write dirty records back to disk.
The DBMS needs to ensure the following:
$\rightarrow$ The changes for any txn are durable once the DBMS has told somebody that it committed. $\rightarrow$ No partial changes are durable if the txn aborted.
UNDO VS. REDO
Undo: The process of removing the effects of an incomplete or aborted txn. Redo: The process of re- applying the effects of a committed txn for durability.
How the DBMS supports this functionality depends on how it manages the buffer pool ...
BUFFER POOL
Schedule
BUFFER POOL
Schedule
BUFFER POOL
Schedule
BUFFER POOL
Schedule
BUFFER POOL
Schedule
BUFFER POOL
Schedule
BUFFER POOL
Schedule
BUFFER POOL
Schedule
BUFFER POOL
Schedule
BUFFER POOL
Schedule
BUFFER POOL
Schedule
STEAL POLICY
Whether the DBMS can evict a dirty object in the buffer pool modified by an uncommitted txn and overwrite the most recent committed version of that object in non- volatile storage.
STEAL: Eviction + overwriting is allowed. NO- STEAL: Eviction + overwriting is not allowed.
FORCE POLICY
Whether the DBMS requires that all updates made by a txn are written back to non- volatile storage before the txn can commit.
FORCE: Write- back is required. NO- FORCE: Write- back is not required.
NO-STEAL + FORCE
Schedule
NO-STEAL + FORCE
Schedule
NO-STEAL + FORCE
Schedule
NO-STEAL + FORCE
[ImageCaption: Schedule]
NO-STEAL + FORCE
Schedule
NO-STEAL + FORCE
Schedule
NO-STEAL + FORCE
Schedule
NO-STEAL + FORCE
Schedule
NO-STEAL + FORCE
Schedule
NO-STEAL + FORCE
Schedule
NO-STEAL + FORCE
Schedule
NO-STEAL + FORCE
This approach is the easiest to implement:
$\rightarrow$ Never have to undo changes of an aborted txn because the changes were not written to disk. $\rightarrow$ Never have to redo changes of a committed txn because all the changes are guaranteed to be written to disk at commit time (assuming atomic hardware writes).
Previous example cannot support write sets that exceed the amount of physical memory available.
SHADOW PAGING
Instead of copying the entire database, the DBMS copies pages on write to create two versions: $\rightarrow$ Master: Contains only changes from committed txns. $\rightarrow$ Shadow: Temporary database with changes made from uncommitted txns.
To install updates when a txn commits, overwrite the root so it points to the shadow, thereby swapping the master and shadow.
Buffer Pool Policy: NO- STEAL + FORCE
SHADOW PAGING - EXAMPLE
SHADOW PAGING - EXAMPLE
SHADOW PAGING - EXAMPLE
SHADOW PAGING - EXAMPLE
SHADOW PAGING - EXAMPLE
SHADOW PAGING - EXAMPLE
SHADOW PAGING - EXAMPLE
SHADOW PAGING - EXAMPLE
SHADOW PAGING - EXAMPLE
SHADOW PAGING - EXAMPLE
SHADOW PAGING - EXAMPLE
SHADOW PAGING - EXAMPLE
SHADOW PAGING - EXAMPLE
SHADOW PAGING - EXAMPLE
SHADOW PAGING - UNDO/REDO
Supporting rollbacks and recovery is easy with shadow paging.
Undo: Remove the shadow pages. Leave the master and the DB root pointer alone.
Redo: Not needed at all.
SHADOW PAGING – DISADVANTAGES
Copying the entire page table is expensive:
$\rightarrow$ Use a page table structured like a B+tree (LMDB). $\rightarrow$ No need to copy entire tree, only need to copy paths in the tree that lead to updated leaf nodes.
Commit overhead is high:
$\rightarrow$ Flush every updated page, page table, and root. $\rightarrow$ Data gets fragmented (bad for sequential scans). $\rightarrow$ Need garbage collection. $\rightarrow$ Only supports one writer txn at a time or txns in a batch.
SQLITE (PRE-2010)
When a txn modifies a page, the DBMS copies the original page to a separate journal file before overwriting master version. $\rightarrow$ Called rollback mode.
After restarting, if a journal file exists, then the DBMS restores it to undo changes from uncommitted txns.
SQLITE (PRE-2010)
When a txn modifies a page, the DBMS copies the original page to a separate journal file before overwriting master version. $\rightarrow$ Called rollback mode.
After restarting, if a journal file exists, then the DBMS restores it to undo changes from uncommitted txns.
SQLITE (PRE-2010)
When a txn modifies a page, the DBMS copies the original page to a separate journal file before overwriting master version. $\rightarrow$ Called rollback mode.
After restarting, if a journal file exists, then the DBMS restores it to undo changes from uncommitted txns.
SQLITE (PRE-2010)
When a txn modifies a page, the DBMS copies the original page to a separate journal file before overwriting master version. $\rightarrow$ Called rollback mode.
After restarting, if a journal file exists, then the DBMS restores it to undo changes from uncommitted txns.
SQLITE (PRE-2010)
When a txn modifies a page, the DBMS copies the original page to a separate journal file before overwriting master version. $\rightarrow$ Called rollback mode.
After restarting, if a journal file exists, then the DBMS restores it to undo changes from uncommitted txns.
SQLITE (PRE-2010)
When a txn modifies a page, the DBMS copies the original page to a separate journal file before overwriting master version. $\rightarrow$ Called rollback mode.
After restarting, if a journal file exists, then the DBMS restores it to undo changes from uncommitted txns.
SQLITE (PRE-2010)
When a txn modifies a page, the DBMS copies the original page to a separate journal file before overwriting master version. $\rightarrow$ Called rollback mode.
After restarting, if a journal file exists, then the DBMS restores it to undo changes from uncommitted txns.
Memory
Disk
SQLite (PRE-2010)
When a txn modifies a page, the DBMS copies the original page to a separate journal file before overwriting master version. $\rightarrow$ Called rollback mode.
After restarting, if a journal file exists, then the DBMS restores it to undo changes from uncommitted txns.
SQLITE (PRE-2010)
When a txn modifies a page, the DBMS copies the original page to a separate journal file before overwriting master version. $\rightarrow$ Called rollback mode.
After restarting, if a journal file exists, then the DBMS restores it to undo changes from uncommitted txns.
SQLITE (PRE-2010)
When a txn modifies a page, the DBMS copies the original page to a separate journal file before overwriting master version. $\rightarrow$ Called rollback mode.
After restarting, if a journal file exists, then the DBMS restores it to undo changes from uncommitted txns.
SQLITE (PRE-2010)
When a txn modifies a page, the DBMS copies the original page to a separate journal file before overwriting master version. $\rightarrow$ Called rollback mode.
After restarting, if a journal file exists, then the DBMS restores it to undo changes from uncommitted txns.
Memory
SQLITE (PRE-2010)
When a txn modifies a page, the DBMS copies the original page to a separate journal file before overwriting master version. $\rightarrow$ Called rollback mode.
After restarting, if a journal file exists, then the DBMS restores it to undo changes from uncommitted txns.
SQLITE (PRE-2010)
When a txn modifies a page, the DBMS copies the original page to a separate journal file before overwriting master version. $\rightarrow$ Called rollback mode.
After restarting, if a journal file exists, then the DBMS restores it to undo changes from uncommitted txns.
SQLITE (PRE-2010)
When a txn modifies a page, the DBMS copies the original page to a separate journal file before overwriting master version. $\rightarrow$ Called rollback mode.
After restarting, if a journal file exists, then the DBMS restores it to undo changes from uncommitted txns.
SQLITE (PRE-2010)
When a txn modifies a page, the DBMS copies the original page to a separate journal file before overwriting master version. $\rightarrow$ Called rollback mode.
After restarting, if a journal file exists, then the DBMS restores it to undo changes from uncommitted txns.
SQLITE (PRE-2010)
When a txn modifies a page, the DBMS copies the original page to a separate journal file before overwriting master version. $\rightarrow$ Called rollback mode.
After restarting, if a journal file exists, then the DBMS restores it to undo changes from uncommitted txns.
OBSERVATION
Shadowing page requires the DBMS to perform writes to random non- contiguous pages on disk.
We need a way for the DBMS convert random writes into sequential writes...
WRITE-AHEAD LOG (WAL)
Maintain a log file separate from data files that contains the changes that txns make to database. $\longrightarrow$ Assume that the log is on stable storage. $\longrightarrow$ Log contains enough information to perform the necessary undo and redo actions to restore the database.
DBMS must write to disk the log file records that correspond to changes made to a database object before it can flush that object to disk.
Buffer Pool Policy: STEAL + NO- FORCE
WRITE-AHEAD LOG (WAL)
Maintain a log file separate from data files that contains the changes that txns make to database. $\longrightarrow$ Assume that the log is on stable storage. $\longrightarrow$ Log contains enough information to perform the necessary undo and redo actions to restore the database.
DBMS must write to disk the log file records that correspond to changes made to a database object before it can flush that object to disk.
Buffer Pool Policy: STEAL + NO- FORCE
BUFFER POOL + WAL
BUFFER POOL + WAL
Force (on every update, flush the updated page to disk) Poor response time, but enforces durability of committed txns.
BUFFER POOL + WAL
Force (on every update, flush the updated page to disk) Poor response time, but enforces durability of committed txns.
BUFFER POOL + WAL
Low throughput, but works for aborted txns.
No-Force
Concern: Crash before a page is flushed to disk. Durability? Solution: Force a summary/log @ commit. Use to REDO.
Force (on every update, flush the updated page to disk) Poor response time, but enforces durability of committed txns.
BUFFER POOL + WAL
No- Steal Low throughput, but works for aborted txns.
Steal (flush an unpinned dirty page even if the updating txn is active) Concern: A stolen+flushed page was modified by an uncommitted txn. T. If T aborts, how is atomicity enforced? Solution: Remember old value (logs). Use to UNDO.
WAL PROTOCOL
The DBMS stages all a txn's log records in volatile storage (usually backed by buffer pool).
All log records pertaining to an updated page are written to non- volatile storage before the page itself is over- written in non- volatile storage.
A txn is not considered committed until all its log records have been written to stable storage.
WAL PROTOCOL
Write a record to the log for each txn to mark its starting point.
Append a record every time a txn changes an object: $\begin{array}{rl} & \rightarrow \mathrm{TransectionId}\ & \rightarrow \mathrm{ObjecId}\ & \rightarrow \mathrm{BeforeValue(UNDO)}\ & \rightarrow \mathrm{AferValue(REDO)} \end{array}$
WAL PROTOCOL
Write a record to the log for each txn to mark its starting point.
Append a record every time a txn changes an object: $\begin{array}{rl} & \rightarrow \mathrm{TransectionId}\ & \rightarrow \mathrm{ObjecId}\ & \rightarrow \mathrm{BecoreValue(UNDO)}\ & \rightarrow \mathrm{AforeValue(REDO)} \end{array}$ Not necessary if using append- only MVCC
WAL PROTOCOL
Write a record to the log for each txn to mark its starting point.
Append a record every time a txn changes an object: $\begin{array}{rl} & \rightarrow \mathrm{Tr}\mathrm{asractionId}\ & \rightarrow \mathrm{ObjctId}\ & \rightarrow \mathrm{BeforeValue~(UNDO)}\ & \rightarrow \mathrm{AferValue(REDO)} \end{array}$ Not necessary if using append- only MVCC
When a txn finishes, the DBMS appends a record to the log. $\rightarrow$ Make sure that all log records are flushed before it returns an acknowledgement to application.
WAL - EXAMPLE
Schedule
WAL - EXAMPLE
Schedule
WAL - EXAMPLE
Schedule
WAL - EXAMPLE
Schedule
WAL - EXAMPLE
Schedule
WAL - EXAMPLE
Schedule
WAL - EXAMPLE
[ImageCaption: Schedule]
WAL - EXAMPLE
[ImageCaption: Schedule]
WAL - EXAMPLE
Schedule
WAL - EXAMPLE
Schedule
WAL - EXAMPLE
Everything we need to restore $T_{1}$ is in the log!
WAL - IMPLEMENTATION
Flushing the log buffer to disk every time a txn commits will become a bottleneck.
The DBMS can use the group commit optimization to batch multiple log flushes together to amortize overhead. $\rightarrow$ When the buffer is full, flush it to disk. $\rightarrow$ Or if there is a timeout (e.g., 5 ms).
WAL - GROUP COMMIT
Schedule
WAL - GROUP COMMIT
Schedule
WAL - GROUP COMMIT
Schedule
WAL - GROUP COMMIT
Schedule
WAL - GROUP COMMIT
Schedule
WAL - GROUP COMMIT
Schedule
WAL - GROUP COMMIT
Schedule
WAL - GROUP COMMIT
WAL - GROUP COMMIT
WAL - GROUP COMMIT
Schedule
WAL - GROUP COMMIT
Schedule
WAL - GROUP COMMIT
Schedule
WAL - GROUP COMMIT
Schedule
WAL - GROUP COMMIT
Schedule
WAL - GROUP COMMIT
Schedule
BUFFER POOL POLICIES
Almost every DBMS uses NO-FORCE + STEAL
Runtime Performance
Recovery Performance
Table (html):
| NO-STEAL | STEAL | |
| NO-FORCE | - | Fastest | |
| FORCE | Slowest | - | |
Table (html):
| NO-FORCE | NO-STEAL | STEAL |
| - | Slowest |
| FORCE | Fastest | - |
BUFFER POOL POLICIES
Almost every DBMS uses NO-FORCE + STEAL
LOGGING SCHEMES
Physical Logging
$\longrightarrow$ Record the byte- level changes made to a specific page. $\longrightarrow$ Example: git diff
Logical Logging
$\longrightarrow$ Record the high- level operations executed by txns. $\longrightarrow$ Example: UPDATE, DELETE, and INSERT queries.
Physiological Logging
$\longrightarrow$ Physical- to- a- page, logical- within- a- page. $\longrightarrow$ Hybrid approach with byte- level changes for a single tuple identified by page id + slot number. $\longrightarrow$ Does not specify organization of the page.
LOGGING SCHEMES
UPDATE foo SET val = XYZ WHERE id = 1;
LOGGING SCHEMES
UPDATE foo SET val = XYZ WHERE id = 1;
Physical
<T1, Table=X, Page=99, Offset=1024, Before=ABC, After=XYZ> <T1, Index=X_PKEY, Page=45, Offset=9, Key=(1,Record1)>
LOGGING SCHEMES
UPDATE foo SET val = XYZ WHERE id = 1;
Physical
LOGGING SCHEMES
UPDATE foo SET val = XYZ WHERE id = 1;
Physical
Logical
LOGGING SCHEMES
UPDATE foo SET val = XYZ WHERE id = 1;
Physical
Logical
Physiological
LOGGING SCHEMES
UPDATE foo SET val = XYZ WHERE id = 1;
Physical
Logical
Physiological
PHYSICAL VS. LOGICAL LOGGING
Logical logging requires less data written in each log record than physical logging.
Difficult to implement recovery with logical logging if you have concurrent txns running at lower isolation levels.
$\rightarrow$ Hard to determine which parts of the database may have been modified by a query before crash. $\rightarrow$ Recovery takes longer because DBMS re- executes every query in the log again.
CHANGE DATA CAPTURE (CDC)
Automatically propagate changes to external sources to replicate and synchronize database contents. $\longrightarrow$ Extract Transform Load (ETL) $\longrightarrow$ Some systems can do this automatically. Others require third- party tools.
Approach #1: WAL Approach #2: Triggers Approach #3: Timestamps
CHANGE DATA CAPTURE (CDC)
Automatically propagate changes to external sources to replicate and synchronize database contents. $\longrightarrow$ Extract Transform Load (ETL) $\longrightarrow$ Some systems can do this automatically. Others require third- party tools.
Approach #1: WAL
Approach #2: Triggers Approach #3: Timestamps
CHANGE DATA CAPTURE (CDC)
Automatically propagate changes to external sources to replicate and synchronize database contents.
Extract Transform Load (ETL) $\rightarrow$ Some systems can do this automatically. Others require third- party tools.
Approach #1: WAL Approach #2: Triggers Approach #3: Timestamps
CHANGE DATA CAPTURE (CDC)
Automatically propagate changes to external sources to replicate and synchronize database contents. $\longrightarrow$ Extract Transform Load (ETL) $\longrightarrow$ Some systems can do this automatically. Others require third- party tools.
Approach #1: WAL
Approach #2: Triggers Approach #3: Timestamps
CHANGE DATA CAPTURE (CDC)
Automatically propagate changes to external sources to replicate and synchronize database contents. $\longrightarrow$ Extract Transform Load (ETL) $\longrightarrow$ Some systems can do this automatically. Others require third- party tools.
Approach #1: WAL
Approach #2: Triggers
Approach #3: Timestamps
CHANGE DATA CAPTURE (CDC)
Automatically propagate changes to external sources to replicate and synchronize database contents. $\longrightarrow$ Extract Transform Load (ETL) $\longrightarrow$ Some systems can do this automatically. Others require third- party tools.
Approach #1: WAL Approach #2: Triggers Approach #3: Timestamps
CHANGE DATA CAPTURE (CDC)
Automatically propagate changes to external sources to replicate and synchronize database contents. $\longrightarrow$ Extract Transform Load (ETL) $\longrightarrow$ Some systems can do this automatically. Others require third- party tools.
Approach #1: WAL
Approach #2: Triggers Approach #3: Timestamps
OBSERVATION
The DBMS's WAL will grow forever. After a crash, the DBMS must replay the entire log, which will take a long time.
OBSERVATION
The DBMS's WAL will grow forever.
After a crash, the DBMS must replay the entire log, which will take a long time.
The DBMS periodically takes a checkpoint where it flushes all buffers out to disk. $\rightarrow$ This provides a hint on how far back it needs to replay the WAL after a crash. $\rightarrow$ Truncate the WAL up to a certain safe point in time.
CHECKPOINTS
Blocking / Consistent Checkpoint Protocol:
$\longrightarrow$ Pause all queries. $\longrightarrow$ Flush all WAL records in memory to disk. $\longrightarrow$ Flush all modified pages in the buffer pool to disk. $\longrightarrow$ Write a entry to WAL and flush to disk. $\longrightarrow$ Resume queries.
CHECKPOINTS
WAL
<T1, A, 1, 2> <T2, A, 2, 3> <T2, B, 4, 5> <T3, A, 3, 4> $\vdots$
CHECKPOINTS
WAL
<T1, A, 1, 2> <T2, A, 2, 3> <T2, B, 4, 5> <T3, A, 3, 4>
CHECKPOINTS
Use the record as the starting point for analyzing the WAL.
WAL
<T1, A, 1, 2> <T2, A, 2, 3> <T2, B, 4, 5> <T3, A, 3, 4>
CHECKPOINTS
Use the record as the starting point for analyzing the WAL. Any txn that committed before the checkpoint is ignored (T₁).
WAL
<T1, A, 1, 2> <T2, A, 2, 3> <T2, B, 4, 5> <T3, A, 3, 4>
CHECKPOINTS
Use the record as the starting point for analyzing the WAL. Any txn that committed before the checkpoint is ignored (T₁).
WAL
CHECKPOINTS
Use the record as the starting point for analyzing the WAL. Any txn that committed before the checkpoint is ignored (T₁). T₂ + T₃ did not commit before the last checkpoint.
CHECKPOINTS
Use the record as the starting point for analyzing the WAL. Any txn that committed before the checkpoint is ignored (T₁). $\mathbf{T}_2 + \mathbf{T}_3$ did not commit before the last checkpoint. $\rightarrow$ Need to redo $\mathbf{T}_2$ because it committed after checkpoint.
CHECKPOINTS
Use the record as the starting point for analyzing the WAL. Any txn that committed before the checkpoint is ignored (T1). $\mathbf{T}_2 + \mathbf{T}_3$ did not commit before the last checkpoint. $\rightarrow$ Need to redo $\mathbf{T}_2$ because it committed after checkpoint. $\rightarrow$ Need to undo $\mathbf{T}_3$ because it did not commit before the crash.
CHECKPOINTS - CHALLENGES
In this example, the DBMS must stall txns when it takes a checkpoint to ensure a consistent snapshot. $\rightarrow$ We will see how to get around this problem next class.
Scanning the log to find uncommitted txns can take a long time. $\rightarrow$ Unavoidable but we will add hints to the record to speed things up next class.
How often the DBMS should take checkpoints depends on many different factors...
CHECKPOINTS - FREQUENCY
Checkpointing too often causes the runtime performance to degrade. $\rightarrow$ System spends too much time flushing buffers.
But waiting a long time is just as bad: $\rightarrow$ The checkpoint will be large and slow. $\rightarrow$ Makes recovery time much longer.
Tunable option that depends on application recovery time requirements.
CONCLUSION
Write- Ahead Logging is (almost) always the best approach to handle loss of volatile storage. Use incremental updates (STEAL + NO- FORCE) with checkpoints.
On Recovery: undo uncommitted txns + redo committed txns.
NEXT CLASS
Better Checkpoint Protocols. Recovery with ARIES.