Carnegie Mellon University
Database Systems
Distributed
Databases
ADMINISTRIVIA
Project #4 is due Sunday April 20th @ 11
$\longrightarrow$ Recitation: Friday, April 11
th in GHC 4303 from 3
- 4
PM
HW6 is due Sunday, April 20, 2025 @ 11
Final Exam is on Monday, April 28, 2025, from 05
- 08
$\longrightarrow$ Early exam will not be offered. Do not make travel plans.
UPCOMING DATABASE TALKS
MariaDB (DB Seminar)
$\rightarrow$ Monday, April 14 @ 4
$\rightarrow$ MariaDB's New Query Optimizer $\rightarrow$ Speaker: Michael Widenius $\rightarrow$
https://cmu.zoom.us/j/93441451665
Gel (DB Seminar) $\rightarrow$ Monday, April 21 @ 4
$\rightarrow$ EdgeQL with Gel $\rightarrow$ Speaker: Michael Sullivan $\rightarrow$
https://cmu.zoom.us/j/93441451665
COURSE STATUS
Databases are hard.
Distributed databases are harder.
Query Planning
Concurrency Control
Operator Execution
Access Methods
Recovery
Buffer Pool Manager
Disk Manager
COURSE STATUS
Databases are hard.
Distributed databases are harder.
COURSE STATUS
Databases are hard.
Distributed databases are harder.
PARALLEL VS. DISTRIBUTED
Parallel DBMSs:
$\rightarrow$ Nodes are physically close to each other. $\rightarrow$ Nodes connected with high- speed LAN. $\rightarrow$ Communication cost is assumed to be small.
Distributed DBMSs:
$\rightarrow$ Nodes can be far from each other. $\rightarrow$ Nodes connected using public network. $\rightarrow$ Communication cost and problems cannot be ignored.
DISTRIBUTED DBMSs
Use the building blocks that we covered in single- node DBMSs to now support transaction processing and query execution in distributed environments.
$\longrightarrow$ Optimization & Planning $\longrightarrow$ Concurrency Control $\longrightarrow$ Logging & Recovery
TODAY'S AGENDA
System ArchitecturesDesign IssuesPartitioning SchemesDistributed Concurrency ControlDB Flash Talk: dbt Labs
SYSTEM ARCHITECTURE
A distributed DBMS's system architecture specifies what shared resources are directly accessible to CPUs.
This affects how CPUs coordinate with each other and where they retrieve/store objects in the database.
SYSTEM ARCHITECTURE
SHARED NOTHING
Each DBMS node has its own CPU, memory, and local disk.
Nodes only communicate with each other via network.
$\longrightarrow$ Better performance & efficiency. $\longrightarrow$ Harder to scale capacity. $\longrightarrow$ Harder to ensure consistency.
SHARED NOTHING
Each DBMS node has its own CPU, memory, and local disk.
Nodes only communicate with each other via network.
$\longrightarrow$ Better performance & efficiency. $\longrightarrow$ Harder to scale capacity. $\longrightarrow$ Harder to ensure consistency.
SHARED NOTHING EXAMPLE
SHARED NOTHING EXAMPLE
SHARED NOTHING EXAMPLE
SHARED NOTHING EXAMPLE
SHARED NOTHING EXAMPLE
SHARED NOTHING EXAMPLE
SHARED NOTHING EXAMPLE
SHARED NOTHING EXAMPLE
SHARED NOTHING EXAMPLE
SHARED NOTHING EXAMPLE
SHARED NOTHING EXAMPLE
SHARED DISK
Nodes access a single logical disk via an interconnect, but each have their own private memories.
$\longrightarrow$ Scale execution layer independently from the storage layer. $\longrightarrow$ Nodes can still use direct attached storage as a slower/larger cache. $\longrightarrow$ This architecture facilitates data lakes and serverless systems.
SHARED DISK
Nodes access a single logical disk via an interconnect, but each have their own private memories.
$\longrightarrow$ Scale execution layer independently from the storage layer. $\longrightarrow$ Nodes can still use direct attached storage as a slower/larger cache. $\longrightarrow$ This architecture facilitates data lakes and serverless systems.
SHARED DISK
Nodes access a single logical disk via an interconnect, but each have their own private memories.
$\longrightarrow$ Scale execution layer independently from the storage layer. $\longrightarrow$ Nodes can still use direct attached storage as a slower/larger cache. $\longrightarrow$ This architecture facilitates data lakes and serverless systems.
SHARED DISK EXAMPLE
SHARED DISK EXAMPLE
SHARED DISK EXAMPLE
Application Server
Node
SHARED DISK EXAMPLE
SHARED DISK EXAMPLE
SHARED DISK EXAMPLE
SHARED DISK EXAMPLE
SHARED DISK EXAMPLE
SHARED DISK EXAMPLE
SHARED DISK EXAMPLE
SHARED DISK EXAMPLE
SHARED DISK EXAMPLE
SHARED DISK EXAMPLE
SHARED MEMORY
Nodes access a common memory address space via a fast interconnect.
$\longrightarrow$ Each node has a global view of all the in- memory data structures. $\longrightarrow$ Can still use local memory / disk for intermediate results.
This looks a lot like shared- everything. Nobody does this.
DESIGN ISSUES
How does the application find data? Where does the application send queries? How to execute queries on distributed data? $\longrightarrow$ Push query to data. $\longrightarrow$ Pull data to query. How do we divide the database across resources? How does the DBMS ensure correctness?
DESIGN ISSUES
How does the application find data?
Where does the application send queries?
How to execute queries on distributed data?
$\rightarrow$ Push query to data. $\rightarrow$ Pull data to query.
How do we divide the database across resources?
How does the DBMS ensure correctness?
Next Class
DATA TRANSPARENCY
Applications should not be required to know where data is physically located in a distributed DBMS. $\longrightarrow$ Any query that run on a single- node DBMS should produce the same result on a distributed DBMS.
In practice, developers need to be aware of the communication costs of queries to avoid excessively "expensive" data movement.
Table (html):
| Chip -> Chip | UPI | Intra-rack | Planetary |
| ~100 ns | ~100 μs | ~100 ms | |
DATA TRANSPARENCY
Applications should not be required to know where data is physically located in a distributed DBMS. $\longrightarrow$ Any query that run on a single- node DBMS should produce the same result on a distributed DBMS.
In practice, developers need to be aware of the communication costs of queries to avoid excessively "expensive" data movement.
DATABASE PARTITIONING
Split database across multiple resources: $\rightarrow$ Disks, nodes, processors. $\rightarrow$ Called "sharding" in NoSQL systems.
The DBMS executes query fragments on each partition and then combines the results to produce a single answer.
NAIVE TABLE PARTITIONING
Assign an entire table to a single node.
Assumes that each node has enough storage space for an entire table.
Ideal if queries never join data across tables stored on different nodes and access patterns are uniform.
NAIVE TABLE PARTITIONING
Ideal Query:
SELECT * FROM table1
NAIVE TABLE PARTITIONING
Ideal Query:
SELECT * FROM table1
NAIVE TABLE PARTITIONING
NAIVE TABLE PARTITIONING
Ideal Query:
SELECT * FROM table1
VERTICAL PARTITIONING
Split a table's attributes into separate partitions.
Must store tuple information to reconstruct the original record.
CREATE TABLE foo ( attr1 INT, attr2 INT, attr3 INT, attr4 TEXT);
Table (html):
| Tuple#1 | attr1 | attr2 | attr3 | attr4 |
| Tuple#2 | attr1 | attr2 | attr3 | attr4 |
| Tuple#3 | attr1 | attr2 | attr3 | attr4 |
| Tuple#4 | attr1 | attr2 | attr3 | attr4 |
VERTICAL PARTITIONING
Split a table's attributes into separate partitions.
Must store tuple information to reconstruct the original record.
CREATE TABLE foo ( attr1 INT, attr2 INT, attr3 INT, attr4 TEXT);
Table (html):
| Tuple#1 | attr1 | attr2 | attr3 | attr4 |
| Tuple#2 | attr1 | attr2 | attr3 | attr4 |
| Tuple#3 | attr1 | attr2 | attr3 | attr4 |
| Tuple#4 | attr1 | attr2 | attr3 | attr4 |
VERTICAL PARTITIONING
Split a table's attributes into separate partitions.
Must store tuple information to reconstruct the original record.
CREATE TABLE foo ( attr1 INT, attr2 INT, attr3 INT, attr4 TEXT);
Table (html):
| Tuple#1 | attr1 | attr2 | attr3 |
| Tuple#2 | attr1 | attr2 | attr3 |
| Tuple#3 | attr1 | attr2 | attr3 |
| Tuple#4 | attr1 | attr2 | attr3 |
[TableCaption: Partition #1]
Table (html):
| Tuple#1 | attr4 |
| Tuple#2 | attr4 |
| Tuple#3 | attr4 |
| Tuple#4 | attr4 |
[TableCaption: Partition #2]
HORIZONTAL PARTITIONING
Split a table's tuples into disjoint subsets based on some partitioning key and scheme. $\longrightarrow$ Choose column(s) that divides the database equally in terms of size, load, or usage.
Partitioning Schemes:
$\longrightarrow$ Hashing $\longrightarrow$ Ranges $\longrightarrow$ Predicates
HORIZONTAL PARTITIONING
Table
Table (html):
| 101 | a | XXX | 2022-11-29 |
| 102 | b | XXY | 2022-11-28 |
| 103 | c | XYZ | 2022-11-29 |
| 104 | d | XYX | 2022-11-27 |
| 105 | e | XYY | 2022-11-29 |
Ideal Query:
SELECT * FROM table WHERE partitionKey = ?
HORIZONTAL PARTITIONING
HORIZONTAL PARTITIONING
Partitioning Key
Table
Ideal Query:
SELECT * FROM table WHERE partitionKey = ?
HORIZONTAL PARTITIONING
Partitioning Key
Table
Ideal Query:
SELECT * FROM table WHERE partitionKey = ?
Partitions
HORIZONTAL PARTITIONING
Partitioning Key
Table
Ideal Query:
SELECT * FROM table WHERE partitionKey = ?
Partitions
HORIZONTAL PARTITIONING
Partitioning Key
Table
Ideal Query:
SELECT * FROM table WHERE partitionKey = ?
Partitions
SHARED-DISK PARTITIONING
SHARED-DISK PARTITIONING
SHARED-DISK PARTITIONING
SHARED-DISK PARTITIONING
SHARED-DISK PARTITIONING
SHARED-DISK PARTITIONING
SHARED-DISK PARTITIONING
SHARED-DISK PARTITIONING
SHARED-DISK PARTITIONING
SHARED-NOTHING PARTITIONING
SHARED-NOTHING PARTITIONING
SHARED-NOTHING PARTITIONING
SHARED-NOTHING PARTITIONING
HORIZONTAL PARTITIONING
Partitioning Key
Table
Ideal Query:
SELECT * FROM table WHERE partitionKey = ?
Partitions
HORIZONTAL PARTITIONING
HORIZONTAL PARTITIONING
CONSISTENT HASHING
CONSISTENT HASHING
CONSISTENT HASHING
CONSISTENT HASHING
CONSISTENT HASHING
CONSISTENT HASHING
CONSISTENT HASHING
CONSISTENT HASHING
CONSISTENT HASHING
CONSISTENT HASHING
CONSISTENT HASHING
CONSISTENT HASHING
CONSISTENT HASHING
CONSISTENT HASHING
CONSISTENT HASHING
CONSISTENT HASHING
CONSISTENT HASHING
CONSISTENT HASHING
CONSISTENT HASHING
CONSISTENT HASHING
SINGLE-NODE VS. DISTRIBUTED
A single- node txn only accesses data that is contained on one partition. $\rightarrow$ The DBMS may not need check the behavior concurrent txns running on other nodes.
A distributed txn accesses data at one or more partitions.
$\rightarrow$ Requires expensive coordination.
TRANSACTION COORDINATION
If our DBMS supports multi- operation and distributed txns, we need a way to coordinate their execution in the system.
Two different approaches: $\rightarrow$ Centralized: Global "traffic cop". $\rightarrow$ Decentralized: Nodes organize themselves.
Most distributed DBMSs use a hybrid approach where they periodically elect some node to be a temporary coordinator.
CENTRALIZED COORDINATOR
Coordinator
Partitions
CENTRALIZED COORDINATOR
Coordinator
CENTRALIZED COORDINATOR
CENTRALIZED COORDINATOR
CENTRALIZED COORDINATOR
CENTRALIZED COORDINATOR
CENTRALIZED COORDINATOR
CENTRALIZED COORDINATOR
CENTRALIZED COORDINATOR
CENTRALIZED COORDINATOR
CENTRALIZED COORDINATOR
CENTRALIZED COORDINATOR
CENTRALIZED COORDINATOR
CENTRALIZED COORDINATOR
Middleware
CENTRALIZED COORDINATOR
CENTRALIZED COORDINATOR
CENTRALIZED COORDINATOR
CENTRALIZED COORDINATOR
CENTRALIZED COORDINATOR
DECENTRALIZED COORDINATOR
Partitions
DECENTRALIZED COORDINATOR
DECENTRALIZED COORDINATOR
DECENTRALIZED COORDINATOR
DECENTRALIZED COORDINATOR
DECENTRALIZED COORDINATOR
DECENTRALIZED COORDINATOR
DECENTRALIZED COORDINATOR
OBSERVATION
We have assumed that the nodes in our distributed systems are running the same DBMS software. But organizations often run many different DBMSs in their applications.
It would be nice if we could have a single interface for all our data.
FEDERATED DATABASES
Distributed architecture that connects disparate DBMSs into a single logical system.
$\rightarrow$ Expose a single query interface that can access data at any location.
This is hard and nobody does it well $\rightarrow$ Different data models, query languages, limitations. $\rightarrow$ No easy way to optimize queries $\rightarrow$ Lots of data copying (bad).
FEDERATED DATABASE EXAMPLE
Middleware
FEDERATED DATABASE EXAMPLE
FEDERATED DATABASE EXAMPLE
DISTRIBUTED CONCURRENCY CONTROL
Need to allow multiple txns to execute simultaneously across multiple nodes. $\rightarrow$ Many of the same protocols from single- node DBMSs can be adapted.
This is harder because of:
$\rightarrow$ Replication. $\rightarrow$ Network Communication Overhead. $\rightarrow$ Node Failures (Permanent + Ephemeral). $\rightarrow$ Clock Skew.
DISTRIBUTED 2PL
DISTRIBUTED 2PL
DISTRIBUTED 2PL
DISTRIBUTED 2PL
DISTRIBUTED 2PL
DISTRIBUTED 2PL
DISTRIBUTED 2PL
CONCLUSION
We have barely scratched the surface on distributed database systems...
It is hard to get this right.
NEXT CLASS
Distributed OLTP Systems Replication CAP Theorem Real- World Examples