Carnegie Mellon University
Database Systems
Final Review & Systems Potpourri
ADMINISTRIVIA
Final Exam is on Monday, April 28, 2025, from 5:30pm- 8
.
$\longrightarrow$ Early exam will not be offered. Do not make travel plans. $\longrightarrow$ Material: Lecture 12 - Lecture 24. $\longrightarrow$ You can use the full 3 hours, though the exam is meant to be done in $\sim 2$ hours.
Last day to submit P4 (with late days and penalty) is April 30 @ 11
pm
Course Evals: Would like your feedback.
OFFICE HOURS
Jignesh:
$\rightarrow$ Thursday April 24th @ noon- 2
(GHC 9103)
All other TAs will have their office hours up to and including Saturday April 26th
FINAL EXAM
Where: Scaife Hall 105 and Scaife Hall 234. When: Monday, April 28, 2025, 5:30pm- 8
.
Come to Scaife Hall 105 first. Then, look at your seating assignment, which may assign you to Scaife Hall 234.
FINAL EXAM
What to bring:
$\rightarrow$ CMU ID $\rightarrow$ Pencil + Eraser (!!!) $\rightarrow$ Calculator (cellphone is okay) $\rightarrow$ One 8.5x11" page of handwritten notes (double- sided)
STUFF BEFORE MID-TERM
SQL
Buffer Pool ManagementData Structures (Hash Tables, B+Trees)Storage ModelsQuery Processing ModelsInter- Query ParallelismBasic Understanding of BusTub Internals
JOIN ALGORITHMS
Join Algorithms
$\longrightarrow$ Naive Nested Loops $\longrightarrow$ Block Nested Loops $\longrightarrow$ Index Nested Loops $\longrightarrow$ Sort- Merge $\longrightarrow$ Hash Join: Simple, Partitioned, Hybrid Hash $\longrightarrow$ Optimization using Bloom Filters $\longrightarrow$ Cost functions
QUERY EXECUTION
Execution Models
$\rightarrow$ Iterator $\rightarrow$ Materialized $\rightarrow$ Vector / Batch
Plan Processing: Push vs. Pull
Access Methods $\rightarrow$ Sequential Scan and various optimization $\rightarrow$ Index Scan, including multi- index scan $\rightarrow$ Issues with update queries
Expression Evaluation
QUERY EXECUTION
Process Model
Parallel Execution
$\longrightarrow$ Inter Query Parallelism $\longrightarrow$ Intra Query Parallelism: Intra- Operator: horizontal, vertical, and bushy Parallel hash join, Exchange operator $\longrightarrow$ Intra Query Parallelism: Inter- Operator, aka. pipelined parallelism
IO Parallelism
QUERY OPTIMIZATION
Heuristics
$\longrightarrow$ Predicate Pushdown $\longrightarrow$ Projection Pushdown $\longrightarrow$ Nested Sub- Queries: Rewrite and Decompose
Statistics
$\longrightarrow$ Cardinality Estimation $\longrightarrow$ Histograms
Cost- based search $\longrightarrow$ Bottom- up vs. Top- Down
TRANSACTIONS
ACID
Conflict Serializability: $\rightarrow$ How to check for correctness? $\rightarrow$ How to check for equivalence? View Serializability $\rightarrow$ Difference with conflict serializability Isolation Levels / Anomalies
TRANSACTIONS
Two- Phase Locking $\longrightarrow$ Strong Strict 2PL $\longrightarrow$ Cascading Aborts Problem $\longrightarrow$ Deadlock Detection & Prevention
Multiple Granularity Locking
$\longrightarrow$ Intention Locks $\longrightarrow$ Understanding performance trade- offs $\longrightarrow$ Lock Escalation (i.e., when is it allowed)
TRANSACTIONS
Optimistic Concurrency Control
$\longrightarrow$ Read Phase $\longrightarrow$ Validation Phase (Backwards vs. Forwards) $\longrightarrow$ Write Phase
Multi- Version Concurrency Control
$\longrightarrow$ Version Storage / Ordering $\longrightarrow$ Garbage Collection $\longrightarrow$ Index Maintenance
CRASH RECOVERY
Buffer Pool Policies: $\rightarrow$ STEAL vs. NO- STEAL $\rightarrow$ FORCE vs. NO- FORCE
Shadow Paging
Write- Ahead Logging $\rightarrow$ How it relates to buffer pool management $\rightarrow$ Logging Schemes (Physical vs. Logical)
CRASH RECOVERY
Checkpoints $\rightarrow$ Non- Fuzzy vs. Fuzzy
ARIES Recovery
$\rightarrow$ Dirty Page Table (DPT) $\rightarrow$ Active Transaction Table (ATT) $\rightarrow$ Analyze, Redo, Undo phases $\rightarrow$ Log Sequence Numbers $\rightarrow$ CLRs
DISTRIBUTED DATABASES
System Architectures Replication Schemes Partitioning Schemes Two- Phase Commit Paxos Distributed Query Execution Distributed Join Algorithms Semi- Join Optimization Cloud Architectures
TOPICS NOT ON EXAM!
Flash TalksSeminar TalksDetails of specific database systems (e.g., Postgres)
GOOGLE SPANNER
Google's geo- replicated DBMS (>2011) Schematized, semi- relational data model. Decentralized shared- disk architecture. Log- structured on- disk storage. Concurrency Control: $\rightarrow$ Strict ZPL + MVCC + Multi- Paxos + 2PC $\rightarrow$ Externally consistent global write- transactions with synchronous replication. $\rightarrow$ Lock- free read- only transactions.
SPANNER: CONCURRENCY CONTROL
MVCC + Strict 2PL with Wound- Wait Deadlock Prevention
DBMS ensures ordering through globally unique timestamps generated from atomic clocks and GPS devices.
Buffer writes in the client, and these are sent to the server at commit time.
Database is broken up into tablets (partitions): $\rightarrow$ Use Paxos to elect leader in tablet group. $\rightarrow$ Use 2PC for txns that span tablets.
SPANNER TABLES
SPANNER TABLES
SPANNER TABLES
SPANNER TABLES
SPANNER TABLES
SPANNER TABLES
SPANNER: TRANSACTION ORDERING
DBMS orders transactions based on physical "wall- clock" time.
$\rightarrow$ This is necessary to guarantee strict serializability. $\rightarrow$ If $\mathsf{T}_1$ finishes before $\mathsf{T}_2$ , then $\mathsf{T}_2$ should see the result of $\mathsf{T}_1$ .
Each Paxos group decides in what order transactions should be committed according to the timestamps.
$\rightarrow$ If $\mathsf{T}_1$ commits at $\mathsf{time}_1$ and $\mathsf{T}_2$ starts at $\mathsf{time}_2 > \mathsf{time}_1$ , then $\mathsf{T}_1$ 's timestamp should be less than $\mathsf{T}_2$ 's.
SPANNER TRUETIME
The DBMS maintains a global wall- clock time across all data centers with bounded uncertainty. Timestamps are intervals, not single values
SPANNER TRUETIME
The DBMS maintains a global wall- clock time across all data centers with bounded uncertainty. Timestamps are intervals, not single values
SPANNER: TRUETIME
Each data center has GPS and atomic clocks $\rightarrow$ These two provide fine- grained clock synchronization down to a few milliseconds. $\rightarrow$ Every 30 seconds, there is a maximum 7 ms difference.
Multiple sync daemons per data center $\rightarrow$ GPS and atomic clocks can fail in various conditions. $\rightarrow$ Sync daemons talk to each other within a data center as well as across data centers.
GOOGLE BIGQUERY (2011)
Originally developed as "Dremel" in 2006 as a side- project for analyzing data artifacts generated from other tools.
$\rightarrow$ The "interactive" goal means that they want to support ad hoc queries on in- situ data files. $\rightarrow$ Did not support joins in the first version.
Rewritten in the late 2010s to shared- disk architecture built on top of GFS.
Released as public commercial product (BigQuery) in 2012.
BIGQUERY: OVERVIEW
Shared- Disk / Disaggregated StorageVectorized Query ProcessingShuffle- based Distributed Query ExecutionColumnar Storage $\rightarrow$ Zone Maps / Filters $\rightarrow$ Dictionary + RLE Compression $\rightarrow$ Only Allows "Search" Inverted IndexesHash Joins OnlyHeuristic Optimizer + Adaptive Optimizations
BIGQUERY: OVERVIEW
Shared- Disk / Disaggregated Storage Vectorized Query Processing
Shuffle- based Distributed Query Execution
Columnar Storage $\rightarrow$ Zone Maps / Filters $\rightarrow$ Dictionary + RLE Compression $\rightarrow$ Only Allows "Search" Inverted IndexesHash Joins OnlyHeuristic Optimizer + Adaptive Optimizations
BIGQUERY: IN-MEMORY SHUFFLE
The shuffle phases represent checkpoints in a query's lifecycle where that the coordinator makes sure that all tasks are completed.
Fault Tolerance / Straggler Avoidance:
$\rightarrow$ If a worker does not produce a task's results within a deadline, the coordinator speculatively executes a redundant task.
Dynamic Resource Allocation:
$\rightarrow$ Scale up / down the number of workers for the next stage depending size of a stage's output.
BIGQUERY: IN-MEMORY SHUFFLE
Distributed File System
BIGQUERY: IN-MEMORY SHUFFLE
BIGQUERY: IN-MEMORY SHUFFLE
BIGQUERY: IN-MEMORY SHUFFLE
Distributed File System
BIGQUERY: IN-MEMORY SHUFFLE
Distributed File System
BIGQUERY: IN-MEMORY SHUFFLE
BIGQUERY: IN-MEMORY SHUFFLE
BIGQUERY: IN-MEMORY SHUFFLE
BIGQUERY: IN-MEMORY SHUFFLE
BIGQUERY: DYNAMIC REPARTITIONING
BigQuery dynamically load balances and adjusts intermediate result partitioning to adapt to data skew.
DBMS detects whether shuffle partition gets too full and then instructs workers to adjust their partitioning scheme.
Coordinator
BIGQUERY: DYNAMIC REPARTITIONING
BigQuery dynamically load balances and adjusts intermediate result partitioning to adapt to data skew.
DBMS detects whether shuffle partition gets too full and then instructs workers to adjust their partitioning scheme.
BIGQUERY: DYNAMIC REPARTITIONING
BigQuery dynamically load balances and adjusts intermediate result partitioning to adapt to data skew.
DBMS detects whether shuffle partition gets too full and then instructs workers to adjust their partitioning scheme.
BIGQUERY: DYNAMIC REPARTITIONING
BigQuery dynamically load balances and adjusts intermediate result partitioning to adapt to data skew.
DBMS detects whether shuffle partition gets too full and then instructs workers to adjust their partitioning scheme.
BIGQUERY: DYNAMIC REPARTITIONING
BigQuery dynamically load balances and adjusts intermediate result partitioning to adapt to data skew.
DBMS detects whether shuffle partition gets too full and then instructs workers to adjust their partitioning scheme.
BIGQUERY: DYNAMIC REPARTITIONING
BigQuery dynamically load balances and adjusts intermediate result partitioning to adapt to data skew.
DBMS detects whether shuffle partition gets too full and then instructs workers to adjust their partitioning scheme.
BIGQUERY: DYNAMIC REPARTITIONING
BigQuery dynamically load balances and adjusts intermediate result partitioning to adapt to data skew.
DBMS detects whether shuffle partition gets too full and then instructs workers to adjust their partitioning scheme.
BIGQUERY: DYNAMIC REPARTITIONING
BigQuery dynamically load balances and adjusts intermediate result partitioning to adapt to data skew.
DBMS detects whether shuffle partition gets too full and then instructs workers to adjust their partitioning scheme.
BIGQUERY: DYNAMIC REPARTITIONING
BigQuery dynamically load balances and adjusts intermediate result partitioning to adapt to data skew.
DBMS detects whether shuffle partition gets too full and then instructs workers to adjust their partitioning scheme.
BIGQUERY: DYNAMIC REPARTITIONING
BigQuery dynamically load balances and adjusts intermediate result partitioning to adapt to data skew.
DBMS detects whether shuffle partition gets too full and then instructs workers to adjust their partitioning scheme.
BIGQUERY: DYNAMIC REPARTITIONING
BigQuery dynamically load balances and adjusts intermediate result partitioning to adapt to data skew.
DBMS detects whether shuffle partition gets too full and then instructs workers to adjust their partitioning scheme.
W snowflake
SNOWFLAKE (2013)
Managed OLAP DBMS written in C++.
$\rightarrow$ Shared- disk architecture with aggressive compute- side local caching. $\rightarrow$ Written from scratch. Did not borrow components from existing systems. $\rightarrow$ Custom SQL dialect and client- server network protocols.
The OG cloud- native data warehouse.
SNOWFLAKE (2
SNOWFLAKE (2Managed OLAP DBMS written i→ Shared- disk architecture with aggres local caching.→ Written from scratch. Did not borr existing systems.→ Custom SQL dialect and client- serv
The OG cloud- native data ware
SNOWFLAKE: OVERVIEW
Cloud- native OLAP DBMS written in C++Shared- Disk / Disaggregated StoragePush- based Vectorized Query ProcessingPrecompiled Operator PrimitivesSeparate Table Data from Meta- DataNo Buffer PoolPAX Columnar Storage
SNOWFLAKE: QUERY PROCESSING
Snowflake is a push- based vectorized engine that uses precompiled primitives for operator kernels. $\longrightarrow$ Pre- compile variants using $\mathrm{C + + }$ templates for different vector data types. $\longrightarrow$ Only uses codegen (via LLVM) for tuple serialization/deserialization between workers.
Does not support partial query retries $\longrightarrow$ If a worker fails, then the entire query has to restart.
SNOWFLAKE: ADAPTIVE OPTIMIZATION
After determining join ordering, Snowflake's optimizer identifies aggregation operators to push down into the plan below joins.
The optimizer adds the downstream aggregations but then the DBMS only enables them at runtime according to statistics observed during execution.
SNOWFLAKE: ADAPTIVE OPTIMIZATION
After determining join ordering, Snowflake's optimizer identifies aggregation operators to push down into the plan below joins.
The optimizer adds the downstream aggregations but then the DBMS only enables them at runtime according to statistics observed during execution.
SNOWFLAKE: ADAPTIVE OPTIMIZATION
After determining join ordering, Snowflake's optimizer identifies aggregation operators to push down into the plan below joins.
The optimizer adds the downstream aggregations but then the DBMS only enables them at runtime according to statistics observed during execution.
SNOWFLAKE: ADAPTIVE OPTIMIZATION
After determining join ordering, Snowflake's optimizer identifies aggregation operators to push down into the plan below joins.
The optimizer adds the downstream aggregations but then the DBMS only enables them at runtime according to statistics observed during execution.
SNOWFLAKE: AD
After determining join order. Snowflake's optimizer identif aggregation operators to pus into the plan below joins.
The optimizer adds the dow. aggregations but then the DI enables them at runtime acc statistics observed during ex
Aggregation Placement - An Adaptive Query Optimization for Snowflake
Bowei Chen · Follow Published in Snowflake · 8 min read · Aug 10, 2023
Snowflake's Data Cloud is backed by a data platform designed from the ground up to leverage cloud computing technology. The platform is delivered as a fully managed service, providing a user- friendly experience to run complex analytical workloads easily and efficiently without the burden of managing on- premise infrastructure. Snowflake's architecture separates the compute layer from the storage layer. Compute workloads on the same dataset can scale independently and run in isolation without interfering with each other, and compute resources could be allocated and scaled on demand within seconds. The cloud- native architecture makes Snowflake a powerful platform for data warehousing, data engineering, data science, and many other types of applications. More about Snowflake architecture can be found in Key Concepts & Architecture documentation and the Snowflake Elastic Data Warehouse research paper.
SNOWFLAKE: FLEXIBLE COMPUTE
If a query plan fragment will process a large amount of data, then the DBMS can temporarily deploy additional worker nodes to accelerate its performance.
Flexible compute worker nodes write results to storage as if it was a table.
SNOWFLAKE: FLEXIBLE COMPUTE
If a query plan fragment will process a large amount of data, then the DBMS can temporarily deploy additional worker nodes to accelerate its performance.
Flexible compute worker nodes write results to storage as if it was a table.
SNOWFLAKE: FLEXIBLE COMPUTE
If a query plan fragment will process a large amount of data, then the DBMS can temporarily deploy additional worker nodes to accelerate its performance.
Flexible compute worker nodes write results to storage as if it was a table.
SNOWFLAKE: FLEXIBLE COMPUTE
If a query plan fragment will process a large amount of data, then the DBMS can temporarily deploy additional worker nodes to accelerate its performance.
Flexible compute worker nodes write results to storage as if it was a table.
SNOWFLAKE: FLEXIBLE COMPUTE
If a query plan fragment will process a large amount of data, then the DBMS can temporarily deploy additional worker nodes to accelerate its performance.
Flexible compute worker nodes write results to storage as if it was a table.
AMAZON REDSHIFT (2014)
Amazon's flagship OLAP DBaaS. $\longrightarrow$ Based on ParAccel's original shared- nothing architecture. $\longrightarrow$ Switched to support disaggregated storage (S3) in 2017. $\longrightarrow$ Added serverless deployments in 2022.
Redshift is a more traditional data warehouse compared to BigQuery/Spark where it wants to control all the data.
Overarching design goal is to remove as much administration + configuration choices from users.
REDSHIFT: OVERVIEW
Shared- Disk / Disaggregated StoragePush- based Vectorized Query ProcessingTranspilation Query Codegen (C++)Precompiled PrimitivesCompute- side CachingPAX Columnar StorageSort- Merge + Hash JoinsHardware Acceleration (AQUA)Stratified Query Optimizer
REDSHIFT: OVERVIEW
Shared- Disk / Disaggregated Storage Push- based Vectorized Query Processing
Transpilation Query Codegen (C++) Precompiled Primitives
Compute- side Caching PAX Columnar Storage Sort- Merge + Hash Joins
Hardware Acceleration (AQUA)
Stratified Query Optimizer
REDSHIFT: COMPILATION SERVICE
Separate nodes to compile query plans using GCC and aggressive caching.
$\rightarrow$ DBMS checks whether a compiled version of each templated fragment already exists in customer's local cache. $\rightarrow$ If fragment does not exist in the local cache, then it checks a global cache for the entire fleet of Redshift customers.
Background workers proactively recompile plans when new version of DBMS is released.
REDSHIFT: HARDWARE ACCELERATION
AWS introduced the AQUA (Advanced Query Accelerator) for Redshift (Spectrum?) in 2021.
Separate compute/cache nodes that use FPGAs to evaluate predicates.
AQUA was phased out and replaced with Nitro cards on compute nodes
REDSHIFT: HARDWARE ACCELERATION
AWS introduced the AQUA (Advanced Query Accelerator) for Redshift (Spectrum?) in 2021.
Separate compute/cache nodes that use FPGAs to evaluate predicates.
AQUA was phased out and replaced with Nitro cards on compute nodes
REDSHIFT: HARDWARE ACCELERATION
AWS introduced the AQUA (Advanced Query Accelerator) for Redshift (Spectrum?) in 2021.
Separate compute/cache nodes that use FPGAs to evaluate predicates.
AQUA was phased out and replaced with Nitro cards on compute nodes
3 databricks
DATABRICKS PHOTON (2022)
Single- threaded $\mathtt{C + + }$ execution engine embedded into Databricks Runtime (DBR) via JNI.
$\longrightarrow$ Overrides existing engine when appropriate. $\longrightarrow$ Support both Spark's earlier SQL engine and Spark's DataFrame API. $\longrightarrow$ Seamlessly handle impedance mismatch between roworiented DBR and column- oriented Photon.
Accelerate execution of query plans over "raw / uncurated" files in a data lake.
DATABRICKS PHOTON (2022)
Photon: A Fast Query Engine for Lakehouse Systems
Alexander Behm, Shoumik Palkar, Utkarsh Agarwal, Timothy Armstrong, David Cashman, Ankur Dave, Todd Greenstein, Shant Hovsepian, Ryan Johnson, Arvind Sai Krishnan, Paul Leventis, Ala Luszczak, Prashanth Menon, Mostafa Mokhtar, Gene Pang, Sameer Paranjpye, Greg Rahn, Bart Samwel, Tom van Bussel, Herman van Hovell, Maryann Xue, Reynold Xin, Matei Zaharia photon- paper-
authors@databricks.com Databricks Inc.
ABSTRACT
Many organizations are shifting to a data management paradigm called the "Lakehouse," which implements the functionality of structured data warehouses on top of unstructured data lakes. This from SQL to machine learning. Traditionally, for the most demanding SQL workloads, enterprises have also moved a curated subset of their data into data warehouses to get high performance, gov- . ernance and concurrency. However, this two- tier architecture is
PHOTON: OVERVIEW
Shared- Disk / Disaggregated StoragePull- based Vectorized Query ProcessingPrecompiled Primitives + Expression FusionShuffle- based Distributed Query ExecutionSort- Merge + Hash JoinsUnified Query Optimizer + Adaptive Optimizations
PHOTON: OVERVIEW
Shared- Disk / Disaggregated Storage
Pull- based Vectorized Query Processing
Precompiled Primitives + Expression Fusion Shuffle- based Distributed Query Execution Sort- Merge + Hash Joins Unified Query Optimizer + Adaptive Optimizations
PHOTON: VECTORIZED PROCESSING
Photon is a pull- based vectorized engine that uses precompiled operator kernels (primitives). $\rightarrow$ Converts physical plan into a list of pointers to functions that perform low- level operations on column batches.
Databricks: It is easier to build/maintain a vectorized engine than a JIT engine. $\rightarrow$ Engineers spend more time creating specialized codepaths to get closer to JIT performance. $\rightarrow$ With codegen, engineers write tooling and observability hooks instead of writing the engine.
PHOTON: EXPRESSION FUSION
SELECT * FROM foo WHERE cdate BETWEEN '2024- 01- 01' AND '2024- 04- 01';
PHOTON: EXPRESSION FUSION
SELECT * FROM foo WHERE cdate >= '2024- 01- 01' AND cdate <= '2024- 04- 01';
PHOTON: EXPRESSION FUSION
SELECT * FROM foo WHERE cdate >= '2024- 01- 01' AND cdate <= '2024- 04- 01';
PHOTON: EXPRESSION FUSION
PHOTON: EXPRESSION FUSION
SELECT * FROM foo WHERE cdate >= '2024- 01- 01' AND cdate <= '2024- 04- 01';
PHOTON: EXPRESSION FUSION
SELECT * FROM foo WHERE cdate >= '2024- 01- 01' AND cdate <= '2024- 04- 01';
SPARK: PARTITION COALESCING
Spark (over- )allocates a large number of shuffle partitions for each stage. $\longrightarrow$ Number needs to be large enough to avoid one partitioning from filling up too much.
After the shuffle completes, the DBMS then combines underutilized partitions using heuristics.
Worker
SPARK: PARTITION COALESCING
Spark (over- )allocates a large number of shuffle partitions for each stage. $\longrightarrow$ Number needs to be large enough to avoid one partitioning from filling up too much.
After the shuffle completes, the DBMS then combines underutilized partitions using heuristics.
SPARK: PARTITION COALESCING
Spark (over- )allocates a large number of shuffle partitions for each stage. $\longrightarrow$ Number needs to be large enough to avoid one partitioning from filling up too much.
After the shuffle completes, the DBMS then combines underutilized partitions using heuristics.
SPARK: PARTITION COALESCING
Spark (over- )allocates a large number of shuffle partitions for each stage. $\longrightarrow$ Number needs to be large enough to avoid one partitioning from filling up too much.
After the shuffle completes, the DBMS then combines underutilized partitions using heuristics.
SPARK: PARTITION COALESCING
Spark (over- )allocates a large number of shuffle partitions for each stage. $\longrightarrow$ Number needs to be large enough to avoid one partitioning from filling up too much.
After the shuffle completes, the DBMS then combines underutilized partitions using heuristics.
SPARK: PARTITION COALESCING
Spark (over- )allocates a large number of shuffle partitions for each stage. $\longrightarrow$ Number needs to be large enough to avoid one partitioning from filling up too much.
After the shuffle completes, the DBMS then combines underutilized partitions using heuristics.
DUCKDB (2019)
Multi- threaded embedded (in- process, serverless) DBMS that executes SQL over disparate data files. $\longrightarrow$ PostgreSQL- like dialect with quality- of- life enhancements. $\longrightarrow$ "SQLite for Analytics"
Provides zero- copy access to query results via Arrow to client code running in same process.
The core DBMS is nearly all custom C++ code with little to no third- party dependencies. $\longrightarrow$ Relies on extensions ecosystem to expand capabilities.
DUCKDB (2019)
Multi- threaded emb DBMS that execute $\longrightarrow$ PostgreSQL- like dia $\longrightarrow$ "SQLite for Analytic
Provides zero- copy Arrow to client co
The core DBMS is little to no third- p $\longrightarrow$ Relies on extension
George Fraser @frasergeorgew
My second big finding is the vast majority of queries are tiny, and virtually all queries could fit on a large single node. We maybe don't need MPP systems anymore?
DUCKDB: OVERVIEW
Shared- EverythingPush- based Vectorized Query ProcessingPrecompiled PrimitivesMulti- Version Concurrency ControlMorsel Parallelism + SchedulingPAX Columnar StorageSort- Merge + Hash JoinsStratified Query Optimizer
DUCKDB: OVERVIEW
Shared- Everything
Push- based Vectorized Query Processing
Precompiled Primitives Multi- Version Concurrency Control Morsel Parallelism + Scheduling PAX Columnar Storage Sort- Merge + Hash Joins Stratified Query Optimizer
DUCKDB: PUSH-BASED PROCESSING
System originally used pull- based vectorized query processing but found it unwieldily to expand to support more complex parallelism. $\longrightarrow$ Cannot invoke multiple pipelines simultaneously.
Switched to a push- based query processing model in 2021. Each operator determines whether it will execute in parallel on its own instead of a centralized executor.
DUCKDB: PUSH-BASED PROCESSING
System originally used processing but found it support more complex pa → Cannot invoke multiple pi
Switched to a push- based 2021. Each operator dete execute in parallel on its centralized executor.
DUCKDB: VECTORS
Custom internal vector layout for intermediate results that is compatible with Velox.
Supports multiple vector types:
DUCKDB: VECTORS
DuckDB uses a unified format to process all vector types without needing to decompress them first. $\rightarrow$ Reduce # of specialized primitives per vector type
DUCKDB: VECTORS
DuckDB uses a unified format to process all vector types without needing to decompress them first. $\longrightarrow$ Reduce # of specialized primitives per vector type
DUCKDB: VECTORS
DuckDB uses a unified format to process all vector types without needing to decompress them first. $\longrightarrow$ Reduce # of specialized primitives per vector type
DUCKDB: VECTORS
DuckDB uses a unified format to process all vector types without needing to decompress them first. $\longrightarrow$ Reduce # of specialized primitives per vector type
DUCKDB: VECTORS
DuckDB uses a unified format to process all vector types without needing to decompress them first. $\longrightarrow$ Reduce # of specialized primitives per vector type
DUCKDB: VECTORS
DuckDB uses a unified format to process all vector types without needing to decompress them first. $\longrightarrow$ Reduce # of specialized primitives per vector type
DUCKDB: VECTORS
DuckDB uses a unified format to process all vector types without needing to decompress them first. $\longrightarrow$ Reduce # of specialized primitives per vector type
DUCKDB: VECTORS
DuckDB uses a unified format to process all vector types without needing to decompress them first. $\longrightarrow$ Reduce # of specialized primitives per vector type