GithubDocuments
    1. Marketplace
    2. CMU Database - 15445 - 2025 Spring
    Collections
    CMU Database - 15445 - 2025 Spring
    cmu_database
    This collection of documents, "CMU Database - 15445 - 2025 Spring," provides a comprehensive overview of database systems, primarily focusing on the design, implementation, and ...
    DocumentsKnowledge Graph
    lecture-25-slides.pdf
    Carnegie Mellon University
    Database Systems
    Final Review & Systems Potpourri

    ADMINISTRIVIA

    Final Exam is on Monday, April 28, 2025, from 5:30pm- 8
    .
    $\longrightarrow$ Early exam will not be offered. Do not make travel plans. $\longrightarrow$ Material: Lecture 12 - Lecture 24. $\longrightarrow$ You can use the full 3 hours, though the exam is meant to be done in $\sim 2$ hours.
    Last day to submit P4 (with late days and penalty) is April 30 @ 11
    pm
    Course Evals: Would like your feedback.
    $\longrightarrow$ https://cmu.smartevals.com $\longrightarrow$ https://www.ugrad.cs.cmu.edu/ta/S25/feedback/

    OFFICE HOURS

    Jignesh:

    $\rightarrow$ Thursday April 24th @ noon- 2
    (GHC 9103)
    All other TAs will have their office hours up to and including Saturday April 26th

    FINAL EXAM

    Where: Scaife Hall 105 and Scaife Hall 234. When: Monday, April 28, 2025, 5:30pm- 8
    .
    Come to Scaife Hall 105 first. Then, look at your seating assignment, which may assign you to Scaife Hall 234.
    https://15445. courses.cs.cmu.edu/spring2025/final- guide.html

    FINAL EXAM

    What to bring:

    $\rightarrow$ CMU ID $\rightarrow$ Pencil + Eraser (!!!) $\rightarrow$ Calculator (cellphone is okay) $\rightarrow$ One 8.5x11" page of handwritten notes (double- sided)

    STUFF BEFORE MID-TERM

    SQL

    Buffer Pool ManagementData Structures (Hash Tables, B+Trees)Storage ModelsQuery Processing ModelsInter- Query ParallelismBasic Understanding of BusTub Internals

    JOIN ALGORITHMS

    Join Algorithms
    $\longrightarrow$ Naive Nested Loops $\longrightarrow$ Block Nested Loops $\longrightarrow$ Index Nested Loops $\longrightarrow$ Sort- Merge $\longrightarrow$ Hash Join: Simple, Partitioned, Hybrid Hash $\longrightarrow$ Optimization using Bloom Filters $\longrightarrow$ Cost functions

    QUERY EXECUTION

    Execution Models
    $\rightarrow$ Iterator $\rightarrow$ Materialized $\rightarrow$ Vector / Batch
    Plan Processing: Push vs. Pull
    Access Methods $\rightarrow$ Sequential Scan and various optimization $\rightarrow$ Index Scan, including multi- index scan $\rightarrow$ Issues with update queries
    Expression Evaluation

    QUERY EXECUTION

    Process Model

    Parallel Execution
    $\longrightarrow$ Inter Query Parallelism $\longrightarrow$ Intra Query Parallelism: Intra- Operator: horizontal, vertical, and bushy Parallel hash join, Exchange operator $\longrightarrow$ Intra Query Parallelism: Inter- Operator, aka. pipelined parallelism
    IO Parallelism

    QUERY OPTIMIZATION

    Heuristics

    $\longrightarrow$ Predicate Pushdown $\longrightarrow$ Projection Pushdown $\longrightarrow$ Nested Sub- Queries: Rewrite and Decompose

    Statistics

    $\longrightarrow$ Cardinality Estimation $\longrightarrow$ Histograms
    Cost- based search $\longrightarrow$ Bottom- up vs. Top- Down

    TRANSACTIONS

    ACID

    Conflict Serializability: $\rightarrow$ How to check for correctness? $\rightarrow$ How to check for equivalence? View Serializability $\rightarrow$ Difference with conflict serializability Isolation Levels / Anomalies

    TRANSACTIONS

    Two- Phase Locking $\longrightarrow$ Strong Strict 2PL $\longrightarrow$ Cascading Aborts Problem $\longrightarrow$ Deadlock Detection & Prevention
    Multiple Granularity Locking
    $\longrightarrow$ Intention Locks $\longrightarrow$ Understanding performance trade- offs $\longrightarrow$ Lock Escalation (i.e., when is it allowed)

    TRANSACTIONS

    Optimistic Concurrency Control
    $\longrightarrow$ Read Phase $\longrightarrow$ Validation Phase (Backwards vs. Forwards) $\longrightarrow$ Write Phase
    Multi- Version Concurrency Control
    $\longrightarrow$ Version Storage / Ordering $\longrightarrow$ Garbage Collection $\longrightarrow$ Index Maintenance

    CRASH RECOVERY

    Buffer Pool Policies: $\rightarrow$ STEAL vs. NO- STEAL $\rightarrow$ FORCE vs. NO- FORCE
    Shadow Paging
    Write- Ahead Logging $\rightarrow$ How it relates to buffer pool management $\rightarrow$ Logging Schemes (Physical vs. Logical)

    CRASH RECOVERY

    Checkpoints $\rightarrow$ Non- Fuzzy vs. Fuzzy
    ARIES Recovery
    $\rightarrow$ Dirty Page Table (DPT) $\rightarrow$ Active Transaction Table (ATT) $\rightarrow$ Analyze, Redo, Undo phases $\rightarrow$ Log Sequence Numbers $\rightarrow$ CLRs

    DISTRIBUTED DATABASES

    System Architectures Replication Schemes Partitioning Schemes Two- Phase Commit Paxos Distributed Query Execution Distributed Join Algorithms Semi- Join Optimization Cloud Architectures

    TOPICS NOT ON EXAM!

    Flash TalksSeminar TalksDetails of specific database systems (e.g., Postgres)

    GOOGLE SPANNER

    Google's geo- replicated DBMS (>2011) Schematized, semi- relational data model. Decentralized shared- disk architecture. Log- structured on- disk storage. Concurrency Control: $\rightarrow$ Strict ZPL + MVCC + Multi- Paxos + 2PC $\rightarrow$ Externally consistent global write- transactions with synchronous replication. $\rightarrow$ Lock- free read- only transactions.

    SPANNER: CONCURRENCY CONTROL

    MVCC + Strict 2PL with Wound- Wait Deadlock Prevention
    DBMS ensures ordering through globally unique timestamps generated from atomic clocks and GPS devices.
    Buffer writes in the client, and these are sent to the server at commit time.
    Database is broken up into tablets (partitions): $\rightarrow$ Use Paxos to elect leader in tablet group. $\rightarrow$ Use 2PC for txns that span tablets.

    SPANNER TABLES

    SPANNER TABLES

    SPANNER TABLES

    SPANNER TABLES

    SPANNER TABLES

    SPANNER TABLES

    SPANNER: TRANSACTION ORDERING

    DBMS orders transactions based on physical "wall- clock" time.
    $\rightarrow$ This is necessary to guarantee strict serializability. $\rightarrow$ If $\mathsf{T}_1$ finishes before $\mathsf{T}_2$ , then $\mathsf{T}_2$ should see the result of $\mathsf{T}_1$ .
    Each Paxos group decides in what order transactions should be committed according to the timestamps.
    $\rightarrow$ If $\mathsf{T}_1$ commits at $\mathsf{time}_1$ and $\mathsf{T}_2$ starts at $\mathsf{time}_2 > \mathsf{time}_1$ , then $\mathsf{T}_1$ 's timestamp should be less than $\mathsf{T}_2$ 's.

    SPANNER TRUETIME

    The DBMS maintains a global wall- clock time across all data centers with bounded uncertainty. Timestamps are intervals, not single values

    SPANNER TRUETIME

    The DBMS maintains a global wall- clock time across all data centers with bounded uncertainty. Timestamps are intervals, not single values

    SPANNER: TRUETIME

    Each data center has GPS and atomic clocks $\rightarrow$ These two provide fine- grained clock synchronization down to a few milliseconds. $\rightarrow$ Every 30 seconds, there is a maximum 7 ms difference.
    Multiple sync daemons per data center $\rightarrow$ GPS and atomic clocks can fail in various conditions. $\rightarrow$ Sync daemons talk to each other within a data center as well as across data centers.

    GOOGLE BIGQUERY (2011)

    Originally developed as "Dremel" in 2006 as a side- project for analyzing data artifacts generated from other tools.
    $\rightarrow$ The "interactive" goal means that they want to support ad hoc queries on in- situ data files. $\rightarrow$ Did not support joins in the first version.
    Rewritten in the late 2010s to shared- disk architecture built on top of GFS.
    Released as public commercial product (BigQuery) in 2012.

    BIGQUERY: OVERVIEW

    Shared- Disk / Disaggregated StorageVectorized Query ProcessingShuffle- based Distributed Query ExecutionColumnar Storage $\rightarrow$ Zone Maps / Filters $\rightarrow$ Dictionary + RLE Compression $\rightarrow$ Only Allows "Search" Inverted IndexesHash Joins OnlyHeuristic Optimizer + Adaptive Optimizations

    BIGQUERY: OVERVIEW

    Shared- Disk / Disaggregated Storage Vectorized Query Processing
    Shuffle- based Distributed Query Execution
    Columnar Storage $\rightarrow$ Zone Maps / Filters $\rightarrow$ Dictionary + RLE Compression $\rightarrow$ Only Allows "Search" Inverted IndexesHash Joins OnlyHeuristic Optimizer + Adaptive Optimizations

    BIGQUERY: IN-MEMORY SHUFFLE

    The shuffle phases represent checkpoints in a query's lifecycle where that the coordinator makes sure that all tasks are completed.

    Fault Tolerance / Straggler Avoidance:

    $\rightarrow$ If a worker does not produce a task's results within a deadline, the coordinator speculatively executes a redundant task.

    Dynamic Resource Allocation:

    $\rightarrow$ Scale up / down the number of workers for the next stage depending size of a stage's output.

    BIGQUERY: IN-MEMORY SHUFFLE

    Distributed File System

    BIGQUERY: IN-MEMORY SHUFFLE

    BIGQUERY: IN-MEMORY SHUFFLE

    BIGQUERY: IN-MEMORY SHUFFLE

    Distributed File System

    BIGQUERY: IN-MEMORY SHUFFLE

    Distributed File System

    BIGQUERY: IN-MEMORY SHUFFLE

    BIGQUERY: IN-MEMORY SHUFFLE

    BIGQUERY: IN-MEMORY SHUFFLE

    BIGQUERY: IN-MEMORY SHUFFLE

    BIGQUERY: DYNAMIC REPARTITIONING

    BigQuery dynamically load balances and adjusts intermediate result partitioning to adapt to data skew.
    DBMS detects whether shuffle partition gets too full and then instructs workers to adjust their partitioning scheme.

    Coordinator

    BIGQUERY: DYNAMIC REPARTITIONING

    BigQuery dynamically load balances and adjusts intermediate result partitioning to adapt to data skew.
    DBMS detects whether shuffle partition gets too full and then instructs workers to adjust their partitioning scheme.

    BIGQUERY: DYNAMIC REPARTITIONING

    BigQuery dynamically load balances and adjusts intermediate result partitioning to adapt to data skew.
    DBMS detects whether shuffle partition gets too full and then instructs workers to adjust their partitioning scheme.

    BIGQUERY: DYNAMIC REPARTITIONING

    BigQuery dynamically load balances and adjusts intermediate result partitioning to adapt to data skew.
    DBMS detects whether shuffle partition gets too full and then instructs workers to adjust their partitioning scheme.

    BIGQUERY: DYNAMIC REPARTITIONING

    BigQuery dynamically load balances and adjusts intermediate result partitioning to adapt to data skew.
    DBMS detects whether shuffle partition gets too full and then instructs workers to adjust their partitioning scheme.

    BIGQUERY: DYNAMIC REPARTITIONING

    BigQuery dynamically load balances and adjusts intermediate result partitioning to adapt to data skew.
    DBMS detects whether shuffle partition gets too full and then instructs workers to adjust their partitioning scheme.

    BIGQUERY: DYNAMIC REPARTITIONING

    BigQuery dynamically load balances and adjusts intermediate result partitioning to adapt to data skew.
    DBMS detects whether shuffle partition gets too full and then instructs workers to adjust their partitioning scheme.

    BIGQUERY: DYNAMIC REPARTITIONING

    BigQuery dynamically load balances and adjusts intermediate result partitioning to adapt to data skew.
    DBMS detects whether shuffle partition gets too full and then instructs workers to adjust their partitioning scheme.

    BIGQUERY: DYNAMIC REPARTITIONING

    BigQuery dynamically load balances and adjusts intermediate result partitioning to adapt to data skew.
    DBMS detects whether shuffle partition gets too full and then instructs workers to adjust their partitioning scheme.

    BIGQUERY: DYNAMIC REPARTITIONING

    BigQuery dynamically load balances and adjusts intermediate result partitioning to adapt to data skew.
    DBMS detects whether shuffle partition gets too full and then instructs workers to adjust their partitioning scheme.

    BIGQUERY: DYNAMIC REPARTITIONING

    BigQuery dynamically load balances and adjusts intermediate result partitioning to adapt to data skew.
    DBMS detects whether shuffle partition gets too full and then instructs workers to adjust their partitioning scheme.

    W snowflake

    SNOWFLAKE (2013)

    Managed OLAP DBMS written in C++.
    $\rightarrow$ Shared- disk architecture with aggressive compute- side local caching. $\rightarrow$ Written from scratch. Did not borrow components from existing systems. $\rightarrow$ Custom SQL dialect and client- server network protocols.
    The OG cloud- native data warehouse.

    SNOWFLAKE (2

    SNOWFLAKE (2Managed OLAP DBMS written i→ Shared- disk architecture with aggres local caching.→ Written from scratch. Did not borr existing systems.→ Custom SQL dialect and client- serv
    The OG cloud- native data ware

    SNOWFLAKE: OVERVIEW

    Cloud- native OLAP DBMS written in C++Shared- Disk / Disaggregated StoragePush- based Vectorized Query ProcessingPrecompiled Operator PrimitivesSeparate Table Data from Meta- DataNo Buffer PoolPAX Columnar Storage

    SNOWFLAKE: QUERY PROCESSING

    Snowflake is a push- based vectorized engine that uses precompiled primitives for operator kernels. $\longrightarrow$ Pre- compile variants using $\mathrm{C + + }$ templates for different vector data types. $\longrightarrow$ Only uses codegen (via LLVM) for tuple serialization/deserialization between workers.
    Does not support partial query retries $\longrightarrow$ If a worker fails, then the entire query has to restart.

    SNOWFLAKE: ADAPTIVE OPTIMIZATION

    After determining join ordering, Snowflake's optimizer identifies aggregation operators to push down into the plan below joins.
    The optimizer adds the downstream aggregations but then the DBMS only enables them at runtime according to statistics observed during execution.

    SNOWFLAKE: ADAPTIVE OPTIMIZATION

    After determining join ordering, Snowflake's optimizer identifies aggregation operators to push down into the plan below joins.
    The optimizer adds the downstream aggregations but then the DBMS only enables them at runtime according to statistics observed during execution.

    SNOWFLAKE: ADAPTIVE OPTIMIZATION

    After determining join ordering, Snowflake's optimizer identifies aggregation operators to push down into the plan below joins.
    The optimizer adds the downstream aggregations but then the DBMS only enables them at runtime according to statistics observed during execution.

    SNOWFLAKE: ADAPTIVE OPTIMIZATION

    After determining join ordering, Snowflake's optimizer identifies aggregation operators to push down into the plan below joins.
    The optimizer adds the downstream aggregations but then the DBMS only enables them at runtime according to statistics observed during execution.

    SNOWFLAKE: AD

    After determining join order. Snowflake's optimizer identif aggregation operators to pus into the plan below joins.
    The optimizer adds the dow. aggregations but then the DI enables them at runtime acc statistics observed during ex

    Aggregation Placement - An Adaptive Query Optimization for Snowflake

    Bowei Chen · Follow Published in Snowflake · 8 min read · Aug 10, 2023
    Snowflake's Data Cloud is backed by a data platform designed from the ground up to leverage cloud computing technology. The platform is delivered as a fully managed service, providing a user- friendly experience to run complex analytical workloads easily and efficiently without the burden of managing on- premise infrastructure. Snowflake's architecture separates the compute layer from the storage layer. Compute workloads on the same dataset can scale independently and run in isolation without interfering with each other, and compute resources could be allocated and scaled on demand within seconds. The cloud- native architecture makes Snowflake a powerful platform for data warehousing, data engineering, data science, and many other types of applications. More about Snowflake architecture can be found in Key Concepts & Architecture documentation and the Snowflake Elastic Data Warehouse research paper.

    SNOWFLAKE: FLEXIBLE COMPUTE

    If a query plan fragment will process a large amount of data, then the DBMS can temporarily deploy additional worker nodes to accelerate its performance.
    Flexible compute worker nodes write results to storage as if it was a table.

    SNOWFLAKE: FLEXIBLE COMPUTE

    If a query plan fragment will process a large amount of data, then the DBMS can temporarily deploy additional worker nodes to accelerate its performance.
    Flexible compute worker nodes write results to storage as if it was a table.

    SNOWFLAKE: FLEXIBLE COMPUTE

    If a query plan fragment will process a large amount of data, then the DBMS can temporarily deploy additional worker nodes to accelerate its performance.
    Flexible compute worker nodes write results to storage as if it was a table.

    SNOWFLAKE: FLEXIBLE COMPUTE

    If a query plan fragment will process a large amount of data, then the DBMS can temporarily deploy additional worker nodes to accelerate its performance.
    Flexible compute worker nodes write results to storage as if it was a table.

    SNOWFLAKE: FLEXIBLE COMPUTE

    If a query plan fragment will process a large amount of data, then the DBMS can temporarily deploy additional worker nodes to accelerate its performance.
    Flexible compute worker nodes write results to storage as if it was a table.

    AMAZON REDSHIFT (2014)

    Amazon's flagship OLAP DBaaS. $\longrightarrow$ Based on ParAccel's original shared- nothing architecture. $\longrightarrow$ Switched to support disaggregated storage (S3) in 2017. $\longrightarrow$ Added serverless deployments in 2022.
    Redshift is a more traditional data warehouse compared to BigQuery/Spark where it wants to control all the data.
    Overarching design goal is to remove as much administration + configuration choices from users.

    REDSHIFT: OVERVIEW

    Shared- Disk / Disaggregated StoragePush- based Vectorized Query ProcessingTranspilation Query Codegen (C++)Precompiled PrimitivesCompute- side CachingPAX Columnar StorageSort- Merge + Hash JoinsHardware Acceleration (AQUA)Stratified Query Optimizer

    REDSHIFT: OVERVIEW

    Shared- Disk / Disaggregated Storage Push- based Vectorized Query Processing
    Transpilation Query Codegen (C++) Precompiled Primitives
    Compute- side Caching PAX Columnar Storage Sort- Merge + Hash Joins
    Hardware Acceleration (AQUA)
    Stratified Query Optimizer

    REDSHIFT: COMPILATION SERVICE

    Separate nodes to compile query plans using GCC and aggressive caching.
    $\rightarrow$ DBMS checks whether a compiled version of each templated fragment already exists in customer's local cache. $\rightarrow$ If fragment does not exist in the local cache, then it checks a global cache for the entire fleet of Redshift customers.
    Background workers proactively recompile plans when new version of DBMS is released.

    REDSHIFT: HARDWARE ACCELERATION

    AWS introduced the AQUA (Advanced Query Accelerator) for Redshift (Spectrum?) in 2021.
    Separate compute/cache nodes that use FPGAs to evaluate predicates.
    AQUA was phased out and replaced with Nitro cards on compute nodes

    REDSHIFT: HARDWARE ACCELERATION

    AWS introduced the AQUA (Advanced Query Accelerator) for Redshift (Spectrum?) in 2021.
    Separate compute/cache nodes that use FPGAs to evaluate predicates.
    AQUA was phased out and replaced with Nitro cards on compute nodes

    REDSHIFT: HARDWARE ACCELERATION

    AWS introduced the AQUA (Advanced Query Accelerator) for Redshift (Spectrum?) in 2021.
    Separate compute/cache nodes that use FPGAs to evaluate predicates.
    AQUA was phased out and replaced with Nitro cards on compute nodes

    3 databricks

    DATABRICKS PHOTON (2022)

    Single- threaded $\mathtt{C + + }$ execution engine embedded into Databricks Runtime (DBR) via JNI.
    $\longrightarrow$ Overrides existing engine when appropriate. $\longrightarrow$ Support both Spark's earlier SQL engine and Spark's DataFrame API. $\longrightarrow$ Seamlessly handle impedance mismatch between roworiented DBR and column- oriented Photon.
    Accelerate execution of query plans over "raw / uncurated" files in a data lake.

    DATABRICKS PHOTON (2022)

    Photon: A Fast Query Engine for Lakehouse Systems

    Alexander Behm, Shoumik Palkar, Utkarsh Agarwal, Timothy Armstrong, David Cashman, Ankur Dave, Todd Greenstein, Shant Hovsepian, Ryan Johnson, Arvind Sai Krishnan, Paul Leventis, Ala Luszczak, Prashanth Menon, Mostafa Mokhtar, Gene Pang, Sameer Paranjpye, Greg Rahn, Bart Samwel, Tom van Bussel, Herman van Hovell, Maryann Xue, Reynold Xin, Matei Zaharia photon- paper- authors@databricks.com Databricks Inc.

    ABSTRACT

    Many organizations are shifting to a data management paradigm called the "Lakehouse," which implements the functionality of structured data warehouses on top of unstructured data lakes. This from SQL to machine learning. Traditionally, for the most demanding SQL workloads, enterprises have also moved a curated subset of their data into data warehouses to get high performance, gov- . ernance and concurrency. However, this two- tier architecture is

    PHOTON: OVERVIEW

    Shared- Disk / Disaggregated StoragePull- based Vectorized Query ProcessingPrecompiled Primitives + Expression FusionShuffle- based Distributed Query ExecutionSort- Merge + Hash JoinsUnified Query Optimizer + Adaptive Optimizations

    PHOTON: OVERVIEW

    Shared- Disk / Disaggregated Storage
    Pull- based Vectorized Query Processing
    Precompiled Primitives + Expression Fusion Shuffle- based Distributed Query Execution Sort- Merge + Hash Joins Unified Query Optimizer + Adaptive Optimizations

    PHOTON: VECTORIZED PROCESSING

    Photon is a pull- based vectorized engine that uses precompiled operator kernels (primitives). $\rightarrow$ Converts physical plan into a list of pointers to functions that perform low- level operations on column batches.
    Databricks: It is easier to build/maintain a vectorized engine than a JIT engine. $\rightarrow$ Engineers spend more time creating specialized codepaths to get closer to JIT performance. $\rightarrow$ With codegen, engineers write tooling and observability hooks instead of writing the engine.

    PHOTON: EXPRESSION FUSION

    SELECT * FROM foo WHERE cdate BETWEEN '2024- 01- 01' AND '2024- 04- 01';

    PHOTON: EXPRESSION FUSION

    SELECT * FROM foo WHERE cdate >= '2024- 01- 01' AND cdate <= '2024- 04- 01';

    PHOTON: EXPRESSION FUSION

    SELECT * FROM foo WHERE cdate >= '2024- 01- 01' AND cdate <= '2024- 04- 01';

    PHOTON: EXPRESSION FUSION

    PHOTON: EXPRESSION FUSION

    SELECT * FROM foo WHERE cdate >= '2024- 01- 01' AND cdate <= '2024- 04- 01';

    PHOTON: EXPRESSION FUSION

    SELECT * FROM foo WHERE cdate >= '2024- 01- 01' AND cdate <= '2024- 04- 01';

    SPARK: PARTITION COALESCING

    Spark (over- )allocates a large number of shuffle partitions for each stage. $\longrightarrow$ Number needs to be large enough to avoid one partitioning from filling up too much.
    After the shuffle completes, the DBMS then combines underutilized partitions using heuristics.

    Worker

    SPARK: PARTITION COALESCING

    Spark (over- )allocates a large number of shuffle partitions for each stage. $\longrightarrow$ Number needs to be large enough to avoid one partitioning from filling up too much.
    After the shuffle completes, the DBMS then combines underutilized partitions using heuristics.

    SPARK: PARTITION COALESCING

    Spark (over- )allocates a large number of shuffle partitions for each stage. $\longrightarrow$ Number needs to be large enough to avoid one partitioning from filling up too much.
    After the shuffle completes, the DBMS then combines underutilized partitions using heuristics.

    SPARK: PARTITION COALESCING

    Spark (over- )allocates a large number of shuffle partitions for each stage. $\longrightarrow$ Number needs to be large enough to avoid one partitioning from filling up too much.
    After the shuffle completes, the DBMS then combines underutilized partitions using heuristics.

    SPARK: PARTITION COALESCING

    Spark (over- )allocates a large number of shuffle partitions for each stage. $\longrightarrow$ Number needs to be large enough to avoid one partitioning from filling up too much.
    After the shuffle completes, the DBMS then combines underutilized partitions using heuristics.

    SPARK: PARTITION COALESCING

    Spark (over- )allocates a large number of shuffle partitions for each stage. $\longrightarrow$ Number needs to be large enough to avoid one partitioning from filling up too much.
    After the shuffle completes, the DBMS then combines underutilized partitions using heuristics.

    DUCKDB (2019)

    Multi- threaded embedded (in- process, serverless) DBMS that executes SQL over disparate data files. $\longrightarrow$ PostgreSQL- like dialect with quality- of- life enhancements. $\longrightarrow$ "SQLite for Analytics"
    Provides zero- copy access to query results via Arrow to client code running in same process.
    The core DBMS is nearly all custom C++ code with little to no third- party dependencies. $\longrightarrow$ Relies on extensions ecosystem to expand capabilities.

    DUCKDB (2019)

    Multi- threaded emb DBMS that execute $\longrightarrow$ PostgreSQL- like dia $\longrightarrow$ "SQLite for Analytic
    Provides zero- copy Arrow to client co
    The core DBMS is little to no third- p $\longrightarrow$ Relies on extension
    George Fraser @frasergeorgew
    My second big finding is the vast majority of queries are tiny, and virtually all queries could fit on a large single node. We maybe don't need MPP systems anymore?

    DUCKDB: OVERVIEW

    Shared- EverythingPush- based Vectorized Query ProcessingPrecompiled PrimitivesMulti- Version Concurrency ControlMorsel Parallelism + SchedulingPAX Columnar StorageSort- Merge + Hash JoinsStratified Query Optimizer

    DUCKDB: OVERVIEW

    Shared- Everything
    Push- based Vectorized Query Processing
    Precompiled Primitives Multi- Version Concurrency Control Morsel Parallelism + Scheduling PAX Columnar Storage Sort- Merge + Hash Joins Stratified Query Optimizer

    DUCKDB: PUSH-BASED PROCESSING

    System originally used pull- based vectorized query processing but found it unwieldily to expand to support more complex parallelism. $\longrightarrow$ Cannot invoke multiple pipelines simultaneously.
    Switched to a push- based query processing model in 2021. Each operator determines whether it will execute in parallel on its own instead of a centralized executor.

    DUCKDB: PUSH-BASED PROCESSING

    System originally used processing but found it support more complex pa → Cannot invoke multiple pi
    Switched to a push- based 2021. Each operator dete execute in parallel on its centralized executor.

    DUCKDB: VECTORS

    Custom internal vector layout for intermediate results that is compatible with Velox.
    Supports multiple vector types:

    DUCKDB: VECTORS

    DuckDB uses a unified format to process all vector types without needing to decompress them first. $\rightarrow$ Reduce # of specialized primitives per vector type

    DUCKDB: VECTORS

    DuckDB uses a unified format to process all vector types without needing to decompress them first. $\longrightarrow$ Reduce # of specialized primitives per vector type

    DUCKDB: VECTORS

    DuckDB uses a unified format to process all vector types without needing to decompress them first. $\longrightarrow$ Reduce # of specialized primitives per vector type

    DUCKDB: VECTORS

    DuckDB uses a unified format to process all vector types without needing to decompress them first. $\longrightarrow$ Reduce # of specialized primitives per vector type

    DUCKDB: VECTORS

    DuckDB uses a unified format to process all vector types without needing to decompress them first. $\longrightarrow$ Reduce # of specialized primitives per vector type

    DUCKDB: VECTORS

    DuckDB uses a unified format to process all vector types without needing to decompress them first. $\longrightarrow$ Reduce # of specialized primitives per vector type

    DUCKDB: VECTORS

    DuckDB uses a unified format to process all vector types without needing to decompress them first. $\longrightarrow$ Reduce # of specialized primitives per vector type

    DUCKDB: VECTORS

    DuckDB uses a unified format to process all vector types without needing to decompress them first. $\longrightarrow$ Reduce # of specialized primitives per vector type