Carnegie Mellon University
Database Systems
Distributed OLAP Databases
ADMINISTRIVIA
Project #4 is due Sunday April 20th @ 11
$\rightarrow$ Recitation: Friday, April 11th in GHC 4303 from 3
- 4
PM
HW6 is due Sunday, April 20, 2025 @ 11
Final Exam is on Monday, April 28, 2025, from 5:30pm- 8
.
$\rightarrow$ Early exam will not be offered. Do not make travel plans.
$\rightarrow$ Material: Lecture 12 - Lecture 24.
$\rightarrow$ You can use the full 3 hours, though the exam is meant to be done in ~2 hours.
This course is recruiting TAs for the next semester
ADMINISTRIVIA
My OH on Monday moved to 10
- 11
am
Class on Monday, April 21: Review Session
$\rightarrow$ Come to class prepared with your questions. What material do you want me to go over again?
Class on Wednesday, April 23: Guest Lecture
$\rightarrow$ Real- world applications of Gen AI and Databases $\rightarrow$ Speaker: Sailesh Krishnamurthy, Google
UPCOMING DATABASE TALKS
Gel (DB Seminar) $\rightarrow$ Monday, April 21 @ 4
$\rightarrow$ EdgeQL with Gel $\rightarrow$ Speaker: Michael Sullivan $\rightarrow$
https://cmu.zoom.us/j/93441451665
BIFURCATED ENVIRONMENT
OLTP Databases
OLAP Database
BIFURCATED ENVIRONMENT
[ImageCaption: OLAP Database]
OLTP Databases
BIFURCATED ENVIRONMENT
BIFURCATED ENVIRONMENT
BIFURCATED ENVIRONMENT
DECISION SUPPORT SYSTEMS
Applications that serve the management, operations, and planning levels of an organization to help people make decisions about future issues and problems by analyzing historical data.
Star Schema vs. Snowflake Schema
STAR SCHEMA
STAR SCHEMA
STAR SCHEMA
SNOWFLAKE SCHEMA
SNOWFLAKE SCHEMA
STAR VS. SNOWFLAKE SCHEMA
Issue #1: Normalization
$\rightarrow$ Snowflake schemas take up less storage space. $\rightarrow$ Denormalized data models may incur integrity and consistency violations.
Issue #2: Query Complexity
$\rightarrow$ Snowflake schemas require more joins to get the data needed for a query. $\rightarrow$ Queries on star schemas will (usually) be faster.
PROBLEM SETUP
Partitions
PROBLEM SETUP
PROBLEM SETUP
PROBLEM SETUP
TODAY'S AGENDA
Execution ModelsQuery PlanningDistributed Join AlgorithmsCloud Systems
DISTRIBUTED QUERY EXECUTION
Executing an OLAP query in a distributed DBMS is roughly the same as on a single- node DBMS. $\rightarrow$ Query plan is a DAG of physical operators.
For each operator, the DBMS considers where input is coming from and where to send output. $\rightarrow$ Table Scans $\rightarrow$ Joins $\rightarrow$ Aggregations $\rightarrow$ Sorting
DISTRIBUTED QUERY EXECUTION
中
Worker Nodes
DISTRIBUTED QUERY EXECUTION
DISTRIBUTED QUERY EXECUTION
DISTRIBUTED QUERY EXECUTION
DISTRIBUTED QUERY EXECUTION
DISTRIBUTED QUERY EXECUTION
DISTRIBUTED QUERY EXECUTION
DISTRIBUTED QUERY EXECUTION
DISTRIBUTED QUERY EXECUTION
DISTRIBUTED QUERY EXECUTION
DATA CATEGORIES
Persistent Data:
$\rightarrow$ The "source of record" for the database (e.g., tables). $\rightarrow$ Modern systems assume that these data files are immutable but can support updates by rewriting them.
Intermediate Data:
$\rightarrow$ Short- lived artifacts produced by query operators during execution and then consumed by other operators. $\rightarrow$ The amount of intermediate data that a query generates has little to no correlation to amount of persistent data that it reads or the execution time.
DISTRIBUTED SYSTEM ARCHITECTURE
A distributed DBMS's system architecture specifies the location of the database's data files. This affects how nodes coordinate with each other and where they retrieve/store objects in the database.
Two approaches (not mutually exclusive): $\rightarrow$ Push Query to Data $\rightarrow$ Pull Data to Query
PUSH VS. PULL
Approach #1: Push Query to Data
→ Send the query (or a portion of it) to the node that contains the data. → Perform as much filtering and processing as possible where data resides before transmitting over network.
PUSH VS. PULL
Approach #1: Push Query to Data
→ Send the query (or a portion of it) to the node that contains the data. → Perform as much filtering and processing as possible where data resides before transmitting over network.
Approach #2: Pull Data to Query
→ Bring the data to the node that is executing a query that needs it for processing. → This is necessary when there is no compute resources available where database files are located.
Filtering and retrieving data using Amazon S3 Select
Approa $\longrightarrow$ Send ti contai $\longrightarrow$ Perfor data re
With Amazon S3 Select, you can use simple structured query language (SQL) statements to filter the contents of an Amazon S3 object and retrieve just the subset of data that you need. By using Amazon S3 Select to filter this data, you can reduce the amount of data that Amazon S3 transfers, which reduces the cost and latency to retrieve this data. Amazon S3 Select works on objects stored in CSV, JSON, or Apache Parquet format. It also works with objects that are compressed with GZIP or BZIP2 (for CSV and JSON objects only), and server- side encrypted objects. You can specify the format of the results as either CSV or JSON, and you can determine how the records in the result are delimited. You pass SQL expressions to Amazon S3 in the request. Amazon S3 Select supports a subset of SQL. For more information about the SQL elements that are supported by Amazon S3 Select, see SQL reference for Amazon S3 Select. You can perform SQL queries using AWS SDKs, the SELECT Object Content REST API, the AWS Command Line Interface (AWS CLI), or the Amazon S3 console. The Amazon S3 console limits the amount of data returned to 40 MB. To retrieve
Approa $\longrightarrow$ Bring needs it for processing.
$\longrightarrow$ This is necessary when there is no compute resources available where database files are located.
Filtering and retrieving data using Amazon S3 Select
Approu
With Amazon S3 Select
Microsoft
Feedback
Query Blob Contents
Article - 07/20/2021 - 10 minutes to read - 1 contributors
Article - 07/20/2021 - 10 minutes to read - 1 contributorsArticle - 07/20/2021 - 10 minutes to read - 1 contributorsArticle - 07/20/2021 - 10 minutes to read - 1 contributorsArticle - 07/20/2021 - 10 minutes to read - 1 contributorsArticle - 07/20/2021 - 10 minutes to read - 1 contributorsArticle - 10 minutes to read - 1 contributorsArticle - 07/20/2021 - 10 minutes to read - 1 contributorsArticle - 07/20/2021 - 10 minutes to read - 1 contributorsArticle - 07/20/2021 - 10 minutes to read - 1 contributorsArticle - 07/20/2021 - 1
Request
The query Blob contents request may be constructed as follows. HTTPS is recommended. Replace
Table (html):
| The query Blob contents request may be constructed as follows. HTTPS is recommended. Replace |
| myaccount with the name of your storage account: |
| POST Method Request URI | HTTP Version |
| https://myaccount.blob.core.windows.net/mycontainer/myblob?comp=query | HTTP/1.0 |
| https://myaccount.blob.core.windows.net/mycontainer/myblob?comp=query&snapshot=<datetime> | https://myaccount.blob.core.windows.net/mycontainer/myblob?comp=query&snapshot=<datetime> |
| https://myaccount.blob.core.windows.net/mycontainer/myblob?comp=query&versionid=<datetime> | |
ery language (sQL) statements to filter the contents of an at you need.By using Amazon S3 Select to filter this data,you can ich reduces the cost and latency to retrieve this data.
or Apache Parquet format. It also works with objects that are only), and server- side encrypted objects. You can specify the etermine how the records in the result are delimited.
azon S3 Select supports a subset of SQL.For more information Select, see SQL reference for Amazon S3 Select.
Object Content REST API, the AWS Command Line Interface le limits the amount of data returned to 40 MB.To retrieve
pute resources ed.
PUSH QUERY TO DATA
PUSH QUERY TO DATA
PUSH QUERY TO DATA
PUSH QUERY TO DATA
PULL DATA TO QUERY
PULL DATA TO QUERY
PULL DATA TO QUERY
PULL DATA TO QUERY
PULL DATA TO QUERY
PULL DATA TO QUERY
OBSERVATION
The data that a node receives from remote sources are cached in the buffer pool. $\longrightarrow$ This allows the DBMS to support intermediate results that are large than the amount of memory available. $\longrightarrow$ Ephemeral pages are not persisted after a restart.
What happens to a long- running OLAP query if a node crashes during execution?
QUERY FAULT TOLERANCE
Most shared- nothing distributed OLAP DBMSs are designed to assume that nodes do not fail during query execution. $\rightarrow$ If one node fails during query execution, then the whole query fails.
The DBMS could take a snapshot of the intermediate results for a query during execution to allow it to recover if nodes fail.
QUERY FAULT TOLERANCE
SELECT * FROM R JOIN SON R.id = S.id
Node
Application Server
QUERY FAULT TOLERANCE
QUERY FAULT TOLERANCE
QUERY FAULT TOLERANCE
QUERY FAULT TOLERANCE
QUERY FAULT TOLERANCE
QUERY PLANNING
All the optimizations that we talked about before are still applicable in a distributed environment. $\longrightarrow$ Predicate Pushdown $\longrightarrow$ Projection Pushdown $\longrightarrow$ Optimal Join Orderings
Distributed query optimization is even harder because it must consider the physical location of data and network transfer costs.
QUERY PLAN FRAGMENTS
Approach #1: Physical Operators
$\rightarrow$ Generate a single query plan and then break it up into partition- specific fragments. $\rightarrow$ Most systems implement this approach.
Approach #2: SQL
$\rightarrow$ Rewrite original query into partition- specific queries. $\rightarrow$ Allows for local optimization at each node. $\rightarrow$ SingleStore + Vitess are the only systems we know that use this approach.
QUERY PLAN FRAGMENTS
SELECT * FROM R JOIN SON R.id = S.id
QUERY PLAN FRAGMENTS
OBSERVATION
The efficiency of a distributed join depends on the target tables' partitioning schemes.
One approach is to put entire tables on a single node and then perform the join. $\rightarrow$ You lose the parallelism of a distributed DBMS. $\rightarrow$ Costly data transfer over the network.
DISTRIBUTED JOIN ALGORITHMS
To join tables R and S, the DBMS needs to get the proper tuples on the same node.
Once the data is at the node, the DBMS then executes the same join algorithms that we discussed earlier in the semester. $\rightarrow$ Need to produce the correct answer as if all the data is located in a single node system.
SCENARIO #1
The entire copy of one data set is replicated at every node. $\rightarrow$ Think of it as a small dimension table.
Each node joins its local data in parallel and then sends their results to a coordinating node.
SCENARIO #1
The entire copy of one data set is replicated at every node. $\longrightarrow$ Think of it as a small dimension table.
Each node joins its local data in parallel and then sends their results to a coordinating node.
SELECT * FROM R JOIN S ON R.id = S.id
SCENARIO #1
The entire copy of one data set is replicated at every node. $\longrightarrow$ Think of it as a small dimension table.
Each node joins its local data in parallel and then sends their results to a coordinating node.
SELECT * FROM R JOIN S ON R.id = S.id
SCENARIO #1
The entire copy of one data set is replicated at every node. $\longrightarrow$ Think of it as a small dimension table.
Each node joins its local data in parallel and then sends their results to a coordinating node.
SELECT * FROM R JOIN S ON R.id = S.id
SCENARIO #1
The entire copy of one data set is replicated at every node. $\longrightarrow$ Think of it as a small dimension table.
Each node joins its local data in parallel and then sends their results to a coordinating node.
SELECT * FROM R JOIN S ON R.id = S.id
SCENARIO #2
Both data sets are partitioned on the join attribute. Each node performs the join on local data and then sends to a coordinator node for coalescing.
SELECT * FROM R JOIN S ON R.id = S.id
SCENARIO #2
Both data sets are partitioned on the join attribute. Each node performs the join on local data and then sends to a coordinator node for coalescing.
SELECT * FROM R JOIN S ON R.id = S.id
SCENARIO #2
Both data sets are partitioned on the join attribute. Each node performs the join on local data and then sends to a coordinator node for coalescing.
SELECT * FROM R JOIN S ON R.id = S.id
SCENARIO #3 - BROADCAST JOIN
Both data sets are partitioned on different keys. If one of the data sets is small, then the DBMS "broadcasts" that data to all nodes.
SELECT * FROM R JOIN S ON R.id = S.id
SCENARIO #3 - BROADCAST JOIN
Both data sets are partitioned on different keys. If one of the data sets is small, then the DBMS "broadcasts" that data to all nodes.
SELECT * FROM R JOIN S ON R.id = S.id
SCENARIO #3 - BROADCAST JOIN
Both data sets are partitioned on different keys. If one of the data sets is small, then the DBMS "broadcasts" that data to all nodes.
SELECT * FROM R JOIN S ON R.id = S.id
SCENARIO #3 - BROADCAST JOIN
Both data sets are partitioned on different keys. If one of the data sets is small, then the DBMS "broadcasts" that data to all nodes.
SELECT * FROM R JOIN S ON R.id = S.id
SCENARIO #3 - BROADCAST JOIN
Both data sets are partitioned on different keys. If one of the data sets is small, then the DBMS "broadcasts" that data to all nodes.
SELECT * FROM R JOIN S ON R.id = S.id
SCENARIO #3 - BROADCAST JOIN
Both data sets are partitioned on different keys. If one of the data sets is small, then the DBMS "broadcasts" that data to all nodes.
SELECT * FROM R JOIN S ON R.id = S.id
SCENARIO #4 - SHUFFLE JOIN
Both data sets are not partitioned on the join key. The DBMS copies/re- partitions the data on- the- fly across nodes. $\longrightarrow$ The repartitioned data copy is generally deleted when the query is done.
SELECT * FROM R JOIN S ON R.id = S.id
SCENARIO #4 – SHUFFLE JOIN
Both data sets are not partitioned on the join key. The DBMS copies/re- partitions the data on- the- fly across nodes. $\rightarrow$ The repartitioned data copy is generally deleted when the query is done.
SELECT * FROM R JOIN S ON R.id = S.id
SCENARIO #4 – SHUFFLE JOIN
Both data sets are not partitioned on the join key. The DBMS copies/re- partitions the data on- the- fly across nodes. $\longrightarrow$ The repartitioned data copy is generally deleted when the query is done.
SELECT * FROM R JOIN S ON R.id = S.id
SCENARIO #4 – SHUFFLE JOIN
Both data sets are not partitioned on the join key. The DBMS copies/re- partitions the data on- the- fly across nodes. $\longrightarrow$ The repartitioned data copy is generally deleted when the query is done.
SELECT * FROM R JOIN S ON R.id = S.id
SCENARIO #4 – SHUFFLE JOIN
Both data sets are not partitioned on the join key. The DBMS copies/re- partitions the data on- the- fly across nodes. $\longrightarrow$ The repartitioned data copy is generally deleted when the query is done.
SELECT * FROM R JOIN S ON R.id = S.id
SCENARIO #4 – SHUFFLE JOIN
Both data sets are not partitioned on the join key. The DBMS copies/re- partitions the data on- the- fly across nodes. $\longrightarrow$ The repartitioned data copy is generally deleted when the query is done.
SELECT * FROM R JOIN S ON R.id = S.id
SCENARIO #4 – SHUFFLE JOIN
Both data sets are not partitioned on the join key. The DBMS copies/re- partitions the data on- the- fly across nodes. $\longrightarrow$ The repartitioned data copy is generally deleted when the query is done.
SELECT * FROM R JOIN S ON R.id = S.id
SCENARIO #4 – SHUFFLE JOIN
Both data sets are not partitioned on the join key. The DBMS copies/re- partitions the data on- the- fly across nodes. $\longrightarrow$ The repartitioned data copy is generally deleted when the query is done.
SELECT * FROM R JOIN S ON R.id = S.id
SEMI-JOIN OPTIMIZATION
Before pulling data from another node, send a semi- join filter to reduce data movement.
$\longrightarrow$ Perform a join on the bare minimum data needed to avoid unnecessary transfers. $\longrightarrow$ Could use an approximate filter (Bloom Join).
SELECT Fact.price, Dim.* FROM Fact JOIN Dim ON Fact.id = Dim.id WHERE Dim.zip = 15213
SEMI-JOIN OPTIMIZATION
Before pulling data from another node, send a semi- join filter to reduce data movement.
$\longrightarrow$ Perform a join on the bare minimum data needed to avoid unnecessary transfers. $\longrightarrow$ Could use an approximate filter (Bloom Join).
SELECT Fact.price, Dim.* FROM Fact JOIN Dim ON Fact.id = Dim.id WHERE Dim.zip = 15213
Dim_{\text{semi}} = \Pi_{\text{id}} (\sigma_{\text{zip}} = 15213 \text{Dim})
SEMI-JOIN OPTIMIZATION
Before pulling data from another node, send a semi- join filter to reduce data movement.
$\longrightarrow$ Perform a join on the bare minimum data needed to avoid unnecessary transfers. $\longrightarrow$ Could use an approximate filter (Bloom Join).
SELECT Fact.price, Dim.* FROM Fact JOIN Dim ON Fact.id = Dim.id WHERE Dim.zip = 15213
Dim_{\text{semi}} = \Pi_{\text{id}} (\sigma_{\text{zip}} = 15213 \text{Dim})
SEMI-JOIN OPTIMIZATION
Before pulling data from another node, send a semi- join filter to reduce data movement.
$\longrightarrow$ Perform a join on the bare minimum data needed to avoid unnecessary transfers. $\longrightarrow$ Could use an approximate filter (Bloom Join).
SELECT Fact.price, Dim.* FROM Fact JOIN Dim ON Fact.id = Dim.id WHERE Dim.zip = 15213
F- small = Fact $\bowtie$ Dim semi
Dim semi = $\Pi_{\mathrm{id}}$ (Ozip = 15213 Dim)
SEMI-JOIN OPTIMIZATION
Before pulling data from another node, send a semi- join filter to reduce data movement.
$\longrightarrow$ Perform a join on the bare minimum data needed to avoid unnecessary transfers. $\longrightarrow$ Could use an approximate filter (Bloom Join).
SELECT Fact.price, Dim.* FROM Fact JOIN Dim ON Fact.id = Dim.id WHERE Dim.zip = 15213
F- small = Fact $\bowtie$ Dim semi
Dim_{\text{semi}} = \Pi_{\text{id}} (\sigma_{\text{zip}} = 15213 \text{Dim})
SEMI-JOIN OPTIMIZATION
Before pulling data from another node, send a semi- join filter to reduce data movement.
$\longrightarrow$ Perform a join on the bare minimum data needed to avoid unnecessary transfers. $\longrightarrow$ Could use an approximate filter (Bloom Join).
SELECT Fact.price, Dim.* FROM Fact JOIN Dim ON Fact.id = Dim.id WHERE Dim.zip = 15213
Result = $\Pi_{\text{price}}$ (Dim ₩ Fact_{small})
OBSERVATION
Direct communication between compute nodes means the DBMS knows which nodes will participate in query execution ahead of time. But data skew can cause imbalances...
A better approach is to dynamically adjust compute resources on the fly as a query executes.
SHUFFLE PHASE
Redistribute of intermediate data across nodes between query plan pipelines.
$\longrightarrow$ Can repartition / rebalance data based on observed data characteristics.
Some DBMSs support standalone fault- tolerant shuffle services.
$\longrightarrow$ Example: You can replace Spark's built- in in- memory shuffle implementation or replace it with a separate service.
Google Big Query
SHUFFLE PHASE
Shuffle Nodes
Shared- Disk
SHUFFLE PHASE
Shuffle Nodes
Shared- Disk
SHUFFLE PHASE
Shared- Disk
SHUFFLE PHASE
SHUFFLE PHASE
[ImageCaption: Shared-Disk]
SHUFFLE PHASE
SHUFFLE PHASE
SHUFFLE PHASE
SHUFFLE PHASE
EXCHANGE OPERATOR
Exchange Type #1 - Gather $\rightarrow$ Combine the results from multiple workers into a single output stream.
Exchange Type #2 - Distribute $\rightarrow$ Split a single input stream into multiple output streams.
Exchange Type #3 - Repartition $\rightarrow$ Shuffle multiple input streams across multiple output streams. $\rightarrow$ Some DBMSs always perform this step after every pipeline (e.g., Dremel/BigQuery).
CLOUD SYSTEMS
Vendors provide database- as- a- service (DBaaS) offerings that are managed DBMS environments.
Newer systems are starting to blur the lines between shared- nothing and shared- disk. $\rightarrow$ Example: You can do simple filtering on Amazon S3 before copying data to compute nodes.
CLOUD SYSTEMS
Approach #1: Managed DBMSs
$\rightarrow$ No significant modification to the DBMS to be "aware" that it is running in a cloud environment. $\rightarrow$ Examples: Most vendors
Approach #2: Cloud-Native DBMS
$\rightarrow$ System designed explicitly to run in a cloud environment. $\rightarrow$ Usually based on a shared- disk architecture. $\rightarrow$ Examples: Snowflake, Google BigQuery
SERVERLESS DATABASES
Rather than always maintaining compute resources for each customer, a "serverless" DBMS evicts tenants when they become idle.
SERVERLESS DATABASES
Rather than always maintaining compute resources for each customer, a "serverless" DBMS evicts tenants when they become idle.
SERVERLESS DATABASES
Rather than always maintaining compute resources for each customer, a "serverless" DBMS evicts tenants when they become idle.
SERVERLESS DATABASES
Rather than always maintaining compute resources for each customer, a "serverless" DBMS evicts tenants when they become idle.
SERVERLESS DATABASES
Rather than always maintaining compute resources for each customer, a "serverless" DBMS evicts tenants when they become idle.
SERVERLESS DATABASES
Rather than always maintaining compute resources for each customer, a "serverless" DBMS evicts tenants when they become idle.
SERVERLESS DATABASES
Rather than always maintaining compute resources for each customer, a "serverless" DBMS evicts tenants when they become idle.
SERVERLESS DATABASES
Rather than always maintaining compute resources for each customer, a "serverless" DBMS evicts tenants when they become idle.
SERVERLESS DATABASES
Rather than always maintaining compute resources for each customer, a "serverless" DBMS evicts tenants when they become idle.
SERVERLESS DATABASES
Rather than always maintaining compute resources for each customer, a "serverless" DBMS evicts tenants when they become idle.
SERVERLESS DATABASES
Rather than always maintaining compute resources for each customer, a "serverless" DBMS evicts tenants when they become idle.
SERVERLESS DATABASES
Rather than always maintaining compute resources for each customer, a "serverless" DBMS evicts tenants when they become idle.
SERVERLESS DATABASES
Rather than always maintaining compute resources for each customer, a "serverless" DBMS evicts tenants when they become idle.
DATA LAKES
Repository for storing large amounts of structured, semi- structured, and unstructured data without having to define a schema or ingest the data into proprietary internal formats.
DATA LAKES
Repository for storing large amounts of structured, semi- structured, and unstructured data without having to define a schema or ingest the data into proprietary internal formats.
DATA LAKES
Repository for storing large amounts of structured, semi- structured, and unstructured data without having to define a schema or ingest the data into proprietary internal formats.
DATA LAKES
Repository for storing large amounts of structured, semi- structured, and unstructured data without having to define a schema or ingest the data into proprietary internal formats.
CREATE TABLE foo (...);
INSERT INTO foo VALUES (...);
DATA LAKES
Repository for storing large amounts of structured, semi- structured, and unstructured data without having to define a schema or ingest the data into proprietary internal formats.
DATA LAKES
Repository for storing large amounts of structured, semi- structured, and unstructured data without having to define a schema or ingest the data into proprietary internal formats.
DATA LAKES
Repository for storing large amounts of structured, semi- structured, and unstructured data without having to define a schema or ingest the data into proprietary internal formats.
DATA LAKES
Repository for storing large amounts of structured, semi- structured, and unstructured data without having to define a schema or ingest the data into proprietary internal formats.
SELECT * FROM foo
DATA LAKES
Repository for storing large amounts of structured, semi- structured, and unstructured data without having to define a schema or ingest the data into proprietary internal formats.
DATA LAKES
Repository for storing large amounts of structured, semi- structured, and unstructured data without having to define a schema or ingest the data into proprietary internal formats.
DATA LAKES
Repository for storing large amounts of structured, semi- structured, and unstructured data without having to define a schema or ingest the data into proprietary internal formats.
DATA LAKES
Repository for storing large amounts of structured, semi- structured, and unstructured data without having to define a schema or ingest the data into proprietary internal formats.
DATA LAKES
Repository for storing large amounts of structured, semi- structured, and unstructured data without having to define a schema or ingest the data into proprietary internal formats.
OLAP DBMS COMPONENTS
One recent trend of the last decade is the breakout of OLAP DBMS components into standalone services and libraries:
$\longrightarrow$ System Catalogs $\longrightarrow$ Intermediate Representation $\longrightarrow$ Query Optimizers $\longrightarrow$ File Format / Access Libraries $\longrightarrow$ Execution Engines / Fabrics Lots of engineering challenges to make these components interoperable + performant.
SYSTEM CATALOGS
A DBMS tracks a database's schema (table, columns) and data files in its catalog.
$\rightarrow$ If the DBMS is on the data ingestion path, then it can maintain the catalog incrementally. $\rightarrow$ If an external process adds data files, then it also needs to update the catalog so that the DBMS is aware of them.
Notable implementations:
$\rightarrow$ HCatalog $\rightarrow$ Google Data Catalog $\rightarrow$ Amazon Glue Data Catalog $\rightarrow$ Databricks Unity $\rightarrow$ Apache Iceberg
SYSTEM C
A DBMS tracks a database and data files in its catalog. $\longrightarrow$ If the DBMS is on the data in maintain the catalog increment $\longrightarrow$ If an external process adds c update the catalog so that th
Notable implementations:
$\longrightarrow$ HCatalog $\longrightarrow$ Google Data Catalog $\longrightarrow$ Amazon Glue Data Catalog $\longrightarrow$ Databricks Unity $\longrightarrow$ Apache Iceberg
In case you missed it, earlier this month Databricks acquired Tabular, the company behind the open source project Iceberg, for over $1 billion. The acquisition, which was announced during Snowflake's 2024 Summit conference and amid rumors of Snowflake's interest in purchasing Tabular, caught many by surprise especially since Databricks already offers a competing product, Delta Lake. So, what is Iceberg, how does it compare to Delta Lake, and what does the project's future look like the post- acquisition?
DATA FILE FORMATS
Most DBMSs use a proprietary on- disk binary file format for their databases. $\rightarrow$ Think of the BusTub page types...
The only way to share data between systems is to convert data into a common text- based format $\rightarrow$ Examples: CSV, JSON, XML
There are new open- source binary file formats that make it easier to access data across systems.
DATA FILE FORMATS
Apache Parquet
$\longrightarrow$ Compressed columnar storage from Cloudera/Twitter
Apache ORC
$\longrightarrow$ Compressed columnar storage from Apache Hive.
Apache CarbonData
$\longrightarrow$ Compressed columnar storage with indexes from Huawei.
Apache Iceberg
$\longrightarrow$ Flexible data format that supports schema evolution from Netflix.
HDF5
$\longrightarrow$ Multi- dimensional arrays for scientific workloads.
Apache Arrow
$\longrightarrow$ In- memory compressed columnar storage from Pandas/Dremio.
DATA FILE FO
Apache Parquet
$\longrightarrow$ Compressed columnar storage from Cloudera/Twitter
Apache ORC
$\longrightarrow$ Compressed columnar storage from Apache Hive.
Apache CarbonData
$\longrightarrow$ Compressed columnar storage with indexes from Huawei.
An Empirical Evaluation of Columnar Storage Formats
ABSTRACT
Columnar storage is a core component of a modern data analytics system. Although many database management systems (DBMS) have proprietary storage formats, most provide extensive support to open- source storage formats such as Parquet and ORC to facilitate cross- platform data sharing. But these formats were developed over a decade ago, in the early 2010s, for the Hadoop ecosystem. Since then, both the hardware and workload landscapes have changed. In this paper, we revisit the most widely adopted open- source columnar storage formats (Parquet and ORC) with a deep dive into their internals. We designed a benchmark to stress test the formats' performance and space efficiency under different workload configurations. From our comprehensive evaluation of Parquet and ORC, we identify design data distributions advantageous with modern hardware and real- world data distributions. These include using dictionary encoding by default, favoring decoding speed over dictionary ratio for integer encoding algorithms, making block compression optional, and embedding finer- grained auxiliary data structures. We also point out the inefficiencies in the format designs when handling common machine learning workloads and using GPUs for decoding. Our analysis identified important considerations that may guide future formats to better fit modern technology trends.
PVLDB Reference Format:
Xinyu Zeng Yulong Hui; Hunshen Zhang; An Empirical Evaluation of Columnar Storage Formats. PVLDB, 17(2): 145 - 161, 2023. doi
.14778/3626292.3626.298
PVLDB Artifact Availability:
1 INTRODUCTION
Columnar storage has been widely adopted for data analytics because of its advantages, such as irrelevant attribute skipping, efficient data compression, and vectorized query processing [55, 59, 68]. In the early 2010s, organizations developed data processing engines for the open- source big data ecosystem [12], including Hive [13,
This work is licensed under the Creative Commons BY- NC- ND 4.0 International License. Visit
https://creativecommons.org/licenses/by- nc- nd/4.0 to view a copy of this license. For any use beyond those covered by this license, obtain permission by mailing
info@vlb.org. Copyright (a held by the owner authorship). Publication rights (c) licensed to the VLDB Endowment. Proceedings of the VLDB Endowment, Vol. 17, No. 2 ISSN 2130- 9077. doi
.1176/3626292.3626.298
- Impula [16], Spark [20, 113], and Presto [19, 98], to respond to the petabyte data analytics. To facilitate data sharing among demand for bus Hadoop- based query engines, vendors proposed open- source columnar storage formats [11, 17, 18, 76], represented by Parquet and ORC, that have become the de facto standard for data storage in today's data warehouses and data lakes [14, 15, 15, 20, 29, 38, 61]. These formats, however, were developed more than a decade ago, in the hardware landscape, and changed since then: persistent storage performance has improved to orders of magnitude, achieving gigabytes per second [48]. Meanwhile, the rise of data lakes means more column- oriented files reside in cheap cloud storage (e.g., AWS [7], Azure Blob Storage [24], Google Cloud Storage [33]), which exhibits both high bandwidth and high latency. On the software side, a number of new lightweight compression schemes [57, 65, 87, 116], as well as indexing and filtering techniques [77, 86, 101, 115], have been proposed in academia, while existing open columnar formats are based on DBMS methods from the 2000s [56].
Prior studies on storage formats focus on measuring the end- hand performance of Hadoop- based query engines [72, 80]. They try to analyze the design decisions and their engines off. However, they also see specific workloads that do not consider skewed data distribution observed in the real world [109]. Such data sets are less suitable for storage format benchmarking.
The goal of this paper is to analyze common columnar file formats and to identify design considerations to provide insights for developing next- generation column- oriented storage formats. We created a benchmark pool of pre- limited workloads whose configurations were extracted from a collection of real- world data sets. We then performed a comprehensive analysis for the major components in Parquet and ORC, including encodings, block compression, and data organization, indexing and filtering, and nested data mod- forms, support common machine learning workloads and whether their designs are friendly to GPUs. We detail the lessons learned in Section 6 and summarize our main findings below.
Finally, there is no clear winner between Parquet and ORC in format efficiency. Parquet has a slight file size advantage because of decoding due to its simpler integer encoding. Parquet also has faster column is more effective in selection pruning due to the finer granularity of its fine maps (a type of sparse index). Second, most columns in real- world data sets have a small num
ber of distinct values (or low "NDV ratios" defined in Section 4.1).
Hunchee Zhang also affiliated with Shanghai Qiu Za Institute.
EXECUTION ENGINES
Standalone libraries for executing vectorized query operators on columnar data. $\longrightarrow$ Input is a DAG of physical operators. $\longrightarrow$ Require external scheduling and orchestration.
Notable implementations:
$\longrightarrow$ Velox $\longrightarrow$ DataFusion $\longrightarrow$ Intel OAP
CONCLUSION
The cloud has made the distributed OLAP DBMS market flourish. Lots of vendors. Lots of money.
But more money, more data, more problems...
NEXT CLASS
Final Review