CMU Database - 15445 - 2025 Spring

cmu_database

This collection of documents, "CMU Database - 15445 - 2025 Spring," provides a comprehensive overview of database systems, primarily focusing on the design, implementation, and ...

Documents Knowledge Graph

lecture-24-slides.pdf

Carnegie Mellon University

Database Systems

Distributed OLAP Databases

ADMINISTRIVIA

Project #4 is due Sunday April 20^th @ 11

$\rightarrow$ Recitation: Friday, April 11^th in GHC 4303 from 3

- 4

HW6 is due Sunday, April 20, 2025 @ 11

Final Exam is on Monday, April 28, 2025, from 5:30pm- 8

$\rightarrow$ Early exam will not be offered. Do not make travel plans.

$\rightarrow$ Material: Lecture 12 - Lecture 24.

$\rightarrow$ You can use the full 3 hours, though the exam is meant to be done in ~2 hours.

This course is recruiting TAs for the next semester

$\rightarrow$ Apply at: https://www.ugrad.cs.cmu.edu/ta/F25/

ADMINISTRIVIA

My OH on Monday moved to 10

- 11

Class on Monday, April 21: Review Session

$\rightarrow$ Come to class prepared with your questions. What material do you want me to go over again?

Class on Wednesday, April 23: Guest Lecture

$\rightarrow$ Real- world applications of Gen AI and Databases $\rightarrow$ Speaker: Sailesh Krishnamurthy, Google

UPCOMING DATABASE TALKS

Gel (DB Seminar) $\rightarrow$ Monday, April 21 @ 4

$\rightarrow$ EdgeQL with Gel $\rightarrow$ Speaker: Michael Sullivan $\rightarrow$ https://cmu.zoom.us/j/93441451665

BIFURCATED ENVIRONMENT

OLTP Databases

OLAP Database

BIFURCATED ENVIRONMENT

[ImageCaption: OLAP Database]

OLTP Databases

BIFURCATED ENVIRONMENT

DECISION SUPPORT SYSTEMS

Applications that serve the management, operations, and planning levels of an organization to help people make decisions about future issues and problems by analyzing historical data.

Star Schema vs. Snowflake Schema

STAR SCHEMA

SNOWFLAKE SCHEMA

STAR VS. SNOWFLAKE SCHEMA

Issue #1: Normalization

$\rightarrow$ Snowflake schemas take up less storage space. $\rightarrow$ Denormalized data models may incur integrity and consistency violations.

Issue #2: Query Complexity

$\rightarrow$ Snowflake schemas require more joins to get the data needed for a query. $\rightarrow$ Queries on star schemas will (usually) be faster.

PROBLEM SETUP

Partitions

PROBLEM SETUP

TODAY'S AGENDA

Execution ModelsQuery PlanningDistributed Join AlgorithmsCloud Systems

DISTRIBUTED QUERY EXECUTION

Executing an OLAP query in a distributed DBMS is roughly the same as on a single- node DBMS. $\rightarrow$ Query plan is a DAG of physical operators.

For each operator, the DBMS considers where input is coming from and where to send output. $\rightarrow$ Table Scans $\rightarrow$ Joins $\rightarrow$ Aggregations $\rightarrow$ Sorting

DISTRIBUTED QUERY EXECUTION

中

Worker Nodes

DISTRIBUTED QUERY EXECUTION

DATA CATEGORIES

Persistent Data:

$\rightarrow$ The "source of record" for the database (e.g., tables). $\rightarrow$ Modern systems assume that these data files are immutable but can support updates by rewriting them.

Intermediate Data:

$\rightarrow$ Short- lived artifacts produced by query operators during execution and then consumed by other operators. $\rightarrow$ The amount of intermediate data that a query generates has little to no correlation to amount of persistent data that it reads or the execution time.

DISTRIBUTED SYSTEM ARCHITECTURE

A distributed DBMS's system architecture specifies the location of the database's data files. This affects how nodes coordinate with each other and where they retrieve/store objects in the database.

Two approaches (not mutually exclusive): $\rightarrow$ Push Query to Data $\rightarrow$ Pull Data to Query

PUSH VS. PULL

Approach #1: Push Query to Data

→ Send the query (or a portion of it) to the node that contains the data. → Perform as much filtering and processing as possible where data resides before transmitting over network.

PUSH VS. PULL

Approach #1: Push Query to Data

→ Send the query (or a portion of it) to the node that contains the data. → Perform as much filtering and processing as possible where data resides before transmitting over network.

Approach #2: Pull Data to Query

→ Bring the data to the node that is executing a query that needs it for processing. → This is necessary when there is no compute resources available where database files are located.

Filtering and retrieving data using Amazon S3 Select

Approa $\longrightarrow$ Send ti contai $\longrightarrow$ Perfor data re

With Amazon S3 Select, you can use simple structured query language (SQL) statements to filter the contents of an Amazon S3 object and retrieve just the subset of data that you need. By using Amazon S3 Select to filter this data, you can reduce the amount of data that Amazon S3 transfers, which reduces the cost and latency to retrieve this data. Amazon S3 Select works on objects stored in CSV, JSON, or Apache Parquet format. It also works with objects that are compressed with GZIP or BZIP2 (for CSV and JSON objects only), and server- side encrypted objects. You can specify the format of the results as either CSV or JSON, and you can determine how the records in the result are delimited. You pass SQL expressions to Amazon S3 in the request. Amazon S3 Select supports a subset of SQL. For more information about the SQL elements that are supported by Amazon S3 Select, see SQL reference for Amazon S3 Select. You can perform SQL queries using AWS SDKs, the SELECT Object Content REST API, the AWS Command Line Interface (AWS CLI), or the Amazon S3 console. The Amazon S3 console limits the amount of data returned to 40 MB. To retrieve

Approa $\longrightarrow$ Bring needs it for processing.

$\longrightarrow$ This is necessary when there is no compute resources available where database files are located.

Filtering and retrieving data using Amazon S3 Select

Approu

With Amazon S3 Select

Microsoft

Feedback

Query Blob Contents

Article - 07/20/2021 - 10 minutes to read - 1 contributors

Article - 07/20/2021 - 10 minutes to read - 1 contributorsArticle - 07/20/2021 - 10 minutes to read - 1 contributorsArticle - 07/20/2021 - 10 minutes to read - 1 contributorsArticle - 07/20/2021 - 10 minutes to read - 1 contributorsArticle - 07/20/2021 - 10 minutes to read - 1 contributorsArticle - 10 minutes to read - 1 contributorsArticle - 07/20/2021 - 10 minutes to read - 1 contributorsArticle - 07/20/2021 - 10 minutes to read - 1 contributorsArticle - 07/20/2021 - 10 minutes to read - 1 contributorsArticle - 07/20/2021 - 1

Request

The query Blob contents request may be constructed as follows. HTTPS is recommended. Replace

Table (html):

The query Blob contents request may be constructed as follows. HTTPS is recommended. Replace
myaccount with the name of your storage account:
POST Method Request URI	HTTP Version
https://myaccount.blob.core.windows.net/mycontainer/myblob?comp=query	HTTP/1.0
https://myaccount.blob.core.windows.net/mycontainer/myblob?comp=query&snapshot=<datetime>	https://myaccount.blob.core.windows.net/mycontainer/myblob?comp=query&snapshot=<datetime>
https://myaccount.blob.core.windows.net/mycontainer/myblob?comp=query&versionid=<datetime>

ery language (sQL) statements to filter the contents of an at you need.By using Amazon S3 Select to filter this data,you can ich reduces the cost and latency to retrieve this data.

or Apache Parquet format. It also works with objects that are only), and server- side encrypted objects. You can specify the etermine how the records in the result are delimited.

azon S3 Select supports a subset of SQL.For more information Select, see SQL reference for Amazon S3 Select.

Object Content REST API, the AWS Command Line Interface le limits the amount of data returned to 40 MB.To retrieve

pute resources ed.

PUSH QUERY TO DATA

PULL DATA TO QUERY

OBSERVATION

The data that a node receives from remote sources are cached in the buffer pool. $\longrightarrow$ This allows the DBMS to support intermediate results that are large than the amount of memory available. $\longrightarrow$ Ephemeral pages are not persisted after a restart.

What happens to a long- running OLAP query if a node crashes during execution?

QUERY FAULT TOLERANCE

Most shared- nothing distributed OLAP DBMSs are designed to assume that nodes do not fail during query execution. $\rightarrow$ If one node fails during query execution, then the whole query fails.

The DBMS could take a snapshot of the intermediate results for a query during execution to allow it to recover if nodes fail.

QUERY FAULT TOLERANCE

SELECT * FROM R JOIN SON R.id = S.id

Node

Application Server

QUERY FAULT TOLERANCE

QUERY PLANNING

All the optimizations that we talked about before are still applicable in a distributed environment. $\longrightarrow$ Predicate Pushdown $\longrightarrow$ Projection Pushdown $\longrightarrow$ Optimal Join Orderings

Distributed query optimization is even harder because it must consider the physical location of data and network transfer costs.

QUERY PLAN FRAGMENTS

Approach #1: Physical Operators

$\rightarrow$ Generate a single query plan and then break it up into partition- specific fragments. $\rightarrow$ Most systems implement this approach.

Approach #2: SQL

$\rightarrow$ Rewrite original query into partition- specific queries. $\rightarrow$ Allows for local optimization at each node. $\rightarrow$ SingleStore + Vitess are the only systems we know that use this approach.

QUERY PLAN FRAGMENTS

SELECT * FROM R JOIN SON R.id = S.id

QUERY PLAN FRAGMENTS

OBSERVATION

The efficiency of a distributed join depends on the target tables' partitioning schemes.

One approach is to put entire tables on a single node and then perform the join. $\rightarrow$ You lose the parallelism of a distributed DBMS. $\rightarrow$ Costly data transfer over the network.

DISTRIBUTED JOIN ALGORITHMS

To join tables R and S, the DBMS needs to get the proper tuples on the same node.

Once the data is at the node, the DBMS then executes the same join algorithms that we discussed earlier in the semester. $\rightarrow$ Need to produce the correct answer as if all the data is located in a single node system.

SCENARIO #1

The entire copy of one data set is replicated at every node. $\rightarrow$ Think of it as a small dimension table.

Each node joins its local data in parallel and then sends their results to a coordinating node.

SCENARIO #1

The entire copy of one data set is replicated at every node. $\longrightarrow$ Think of it as a small dimension table.

Each node joins its local data in parallel and then sends their results to a coordinating node.