GithubDocuments
    1. Marketplace
    2. CMU Database - 15445 - 2025 Spring
    Collections
    CMU Database - 15445 - 2025 Spring
    cmu_database
    This collection of documents, "CMU Database - 15445 - 2025 Spring," provides a comprehensive overview of database systems, primarily focusing on the design, implementation, and ...
    DocumentsKnowledge Graph
    lecture-15-slides.pdf
    Carnegie Mellon University

    Database Systems

    Query Planning & Optimization

    ADMINISTRIVIA

    Project #3 is due Sunday March 30th @ 11
    $\rightarrow$ Recitation: slides, recording
    Mid- term exam:
    $\rightarrow$ Exam viewing for the next 3 OH, including today. $\rightarrow$ The last OH for exam viewing is on March 24.

    UPCOMING DATABASE TALKS

    MALLOY Mar 17 Lloyd Tabb Malloy: A Modern Open Source Language for Analyzing, Transforming, and Modeling Data
    Table (html):
    PRQLMar 24Tobias BrandtPRQL: Pipelined Relational Query Language
    StarRocksMar 31Kaisen KangStarRocks Query Optimizer
    OxideApr 7Ben NaeckerOxQL: Oximeter Query Language
    MariaDBApr 14Michael WideniusMariaDB's New Query Optimizer
    Apr 21Michael SullivanEdgeQL with Gel

    LAST CLASS

    We talked about how to design the DBMS's architecture to execute queries in parallel.
    The query plan is comprised of physical operators that specify the algorithm to invoke at each step of the plan.
    But how do we go from SQL to a query plan?

    MOTIVATION

    SELECT DISTINCT ename FROM Emp E JOIN Dept D ON E.did = D.did WHERE D.dname = 'Toy'

    MOTIVATION

    SELECT DISTINCT ename FROM Emp E JOIN Dept D ON E.did = D.did WHERE D.dname = 'Toy'

    Catalog

    MOTIVATION

    SELECT DISTINCT ename FROM Emp E JOIN Dept D ON E.did = D.did WHERE D.dname = 'Toy'

    Catalog

    CMU- DB 15- 445 (645 (Spring 2025)

    MOTIVATION

    SELECT DISTINCT ename FROM Emp E JOIN Dept D ON E.did = D.did WHERE D.dname = 'Toy'

    Catalog

    CMU- DB 15- 445(645 (Spring 2025)
    (50 + 50,000) reads + 1,000,000 writes Write temp file T1 5 tuples per page in T1

    MOTIVATION

    SELECT DISTINCT ename FROM Emp E JOIN Dept D ON E.did = D.did WHERE D.dname = 'Toy'

    Catalog

    [ImageFootnote: CMU-DB 15-445(645 (Spring 2025)]
    1,000,000 reads + 2,000 writes (FK join, 10k tuples in temp T2)
    (50 + 50,000) reads + 1,000,000 writes Write temp file T1 5 tuples per page in T1

    MOTIVATION

    SELECT DISTINCT ename FROM Emp E JOIN Dept D ON E.did = D.did WHERE D.dname = 'Toy'

    Catalog

    [ImageFootnote: CMU-DB 15-445(645 (Spring 2025)]
    2,000 reads + 4 writes (10K/500 = 20 emps per dept)
    1,000,000 reads + 2,000 writes (FK join, 10k tuples in temp T2)
    (50 + 50,000) reads + 1,000,000 writes Write temp file T1 5 tuples per page in T1

    MOTIVATION

    SELECT DISTINCT ename FROM Emp E JOIN Dept D ON E.did = D.did WHERE D.dname = 'Toy'

    Catalog

    [ImageFootnote: CMU-DB 15-445(645 (Spring 2025)]

    MOTIVATION

    SELECT DISTINCT ename FROM Emp E JOIN Dept D ON E.did = D.did WHERE D.dname = 'Toy'

    Catalog

    [ImageFootnote: CMU-DB 15-445/645 (Spring 2025)]

    MOTIVATION

    SELECT DISTINCT ename FROM Emp E JOIN Dept D ON E.did = D.did WHERE D.dname = 'Toy'

    Catalog

    CMU- DB 15- 445 (645 (Spring 2025)

    MOTIVATION

    SELECT DISTINCT ename FROM Emp E JOIN Dept D ON E.did = D.did WHERE D.dname = 'Toy'

    Catalog

    (50 + 50,000) reads + 2,000 writes Page Nested- Loop Join Write Temp T1 Emp Dept

    MOTIVATION

    SELECT DISTINCT ename FROM Emp E JOIN Dept D ON E.did = D.did WHERE D.dname = 'Toy'

    Catalog

    CMU- DB 15- 445(645 (Spring 2025)
    2,000 reads + 4 writes Read temp T1, Write temp T2 $(50 + 50,000)$ reads + 2,000 writes Page Nested- Loop Join Write Temp T1 Emp

    MOTIVATION

    SELECT DISTINCT ename FROM Emp E JOIN Dept D ON E.did = D.did WHERE D.dname = 'Toy'

    Catalog

    MOTIVATION

    SELECT DISTINCT ename FROM Emp E JOIN Dept D ON E.did = D.did WHERE D.dname = 'Toy'

    Catalog

    [ImageFootnote: CMU-DB 15-445/645 (Spring 2025)]

    MOTIVATION

    SELECT DISTINCT ename FROM Emp E JOIN Dept D ON E.did = D.did WHERE D.dname = 'Toy'

    Catalog

    [ImageFootnote: CMU-DB 15-445/645 (Spring 2025)]

    MOTIVATION

    SELECT DISTINCT ename FROM Emp E JOIN Dept D ON E.did = D.did WHERE D.dname = 'Toy'

    Catalog

    MOTIVATION

    SELECT DISTINCT ename FROM Emp E JOIN Dept D ON E.did = D.did WHERE D.dname = 'Toy'

    Catalog

    Materialization Model Total: 7,159 l/Os

    MOTIVATION

    SELECT DISTINCT ename FROM Emp E JOIN Dept D ON E.did = D.did WHERE D.dname = 'Toy'

    Catalog

    Total: 7,159 l/Os

    MOTIVATION

    SELECT DISTINCT ename FROM Emp E JOIN Dept D ON E.did = D.did WHERE D.dname = 'Toy'

    Catalog

    Total: 7,159 l/Os

    MOTIVATION

    SELECT DISTINCT ename FROM Emp E JOIN Dept D ON E.did = D.did WHERE D.dname = 'Toy'

    Catalog

    MOTIVATION

    SELECT DISTINCT ename FROM Emp E JOIN Dept D ON E.did = D.did WHERE D.dname = 'Toy'

    Catalog

    CMU- DB 15- 445 (645 (Spring 2025)

    MOTIVATION

    SELECT DISTINCT ename FROM Emp E JOIN Dept D ON E.did = D.did WHERE D.dname = 'Toy'

    Catalog

    CMU- DB 15- 445 (645 (Spring 2025)

    MOTIVATION

    SELECT DISTINCT ename FROM Emp E JOIN Dept D ON E.did = D.did WHERE D.dname = 'Toy'

    Catalog

    CMU- DB 15- 445 (645 (Spring 2025)

    MOTIVATION

    SELECT DISTINCT ename FROM Emp E JOIN Dept D ON E.did = D.did WHERE D.dname = 'Toy'

    Catalog

    CMU- DB 15- 445 (645 (Spring 2025)

    MOTIVATION

    SELECT DISTINCT ename FROM Emp E JOIN Dept D ON E.did = D.did WHERE D.dname = 'Toy'

    Catalog

    CMU- DB 15- 445(645 (Spring 2025)

    MOTIVATION

    SELECT DISTINCT ename FROM Emp E JOIN Dept D ON E.did = D.did WHERE D.dname = 'Toy'

    Catalog

    CMU- DB 15- 445 (645 (Spring 2025)

    MOTIVATION

    SELECT DISTINCT ename FROM Emp E JOIN Dept D ON E.did = D.did WHERE D.dname = 'Toy'

    Catalog

    3 reads + 1 writes Access: Index(dname) Dept

    MOTIVATION

    SELECT DISTINCT ename FROM Emp E JOIN Dept D ON E.did = D.did WHERE D.dname = 'Toy'

    Catalog

    1+3 (idx) + 20 (ptr chase) reads + 4 writes Emp.did = Dept.did Index Nested- Loop Join Emp.dname = 'Toy' 3 reads + 1 writes Access: Index(dname) Dept

    MOTIVATION

    SELECT DISTINCT ename FROM Emp E JOIN Dept D ON E.did = D.did WHERE D.dname = 'Toy'

    Catalog

    MOTIVATION

    SELECT DISTINCT ename FROM Emp E JOIN Dept D ON E.did = D.did WHERE D.dname = 'Toy'

    Catalog

    TODAY'S AGENDA

    BackgroundHeuristic / Ruled- based OptimizationCost- based OptimizationCost Model Estimation

    ARCHITECTURE OVERVIEW

    LOGICAL VS. PHYSICAL PLANS

    The optimizer generates a mapping of a logical algebra expression to the optimal equivalent physical algebra expression.
    Physical operators define a specific execution strategy using an access path. $\rightarrow$ They can depend on the physical format of the data that they process (i.e., sorting, compression). $\rightarrow$ Not always a 1
    mapping from logical to physical.

    Annotated RA Tree a.k.a. The Physical Plan

    QUERY OPTIMIZATION (QO)

    • Identify candidate equivalent trees (logical). It is an NP-hard problem, so the space is large.
    • For each candidate, find the execution plan (physical). Estimate the cost of each plan.
    • Choose the best (physical) plan.

    QUERY OPTIMIZATION (QO)

    • Identify candidate equivalent trees (logical). It is an NP-hard problem, so the space is large.
    • For each candidate, find the execution plan (physical). Estimate the cost of each plan.
    • Choose the best (physical) plan.

    QUERY OPTIMIZATION (QO)

    • Identify candidate equivalent trees (logical). It is an NP-hard problem, so the space is large.
    • For each candidate, find the execution plan (physical). Estimate the cost of each plan.
    • Choose the best (physical) plan.

    QUERY OPTIMIZATION (QO)

    • Identify candidate equivalent trees (logical). It is an NP-hard problem, so the space is large.
    • For each candidate, find the execution plan (physical). Estimate the cost of each plan.
    • Choose the best (physical) plan.
    Practically: Choose from a subset of all possible plans.

    QUERY OPTIMIZATION

    Heuristics / Rules

    $\rightarrow$ Rewrite the query to remove (guessed) inefficiencies. $\rightarrow$ Examples: always do selections first or push down projections as early as possible. $\rightarrow$ These techniques may need to examine catalog, but they do not need to examine data.

    Cost-based Search

    $\rightarrow$ Use a model to estimate the cost of executing a plan. $\rightarrow$ Enumerate multiple equivalent plans for a query and pick the one with the lowest cost.

    QUERY OPTIMIZATION

    Heuristics / Rules

    $\rightarrow$ Rewrite the query to remove (guessed) inefficiencies. $\rightarrow$ Examples: always do selections first or push down projections as early as possible. $\rightarrow$ These techniques may need to examine catalog, but they do not need to examine data.

    Cost-based Search

    $\rightarrow$ Use a model to estimate the cost of executing a plan. $\rightarrow$ Enumerate multiple equivalent plans for a query and pick the one with the lowest cost.

    LOGICAL PLAN OPTIMIZATION

    Transform a logical plan into an equivalent logical plan using pattern matching rules.
    The goal is to increase the likelihood of enumerating the optimal plan in the search. $\rightarrow$ Many equivalence rules for relational algebra!
    Cannot compare plans because there is no cost model but can "direct" a transformation to a preferred side.

    PREDICATE PUSHDOWN

    $\pi_{\mathrm{name}}\left(\sigma_{\mathrm{dname} = '\mathrm{Toy}}(\mathrm{Dept}\ltimes \mathrm{Emp})\right)$

    PREDICATE PUSHDOWN

    $\pi_{\mathrm{name}}\left(\sigma_{\mathrm{dname} = '\mathrm{Toy}}\left(\mathrm{Dept}\ltimes \mathrm{Emp}\right)\right)$

    PREDICATE PUSHDOWN

    $\pi_{\mathrm{name}}\left(\sigma_{\mathrm{dname} = '\mathrm{Toy}}\left(\mathrm{Dept}\ltimes \mathrm{Emp}\right)\right)$

    PREDICATE PUSHDOWN

    $\pi_{\mathrm{ename}}\left(\sigma_{\mathrm{dname} = '\mathrm{Toy}}(\mathrm{Dept}\bowtie \mathrm{Emp})\right)$

    PREDICATE PUSHDOWN

    REPLACE CARTESIAN PRODUCT

    $$ \ldots \left(\sigma_{\mathrm{Dept.did} = \mathrm{Emp.did}}(\mathrm{Dept}\times \mathrm{Emp})\right) $$

    REPLACE CARTESIAN PRODUCT

    O Dept.did = Emp.did (Dept x Emp)

    REPLACE CARTESIAN PRODUCT

    PROJECTION PUSHDOWN

    Emp. did = Dept. did

    PROJECTION PUSHDOWN

    Emp.ename

    PROJECTION PUSHDOWN

    Emp.ename
    Emp.ename

    QUERY OPTIMIZATION

    Heuristics / Rules

    $\rightarrow$ Rewrite the query to remove (guessed) inefficiencies. $\rightarrow$ Examples: always do selections first or push down projections as early as possible. $\rightarrow$ These techniques may need to examine catalog, but they do not need to examine data.

    Cost-based Search

    $\rightarrow$ Use a model to estimate the cost of executing a plan. $\rightarrow$ Enumerate multiple equivalent plans for a query and pick the one with the lowest cost.

    COST-BASED QUERY OPTIMIZATION

    We will start with cost- based, bottom- up QO → a.k.a. the "classic" IBM System R optimizer
    Approach: Enumerate different plans for the query and estimate their costs.
    → Single relation. → Multiple relations. → Nested sub- queries.
    It chooses the best plan it has seen for the query after exhausting all plans or some timeout.

    SINGLE-RELATION QUERY PLANNING

    Pick the best access method.
    $\longrightarrow$ Sequential Scan $\longrightarrow$ Binary Search (clustered indexes) $\longrightarrow$ Index Scan
    Predicate evaluation ordering.
    Simple heuristics are often good enough for this.

    MULTI-RELATION QUERY PLANNING

    Approach #1: Generative / Bottom-Up

    $\rightarrow$ Start with nothing and then iteratively assemble and add building blocks to generate a query plan. $\rightarrow$ Examples: System R, Starburst

    Approach #2: Transformation / Top-Down

    $\rightarrow$ Start with the outcome that the query wants, and then transform it to equivalent alternative sub- plans to find the optimal plan that gets to that goal. $\rightarrow$ Examples: Volcano, Cascades

    BOTTOM-UP OPTIMIZATION

    Use static rules to perform initial optimization. Then use dynamic programming to determine the best join order for tables using a divide- and- conquer search method
    Examples: IBM System R, DB2, MySQL, Postgres, most open- source DBMSs.

    SYSTEM R OPTIMIZER

    Break query into blocks and generate logical operators for each block.
    For each logical operator, generate a set of physical operators that implement it.
    $\rightarrow$ All combinations of join algorithms and access paths
    Then, iteratively construct a "left- deep" join tree that minimizes the estimated amount of work to execute the plan.

    Left-Deep Tree

    Bushy Tree

    SYSTEM R OPTIMIZER

    Break query into blocks and generate logical operators for each block.
    For each logical operator, generate a set of physical operators that implement it.
    $\rightarrow$ All combinations of join algorithms and access paths
    Then, iteratively construct a "left- deep" join tree that minimizes the estimated amount of work to execute the plan.

    Left-Deep Tree

    Bushy Tree

    SYSTEM R OPTIMIZER

    SELECT ARTIST.NAME FROM ARTIST, APPEARS, ALBUM WHERE ARTIST.ID=APPEARS.ARTIST_ID AND APPEARS.ALBUM_ID=ALBUM.ID AND ALBUM.NAME="Andy's OG Remix" ORDER BY ARTIST.ID

    SYSTEM R OPTIMIZER

    SELECT ARTIST.NAME FROM ARTIST, APPEARS, ALBUM WHERE ARTIST.ID=APPEARS.ARTIST_ID AND APPEARS.ALBUM_ID=ALBUM.ID AND ALBUM.NAME="Andy's OG Remix" ORDER BY ARTIST.ID
    Step #1: Choose the best access paths to each table

    SYSTEM R OPTIMIZER

    SELECT ARTIST.NAME FROM ARTIST, APPEARS, ALBUM WHERE ARTIST.ID=APPEARS.ARTIST_ID AND APPEARS.ALBUM_ID=ALBUM.ID AND ALBUM.NAME="Andy's OG Remix" ORDER BY ARTIST.ID
    Step #1: Choose the best access paths to each table
    ARTIST: Sequential Scan APPEARS: Sequential Scan ALBUM: Index Look- up on NAME

    SYSTEM R OPTIMIZER

    SELECT ARTIST.NAME FROM ARTIST, APPEARS, ALBUM WHERE ARTIST.ID=APPEARS.ARTIST_ID AND APPEARS.ALBUM_ID=ALBUM.ID AND ALBUM.NAME="Andy's OG Remix" ORDER BY ARTIST.ID
    Step #1: Choose the best access paths to each table Step #2: Enumerate all possible join orderings for tables
    ARTIST: Sequential Scan APPEARS: Sequential Scan ALBUM: Index Look- up on NAME

    SYSTEM R OPTIMIZER

    SELECT ARTIST.NAME FROM ARTIST, APPEARS, ALBUM WHERE ARTIST.ID=APPEARS.ARTIST_ID AND APPEARS.ALBUM_ID=ALBUM.ID AND ALBUM.NAME="Andy's OG Remix" ORDER BY ARTIST.ID
    Step #1: Choose the best access paths to each table
    Step #2: Enumerate all possible join orderings for tables
    ARTIST: Sequential Scan
    APPEARS: Sequential Scan
    ALBUM: Index Look- up on NAME
    ARTIST APPEARS ALBUM APPEARS ALBUM ARTIST ALBUM APPEARS ARTIST APPEARS ARTIST ALBUM ARTIST X ALBUM APPEARS ALBUM X ARTIST APPEARS : : : :

    SYSTEM R OPTIMIZER

    SELECT ARTIST.NAME FROM ARTIST, APPEARS, ALBUM WHERE ARTIST.ID=APPEARS.ARTIST_ID AND APPEARS.ALBUM_ID=ALBUM.ID AND ALBUM.NAME="Andy's OG Remix" ORDER BY ARTIST.ID
    Step #1: Choose the best access paths to each table
    Step #2: Enumerate all possible join orderings for tables
    Step #3: Determine the join ordering with the lowest cost
    ARTIST: Sequential Scan
    APPEARS: Sequential Scan
    ALBUM: Index Look- up on NAME
    ARTIST APPEARS ALBUM APPEARS ALBUM ARTIST ALBUM APPEARS ARTIST APPEARS ARTIST ALBUM ARTIST X ALBUM APPEARS ALBUM X ARTIST APPEARS : : : :
    Logical Op Physical Op

    SYSTEM R OPTIMIZER

    ARTIST APPEARS ALBUM
    Logical Op Physical Op

    SYSTEM R OPTIMIZER

    ARTIST APPEARS ALBUM
    Logical Op Physical Op

    SYSTEM R OPTIMIZER

    ARTIST APPEARS ALBUM
    Logical Op Physical Op

    SYSTEM R OPTIMIZER

    Logical Op Physical Op

    SYSTEM R OPTIMIZER

    SYSTEM R OPTIMIZER

    SYSTEM R OPTIMIZER

    SYSTEM R OPTIMIZER

    TOP-DOWN OPTIMIZATION

    Start with a logical plan of what we want the query to be. Perform a branch- and- bound search to traverse the plan tree by converting logical operators into physical operators. $\rightarrow$ Keep track of global best plan during search. $\rightarrow$ Treat physical properties of data as first- class entities during planning.
    Examples: MSSQL, Greenplum, CockroachDB

    Example Logical Rules

    P1 (P2(R)) = P2 (P1(R)) (commutativity)
    P1AP2 ...APn (R) = P1(P2( ... Pn(R))) (cascading o)
    $\begin{array}{r}\prod_{\mathrm{a}1}(\mathrm{R})\equiv \prod_{\mathrm{a}1}(\prod_{\mathrm{a}2}(\dots \prod_{\mathrm{ak}}(\mathrm{R})\dots)),\mathsf{a}{\mathrm{i}}\subseteq \mathsf{a}{\mathrm{i} + 1} \end{array}$ (cascading I)
    R S S R (join commutativity)
    R (S T) = (R S) T (join associativity)
    $\sigma_{\mathrm{p}}(\mathrm{RXS})\equiv (\mathrm{R}\bowtie_{\mathrm{p}}\mathrm{S})$ , if P is a join predicate
    $\sigma_{\mathrm{p}}(\mathrm{RXS})\equiv \sigma_{\mathrm{p1}}(\sigma_{\mathrm{p2}}(\mathrm{R})\bowtie_{\mathrm{p4}}\sigma_{\mathrm{p3}}(\mathrm{S}))$ , where $\mathrm{P} = \mathrm{p}1\wedge \mathrm{p}2\wedge \mathrm{p}3\wedge \mathrm{p}4$
    $\begin{array}{r}\prod_{\mathrm{A1,A2,\ldots An}}(\sigma_{\mathrm{P}}(\mathrm{R}))\equiv \prod_{\mathrm{A1,A2,\ldots An}}(\sigma_{\mathrm{P}}(\prod_{\mathrm{A1,\ldots An,B1,\ldots BM}}\mathrm{R})), \end{array}$ where B1 ... BM are columns in P
    Logical Op Physical Op

    TOP-DOWN OPTIMIZATION

    Start with a logical plan of what we want the query to be.

    TOP-DOWN OPTIMIZATION

    Start with a logical plan of what we want the query to be.
    ARTIST $\bowtie$ APPEARS $\bowtie$ ALBUM ORDER- BY(ARTIST.ID)

    TOP-DOWN OPTIMIZATION

    Start with a logical plan of what we want the query to be.
    Invoke rules to create new nodes and traverse tree.
    $\rightarrow$ Logical $\rightarrow$ Logical
    (A, B) to JOIN(B, A) $\rightarrow$ Logical $\rightarrow$ Physical
    (A, B) to HASH_JOIN(A, B)
    ARTIST $\bowtie$ APPEARS $\bowtie$ ALBUMORDER- BY(ARTIST.ID)

    TOP-DOWN OPTIMIZATION

    Start with a logical plan of what we want the query to be.
    Invoke rules to create new nodes and traverse tree.
    $\rightarrow$ Logical $\rightarrow$ Logical
    (A,B) to JOIN(B,A) $\rightarrow$ Logical $\rightarrow$ Physical
    (A,B) to HASH_JOIN(A,B)
    Logical Op Physical Op

    TOP-DOWN OPTIMIZATION

    Start with a logical plan of what we want the query to be.
    Invoke rules to create new nodes and traverse tree.
    $\rightarrow$ Logical $\rightarrow$ Logical
    (A,B) to JOIN(B,A) $\rightarrow$ Logical $\rightarrow$ Physical
    (A,B) to HASH_JOIN(A,B)
    ARTIST APPEARS ALBUM ORDER- BY(ARTIST.ID)
    Logical Op Physical Op

    TOP-DOWN OPTIMIZATION

    Start with a logical plan of what we want the query to be.
    Invoke rules to create new nodes and traverse tree.
    $\rightarrow$ Logical $\rightarrow$ Logical
    (A,B) to JOIN(B,A) $\rightarrow$ Logical $\rightarrow$ Physical
    (A,B) to HASH_JOIN(A,B)
    Logical Op Physical Op

    TOP-DOWN OPTIMIZATION

    Start with a logical plan of what we want the query to be.
    Invoke rules to create new nodes and traverse tree.
    $\rightarrow$ Logical $\rightarrow$ Logical
    (A,B) to JOIN(B,A) $\rightarrow$ Logical $\rightarrow$ Physical
    (A,B) to HASH_JOIN(A,B)
    Logical Op Physical Op

    TOP-DOWN OPTIMIZATION

    Start with a logical plan of what we want the query to be.
    Invoke rules to create new nodes and traverse tree.
    $\rightarrow$ Logical $\rightarrow$ Logical
    (A,B) to JOIN(B,A) $\rightarrow$ Logical $\rightarrow$ Physical
    (A,B) to HASH_JOIN(A,B)
    Logical Op Physical Op

    TOP-DOWN OPTIMIZATION

    Start with a logical plan of what we want the query to be.
    Invoke rules to create new nodes and traverse tree.
    $\rightarrow$ Logical $\rightarrow$ Logical
    (A,B) to JOIN(B,A) $\rightarrow$ Logical $\rightarrow$ Physical
    (A,B) to HASH_JOIN(A,B)
    Logical Op Physical Op

    TOP-DOWN OPTIMIZATION

    Start with a logical plan of what we want the query to be.
    Invoke rules to create new nodes and traverse tree.
    $\rightarrow$ Logical $\rightarrow$ Logical: JOIN(A,B) to JOIN(B,A) $\rightarrow$ Logical $\rightarrow$ Physical: JOIN(A,B) to HASH_JOIN(A,B)
    Logical Op Physical Op

    TOP-DOWN OPTIMIZATION

    Start with a logical plan of what we want the query to be.
    Invoke rules to create new nodes and traverse tree.
    $\rightarrow$ Logical $\rightarrow$ Logical: JOIN(A,B) to JOIN(B,A) $\rightarrow$ Logical $\rightarrow$ Physical: JOIN(A,B) to HASH_JOIN(A,B)
    Logical Op Physical Op

    TOP-DOWN OPTIMIZATION

    Start with a logical plan of what we want the query to be.
    Invoke rules to create new nodes and traverse tree.
    $\rightarrow$ Logical $\rightarrow$ Logical: JOIN(A,B) to JOIN(B,A) $\rightarrow$ Logical $\rightarrow$ Physical: JOIN(A,B) to HASH_JOIN(A,B)
    Logical Op Physical Op

    TOP-DOWN OPTIMIZATION

    Start with a logical plan of what we want the query to be.
    Invoke rules to create new nodes and traverse tree.
    $\rightarrow$ Logical $\rightarrow$ Logical
    (A,B) to JOIN(B,A) $\rightarrow$ Logical $\rightarrow$ Physical
    (A,B) to HASH_JOIN(A,B)
    Logical Op Physical Op

    TOP-DOWN OPTIMIZATION

    Start with a logical plan of what we want the query to be.
    Invoke rules to create new nodes and traverse tree.
    $\rightarrow$ Logical $\rightarrow$ Logical: JOIN(A,B) to JOIN(B,A) $\rightarrow$ Logical $\rightarrow$ Physical: JOIN(A,B) to HASH_JOIN(A,B)
    Can create "enforcer" rules that require input to have certain properties.
    LogicalOp PhysicalOp

    TOP-DOWN OPTIMIZATION

    Start with a logical plan of what we want the query to be.
    Invoke rules to create new nodes and traverse tree.
    $\longrightarrow$ Logical $\rightarrow$ Logical: JOIN(A,B) to JOIN(B,A) $\longrightarrow$ Logical $\rightarrow$ Physical: JOIN(A,B) to HASH_JOIN(A,B)
    Can create "enforcer" rules that require input to have certain properties.

    Logical Op

    Physical Op

    TOP-DOWN OPTIMIZATION

    Start with a logical plan of what we want the query to be.
    Invoke rules to create new nodes and traverse tree.
    $\longrightarrow$ Logical $\rightarrow$ Logical: JOIN(A,B) to JOIN(B,A) $\longrightarrow$ Logical $\rightarrow$ Physical: JOIN(A,B) to HASH_JOIN(A,B)
    Can create "enforcer" rules that require input to have certain properties.

    Logical Op

    Physical Op

    TOP-DOWN OPTIMIZATION

    Start with a logical plan of what we want the query to be.
    Invoke rules to create new nodes and traverse tree.
    $\longrightarrow$ Logical $\rightarrow$ Logical: JOIN(A,B) to JOIN(B,A) $\longrightarrow$ Logical $\rightarrow$ Physical: JOIN(A,B) to HASH_JOIN(A,B)
    Can create "enforcer" rules that require input to have certain properties.

    Logical Op

    Physical Op

    TOP-DOWN OPTIMIZATION

    Start with a logical plan of what we want the query to be.
    Invoke rules to create new nodes and traverse tree.
    $\longrightarrow$ Logical $\rightarrow$ Logical: JOIN(A,B) to JOIN(B,A) $\longrightarrow$ Logical $\rightarrow$ Physical: JOIN(A,B) to HASH_JOIN(A,B)
    Can create "enforcer" rules that require input to have certain properties.
    Logical Op Physical Op

    TOP-DOWN OPTIMIZATION

    Start with a logical plan of what we want the query to be.
    Invoke rules to create new nodes and traverse tree.
    $\rightarrow$ Logical $\rightarrow$ Logical: JOIN(A,B) to JOIN(B,A) $\rightarrow$ Logical $\rightarrow$ Physical: JOIN(A,B) to HASH_JOIN(A,B)
    Can create "enforcer" rules that require input to have certain properties.
    Logical Op Physical Op

    TOP-DOWN OPTIMIZATION

    Start with a logical plan of what we want the query to be.
    Invoke rules to create new nodes and traverse tree.
    $\rightarrow$ Logical $\rightarrow$ Logical: JOIN(A,B) to JOIN(B,A) $\rightarrow$ Logical $\rightarrow$ Physical: JOIN(A,B) to HASH_JOIN(A,B)
    Can create "enforcer" rules that require input to have certain properties.
    LogicalOp PhysicalOp

    TOP-DOWN OPTIMIZATION

    Start with a logical plan of what we want the query to be.
    Invoke rules to create new nodes and traverse tree.
    $\longrightarrow$ Logical $\rightarrow$ Logical: JOIN(A,B) to JOIN(B,A) $\longrightarrow$ Logical $\rightarrow$ Physical: JOIN(A,B) to HASH_JOIN(A,B)
    Can create "enforcer" rules that require input to have certain properties.

    OBSERVATION

    Applications often execute nested queries.
    $\rightarrow$ We could optimize each block using the methods we have discussed. $\rightarrow$ However, this may be inefficient since we optimize each block separately without a global approach.
    What if we could flatten a nested query into a single block and optimize it?
    $\rightarrow$ Then, apply single- block query optimization methods. $\rightarrow$ Even if one cannot flatten to a single block, flattening to fewer blocks is still beneficial.

    NESTED SUB-QUERIES

    The DBMS treats nested sub- queries in the where clause as functions that take parameters and return a single value or set of values.
    Approach #1: Rewrite to de- correlate and/or flatten them.
    Approach #2: Decompose nested query and store results in a temporary table.

    NESTED SUB-QUERIES: REWRITE

    SELECT name FROM sailors AS S WHERE EXISTS (SELECT * FROM reserves AS R WHERE S.sid = R.sid AND R.day = '2024- 10- 25')

    NESTED SUB-QUERIES: REWRITE

    SELECT name FROM sailors AS S WHERE EXISTS (SELECT * FROM reserves AS R WHERE S.sid = R.sid AND R.day = '2024- 10- 25')

    NESTED SUB-QUERIES: REWRITE

    DECOMPOSING QUERIES

    For harder queries, the optimizer breaks up queries into blocks and then concentrates on one block at a time.
    Sub- queries are written to temporary tables that are discarded after the query finishes.

    DECOMPOSING QUERIES

    SELECT S.sid, MIN(R.day) FROM sailors S, reserves R, boats B WHERE S.sid = R.sid AND R.bid = B.bid AND B.color = 'red' AND S.rating = (SELECT MAX(S2. rating) FROM sailors S2) GROUP BY S.sid HAVING COUNT(*) > 1

    DECOMPOSING QUERIES

    DECOMPOSING QUERIES

    SELECT MAX(rating) FROM sailors
    SELECT S.sid, MIN(R.day)
    FROM sailors S, reserves R, boats B
    WHERE S.sid = R.sid
    AND R.bid = B.bid
    AND B.color = 'red'
    AND S.rating = (SELECT MAX(S2. rating) FROM sailors S2)
    GROUP BY S.sid HAVING COUNT(*) > 1
    Nested Block

    DECOMPOSING QUERIES

    SELECT MAX(rating) FROM sailors
    SELECT S.sid, MIN(R.day)
    FROM sailors S, reserves R, boats B
    WHERE S.sid = R.sid
    AND R.bid = B.bid
    AND B.color = 'red'
    AND S.rating = ###
    GROUP BY S.sidHAVING COUNT(*) > 1
    Nested Block

    DECOMPOSING QUERIES

    SELECT MAX(rating) FROM sailors
    SELECT S.sid, MIN(R.day)
    FROM sailors S, reserves R, boats B
    WHERE S.sid = R.sid
    AND R.bid = B.bid
    AND B.color = 'red'
    AND S.rating = ###
    GROUP BY S.sidHAVING COUNT(*) > 1

    DECOMPOSING QUERIES

    Inner Block

    SELECT MAX(rating) FROM sailors
    SELECT S.sid, MIN(R.day)
    FROM sailors S, reserves R, boats B
    WHERE S.sid = R.sid
    AND R.bid = B.bid
    AND B.color = 'red'
    AND S.rating = ###
    GROUP BY S.sidHAVING COUNT(*) > 1
    Outer Block

    EXPRESSION REWRITING

    An optimizer transforms a query's expressions (e.g., WHERE/ON clause predicates) into the minimal set of expressions.
    Implemented using if/then/else clauses or a pattern- matching rule engine.
    $\rightarrow$ Search for expressions that match a pattern. $\rightarrow$ When a match is found, rewrite the expression. $\rightarrow$ Halt if there are no more rules that match.

    EXPRESSION REWRITING

    Impossible / Unnecessary Predicates
    SELECT * FROM A WHERE 1 = 0;

    EXPRESSION REWRITING

    Impossible / Unnecessary Predicates
    SELECT * FROM A WHERE 1 = 0

    EXPRESSION REWRITING

    Impossible / Unnecessary Predicates
    SELECT * FROM A WHERE false;

    EXPRESSION REWRITING

    Impossible / Unnecessary Predicates
    SELECT * FROM A WHERE false;
    SELECT * FROM A WHERE NOW() IS NULL;

    EXPRESSION REWRITING

    Impossible / Unnecessary Predicates
    SELECT * FROM A WHERE false;
    SELECT * FROM A WHERE NOW() IS NULL;

    EXPRESSION REWRITING

    Impossible / Unnecessary Predicates
    SELECT * FROM A WHERE false;
    SELECT * FROM A WHERE false;

    EXPRESSION REWRITING

    Impossible / Unnecessary Predicates
    SELECT * FROM A WHERE false; SELECT * FROM A WHERE false;
    Merging Predicates
    SELECT * FROM A WHERE val BETWEEN 1 AND 100 OR val BETWEEN 50 AND 150;

    EXPRESSION REWRITING

    Impossible / Unnecessary Predicates
    SELECT * FROM A WHERE false; SELECT * FROM A WHERE false;
    Merging Predicates
    SELECT * FROM A WHERE val BETWEEN 1 AND 100 OR val BETWEEN 50 AND 150;

    EXPRESSION REWRITING

    Impossible / Unnecessary Predicates
    SELECT * FROM A WHERE false;
    SELECT * FROM A WHERE false;
    Merging Predicates
    SELECT * FROM A WHERE val BETWEEN 1 AND 150;

    OBSERVATION

    We have formulas for the operator algorithms (e.g. the cost formulas for hash join, sort merge join,...), but we also need to estimate the size of the output that an operator produces.

    OBSERVATION

    We have formulas for the operator algorithms (e.g. the cost formulas for hash join, sort merge join,...), but we also need to estimate the size of the output that an operator produces.

    OBSERVATION

    We have formulas for the operator algorithms (e.g. the cost formulas for hash join, sort merge join,...), but we also need to estimate the size of the output that an operator produces.
    This is hard because the output of each operators depends on its input.

    COST ESTIMATION

    The DBMS uses a cost model to predict the behavior of a query plan given a database state. $\rightarrow$ This is an internal cost that allows the DBMS to compare one plan with another.
    It is too expensive to run every possible plan to determine this information, so the DBMS need a way to derive this information.

    COST MODEL COMPONENTS

    Choice #1: Physical Costs

    $\rightarrow$ Predict CPU cycles, I/O, cache misses, RAM consumption, network messages... $\rightarrow$ Depends heavily on hardware.

    Choice #2: Logical Costs

    $\rightarrow$ Estimate output size per operator. $\rightarrow$ Independent of the operator algorithm. $\rightarrow$ Need estimations for operator result sizes.

    POSTGRES COST MODEL

    Uses a combination of CPU and I/O costs that are weighted by "magic" constant factors.
    Default settings are obviously for a disk- resident database without a lot of memory: $\rightarrow$ Processing a tuple in memory is 400x faster than reading a tuple from disk. $\rightarrow$ Sequential I/O is 4x faster than random I/O.

    19.7.2. Planner Cost Constants

    The cost variables described in this section are measured on an arbitrary scale. Only their relative values matter, hence scaling them all up or down by the same factor will result in no change in the planner's choices. By default, these cost variables are based on the cost of sequential page fetches; that is, seq_page_cost is conventionally set to 1.0 and the other cost variables are set with reference to that. But you can use a different scale if you prefer, such as actual execution times in milliseconds on a particular machine.
    Note: Unfortunately, there is no well- defined method for determining ideal values for the cost variables. They are best treated as averages over the entire mix of queries that a particular installation will receive. This means that changing them on the basis of just a few experiments is very risky.
    seq_page_cost (floating point)
    Sets the planner's estimate of the cost of a disk page fetch that is part of a series of sequential fetches. The default is 1.0. This value can be overridden for tables and indexes in a particular tablespace by setting the tablespace parameter of the same name (see ALTER TABLESPACE).
    random_page_cost (floating point)

    STATISTICS

    The DBMS stores internal statistics about tables, attributes, and indexes in its internal catalog. Different systems update them at different times.
    Manual invocations:
    $\rightarrow$ Postgres/SQLite: ANALYZE $\rightarrow$ Oracle/MySQL: ANALYZE TABLE $\rightarrow$ SQL Server: UPDATE STATISTICS $\rightarrow$ DB2: RUNSTATS

    SELECTION CARDINALITY

    The selectivity (sel) of a predicate P is the fraction of tuples that qualify.
    Equality Predicate: A=constant $\rightarrow$ sel(A=constant) = #occurences/|R|
    SELECT * FROM people WHERE age = 9

    SELECTION CARDINALITY

    The selectivity (sel) of a predicate P is the fraction of tuples that qualify.
    Equality Predicate: A=constant $\rightarrow$ sel(A=constant) = #occurences/|R|
    SELECT * FROM people WHERE age = 9
    [ImageCaption: age]

    SELECTION CARDINALITY

    The selectivity (sel) of a predicate P is the fraction of tuples that qualify.
    Equality Predicate: A=constant $\rightarrow$ sel(A=constant) = #occurrences/|R|
    SELECT × FROM people WHERE age = 9

    SELECTION CARDINALITY

    The selectivity (sel) of a predicate P is the fraction of tuples that qualify.
    Equality Predicate: A=constant $\rightarrow$ sel(A=constant) = #occurences/|R| $\rightarrow$ Example: sel(age=9) =
    SELECT × FROM people WHERE age = 9

    SELECTION CARDINALITY

    The selectivity (sel) of a predicate P is the fraction of tuples that qualify.
    Equality Predicate: A=constant $\rightarrow$ sel(A=constant) = #occurences/|R| $\rightarrow$ Example: sel(age=9) =
    SELECT × FROM people WHERE age = 9

    SELECTION CARDINALITY

    The selectivity (sel) of a predicate P is the fraction of tuples that qualify.
    Equality Predicate: A=constant $\rightarrow$ sel(A=constant) = #occurences/|R| $\rightarrow$ Example: sel(age=9) = 4/45
    SELECT × FROM people WHERE age = 9

    SELECTION CARDINALITY

    Assumption #1: Uniform Data

    $\rightarrow$ The distribution of values (except for the heavy hitters) is the same.

    Assumption #2: Independent Predicates

    $\rightarrow$ The predicates on attributes are independent

    Assumption #3: Inclusion Principle

    $\rightarrow$ The domain of join keys overlap such that each key in the inner relation will also exist in the outer table.

    CORRELATED ATTRUTES

    Consider a database of automobiles:
    $\rightarrow$ # of Makes = 10, # of Models = 100
    And the following query: $\rightarrow$ (make="Honda" AND model="Accord")
    With the independence and uniformity assumptions, the selectivity is:
    $$ \rightarrow 1 / 10\times 1 / 100 = 0.001 $$
    But since only Honda makes Accords the real selectivity is $1 / 100 = 0.01$

    STATISTICS

    Choice #1: Histograms

    $\rightarrow$ Maintain an occurrence count per value (or range of values) in a column.

    Choice #2: Sketches

    $\rightarrow$ Probabilistic data structure that gives an approximate count for a given value.

    Choice #3: Sampling

    $\rightarrow$ DBMS maintains a small subset of each table that it then uses to evaluate expressions to compute selectivity.

    HISTOGRAMS

    Our formulas are nice, but we assume that data values are uniformly distributed.

    HISTOGRAMS

    Our formulas are nice, but we assume that data values are uniformly distributed.

    HISTOGRAMS

    Our formulas are nice, but we assume that data values are uniformly distributed.

    EQUI-WIDTH HISTOGRAM

    Maintain counts for a group of values instead of each unique key. All buckets have the same width (i.e., same # of value).
    [ImageCaption: Non-Uniform Approximation]

    EQUI-WIDTH HISTOGRAM

    Maintain counts for a group of values instead of each unique key. All buckets have the same width (i.e., same # of value).
    [ImageCaption: Non-Uniform Approximation]

    EQUI-WIDTH HISTOGRAM

    Maintain counts for a group of values instead of each unique key. All buckets have the same width (i.e., same # of value).
    [ImageCaption: Equi-Width Histogram]

    EQUI-DEPTH HISTOGRAMS

    Vary the width of buckets so that the total number of occurrences for each bucket is roughly the same.

    EQUI-DEPTH HISTOGRAMS

    Vary the width of buckets so that the total number of occurrences for each bucket is roughly the same.

    EQUI-DEPTH HISTOGRAMS

    Vary the width of buckets so that the total number of occurrences for each bucket is roughly the same.

    SKETCHES

    Probabilistic data structures that generate approximate statistics about a data set.
    Cost- model can replace histograms with sketches to improve its selectivity estimate accuracy.
    Most common examples:
    $\rightarrow$ Count- Min Sketch (1988): Approximate frequency count of elements in a set. $\rightarrow$ HyperLogLog (2007): Approximate the number of distinct elements in a set.

    SAMPLING

    Modern DBMSs also collect samples from tables to estimate selectivities.
    Update samples when the underlying tables changes significantly.
    SELECT AVG(age) FROM people WHERE age > 50
    Table (html):
    idnameagestatus
    1001Obama63Rested
    1002Swift34Paid
    1003Tupac25Dead
    1004Bieber30Crunk
    1005Andy43Illin
    1006TigerKing61Jailed
    1 billion tuples

    SAMPLING

    Modern DBMSs also collect samples from tables to estimate selectivities.
    Update samples when the underlying tables changes significantly.
    SELECT AVG(age) FROM people WHERE age > 50
    Table (html):
    idnameagestatus
    1001Obama63Rested
    1002Swift34Paid
    1003Tupac25Dead
    1004Bieber30Crunk
    1005Andy43Illin
    1006TigerKing61Jailed
    1 billion tuples

    SAMPLING

    Modern DBMSs also collect samples from tables to estimate selectivities.
    Update samples when the underlying tables changes significantly.

    Table Sample

    Table (html):
    1001Obama63Rested
    1003Tupac25Dead
    1005Andy43Illin
    SELECT AVG(age) FROM people WHERE age > 50
    Table (html):
    idnameagestatus
    1001Obama63Rested
    1002Swift34Paid
    1003Tupac25Dead
    1004Bieber30Crunk
    1005Andy43Illin
    1006TigerKing61Jailed
    1 billion tuples

    SAMPLING

    Modern DBMSs also collect samples from tables to estimate selectivities.
    Update samples when the underlying tables changes significantly.

    Table Sample

    sel(age>50) =
    Table (html):
    1001Obama63Rested
    1003Tupac25Dead
    1005Andy43Illin
    SELECT AVG(age) FROM people WHERE age > 50
    Table (html):
    idnameagestatus
    1001Obama63Rested
    1002Swift34Paid
    1003Tupac25Dead
    1004Bieber30Crunk
    1005Andy43Illin
    1006TigerKing61Jailed
    1 billion tuples

    SAMPLING

    Modern DBMSs also collect samples from tables to estimate selectivities.
    Update samples when the underlying tables changes significantly.

    Table Sample

    sel(age>50) =
    Table (html):
    1001Obama63Rested
    1003Tupac25Dead
    1005Andy43Illin
    SELECT AVG(age) FROM people WHERE age > 50
    Table (html):
    idnameagestatus
    1001Obama63Rested
    1002Swift34Paid
    1003Tupac25Dead
    1004Bieber30Crunk
    1005Andy43Illin
    1006TigerKing61Jailed
    1 billion tuples

    SAMPLING

    Modern DBMSs also collect samples from tables to estimate selectivities.
    Update samples when the underlying tables changes significantly.

    Table Sample

    sel(age>50) = 1/3
    Table (html):
    1001Obama63Rested
    1003Tupac25Dead
    1005Andy43Illin
    SELECT AVG(age) FROM people WHERE age > 50
    Table (html):
    idnameagestatus
    1001Obama63Rested
    1002Swift34Paid
    1003Tupac25Dead
    1004Bieber30Crunk
    1005Andy43Illin
    1006TigerKing61Jailed
    1 billion tuples

    CONCLUSION

    Query optimization is critical for a database system. $\rightarrow$ SQL $\rightarrow$ Logical Plan $\rightarrow$ Physical Plan $\rightarrow$ Flatten queries before going to the optimization part. Expression handling is also important. $\rightarrow$ Estimate costs using models based on summarizations.
    QO enumeration can be bottom- up or top- down.

    NEXT CLASS

    Transactions! $\rightarrow$ aka the second hardest part about database systems