GithubDocuments
    1. Marketplace
    2. CMU Database - 15445 - 2025 Spring
    Collections
    CMU Database - 15445 - 2025 Spring
    cmu_database
    This collection of documents, "CMU Database - 15445 - 2025 Spring," provides a comprehensive overview of database systems, primarily focusing on the design, implementation, and ...
    DocumentsKnowledge Graph
    lecture-08-slides.pdf
    Carnegie Mellon University
    Database Systems
    Beautiful B+Trees

    ADMINISTRIVIA

    Project #2 out later today; due Sunday March 2 $^{\text{nd}}$ @ 11
    $\rightarrow$ Don't forget to do a GitHub "pull" before starting
    Homework #3 (indices and filters) will be released on Wednesday February 19 $^{\text{th}}$
    Mid- term Exam on Wednesday Feb 26 $^{\text{th}}$ $\rightarrow$ In- class in this room $\rightarrow$ More info next week

    New Seminar Series: Every Tue @ 4

    Schedule

    https://cmu.zoom.us/j/93441451665 (Passcode 261758)
    Today's seminar: Larry Ellison was Right (kinda)! TypeScript Stored Procedures for the Modern Age
    Speaker: James Cowling, CTO, Convex.

    LAST CLASS

    Hash tables are important data structures that are used all throughout a DBMS. $\rightarrow$ Space Complexity: O(n) $\rightarrow$ Average Time Complexity: O(1) Static vs. Dynamic Hashing schemes
    DBMSs use mostly hash tables for their internal data structures.

    INDEXES VS. FILTERS

    An index data structure of a subset of a table's attributes that are organized and/or sorted to the location of specific tuples using those attributes. $\rightarrow$ Example: B+Tree
    A filter is a data structure that answers set membership queries; it tells you whether a record (likely) exists for a key but not where it is located. $\rightarrow$ Example: Bloom Filter

    TODAY'S AGENDA

    B+Tree OverviewDesign ChoicesOptimizations

    B-TREE FAMILY

    There is a specific data structure called a B- Tree.
    People also use the term to generally refer to a class of balanced tree data structures:
    $\rightarrow$ B- Tree (1970) $\rightarrow$ B+Tree (1973) $\rightarrow$ B*Tree (1977?) $\rightarrow$ Blink- Tree (1981) $\rightarrow$ Be- Tree (2003) $\rightarrow$ Bw- Tree (2013)

    B-TREE FA

    There is a specific data structu
    People also use the term to ge. of balanced tree data structure
    $\rightarrow$ B- Tree (1970)
    $\rightarrow$ B+Tree (1973) $\rightarrow$ B*Tree (1977?) $\rightarrow$ Blink- Tree (1981) $\rightarrow$ Be- Tree (2003) $\rightarrow$ Bw- Tree (2013)
    ORGANIZATION AND MAINTENANCE OF LARGE
    ORDERED INDICES
    by
    R. Beet
    and
    E. McCright
    Mathematical and Information Sciences Report No. 20 Mathematical and Information Sciences Laboratory BOEING SCIENTIFIC RESEARCH LABORATORIES
    July 1970

    B-TREE FAI

    The Ubiquitous B-Tree

    DOUGLAS COMER
    Computer Science Department, Purdue University, West Lafayette, Indiana 47907
    There is a specific data structu
    B- trees have become, de facto, a standard for file organization. File indexes of users, dedicated database systems, and general- purpose access methods have all been proposed and implemented using B- trees. This paper reviews B- trees and shows why they have been so successful. It discusses the major variations of the B- tree, especially the B'- tree, contrasting the relative merits and costs of each implementation. It illustrates a general purpose access method which uses a B- tree.
    Keywords and Phrases: B- tree, B'- tree, B'- tree, file organization, index GB Language: U113.5.14.4.3.4.4.34
    People also use the term to ge of balanced tree data structure
    $\rightarrow$ B- Tree (1970)
    $\rightarrow$ B+Tree (1973) $\rightarrow$ B*Tree (1977?) $\rightarrow$ Blink- Tree (1981) $\rightarrow$ Be- Tree (2003) $\rightarrow$ Bw- Tree (2013)

    INTRODUCTION

    The secondary storage facilities available on large computer systems allow users to store, update, and recall data from large collections of information called files. A computer must retrieve an item and place it in main memory before it can be pro- . cessed. In order to make good use of the computer resources, one must organize files intelligently, making the retrieval process efficient.
    The choice of a good file organization depends on the kinds of retrieval to be performed. There are two broad classes of retrieval commands which can be illustrated by the following examples:
    Sequential: "From our employee file, prepare a list of all employees' names and addresses," and "From our employee file, extract the information about employee J. Smith".
    We can imagine a filing cabinet with three drawers of folders, one folder for each employee. The drawers might be labeled "A- G," "H- R," and "S- Z," while the folders might be labeled with the employees' last names. A sequential request requires the searcher to examine the entire file, one folder at a time. On the other hand, a random request implies that the searcher, guided by the labels on the drawers and folders, need only extract one folder.
    Associated with a large, randomly accessed file in a computer system is an index which, like the labels on the drawers and folders of the file cabinet, speeds retrieval by directing the searcher to the small part of the file containing the desired item. Figure 1 depicts a file and its index. An index may be physically integrated with the file, like the labels on employee folders or physically separate, like the labels on the drawers. Usually the index itself is a file. If the index file is large, another index may be built on top of it to speed retrieval further, and so on. The resulting hierarchy is similar to the employee file, where the topmost index consists of labels on drawers, and the next level of index consists of labels on folders.
    Natural hierarchies, like the one formed by considering last names as index entries, do not always produce the best perform
    Permission to copy without fee all or part of this material is granted provided that the copies are not made or distributed for direct commercial advantage, the ACM copyright notice and the title of the publication and its date appear, and notice is given that copying, to by permission of the Association for Computing Machinery. To copy otherwise, or to republish, requires a fee and/or specific permission. © 1979 ACM 0010- 4892/79/0600- 0121 $00 75

    B-TREE FAMILY

    There is a specific data structure called
    People also use the term to generally r of balanced tree data structures:
    $\rightarrow$ B- Tree (1970) $\rightarrow$ B+Tree (1973) $\rightarrow$ B*Tree (1977?) $\rightarrow$ Blink- Tree (1981) $\rightarrow$ Be- Tree (2003) $\rightarrow$ Bw- Tree (2013)

    Efficient Locking for Concurrent Operations on B-Trees

    PHILIP L. LEHMAN Carnegie- Mellon University and S. BING YAO Purdue University
    The B- tree and its variants have been found to be highly useful (both theoretically and in practice) for storing large amounts of information, especially on secondary storage devices. We examine the problem of overcoming the inherent difficulty of concurrent operations on such structures, using a practical storage model. A single additional "link" pointer in each node allows a process to easily recover from tree modifications performed by other concurrent processes. Our solution compares only a small) constant number of nodes are locked by any update process at any given time. An informal correctness proof of our system is given. Key Words and Phrases: database, data structures, B- tree, index organizations, concurrent algorithms, concurrency controls, locking protocols, correctness, consistency, multiway search trees CR Categories: 3.73, 3.73, 4.32, 4.33, 4.34, 5.24

    1. INTRODUCTION

    The B- tree [2] and its variants have been widely used in recent years as a data structure for storing large files of information, especially on secondary storage devices [7]. The guarantee (small (average) search, insertion, and deletion time for these structures makes them quite appealing for database applications. A topic of current interest in database design is the construction of databases that can be manipulated concurrently and correctly by several processes. In this paper, we consider a simple variant of the B- tree (usually of the B*- tree, proposed by W. J. Lippin [10]) especially well suited for use in a concurrent database system.
    and Schkolnick [3] and others [6, 12, 13]. The solution given in the current paper
    Permission to copy without loss all or part of this material is granted provided that the copies are not made or distributed for direct commercial advantage, the ACM copyright notice and the title of the publication and its date appear, and notice is given that copying is by permission of the Association for Computing Machinery. To copy otherwise, or to republish, requires a fee and/or specific permission. This research was supported by the National Science Foundation under Grant MCS76- 16604. Authors' present addresses: P. L. Lehman, Department of Computer Science, Carnegie- Mellon University, Pittsburgh, PA 16113; S. Yao, Department of Computer Science and College of Business and Management, University of Maryland, College Park, MD 20742.

    B-TREE FAMILY

    There is a specific data structure called a B- Tree.
    People also use the term to generally refer to a class of balanced tree data structures:
    $\rightarrow$ B- Tree (1970) $\rightarrow$ B+Tree (1973) $\rightarrow$ B*Tree (1977?) $\rightarrow$ Blink- Tree (1981) $\rightarrow$ Be- Tree (2003) $\rightarrow$ Bw- Tree (2013)

    R.TREE.FAMILY

    postgres/src/backend/access/nbtree/README
    1083 lines (959 loc)·62.8 KB
    3- Tree.

    Code Blame

    src/backend/access/nbtree/READMEBtree Indexing5 This directory contains a correct implementation of Lehman and Yao's6 high- concurrency B- tree management algorithm (P. Lehman and S. Yao,8 Efficient Locking for Concurrent Operations on B- Trees, ACM Transactions9 on Database Systems, Vol 6, No. 4, December 1981, pp 650- 670). We also10 use a simplified version of the deletion logic described in Lanin and11 Shasha (V. Lanin and D. Shasha, A Symmetric Concurrent B- Tree Algorithm,12 Proceedings of 1986 Fall Joint Computer Conference, pp 386- 389).1314 The basic Lehman & Yao Algorithm15
    to a class

    B+TREE

    A B+Tree is a self- balancing, ordered m- way tree for searches, sequential access, insertions, and deletions in O(logm n) where m is the tree fanout. $\rightarrow$ It is perfectly balanced (i.e., every leaf node is at the same depth in the tree) $\rightarrow$ Every node other than the root is at least half- full m/2- 1 #keys ≤ m- 1 $\rightarrow$ Every inner node with k keys has k+1 non- null children. $\rightarrow$ Optimized for reading/writing large data blocks.
    Some real- world implementations relax these properties, but we will ignore that for now...

    B+TREE EXAMPLE

    B+TREE EXAMPLE

    B+TREE EXAMPLE

    B+TREE EXAMPLE

    B+TREE EXAMPLE

    B+TREE EXAMPLE

    B+TREE EXAMPLE

    B+TREE EXAMPLE

    NODES

    Every B+ Tree node is comprised of an array of key/value pairs. $\rightarrow$ The keys are derived from the index's target attribute(s). $\rightarrow$ The values will differ based on whether the node is classified as an inner node or a leaf node.
    The arrays are (usually) kept in sorted key order.
    Store all NULL keys at either first or last leaf nodes.

    B+TREE LEAF NODES

    [ImageCaption: B+Tree Leaf Node]

    B+TREE LEAF NODES

    [ImageCaption: B+Tree Leaf Node]

    B+TREE LEAF NODES

    [ImageCaption: B+Tree Leaf Node]

    B+TREE LEAF NODES

    [ImageCaption: B+Tree Leaf Node]

    B+TREE LEAF NODES

    B+Tree Leaf Node

    Level Slots Prev Next
    Sorted Key/Value Pairs
    Table (html):
    K1αK2αK3α
    K4αK5α...

    B+TREE LEAF NODES

    B+Tree Leaf Node

    LEAF NODE VALUES

    Approach #1: Record IDs

    $\rightarrow$ A pointer to the location of the tuple to which the index entry corresponds. $\rightarrow$ Most common implementation.

    PostgreSQL

    Microsoft* SQL Server IBM DB2 ORACLE

    LEAF NODE VALUES

    Approach #1: Record IDs

    $\rightarrow$ A pointer to the location of the tuple to which the index entry corresponds. $\rightarrow$ Most common implementation.

    Approach #2: Tuple Data

    $\rightarrow$ Index- Organized Storage (Lecture #04) $\rightarrow$ Primary Key Index: Leaf nodes store the contents of the tuple. $\rightarrow$ Secondary Indexes: Leaf nodes store tuples' primary key as their values.

    PostgreSQL

    Microsoft* SQLServer IBM DB2 ORACLE
    MySQL ORACLE

    B-TREE VS. B+TREE

    The original B- Tree from 1971 stored keys and values in all nodes in the tree. $\rightarrow$ More space- efficient, since each key only appears once in the tree.
    A B+Tree only stores values in leaf nodes. Inner nodes only guide the search process.

    B+TREE - INSERT

    Find correct leaf node L. Insert data entry into L in sorted order. If L has enough space, done! Otherwise, split L keys into L and a new node $L_{2}$ $\longrightarrow$ Redistribute entries evenly, copy up middle key. $\longrightarrow$ Insert index entry pointing to $L_{2}$ into parent of L.
    To split inner node, redistribute entries evenly, but push up middle key.

    B+TREE – INSERT EXAMPLE (1)

    B+TREE – INSERT EXAMPLE (1)

    B+TREE – INSERT EXAMPLE (1)

    Insert 6

    B+TREE - INSERT EXAMPLE (1)

    Insert 6

    B+TREE – INSERT EXAMPLE (1)

    Insert 6

    B+TREE – INSERT EXAMPLE (1)

    Insert 6

    B+TREE – INSERT EXAMPLE (1)

    Insert 6

    B+TREE – INSERT EXAMPLE (1)

    Insert 6

    B+TREE – INSERT EXAMPLE (1)

    Insert 6

    B+TREE – INSERT EXAMPLE (1)

    Insert 6

    B+TREE – INSERT EXAMPLE (1)

    Insert 6

    B+TREE – INSERT EXAMPLE (1)

    Insert 6

    B+TREE – INSERT EXAMPLE (1)

    Insert 6

    B+TREE – INSERT EXAMPLE (1)

    Insert 6

    B+TREE - INSERT EXAMPLE (1)

    Insert 6 Insert 8

    B+TREE - INSERT EXAMPLE (1)

    Insert 6 Insert 8

    B+TREE - INSERT EXAMPLE (1)

    Insert 6 Insert 8

    B+TREE – INSERT EXAMPLE (2)

    Insert 17
    Note: New Example/Tree.

    B+TREE – INSERT EXAMPLE (2)

    Insert 17
    Note: New Example/Tree.

    B+TREE - INSERT EXAMPLE (3)

    Insert 17
    Insert 16

    B+TREE – INSERT EXAMPLE (3)

    Insert 17 Insert 16
    No space in the node where the new key "belongs".

    B+TREE – INSERT EXAMPLE (3)

    Insert 17 Insert 16

    B+TREE – INSERT EXAMPLE (3)

    Insert 17 Insert 16

    B+TREE – INSERT EXAMPLE (3)

    Insert 17

    Insert 16
    New Node! Shuffle keys from the node that triggered the split.

    B+TREE - INSERT EXAMPLE (3)

    Insert 17
    Insert 16

    B+TREE – INSERT EXAMPLE (3)

    Insert 17
    Insert 16

    B+TREE – INSERT EXAMPLE (3)

    Insert 17
    Insert 16

    B+TREE - INSERT EXAMPLE (3)

    Insert 17
    Insert 16
    Want to create a key, pointer pair like this. But cannot insert it in the root node, which is full.

    B+TREE – INSERT EXAMPLE (3)

    Insert 17
    Insert 16
    Want to create a key, pointer pair like this. But cannot insert it in the root node, which is full.

    B+TREE – INSERT EXAMPLE (3)

    Insert 17 Insert 16

    B+TREE – INSERT EXAMPLE (3)

    Insert 17
    Insert 16

    B+TREE – INSERT EXAMPLE (3)

    Insert 17
    Insert 16

    B+TREE - INSERT EXAMPLE (3)

    Insert 17
    Insert 16

    B+TREE - INSERT EXAMPLE (3)

    B+TREE - INSERT EXAMPLE (3)

    B+TREE - DELETE

    Start at root, find leaf L where entry belongs. Remove the entry. If L is at least half- full, done! If L has only m/2- 1 entries, $\rightarrow$ Try to re- distribute, borrowing from sibling (adjacent node with same parent as L). $\rightarrow$ If re- distribution fails, merge L and sibling.
    If merge occurred, must delete entry (pointing to L or sibling) from parent of L.

    B+TREE - DELETE EXAMPLE (1)

    Delete 6

    B+TREE - DELETE EXAMPLE (1)

    Delete 6

    B+TREE - DELETE EXAMPLE (1)

    Delete 6

    B+TREE - DELETE EXAMPLE (1)

    Delete 6

    B+TREE - DELETE EXAMPLE (1)

    Delete 6

    B+TREE - DELETE EXAMPLE (1)

    Delete 6

    B+TREE - DELETE EXAMPLE (1)

    Delete 6

    B+TREE - DELETE EXAMPLE (1)

    Delete 6

    B+TREE - DELETE EXAMPLE (1)

    Delete 6

    B+TREE - DELETE EXAMPLE (1)

    Delete 6

    B+TREE - DELETE EXAMPLE (2)

    Note: New Example/Tree.

    B+TREE - DELETE EXAMPLE (2)

    Note: New Example/Tree.

    B+TREE - DELETE EXAMPLE (2)

    B+TREE - DELETE EXAMPLE (2)

    B+TREE - DELETE EXAMPLE (2)

    Need to update parent node!

    B+TREE - DELETE EXAMPLE (2)

    B+TREE - DELETE EXAMPLE (2)

    B+TREE - DELETE EXAMPLE (3)

    B+TREE - DELETE EXAMPLE (3)

    B+TREE - DELETE EXAMPLE (3)

    Under- filled! No "rich" sibling nodes to borrow. Merge with a sibling

    B+TREE - DELETE EXAMPLE (3)

    Under- filled! No "rich" sibling nodes to borrow. Merge with a sibling

    B+TREE - DELETE EXAMPLE (3)

    Under- filled! No "rich" sibling nodes to borrow. Merge with a sibling

    B+TREE - DELETE EXAMPLE (3)

    B+TREE - DELETE EXAMPLE (3)

    Delete 15
    Delete 19
    This node is under- filled! Pull- down.

    B+TREE - DELETE EXAMPLE (3)

    Delete 15

    Delete 19

    B+TREE - DELETE EXAMPLE (3)

    Delete 15

    Delete 19

    B+TREE - DELETE EXAMPLE (3)

    Delete 15 Delete 19

    B+TREE - DELETE EXAMPLE (3)

    Delete 15 Delete 19

    COMPOSITE INDEX

    A composite index is when the key is comprised of two or more attributes. $\rightarrow$ Example: Index on <a,b,c>
    CREATE INDEX my_idx ON xxx (a, b DESC, c NULLS FIRST);
    DBMS can use B+Tree index if the query provides a "prefix" of composite key. $\rightarrow$ Supported: (a=1 AND b=2 AND c=3) $\rightarrow$ Supported: (a=1 AND b=2) $\rightarrow$ Rarely Supported: (b=2), (c=3)

    COMPOSITE INDEX

    A composite index is when the key is comprised of two or more attributes.
    $\rightarrow$ Example: Index on <a,b,c>
    DBMS can use B+Tree index if the query provides a "prefix" of composite key. $\rightarrow$ Supported: (a=1 AND b=2 AND c=3) $\rightarrow$ Supported: (a=1 AND b=2) $\rightarrow$ Rarely Supported: (b=2), (c=3)

    COMPOSITE INDEX

    A composite index is when the key is comprised of two or more attributes.
    $\rightarrow$ Example: Index on <a,b,c>
    DBMS can use B+Tree index if the query provides a "prefix" of composite key.
    $\rightarrow$ Supported: (a=1 AND b=2 AND c=3) $\rightarrow$ Supported: (a=1 AND b=2) $\rightarrow$ Rarely Supported: (b=2), (c=3)

    SELECTION CONDITIONS

    SELECTION CONDITIONS

    Find Key=(1,2)

    SELECTION CONDITIONS

    Find Key=(1,2)

    SELECTION CONDITIONS

    Find Key=(1,2)

    SELECTION CONDITIONS

    Find Key=(1,2)

    SELECTION CONDITIONS

    Find Key=(1,2) Find Key=(1,*)

    SELECTION CONDITIONS

    Find Key=(1,2) Find Key=(1,*)

    SELECTION CONDITIONS

    Find Key=(1,2) Find Key=(1,*)

    SELECTION CONDITIONS

    Find Key=(1,2) Find Key=(1,*)

    SELECTION CONDITIONS

    Find Key=(1,2) Find Key=(1,*)

    SELECTION CONDITIONS

    Find Key=(1,2) Find Key=(1,) Find Key=(,1)

    SELECTION CONDITIONS

    Find Key=(1,2) Find Key=(1,) Find Key=(,1)

    SELECTION CONDITIONS

    Find Key=(1,2) Find Key=(1,) Find Key=(,1)

    B+TREE - DUPLICATE KEYS

    Approach #1: Append Record ID

    → Add the tuple's unique Record ID as part of the key to ensure that all keys are unique. → The DBMS can still use partial keys to find tuples.

    Approach #2: Overflow Leaf Nodes

    → Allow leaf nodes to spill into overflow nodes that contain the duplicate keys. → This is more complex to maintain and modify.

    B+TREE - APPEND RECORD ID

    B+TREE - APPEND RECORD ID

    B+TREE - APPEND RECORD ID

    Insert 6

    B+TREE - APPEND RECORD ID

    Insert <6, (Page, Slot)>

    B+TREE - APPEND RECORD ID

    Insert <6, (Page, Slot)>

    B+TREE - APPEND RECORD ID

    Insert <6, (Page, Slot)>

    B+TREE - APPEND RECORD ID

    Insert <6, (Page, Slot)>

    B+TREE - APPEND RECORD ID

    Insert <6, (Page, Slot)>

    B+TREE - OVERFLOW LEAF NODES

    Insert 6

    B+TREE - OVERFLOW LEAF NODES

    Insert 6

    B+TREE - OVERFLOW LEAF NODES

    Insert 6

    B+TREE - OVERFLOW LEAF NODES

    Insert 6 Insert 7

    B+TREE - OVERFLOW LEAF NODES

    Insert 6 Insert 7 Insert 6

    CLUSTERED INDEXES

    The table is stored in the sort order specified by the primary key.
    $\rightarrow$ Can be either heap- or index- organized storage.
    Some DBMSs always use a clustered index. $\rightarrow$ If a table does not contain a primary key, the DBMS will automatically make a hidden primary key.
    Other DBMSs cannot use them at all.

    CLUSTERED B+TREE

    Traverse to the left- most leaf page and then retrieve tuples from all leaf pages.
    This will always be better than sorting data for each query.
    Table Pages

    CLUSTERED B+TREE

    Traverse to the left- most leaf page and then retrieve tuples from all leaf pages.
    This will always be better than sorting data for each query.

    CLUSTERED B+TREE

    Traverse to the left- most leaf page and then retrieve tuples from all leaf pages.
    This will always be better than sorting data for each query.

    INDEX SCAN PAGE SORTING

    Retrieving tuples in the order they appear in a non- clustered index is inefficient due to redundant reads.
    A better approach is to find all the tuples that the query needs and then sort them based on their page ID.
    The DBMS retrieves each page once.

    INDEX SCAN PAGE SORTING

    Retrieving tuples in the order they appear in a non- clustered index is inefficient due to redundant reads.
    A better approach is to find all the tuples that the query needs and then sort them based on their page ID.
    The DBMS retrieves each page once.

    INDEX SCAN PAGE SORTING

    Retrieving tuples in the order they appear in a non- clustered index is inefficient due to redundant reads.
    A better approach is to find all the tuples that the query needs and then sort them based on their page ID.
    The DBMS retrieves each page once.

    INDEX SCAN PAGE SORTING

    Retrieving tuples in the order they appear in a non- clustered index is inefficient due to redundant reads.
    A better approach is to find all the tuples that the query needs and then sort them based on their page ID.
    The DBMS retrieves each page once.
    Page 102 Page 103 Page 104 Page 104 Page 102 Page 102 Page 101 Page 103 Page 104 Page 103

    INDEX SCAN PAGE SORTING

    Retrieving tuples in the order they appear in a non- clustered index is inefficient due to redundant reads.
    A better approach is to find all the tuples that the query needs and then sort them based on their page ID.
    The DBMS retrieves each page once.
    Page 102 Page 103 Page 104 Page 104 Page 102 Page 103 Page 104 Page 103

    INDEX SCAN PAGE SORTING

    Retrieving tuples in the order they appear in a non- clustered index is inefficient due to redundant reads.
    A better approach is to find all the tuples that the query needs and then sort them based on their page ID.
    The DBMS retrieves each page once.

    B+TREE DESIGN CHOICES

    Node SizeMerge ThresholdVariable- Length KeysIntra- Node Search

    B+TREE DESIGN

    Node SizeMerge ThresholdVariable- Length KeysIntra- Node Search

    NODE SIZE

    The slower the storage device, the larger the optimal node size for a B+ Tree.
    $\rightarrow$ HDD: ~1MB $\rightarrow$ SSD: ~10KB $\rightarrow$ In- Memory: ~512B
    Optimal sizes can vary depending on the workload $\rightarrow$ Leaf Node Scans vs. Root- to- Leaf Traversals

    MERGE THRESHOLD

    Some DBMSs do not always merge nodes when they are half full.
    $\rightarrow$ Average occupancy rate for B+Tree nodes is 69%.
    Delaying a merge operation may reduce the amount of reorganization.
    It may also be better to let underfilled nodes exist and then periodically rebuild entire tree.
    This is why PostgreSQL calls their B+Tree a "non- balanced" B+Tree (nbtree).

    VARIABLE-LENGTH KEYS

    Approach #1: Pointers

    $\rightarrow$ Store the keys as pointers to the tuple's attribute. $\rightarrow$ Also called T- Trees (in- memory DBMSs)

    Approach #2: Variable-Length Nodes

    $\rightarrow$ The size of each node in the index can vary. $\rightarrow$ Requires careful memory management.

    Approach #3: Padding

    $\rightarrow$ Always pad the key to be max length of the key type.

    Approach #4: Key Map / Indirection

    $\rightarrow$ Embed an array of pointers that map to the key + value list within the node.

    INTRA-NODE SEARCH

    Approach #1: Linear

    $\longrightarrow$ Scan node keys from beginning to end. $\longrightarrow$ Use SIMD to vectorize comparisons.

    INTRA-NODE SEARCH

    Approach #1: Linear

    $\longrightarrow$ Scan node keys from beginning to end. $\longrightarrow$ Use SIMD to vectorize comparisons.

    Find Key=8

    Table (html):
    45678910

    INTRA-NODE SEARCH

    Approach #1: Linear

    $\longrightarrow$ Scan node keys from beginning to end. $\longrightarrow$ Use SIMD to vectorize comparisons.

    Find Key=8

    INTRA-NODE SEARCH

    Approach #1: Linear

    $\longrightarrow$ Scan node keys from beginning to end. $\longrightarrow$ Use SIMD to vectorize comparisons.

    Find Key=8

    INTRA-NODE SEARCH

    Approach #1: Linear

    $\longrightarrow$ Scan node keys from beginning to end. $\longrightarrow$ Use SIMD to vectorize comparisons.

    Find Key=8

    Table (html):
    45678910

    INTRA-NODE SEARCH

    Approach #1: Linear

    → Scan node keys from beginning to end. → Use SIMD to vectorize comparisons.

    Find Key=8

    _mm_cmpeq_epi32_mask(a, b)

    INTRA-NODE SEARCH

    Approach #1: Linear

    → Scan node keys from beginning to end. → Use SIMD to vectorize comparisons.

    Find Key=8

    _mm_cmpeq_epi32_mask(a, b)

    INTRA-NODE SEARCH

    Approach #1: Linear

    $\rightarrow$ Scan node keys from beginning to end. $\rightarrow$ Use SIMD to vectorize comparisons.
    _mm_cmpeq_epi32_mask(a, b)

    INTRA-NODE SEARCH

    Approach #1: Linear

    $\longrightarrow$ Scan node keys from beginning to end. $\longrightarrow$ Use SIMD to vectorize comparisons.
    _mm_cmpeq_epi32_mask(a, b)

    INTRA-NODE SEARCH

    Approach #1: Linear

    $\rightarrow$ Scan node keys from beginning to end. $\rightarrow$ Use SIMD to vectorize comparisons.
    _mm_cmpeq_epi32_mask(a, b)

    INTRA-NODE SEARCH

    Approach #1: Linear

    $\rightarrow$ Scan node keys from beginning to end. $\rightarrow$ Use SIMD to vectorize comparisons.

    Approach #2: Binary

    $\rightarrow$ Jump to middle key, pivot left/right depending on comparison.

    INTRA-NODE SEARCH

    Approach #1: Linear

    $\rightarrow$ Scan node keys from beginning to end. $\rightarrow$ Use SIMD to vectorize comparisons.

    Approach #2: Binary

    $\rightarrow$ Jump to middle key, pivot left/right depending on comparison.
    Find Key=8

    INTRA-NODE SEARCH

    Approach #1: Linear

    $\rightarrow$ Scan node keys from beginning to end. $\rightarrow$ Use SIMD to vectorize comparisons.

    Approach #2: Binary

    $\rightarrow$ Jump to middle key, pivot left/right depending on comparison.

    Find Key=8

    INTRA-NODE SEARCH

    Approach #1: Linear

    $\longrightarrow$ Scan node keys from beginning to end. $\longrightarrow$ Use SIMD to vectorize comparisons.

    Approach #2: Binary

    $\longrightarrow$ Jump to middle key, pivot left/right depending on comparison.

    Find Key=8

    INTRA-NODE SEARCH

    Approach #1: Linear

    $\rightarrow$ Scan node keys from beginning to end. $\rightarrow$ Use SIMD to vectorize comparisons.

    Approach #2: Binary

    $\rightarrow$ Jump to middle key, pivot left/right depending on comparison.

    Approach #3: Interpolation

    $\rightarrow$ Approximate location of desired key based on known distribution of keys.

    INTRA-NODE SEARCH

    Approach #1: Linear

    $\rightarrow$ Scan node keys from beginning to end. $\rightarrow$ Use SIMD to vectorize comparisons.

    Approach #2: Binary

    $\rightarrow$ Jump to middle key, pivot left/right depending on comparison.

    Approach #3: Interpolation

    $\rightarrow$ Approximate location of desired key based on known distribution of keys.

    Find Key=8

    Table (html):
    45678910

    INTRA-NODE SEARCH

    Approach #1: Linear

    $\rightarrow$ Scan node keys from beginning to end. $\rightarrow$ Use SIMD to vectorize comparisons.

    Approach #2: Binary

    $\rightarrow$ Jump to middle key, pivot left/right depending on comparison.

    Approach #3: Interpolation

    $\rightarrow$ Approximate location of desired key based on known distribution of keys.
    Offset: (8- 4)*7/(10- 4)=4
    Table (html):
    45678910

    INTRA-NODE SEARCH

    Approach #1: Linear

    $\rightarrow$ Scan node keys from beginning to end. $\rightarrow$ Use SIMD to vectorize comparisons.

    Approach #2: Binary

    $\rightarrow$ Jump to middle key, pivot left/right depending on comparison.

    Approach #3: Interpolation

    $\rightarrow$ Approximate location of desired key based on known distribution of keys.
    Offset: (8- 4)*7/(10- 4)=4
    Table (html):
    45678910

    INTRA-NODE SEARCH

    Approach #1: Linear

    $\rightarrow$ Scan node keys from beginning to end. $\rightarrow$ Use SIMD to vectorize comparisons.

    Approach #2: Binary

    $\rightarrow$ Jump to middle key, pivot left/right depending on comparison.
    Approach #3: Interpolation $\rightarrow$ Approximate location of desired key based on known distribution of keys.

    Efficiently Searching In-Memory Sorted Arrays: Revenge of the Interpolation Search?

    Peter Van Sandt, Yannis Chronis, Jignesh M. Patel Department of Computer Sciences, University of Wisconsin- Madison {van- sandt,chronis,jignesh}@cis.wisc.edu

    ABSTRACT

    In this paper, we focus on the problem of searching sorted, as- namely, linear. The idea is that operations, and binary Search is the de facto algorithm that is used in practice. We consider an alternative, namely Interpolation Search, which can take advantage of hardware trends by using complex calculations to save memory accesses. Historically, Interpolation Search was found to underperform compared to other search algorithms in this setting, despite its superior asymptotic complexity. Also, Interpolation Search is known to perform poorly on non- uniform data. To address these issues, we introduce SIP (Slope reuse Interpolation), an optimized implementation of Interpolation Search, and TIP (Three point Interpolation), a new search algorithm that uses linear fractions to interpolate non- uniform distributions. We evaluate these two algorithms against a similarly optimized Binary Search method using a variety of real and synthetic datasets. We show that SIP is up to 4 times faster on uniformly distributed data and TIP is 2- 3 times faster on non- uniformly distributed data in some cases. We also design a meta- algorithm to simplify between these different methods to automate picking the higher performing search algorithm, which depends on factors like data distribution.

    CCS CONCEPTS

    $\cdot$ Information systems $\rightarrow$ Point lookups; Main memory engines.

    KEYWORDS

    In- memory search; Interpolation Search; Binary Search
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from permissions@cmu.org. SIGMOD '19, June 30- July 5, 2019, Amsterdam, Netherlands © 2019 Copyright held by the owner/author(s). Publication rights licensed to ACM. ACM ISBN 978- 1- 4503- 5643- 5/19/001. $\mathbb{S}15.00$ https://doi.org/10.1145/3299869.3000075

    ACM Reference Format:

    Peter Van Sandt, Yannis Chronis, Jignesh M. Patel. 2019. Efficiently Searching In- Memory Sorted Arrays: Revenge of the Interpolation Search? In 2019 International Conference on Management of Data (SIGMOD '19), June 30- July 5, 2019, Amsterdam, Netherlands. ACM, New York, NY, USA, 18 pages. https://doi.org/10.1145/3299869.3300075
    Figure 1: Speed comparison of representative processor and main memory technologies [27]. The performance of processors is measured in FLOPs. The performance of main memory is measured as peak FLOPs to sustained memory bandwidth (GFLOP/sec) / (Words/sec) and peak FLOPs per idle memory latency (GFLOP/sec) * sec. In the conventional von Neumann architectural path, main memory speed is poised to become (relatively) slower compared to the speed of computing inside processors.

    1 INTRODUCTION

    Searching in- memory, sorted datasets is a fundamental data operation [23]. Today, Binary Search is the de facto search method that is used in practice, as it is an efficient and asymptotically optimal in the worst case algorithm. Binary Search is a primitive in many popular data systems and frameworks (e.g. LevelDB [25] and Pandas [30]).
    Designing algorithms around hardware trends can yield significant performance gains. A key technological trend is the diverging CPU and memory speeds, which is illustrated in Figure 1. This trend favors algorithms that can use more computation to reduce memory accesses [4, 6, 16, 21, 27, 38]. The focus of this paper is on exploring the impact of this trend

    OPTIMIZATIONS

    Prefix CompressionDeduplicationSuffix TruncationPointer SwizzlingBulk InsertBuffered UpdatesMany more...

    PREFIX COMPRESSION

    Sorted keys in the same leaf node are likely to have the same prefix.
    Instead of storing the entire key each time, extract common prefix and store only unique suffix for each key. $\longrightarrow$ Many variations.
    Table (html):
    robbedrobbingrobot

    PREFIX COMPRESSION

    Sorted keys in the same leaf node are likely to have the same prefix.
    Instead of storing the entire key each time, extract common prefix and store only unique suffix for each key. $\rightarrow$ Many variations.

    DEDUPLICATION

    Non- unique indexes can end up storing multiple copies of the same key in leaf nodes.
    The leaf node can store the key once and then maintain a "posting list" of tuples with that key (similar to what we discussed for hash tables).
    Table (html):
    K1V1K1V2K1V3K2V4

    DEDUPLICATION

    Non- unique indexes can end up storing multiple copies of the same key in leaf nodes.
    The leaf node can store the key once and then maintain a "posting list" of tuples with that key (similar to what we discussed for hash tables).
    Table (html):
    K1V1K1V2K1V3K2V4

    DEDUPLICATION

    Non- unique indexes can end up storing multiple copies of the same key in leaf nodes.
    The leaf node can store the key once and then maintain a "posting list" of tuples with that key (similar to what we discussed for hash tables).

    SUFFIX TRUNCATION

    The keys in the inner nodes are only used to "direct traffic".
    $\rightarrow$ We don't need the entire key.
    Store a minimum prefix that is needed to correctly route probes into the index.

    SUFFIX TRUNCATION

    The keys in the inner nodes are only used to "direct traffic".
    $\rightarrow$ We don't need the entire key.
    Store a minimum prefix that is needed to correctly route probes into the index.

    SUFFIX TRUNCATION

    The keys in the inner nodes are only used to "direct traffic".
    $\rightarrow$ We don't need the entire key.
    Store a minimum prefix that is needed to correctly route probes into the index.

    POINTER SWIZZLING

    Nodes use page ids to reference other nodes in the index. The DBMS must get the memory location from the page table during traversal.
    If a page is pinned in the buffer pool, then we can store raw pointers instead of page ids. This avoids address lookups from the page table.

    POINTER SWIZZLING

    Nodes use page ids to reference other nodes in the index. The DBMS must get the memory location from the page table during traversal.
    If a page is pinned in the buffer pool, then we can store raw pointers instead of page ids. This avoids address lookups from the page table.

    POINTER SWIZZLING

    Nodes use page ids to reference other nodes in the index. The DBMS must get the memory location from the page table during traversal.
    If a page is pinned in the buffer pool, then we can store raw pointers instead of page ids. This avoids address lookups from the page table.

    POINTER SWIZZLING

    Nodes use page ids to reference other nodes in the index. The DBMS must get the memory location from the page table during traversal.
    If a page is pinned in the buffer pool, then we can store raw pointers instead of page ids. This avoids address lookups from the page table.

    POINTER SWIZZLING

    Nodes use page ids to reference other nodes in the index. The DBMS must get the memory location from the page table during traversal.
    If a page is pinned in the buffer pool, then we can store raw pointers instead of page ids. This avoids address lookups from the page table.

    POINTER SWIZZLING

    Nodes use page ids to reference other nodes in the index. The DBMS must get the memory location from the page table during traversal.
    If a page is pinned in the buffer pool, then we can store raw pointers instead of page ids. This avoids address lookups from the page table.

    POINTER SWIZZLING

    Nodes use page ids to reference other nodes in the index. The DBMS must get the memory location from the page table during traversal.
    If a page is pinned in the buffer pool, then we can store raw pointers instead of page ids. This avoids address lookups from the page table.

    POINTER SWIZZLING

    Nodes use page ids to reference other nodes in the index. The DBMS must get the memory location from the page table during traversal.
    If a page is pinned in the buffer pool, then we can store raw pointers instead of page ids. This avoids address lookups from the page table.

    POINTER SWIZZLING

    Nodes use page ids to reference other nodes in the index. The DBMS must get the memory location from the page table during traversal.
    If a page is pinned in the buffer pool, then we can store raw pointers instead of page ids. This avoids address lookups from the page table.

    BULK INSERT

    The fastest way to build a new B+Tree for an existing table is to first sort the keys and then build the index from the bottom up.

    BULK INSERT

    The fastest way to build a new B+Tree for an existing table is to first sort the keys and then build the index from the bottom up.
    Keys: 3, 7, 9, 13, 6, 1

    BULK INSERT

    The fastest way to build a new B+Tree for an existing table is to first sort the keys and then build the index from the bottom up.
    Keys: 3, 7, 9, 13, 6, 1Sorted Keys: 1, 3, 6, 7, 9, 13

    BULK INSERT

    The fastest way to build a new B+Tree for an existing table is to first sort the keys and then build the index from the bottom up.
    Keys: 3, 7, 9, 13, 6, 1Sorted Keys: 1, 3, 6, 7, 9, 13

    BULK INSERT

    The fastest way to build a new B+Tree for an existing table is to first sort the keys and then build the index from the bottom up.
    Keys: 3, 7, 9, 13, 6, 1
    Sorted Keys: 1, 3, 6, 7, 9, 13

    OBSERVATION

    Modifying a B+tree is expensive when the DBMS has to split/merge nodes. $\rightarrow$ Worst case is when DBMS reorganizes the entire tree. $\rightarrow$ The worker that causes a split/merge is responsible for doing the work.
    What if there was a way to delay updates and then apply multiple changes together in a batch?

    WRITE-OPTIMIZED B+TREE

    Instead of immediately applying updates, store changes to key/value entries in log buffers at inner nodes. $\rightarrow$ aka Fractal Trees / Be- trees.
    Updates cascade down to lower nodes incrementally when buffers get full.
    Tokutek SPLINTERDB RelationalAI ChromoDB
    STS DB

    WRITE-OPTIMIZED B+TREE

    Instead of immediately applying updates, store changes to key/value entries in log buffers at inner nodes. $\rightarrow$ aka Fractal Trees / Be- trees.
    Updates cascade down to lower nodes incrementally when buffers get full.
    Tokutek SPLINTERDB RelationalAI ChromoDB STSDB

    WRITE-OPTIMIZED B+TREE

    Instead of immediately applying updates, store changes to key/value entries in log buffers at inner nodes. $\rightarrow$ aka Fractal Trees / Be- trees.
    Updates cascade down to lower nodes incrementally when buffers get full.
    Tokutek SPLINTERDB RelationalAI ChromoDB STS DB

    Insert 7

    WRITE-OPTIMIZED B+TREE

    Instead of immediately applying updates, store changes to key/value entries in log buffers at inner nodes. $\rightarrow$ aka Fractal Trees / Be- trees.
    Updates cascade down to lower nodes incrementally when buffers get full.
    Tokutek SPLINTERDB RelationalAI ChromoDB
    STS DB
    Insert 7

    WRITE-OPTIMIZED B+TREE

    Instead of immediately applying updates, store changes to key/value entries in log buffers at inner nodes. $\rightarrow$ aka Fractal Trees / Be- trees.
    Updates cascade down to lower nodes incrementally when buffers get full.
    Tokutek SPLINTERDB RelationalAI ChromoDB
    STS DB
    Insert 7 Delete 10

    WRITE-OPTIMIZED B+TREE

    Instead of immediately applying updates, store changes to key/value entries in log buffers at inner nodes. $\rightarrow$ aka Fractal Trees / Be- trees.
    Updates cascade down to lower nodes incrementally when buffers get full.
    Tokutek SPLINTERDB RelationalAI ChromoDB
    STS DB
    Find 10

    WRITE-OPTIMIZED B+TREE

    Instead of immediately applying updates, store changes to key/value entries in log buffers at inner nodes. $\rightarrow$ aka Fractal Trees / Be- trees.
    Updates cascade down to lower nodes incrementally when buffers get full.
    Tokutek SPLINTERDB RelationalAI ChromoDB
    STS DB
    Insert 40

    WRITE-OPTIMIZED B+TREE

    Instead of immediately applying updates, store changes to key/value entries in log buffers at inner nodes. $\rightarrow$ aka Fractal Trees / Be- trees.
    Updates cascade down to lower nodes incrementally when buffers get full.
    Tokutek SPLINTERDB RelationalAI ChromoDB STS DB

    Insert 40

    WRITE-OPTIMIZED B+TREE

    Instead of immediately applying updates, store changes to key/value entries in log buffers at inner nodes. $\rightarrow$ aka Fractal Trees / Be- trees.
    Updates cascade down to lower nodes incrementally when buffers get full.
    Tokutek SPLINTERDB RelationalAI ChromoDB
    STS DB
    Insert 40

    CONCLUSION

    The venerable B+Tree is (almost) always a good choice for your DBMS.

    NEXT CLASS

    Bloom Filters Tries / Radix Trees / Patricia Trees Skip Lists Inverted Indexes Vector Indexes