CMU Database - 15445 - 2025 Spring

cmu_database

This collection of documents, "CMU Database - 15445 - 2025 Spring," provides a comprehensive overview of database systems, primarily focusing on the design, implementation, and ...

Documents Knowledge Graph

lecture-08-slides.pdf

Carnegie Mellon University

Database Systems

Beautiful B+Trees

ADMINISTRIVIA

Project #2 out later today; due Sunday March 2 $^{\text{nd}}$ @ 11

$\rightarrow$ Don't forget to do a GitHub "pull" before starting

Homework #3 (indices and filters) will be released on Wednesday February 19 $^{\text{th}}$

Mid- term Exam on Wednesday Feb 26 $^{\text{th}}$ $\rightarrow$ In- class in this room $\rightarrow$ More info next week

New Seminar Series: Every Tue @ 4

Schedule

https://cmu.zoom.us/j/93441451665 (Passcode 261758)

Today's seminar: Larry Ellison was Right (kinda)! TypeScript Stored Procedures for the Modern Age

Speaker: James Cowling, CTO, Convex.

LAST CLASS

Hash tables are important data structures that are used all throughout a DBMS. $\rightarrow$ Space Complexity: O(n) $\rightarrow$ Average Time Complexity: O(1) Static vs. Dynamic Hashing schemes

DBMSs use mostly hash tables for their internal data structures.

INDEXES VS. FILTERS

An index data structure of a subset of a table's attributes that are organized and/or sorted to the location of specific tuples using those attributes. $\rightarrow$ Example: B+Tree

A filter is a data structure that answers set membership queries; it tells you whether a record (likely) exists for a key but not where it is located. $\rightarrow$ Example: Bloom Filter

TODAY'S AGENDA

B+Tree OverviewDesign ChoicesOptimizations

B-TREE FAMILY

There is a specific data structure called a B- Tree.

People also use the term to generally refer to a class of balanced tree data structures:

$\rightarrow$ B- Tree (1970) $\rightarrow$ B+Tree (1973) $\rightarrow$ B*Tree (1977?) $\rightarrow$ Blink- Tree (1981) $\rightarrow$ Be- Tree (2003) $\rightarrow$ Bw- Tree (2013)

B-TREE FA

There is a specific data structu

People also use the term to ge. of balanced tree data structure

$\rightarrow$ B- Tree (1970)

$\rightarrow$ B+Tree (1973) $\rightarrow$ B*Tree (1977?) $\rightarrow$ Blink- Tree (1981) $\rightarrow$ Be- Tree (2003) $\rightarrow$ Bw- Tree (2013)

ORGANIZATION AND MAINTENANCE OF LARGE

ORDERED INDICES

R. Beet

and

E. McCright

Mathematical and Information Sciences Report No. 20 Mathematical and Information Sciences Laboratory BOEING SCIENTIFIC RESEARCH LABORATORIES

July 1970

B-TREE FAI

The Ubiquitous B-Tree

DOUGLAS COMER

Computer Science Department, Purdue University, West Lafayette, Indiana 47907

There is a specific data structu

B- trees have become, de facto, a standard for file organization. File indexes of users, dedicated database systems, and general- purpose access methods have all been proposed and implemented using B- trees. This paper reviews B- trees and shows why they have been so successful. It discusses the major variations of the B- tree, especially the B'- tree, contrasting the relative merits and costs of each implementation. It illustrates a general purpose access method which uses a B- tree.

Keywords and Phrases: B- tree, B'- tree, B'- tree, file organization, index GB Language: U113.5.14.4.3.4.4.34

People also use the term to ge of balanced tree data structure

$\rightarrow$ B- Tree (1970)

$\rightarrow$ B+Tree (1973) $\rightarrow$ B*Tree (1977?) $\rightarrow$ Blink- Tree (1981) $\rightarrow$ Be- Tree (2003) $\rightarrow$ Bw- Tree (2013)

INTRODUCTION

The secondary storage facilities available on large computer systems allow users to store, update, and recall data from large collections of information called files. A computer must retrieve an item and place it in main memory before it can be pro- . cessed. In order to make good use of the computer resources, one must organize files intelligently, making the retrieval process efficient.

The choice of a good file organization depends on the kinds of retrieval to be performed. There are two broad classes of retrieval commands which can be illustrated by the following examples:

Sequential: "From our employee file, prepare a list of all employees' names and addresses," and "From our employee file, extract the information about employee J. Smith".

We can imagine a filing cabinet with three drawers of folders, one folder for each employee. The drawers might be labeled "A- G," "H- R," and "S- Z," while the folders might be labeled with the employees' last names. A sequential request requires the searcher to examine the entire file, one folder at a time. On the other hand, a random request implies that the searcher, guided by the labels on the drawers and folders, need only extract one folder.

Associated with a large, randomly accessed file in a computer system is an index which, like the labels on the drawers and folders of the file cabinet, speeds retrieval by directing the searcher to the small part of the file containing the desired item. Figure 1 depicts a file and its index. An index may be physically integrated with the file, like the labels on employee folders or physically separate, like the labels on the drawers. Usually the index itself is a file. If the index file is large, another index may be built on top of it to speed retrieval further, and so on. The resulting hierarchy is similar to the employee file, where the topmost index consists of labels on drawers, and the next level of index consists of labels on folders.

Natural hierarchies, like the one formed by considering last names as index entries, do not always produce the best perform

Permission to copy without fee all or part of this material is granted provided that the copies are not made or distributed for direct commercial advantage, the ACM copyright notice and the title of the publication and its date appear, and notice is given that copying, to by permission of the Association for Computing Machinery. To copy otherwise, or to republish, requires a fee and/or specific permission. © 1979 ACM 0010- 4892/79/0600- 0121 $00 75

B-TREE FAMILY

There is a specific data structure called

People also use the term to generally r of balanced tree data structures:

$\rightarrow$ B- Tree (1970) $\rightarrow$ B+Tree (1973) $\rightarrow$ B*Tree (1977?) $\rightarrow$ Blink- Tree (1981) $\rightarrow$ Be- Tree (2003) $\rightarrow$ Bw- Tree (2013)

Efficient Locking for Concurrent Operations on B-Trees

PHILIP L. LEHMAN Carnegie- Mellon University and S. BING YAO Purdue University

The B- tree and its variants have been found to be highly useful (both theoretically and in practice) for storing large amounts of information, especially on secondary storage devices. We examine the problem of overcoming the inherent difficulty of concurrent operations on such structures, using a practical storage model. A single additional "link" pointer in each node allows a process to easily recover from tree modifications performed by other concurrent processes. Our solution compares only a small) constant number of nodes are locked by any update process at any given time. An informal correctness proof of our system is given. Key Words and Phrases: database, data structures, B- tree, index organizations, concurrent algorithms, concurrency controls, locking protocols, correctness, consistency, multiway search trees CR Categories: 3.73, 3.73, 4.32, 4.33, 4.34, 5.24

1. INTRODUCTION

The B- tree [2] and its variants have been widely used in recent years as a data structure for storing large files of information, especially on secondary storage devices [7]. The guarantee (small (average) search, insertion, and deletion time for these structures makes them quite appealing for database applications. A topic of current interest in database design is the construction of databases that can be manipulated concurrently and correctly by several processes. In this paper, we consider a simple variant of the B- tree (usually of the B*- tree, proposed by W. J. Lippin [10]) especially well suited for use in a concurrent database system.

and Schkolnick [3] and others [6, 12, 13]. The solution given in the current paper

Permission to copy without loss all or part of this material is granted provided that the copies are not made or distributed for direct commercial advantage, the ACM copyright notice and the title of the publication and its date appear, and notice is given that copying is by permission of the Association for Computing Machinery. To copy otherwise, or to republish, requires a fee and/or specific permission. This research was supported by the National Science Foundation under Grant MCS76- 16604. Authors' present addresses: P. L. Lehman, Department of Computer Science, Carnegie- Mellon University, Pittsburgh, PA 16113; S. Yao, Department of Computer Science and College of Business and Management, University of Maryland, College Park, MD 20742.

B-TREE FAMILY

There is a specific data structure called a B- Tree.

People also use the term to generally refer to a class of balanced tree data structures:

$\rightarrow$ B- Tree (1970) $\rightarrow$ B+Tree (1973) $\rightarrow$ B*Tree (1977?) $\rightarrow$ Blink- Tree (1981) $\rightarrow$ Be- Tree (2003) $\rightarrow$ Bw- Tree (2013)

R.TREE.FAMILY

postgres/src/backend/access/nbtree/README

1083 lines (959 loc)·62.8 KB

3- Tree.

Code Blame

src/backend/access/nbtree/READMEBtree Indexing5 This directory contains a correct implementation of Lehman and Yao's6 high- concurrency B- tree management algorithm (P. Lehman and S. Yao,8 Efficient Locking for Concurrent Operations on B- Trees, ACM Transactions9 on Database Systems, Vol 6, No. 4, December 1981, pp 650- 670). We also10 use a simplified version of the deletion logic described in Lanin and11 Shasha (V. Lanin and D. Shasha, A Symmetric Concurrent B- Tree Algorithm,12 Proceedings of 1986 Fall Joint Computer Conference, pp 386- 389).1314 The basic Lehman & Yao Algorithm15

to a class

B+TREE

A B+Tree is a self- balancing, ordered m- way tree for searches, sequential access, insertions, and deletions in O(logm n) where m is the tree fanout. $\rightarrow$ It is perfectly balanced (i.e., every leaf node is at the same depth in the tree) $\rightarrow$ Every node other than the root is at least half- full m/2- 1 #keys ≤ m- 1 $\rightarrow$ Every inner node with k keys has k+1 non- null children. $\rightarrow$ Optimized for reading/writing large data blocks.

Some real- world implementations relax these properties, but we will ignore that for now...

B+TREE EXAMPLE

NODES

Every B+ Tree node is comprised of an array of key/value pairs. $\rightarrow$ The keys are derived from the index's target attribute(s). $\rightarrow$ The values will differ based on whether the node is classified as an inner node or a leaf node.

The arrays are (usually) kept in sorted key order.

Store all NULL keys at either first or last leaf nodes.

B+TREE LEAF NODES

[ImageCaption: B+Tree Leaf Node]

B+TREE LEAF NODES

[ImageCaption: B+Tree Leaf Node]

B+TREE LEAF NODES

[ImageCaption: B+Tree Leaf Node]

B+TREE LEAF NODES

[ImageCaption: B+Tree Leaf Node]

B+TREE LEAF NODES

B+Tree Leaf Node

Level Slots Prev Next

Sorted Key/Value Pairs

Table (html):

K1	α	K2	α	K3	α
K4	α	K5	α	...

B+TREE LEAF NODES

B+Tree Leaf Node

LEAF NODE VALUES

Approach #1: Record IDs

$\rightarrow$ A pointer to the location of the tuple to which the index entry corresponds. $\rightarrow$ Most common implementation.

PostgreSQL

Microsoft* SQL Server IBM DB2 ORACLE

LEAF NODE VALUES

Approach #1: Record IDs

$\rightarrow$ A pointer to the location of the tuple to which the index entry corresponds. $\rightarrow$ Most common implementation.

Approach #2: Tuple Data

$\rightarrow$ Index- Organized Storage (Lecture #04) $\rightarrow$ Primary Key Index: Leaf nodes store the contents of the tuple. $\rightarrow$ Secondary Indexes: Leaf nodes store tuples' primary key as their values.

PostgreSQL

Microsoft* SQLServer IBM DB2 ORACLE

MySQL ORACLE

B-TREE VS. B+TREE

The original B- Tree from 1971 stored keys and values in all nodes in the tree. $\rightarrow$ More space- efficient, since each key only appears once in the tree.

A B+Tree only stores values in leaf nodes. Inner nodes only guide the search process.

B+TREE - INSERT

Find correct leaf node L. Insert data entry into L in sorted order. If L has enough space, done! Otherwise, split L keys into L and a new node $L_{2}$ $\longrightarrow$ Redistribute entries evenly, copy up middle key. $\longrightarrow$ Insert index entry pointing to $L_{2}$ into parent of L.

To split inner node, redistribute entries evenly, but push up middle key.

B+TREE – INSERT EXAMPLE (1)

Insert 6

B+TREE - INSERT EXAMPLE (1)

Insert 6

B+TREE – INSERT EXAMPLE (1)

Insert 6

B+TREE – INSERT EXAMPLE (1)

Insert 6

B+TREE – INSERT EXAMPLE (1)

Insert 6

B+TREE – INSERT EXAMPLE (1)

Insert 6

B+TREE – INSERT EXAMPLE (1)

Insert 6

B+TREE – INSERT EXAMPLE (1)

Insert 6

B+TREE – INSERT EXAMPLE (1)

Insert 6

B+TREE – INSERT EXAMPLE (1)

Insert 6

B+TREE – INSERT EXAMPLE (1)

Insert 6

B+TREE – INSERT EXAMPLE (1)

Insert 6

B+TREE - INSERT EXAMPLE (1)

Insert 6 Insert 8

B+TREE - INSERT EXAMPLE (1)

Insert 6 Insert 8

B+TREE - INSERT EXAMPLE (1)

Insert 6 Insert 8

B+TREE – INSERT EXAMPLE (2)

Insert 17

Note: New Example/Tree.

B+TREE – INSERT EXAMPLE (2)

Insert 17

Note: New Example/Tree.

B+TREE - INSERT EXAMPLE (3)

Insert 17

Insert 16

B+TREE – INSERT EXAMPLE (3)

Insert 17 Insert 16

No space in the node where the new key "belongs".

B+TREE – INSERT EXAMPLE (3)

Insert 17 Insert 16

B+TREE – INSERT EXAMPLE (3)

Insert 17 Insert 16

B+TREE – INSERT EXAMPLE (3)

Insert 17

Insert 16

New Node! Shuffle keys from the node that triggered the split.

B+TREE - INSERT EXAMPLE (3)

Insert 17

Insert 16

B+TREE – INSERT EXAMPLE (3)

Insert 17

Insert 16

B+TREE – INSERT EXAMPLE (3)

Insert 17

Insert 16

B+TREE - INSERT EXAMPLE (3)

Insert 17

Insert 16

Want to create a key, pointer pair like this. But cannot insert it in the root node, which is full.

B+TREE – INSERT EXAMPLE (3)

Insert 17

Insert 16

Want to create a key, pointer pair like this. But cannot insert it in the root node, which is full.

B+TREE – INSERT EXAMPLE (3)

Insert 17 Insert 16

B+TREE – INSERT EXAMPLE (3)

Insert 17

Insert 16

B+TREE – INSERT EXAMPLE (3)

Insert 17

Insert 16

B+TREE - INSERT EXAMPLE (3)

Insert 17

Insert 16

B+TREE - INSERT EXAMPLE (3)

B+TREE - DELETE

Start at root, find leaf L where entry belongs. Remove the entry. If L is at least half- full, done! If L has only m/2- 1 entries, $\rightarrow$ Try to re- distribute, borrowing from sibling (adjacent node with same parent as L). $\rightarrow$ If re- distribution fails, merge L and sibling.

If merge occurred, must delete entry (pointing to L or sibling) from parent of L.

B+TREE - DELETE EXAMPLE (1)

Delete 6

B+TREE - DELETE EXAMPLE (1)

Delete 6

B+TREE - DELETE EXAMPLE (1)

Delete 6

B+TREE - DELETE EXAMPLE (1)

Delete 6

B+TREE - DELETE EXAMPLE (1)

Delete 6

B+TREE - DELETE EXAMPLE (1)

Delete 6

B+TREE - DELETE EXAMPLE (1)

Delete 6

B+TREE - DELETE EXAMPLE (1)

Delete 6

B+TREE - DELETE EXAMPLE (1)

Delete 6

B+TREE - DELETE EXAMPLE (1)

Delete 6

B+TREE - DELETE EXAMPLE (2)

Note: New Example/Tree.

B+TREE - DELETE EXAMPLE (2)

Note: New Example/Tree.

B+TREE - DELETE EXAMPLE (2)

Need to update parent node!

B+TREE - DELETE EXAMPLE (2)

B+TREE - DELETE EXAMPLE (3)

Under- filled! No "rich" sibling nodes to borrow. Merge with a sibling

B+TREE - DELETE EXAMPLE (3)

Under- filled! No "rich" sibling nodes to borrow. Merge with a sibling

B+TREE - DELETE EXAMPLE (3)

Under- filled! No "rich" sibling nodes to borrow. Merge with a sibling

B+TREE - DELETE EXAMPLE (3)

Delete 15

Delete 19

This node is under- filled! Pull- down.

B+TREE - DELETE EXAMPLE (3)

Delete 15

Delete 19

B+TREE - DELETE EXAMPLE (3)

Delete 15

Delete 19

B+TREE - DELETE EXAMPLE (3)

Delete 15 Delete 19

B+TREE - DELETE EXAMPLE (3)

Delete 15 Delete 19

COMPOSITE INDEX

A composite index is when the key is comprised of two or more attributes. $\rightarrow$ Example: Index on <a,b,c>

CREATE INDEX my_idx ON xxx (a, b DESC, c NULLS FIRST);

DBMS can use B+Tree index if the query provides a "prefix" of composite key. $\rightarrow$ Supported: (a=1 AND b=2 AND c=3) $\rightarrow$ Supported: (a=1 AND b=2) $\rightarrow$ Rarely Supported: (b=2), (c=3)

COMPOSITE INDEX

A composite index is when the key is comprised of two or more attributes.

$\rightarrow$ Example: Index on <a,b,c>

COMPOSITE INDEX

A composite index is when the key is comprised of two or more attributes.

$\rightarrow$ Example: Index on <a,b,c>

DBMS can use B+Tree index if the query provides a "prefix" of composite key.

$\rightarrow$ Supported: (a=1 AND b=2 AND c=3) $\rightarrow$ Supported: (a=1 AND b=2) $\rightarrow$ Rarely Supported: (b=2), (c=3)

SELECTION CONDITIONS

Find Key=(1,2)

SELECTION CONDITIONS

Find Key=(1,2)

SELECTION CONDITIONS

Find Key=(1,2)

SELECTION CONDITIONS

Find Key=(1,2)

SELECTION CONDITIONS

Find Key=(1,2) Find Key=(1,*)

SELECTION CONDITIONS

Find Key=(1,2) Find Key=(1,*)

SELECTION CONDITIONS

Find Key=(1,2) Find Key=(1,*)

SELECTION CONDITIONS

Find Key=(1,2) Find Key=(1,*)

SELECTION CONDITIONS

Find Key=(1,2) Find Key=(1,*)

SELECTION CONDITIONS

Find Key=(1,2) Find Key=(1,) Find Key=(,1)

SELECTION CONDITIONS

Find Key=(1,2) Find Key=(1,) Find Key=(,1)

SELECTION CONDITIONS

Find Key=(1,2) Find Key=(1,) Find Key=(,1)

B+TREE - DUPLICATE KEYS

Approach #1: Append Record ID

→ Add the tuple's unique Record ID as part of the key to ensure that all keys are unique. → The DBMS can still use partial keys to find tuples.

Approach #2: Overflow Leaf Nodes

→ Allow leaf nodes to spill into overflow nodes that contain the duplicate keys. → This is more complex to maintain and modify.

B+TREE - APPEND RECORD ID

Insert 6

B+TREE - APPEND RECORD ID

Insert <6, (Page, Slot)>

B+TREE - APPEND RECORD ID

Insert <6, (Page, Slot)>

B+TREE - APPEND RECORD ID

Insert <6, (Page, Slot)>

B+TREE - APPEND RECORD ID

Insert <6, (Page, Slot)>

B+TREE - APPEND RECORD ID

Insert <6, (Page, Slot)>

B+TREE - OVERFLOW LEAF NODES

Insert 6

B+TREE - OVERFLOW LEAF NODES

Insert 6

B+TREE - OVERFLOW LEAF NODES

Insert 6

B+TREE - OVERFLOW LEAF NODES

Insert 6 Insert 7

B+TREE - OVERFLOW LEAF NODES

Insert 6 Insert 7 Insert 6

CLUSTERED INDEXES

The table is stored in the sort order specified by the primary key.

$\rightarrow$ Can be either heap- or index- organized storage.

Some DBMSs always use a clustered index. $\rightarrow$ If a table does not contain a primary key, the DBMS will automatically make a hidden primary key.

Other DBMSs cannot use them at all.

CLUSTERED B+TREE

Traverse to the left- most leaf page and then retrieve tuples from all leaf pages.

This will always be better than sorting data for each query.

Table Pages

CLUSTERED B+TREE

Traverse to the left- most leaf page and then retrieve tuples from all leaf pages.

This will always be better than sorting data for each query.

CLUSTERED B+TREE

Traverse to the left- most leaf page and then retrieve tuples from all leaf pages.

This will always be better than sorting data for each query.

INDEX SCAN PAGE SORTING

Retrieving tuples in the order they appear in a non- clustered index is inefficient due to redundant reads.

A better approach is to find all the tuples that the query needs and then sort them based on their page ID.

The DBMS retrieves each page once.

INDEX SCAN PAGE SORTING

Retrieving tuples in the order they appear in a non- clustered index is inefficient due to redundant reads.

A better approach is to find all the tuples that the query needs and then sort them based on their page ID.

The DBMS retrieves each page once.

INDEX SCAN PAGE SORTING

Retrieving tuples in the order they appear in a non- clustered index is inefficient due to redundant reads.

A better approach is to find all the tuples that the query needs and then sort them based on their page ID.

The DBMS retrieves each page once.

INDEX SCAN PAGE SORTING

Retrieving tuples in the order they appear in a non- clustered index is inefficient due to redundant reads.

A better approach is to find all the tuples that the query needs and then sort them based on their page ID.

The DBMS retrieves each page once.

Page 102 Page 103 Page 104 Page 104 Page 102 Page 102 Page 101 Page 103 Page 104 Page 103

INDEX SCAN PAGE SORTING

Retrieving tuples in the order they appear in a non- clustered index is inefficient due to redundant reads.

A better approach is to find all the tuples that the query needs and then sort them based on their page ID.

The DBMS retrieves each page once.

Page 102 Page 103 Page 104 Page 104 Page 102 Page 103 Page 104 Page 103

INDEX SCAN PAGE SORTING

Retrieving tuples in the order they appear in a non- clustered index is inefficient due to redundant reads.

A better approach is to find all the tuples that the query needs and then sort them based on their page ID.

The DBMS retrieves each page once.

B+TREE DESIGN CHOICES

Node SizeMerge ThresholdVariable- Length KeysIntra- Node Search

B+TREE DESIGN

Node SizeMerge ThresholdVariable- Length KeysIntra- Node Search

NODE SIZE

The slower the storage device, the larger the optimal node size for a B+ Tree.

$\rightarrow$ HDD: ~1MB $\rightarrow$ SSD: ~10KB $\rightarrow$ In- Memory: ~512B

Optimal sizes can vary depending on the workload $\rightarrow$ Leaf Node Scans vs. Root- to- Leaf Traversals

MERGE THRESHOLD

Some DBMSs do not always merge nodes when they are half full.

$\rightarrow$ Average occupancy rate for B+Tree nodes is 69%.

Delaying a merge operation may reduce the amount of reorganization.

It may also be better to let underfilled nodes exist and then periodically rebuild entire tree.

This is why PostgreSQL calls their B+Tree a "non- balanced" B+Tree (nbtree).

VARIABLE-LENGTH KEYS

Approach #1: Pointers

$\rightarrow$ Store the keys as pointers to the tuple's attribute. $\rightarrow$ Also called T- Trees (in- memory DBMSs)

Approach #2: Variable-Length Nodes

$\rightarrow$ The size of each node in the index can vary. $\rightarrow$ Requires careful memory management.

Approach #3: Padding

$\rightarrow$ Always pad the key to be max length of the key type.

Approach #4: Key Map / Indirection

$\rightarrow$ Embed an array of pointers that map to the key + value list within the node.

INTRA-NODE SEARCH

Approach #1: Linear

$\longrightarrow$ Scan node keys from beginning to end. $\longrightarrow$ Use SIMD to vectorize comparisons.

INTRA-NODE SEARCH

Approach #1: Linear

$\longrightarrow$ Scan node keys from beginning to end. $\longrightarrow$ Use SIMD to vectorize comparisons.

Find Key=8

Table (html):

INTRA-NODE SEARCH

Approach #1: Linear

$\longrightarrow$ Scan node keys from beginning to end. $\longrightarrow$ Use SIMD to vectorize comparisons.

Find Key=8

INTRA-NODE SEARCH

Approach #1: Linear

$\longrightarrow$ Scan node keys from beginning to end. $\longrightarrow$ Use SIMD to vectorize comparisons.

Find Key=8

INTRA-NODE SEARCH

Approach #1: Linear

$\longrightarrow$ Scan node keys from beginning to end. $\longrightarrow$ Use SIMD to vectorize comparisons.

Find Key=8

Table (html):

INTRA-NODE SEARCH

Approach #1: Linear

→ Scan node keys from beginning to end. → Use SIMD to vectorize comparisons.

Find Key=8

_mm_cmpeq_epi32_mask(a, b)

INTRA-NODE SEARCH

Approach #1: Linear

→ Scan node keys from beginning to end. → Use SIMD to vectorize comparisons.

Find Key=8

_mm_cmpeq_epi32_mask(a, b)

INTRA-NODE SEARCH

Approach #1: Linear

$\rightarrow$ Scan node keys from beginning to end. $\rightarrow$ Use SIMD to vectorize comparisons.

_mm_cmpeq_epi32_mask(a, b)

INTRA-NODE SEARCH

Approach #1: Linear

$\longrightarrow$ Scan node keys from beginning to end. $\longrightarrow$ Use SIMD to vectorize comparisons.

_mm_cmpeq_epi32_mask(a, b)

INTRA-NODE SEARCH

Approach #1: Linear

$\rightarrow$ Scan node keys from beginning to end. $\rightarrow$ Use SIMD to vectorize comparisons.

_mm_cmpeq_epi32_mask(a, b)

INTRA-NODE SEARCH

Approach #1: Linear

$\rightarrow$ Scan node keys from beginning to end. $\rightarrow$ Use SIMD to vectorize comparisons.

Approach #2: Binary

$\rightarrow$ Jump to middle key, pivot left/right depending on comparison.

INTRA-NODE SEARCH

Approach #1: Linear

$\rightarrow$ Scan node keys from beginning to end. $\rightarrow$ Use SIMD to vectorize comparisons.

Approach #2: Binary

$\rightarrow$ Jump to middle key, pivot left/right depending on comparison.

Find Key=8

INTRA-NODE SEARCH

Approach #1: Linear

$\rightarrow$ Scan node keys from beginning to end. $\rightarrow$ Use SIMD to vectorize comparisons.

Approach #2: Binary

$\rightarrow$ Jump to middle key, pivot left/right depending on comparison.

Find Key=8

INTRA-NODE SEARCH

Approach #1: Linear

$\longrightarrow$ Scan node keys from beginning to end. $\longrightarrow$ Use SIMD to vectorize comparisons.

Approach #2: Binary

$\longrightarrow$ Jump to middle key, pivot left/right depending on comparison.

Find Key=8

INTRA-NODE SEARCH

Approach #1: Linear

$\rightarrow$ Scan node keys from beginning to end. $\rightarrow$ Use SIMD to vectorize comparisons.

Approach #2: Binary

$\rightarrow$ Jump to middle key, pivot left/right depending on comparison.

Approach #3: Interpolation

$\rightarrow$ Approximate location of desired key based on known distribution of keys.

INTRA-NODE SEARCH

Approach #1: Linear

$\rightarrow$ Scan node keys from beginning to end. $\rightarrow$ Use SIMD to vectorize comparisons.

Approach #2: Binary

$\rightarrow$ Jump to middle key, pivot left/right depending on comparison.

Approach #3: Interpolation

$\rightarrow$ Approximate location of desired key based on known distribution of keys.

Find Key=8

Table (html):

INTRA-NODE SEARCH

Approach #1: Linear

$\rightarrow$ Scan node keys from beginning to end. $\rightarrow$ Use SIMD to vectorize comparisons.

Approach #2: Binary

$\rightarrow$ Jump to middle key, pivot left/right depending on comparison.

Approach #3: Interpolation

$\rightarrow$ Approximate location of desired key based on known distribution of keys.

Offset: (8- 4)*7/(10- 4)=4

Table (html):

INTRA-NODE SEARCH

Approach #1: Linear

$\rightarrow$ Scan node keys from beginning to end. $\rightarrow$ Use SIMD to vectorize comparisons.

Approach #2: Binary

$\rightarrow$ Jump to middle key, pivot left/right depending on comparison.

Approach #3: Interpolation

$\rightarrow$ Approximate location of desired key based on known distribution of keys.

Offset: (8- 4)*7/(10- 4)=4

Table (html):

INTRA-NODE SEARCH

Approach #1: Linear

$\rightarrow$ Scan node keys from beginning to end. $\rightarrow$ Use SIMD to vectorize comparisons.

Approach #2: Binary

$\rightarrow$ Jump to middle key, pivot left/right depending on comparison.

Approach #3: Interpolation $\rightarrow$ Approximate location of desired key based on known distribution of keys.

Efficiently Searching In-Memory Sorted Arrays: Revenge of the Interpolation Search?

Peter Van Sandt, Yannis Chronis, Jignesh M. Patel Department of Computer Sciences, University of Wisconsin- Madison {van- sandt,chronis,jignesh}@cis.wisc.edu

ABSTRACT

In this paper, we focus on the problem of searching sorted, as- namely, linear. The idea is that operations, and binary Search is the de facto algorithm that is used in practice. We consider an alternative, namely Interpolation Search, which can take advantage of hardware trends by using complex calculations to save memory accesses. Historically, Interpolation Search was found to underperform compared to other search algorithms in this setting, despite its superior asymptotic complexity. Also, Interpolation Search is known to perform poorly on non- uniform data. To address these issues, we introduce SIP (Slope reuse Interpolation), an optimized implementation of Interpolation Search, and TIP (Three point Interpolation), a new search algorithm that uses linear fractions to interpolate non- uniform distributions. We evaluate these two algorithms against a similarly optimized Binary Search method using a variety of real and synthetic datasets. We show that SIP is up to 4 times faster on uniformly distributed data and TIP is 2- 3 times faster on non- uniformly distributed data in some cases. We also design a meta- algorithm to simplify between these different methods to automate picking the higher performing search algorithm, which depends on factors like data distribution.

CCS CONCEPTS

$\cdot$ Information systems $\rightarrow$ Point lookups; Main memory engines.

KEYWORDS

In- memory search; Interpolation Search; Binary Search

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from permissions@cmu.org. SIGMOD '19, June 30- July 5, 2019, Amsterdam, Netherlands © 2019 Copyright held by the owner/author(s). Publication rights licensed to ACM. ACM ISBN 978- 1- 4503- 5643- 5/19/001. $\mathbb{S}15.00$ https://doi.org/10.1145/3299869.3000075

ACM Reference Format:

Peter Van Sandt, Yannis Chronis, Jignesh M. Patel. 2019. Efficiently Searching In- Memory Sorted Arrays: Revenge of the Interpolation Search? In 2019 International Conference on Management of Data (SIGMOD '19), June 30- July 5, 2019, Amsterdam, Netherlands. ACM, New York, NY, USA, 18 pages. https://doi.org/10.1145/3299869.3300075

Figure 1: Speed comparison of representative processor and main memory technologies [27]. The performance of processors is measured in FLOPs. The performance of main memory is measured as peak FLOPs to sustained memory bandwidth (GFLOP/sec) / (Words/sec) and peak FLOPs per idle memory latency (GFLOP/sec) * sec. In the conventional von Neumann architectural path, main memory speed is poised to become (relatively) slower compared to the speed of computing inside processors.

1 INTRODUCTION

Searching in- memory, sorted datasets is a fundamental data operation [23]. Today, Binary Search is the de facto search method that is used in practice, as it is an efficient and asymptotically optimal in the worst case algorithm. Binary Search is a primitive in many popular data systems and frameworks (e.g. LevelDB [25] and Pandas [30]).

Designing algorithms around hardware trends can yield significant performance gains. A key technological trend is the diverging CPU and memory speeds, which is illustrated in Figure 1. This trend favors algorithms that can use more computation to reduce memory accesses [4, 6, 16, 21, 27, 38]. The focus of this paper is on exploring the impact of this trend

OPTIMIZATIONS

Prefix CompressionDeduplicationSuffix TruncationPointer SwizzlingBulk InsertBuffered UpdatesMany more...

PREFIX COMPRESSION

Sorted keys in the same leaf node are likely to have the same prefix.

Instead of storing the entire key each time, extract common prefix and store only unique suffix for each key. $\longrightarrow$ Many variations.

Table (html):

robbed

robbing

robot

PREFIX COMPRESSION

Sorted keys in the same leaf node are likely to have the same prefix.

Instead of storing the entire key each time, extract common prefix and store only unique suffix for each key. $\rightarrow$ Many variations.

DEDUPLICATION

Non- unique indexes can end up storing multiple copies of the same key in leaf nodes.

The leaf node can store the key once and then maintain a "posting list" of tuples with that key (similar to what we discussed for hash tables).

Table (html):

DEDUPLICATION

Non- unique indexes can end up storing multiple copies of the same key in leaf nodes.

The leaf node can store the key once and then maintain a "posting list" of tuples with that key (similar to what we discussed for hash tables).

Table (html):

DEDUPLICATION

Non- unique indexes can end up storing multiple copies of the same key in leaf nodes.

The leaf node can store the key once and then maintain a "posting list" of tuples with that key (similar to what we discussed for hash tables).

SUFFIX TRUNCATION

The keys in the inner nodes are only used to "direct traffic".

$\rightarrow$ We don't need the entire key.

Store a minimum prefix that is needed to correctly route probes into the index.

SUFFIX TRUNCATION

The keys in the inner nodes are only used to "direct traffic".

$\rightarrow$ We don't need the entire key.

Store a minimum prefix that is needed to correctly route probes into the index.

SUFFIX TRUNCATION

The keys in the inner nodes are only used to "direct traffic".

$\rightarrow$ We don't need the entire key.

Store a minimum prefix that is needed to correctly route probes into the index.

POINTER SWIZZLING

Nodes use page ids to reference other nodes in the index. The DBMS must get the memory location from the page table during traversal.

If a page is pinned in the buffer pool, then we can store raw pointers instead of page ids. This avoids address lookups from the page table.

POINTER SWIZZLING

Nodes use page ids to reference other nodes in the index. The DBMS must get the memory location from the page table during traversal.

If a page is pinned in the buffer pool, then we can store raw pointers instead of page ids. This avoids address lookups from the page table.

POINTER SWIZZLING

Nodes use page ids to reference other nodes in the index. The DBMS must get the memory location from the page table during traversal.

If a page is pinned in the buffer pool, then we can store raw pointers instead of page ids. This avoids address lookups from the page table.

POINTER SWIZZLING

Nodes use page ids to reference other nodes in the index. The DBMS must get the memory location from the page table during traversal.

If a page is pinned in the buffer pool, then we can store raw pointers instead of page ids. This avoids address lookups from the page table.

POINTER SWIZZLING

Nodes use page ids to reference other nodes in the index. The DBMS must get the memory location from the page table during traversal.

If a page is pinned in the buffer pool, then we can store raw pointers instead of page ids. This avoids address lookups from the page table.

POINTER SWIZZLING

Nodes use page ids to reference other nodes in the index. The DBMS must get the memory location from the page table during traversal.

If a page is pinned in the buffer pool, then we can store raw pointers instead of page ids. This avoids address lookups from the page table.

POINTER SWIZZLING

Nodes use page ids to reference other nodes in the index. The DBMS must get the memory location from the page table during traversal.

If a page is pinned in the buffer pool, then we can store raw pointers instead of page ids. This avoids address lookups from the page table.

POINTER SWIZZLING

Nodes use page ids to reference other nodes in the index. The DBMS must get the memory location from the page table during traversal.

If a page is pinned in the buffer pool, then we can store raw pointers instead of page ids. This avoids address lookups from the page table.

POINTER SWIZZLING

Nodes use page ids to reference other nodes in the index. The DBMS must get the memory location from the page table during traversal.

If a page is pinned in the buffer pool, then we can store raw pointers instead of page ids. This avoids address lookups from the page table.

BULK INSERT

The fastest way to build a new B+Tree for an existing table is to first sort the keys and then build the index from the bottom up.

BULK INSERT

The fastest way to build a new B+Tree for an existing table is to first sort the keys and then build the index from the bottom up.

Keys: 3, 7, 9, 13, 6, 1

BULK INSERT

The fastest way to build a new B+Tree for an existing table is to first sort the keys and then build the index from the bottom up.

Keys: 3, 7, 9, 13, 6, 1Sorted Keys: 1, 3, 6, 7, 9, 13

BULK INSERT

The fastest way to build a new B+Tree for an existing table is to first sort the keys and then build the index from the bottom up.

Keys: 3, 7, 9, 13, 6, 1Sorted Keys: 1, 3, 6, 7, 9, 13

BULK INSERT

The fastest way to build a new B+Tree for an existing table is to first sort the keys and then build the index from the bottom up.

Keys: 3, 7, 9, 13, 6, 1

Sorted Keys: 1, 3, 6, 7, 9, 13

OBSERVATION

Modifying a B+tree is expensive when the DBMS has to split/merge nodes. $\rightarrow$ Worst case is when DBMS reorganizes the entire tree. $\rightarrow$ The worker that causes a split/merge is responsible for doing the work.

What if there was a way to delay updates and then apply multiple changes together in a batch?

WRITE-OPTIMIZED B+TREE

Instead of immediately applying updates, store changes to key/value entries in log buffers at inner nodes. $\rightarrow$ aka Fractal Trees / Be- trees.

Updates cascade down to lower nodes incrementally when buffers get full.

Tokutek SPLINTERDB RelationalAI ChromoDB

STS DB

WRITE-OPTIMIZED B+TREE

Instead of immediately applying updates, store changes to key/value entries in log buffers at inner nodes. $\rightarrow$ aka Fractal Trees / Be- trees.

Updates cascade down to lower nodes incrementally when buffers get full.

Tokutek SPLINTERDB RelationalAI ChromoDB STSDB

WRITE-OPTIMIZED B+TREE

Instead of immediately applying updates, store changes to key/value entries in log buffers at inner nodes. $\rightarrow$ aka Fractal Trees / Be- trees.

Updates cascade down to lower nodes incrementally when buffers get full.

Tokutek SPLINTERDB RelationalAI ChromoDB STS DB

Insert 7

WRITE-OPTIMIZED B+TREE

Instead of immediately applying updates, store changes to key/value entries in log buffers at inner nodes. $\rightarrow$ aka Fractal Trees / Be- trees.

Updates cascade down to lower nodes incrementally when buffers get full.

Tokutek SPLINTERDB RelationalAI ChromoDB

STS DB

Insert 7

WRITE-OPTIMIZED B+TREE

Instead of immediately applying updates, store changes to key/value entries in log buffers at inner nodes. $\rightarrow$ aka Fractal Trees / Be- trees.

Updates cascade down to lower nodes incrementally when buffers get full.

Tokutek SPLINTERDB RelationalAI ChromoDB

STS DB

Insert 7 Delete 10

WRITE-OPTIMIZED B+TREE

Instead of immediately applying updates, store changes to key/value entries in log buffers at inner nodes. $\rightarrow$ aka Fractal Trees / Be- trees.

Updates cascade down to lower nodes incrementally when buffers get full.

Tokutek SPLINTERDB RelationalAI ChromoDB

STS DB

Find 10

WRITE-OPTIMIZED B+TREE

Instead of immediately applying updates, store changes to key/value entries in log buffers at inner nodes. $\rightarrow$ aka Fractal Trees / Be- trees.

Updates cascade down to lower nodes incrementally when buffers get full.

Tokutek SPLINTERDB RelationalAI ChromoDB

STS DB

Insert 40

WRITE-OPTIMIZED B+TREE

Instead of immediately applying updates, store changes to key/value entries in log buffers at inner nodes. $\rightarrow$ aka Fractal Trees / Be- trees.

Updates cascade down to lower nodes incrementally when buffers get full.

Tokutek SPLINTERDB RelationalAI ChromoDB STS DB

Insert 40

WRITE-OPTIMIZED B+TREE

Instead of immediately applying updates, store changes to key/value entries in log buffers at inner nodes. $\rightarrow$ aka Fractal Trees / Be- trees.

Updates cascade down to lower nodes incrementally when buffers get full.

Tokutek SPLINTERDB RelationalAI ChromoDB

STS DB

Insert 40

CONCLUSION

The venerable B+Tree is (almost) always a good choice for your DBMS.

NEXT CLASS

Bloom Filters Tries / Radix Trees / Patricia Trees Skip Lists Inverted Indexes Vector Indexes