CMU Database - 15445 - 2025 Spring

cmu_database

This collection of documents, "CMU Database - 15445 - 2025 Spring," provides a comprehensive overview of database systems, primarily focusing on the design, implementation, and ...

Documents Knowledge Graph

lecture-07-slides.pdf

Carnegie Mellon University Database Systems Hash Tables

ADMINISTRIVIA

Project #1 is due on February 9^th @ 11

Homework #2 is due February 9^th @ 11

COURSE OUTLINE

We are now going to talk about how to support the DBMS's execution engine to read/write data from pages.

Two types of data structures: $\rightarrow$ Hash Tables (Unordered) $\rightarrow$ Trees (Ordered)

Query Planning

Operator Execution

Access Methods

Buffer Pool Manager

Disk Manager

TODAY'S AGENDA

Background Hash Functions Static Hashing Schemes Dynamic Hashing Schemes DB Flash Talk: DataStax

DATA STRUCTURES

Internal Meta- dataCore Data StorageTemporary Data StructuresTable Indexes

DESIGN DECISIONS

Data Organization

$\rightarrow$ How we layout data structure in memory/pages and what information to store to support efficient access.

Concurrency

$\rightarrow$ How to enable multiple threads to access the data structure at the same time without causing problems.

HASH TABLES

A hash table implements an unordered associative array that maps keys to values.

It uses a hash function to compute an offset into this array for a given key, from which the desired value can be found.

Space Complexity: O(n)Time Complexity: $\rightarrow$ Average: O(1) $\leftarrow$ Databases care about constants! $\rightarrow$ Worst: O(n)

STATIC HASH TABLE

Allocate a giant array that has one slot for every element you need to store.

To find an entry, mod the key by the number of elements to find the offset in the array.

hash(key) % N

STATIC HASH TABLE

Allocate a giant array that has one slot for every element you need to store.

To find an entry, mod the key by the number of elements to find the offset in the array.

hash(key) % N

STATIC HASH TABLE

Allocate a giant array that has one slot for every element you need to store.

To find an entry, mod the key by the number of elements to find the offset in the array.

hash(key) % N

UNREALISTIC ASSUMPTIONS

Assumption #1: Number of elements is known ahead of time and fixed.

Assumption #2: Each key is unique.

Assumption #3: Perfect hash function guarantees no collisions. $\rightarrow$ If key1≠key2, then hash(key1)≠hash(key2)

hash(key)% N

HASH TABLE

Design Decision #1: Hash Function

$\rightarrow$ How to map a large key space into a smaller domain. $\rightarrow$ Trade- off between being fast vs. collision rate.

Design Decision #2: Hashing Scheme

$\rightarrow$ How to handle key collisions after hashing. $\rightarrow$ Trade- off between allocating a large hash table vs. additional instructions to get/put keys.

HASH FUNCTIONS

For any input key, return an integer representation of that key.

$\rightarrow$ Converts arbitrary byte array into a fixed- length code.

We want something that is fast and has a low collision rate.

We do not want to use a cryptographic hash function for DBMS hash tables (e.g., SHA- 2).

HASH FUNCTIONS

CRC-64 (1975)

$\rightarrow$ Used in networking for error detection.

MurmurHash (2008)

$\rightarrow$ Designed as a fast, general- purpose hash function.

Google CityHash (2011)

$\rightarrow$ Designed to be faster for short keys (<64 bytes).

Facebook XXHash (2012)

$\rightarrow$ From the creator of zstd compression.

Google FarmHash (2014)

$\rightarrow$ Newer version of CityHash with better collision rates.

HASH FUNCTIONS

CRC-64 (1975)

$\rightarrow$ Used in networking for error detection.

MurmurHash (2008)

$\rightarrow$ Designed as a fast, general- purpose hash function.

Google CityHash (2011)

$\rightarrow$ Designed to be faster for short keys (<64 bytes).

Facebook XXHash (2012)

$\rightarrow$ From the creator of zstd compression.

State- of- the- art

Google FarmHash (2014)

$\rightarrow$ Newer version of CityHash with better collision rates.

HASH FUNCTIONS

smhasher

SMhasher

Table (html):

Hash function	MiB/sec	cycl./hash	cycl./map	size	Quality problems
donnothing32	11149460.06	4.00	-	13	bad seed 0, test NOP
donnothing64	11787676.42	4.00	-	13	bad seed 0, test NOP
donnothing128	11745060.76	4.06	-	13	bad seed 0, test NOP
NOP_OAAT_read64	11372846.37	14.00	-	47	test NOP
BadHash	769.94	73.97	-	47	bad seed 0, test FAIL
sumhash	10699.57	29.53	-	363	bad seed 0, test FAIL
sumhash32	42877.79	23.12	-	863	UB, test FAIL
multiply_shift	8026.77	26.05	226.80 (8)	345	bad seeds & 0xfliff0, fails most tests
pair_multiply_shift	3716.95	40.22	186.34 (3)	609	fails most tests
crc32	383.12	134.21	257.50 (11)	422	insecure, 8590x collisions, distrib, PerlinNoise
md5_32	350.53	644.31	894.12 (10)	4419

[TableCaption: Linux Build status building building]

tate- of- the- art

rates.

HASH FUNCTIONS

smhasher

SMhasher

Linux Build status building building failing

Table (html):

Hash function	MiB/sec	cycl./hash	cycl./map	size
donnothing32	11149460.06	4.00	-	13
donnothing64	11787676.42	4.00	-	13
donnothing128	11745060.76	4.06	-	13
NOP_OAAT_read64	11372846.37	14.00	-	4
BadHash	769.94	73.97	-	4
sumhash	10699.57	29.53	-	36
sumhash32	42877.79	23.12	-	8
multiply_shift	8026.77	26.05	226.80 (8)	3
pair_multiply_shift	3716.95	40.22	186.34 (3)	6
crc32	383.12	134.21	257.50 (11)	4
md5_32	350.53	644.31	894.12 (10)	4

Summary

I added some SSE assisted hashes and fast intel/arm CRC32- C, AES and SHA HW variants. See also the old https://github.com/aapplebv/smhasher/wiki, the improved, but unmaintained fork https://github.com/demerphq/smhasher, and the new improved version SMHasher3 https://gitlab.com/fwojcik/smhasher3. So the fastest hash functions on x86_64 without quality problems are:

rapidhash (an improved wyhash) - xhx3low wyhash - umash (even universal!) - ahash64 - t1ha2_atonce komihash FarmHash (not portable, too machine specific: 64 vs 32bit, old gcc, ...) - halftime_hash128 - Spooky32 pengyhash - nmhash32 - mx3 - MUM/mir (different results on 32/64- bit archs, lots of bad seeds to filter out) - fasthash32

STATIC HASHING SCHEMES

Approach #1: Linear Probe Hashing Approach #2: Cuckoo Hashing

There are several other schemes covered in the Advanced DB course:

$\longrightarrow$ Robin Hood Hashing $\longrightarrow$ Hopscotch Hashing $\longrightarrow$ Swiss Tables

STATIC HASHING SCHEMES

Approach #1: Linear Probe Hashing Approach #2: Cuckoo Hashing

Open Addressing

There are several other schemes covered in the Advanced DB course:

$\longrightarrow$ Robin Hood Hashing $\longrightarrow$ Hopscotch Hashing $\longrightarrow$ Swiss Tables

LINEAR PROBE HASHING

Single giant table of fixed- length slots.

Resolve collisions by linearly searching for the next free slot in the table.

$\rightarrow$ To determine whether an element is present, hash to a location in the table and scan for it. $\rightarrow$ Store keys in table to know when to stop scanning. $\rightarrow$ Insertions and deletions are generalizations of lookups.

The table's load factor determines when it is becoming too full and should be resized. $\rightarrow$ Allocate a new table twice as large and rehash entries.

LINEAR PROBE HASHING

hash(key)% N

A B C D E F

LINEAR PROBE HASHING

hash(key)% N

A B C D E F

LINEAR PROBE HASHING

hash(key)% N

A B C D E F

LINEAR PROBE HASHING

hash(key)% N

LINEAR PROBE HASHING

hash(key)% N

LINEAR PROBE HASHING

hash(key)% N

LINEAR PROBE HASHING

hash(key)% N

LINEAR PROBE HASHING

hash(key)% N

LINEAR PROBE HASHING

hash(key)% N

HASH TABLE – KEY/VALUE ENTRIES

Fixed-length Key/Values:

$\rightarrow$ Store inline within the hash table pages. $\rightarrow$ Optional: Store the key's hash with the key for faster comparisons.

Variable-length Key/Values:

$\rightarrow$ Insert key/value data in separate a private temporary table. $\rightarrow$ Store the hash as the key and use the record id pointing to its corresponding entry in the temporary table as the value.

HASH TABLE - KEY/VALUE ENTRIES

Fixed-length Key/Values:

$\rightarrow$ Store inline within the hash table pages. $\rightarrow$ Optional: Store the key's hash with the key for faster comparisons.

Variable-length Key/Values:

Table (html):

hash	key	value
hash	key	value
hash	key	value

HASH TABLE - KEY/VALUE ENTRIES

Fixed-length Key/Values:

$\rightarrow$ Store inline within the hash table pages. $\rightarrow$ Optional: Store the key's hash with the key for faster comparisons.

Variable-length Key/Values:

Table (html):

hash	key	value
hash	key	value
hash	key	value

Table (html):

hash	RecordId
hash	RecordId
hash	RecordId

Temp Table Page

Table (html):

key \| value
key \| value
key \| value

LINEAR PROBE HASHING - DELETES

hash(key)% N

A B C D E F

LINEAR PROBE HASHING - DELETES