Carnegie Mellon University
Database Systems
Bloom Filters, Tries, Skip Lists, Inverted Indexes, Vector Indexes
ADMINISTRIVIA
Project #2 out; due Sunday March 2 $^{\text{nd}}$ @ 11
$\rightarrow$ Don't forget to do a GitHub "pull" before starting $\rightarrow$ Recitation on Wednesday Feb. 19 4:00- 5
pm, GHC 5117
Homework #3 (indices and filters) due Sunday Feb 23 $^{\text{th}}$ @11
Mid- term Exam on Wednesday Feb 26 $^{\text{th}}$ $\rightarrow$ In- class in this room
INDEXES VS. FILTERS
An index data structure of a subset of a table's attributes that are organized and/or sorted to the location of specific tuples using those attributes. $\longrightarrow$ Example: B+Tree
A filter is a data structure that answers set membership queries; it tells you whether a key (likely) exists in a set but not where it is located. $\longrightarrow$ Example: Bloom Filter
TODAY'S AGENDA
Bloom Filters Skip Lists Tries / Radix Trees Inverted Indexes Vector Indexes DB Flash Talk: Weaviate
BLOOM FILTERS
Probabilistic data structure (bitmap) that answers set membership queries.
→ False negatives will never occur. → False positives can sometimes occur. → See Bloom Filter Calculator.
Insert(x):
→ Use $k$ hash functions to set bits in the filter to 1.
Lookup(x):
→ Check whether the bits are 1 for each hash function.
BLOOM FILTERS
Insert 'RZA'
Bloom Filter
Table (html):
BLOOM FILTERS
Insert 'RZA'
Bloom Filter
$hash_{1}(RZA^{\prime}) = 2222% 8 = 6$ $hash_{2}(RZA^{\prime}) = 4444% 8 = 4$
BLOOM FILTERS
Insert 'RZA'
Bloom Filter
BLOOM FILTERS
Insert 'RZA'
Insert 'GZA'
Bloom Filter
hash('GZA') = 5555 % 8 = 3 hash('GZA') = 7777 % 8 = 1
BLOOM FILTERS
Insert 'RZA'
Insert 'GZA'
Bloom Filter
BLOOM FILTERS
Insert 'RZA' Insert 'GZA' Lookup 'RZA'
Bloom Filter
BLOOM FILTERS
Insert 'RZA'
Insert 'GZA'
Lookup 'RZA' → TRUE
Bloom Filter
BLOOM FILTERS
Insert 'RZA'
Insert 'GZA'
Lookup 'RZA' → TRUE
Lookup 'Raekwon'
Bloom Filter
BLOOM FILTERS
Insert 'RZA'
Insert 'GZA'
Lookup 'RZA' → TRUE
Lookup 'Raekwon'
Bloom Filter
$hash_{1}(Raekwon^{\prime}) = 3333% 8 = 5$ $hash_{2}(Raekwon^{\prime}) = 8899% 8 = 3$
BLOOM FILTERS
Insert 'RZA'
Insert 'GZA'
Lookup 'RZA' → TRUE
Lookup 'Raekwon'
Bloom Filter
BLOOM FILTERS
Insert 'RZA'
Insert 'GZA'
Lookup 'RZA' → TRUE
Lookup 'Raekwon' → FALSE
Bloom Filter
BLOOM FILTERS
Insert 'RZA'
Insert 'GZA'
Lookup 'RZA' → TRUE
Lookup 'Raekwon' → FALSE
Bloom Filter
BLOOM FILTERS
Insert 'RZA'
Insert 'GZA'
Lookup 'RZA' → TRUE
Lookup 'Raekwon' → FALSE
Lookup 'ODB'
Bloom Filter
$hash_{1}(^{\prime}ODB^{\prime}) = 6699% 8 =$ 3 $hash_{2}(^{\prime}ODB^{\prime}) = 9966% 8 =$ 6
BLOOM FILTERS
Insert 'RZA'
Insert 'GZA'
Lookup 'RZA' → TRUE
Lookup 'Raekwon' → FALSE
Lookup 'ODB'
Bloom Filter
BLOOM FILTERS
Insert 'RZA'
Insert 'GZA'
Lookup 'RZA' → TRUE
Lookup 'Raekwon' → FALSE
Lookup 'ODB' → TRUE
Bloom Filter
BLOOM FILTERS
Insert 'RZA'
Insert 'GZA'
Lookup 'RZA' → TRUE
Lookup 'Raekwon' → FALSE
Lookup 'ODB' → TRUE
Bloom Filter
OTHER FILTERS
Counting Bloom Filter
$\rightarrow$ Supports dynamically adding and removing keys. $\rightarrow$ Uses integers instead of bits to count the number of occurrences of a key in a set.
Cuckoo Filter
$\rightarrow$ Also supports dynamically adding and removing keys. $\rightarrow$ Uses a Cuckoo Hash Table but stores fingerprints instead of full keys.
Succinct Range Filter (SuRF)
$\rightarrow$ Immutable compact trie that supports approximate exact matches and range filtering.
OTHER FILTERS
Counting Bloom Filter
$\rightarrow$ Supports dynamically adding and removing keys. $\rightarrow$ Uses integers instead of bits to count the number of occurrences of a key in a set.
Cuckoo Filter
$\rightarrow$ Also supports dynamically adding and removing keys. $\rightarrow$ Uses a Cuckoo Hash Table but stores fingerprints instead of full keys.
Succinct Range Filter (SuRF)
$\rightarrow$ Immutable compact trie that supports approximate exact matches and range filtering.
OTHER
OBSERVATION
The easiest way to implement a dynamic order- preserving index is to use a sorted linked list. All operations have to linear search. $\rightarrow$ Average Cost: O(n)
OBSERVATION
The easiest way to implement a dynamic order- preserving index is to use a sorted linked list. All operations have to linear search. $\rightarrow$ Average Cost: O(n)
OBSERVATION
The easiest way to implement a dynamic order- preserving index is to use a sorted linked list. All operations have to linear search. $\rightarrow$ Average Cost: O(n)
OBSERVATION
The easiest way to implement a dynamic order- preserving index is to use a sorted linked list. All operations have to linear search. $\rightarrow$ Average Cost: O(n)
SKIP LISTS
Multiple levels of linked lists with extra pointers to skip over entries. $\rightarrow 1^{\mathrm{st}}$ level is a sorted list of all keys. $\rightarrow 2^{\mathrm{nd}}$ level links every other key $\rightarrow 3^{\mathrm{rd}}$ level links every fourth key $\rightarrow$ Each level has $\frac{1}{2}$ the keys of one below it
Maintains keys in sorted order without requiring global rebalancing. $\rightarrow$ Approximate O(log n) search times.
Mostly for in- memory data structures. $\rightarrow$ Example: LSM MemTable
Skip Lists: A Probabilistic Alternative to Balanced Trees
Skip lists are a data structure that can be used in place of balanced trees. Skip lists use probabilistic balancing rather than strictly enforced balancing and as a result the algorithms for insertion and deletion in skip lists are much simpler and significantly faster than equivalent algorithms for balanced trees.
William Pugh
Binary trees can be used for representing distinct data types such as dictionaries and ordered lists. They work well when in the elements are inserted in a random order. Some elements of operations, such as inserting elements into the order of the degenerate data structures that give your performance a hit, it was possible to randomly insert the items of items that were, those would work well with high probability for a no- per sequence. In most cases quents must be involved in the list, so randomly permuting the input is impractical. Below are some algorithms to arrange the tree as operations are performed to maintain certain balance conditions and assure good performance. Skip lists are a probabilistic alternative to balanced trees. Skip lists are balanced by consulting a random number per- enter. Although skip lists have worst- case performance, no input sequence consistently produces the worst case performance (such as it very unlikely) a skip list data structure will be significantly quicker (e.g., for a dictionary of more than 250 elements, the chance that a search will take more than 3 times the expected time is less than one in a million). Skip lists have balance properties similar to that of search trees built by random insertions, yet do not require insertions to be random. Balancing a data structure probabilistically is easier to explicitly maintaining the balance. For many applications, skip lists are a natural representation than trees. Although skip list almost always has a skip structure, the algorithmic element to implement and provides siglo- cant constant factor speed improvements over balanced tree and self- adjusting tree algorithms. Skip lists are also very space efficient. They can easily be configured to require average of 1/1 points per element (or even less) and do not require balance or priority information to be stored with each skip list. We might need to examine every node of the list when searching a linked list (Figure 1a). If the list is stored in sorted order and every other node of the list also has a pointer to the node two- ahead in the list (Figure 1b), we have to estimate the more than 1/1 nodes (where n is the length of the list).
SingleStore WIREDTIGER
C. Couchbase
SKIP LISTS: INSERT
Insert $K_{5}$
SKIP LISTS: INSERT
Insert K5
SKIP LISTS: INSERT
Insert $K_{5}$
SKIP LISTS: INSERT
Insert K5
SKIP LISTS: INSERT
Insert $K_{5}$
SKIP LISTS: INSERT
Insert K5
Levels
Flip a coin to decide how many levels to add the new key into.
SKIP LISTS: INSERT
Insert K5
SKIP LISTS: INSERT
Insert K5
SKIP LISTS: INSERT
Insert K5
SKIP LISTS: INSERT
Insert K5
SKIP LISTS: INSERT
Insert K5
SKIP LISTS: INSERT
Insert K5
SKIP LISTS: INSERT
Insert K5
SKIP LISTS: INSERT
Insert K5
SKIP LISTS: INSERT
Insert K5
SKIP LISTS: INSERT
Insert K5
SKIP LISTS: INSERT
Insert K5
SKIP LISTS: INSERT
Insert K5
SKIP LISTS: INSERT
Insert K5
SKIP LISTS: INSERT
Insert K5
SKIP LISTS: SEARCH
Find K3
Levels
SKIP LISTS: SEARCH
Find K3
SKIP LISTS: SEARCH
Find K3
SKIP LISTS: SEARCH
Find K3
SKIP LISTS: SEARCH
Find K3
Levels
SKIP LISTS: SEARCH
Find K3
Levels
SKIP LISTS: DELETE
First logically remove a key from the index by setting a flag to tell threads to ignore.
Then physically remove the key once we know that no other thread is holding the reference.
SKIP LISTS: DELETE
Delete $K_{5}$
SKIP LISTS: DELETE
Delete $K_{5}$
SKIP LISTS: DELETE
Delete $K_{5}$
SKIP LISTS: DELETE
Delete $K_{5}$
SKIP LISTS: DELETE
Delete $K_{5}$
SKIP LISTS: DELETE
Delete $K_{5}$
SKIP LISTS: DELETE
Delete $K_{5}$
SKIP LISTS: DELETE
Delete $K_{5}$
SKIP LISTS: DELETE
Delete $K_{5}$
SKIP LISTS
Advantages:
$\rightarrow$ May use less memory than a B+ Tree, if you do not include reverse pointers. $\rightarrow$ Insertions and deletions do not require rebalancing.
Disadvantages:
$\rightarrow$ Not disk/cache friendly because they do not optimize locality of references. $\rightarrow$ Reverse search is non- trivial.
OBSERVATION
The inner node keys in a B+Tree cannot tell you whether a key exists in the index. You must always traverse to the leaf node.
This means that you could have (at least) one buffer pool page miss per level in the tree just to find out a key does not exist.
TRIE INDEX
Use a digital representation of keys to examine prefixes one- by- one.
→ aka Digital Search Tree, Prefix Tree.
Shape depends on keys and lengths.
→ Does not depend on existing keys or insertion order.
→ Does not require rebalancing operations.
All operations have O(k) complexity where k is the length of the key.
→ Path to a leaf node represents a key.
→ Keys are stored implicitly and can be reconstructed from paths.
Keys: HELLO, HAT, HAVE
TRIE INDEX
Use a digital representation of keys to examine prefixes one- by- one.
→ aka Digital Search Tree, Prefix Tree.
Shape depends on keys and lengths.
→ Does not depend on existing keys or insertion order.
→ Does not require rebalancing operations.
All operations have O(k) complexity where k is the length of the key.
→ Path to a leaf node represents a key.
→ Keys are stored implicitly and can be reconstructed from paths.
Keys: HELLO, HAT, HAVE
TRIE INDEX
Use a digital representation of keys to examine prefixes one- by- one.
→ aka Digital Search Tree, Prefix Tree.
Shape depends on keys and lengths.
→ Does not depend on existing keys or insertion order.
→ Does not require rebalancing operations.
All operations have O(k) complexity where k is the length of the key.
→ Path to a leaf node represents a key.
→ Keys are stored implicitly and can be reconstructed from paths.
Keys: HELLO, HAT, HAVE
TRIE INDEX
Use a digital representation of keys to examine prefixes one- by- one.
→ aka Digital Search Tree, Prefix Tree.
Shape depends on keys and lengths.
→ Does not depend on existing keys or insertion order.
→ Does not require rebalancing operations.
All operations have O(k) complexity where k is the length of the key.
→ Path to a leaf node represents a key.
→ Keys are stored implicitly and can be reconstructed from paths.
Keys: HELLO, HAT, HAVE
TRIE INDEX
Use a digital representation of keys to examine prefixes one- by- one.
→ aka Digital Search Tree, Prefix Tree.
Shape depends on keys and lengths.
→ Does not depend on existing keys or insertion order.
→ Does not require rebalancing operations.
All operations have O(k) complexity where k is the length of the key.
→ Path to a leaf node represents a key.
→ Keys are stored implicitly and can be reconstructed from paths.
Keys: HELLO, HAT, HAVE
TRIE KEY SPAN
The span of a trie level is the number of bits that each partial key / digit represents. $\rightarrow$ If the digit exists in the corpus, then store a pointer to the next level in the trie branch. Otherwise, store null.
This determines the fan- out of each node and the physical height of the tree. $\rightarrow$ n- way Trie = Fan- Out of n
TRIE KEY SPAN
1-bit Span Trie
Keys: K10, K25, K31
TRIE KEY SPAN
1-bit Span Trie
Keys: K10, K25, K31
K10→ 00000000 00001010 K25→ 00000000 00011001 K31→ 00000000 00011111
TRIE KEY SPAN
1-bit Span Trie
Keys: K10, K25, K31
K10→ 00000000 00001010 K25→ 00000000 00011001 K31→ 00000000 00011111
TRIE KEY SPAN
1-bit Span Trie
Keys: K10, K25, K31
TRIE KEY SPAN
1-bit Span Trie
Keys: K10, K25, K31
TRIE KEY SPAN
1-bit Span Trie
Keys: K10, K25, K31
K10→ 00000000 00001010 K25→ 00000000 00011001 K31→ 00000000 00011111
TRIE KEY SPAN
1-bit Span Trie
Keys: K10, K25, K31
TRIE KEY SPAN
1-bit Span Trie
Keys: K10, K25, K31
K10→ 00000000 00001010 K25→ 00000000 00011001 K31→ 00000000 00011111
TRIE KEY SPAN
1-bit Span Trie
Keys: K10, K25, K31
K10→ 00000000 00001010 K25→ 00000000 00011001 K31→ 00000000 00011111
TRIE KEY SPAN
1-bit Span Trie
Keys: K10, K25, K31
K10→ 00000000 00001010 K25→ 00000000 00011001 K31→ 00000000 00011111
TRIE KEY SPAN
1-bit Span Trie
Keys: K10, K25, K31
K10→ 00000000 00001010 K25→ 00000000 00011001 K31→ 00000000 00011111
RADIX TREE
1-bit Span Radix Tree
Vertically compressed trie that compacts nodes with a single child. $\rightarrow$ Also known as Patricia Trie.
Can produce false positives, so the DBMS always checks the original tuple to see whether a key matches.
RADIX TREE
1-bit Span Radix Tree
Vertically compressed trie that compacts nodes with a single child. $\rightarrow$ Also known as Patricia Trie.
Can produce false positives, so the DBMS always checks the original tuple to see whether a key matches.
RADIX TREE: MODIFICATIONS
RADIX TREE: MODIFICATIONS
Insert HAIR
RADIX TREE: MODIFICATIONS
Insert HAIR
RADIX TREE: MODIFICATIONS
Insert HAIR Delete HAT
RADIX TREE: MODIFICATIONS
Insert HAIR Delete HAT
RADIX TREE: MODIFICATIONS
Insert HAIR Delete HAT
RADIX TREE: MODIFICATIONS
Insert HAIRDelete HATDelete HAVE
RADIX TREE: MODIFICATIONS
Insert HAIRDelete HATDelete HAVE
RADIX TREE: MODIFICATIONS
Insert HAIRDelete HATDelete HAVE
RADIX TREE: MODIFICATIONS
Insert HAIR Delete HAT Delete HAVE
OBSERVATION
The indexes that we've discussed are useful for "point" and "range" queries: $\longrightarrow$ Find all customers in the 15217 zipcode. $\longrightarrow$ Find all orders between June 2024 and September 2024.
They are not good at keyword searches: $\longrightarrow$ Example: Find all Wikipedia articles that contain the word "Pavlo"
OBSERVATION
The indexes that we've discussed are useful for "point" and "range" queries: $\rightarrow$ Find all customers in the 15217 zipcode. $\rightarrow$ Find all orders between June 2024 and September 2024.
They are not good at keyword searches:
$\rightarrow$ Example: Find all Wikipedia articles that contain the word "Pavlo"
OBSERVATION
The indexes that we've discussed are useful for "point" and "range" queries: $\longrightarrow$ Find all customers in the 15217 zipcode. $\longrightarrow$ Find all orders between June 2024 and September 2024.
They are not good at keyword searches:
$\longrightarrow$ Example: Find all Wikipedia articles that contain the word "Pavlo"
revisions(id,content,..)
id content
Table (html):
| 11 | Wu-Tang Clan is an American hip hop musical collective formed in Staten Island, New York City, in 1992... |
| 22 | Carnegie Mellon University (CMU) is a private research university in Pittsburgh, Pennsylvania. The institution was established in 1900 by Andrew Carnegie... |
| 33 | In computing, a database is an organized collection of data or a type of data store based on the use of a database management system (DBMS), the software... |
| 44 | Andrew Pavlo, best known as Andy Pavlo, is an associate professor of Computer Science at Carnegie Mellon University. He conducts research on database... |
OBSERVATION
The indexes that we've discussed are useful for "point" and "range" queries: $\longrightarrow$ Find all customers in the 15217 zipcode. $\longrightarrow$ Find all orders between June 2024 and September 2024.
They are not good at keyword searches:
$\longrightarrow$ Example: Find all Wikipedia articles that contain the word "Pavlo"
revisions(id,content,..)
id content
Table (html):
| 11 | Wu-Tang Clan is an American hip hop musical collective formed in Staten Island, New York City, in 1992... |
| 22 | Carnegie Mellon University (CMU) is a private research university in Pittsburgh, Pennsylvania. The institution was established in 1900 by Andrew Carnegie... |
| 33 | In computing, a database is an organized collection of data or a type of data store based on the use of a database management system (DBMS), the software... |
| 44 | Andrew Pavlo, best known as Andy Pavlo, is an associate professor of Computer Science at Carnegie Mellon University. He conducts research on database... |
CREATE INDEX idx_rev_cntnt ON revisions (content);
SELECT pageID FROM revisions WHERE content LIKE '%Pavlo%';
INVERTED INDEX
An inverted index stores a mapping of terms to records that contain those terms in the target attribute.
$\longrightarrow$ Sometimes called a full- text search index. $\longrightarrow$ Originally called a concordance (1200s).
Many major DBMSs support these natively. But there are also specialized DBMSs and libraries.
revisions(id,content,..)
id content
Table (html):
| 11 | Wu-Tang Clan is an American hip hop musical collective formed in Staten Island, New York City, in 1992... |
| 22 | Carnegie Mellon University (CMU) is a private research university in Pittsburgh, Pennsylvania. The institution was established in 1900 by Andrew Carnegie... |
| 33 | In computing, a database is an organized collection of data or a type of data store based on the use of a database management system (DBMS), the software... |
| 44 | Andrew Pavlo, best known as Andy Pavlo, is an associate professor of Computer Science at Carnegie Mellon University. He conducts research on database... |
INVERTED INDEX
An inverted index stores a mapping of terms to records that contain those terms in the target attribute.
$\longrightarrow$ Sometimes called a full- text search index. $\longrightarrow$ Originally called a concordance (1200s).
Many major DBMSs support these natively. But there are also specialized DBMSs and libraries.
revisions(id,content,..)
id content
Table (html):
| 11 | Wu-Tang Clan is an American hip hop musical collective formed in Staten Island, New York City, in 1992... |
| 22 | Carnegie Mellon University (CMU) is a private research university in Pittsburgh, Pennsylvania. The institution was established in 1900 by Andrew Carnegie... |
| 33 | In computing, a database is an organized collection of data or a type of data store based on the use of a database management system (DBMS), the software... |
| 44 | Andrew Pavlo, best known as Andy Pavlo, is an associate professor of Computer Science at Carnegie Mellon University. He conducts research on database... |
Dictionary Posting Lists
INVERTED INDEX
An inverted index stores a mapping of terms to records that contain those terms in the target attribute.
$\longrightarrow$ Sometimes called a full- text search index. $\longrightarrow$ Originally called a concordance (1200s).
Many major DBMSs support these natively. But there are also specialized DBMSs and libraries.
revisions(id,content,..)
id content
Table (html):
| 11 | Wu-Tang Clan is an American hip hop musical collective formed in Staten Island, New York City, in 1992... |
| 22 | Carnegie Mellon University (CMU) is a private research university in Pittsburgh, Pennsylvania. The institution was established in 1900 by Andrew Carnegie... |
| 33 | In computing, a database is an organized collection of data or a type of data store based on the use of a database management system (DBMS), the software... |
| 44 | Andrew Pavlo, best known as Andy Pavlo, is an associate professor of Computer Science at Carnegie Mellon University. He conducts research on database... |
Dictionary Posting Lists
INVERTED INDEX
An inverted index stores a mapping of terms to records that contain those terms in the target attribute.
$\longrightarrow$ Sometimes called a full- text search index. $\longrightarrow$ Originally called a concordance (1200s).
Many major DBMSs support these natively. But there are also specialized DBMSs and libraries.
revisions(id,content,..)
id content
Table (html):
| 11 | Wu-Tang Clan is an American hip hop musical collective formed in Staten Island, New York City, in 1992... |
| 22 | Carnegie Mellon University (CMU) is a private research university in Pittsburgh, Pennsylvania. The institution was established in 1900 by Andrew Carnegie... |
| 33 | In computing, a database is an organized collection of data or a type of data store based on the use of a database management system (DBMS), the software... |
| 44 | Andrew Pavlo, best known as Andy Pavlo, is an associate professor of Computer Science at Carnegie Mellon University. He conducts research on database... |
INVERTED INDEX: LUCENE
Uses a Finite State Transducer for determining offset of terms in dictionary.
Incrementally create dictionary segments and then merge them in the background.
$\longrightarrow$ Uses compression methods we previously discussed (e.g., delta, bit packing). $\longrightarrow$ Also supports precomputed aggregations for terms and occurrences.
Dictionary
INVERTED INDEX: LUCENE
Uses a Finite State Transducer for determining offset of terms in dictionary.
Incrementally create dictionary segments and then merge them in the background.
$\rightarrow$ Uses compression methods we previously discussed (e.g., delta, bit packing). $\rightarrow$ Also supports precomputed aggregations for terms and occurrences.
Dictionary
INVERTED INDEX: LUCENE
Uses a Finite State Transducer for determining offset of terms in dictionary.
Incrementally create dictionary segments and then merge them in the background.
$\rightarrow$ Uses compression methods we previously discussed (e.g., delta, bit packing). $\rightarrow$ Also supports precomputed aggregations for terms and occurrences.
[ImageCaption: Offset= 0]
INVERTED INDEX: LUCENE
Uses a Finite State Transducer for determining offset of terms in dictionary.
Incrementally create dictionary segments and then merge them in the background.
$\rightarrow$ Uses compression methods we previously discussed (e.g., delta, bit packing). $\rightarrow$ Also supports precomputed aggregations for terms and occurrences.
[ImageCaption: Offset= 2]
INVERTED INDEX: LUCENE
Uses a Finite State Transducer for determining offset of terms in dictionary.
Incrementally create dictionary segments and then merge them in the background.
$\rightarrow$ Uses compression methods we previously discussed (e.g., delta, bit packing). $\rightarrow$ Also supports precomputed aggregations for terms and occurrences.
[ImageCaption: Offset= 3]
INVERTED INDEX: LUCENE
Uses a Finite State Transducer for determining offset of terms in dictionary.
Incrementally create dictionary segments and then merge them in the background.
$\rightarrow$ Uses compression methods we previously discussed (e.g., delta, bit packing). $\rightarrow$ Also supports precomputed aggregations for terms and occurrences.
[ImageCaption: Offset= 3]
INVERTED INDEX: POSTGRESQL
PostgreSQL's Generalized Inverted Index (GIN) uses a B+Tree for the term dictionary that map to a posting list data structure.
Posting list contents varies depending on number of records per term:
$\rightarrow$ Few: Sorted list of record ids. $\rightarrow$ Many: Another B+Tree of record ids.
Uses a separate "pending list" log to avoid incremental updates.
INVERTED INDEX: POSTGRESQL
PostgreSQL's Generalized Inverted Index (GIN) uses a B+Tree for the term dictionary that map to a posting list data structure.
Posting list contents varies depending on number of records per term: $\rightarrow$ Few: Sorted list of record ids. $\rightarrow$ Many: Another B+Tree of record ids.
Uses a separate "pending list" log to avoid incremental updates.
INVERTED INDEX: POSTGRESQL
PostgreSQL's Generalized Inverted Index (GIN) uses a B+Tree for the term dictionary that map to a posting list data structure.
Posting list contents varies depending on number of records per term: $\rightarrow$ Few: Sorted list of record ids. $\rightarrow$ Many: Another B+Tree of record ids.
Uses a separate "pending list" log to avoid incremental updates.
INVERTED INDEX: POSTGRESQL
PostgreSQL's Generalized Inverted Index (GIN) uses a B+Tree for the term dictionary that map to a posting list data structure.
Posting list contents varies depending on number of records per term: $\rightarrow$ Few: Sorted list of record ids. $\rightarrow$ Many: Another B+Tree of record ids.
Uses a separate "pending list" log to avoid incremental updates.
INVERTED INDEX: POSTGRESQL
PostgreSQL's Generalized Inverted Index (GIN) uses a B+Tree for the term dictionary that map to a posting list data structure.
Posting list contents varies depending on number of records per term: $\rightarrow$ Few: Sorted list of record ids. $\rightarrow$ Many: Another B+Tree of record ids.
Uses a separate "pending list" log to avoid incremental updates.
INVERTED INDEX: POSTGRESQL
PostgreSQL's Generalized Inverted Index (GIN) uses a B+Tree for the term dictionary that map to a posting list data structure.
Posting list contents varies depending on number of records per term: $\rightarrow$ Few: Sorted list of record ids. $\rightarrow$ Many: Another B+Tree of record ids.
Uses a separate "pending list" log to avoid incremental updates.
OBSERVATION
Inverted indexes search data based on its contents. $\rightarrow$ There is a little magic to tweak terms based on linguistic models. $\rightarrow$ Example: Normalization ("Wu- Tang" matches "Wu Tang").
Instead of searching for records containing exact keywords (e.g., "Wu- Tang"), an application may want search for records that are related to topics (e.g., "hip- hop groups with songs about slinging").
VECTOR INDEXES
Specialized data structures to perform nearest- neighbor searches on embeddings. $\longrightarrow$ An embedding is an array of floating point numbers. $\longrightarrow$ May also need to filter data before / after vector searches.
The correctness of a query depends on whether the result "feels right".
VECTOR INDEXES: INVERTED FILE
Partition vectors into smaller groups using a clustering algorithm.
To find a match, use same clustering algorithm to map into a group, then scan that group's vectors.
$\longrightarrow$ Also check nearby groups to improve accuracy.
Preprocess / quantize vectors to reduce dimensionality.
Example: IVFFlat
VECTOR INDEXES: INVERTED FILE
Partition vectors into smaller groups using a clustering algorithm.
To find a match, use same clustering algorithm to map into a group, then scan that group's vectors.
$\rightarrow$ Also check nearby groups to improve accuracy.
Preprocess / quantize vectors to reduce dimensionality.
Example: IVFFlat
VECTOR INDEXES: INVERTED FILE
Partition vectors into smaller groups using a clustering algorithm.
To find a match, use same clustering algorithm to map into a group, then scan that group's vectors.
$\rightarrow$ Also check nearby groups to improve accuracy.
Preprocess / quantize vectors to reduce dimensionality.
Example: IVFFlat
VECTOR INDEXES: INVERTED FILE
Partition vectors into smaller groups using a clustering algorithm.
To find a match, use same clustering algorithm to map into a group, then scan that group's vectors.
$\rightarrow$ Also check nearby groups to improve accuracy.
Preprocess / quantize vectors to reduce dimensionality.
Example: IVFFlat
VECTOR INDEXES: NAVIGABLE SMALL WORLDS
Build a graph where each node represents a vector and it has edges to its n nearest neighbors. $\rightarrow$ Can use multiple levels of graphs (HNSW)
To find a match for a given vector, enter the graph and then greedily choose the next edge that moves closer to that vector.
Example: Faiss, hnswlib
VECTOR INDEXES: NAVIGABLE SMALL WORLDS
Build a graph where each node represents a vector and it has edges to its n nearest neighbors. $\rightarrow$ Can use multiple levels of graphs (HNSW)
To find a match for a given vector, enter the graph and then greedily choose the next edge that moves closer to that vector.
Example: Faiss, hnswlib
CONCLUSION
We will see filters again this semester. B+Trees are still the way to go for tree indexes. Inverted indexes are covered in CMU 11- 442.
We did not discuss geo- spatial tree indexes: $\rightarrow$ Examples: R- Tree, Quad- Tree, KD- Tree $\rightarrow$ This is covered in CMU 15- 826.
NEXT CLASS
How to make indexes thread- safe!