GithubDocuments
    1. Marketplace
    2. CMU Database - 15445 - 2025 Spring
    Collections
    CMU Database - 15445 - 2025 Spring
    cmu_database
    This collection of documents, "CMU Database - 15445 - 2025 Spring," provides a comprehensive overview of database systems, primarily focusing on the design, implementation, and ...
    DocumentsKnowledge Graph
    lecture-09-slides.pdf
    Carnegie Mellon University

    Database Systems

    Bloom Filters, Tries, Skip Lists, Inverted Indexes, Vector Indexes

    ADMINISTRIVIA

    Project #2 out; due Sunday March 2 $^{\text{nd}}$ @ 11
    $\rightarrow$ Don't forget to do a GitHub "pull" before starting $\rightarrow$ Recitation on Wednesday Feb. 19 4:00- 5
    pm, GHC 5117
    Homework #3 (indices and filters) due Sunday Feb 23 $^{\text{th}}$ @11
    Mid- term Exam on Wednesday Feb 26 $^{\text{th}}$ $\rightarrow$ In- class in this room

    INDEXES VS. FILTERS

    An index data structure of a subset of a table's attributes that are organized and/or sorted to the location of specific tuples using those attributes. $\longrightarrow$ Example: B+Tree
    A filter is a data structure that answers set membership queries; it tells you whether a key (likely) exists in a set but not where it is located. $\longrightarrow$ Example: Bloom Filter

    TODAY'S AGENDA

    Bloom Filters Skip Lists Tries / Radix Trees Inverted Indexes Vector Indexes DB Flash Talk: Weaviate

    BLOOM FILTERS

    Probabilistic data structure (bitmap) that answers set membership queries.
    → False negatives will never occur. → False positives can sometimes occur. → See Bloom Filter Calculator.

    Insert(x):

    → Use $k$ hash functions to set bits in the filter to 1.

    Lookup(x):

    → Check whether the bits are 1 for each hash function.

    BLOOM FILTERS

    Insert 'RZA'

    Bloom Filter

    Table (html):
    01234567
    00000000

    BLOOM FILTERS

    Insert 'RZA'

    Bloom Filter

    $hash_{1}(RZA^{\prime}) = 2222% 8 = 6$ $hash_{2}(RZA^{\prime}) = 4444% 8 = 4$

    BLOOM FILTERS

    Insert 'RZA'
    Bloom Filter

    BLOOM FILTERS

    Insert 'RZA'
    Insert 'GZA'

    Bloom Filter

    hash('GZA') = 5555 % 8 = 3 hash('GZA') = 7777 % 8 = 1

    BLOOM FILTERS

    Insert 'RZA'
    Insert 'GZA'

    Bloom Filter

    BLOOM FILTERS

    Insert 'RZA' Insert 'GZA' Lookup 'RZA'

    Bloom Filter

    BLOOM FILTERS

    Insert 'RZA'
    Insert 'GZA'
    Lookup 'RZA' → TRUE

    Bloom Filter

    BLOOM FILTERS

    Insert 'RZA'
    Insert 'GZA'
    Lookup 'RZA' → TRUE
    Lookup 'Raekwon'

    Bloom Filter

    BLOOM FILTERS

    Insert 'RZA'
    Insert 'GZA'
    Lookup 'RZA' → TRUE
    Lookup 'Raekwon'

    Bloom Filter

    $hash_{1}(Raekwon^{\prime}) = 3333% 8 = 5$ $hash_{2}(Raekwon^{\prime}) = 8899% 8 = 3$

    BLOOM FILTERS

    Insert 'RZA'
    Insert 'GZA'
    Lookup 'RZA' → TRUE
    Lookup 'Raekwon'

    Bloom Filter

    BLOOM FILTERS

    Insert 'RZA'
    Insert 'GZA'
    Lookup 'RZA' → TRUE
    Lookup 'Raekwon' → FALSE

    Bloom Filter

    BLOOM FILTERS

    Insert 'RZA'
    Insert 'GZA'
    Lookup 'RZA' → TRUE
    Lookup 'Raekwon' → FALSE

    Bloom Filter

    BLOOM FILTERS

    Insert 'RZA'
    Insert 'GZA'
    Lookup 'RZA' → TRUE
    Lookup 'Raekwon' → FALSE
    Lookup 'ODB'

    Bloom Filter

    $hash_{1}(^{\prime}ODB^{\prime}) = 6699% 8 =$ 3 $hash_{2}(^{\prime}ODB^{\prime}) = 9966% 8 =$ 6

    BLOOM FILTERS

    Insert 'RZA'
    Insert 'GZA'
    Lookup 'RZA' → TRUE
    Lookup 'Raekwon' → FALSE
    Lookup 'ODB'

    Bloom Filter

    BLOOM FILTERS

    Insert 'RZA'
    Insert 'GZA'
    Lookup 'RZA' → TRUE
    Lookup 'Raekwon' → FALSE
    Lookup 'ODB' → TRUE

    Bloom Filter

    BLOOM FILTERS

    Insert 'RZA'
    Insert 'GZA'
    Lookup 'RZA' → TRUE
    Lookup 'Raekwon' → FALSE
    Lookup 'ODB' → TRUE

    Bloom Filter

    Bloom filter calculator: https://hur.st/bloomfilter/

    OTHER FILTERS

    Counting Bloom Filter

    $\rightarrow$ Supports dynamically adding and removing keys. $\rightarrow$ Uses integers instead of bits to count the number of occurrences of a key in a set.

    Cuckoo Filter

    $\rightarrow$ Also supports dynamically adding and removing keys. $\rightarrow$ Uses a Cuckoo Hash Table but stores fingerprints instead of full keys.

    Succinct Range Filter (SuRF)

    $\rightarrow$ Immutable compact trie that supports approximate exact matches and range filtering.

    OTHER FILTERS

    Counting Bloom Filter

    $\rightarrow$ Supports dynamically adding and removing keys. $\rightarrow$ Uses integers instead of bits to count the number of occurrences of a key in a set.

    Cuckoo Filter

    $\rightarrow$ Also supports dynamically adding and removing keys. $\rightarrow$ Uses a Cuckoo Hash Table but stores fingerprints instead of full keys.

    Succinct Range Filter (SuRF)

    $\rightarrow$ Immutable compact trie that supports approximate exact matches and range filtering.

    OTHER

    OBSERVATION

    The easiest way to implement a dynamic order- preserving index is to use a sorted linked list. All operations have to linear search. $\rightarrow$ Average Cost: O(n)

    OBSERVATION

    The easiest way to implement a dynamic order- preserving index is to use a sorted linked list. All operations have to linear search. $\rightarrow$ Average Cost: O(n)

    OBSERVATION

    The easiest way to implement a dynamic order- preserving index is to use a sorted linked list. All operations have to linear search. $\rightarrow$ Average Cost: O(n)

    OBSERVATION

    The easiest way to implement a dynamic order- preserving index is to use a sorted linked list. All operations have to linear search. $\rightarrow$ Average Cost: O(n)

    SKIP LISTS

    Multiple levels of linked lists with extra pointers to skip over entries. $\rightarrow 1^{\mathrm{st}}$ level is a sorted list of all keys. $\rightarrow 2^{\mathrm{nd}}$ level links every other key $\rightarrow 3^{\mathrm{rd}}$ level links every fourth key $\rightarrow$ Each level has $\frac{1}{2}$ the keys of one below it
    Maintains keys in sorted order without requiring global rebalancing. $\rightarrow$ Approximate O(log n) search times.
    Mostly for in- memory data structures. $\rightarrow$ Example: LSM MemTable

    Skip Lists: A Probabilistic Alternative to Balanced Trees

    Skip lists are a data structure that can be used in place of balanced trees. Skip lists use probabilistic balancing rather than strictly enforced balancing and as a result the algorithms for insertion and deletion in skip lists are much simpler and significantly faster than equivalent algorithms for balanced trees.

    William Pugh

    Binary trees can be used for representing distinct data types such as dictionaries and ordered lists. They work well when in the elements are inserted in a random order. Some elements of operations, such as inserting elements into the order of the degenerate data structures that give your performance a hit, it was possible to randomly insert the items of items that were, those would work well with high probability for a no- per sequence. In most cases quents must be involved in the list, so randomly permuting the input is impractical. Below are some algorithms to arrange the tree as operations are performed to maintain certain balance conditions and assure good performance. Skip lists are a probabilistic alternative to balanced trees. Skip lists are balanced by consulting a random number per- enter. Although skip lists have worst- case performance, no input sequence consistently produces the worst case performance (such as it very unlikely) a skip list data structure will be significantly quicker (e.g., for a dictionary of more than 250 elements, the chance that a search will take more than 3 times the expected time is less than one in a million). Skip lists have balance properties similar to that of search trees built by random insertions, yet do not require insertions to be random. Balancing a data structure probabilistically is easier to explicitly maintaining the balance. For many applications, skip lists are a natural representation than trees. Although skip list almost always has a skip structure, the algorithmic element to implement and provides siglo- cant constant factor speed improvements over balanced tree and self- adjusting tree algorithms. Skip lists are also very space efficient. They can easily be configured to require average of 1/1 points per element (or even less) and do not require balance or priority information to be stored with each skip list. We might need to examine every node of the list when searching a linked list (Figure 1a). If the list is stored in sorted order and every other node of the list also has a pointer to the node two- ahead in the list (Figure 1b), we have to estimate the more than 1/1 nodes (where n is the length of the list).

    SingleStore WIREDTIGER

    C. Couchbase

    SKIP LISTS: INSERT

    Insert $K_{5}$

    SKIP LISTS: INSERT

    Insert K5

    SKIP LISTS: INSERT

    Insert $K_{5}$

    SKIP LISTS: INSERT

    Insert K5

    SKIP LISTS: INSERT

    Insert $K_{5}$

    SKIP LISTS: INSERT

    Insert K5
    Levels
    Flip a coin to decide how many levels to add the new key into.

    SKIP LISTS: INSERT

    Insert K5

    SKIP LISTS: INSERT

    Insert K5

    SKIP LISTS: INSERT

    Insert K5

    SKIP LISTS: INSERT

    Insert K5

    SKIP LISTS: INSERT

    Insert K5

    SKIP LISTS: INSERT

    Insert K5

    SKIP LISTS: INSERT

    Insert K5

    SKIP LISTS: INSERT

    Insert K5

    SKIP LISTS: INSERT

    Insert K5

    SKIP LISTS: INSERT

    Insert K5

    SKIP LISTS: INSERT

    Insert K5

    SKIP LISTS: INSERT

    Insert K5

    SKIP LISTS: INSERT

    Insert K5

    SKIP LISTS: INSERT

    Insert K5

    SKIP LISTS: SEARCH

    Find K3
    Levels

    SKIP LISTS: SEARCH

    Find K3

    SKIP LISTS: SEARCH

    Find K3

    SKIP LISTS: SEARCH

    Find K3

    SKIP LISTS: SEARCH

    Find K3
    Levels

    SKIP LISTS: SEARCH

    Find K3
    Levels

    SKIP LISTS: DELETE

    First logically remove a key from the index by setting a flag to tell threads to ignore.
    Then physically remove the key once we know that no other thread is holding the reference.

    SKIP LISTS: DELETE

    Delete $K_{5}$

    SKIP LISTS: DELETE

    Delete $K_{5}$

    SKIP LISTS: DELETE

    Delete $K_{5}$

    SKIP LISTS: DELETE

    Delete $K_{5}$

    SKIP LISTS: DELETE

    Delete $K_{5}$

    SKIP LISTS: DELETE

    Delete $K_{5}$

    SKIP LISTS: DELETE

    Delete $K_{5}$

    SKIP LISTS: DELETE

    Delete $K_{5}$

    SKIP LISTS: DELETE

    Delete $K_{5}$

    SKIP LISTS

    Advantages:

    $\rightarrow$ May use less memory than a B+ Tree, if you do not include reverse pointers. $\rightarrow$ Insertions and deletions do not require rebalancing.

    Disadvantages:

    $\rightarrow$ Not disk/cache friendly because they do not optimize locality of references. $\rightarrow$ Reverse search is non- trivial.

    OBSERVATION

    The inner node keys in a B+Tree cannot tell you whether a key exists in the index. You must always traverse to the leaf node.
    This means that you could have (at least) one buffer pool page miss per level in the tree just to find out a key does not exist.

    TRIE INDEX

    Use a digital representation of keys to examine prefixes one- by- one.
    → aka Digital Search Tree, Prefix Tree.
    Shape depends on keys and lengths.
    → Does not depend on existing keys or insertion order.
    → Does not require rebalancing operations.
    All operations have O(k) complexity where k is the length of the key.
    → Path to a leaf node represents a key.
    → Keys are stored implicitly and can be reconstructed from paths.
    Keys: HELLO, HAT, HAVE

    TRIE INDEX

    Use a digital representation of keys to examine prefixes one- by- one.
    → aka Digital Search Tree, Prefix Tree.
    Shape depends on keys and lengths.
    → Does not depend on existing keys or insertion order.
    → Does not require rebalancing operations.
    All operations have O(k) complexity where k is the length of the key.
    → Path to a leaf node represents a key.
    → Keys are stored implicitly and can be reconstructed from paths.
    Keys: HELLO, HAT, HAVE

    TRIE INDEX

    Use a digital representation of keys to examine prefixes one- by- one.
    → aka Digital Search Tree, Prefix Tree.
    Shape depends on keys and lengths.
    → Does not depend on existing keys or insertion order.
    → Does not require rebalancing operations.
    All operations have O(k) complexity where k is the length of the key.
    → Path to a leaf node represents a key.
    → Keys are stored implicitly and can be reconstructed from paths.
    Keys: HELLO, HAT, HAVE

    TRIE INDEX

    Use a digital representation of keys to examine prefixes one- by- one.
    → aka Digital Search Tree, Prefix Tree.
    Shape depends on keys and lengths.
    → Does not depend on existing keys or insertion order.
    → Does not require rebalancing operations.
    All operations have O(k) complexity where k is the length of the key.
    → Path to a leaf node represents a key.
    → Keys are stored implicitly and can be reconstructed from paths.
    Keys: HELLO, HAT, HAVE

    TRIE INDEX

    Use a digital representation of keys to examine prefixes one- by- one.
    → aka Digital Search Tree, Prefix Tree.
    Shape depends on keys and lengths.
    → Does not depend on existing keys or insertion order.
    → Does not require rebalancing operations.
    All operations have O(k) complexity where k is the length of the key.
    → Path to a leaf node represents a key.
    → Keys are stored implicitly and can be reconstructed from paths.
    Keys: HELLO, HAT, HAVE

    TRIE KEY SPAN

    The span of a trie level is the number of bits that each partial key / digit represents. $\rightarrow$ If the digit exists in the corpus, then store a pointer to the next level in the trie branch. Otherwise, store null.
    This determines the fan- out of each node and the physical height of the tree. $\rightarrow$ n- way Trie = Fan- Out of n

    TRIE KEY SPAN

    1-bit Span Trie

    Keys: K10, K25, K31

    TRIE KEY SPAN

    1-bit Span Trie

    Keys: K10, K25, K31
    K10→ 00000000 00001010 K25→ 00000000 00011001 K31→ 00000000 00011111

    TRIE KEY SPAN

    1-bit Span Trie

    Keys: K10, K25, K31
    K10→ 00000000 00001010 K25→ 00000000 00011001 K31→ 00000000 00011111

    TRIE KEY SPAN

    1-bit Span Trie

    Keys: K10, K25, K31

    TRIE KEY SPAN

    1-bit Span Trie

    Keys: K10, K25, K31

    TRIE KEY SPAN

    1-bit Span Trie

    Keys: K10, K25, K31
    K10→ 00000000 00001010 K25→ 00000000 00011001 K31→ 00000000 00011111

    TRIE KEY SPAN

    1-bit Span Trie

    Keys: K10, K25, K31

    TRIE KEY SPAN

    1-bit Span Trie

    Keys: K10, K25, K31
    K10→ 00000000 00001010 K25→ 00000000 00011001 K31→ 00000000 00011111

    TRIE KEY SPAN

    1-bit Span Trie

    Keys: K10, K25, K31
    K10→ 00000000 00001010 K25→ 00000000 00011001 K31→ 00000000 00011111

    TRIE KEY SPAN

    1-bit Span Trie

    Keys: K10, K25, K31
    K10→ 00000000 00001010 K25→ 00000000 00011001 K31→ 00000000 00011111

    TRIE KEY SPAN

    1-bit Span Trie

    Keys: K10, K25, K31
    K10→ 00000000 00001010 K25→ 00000000 00011001 K31→ 00000000 00011111

    RADIX TREE

    1-bit Span Radix Tree

    Vertically compressed trie that compacts nodes with a single child. $\rightarrow$ Also known as Patricia Trie.
    Can produce false positives, so the DBMS always checks the original tuple to see whether a key matches.

    RADIX TREE

    1-bit Span Radix Tree

    Vertically compressed trie that compacts nodes with a single child. $\rightarrow$ Also known as Patricia Trie.
    Can produce false positives, so the DBMS always checks the original tuple to see whether a key matches.

    RADIX TREE: MODIFICATIONS

    RADIX TREE: MODIFICATIONS

    Insert HAIR

    RADIX TREE: MODIFICATIONS

    Insert HAIR

    RADIX TREE: MODIFICATIONS

    Insert HAIR Delete HAT

    RADIX TREE: MODIFICATIONS

    Insert HAIR Delete HAT

    RADIX TREE: MODIFICATIONS

    Insert HAIR Delete HAT

    RADIX TREE: MODIFICATIONS

    Insert HAIRDelete HATDelete HAVE

    RADIX TREE: MODIFICATIONS

    Insert HAIRDelete HATDelete HAVE

    RADIX TREE: MODIFICATIONS

    Insert HAIRDelete HATDelete HAVE

    RADIX TREE: MODIFICATIONS

    Insert HAIR Delete HAT Delete HAVE

    OBSERVATION

    The indexes that we've discussed are useful for "point" and "range" queries: $\longrightarrow$ Find all customers in the 15217 zipcode. $\longrightarrow$ Find all orders between June 2024 and September 2024.
    They are not good at keyword searches: $\longrightarrow$ Example: Find all Wikipedia articles that contain the word "Pavlo"

    OBSERVATION

    The indexes that we've discussed are useful for "point" and "range" queries: $\rightarrow$ Find all customers in the 15217 zipcode. $\rightarrow$ Find all orders between June 2024 and September 2024.
    They are not good at keyword searches:
    $\rightarrow$ Example: Find all Wikipedia articles that contain the word "Pavlo"

    OBSERVATION

    The indexes that we've discussed are useful for "point" and "range" queries: $\longrightarrow$ Find all customers in the 15217 zipcode. $\longrightarrow$ Find all orders between June 2024 and September 2024.
    They are not good at keyword searches:
    $\longrightarrow$ Example: Find all Wikipedia articles that contain the word "Pavlo"

    revisions(id,content,..)

    id content

    Table (html):
    11Wu-Tang Clan is an American hip hop musical collective formed in Staten Island, New York City, in 1992...
    22Carnegie Mellon University (CMU) is a private research university in Pittsburgh, Pennsylvania. The institution was established in 1900 by Andrew Carnegie...
    33In computing, a database is an organized collection of data or a type of data store based on the use of a database management system (DBMS), the software...
    44Andrew Pavlo, best known as Andy Pavlo, is an associate professor of Computer Science at Carnegie Mellon University. He conducts research on database...

    OBSERVATION

    The indexes that we've discussed are useful for "point" and "range" queries: $\longrightarrow$ Find all customers in the 15217 zipcode. $\longrightarrow$ Find all orders between June 2024 and September 2024.
    They are not good at keyword searches:
    $\longrightarrow$ Example: Find all Wikipedia articles that contain the word "Pavlo"

    revisions(id,content,..)

    id content

    Table (html):
    11Wu-Tang Clan is an American hip hop musical collective formed in Staten Island, New York City, in 1992...
    22Carnegie Mellon University (CMU) is a private research university in Pittsburgh, Pennsylvania. The institution was established in 1900 by Andrew Carnegie...
    33In computing, a database is an organized collection of data or a type of data store based on the use of a database management system (DBMS), the software...
    44Andrew Pavlo, best known as Andy Pavlo, is an associate professor of Computer Science at Carnegie Mellon University. He conducts research on database...
    CREATE INDEX idx_rev_cntnt ON revisions (content);
    SELECT pageID FROM revisions WHERE content LIKE '%Pavlo%';

    INVERTED INDEX

    An inverted index stores a mapping of terms to records that contain those terms in the target attribute.
    $\longrightarrow$ Sometimes called a full- text search index. $\longrightarrow$ Originally called a concordance (1200s).
    Many major DBMSs support these natively. But there are also specialized DBMSs and libraries.

    revisions(id,content,..)

    id content

    Table (html):
    11Wu-Tang Clan is an American hip hop musical collective formed in Staten Island, New York City, in 1992...
    22Carnegie Mellon University (CMU) is a private research university in Pittsburgh, Pennsylvania. The institution was established in 1900 by Andrew Carnegie...
    33In computing, a database is an organized collection of data or a type of data store based on the use of a database management system (DBMS), the software...
    44Andrew Pavlo, best known as Andy Pavlo, is an associate professor of Computer Science at Carnegie Mellon University. He conducts research on database...

    INVERTED INDEX

    An inverted index stores a mapping of terms to records that contain those terms in the target attribute.
    $\longrightarrow$ Sometimes called a full- text search index. $\longrightarrow$ Originally called a concordance (1200s).
    Many major DBMSs support these natively. But there are also specialized DBMSs and libraries.

    revisions(id,content,..)

    id content

    Table (html):
    11Wu-Tang Clan is an American hip hop musical collective formed in Staten Island, New York City, in 1992...
    22Carnegie Mellon University (CMU) is a private research university in Pittsburgh, Pennsylvania. The institution was established in 1900 by Andrew Carnegie...
    33In computing, a database is an organized collection of data or a type of data store based on the use of a database management system (DBMS), the software...
    44Andrew Pavlo, best known as Andy Pavlo, is an associate professor of Computer Science at Carnegie Mellon University. He conducts research on database...

    Dictionary Posting Lists

    INVERTED INDEX

    An inverted index stores a mapping of terms to records that contain those terms in the target attribute.
    $\longrightarrow$ Sometimes called a full- text search index. $\longrightarrow$ Originally called a concordance (1200s).
    Many major DBMSs support these natively. But there are also specialized DBMSs and libraries.

    revisions(id,content,..)

    id content

    Table (html):
    11Wu-Tang Clan is an American hip hop musical collective formed in Staten Island, New York City, in 1992...
    22Carnegie Mellon University (CMU) is a private research university in Pittsburgh, Pennsylvania. The institution was established in 1900 by Andrew Carnegie...
    33In computing, a database is an organized collection of data or a type of data store based on the use of a database management system (DBMS), the software...
    44Andrew Pavlo, best known as Andy Pavlo, is an associate professor of Computer Science at Carnegie Mellon University. He conducts research on database...

    Dictionary Posting Lists

    INVERTED INDEX

    An inverted index stores a mapping of terms to records that contain those terms in the target attribute.
    $\longrightarrow$ Sometimes called a full- text search index. $\longrightarrow$ Originally called a concordance (1200s).
    Many major DBMSs support these natively. But there are also specialized DBMSs and libraries.

    revisions(id,content,..)

    id content

    Table (html):
    11Wu-Tang Clan is an American hip hop musical collective formed in Staten Island, New York City, in 1992...
    22Carnegie Mellon University (CMU) is a private research university in Pittsburgh, Pennsylvania. The institution was established in 1900 by Andrew Carnegie...
    33In computing, a database is an organized collection of data or a type of data store based on the use of a database management system (DBMS), the software...
    44Andrew Pavlo, best known as Andy Pavlo, is an associate professor of Computer Science at Carnegie Mellon University. He conducts research on database...

    INVERTED INDEX: LUCENE

    Uses a Finite State Transducer for determining offset of terms in dictionary.
    Incrementally create dictionary segments and then merge them in the background.
    $\longrightarrow$ Uses compression methods we previously discussed (e.g., delta, bit packing). $\longrightarrow$ Also supports precomputed aggregations for terms and occurrences.

    Dictionary

    INVERTED INDEX: LUCENE

    Uses a Finite State Transducer for determining offset of terms in dictionary.
    Incrementally create dictionary segments and then merge them in the background.
    $\rightarrow$ Uses compression methods we previously discussed (e.g., delta, bit packing). $\rightarrow$ Also supports precomputed aggregations for terms and occurrences.

    Dictionary

    INVERTED INDEX: LUCENE

    Uses a Finite State Transducer for determining offset of terms in dictionary.
    Incrementally create dictionary segments and then merge them in the background.
    $\rightarrow$ Uses compression methods we previously discussed (e.g., delta, bit packing). $\rightarrow$ Also supports precomputed aggregations for terms and occurrences.
    [ImageCaption: Offset= 0]

    INVERTED INDEX: LUCENE

    Uses a Finite State Transducer for determining offset of terms in dictionary.
    Incrementally create dictionary segments and then merge them in the background.
    $\rightarrow$ Uses compression methods we previously discussed (e.g., delta, bit packing). $\rightarrow$ Also supports precomputed aggregations for terms and occurrences.
    [ImageCaption: Offset= 2]

    INVERTED INDEX: LUCENE

    Uses a Finite State Transducer for determining offset of terms in dictionary.
    Incrementally create dictionary segments and then merge them in the background.
    $\rightarrow$ Uses compression methods we previously discussed (e.g., delta, bit packing). $\rightarrow$ Also supports precomputed aggregations for terms and occurrences.
    [ImageCaption: Offset= 3]

    INVERTED INDEX: LUCENE

    Uses a Finite State Transducer for determining offset of terms in dictionary.
    Incrementally create dictionary segments and then merge them in the background.
    $\rightarrow$ Uses compression methods we previously discussed (e.g., delta, bit packing). $\rightarrow$ Also supports precomputed aggregations for terms and occurrences.
    [ImageCaption: Offset= 3]

    INVERTED INDEX: POSTGRESQL

    PostgreSQL's Generalized Inverted Index (GIN) uses a B+Tree for the term dictionary that map to a posting list data structure.
    Posting list contents varies depending on number of records per term:
    $\rightarrow$ Few: Sorted list of record ids. $\rightarrow$ Many: Another B+Tree of record ids.
    Uses a separate "pending list" log to avoid incremental updates.

    INVERTED INDEX: POSTGRESQL

    PostgreSQL's Generalized Inverted Index (GIN) uses a B+Tree for the term dictionary that map to a posting list data structure.
    Posting list contents varies depending on number of records per term: $\rightarrow$ Few: Sorted list of record ids. $\rightarrow$ Many: Another B+Tree of record ids.
    Uses a separate "pending list" log to avoid incremental updates.

    INVERTED INDEX: POSTGRESQL

    PostgreSQL's Generalized Inverted Index (GIN) uses a B+Tree for the term dictionary that map to a posting list data structure.
    Posting list contents varies depending on number of records per term: $\rightarrow$ Few: Sorted list of record ids. $\rightarrow$ Many: Another B+Tree of record ids.
    Uses a separate "pending list" log to avoid incremental updates.

    INVERTED INDEX: POSTGRESQL

    PostgreSQL's Generalized Inverted Index (GIN) uses a B+Tree for the term dictionary that map to a posting list data structure.
    Posting list contents varies depending on number of records per term: $\rightarrow$ Few: Sorted list of record ids. $\rightarrow$ Many: Another B+Tree of record ids.
    Uses a separate "pending list" log to avoid incremental updates.

    INVERTED INDEX: POSTGRESQL

    PostgreSQL's Generalized Inverted Index (GIN) uses a B+Tree for the term dictionary that map to a posting list data structure.
    Posting list contents varies depending on number of records per term: $\rightarrow$ Few: Sorted list of record ids. $\rightarrow$ Many: Another B+Tree of record ids.
    Uses a separate "pending list" log to avoid incremental updates.

    INVERTED INDEX: POSTGRESQL

    PostgreSQL's Generalized Inverted Index (GIN) uses a B+Tree for the term dictionary that map to a posting list data structure.
    Posting list contents varies depending on number of records per term: $\rightarrow$ Few: Sorted list of record ids. $\rightarrow$ Many: Another B+Tree of record ids.
    Uses a separate "pending list" log to avoid incremental updates.

    OBSERVATION

    Inverted indexes search data based on its contents. $\rightarrow$ There is a little magic to tweak terms based on linguistic models. $\rightarrow$ Example: Normalization ("Wu- Tang" matches "Wu Tang").
    Instead of searching for records containing exact keywords (e.g., "Wu- Tang"), an application may want search for records that are related to topics (e.g., "hip- hop groups with songs about slinging").

    VECTOR INDEXES

    Specialized data structures to perform nearest- neighbor searches on embeddings. $\longrightarrow$ An embedding is an array of floating point numbers. $\longrightarrow$ May also need to filter data before / after vector searches.
    The correctness of a query depends on whether the result "feels right".

    VECTOR INDEXES: INVERTED FILE

    Partition vectors into smaller groups using a clustering algorithm.
    To find a match, use same clustering algorithm to map into a group, then scan that group's vectors.
    $\longrightarrow$ Also check nearby groups to improve accuracy.
    Preprocess / quantize vectors to reduce dimensionality.
    Example: IVFFlat

    VECTOR INDEXES: INVERTED FILE

    Partition vectors into smaller groups using a clustering algorithm.
    To find a match, use same clustering algorithm to map into a group, then scan that group's vectors.
    $\rightarrow$ Also check nearby groups to improve accuracy.
    Preprocess / quantize vectors to reduce dimensionality.
    Example: IVFFlat

    VECTOR INDEXES: INVERTED FILE

    Partition vectors into smaller groups using a clustering algorithm.
    To find a match, use same clustering algorithm to map into a group, then scan that group's vectors.
    $\rightarrow$ Also check nearby groups to improve accuracy.
    Preprocess / quantize vectors to reduce dimensionality.
    Example: IVFFlat

    VECTOR INDEXES: INVERTED FILE

    Partition vectors into smaller groups using a clustering algorithm.
    To find a match, use same clustering algorithm to map into a group, then scan that group's vectors.
    $\rightarrow$ Also check nearby groups to improve accuracy.
    Preprocess / quantize vectors to reduce dimensionality.
    Example: IVFFlat

    VECTOR INDEXES: NAVIGABLE SMALL WORLDS

    Build a graph where each node represents a vector and it has edges to its n nearest neighbors. $\rightarrow$ Can use multiple levels of graphs (HNSW)
    To find a match for a given vector, enter the graph and then greedily choose the next edge that moves closer to that vector.
    Example: Faiss, hnswlib

    VECTOR INDEXES: NAVIGABLE SMALL WORLDS

    Build a graph where each node represents a vector and it has edges to its n nearest neighbors. $\rightarrow$ Can use multiple levels of graphs (HNSW)
    To find a match for a given vector, enter the graph and then greedily choose the next edge that moves closer to that vector.
    Example: Faiss, hnswlib

    CONCLUSION

    We will see filters again this semester. B+Trees are still the way to go for tree indexes. Inverted indexes are covered in CMU 11- 442.
    We did not discuss geo- spatial tree indexes: $\rightarrow$ Examples: R- Tree, Quad- Tree, KD- Tree $\rightarrow$ This is covered in CMU 15- 826.

    NEXT CLASS

    How to make indexes thread- safe!