GithubDocuments
    1. Marketplace
    2. CMU Database - 15445 - 2025 Spring
    Collections
    CMU Database - 15445 - 2025 Spring
    cmu_database
    This collection of documents, "CMU Database - 15445 - 2025 Spring," provides a comprehensive overview of database systems, primarily focusing on the design, implementation, and ...
    DocumentsKnowledge Graph
    lecture-05-slides.pdf
    Carnegie Mellon University

    Database Systems

    Database Storage: Tuple Organization

    ADMINISTRIVIA

    Homework #1 is due January 29th @ 11
    Project #1 is due on February 9th @ 11
    Project recitation on Monday, February 3rd, from 5- 6pm in GHC 4303.

    PREVIOUSLY

    We presented a disk- oriented architecture where the DBMS assumes that the primary storage location of the database is on non- volatile disk.
    We then discussed a page- oriented storage scheme for organizing tuples across heap files.

    SLOTTED PAGES

    The most common layout scheme is called slotted pages.
    The slot array maps "slots" to the tuples' starting position offsets.
    The header keeps track of: $\longrightarrow$ The # of used slots $\longrightarrow$ The offset of the starting location of the last slot used.
    Fixed- and Var- length Tuple Data

    SLOTTED PAGES

    The most common layout scheme is called slotted pages.
    The slot array maps "slots" to the tuples' starting position offsets.
    The header keeps track of: $\longrightarrow$ The # of used slots $\longrightarrow$ The offset of the starting location of the last slot used.
    Fixed- and Var- length Tuple Data

    SLOTTED PAGES

    The most common layout scheme is called slotted pages.
    The slot array maps "slots" to the tuples' starting position offsets.
    The header keeps track of: $\longrightarrow$ The # of used slots $\longrightarrow$ The offset of the starting location of the last slot used.
    Fixed- and Var- length Tuple Data

    SLOTTED PAGES

    The most common layout scheme is called slotted pages.
    The slot array maps "slots" to the tuples' starting position offsets.
    The header keeps track of: $\longrightarrow$ The # of used slots $\longrightarrow$ The offset of the starting location of the last slot used.
    Fixed- and Var- length Tuple Data

    SLOTTED PAGES

    The most common layout scheme is called slotted pages.
    The slot array maps "slots" to the tuples' starting position offsets.
    The header keeps track of: $\longrightarrow$ The # of used slots $\longrightarrow$ The offset of the starting location of the last slot used.
    Fixed- and Var- length Tuple Data

    SLOTTED PAGES

    The most common layout scheme is called slotted pages.
    The slot array maps "slots" to the tuples' starting position offsets.
    The header keeps track of: $\longrightarrow$ The # of used slots $\longrightarrow$ The offset of the starting location of the last slot used.
    Fixed- and Var- length Tuple Data

    SLOTTED PAGES

    The most common layout scheme is called slotted pages.
    The slot array maps "slots" to the tuples' starting position offsets.
    The header keeps track of: $\longrightarrow$ The # of used slots $\longrightarrow$ The offset of the starting location of the last slot used.
    Fixed- and Var- length Tuple Data

    SLOTTED PAGES

    The most common layout scheme is called slotted pages.
    The slot array maps "slots" to the tuples' starting position offsets.
    The header keeps track of: $\longrightarrow$ The # of used slots $\longrightarrow$ The offset of the starting location of the last slot used.
    Fixed- and Var- length Tuple Data

    RECORD IDS

    The DBMS assigns each logical tuple a unique record identifier that represents its physical location in the database.
    $\longrightarrow$ File Id, Page Id, Slot # $\longrightarrow$ Most DBMSs do not store ids in tuple. $\longrightarrow$ SQLite uses ROWID as the true primary key and stores them as a hidden attribute.
    Applications should never rely on these IDs to mean anything.
    PostgreSQL CTID (6- bytes)
    SQLite ROWID (8- bytes)
    Microsoft SQL Server %%physloc%% (8- bytes)
    ORACLE ROWID (10- bytes)

    TUPLE-ORIENTED STORAGE

    Insert a new tuple:

    $\rightarrow$ Check page directory to find a page with a free slot. $\rightarrow$ Retrieve the page from disk (if not in memory). $\rightarrow$ Check slot array to find empty space in page that will fit.

    Update an existing tuple using its record id:

    $\rightarrow$ Check page directory to find location of page. $\rightarrow$ Retrieve the page from disk (if not in memory). $\rightarrow$ Find offset in page using slot array. $\rightarrow$ If new data fits, overwrite existing data. Otherwise, mark existing tuple as deleted and insert new version in a different page.

    TUPLE-ORIENTED STORAGE

    Problem #1: Fragmentation

    $\rightarrow$ Pages are not fully utilized (unusable space, empty slots).
    Problem #2: Useless Disk I/O
    $\rightarrow$ DBMS must fetch entire page to update one tuple.
    Problem #3: Random Disk I/O
    $\rightarrow$ Worse case scenario when updating multiple tuples is that each tuple is on a separate page.
    What if the DBMS cannot overwrite data in pages and could only create new pages?
    $\rightarrow$ Examples: Some object stores, HDFS, Google Colossus

    TODAY'S AGENDA

    Log- Structured Storage Index- Organized Storage Data Representation

    LOG-STRUCTURED STORAGE

    Instead of storing tuples in pages and updating the in- place, the DBMS maintains a log that records changes to tuples.
    $\rightarrow$ Each log entry represents a tuple PUT/DELETE operation. $\rightarrow$ Originally proposed as log- structure merge trees (LSM Trees) in 1996.
    The DBMS applies changes to an in- memory data structure (MemTable) and then writes out the changes sequentially to disk (SSTable).

    LOG-STRUCTURED STORAGE

    MemTable
    CMU- DB 15- 445(645 (Spring 2025)

    LOG-STRUCTURED STORAGE

    CMU- DB 15- 445(645 (Spring 2025)

    LOG-STRUCTURED STORAGE

    LOG-STRUCTURED STORAGE

    LOG-STRUCTURED STORAGE

    LOG-STRUCTURED STORAGE

    LOG-STRUCTURED STORAGE

    LOG-STRUCTURED STORAGE

    CMU- DB 15- 445(645 (Spring 2025)

    LOG-STRUCTURED STORAGE

    CMU- DB 15- 445(645 (Spring 2025)

    LOG-STRUCTURED STORAGE

    CMU- DB 15- 445(645 (Spring 2025)

    LOG-STRUCTURED STORAGE

    [ImageFootnote: CMU-DB 15-445/645 (Spring 2025)]

    LOG-STRUCTURED STORAGE

    LOG-STRUCTURED STORAGE

    LOG-STRUCTURED STORAGE

    LOG-STRUCTURED STORAGE

    LOG-STRUCTURED STORAGE

    LOG-STRUCTURED STORAGE

    LOG-STRUCTURED STORAGE

    LOG-STRUCTURED STORAGE

    LOG-STRUCTURED STORAGE

    LOG-STRUCTURED STORAGE

    LOG-STRUCTURED STORAGE

    LOG-STRUCTURED STORAGE

    LOG-STRUCTURED STORAGE

    Key- value storage that appends log records on disk to represent changes to tuples (PUT, DELETE).
    $\longrightarrow$ Each log record must contain the tuple's unique identifier. $\longrightarrow$ Put records contain the tuple contents. $\longrightarrow$ Deletes marks the tuple as deleted.
    As the application makes changes to the database, the DBMS appends log records to the end of the file without checking previous log records.

    SSTable

    LOG-STRUCTURED COMPACTION

    Periodically compact SSTAbles to reduce wasted space and speed up reads. $\longrightarrow$ Only keep the "latest" values for each key using a sort- . merge algorithm.

    LOG-STRUCTURED COMPACTION

    Periodically compact SSTAbles to reduce wasted space and speed up reads. $\longrightarrow$ Only keep the "latest" values for each key using a sort- . merge algorithm.

    LOG-STRUCTURED COMPACTION

    Periodically compact SSTAbles to reduce wasted space and speed up reads. $\longrightarrow$ Only keep the "latest" values for each key using a sort- . merge algorithm.

    DISCUSSION

    Log- structured storage managers are more common today than in previous decades.
    $\rightarrow$ This is partly due to the proliferation of RocksDB.
    What are some downsides of this approach?
    $\rightarrow$ Write- Amplification. $\rightarrow$ Compaction is expensive.

    OBSERVATION

    The two table storage approaches we've discussed so far rely on indexes to find individual tuples. $\rightarrow$ Such indexes are necessary because the tables are inherently unsorted.
    But what if the DBMS could keep tuples sorted automatically using an index?

    INDEX-ORGANIZED STORAGE

    DBMS stores a table's tuples as the value of an index data structure.
    $\rightarrow$ Still use a page layout that looks like a slotted page. $\rightarrow$ Tuples are typically sorted in page based on key.
    $\mathrm{B}+$ Tree pays maintenance costs upfront, whereas LSMs pay for it later.
    MySQL
    ORACLE
    SQLServer

    INDEX-ORGANIZED STORAGE

    DBMS stores a table's tuples as the value of an index data structure.
    MySQL
    $\rightarrow$ Still use a page layout that looks like a slotted page. $\rightarrow$ Tuples are typically sorted in page based on key.
    $\mathrm{B} + \mathrm{T}$ ree pays maintenance costs upfront, whereas LSMs pay for it later.

    INDEX-ORGANIZED STORAGE

    DBMS stores a table's tuples as the value of an index data structure.
    $\rightarrow$ Still use a page layout that looks like a slotted page. $\rightarrow$ Tuples are typically sorted in page based on key.
    $\mathrm{B}+$ Tree pays maintenance costs upfront, whereas LSMs pay for it later.
    ORACLE

    INDEX-ORGANIZED STORAGE

    DBMS stores a table's tuples as the value of an index data structure.
    $\longrightarrow$ Still use a page layout that looks like a slotted page. $\longrightarrow$ Tuples are typically sorted in page based on key.
    $\mathrm{B}+$ Tree pays maintenance costs upfront, whereas LSMs pay for it later.

    INDEX-ORGANIZED STORAGE

    DBMS stores a table's tuples as the value of an index data structure.
    $\rightarrow$ Still use a page layout that looks like a slotted page. $\rightarrow$ Tuples are typically sorted in page based on key.
    $\mathrm{B}+$ Tree pays maintenance costs upfront, whereas LSMs pay for it later.

    INDEX-ORGANIZED STORAGE

    DBMS stores a table's tuples as the value of an index data structure.
    $\rightarrow$ Still use a page layout that looks like a slotted page. $\rightarrow$ Tuples are typically sorted in page based on key.
    $\mathrm{B}+$ Tree pays maintenance costs upfront, whereas LSMs pay for it later.

    TUPLE STORAGE

    A tuple is essentially a sequence of bytes prefixed with a header that contains meta- data about it.
    It is the job of the DBMS to interpret those bytes into attribute types and values.
    The DBMS's catalogs contain the schema information about tables that the system uses to figure out the tuple's layout.

    DATA LAYOUT

    CREATE TABLE foo ( id INT PRIMARY KEY, value BIGINT);

    unsigned char[]

    DATA LAYOUT

    CREATE TABLE foo ( id INT PRIMARY KEY, value BIGINT);

    unsigned char[]

    Table (html):
    headeridvalue

    DATA LAYOUT

    CREATE TABLE foo ( id INT PRIMARY KEY, value BIGINT);

    unsigned char[]

    header id value

    DATA LAYOUT

    CREATE TABLE foo ( id INT PRIMARY KEY, value BIGINT);

    unsigned char[]

    DATA LAYOUT

    CREATE TABLE foo ( id INT PRIMARY KEY, value BIGINT);

    unsigned char[]

    header
    id
    value
    reinterpret_cast<int32_t*>(address)

    WORD-ALIGNED TUPLES

    All attributes in a tuple must be word aligned to enable the CPU to access it without any unexpected behavior or additional work.
    CREATE TABLE foo ( id INT PRIMARY KEY, cdate TIMESTAMP, color CHAR(2), zipcode INT);

    unsigned char[]

    Table (html):

    WORD-ALIGNED TUPLES

    All attributes in a tuple must be word aligned to enable the CPU to access it without any unexpected behavior or additional work.
    CREATE TABLE foo ( id INT PRIMARY KEY, cdate TIMESTAMP, color CHAR(2), zipcode INT);

    unsigned char[]

    WORD-ALIGNED TUPLES

    All attributes in a tuple must be word aligned to enable the CPU to access it without any unexpected behavior or additional work.

    WORD-ALIGNED TUPLES

    All attributes in a tuple must be word aligned to enable the CPU to access it without any unexpected behavior or additional work.

    WORD-ALIGNED TUPLES

    All attributes in a tuple must be word aligned to enable the CPU to access it without any unexpected behavior or additional work.

    WORD-ALIGNED TUPLES

    All attributes in a tuple must be word aligned to enable the CPU to access it without any unexpected behavior or additional work.

    WORD-ALIGNED TUPLES

    All attributes in a tuple must be word aligned to enable the CPU to access it without any unexpected behavior or additional work.

    WORD-ALIGNED TUPLES

    All attributes in a tuple must be word aligned to enable the CPU to access it without any unexpected behavior or additional work.

    WORD-ALIGNMENT: PADDING

    Add empty bits after attributes to ensure that tuple is word aligned. Essentially round up the storage size of types to the next largest word size.

    WORD-ALIGNMENT: REORDERING

    Switch the order of attributes in the tuples' physical layout to make sure they are aligned.
    → May still have to use padding to fill remaining space.

    WORD-ALIGNMENT: REORDERING

    Switch the order of attributes in the tuples' physical layout to make sure they are aligned.
    → May still have to use padding to fill remaining space.

    WORD-ALIGNMENT: REORDERING

    Switch the order of attributes in the tuples' physical layout to make sure they are aligned.
    → May still have to use padding to fill remaining space.

    WORD-ALIGNMENT: REORDERING

    Switch the order of attributes in the tuples' physical layout to make sure they are aligned. $\longrightarrow$ May still have to use padding to fill remaining space.

    WORD-ALIGNMENT: REORDERING

    Switch the order of attributes in the tuples' physical layout to make sure they are aligned.
    → May still have to use padding to fill remaining space.

    DATA REPRESENTATION

    INTEGER/BIGINT/SMALLINT/TINYINT
    $\rightarrow$ Same as in C/C++.
    FLOAT/REAL vs. NUMERIC/DECIMAL
    $\rightarrow$ IEEE- 754 Standard / Fixed- point Decimals.
    VARCHAR/VARBINARY/TEXT/BLOB
    $\rightarrow$ Header with length, followed by data bytes OR pointer to another page/offset with data. $\rightarrow$ Need to worry about collations / sorting.
    TIME/DATE/TIMESTAMP/INTERVAL
    $\rightarrow$ 32/64- bit integer of (micro/milli)- seconds since Unix epoch (January 1st, 1970).

    DATA REPRESENTATION

    INTEGER/BIGINT/SMALLINT/TINYINT $\longrightarrow$ Same as in $\mathrm{C / C + + }$
    FLOAT/REAL vs. NUMERIC/DECIMAL
    $\longrightarrow$ IEEE- 754 Standard / Fixed- point Decimals.
    VARCHAR/VARBINARY/TEXT/BLOB
    $\longrightarrow$ Header with length, followed by data bytes OR pointer to another page/offset with data. $\longrightarrow$ Need to worry about collations / sorting.
    TIME/DATE/TIMESTAMP/INTERVAL
    $\longrightarrow$ 32/64- bit integer of (micro/milli)- seconds since Unix epoch (January 1st, 1970).

    VARIABLE PRECISION NUMBERS

    Inexact, variable- precision numeric type that uses the "native" C/C++ types.
    Store directly as specified by IEEE- 754. $\rightarrow$ Example: FLOAT, REAL/DOUBLE
    These types are typically faster than fixed precision numbers because CPU ISA's (Xeon, Arm) have instructions / registers to support them.
    But they do not guarantee exact values...

    VARIABLE PRECISION NUMBERS

    Rounding Example

    include<stdio.h> int main(int argc, char* argv[]) { float $x = 0.1$ . float $y = 0.2$ . printf("x+y = %f\n", x+y); printf("0.3 = %f\n", 0.3); 了

    Output

    x+y = 0.300000 0.3 = 0.300000

    VARIABLE PRECISION NUMBERS

    Rounding Example

    Output

    x+y = 0.300000 0.3 = 0.300000
    x+y = 0.30000001192092895508 0.3 = 0.29999999999999998890

    FIXED PRECISION NUMBERS

    Numeric data types with (potentially) arbitrary precision and scale. Used when rounding errors are unacceptable.
    $\rightarrow$ Example: NUMERIC, DECIMAL
    Many different implementations.
    $\rightarrow$ Example: Store in an exact, variable- length binary representation with additional meta- data. $\rightarrow$ Can be less expensive if the DBMS does not provide arbitrary precision (e.g., decimal point can be in a different position per value).

    POSTGRES: NUMERIC

    typedef unsigned char NumericDigit; typedef struct { int ndigits; int weight; int scale; int sign; NumericDigit *digits; } numeric;

    POSTGRES: NUMERIC

    POSTGRES: NUMERIC

    NULL DATA TYPES

    Choice #1: Null Column Bitmap Header

    $\rightarrow$ Store a bitmap in a centralized header that specifies what attributes are null. $\rightarrow$ This is the most common approach in row- stores.

    Choice #2: Special Values

    $\rightarrow$ Designate a placeholder value to represent NULL for a data type (e.g., INT32_MIN). More common in column- stores.

    Choice #3: Per Attribute Null Flag

    $\rightarrow$ Store a flag that marks that a value is null. $\rightarrow$ Must use more space than just a single bit because this messes up with word alignment.

    NULL DATA TYPES

    Choice #1: Null Column Bitmap Header

    $\rightarrow$ Store a bitmap in a centralized header that specifies what attributes are null. $\rightarrow$ This is the most common approach in row- stores.

    Choice #2: Special Values

    $\rightarrow$ Designate a placeholder value to represent NULL for a data type (e.g., INT32_MIN). More common in column- stores.

    Choice #3: Per Attribute Null Flag

    $\rightarrow$ Store a flag that marks that a value is null. $\rightarrow$ Must use more space than just a single bit because this messes up with word alignment.

    NULL DATA TY

    Choice #1: Null Column Bitma $\longrightarrow$ Store a bitmap in a centralized head attributes are null. $\longrightarrow$ This is the most common approach

    Choice #2: Special Values

    $\longrightarrow$ Designate a placeholder value to rep type (e.g., INT32_MIN). More comm

    Choice #3: Per Attribute Null

    $\longrightarrow$ Store a flag that marks that a value $\longrightarrow$ Must use more space than just a sin messes up with word alignment.

    NULLS! Revisiting Null Representation in Modern Columnar Formats

    Xinyu Zeng Ruijun Meng Andrew Pavlo Tsinghua University Carnegie Mellon University zeng- xy21@mails.tsinghua.edu.cn mcyj21@mails.tsinghua.edu.cn pavlo@cs.cmu.edu
    Wes McKinney Huanchen Zhang Posit PBC Tsinghua University wes@posit.co huazhen@tsinghua.edu.cn

    ABSTRACT

    Nulls are common in real- world data sets, yet recent research
    shows that Nulls are not encodings rarely address Nulls representations. Popular file formats like Parquet and ORC follow the same design as C- Store from nearly 20 years ago that only stores non- Null values contiguously. But recent formats store both non- Null and Null values, with Nulls being set to a placeholder value in this work, we analyze each approach pros and cons under the data distributions. We optimize the bottlenecks in the NULS/AI, approach using AVXS12. We also propose a Null- filling strategy
    called SmartNull, which can determine the Null values best for compression ratio at encoding time. From our micro- benchmarks, we argue that the optimal Null compression depends on several factors: decoding speed, data distribution, and Null ratio. Our analysis shows that the Compact/AI layout performs better when Nulls ratio is high and the data is serial- correlated.

    ACM Reference Format:

    Xinyu Zeng, Ruijun Meng, Andrew Pavlo, Wes McKinney, Huanchen Zhang. 2024, NULLS! Revisiting Null Representation in Modern Columnar Formats In 20th International Workshop on Data Management on New Hardware (DUNeM'24), June 10, 2024, Santago, A36, A34, ACM, New York, NY, USA 10 pages. https://doi.org/10.1145/3662010.3663452

    1 INTRODUCTION

    Codd first mentioned how to use Null values to represent missing data in a relational database in In1975 [17]. A subsequent paper in 1979 described the semantics of Null propagation through terms logic for SOL's arithmetic and expressions [18]. Every example DBMS and data file format [27, 36] supports Null today and they are widely used in real- world applications; a recent survey showed that $- 80%$ of SQL developers encounter Nulls in their database [34]. Despite the prevalence of Nulls, there has not been much investigation into how to best handle them in a modern database that is designed for analytical workloads processing columns data.

    00

    This work is licensed under a Creative Commons Attribution NonCommercial International ND License. DulinN' 24, June 18, 2024, Santago, A3, Club 2024 Copyright held by the owner/author(s). ACM ISBN 978- 9- 4007- 0667- 7: 24 96 https://doi.org/10.1145/3662010.3663452
    [ImageCaption: Figure 1: Null Representations - Examples of Compact and Placeholder representation schemes for a logical data set.]
    Today's most widely used columnar file formats (i.e., Apache Parquet [7], Apache ORC [6]) follow the same format layout as the seminal C- Store DBMS from the 2000s [13]. Compact each outland at- teminal in a table. C- Store's scheme stores non- null (fixed- width) values in densely packed contiguous columns. To handle Nulls, the scheme maintains a separate bitmap to record whether the value for an attribute at a given position is Null or not. Storing values in this manner enables better compression and improves query performance. However, because the Compact layout does not store Nulls, a tuple's logical positioning in a table may not match its physical position in the column, hampering the random access ability.
    An alternative approach is to store the Null values in place. That is, instead of pruning the Nulls out, this scheme uses a default value (e.g., zero, INT, MIN) as a placeholder to represent Null for a given tuple. The scheme still maintains a bitmap to indicate whether a position contains Null or not because the placeholder value may collide with a non- null value. Without further compression, the placeholder value will always be the same amount of storage space whether or not values are Null, but facilitates random access and vectorized execution. Recent systems and formats such as DB2 BLU [32], DuckDB [31], Apache Arrow [4], and BtrBlocks [23] adopt this Placeholder layout. Figure 1 shows the difference between Compact and Placeholder layout. Many DBMSs use a combination of Parquet and Arrow storage to represent data on disk and in- memory, respectively [5, 9, 10]. However, the different representation of Nulls between Compact (Parquet) and Placeholder (Arrow) introduces performance over- head. As shown in Figure 2, the time spent on format conversion from Parquet to Arrow, which represents a common deserialization

    LARGE VALUES

    Most DBMSs do not allow a tuple to exceed the size of a single page.
    To store values that are larger than a page, the DBMS uses separate overflow storage pages.
    $\longrightarrow$ Postgres: TOAST (>2KB) $\longrightarrow$ MySQL: Overflow (>1/2 size of page) $\longrightarrow$ SQL Server: Overflow (>size of page)
    Lots of potential optimizations: $\longrightarrow$ Overflow Compression, German Strings
    CREATE TABLE foo ( id INT PRIMARY KEY, data INT, contents TEXT);
    Table (html):
    HeaderINTINTTEXT

    LARGE VALUES

    Most DBMSs do not allow a tuple to exceed the size of a single page.
    To store values that are larger than a page, the DBMS uses separate overflow storage pages.
    $\longrightarrow$ Postgres: TOAST (>2KB) $\longrightarrow$ MySQL: Overflow (>1/2 size of page) $\longrightarrow$ SQL Server: Overflow (>size of page)
    Lots of potential optimizations: $\longrightarrow$ Overflow Compression, German Strings
    CREATE TABLE foo ( id INT PRIMARY KEY, data INT, contents TEXT);

    LARGE VALUES

    Most DBMSs do not allow a tuple to exceed the size of a single page.
    To store values that are larger than a page, the DBMS uses separate overflow storage pages.
    $\longrightarrow$ Postgres: TOAST (>2KB) $\longrightarrow$ MySQL: Overflow (>1/2 size of page) $\longrightarrow$ SQL Server: Overflow (>size of page)
    Lots of potential optimizations: $\longrightarrow$ Overflow Compression, German Strings

    Overflow Page

    Table (html):
    VARCHAR DATA

    LARGE VALUES

    Most DBMSs do not allow a tuple to exceed the size of a single page.
    To store values that are larger than a page, the DBMS uses separate overflow storage pages.
    $\longrightarrow$ Postgres: TOAST (>2KB) $\longrightarrow$ MySQL: Overflow (>1/2 size of page) $\longrightarrow$ SQL Server: Overflow (>size of page)
    Lots of potential optimizations: $\longrightarrow$ Overflow Compression, German Strings
    CREATE TABLE foo ( id INT PRIMARY KEY, data INT, contents TEXT);
    Table (html):
    HeaderINTINTsizelocation

    Overflow Page

    Table (html):
    VARCHAR DATA

    LARGE VALUES

    Most DBMSs do not allow a tuple to exceed the size of a single page.
    To store values that are larger than a page, the DBMS uses separate overflow storage pages.
    $\longrightarrow$ Postgres: TOAST (>2KB) $\longrightarrow$ MySQL: Overflow (>1/2 size of page) $\longrightarrow$ SQL Server: Overflow (>size of page)
    Lots of potential optimizations: $\longrightarrow$ Overflow Compression, German Strings
    CREATE TABLE foo ( id INT PRIMARY KEY, data INT, contents TEXT);
    Header INT INT size location

    Overflow Page

    Table (html):
    VARCHAR DATA

    LARGE VALUES

    Most DBMSs do not allow a tuple to exceed the size of a single page.
    To store values that are larger than a page, the DBMS uses separate overflow storage pages.
    $\longrightarrow$ Postgres: TOAST (>2KB) $\longrightarrow$ MySQL: Overflow (>1/2 size of page) $\longrightarrow$ SQL Server: Overflow (>size of page)
    Lots of potential optimizations: $\longrightarrow$ Overflow Compression, German Strings
    CREATE TABLE foo ( id INT PRIMARY KEY, data INT, contents TEXT);
    Header INT INT size location

    EXTERNAL VALUE STORAGE

    Some systems allow you to store a large value in an external file.
    Treated as a BLOB type.
    $\longrightarrow$ Oracle: BFILE data type $\longrightarrow$ Microsoft: FILESTREAM data type
    The DBMS cannot manipulate the contents of an external file. $\longrightarrow$ No durability protections. $\longrightarrow$ No transaction protections.
    Table (html):
    Headerabcde
    [TableCaption: Tuple]

    External File

    Table (html):
    Data

    EXTERNAL VALUE STORAGE

    Some systems allow you to store a large value in an external file.
    Treated as a BLOB type.
    $\longrightarrow$ Oracle: BFILE data type $\longrightarrow$ Microsoft: FILESTREAM data type
    The DBMS cannot manipulate the contents of an external file. $\longrightarrow$ No durability protections. $\longrightarrow$ No transaction protections.

    EXTERNAL VALUE S

    Some systems allow you to store a large value in an external file. Treated as a BLOB type. $\longrightarrow$ Oracle: BFILE data type $\longrightarrow$ Microsoft: FILESTREAM data type
    The DBMS cannot manipulate the contents of an external file. $\longrightarrow$ No durability protections. $\longrightarrow$ No transaction protections.

    To BLOB or Not To BLOB: Large Object Storage in a Database or a Filesystem?

    Russell, Sears, Catharine van Ingen, Jim Gray 1: Microsoft Research, 2: University of California in Berkeley scars@cs.berkeley.edu, vaningen@microsoft.com, gray@microsoft.com 1000- TR- 2006- 45 April 2006 Revised June 2006

    Abstract

    Application designers must decide whether to store large objects (BLOBs) in a filesystem or in a database. Generally, this decision is based on factors such as performance simplicity or manageability. Often, system performance affects these factors.
    Folklore tells us that databases efficiently handle large numbers of small objects, while filesystems are more efficient for large objects. When it is the break- even point? When accessing a BLOB stored on a database record?
    Of course, this depends on the particular filesystem, database system, and workload in question. This study shows that when comparing the N-TFS file system and SQL Server 2005 database system on a create, (read, replace) create, BLOBs smaller than 256KB are more efficiently handled by SQL Server, while NTFS is more efficient BLOBs larger than 1MB. Of course, this break- even point will vary among different database systems, filesystems, and workloads.
    By measuring the performance of a storage server workload typical of web applications which use global protocols such as WebDAV (WebDAV), we found that
    the break- even point depends on many factors. However, our experiments suggest that storage age, the ratio of bytes in deleted or replaced objects to bytes in live objects, is dominant. As storage age increases, fragmentation tends to increase. The filesystem we study has better fragmentation than the database we used, suggesting the database system would benefit from incorporating ideas from filesystem architecture. Conversely, filesystem performance may be improved by using database techniques to handle small files.
    Surprisingly, for these studies, when average object size is held constant, the distribution of object sizes did not significantly affect performance. We also found that, in addition to low percentage free space, a low ratio of free space to average object size leads to fragmentation and performance degradation.

    1. Introduction

    Application data objects are getting larger as digital
    media becomes ubiquitous. Furthermore, the increase in popularity of web services and other network applications means that systems that once managed static archives of "finished" objects now manage frequently modified versions of application data as it is being created and updated. Rather than updating these objects, the archive either stores multiple versions of the objects (the V of WebDAV stands for "versioning"), or simply does wholesale replacement (as in SharePoint Team Services [SharePoint]).
    Application designers have the choice of storing large objects as files in the filesystem, as BLOs (binary large objects) in a database or as a combination of both. Only folklore is available regarding the tradeoffs- often the design decision is based on which technology the designer knows best. Most designers will tell you that a database is probably best for small binary objects and that that files are best for large objects. But, what is the break- even point? What are the tradeoffs?
    This article characterizes the performance of an abstracted write- intensive web application that deals with relatively large objects. Two versions of the system are compared; one uses a relational database to store large objects, while the other version stores the objects as files in the filesystem. We measure how performance changes are made at the storage becomes fragmented. The article concludes by describing and quantifying the factors that a designer should consider when picking a storage system. It also suggests filesystem and database improvements for large object support.
    One surprising (to us at least) conclusion of our work is that storage fragmentation is the main determinant of the break- even point in the tradeoff. Therefore, much of our work and much of this article focuses on storage fragmentation issues. In essence, filesystems seem to have better fragmentation handling than databases and this drives the break- even point down from about 1MB to about 256KB.

    SYSTEM CATALOGS

    A DBMS stores meta- data about databases in its internal catalogs.
    $\longrightarrow$ Tables, columns, indexes, views $\longrightarrow$ Users, permissions $\longrightarrow$ Internal statistics
    Almost every DBMS stores the database's catalog inside itself (i.e., as tables). $\longrightarrow$ Wrap object abstraction around tuples. $\longrightarrow$ Specialized code for "bootstrapping" catalog tables.

    SYSTEM CATALOGS

    You can query the DBMS's internal INFORMATION_SCHEMA catalog to get info about the database.
    $\rightarrow$ ANSI standard set of read- only views that provide info about all the tables, views, columns, and procedures in a database
    DBMSs also have non- standard shortcuts to retrieve this information.

    ACCESSING TABLE SCHEMA

    List all the tables in the current database:

    ACCESSING TABLE SCHEMA

    List all the tables in the student table:

    SCHEMA CHANGES

    ADD COLUMN:

    $\rightarrow$ NSM: Copy tuples into new region in memory. $\rightarrow$ DSM: Just create the new column segment on disk.

    DROP COLUMN:

    $\rightarrow$ NSM #1: Copy tuples into new region of memory. $\rightarrow$ NSM #2: Mark column as "deprecated", clean up later. $\rightarrow$ DSM: Just drop the column and free memory.

    CHANGE COLUMN:

    $\rightarrow$ Check whether the conversion is allowed to happen. Depends on default values.

    INDEXES

    CREATE INDEX:

    $\rightarrow$ Scan the entire table and populate the index. $\rightarrow$ Have to record changes made by txns that modified the table while another txn was building the index. $\rightarrow$ When the scan completes, lock the table and resolve changes that were missed after the scan started.

    DROP INDEX:

    $\rightarrow$ Just drop the index logically from the catalog. $\rightarrow$ It only becomes "invisible" when the txn that dropped it commits. All existing txns will still have to update it.

    CONCLUSION

    Log- structured storage is an alternative approach to the tuple- oriented architecture. $\rightarrow$ Ideal for write- heavy workloads because it maximizes sequential disk I/O.
    The storage manager is not entirely independent from the rest of the DBMS.

    NEXT CLASS

    Breaking your preconceived notion that a DBMS stores everything as rows...