19 minute read
Notice a tyop typo? Please submit an issue or open a PR.
The physical database design and the resulting efficiency are critical when building database applications. We may have done an outstanding job designing and implementing the database itself and the applications that run on top of it, but if they are too slow to be useful, then it doesn't matter. Let's look at some tools we have to physically design the database, organize the data, index it, and get to it quickly.
There are two kinds of storage, or memory, in a computer. The bulk of data and applications are stored in secondary storage, which is usually a physical disk. Random access memory (RAM), also known as main memory, stores the currently running programs and their data. Applications run by the CPU can only query and update data stored in main memory. A bus allows us to transport data between disk and main memory.
Main memory is volatile and expensive, needing constant power to maintain its state. Relative to secondary memory, main memory can hold much less data and is faster to access. Secondary memory is permanent storage that is cheaper to maintain since it does not need power to keep its state. Relative to main memory, it can hold much more data and is slower to access.
Only a tiny fraction of a real database will fit in main memory. As a result, we must have auxiliary programs, such as buffer management programs, that enable us to transport data between main and secondary memory. People who talk about main memory databases are not talking about databases. They are talking about toys. Going forward, we will assume that the vast majority of the database will be stored on disk and only a tiny fraction will be in main memory at any given point in time.
Let's briefly discuss why we should care about how long it takes to access data in main memory and on disk. Main memory access typically takes thirty nanoseconds, or seconds. In comparison, disk access takes ten milliseconds, or seconds. In other words, we can perform main memory accesses for every disk access. This difference is so significant that we actually ignore main memory access time when performing cost computations and only count disk access time, also known as the I/O cost.
Let's use a phone book as a metaphor to illustrate the difference between main memory access and disk access. We can think about main memory access as processing the currently open phone book page. Assume that it takes one minute to complete. Disk access, on the other hand, corresponds to flipping the phone book to the next page available for processing. How long would that take? Given that the difference between main memory access and disk access is five orders of magnitude, it would take two hundred days to open the book to the next page. Once that page is open, we can read it in one minute.
To understand how data resides in secondary storage, let's first understand the structure of a disk. A disk consists of several circular plates known as platters, which all sit on the same spindle and rotate together. Read-write heads access the top and bottom of each platter, and an actuator connects all the read-write heads and moves them together in a synchronized fashion, rotating to move them closer to or further from the center of the platter. On each platter surface, what passes under the read-write head in that position is called a track. Each platter, therefore, exposes two tracks. The collection of tracks passing under the read-write heads is known as a cylinder.
Each platter surface is split into sectors. A sector is the smallest physical data unit, 512 bytes, that can be transported from disk to main memory. A block comprises multiple sectors and, depending on what type of data we need to store on disk, typically contains 4kB, 8kB, or 16kB of data, corresponding to 8, 16, or 32 sectors, respectively. Typically, blocks, not sectors, are the unit of information transferred between the disk and main memory.
A higher-end disk pack might have four platters with eight read-write heads. This disk pack might have a 600MB/s transfer rate, running at 10,000 RPMs, with a latency between three and four milliseconds. It could have 150,000 tracks per surface, with 1000kB per track and a total disk capacity of 1200GB. Most systems have a default block size of 4kB, but some major systems default to 8kB.
Let's now look at how we store records on a block on a disk. First, how do we
determine the size of a record? Consider the
RegularUser relation, which has
Hometown attributes. We
previously defined the
Hometown attributes as type
Sex fits into a single character, occupying one byte, and the
datetime type used for
BirthDate occupies eight bytes.
Adding these numbers together, we get a record size of 159 bytes. Let's lay this out in a block:
Sexoccupies byte fifty-one
BirthDateoccupies bytes fifty-two through fifty-nine.
CurrentCityoccupies bytes sixty through 109
Hometownoccupies bytes 110 through 159
At byte 160, we are ready to write the
It is possible to store several records on a single block. Let's assume that the block size is 4kb and that we have the full space available. In reality, this is not true, as the system will reserve some space for metadata about data formats and next pointers, but we can ignore that for now.
Typically we only run with blocks partially full, so let's also assume that we only fill blocks to 80% capacity. We might not want to completely fill a block with data because overflowing a block on an insert attempt comes with extra processing overhead. 80% of 4kB is 3200 bytes, and with a record size of 159 bytes, we can fit about twenty records per block with a little extra space to spare.
There are two ways of thinking about what happens to the space at the end of a block that can't fit a complete record. In a spanned representation, we fill the block with as much data as possible, splitting the record across multiple blocks if needed. In an unspanned representation, we ignore the tiny amount of wasted space at the end of the block and place the entire record on a new block instead.
Typically, if multiple records will fit on a block, most database systems run with unspanned representation simply to avoid the processing overhead necessary to break up records. Of course, if we have record sizes larger than block sizes, we don't have a choice; in those instances, we must use a spanned representation.
Files consist of multiple blocks connected by address pointers. The blocks may not physically sit next to each other on disk, but we can traverse them via these pointers. Within a 32-bit computer architecture, we can assume that block pointers occupy four bytes.
Suppose we have four million
RegularUser records. If we know that we can fit
twenty records in a block, then we need 200,000 blocks (four million divided by
twenty). With a block size of 4kB, our resulting file size is 800MB.
To compute the time it takes to transport data from disk to main memory, let's make some assumptions about how long certain operations take:
Adding these values gives us a range of 5-12ms, so let's conservatively say that a particular page fault takes 10ms.
We can avoid paying multiple seek and rotational costs by collecting multiple blocks of data in a single page fault. This strategy intuitively makes sense: if we need one block of data for our application, it stands to reason that we may need some adjacent blocks as well. Bulk transfers or extent transfersmove a large number of blocks - say, 250 - in one shot without paying additional seek time and rotational delay. Once we have found the first block in the collection, we only incur transfer time costs for the remaining blocks.
Indeed, the savings are significant. Transferring 250 blocks individually at 10ms per block takes 2.5 seconds. Let's consider the savings in seeking time and rotational delay. Transferring 250 blocks in an extent transfer takes 260 ms: 250 transfers at 1 ms plus one seek time and one rotational delay of 10 ms.
One of the problems with bulk transfers is that we will need more buffer space to hold the extra data. In timeshare systems, a favorite buffer management strategy is the least recently used (LRU) strategy. The idea is that when we run out of buffer space and need to free up space for the new data coming in from disk to main memory, we overwrite what has been least recently used. The philosophy here is that the least recently used data is also the least likely to be needed next.
It is not always the case that the LRU strategy works well from the point of view of a database management system. It is excellent for merge joins where two relations are sorted on the join fields, and we merge them on matching keys. However, LRU strategies are very inefficient for nested loop joins. In that case, we might instead use a most recently used (MRU) strategy.
The first disk organization strategy we will examine is a heap, which is an unsorted data file. When we search for a record in a heap, we have no way of knowing on which block that record may sit, so we must scan the data pages to find the record. Sometimes, we are lucky, and the data we need sits on the first block. Other times, it sits on the last block. Given a file of data blocks, we can expect to examine data blocks on average. With four million records spanning 200000 data blocks, equals 100000 blocks. If each data block takes 10 ms to bring into memory, then searching for a record in this heap will take 16.6 minutes on average.
Searching for data in a heap is a slow process. Let's sort the data in the file and see how that improves lookup time. Even with sorted data, we can still use our naive strategy of searching block by block. Again, sometimes the record we are looking for will be on the first block. Sometimes it will be on the last block. On average, as for unsorted data, this strategy requires us to look at about half of the blocks of data, which we already calculated will take 16.6 minutes on average.
Now let's leverage the fact that we have sorted data. What if we had some metadata that would allow us to find the approximate middle block in a data file? We could look at that middle block and then determine whether what we are searching for is to the left or the right of that block. Repeating this process, we can divide the search space in two on each iteration. This process is called binary search and only works with sorted data.
The amount of time it takes to perform a binary search through blocks is on the order of . After accesses, we will end up with a single block containing the data element of interest if the element is in the database. Given 200000 data blocks corresponding to four million records, we have to perform block accesses. Since each block access takes 10 ms, the binary search takes 0.18s. This search is much more efficient than the linear search, which took about 5500 times longer.
For both search algorithms - binary and linear search - remember that the search cost depends on the size of the search space, the number of blocks, , in the case. Since linear search scales linearly with , and binary search scales logarithmically, binary search is the more efficient algorithm for any meaningful .
In addition to sorting the data and leveraging that ordering when we search, we might want to consider ways of reducing the search space itself. Indices help us accomplish this reduction.
Let's remember the structure of our data file. We have records sorted by key values laid out in blocks. To build a sparse primary index, we iterate through the data blocks and append the key value of the first record in the block and a pointer to that block to the current index block. Since we are picking up the key values in sorted order, the index is in sorted order by default.
Since the index blocks only contain key values and block pointers, we can
reference many data blocks from a single index block. The ratio of index blocks
to data blocks is called fanout. What's the fanout in this case? We know
that a block is 4kB and that we only fill blocks to 80% capacity, so we have
3.2kB available. The primary key,
Given a fanout of sixty, how many index blocks do we need? If we start with 200,000 data blocks and can point to sixty data blocks per index block, then we need 200,000 divided by sixty index blocks, equaling 3,334 index blocks.
Instead of building a sparse index, we could have built a dense index. To build a dense primary index, we iterate through the data blocks and append the key value of each record in the block and a pointer to that block to the current index block. Again, since we are picking up the key values in sorted order, the index is in sorted order by default.
The fanout is still sixty, but, in this case, we have one index entry per record instead of one entry per data block. If we divide four million records by sixty index entries per index block, we need 66,666 index blocks to index every key.
One advantage of maintaining a dense index is that the system can answer some queries solely by examining the index without ever having to touch the data blocks. For example, we can use the index to find the maximum and minimum key values. Additionally, we can use dense indices for range queries, such as finding all emails that start with a particular letter.
Let's take a look at some lookup times. Whereas previously, we used to indicate the number of data blocks, let's use to indicate the number of index blocks. The cost of performing a binary search in the index is . Once we identify the data block containing the record of interest, we still have to access that data block, so the total cost is .
For the sparse index containing 3,334 index blocks, we need to perform lookups. At 10 ms per lookup, this search takes 0.13 seconds. Notice that this value is lower than the 0.18 seconds we needed to perform a binary search directly on the data blocks.
The dense index is larger, so we can expect the cost to be higher. With 66,666 index blocks, we need to perform lookups. At 10 ms per lookup, this search takes 0.17 seconds. While this value is greater than the lookup time associated with the sparse index, it's still lower than the 0.18 seconds we needed to search without an index.
Remember that we are only talking about lookup costs here. If we need to perform any updates, the total cost includes the lookup cost and the cost to write data back to disk after it has been changed. In simple situations, we typically add an extra disk access cost for each write. Note that changes that affect the index incur additional access costs.
We build secondary indices when we need to index records on some non-key value. We follow the same process when building a secondary index as when building a dense index, iterating through the data blocks and appending to the current index block the attribute value of each record in the block and a pointer to that block.
When we examine a newly created secondary index, we see that the attribute values are unsorted. Remember that the original data file is sorted by key values but is unsorted for all non-key attribute values. To leverage binary search, we must sort the secondary index after creating it. Since the ordering of attribute values in the index does not match the ordering in the data file, all secondary indices must be dense. A similar consequence of this mismatch is that while dense indices are normally well-suited for range queries, secondary indices are only suited for point queries, despite being dense.
The cost to perform a binary search in the secondary index is still , where is the number of index blocks. With 66,666 index blocks, we need to perform lookups. At 10 ms per lookup, this search takes 0.17 seconds. While this value is greater than the lookup time associated with the sparse index, it's still lower than the 0.18 seconds we needed to search without an index.
Notice that we implicitly made the simplifying assumption that the attribute from which we build the index has no duplicate values. If values for this attribute are not unique, we must consider how to associate these values with a collection of block pointers locating the matching records. Of course, we are allowed to make this assumption when building primary indices because keys must be unique.
We have seen several good ideas so far for increasing how efficiently we find records of interest. For example, we saw that sorting our data and then leveraging that ordering when searching improves search times. Additionally, we saw that reducing the size of the search space by building indices is a smart way to locate data quickly.
If one index improves efficiency, then why don't we index the index and index that index too? We can repeat this process until we have an index that fits entirely on one block, allowing us to read the whole index into memory by paying just one access cost.
Searching through this multilevel index works as follows. We pull the top-level index block into memory. Among the sixty key values in that block, we find the value equal to or just less than the one we are seeking. We follow the corresponding block pointer to the index block in the next level of the tree and repeat this process, traversing down the tree until we hit a leaf node. This leaf node is the data block where the record of interest resides.
Searching for data using multilevel indices is still logarithmic in . However, the base is not two but rather is equal to the fanout. The consequence of changing the base means we no longer divide the search space by two on each iteration but rather by the fanout. Given 3,334 sparse index blocks, with a fanout of sixty, we need to perform accesses to pull the record of interest into memory. At 10 ms per access, we only need 0.03 seconds. Contrast this with the thirteen lookups (and 0.13 seconds) we needed to perform a binary search on the sparse index.
The more indices we place on top of a primary index, the more sensitive we are to overflow effects. Suppose we insert a new record at the data level, and the record overflows its block. We now need to split this block into two to accommodate the additional data. The question is: will this split impact the first level index, and if it does, how far up the tree will the impact propagate? It's important to keep the ripple effect small, as updating the tree comes with a cost, and that's exactly why we don't fill blocks to 100% capacity.
The most popular implementation for multilevel indices, found in many relational database products, is a data structure known as a B+ tree. In a B+ tree, all the data resides in the leaf nodes, and all other nodes are index nodes. A B+ tree remains balanced after insertion, deletion, and update operations; the distance from the root to the leaves remains roughly the same for all leaves over time. Balance is critical since the search performance quickly degrades if the tree becomes unbalanced. Another feature of a B+ tree is that it is rare that an overflow at a lower level will propagate up the tree more than one or two levels, if at all.
Let's look at a completely different approach to primary data organization
called static hashing. In static hashing, we map key values from a key space
into buckets using a hash function,
h can map key
values to buckets 0 through
M-1. For example,
h might map the key
"email@example.com" to bucket 103. Associated with that bucket may be a block
address, and records whose keys map to that bucket will be written to the block
at that address.
A good hash function has the following characteristics. First, it must distribute the values in the key value space uniformly over the buckets. On a related note, it's also important that a hash function avoids collisions. Collisions occur when distinct keys are mapped to the same bucket too many times. A good hash function should also fill the buckets as much as possible. The bucket directory can be large enough that a portion may be stored on disk and, therefore, be subject to being ported between disk and memory. A full directory ensures we can transport the maximum amount of data on each transfer.
Before we insert any records, we have an empty bucket directory and a hash
h. Associated with each bucket number is some space to hold a block
address. When we insert a new record, we hash its key with
h to get the bucket
number, locate or allocate the data block associated with that bucket, and then
write the record to that block. Remember that several records may fit on each
block, so we might not need to allocate a new block for every record we insert.
If a data block is full, and a new record's key hashes to the bucket entry associated with that data block, we must allocate a new data block to continue to store records associated with that bucket. The pointer to that data block will reside at the end of the just-filled data block.
Once we have inserted all the data and built the hash table, we want to use it to find a record with a particular key value. We input the key value to the hash function and receive the bucket entry as output. Next, we look up the block address associated with that bucket in the directory, and then we access that block to retrieve the record if it is in the database.
It takes one block access to bring the bucket directory page we need into memory, followed by one block access to bring the data block associated with the bucket into memory. With an access cost of 10 ms, the total cost for these two accesses is 20 ms. The bucket directory may be small enough to partially or fully stay in memory. If that's the case, accessing a record requires one block access at 10ms.
All the different lookup times we have seen thus far have depended on the data or index size. What's excellent about static hashing is that it provides constant time data access. Beware, though, as we insert more data into a fixed-size bucket directory, we will build increasingly longer chains of data blocks. The records in these blocks are unordered, so as these chains become longer, the search begins to resemble a slow linear heap search. Thankfully, dynamic hash functions that allow the address space to expand to keep the data access cost constant exist.
Also, remember that an important job of the hash function is to allocate the records we insert across the buckets uniformly. Even with an appropriately large bucket directory, a nonuniform hash function will create some chains of data blocks that are much longer than others, which degrades the search performance.
OMSCS Notes is made with in NYC by Matt Schlenker.