SD321 (Fall 2024)

Storage and Indexing

Throughout the semester we have taken on the challenge of Database Design, covering the range of topics from Requirements analysis, conceptual and logical design, schema refinement and normalization, and we have reached the final topic of Physical Tuning. In these final discussions it is time to review how the database stores information to disk, and how we can work to make our databases run faster, speed = money in the real world!

Disks and Files

MySQL, and other Database Management Systems, need effecient methods in which to store your data. At a basic level you can imagine a file as a collection of records. The data is stored and retrieved in units called disk blocks or pages, and unlike the speeds provided by the host system's RAM, the time to retrieve a disk page varies depending upon the location of the information on disk. Knowing this, the relative placement of pages on disk can have a major impact on DBMS performance. The time needed for this Input/Output (I/O) time will be the largest factor in calculating the time for queries.

Let's consider the following classes of queries:

Equality Queries

SELECT * FROM Product WHERE BarCode = 10002121;

Range Queries

SELECT * FROM Product WHERE Price BETWEEN 5 AND 15;

Assuming 200,000 rows in the Product table (20,000 pages on disk), for either case we will need indexes to allow for fast access to the data.

Indexes

An index on a file speeds up selections on search key columns, and any subset of columns of a table can be the search key for an index on the table! The indexing method that we choose should be targeted to the class of queries we wish to speed up processing (either equality or range queries).

Hashing (Equality)

For queries where checking for items that are exactly equal to a value, we will want to use an index based on hashing. This provides for constant search time.

B+ Trees (Ranges)

For queries where searching for items within a specified ranges are used often enough to warrant an index, we will typically use B+ trees. This achieves \(O(log_d N)\) search time where the fan out is \(d\) (typically approx 150) and the number of data entries is \(N\).

With trees, when we perform an insert, update, or delete, we need to find the data entry in a leaf and change it, which may lead to a need to change the parents: change sometimes bubbles up the tree.

Clustering

If the order of the rows on the hard disk is the same as the order of the data entries, this is known as a clustered index. A file can be clustered on (at most) one search key, and as a result the speed of retrieving data records through the index varies greatly based on whether or not the index is clustered. The following diagram will hopefully illustrate the importance of clustering the data on the key that will be used most frequently.

Creating an Index in MySQL

The syntax within MySQL to create an index has a format similar to other MySQL structures

CREATE [UNIQUE] INDEX index_name
  [USING index_type]
  ON table_name (col1, ...)

Where index_type is either BTREE or HASH. Example:

CREATE INDEX I_ItemPrice
  USING BTREE
  ON Product (Price);

Calculations

It is useful to understand and calculate the time necessary to read data from the database. Lets walk through an example.

Here are a few of our settings and the spead of our example drive:

Row Size (in storage): 250 bytes
Page Size (in storage): 4096 bytes
Read time (per page): 20msec
Number of rows in our database: 2,500,000

And these are a few of the necessary formulas we should consider:

Number of rows per page: \( floor(pageSize / rowSize) \)
Number of pages needed: \( ceiling(totalRows / rowsPerPage) \)
Time to read all rows sequentially: \( numPages * readTimePerPage \)
Worst Case Scenario, Time to read all rows randomly: \( numRows * readTimePerPage \)

The largest variable in the information will be the Read time (per page) when retrieving data. When working with spinning disk drives (HDD), the primary factors are:

Seek Time - The time it takes for the read/write head to move to the correct track on the disk, typically 4-9 ms.
Rotational Latency - The time waiting for the desired sector to rotate under the read/write head, for a 7200 RPM drive this would be approximately 5ms
Data Trasfer Time - The actual time to read the data once the head is in position, in the rage of 0.5-1 ms.

If we transitioned from spinning drives to solid state drives we would have significant improvements in time. Why? No moving parts. Lets look at the primary factors with SSDs:

Controller Processing Time - The time taken by the SSD's controller to process the read command. 10-30 µs.
NAND Flass Access Latency - The inherent latency of accessing data stored in the NAND flash memory. 30-100 µs.
Data Trasfer Time - The time required to transfer the data from the SSD to the system memory 30-100 µs.
Queueing and System Overhead - Additional delays due to command queueing, operating system processing, and potential contention with other I/O operations. 20-50 µs.

Considerations

When we are adding indexes to our database, there are a few key questions we should consider:

What indexes should we create?
Which tables should have indexes?
What column(s) should be the search key?
Should we build several indexes?
For each index what kind should it be
- HASH or BTREE?
- Clustered?

One of the most important considerations is determining the columns that you will want to use for creating the index. Columns used within WHERE clauses are excellent candidates for index keys.

A query asking for an exact match suggests using a HASH
A query looking for items within a range suggests a TREE

Try to choose indexes that benefit as many queries as possible, but remember you only get one clustered index per table!

Problems

Consider a disk with average I/O time of 20msec and a page size of 1024 bytes, The table has 200,000 rows each consisting of 100 bytes with no row spanning two pages.
- How many pages are needed store the table?
- What is the time to read all rows sequentially?
- What is the worst case time to read all rows in some random order?
With the following SQL:
```
SELECT * FROM Product WHERE Price BETWEEN 5 AND 10
```
- Is a HASH index useful in this case? Why / Why Not?
- Compute the time needed to evaluate the query that 20% of the data satisfies the condition, the disk has an average I/O time of 20 msec, the page size is 1024 bytes, and the table has 200,000 rows of 100 bytes each, no row spans 2 pages.