SOLUTION - IT360 Lab 14 - DB Performance - SOLUTION

 

 

This lab should get you familiarized with indexing and query efficiency.

 

Create a file called yourlastname_Lab14.txt that will contain all your answers for this lab. Write your name at the top of the file.

 

 

Exercise 1: Consider a disk with average I/O time of 15 milliseconds per page and page size = 4096 bytes. For all of the following questions, the time needed to answer a query is considered to be the number of pages accessed * average I/O time per page.

 

 

Consider a table Products(BarCode, PName, Price) with 200,000 rows of 100 bytes each. The table is stored on disk such that no row spans 2 pages.

 

    1. Compute the number of pages needed to store the entire table on disk

Solution:

Nb rows per page = floor(page size/ row size) = floor(4096/100) = 40

Nb pages to store all rows = ceiling(nb rows/nb rows per page) = ceiling(200000/40) = 5000 pages

 

    1. Compute the approximate time  needed to evaluate the following query, assuming no indexes are used:

SELECT * FROM Products

 

Solution:

If there are no indexes, all pages for the tables need to be read. Time is

Nb pages * time to read one page = 5000 pages *15msec/page= 75000msec = 75 sec

 

    1. Consider the following query:

SELECT * FROM Products WHERE Price between 10 and 100

Assume that a B+tree index exists on Price, and 75% of the products satisfy the condition in the WHERE clause (Price between 10 and 100). For the following questions, you can ignore the cost of retrieving the index entries. The only cost of the operation comes from retrieving the rows (after the “pointers” are found in the index).

 

 

c1. Assume that the B+tree index is not clustered. Compute the approximate time (in seconds) needed to evaluate the query using the unclustered B+tree index.

 

Solution:

We ignore the cost of retrieving the index entries, and the main cost of the operation comes from retrieving the rows, after the “pointers” (data entries in the index) are found. Since the index is unclustered, and the memory size is limited, for every row that matches the condition we might need to bring a new page in memory to read it.

 

Cost           =nb rows matching the condition * time to read one page

                  =0.75*200000 * 15msec =  2,250,000 msec = 2,250 sec

 

Solution 2:

If we don’t ignore the cost of finding the matching rows using the index, the cost to evaluate the query using an unclustered B+ tree is:

 

(height of index -1 + nb leaf pages containing matching data entries + nb rows matching the query) * time to access one page

 

Assume that Price column has a double data type and needs 8 bytes of storage. Since a page has 4096 bytes, we can fit at most 512 entries in a page. However, since the pages are not kept full (usually pages are 67% full), we can assume there are around 400 entries per page, so we have a round number to work with.

Nb leaf pages = ceiling (200,000/400) = 500 leaf pages

A reasonable fanout for the B+tree is 150, so the height for the B+tree would be ceil(log150500) = 2

 

Time to evaluate the query =

(2 -1 + 0.75*500 + 0.75*200000)* 15msec = 2255640 msec = 2,255.640 sec

 

c2. Assume now that the B+tree index is clustered. Compute the approximate time (in seconds) needed to evaluate the query using the clustered index.

 

Solution 1:

We ignore the cost of retrieving the index entries, and the main cost of the operation comes from retrieving the rows, after the “pointers” (data entries in the index) are found. Since the index is clustered rows that are “close” in the index order, will be close to each other on the disk. We can therefore assume that if 40 rows matches the query, all of them will be on the same page.

 

Cost=nb rows matching the condition/nb rows per page * time to read one page

      =0.75*200000/40 * 15msec =  56250msec = 56.250 sec

 

Solution 2:

If we don’t ignore the cost of finding the matching rows using the index, the cost to evaluate the query using a clustered B+ tree is:

 

(height of index -1 + nb leaf pages containing matching data entries + nb rows matching the query/nb rows per page) * time to access one page

 

As before, assume that Price column has a double data type and needs 8 bytes of storage. Since a page has 4096 bytes, we can fit at most 512 entries in a page. However, since the pages are not kept full (usually pages are 67% full), we can assume there are around 400 entries per page, so we have a round number to work with.

Nb leaf pages = ceiling (200,000/400) = 500 leaf pages

A reasonable fanout for the B+tree is 150, so the height for the B+tree would be ceil(log150500) = 2

 

Time to evaluate the query =

(2 -1 + 0.75*500 + 0.75*200000/40)* 15msec = 61890 msec = 61.890 sec

 

c3. Compute the approximate time (in seconds) needed to evaluate the query without using any index.

 

Solution:

If there are no indexes, all pages for the tables need to be read, even if some of the rows do not satisfy the condition. Time is

Nb pages * time to read one page = 5000*15= 75000msec = 75 sec

 

Important note: Sometimes it is better to just scan the entire table (75 sec in the example), instead of using an unclustered index (~2255sec in the example)!!!! Which strategy is better depends on the selectivity of the condition (percentage of rows that satisfy the condition in the query). The database query evaluation engine will estimate which way of evaluating the query is faster, and execute the faster plan.

 

Exercise 2:

 

·       Consider a table with the following schema

StudentsEnroll(StudentID, CourseID, ACYearEnd, Semester).

 

 

Consider the query:

 

SELECT CourseID, Count(*) AS NumStudents

FROM StudentsEnroll

WHERE ACYearEnd = 2013 and Semester = 'SPRING'

GROUP BY CourseID;

 

 

a) What index (if any) would you create to improve the efficiency of the query?  Justify your answer.

 

Solution:

An index on (AcYearEnd, Semester) would be useful to find the rows satisfying the query condition, especially if the condition is selective.

A clustered index on CourseID (hash or B+tree) might be useful for grouping, if the condition is not very selective.

A B+tree index on (AcYearEnd, Semester, CourseID) or (Semester, AcYearEnd, CourseID) would probably be best. The index can be unclustered, since all the columns needed for the query are part of the index, so the query processor can do an index-only scan (actual table rows don’t need to be retrieved from the disk).

 

 

b) Write the SQL statement to create the above index in MySQL. Execute the statement in MySQL.

 

Solution:

CREATE INDEX I_SemesterOffering

USING BTREE

ON StudentsEnroll(ACYearEnd, Semester, CourseID)