SOLUTION - IT360 Lab
14 - DB Performance - SOLUTION
This lab should get you familiarized with indexing and query efficiency.
Create a file called yourlastname_Lab14.txt that will contain all your answers for this lab. Write your name at the top of the file.
Exercise 1: Consider a disk with average I/O time of 15 milliseconds per page and page size = 4096 bytes. For all of the following questions, the time needed to answer a query is considered to be the number of pages accessed * average I/O time per page.
Consider a table Products(BarCode, PName, Price) with 200,000 rows of 100 bytes each. The table is stored on disk such that no row spans 2 pages.
Solution:
Nb rows per page = floor(page size/ row size) = floor(4096/100) = 40
Nb pages to store all
rows = ceiling(nb rows/nb rows per page) = ceiling(200000/40) = 5000 pages
SELECT * FROM Products
Solution:
If there are no indexes, all pages for the
tables need to be read. Time is
Nb pages * time to
read one page = 5000 pages *15msec/page= 75000msec = 75 sec
SELECT * FROM Products WHERE Price between 10 and 100
Assume that a B+tree index exists on Price, and 75% of the products satisfy the condition in the WHERE clause (Price between 10 and 100). For the following questions, you can ignore the cost of retrieving the index entries. The only cost of the operation comes from retrieving the rows (after the “pointers” are found in the index).
c1. Assume that the B+tree index is not clustered. Compute the approximate time (in seconds) needed to evaluate the query using the unclustered B+tree index.
Solution:
We ignore the cost of retrieving the index
entries, and the main cost of the operation comes from retrieving the rows,
after the “pointers” (data entries in the index) are found. Since the index is unclustered, and the memory size is limited, for every row
that matches the condition we might need to bring a new page in memory to read
it.
Cost =nb rows matching the condition * time to read one page
=0.75*200000
* 15msec = 2,250,000
msec = 2,250 sec
Solution 2:
If we don’t ignore the cost of finding the
matching rows using the index, the cost to evaluate the query using an unclustered B+ tree is:
(height of index -1 + nb leaf pages containing matching data entries + nb rows matching the query) * time
to access one page
Assume that Price column has a double data type
and needs 8 bytes of storage. Since a page has 4096 bytes, we can fit at most
512 entries in a page. However, since the pages are not kept full (usually
pages are 67% full), we can assume there are around 400 entries per page, so we
have a round number to work with.
Nb leaf pages =
ceiling (200,000/400) = 500 leaf pages
A reasonable fanout
for the B+tree is 150, so the height for the B+tree would be ceil(log150500)
= 2
Time to evaluate the query =
(2 -1 + 0.75*500 +
0.75*200000)* 15msec = 2255640 msec = 2,255.640 sec
c2. Assume now that the B+tree index is clustered. Compute the approximate time (in
seconds) needed to evaluate the query using the clustered index.
Solution 1:
We ignore the cost of retrieving the index entries,
and the main cost of the operation comes from retrieving the rows, after the
“pointers” (data entries in the index) are found. Since the index is clustered
rows that are “close” in the index order, will be close to each other on the
disk. We can therefore assume that if 40 rows matches
the query, all of them will be on the same page.
Cost=nb rows matching
the condition/nb rows per page * time to read one
page
=0.75*200000/40
* 15msec = 56250msec
= 56.250 sec
Solution 2:
If we don’t ignore the cost of finding the
matching rows using the index, the cost to evaluate the query using a clustered
B+ tree is:
(height of index -1 + nb leaf pages containing matching data entries + nb rows matching the query/nb
rows per page) * time to access one page
As before, assume that Price column has a
double data type and needs 8 bytes of storage. Since a page has 4096 bytes, we
can fit at most 512 entries in a page. However, since the pages are not kept
full (usually pages are 67% full), we can assume there are around 400 entries
per page, so we have a round number to work with.
Nb leaf pages =
ceiling (200,000/400) = 500 leaf pages
A reasonable fanout
for the B+tree is 150, so the height for the B+tree would be ceil(log150500)
= 2
Time to evaluate the query =
(2 -1 + 0.75*500 +
0.75*200000/40)* 15msec = 61890 msec = 61.890 sec
c3. Compute the approximate time (in seconds)
needed to evaluate the query without using any index.
Solution:
If there are no indexes, all pages for the
tables need to be read, even if some of the rows do not satisfy the condition.
Time is
Nb pages * time to
read one page = 5000*15= 75000msec = 75 sec
Important
note: Sometimes it is better to just scan the entire table (75 sec in the
example), instead of using an unclustered index (~2255sec
in the example)!!!! Which strategy is better depends on the selectivity of the
condition (percentage of rows that satisfy the condition in the query). The
database query evaluation engine will estimate which way of evaluating the
query is faster, and execute the faster plan.
Exercise 2:
· Consider a table with the following schema
StudentsEnroll(StudentID, CourseID, ACYearEnd, Semester).
Consider the query:
SELECT CourseID, Count(*) AS NumStudents
FROM StudentsEnroll
WHERE ACYearEnd = 2013 and Semester = 'SPRING'
GROUP BY CourseID;
a) What index (if any) would you create to improve the efficiency of the query? Justify your answer.
Solution:
An index on (AcYearEnd,
Semester) would be useful to find the rows satisfying the query condition,
especially if the condition is selective.
A clustered index on CourseID
(hash or B+tree) might be useful for grouping, if the
condition is not very selective.
A B+tree index on (AcYearEnd, Semester, CourseID) or (Semester, AcYearEnd, CourseID) would probably be best. The index can be unclustered, since all the columns needed for the query are part of the index, so the query processor can do an index-only scan (actual table rows don’t need to be retrieved from the disk).
b) Write the SQL statement to create the above index in MySQL. Execute the statement in MySQL.
Solution:
CREATE INDEX I_SemesterOffering
USING BTREE
ON StudentsEnroll(ACYearEnd, Semester, CourseID)