IT462: Data Mining Lab
Due date: Friday, March 6, 2009, before class
Lab Description and Requirements
For this lab, you will implement a data mining algorithm and try it on some sample data set. You will have to implement the Apriori algorithm to find the frequent itemsets. The algorithm is described in the textbook (Chapter 26), and in "Fast Algorithms for Mining Association Rules" by Rakesh Agrawal and Ramakrishnan Srikant. In your implementation, you can assume the following:
- Transactions data is stored in a flat file (not database)
- All itemsets fit in memory, so you can use a hash table or array to count the itemsets.
- You can use any algorithm you want for large itemsets candidate generation
- The input for your algorithm should be:
- a file containing the transactions data, in a specific format (one transaction per line; each line contains the transaction id, and a list of items bought in that transaction; fields are separated by comma)
- the minimum support considered by the user
- The output for your algorithm should be a file containing the list of frequent itemsets (itemsets with support higher or equal to the user specified minimum suport).
You can use any programming language to implement the algorithm. Make sure you write easy to use and well-documented code.
I've created a sample file where the products are identified by name, and one where the products are identified by numbers. You can use either one. Test your program by finding the frequent itemsets for different min support values, but turn in the results for min support of 15 percent.
Turn In (Due date: Friday, March 6, 2009, before class)
Electronic
Upload to Lab 5 assignment on Blackboard your code and the results file containing the frequent itemsets for min sup 15 percent on the sample file
Hard-copy
- Completed assignment coversheet. Your comments will help us improve this course.
- A hard-copy of the same files you uploaded: your code and the results file as obtainined by executing your code with the sample file and min support of 15 percent as inputs