This is the archived website of SI 486H from the Spring 2016 semester. Feel free to browse around; you may also find more recent offerings at my teaching page.

Problem 100

File fingerprinter

Due: May 3
Points: 5

Write a program that reads in a (possibly large) file and computes a fingerprint value (sequence of small integers) for it. Specifically, your program should be able to read any file, and then it should print out a series of numbers that uniquely identify that file.

You can use any of the ideas that we talked about in class - or even some that we didn't - to do this fingerprinting, as long as:

  • You actually develop the code. You can't just find a file fingerprinting algorithm online and submit it for credit.
  • Your fingerprints should be sensitive to the smallest changes in the file. If I change just one character in the whole file, the fingerprint should also change.
  • Your fingerprints should be reasonably fast. In particular, I should be able to run your fingerprinting algorithm on files that are multiple megabytes in length, and it should return the fingerprint in less than 30 seconds or so.
  • Your fingerprints must be consistent! You will want to use some random values to construct the fingerprinting function, but these should be chosen once and hard-coded into your program, so that the fingerprints it computes are the same every time.

Here's an overview of one way you could do this:

  1. Read in the file one byte at a time. Treat each byte as the coefficient of a big polynomial. You want to evaluate that polynomial at a few randomly-chosen points.
  2. A straightforward evaluation of the polynomial would produce a really ridiculously huge number. You might consider doing the evaluation modulo a prime \(p\). If you choose \(p\) to be a prime number less than \(2^{16}\), then you can do the whole thing using regular ints in C++ or Java.
  3. The randomly-chosen points from (1) and the prime \(p\) from (2) should be chosen once and then hard-coded into your program, so that they're the same every time.
  4. If you do this cleverly, you can avoid having to read in the entire file into memory and holding it in a big array. Saving on memory in this way will make your fingerprinter much faster for huge files, but it will require some clever coding.

Be sure to describe your program clearly in comments. Submit it according to the normal instructions.