Files

A file on a computer is simply a sequence of bytes, nothing more, nothing less. What imbues the file with meaning is how we (or a Program) choose to interpret those bytes. This is actually the same with language. The string bow has no intrinsic meaning. It's up to us to decide whether it means something you shoot with, put in a little girl's hair, whether it means bending over in a polite manner, or the forward part of a ship's hull. If each byte of a file corresponds to a printable ASCII character, we could choose to interpret the file as containing text, so such files are called text files. However, many text files have a more specific meaning. What you're reading right now is contained within a plain text file of a special type: an HTML file.
If you right-click in this browser window and select "view source" you can actually see the plain text file that produced this web page.
Many files are not plain text, but even though such files don't immediately mean something to us, there's usually some Program out there that understands how to interpret those bytes in a meaningful way.

The rules that define how the bytes of a particular file are supposed to be interpreted are called a file format. We described the format of a plain text file in the previous paragraph. You might have heard of .jpg files (JPEG files). JPEG is a file format for images, and any file whose bytes conform to the JPG rules can be viewed as an image with the proper Program. So usually to use a file you need to know what kind it is (i.e. what format it follows) and what Program(s) to use to operate on that kind of file. Here are some common formats:

One of the most important file types is one you might not have thought of: a Program. A Program is a regular old file whose bytes can be interpreted by the computer as instructions to be executed.

The Role of File Extensions

Windows by default hides the file extension from you. For this course (and life) you really want to see this information. Follow these instructions to turn off the extension hiding, so you actually know the true names of files.
  1. Under the Windows Start button, click on File Explorer.
  2. Under the 'View' menu, make sure 'File Name Extensions' box is checked
  3. You may also want to show hidden files and folders
  4. Check the box for 'Hidden Items' too
Why is hiding file extensions bad?
Well, aside from making it hard for you to know what a file's real name is, hidden file extensions can be used to trick people. Check out #1 on this list of ways to trick users into executing malicious programs .

Filenames typically (by tradition) end in a '.' followed by three letters — like fact.jpg ends in .jpg. This last part of the filename is called the file extension. The operating system (Windows) and many programs trust the extension to tell them the file type, and thus choose, for example, what Program to use when opening the file. However, this trust is misplaced. The extension does not tell you the file type reliably. Try this:

  1. Right-click on this link and choose 'Save link as' Under This PC save the link to the desktop.
  2. Go to Start --> File Explorer --> Desktop and find the file CSL.png (which is an image). Right-click on it and choose rename to change its name to CSL.doc. You will have to scroll all the way to the right to rename the .png portion to .doc (the .doc extension is for MS Word). Answer yes to the "are you sure you want to change it" dialog box. Notice how the icon changed. Windows thinks this is a Word document now.
  3. Double-click on CSL.doc to open it up. Windows will try to open it with Word. What happens?
The moral of the story here was that extensions can lie! The only real way to know the type of the file is to examine its bytes and see what format it's in. Usually, the first few bytes tell you the type reliably as you saw in the above activity.

Here's a common example of playing games with file extensions. The mail server here at USNA won't let you send a zip file. Any .zip attachment just mysteriously disappears. In fact, the server only looks at the file name, not at the bytes that make up the file. So you can simply rename the file, say changing foo.zip into foo.piz, and then attach it. The file will be sent, no problem, and the recipient merely needs to change the extension back to .zip when he saves it. So, don't believe what file extensions tell you!

How many bytes does it take ...

File headers

Because files have a format, there must be some parts of the format that are the same for all files of the same type. For instance, something has to be common to all PDF files. One common feature across file formats is a header. A file header is a short sequence of data at the head, or beginning, of the file data. This can readily be recognized when viewing a file in a hex editor. In general you really need a hex editor, not just a text editor, because for many non-text files the header contains some non-printable characters. For instance, a PDF file always has a header of %PDF, which in hex is 25 50 44 46. Below are some common headers (in hex). Open a few files on your computer and see if you can corroborate this information. Headers are many different lengths, and one particular file type may have multiple valid headers. Remember, it's easy to lie with extensions, but quite hard to lie with headers!

Follow this link to an activity that should help you to understand: 1. that files really are just a bunch of bits/bytes, 2. that changing the bits in a file changes what happens when the file is opened with the appropriate program, and 3. that since many file formats have rules about what bytes a file starts with, you can often determine the type of a file by examining the first few bytes. We'll see that this can be important!

"GIFAR" Files

You can play games with file format rules — sometimes for unsavory purposes. One interesting example is the gifar. Basically, we can create a single file (sequence of bytes) that satisfies the formatting rules both for an image format and for an "archive" format. Specifically, for instance, we can create a single file that is a valid .jpg image file and a valid Java .jar (a file that's intended to be processed by the Program "java"). The gist is the first part of the file is the JPG image, and the second part is the Java jar file. This works because a JPG file must have a JPG header as its first several bytes, and must have a JPG footer indicating the end of the image data, but not necessarily the end of the file. There can be more bytes after the JPG footer, but any JPG viewer simply ignores them. Meanwhile, Java processes a jar file starting with the bytes at the back end of the file. These bytes act as a sort of "table of contents" that tells Java how far forward in the file to jump for other pieces of Java-specific data. The table of contents in a gifar never tells Java to look as far forward in the file as the JPG footer or beyond.

You might ask "What's the point?". Java jar files can instruct the Java Program to do seriously bad things to your computer — they can really be evil. JPG image files, on the other hand, are pretty benign. So websites that allow users to post content will often allow JPG image files to be posted, but definitely not Java jar files. What the bad guys figured out, is that by posting a gifar, they could post files to these websites that the websites thought were innocuous JPG image files (and so would allow to be posted), but which were also malicious Java jar files.