"Cyber Space" is a name for this global information system
consisting of many familiar pieces — like web sites,
distributed video games, email — and many less familiar
pieces as well, that work behind the scenes. All of these
systems do little more than process, store and retrieve
digital data. So to even begin to understand Cyber
Space, we need to get a handle on what "digital data" is all
about. That's what this lesson will do.
Submarine Radio Communications
Ballistic missile submarines
remain undetected beneath the ocean's surface awaiting the order
to launch their missiles. The launch order must be sent via radio
transmission, but sea water blocks those radio waves typically used
with satellites or for long-range radio because of their high frequencies.
For submarines, very low frequency (VLF) radio waves must be used (3-30kHz)
to penetrate the ocean and reach the submarine's VLF antenna.
Communicating with submarines while completely submerged comes at a
cost. VLF radio waves have a severely limited capacity for
carrying data.
VLF data transmission rates are around 300 bps. Compare that with a
data transmission rate of 10 Mbps for a 4G wireless phone. Your smart
phone is 33,000 times faster than VLF! In other words, it would take
2 hours and 47 minutes to download one MP3 song using submarine VLF
communications, where it would only take 0.3 seconds using your 4G
phone.
(Image courtesy of
Jim Hawkins)
This is a picture of the VLF antenna array that used to be at
Greenbury Point. The three small antennas you see today are
all that's left. The rest were pulled down in the late 90's.
Bits and Bytes
Digital data consists solely of 0's and 1's. An individual 0 or
1 value is called a bit. So to represent a piece of
information, you need to be able to express that information as
a sequence of 0's and 1's. For the remainder of this lesson,
we'll explore how this is done for many different kinds of
information. First, however, there's a practical issue to take
care of. All computational devices group bits into chunks of
eight, and that's usually the smallest unit of data they
actually operate on. An 8-bit chunk is called a byte.
The difference between bit and byte is really important.
A computer is typically capable of storing and processing an
immense number of bits and bytes. So we often speak of
kilo, mega, giga and tera bytes or bits. What do those mean?
Normally kilo means thousand, mega means million, giga means
billion, and tera means trillion, and that's approximately
true in the context of digital data, but not exactly. In the
context of digital data:
kilo = 210 ≈ 1,000
mega = 220 ≈ 1,000,000
giga = 230 ≈ 1,000,000,000
tera = 240 ≈ 1,000,000,000,000
... so "megabyte" means 220 bytes,
which is 8 × 220 = 223
bits. Finally, you often see these abbreviated as K=kilo,
M=mega, G=giga, T=tera and b=bit and B=byte. So, Gb means
"gigabit" whereas GB means "gigabyte", which is eight times as
many bits. In fact, it's not always easy to know whether the
"decimal" or "binary" interpretation of "kilo", "mega" etc. is
meant, especially in marketing material.
"There are 10 kinds of people: those who
know binary, and those who don't."
On the face of it, it's pretty amazing that all information can
be somehow expressed as sequences of bits. Actually though,
it's all possible because numbers can be expressed as
sequences of bits.
A number expressed as a sequence of 0's and 1's is called a
binary number, and the idea is no different from how we use
sequences of decimal digits to represent numbers. Recall how
that works: When we write 467 we mean
4×102 + 6×101 + 7×100.
Now, in a binary number we only allow bits as digits, and
instead of powers of 10, we have powers of 2. So in binary,
1101 means
1×23 + 1×22 + 0×21 + 1×20
which is 13 in decimal.
Numbers of any size can be represented by sequences of 0's and
1's, though larger numbers require longer sequences.
In fact, it's easy to compute how many
bits you need to represent a number of a specific size. With k
bits, you can represent any number from 0 up to and including 2k-1.
To represent a positive integer N, you need 1 + log2N bits.
In a byte, i.e. eight bits, we can represent numbers up to
28-1 = 256 - 1 = 255.
Because of the importance of bytes, we will concentrate on being
able to write numbers as 8-bit sequences, and being able to
interpret an 8-bit sequence as a number. The smallest number we
can represent in 8-bits is 0, which is the byte 00000000. The
largest is 255 which, in binary, is 11111111. Of course,
anything in between is possible as well.
Videos showing how to convert from binary
to decimal and back
Hexadecimal
Bytes are all-important in computing,
and after a while it becomes cumbersome
to write out all eight bits of a byte.
So we often write out bytes as two hexadecimal digits.
Hexadecimal is actually the base 16 number system, but for our
purposes that is irrelevant. The important point is that it
gives us a concise representation for bytes, since each hex
digit represents a 4-bit pattern. Thus two hex-digits represent
an 8-bit pattern, i.e. a byte. The following table gives the
mapping between the hex digits (0,1,2,3,4,5,6,7,8,9,a,b,c,d,e,f)
and 4-bit patterns.
hex digit
0
1
2
3
4
5
6
7
8
9
a
b
c
d
e
f
4-bit pattern
0000
0001
0010
0011
0100
0101
0110
0111
1000
1001
1010
1011
1100
1101
1110
1111
Using this table, you should be able to convert 3cf6 into a byte
sequence, and convert the byte 01101110 into two hex digits.
ASCII Encoding and Plain Text
Other than numbers, the most fundamental data is text. The
method for representing text digitally (i.e. as bits and bytes)
depends on the alphabet the text uses, of course. However, in
the cyber world, English is the base language and everything
else is an add-on. Convenient for us, eh? Plain text is
represented using one byte (i.e. one number in the range 0-255,
although in reality we only use 0-127)
for each character, where the characters allowed and the byte
values (i.e. numbers) they correspond to are given by
the ASCII Table.
So, for example, the letter a has ASCII value 97
which is byte 01100001. ASCII values 32-126 are
the printable characters, and any sequence of bytes
consisting solely of them is considered to be plain text.
We might allow the additional values
9 ← tab, 10 ← newline, 13 ← carriage return,
which provide limited formatting.
String to ASCII Demo
You can actually enter ASCII values into the address bar in
your browser. Although you have to write them
in hexadecimal notation rather than decimal or
binary. (Hexadecimal is a base 16 (rather than 10 or 2) number
system, whose digits are 0,1,2,3,4,5,6,7,8,9,a,b,c,d,e,f.)
For example, c has ASCII value 99 which is 63 in
hex, so a c can be written in the address bar as
%63. Thus, entering
%63nn.com in your browser's address bar
gets you go cnn.com! BTW: Firefox might have this turned off
by default, since there are actually security implications
with this.
A sequence of characters is called a string, and what
we've just seen is that ASCII gives us a way to encode strings
as sequences of bits (or, if you prefer, bytes).
Files
A file on a computer is simply a sequence of bytes,
nothing more, nothing less.
What imbues the file with meaning is how we (or a program)
choose to interpret those bytes. This is actually the same with
language. The string bow has no intrinsic meaning. It's
up to us to decide whether it means something you shoot with,
put in a little girl's hair, or whether it means bending over in
a polite manner.
If each byte of a file corresponds to a printable ASCII
character, we could choose to interpret the file as containing
text, so such files are called text files. However,
many text files have a more specific meaning. What you're
reading right now is contained within a plain text file of a
special type: an HTML file.
If you right-click in this browser window and select "view
source" you can actually see the plain text file that produced
this pretty page.
Many files are not plain text, but even though such files don't
immediately mean something to us, there's usually some program
out there that understands how to interpret those bytes in a
meaningful way.
The rules that define how the bytes of a particular file are
supposed to be interpreted are called a file format.
We described the format of a plain text file in the previous
paragraph. You might have heard of .jpg files (JPEG files).
JPEG is a file format for images, and any file whose bytes
conform to the JPG rules can be viewed as an image with the
proper program. So usually to use a file you need to know
what kind it is (i.e. what format it follows) and what program(s)
to use to operate on that kind of file. Here are some common
formats:
plain text: contains text of course; open with Notepad
or Notepad++; typical extension is .txt
JPG: image file; open with Windows Photo Viewer, Photoshop,
etc.; typical extension is .jpg or .JPG
ZIP file: contains a bundled collection of files and
folders
PDF file: Portable document format; Open with Adobe
Acrobat Reader; typical extension is .pdf
mp3 files: Mpeg Layer 3 file; audio file using a specific
compression algorithm; can be opened in many players, including
iTunes and WinAmp; extension is typically .mp3
One of the most important file types is one you might not have
thought of: a program. A program is a regular old file
whose bytes can be interpreted by the physical computer as
instructions to be executed.
The Role of File Extensions
Filenames typically (by tradition) end in a '.' followed by
Windows by default hides the file extension from you.
For this course (and life) you really want to see this
information. Follow these instructions to turn off the
extension hiding, so you actually know the true names of
files.
Under the Windows Start button, click on Computer.
Under the 'Organize' menu, select 'Folder and search options'
Click on the 'View' tab.
Uncheck the box for 'Hide extensions for known file types'
three letters — like fact.jpg ends
in .jpg. This last part of the filename is called
the file extension. The Operating System (Windows) and
many programs trust the extension to tell them the file type,
and thus choose, for example, what program to use when opening
the file. However, this trust is misplaced. The extension does
not tell you the file type reliably. Try this:
right-click on this link and choose to
save the link to the desktop.
go to Start->Documents->Desktop and find the file
CSL.png (which is an image). Right-click on it and change its
name to CSL.doc (the .doc extension is for MS Word).
Notice how the icon changed. Windows thinks this is a Word
file now.
double-click on CSL.doc to open it up. Windows will try to
open it with Word. What happens?
The moral of the story here was that extensions can lie! The
only real way to know the type of the file is to examine its
bytes and see what format it's in. Usually, the first few bytes
tell you the type reliably as you saw in the above activity.
Here's a common example of playing games with file
extensions. The mail server here at USNA won't let you send a
zip file. Any .zip attachment just mysteriously disappears.
In fact, the server only looks at the file name, not at the
bytes that make up the file. So you can simply rename the
file, say changing foo.zip into foo.piz, and then attach it.
The file will be sent, no problem, and the recipient merely
needs to change the extension back to .zip when he saves it.
So: don't believe what file extensions tell you!
How many bytes does it take ...
to store the complete works of Shakespeare in plain text? 5,590,193 bytes
to store the complete Harry Potter series in plain text? 6,272,550 bytes
to store Beethoven's 9th Symphony? 62,948,072 bytes
to store a two-hour long High-Def movie? 4,294,967,296 bytes
File headers
Because files have a format, there must be some parts of the format that
are the same for all files of the same type. For instance, something
has to be common to all pdf files. One common feature across file
formats is a header. A header is a short sequence of data at the head,
or beginning, of the file data. This can readily be recognized when
viewing a file in a hex editor. In general you really need a hex
editor, not just a text editor, because for many non-text files
the header contains some non-printable characters.
For instance, a pdf file always has a
header of %PDF, which in hex is 25 50 44 46.
Below are some common headers (in hex).
jpg: ff d8 ff e0 00 10
avi: 52 49 46 46 (this is actually
printable as RIFF)
doc: d0 cf 11 e0 a1 b1 1a e1 00 00
Open a few files on your computer and see if you can corroborate this
information. Headers are many different lengths, and one particular
file type may have multiple valid headers.
Remember, it's easy to lie with extensions, but quite hard to lie
with headers!
Follow this link to an activity that should
help you to understand a) that files really are just a bunch
of bits/bytes, b) that changing the bits in a file changes what
happens when the file is opened with the appropriate program, and
c) that since many file formats have rules about what bytes a file
starts with, you can often determine the type of a file by
examining the first few bytes. We'll see that this can be
important!
"GIFAR" Files
you can play games with file format rules — sometimes for
unsavory purposes.
One interesting example is the "gifar". Basically, we can create a
single file (sequence of bytes) that satisfies the formatting rules
both for an image format and for an "archive" format. Specifically,
for instance, we can create a single file that is a valid .jpg image
file and a valid Java .jar (a file that's intended to be processed by
the program "java").
The gist is the first part of the file is the jpg image, and the
second part is the Java jar file.
This works because a jpg file must have a jpg
header as its first several bytes, and must have a jpg footer
indicating the end of the image data, but not necessarily the end of
the file. There can be more bytes after the jpg footer, but any jpg
viewer simply ignores them. Meanwhile, Java processes a jar file
starting with the bytes at the back end of the file. These
bytes act as a sort of "table of contents" that tells Java how far
forward in the file to jump for other pieces of Java-specific data.
The table of contents in a gifar never tells Java to look as far
forward in the file as the jpg footer or beyond.
You might ask: "what's the point?"
Java jar files can instruct the Java program to do seriously bad
things to your computer — they can really be evil. Jpg image
files, on the other hand, are pretty benign. So websites that allow
users to post content will often allow jpg image files to be posted,
but definitely not Java jar files. What the bad guys figured out,
is that by posting a gifar, they could post files to these websites
that the websites thought were innocuous jpg image files (and so
would allow to be posted), but which
were also malicious Java jar files.