Class 2: The Unix philosophy and "the shell" - intro to bash

Reading: Much of what's in the notes below is covered by pp. 255-263 and pp. 277-291 in A Practical Guide to Linux. Pay special attention to the Keyword Variables section starting on p.283.
Homework: Printout the Homework and answer the questions on that paper.

Programs vs. processes

You are already familiar with the standard Unix utility more, which prints the contents of the files you list on the command-line when calling more, one after the other, a screenfull at a time, so you can page through everything. In fact, more is a program like any other, and if you enter "which more" in the shell, you'll find the exact location of the executable file that is the program more.

A program is simply an executable file.

Now, if we open three terminal windows on our Unix machine, and enter "more /etc/passwd" into the first, and "more /usr/include/math.h" into the second, you'll see that you have two more's executing simultaneously on the machine. Since there's only one program more, we need a name for these executing instances of programs, of which there may be several of the same program at the same time.

An executing instance of a program is called a process.

There is another standard Unix utility ps that lists the processes currently executing on a machine. If in the third terminal window you type

ps -u yourusername

you'll see all the processes belonging to you, and included in that list ought to be the two more process we still have executing in the first and second window. We'll talk a lot about processes in this course, but for now it suffices to know that a process is an executing instance of a program, and that each process has its own

argument vector - this includes options specified in launching a program, like the "-l" in "ls -l" or the "a+x" and "foo" in "chmod a+x foo". The argument vector is usually called argv.
source for standard input (stdin) - for C++ programs this is where "cin" comes from
destination for standard output (stdout) - for C++ programs this is where "cout" goes to
destination for standard error (stderr) - for C++ programs this is where "cerr" goes to

These four things that every process has: argv, stdin, stdout, stderr are really important. You'll start to see why today, but as the semester rolls along you'll see it more and more.

What is "the shell"?

The shell is an interpreter (like a calculator) that processes commands. For the most part, those commands just tell the shell to execute a program which, as we now know, means creating a process and making sure that its argv and standard in/out/error are all set up correctly. When you give the command

$$ rm tmp.txt junk1 junk2

the shell starts a new process (usually we say spawn or fork a processs) that is an executing instance of /bin/rm, setting argv like this:

argv =
rm
0 tmp.txt
1 junk1
2 junk2
3

... tying standard in to the keyboard (which is irrelevant here), standard out to the screen (which is irrelevant here), and standard error to the screen -- which means that if there are errors, like file tmp.txt not existing, the error messages get printed on the screen. The shell waits for that process to finish executing before prompting the user for another command. When you give the command

$$ emacs &

The shell spawns a new process that is an executing instance of emacs and, because we followed the command with an "&", immediately prompts us for another command, without waiting for the emacs process to terminate. So, you see, the shell is really just a program that makes it easy to create new processes.

The shell is just another program and, in fact, though we speak of "the shell", there are many different shells from which to choose:

$ cat /etc/shells  ← This file may not exist on your machine
/bin/sh                         Bourne shell
/bin/csh			C shell
/bin/ksh			Korn shell
/bin/jsh			job control shell
/bin/rsh			remote shell
/bin/tcsh			TENEX C shell
/bin/bash			Bourne Again shell

Each one of these is a different shell (command interpreter) offering different twists on the basic shell functionality. We're using bash. Others may use a different shell. Many of the faculty use tcsh.

The Unix Philosophy

The Unix philosophy starts with this: small programs that do one thing (and do it well!), that read and write barebones text (no prompting for data, no cute formatting or messages), where input comes from stdin in, output goes to stdout and error messages go to stderr. We can see this quite clearly looking at a simple C++ program --- prog1.cpp --- we can see what parts of the program correspond to standard in, standard out, standard error, return codes, etc.

Assuming this program is compiled into something called prog1:
bear[~]> prog1
3 2 7 9
ctrl-d ← this simulates "end of file" from the keyboard
5.25
		
Hopefully you see that this program simply returns the average of the numbers it reads in. Notice that it follows the "Unix Philosopy" of small program that does one thing; no prompting for input, no formatting of output, no cute messages.

The Unix Philosophy then continues like this: small tools are tied together by the user to do whatever job needs doing. Let's see how this can work. Suppose I want to know the average size (in bytes) of all the files in the current directory. A simple ls -l lists all the files writes (to stdout) all the file names and their sizes ... and a whole bunch of extra information! We only want the file sizes. The file sizes all appear within the 30th through 40th characters on each line of the ls -l output. So we want to cut out characters 30-40 of each line. There's an app for that, or rather there's a Unix utility for that, it's called cut. In particular, cut -c30-40 reads from stdin and for each line writes (to stdout) only the characters in positions 30-40. So, to get just the file sizes, we'd like to run ls -l and have its stdout go into cut's stdin rather than to the screen. Unix provides a nice mechanism for doing exactly that: piping the stdout from one program to the stdin for another. It's called a "pipe", and the "|" symbol on the command line denotes a pipe. So

ls -l | cut -c30-40

is like issuing two separate command-lines, but they are tied together by piping the stdout of the first to the stdin of the second. This one idea lets users easily combine utilities in an infinite number of ways to solve new problems. Even better, our super-simple-program plays nice in this world!

bear[~]> ls -l
total 606
-rw-------   1 wcbrown  scs        21233 Jan 13  2010 Class.html
-rwx------   1 wcbrown  scs         5820 Jan  8 11:43 ev*
-rw-------   1 wcbrown  scs          281 Jan  8 11:43 ev.c
-rw-------   1 wcbrown  scs         1830 Jan  8 11:43 ev.c.html
-rw-------   1 wcbrown  scs         3144 Jan  8 11:43 nflqb2008.txt
-rwx------   1 wcbrown  scs        11052 Jan 12 09:14 prog1*
-rwx------   1 wcbrown  scs       249168 Jan 13 09:07 prog1annotated.png*
-rw-------   1 wcbrown  scs          400 Jan 12 09:13 prog1.cpp
-rw-------   1 wcbrown  scs           30 Jan  8 11:43 wwww

bear[~]> ls -l | cut -c30-40 

      21233
       5820
        281
       1830
       3144
      11052
     249168
        400
         30

bear[~]> ls -l | cut -c30-40 | prog1
32550.9

So, now we know what the average file size is in my directory! It's important to understand this command line. Normally, what you enter in to the command line is a request to execute a program (i.e. create a process). This is actually three such commands all issued on the same line, with the "|" separating them. Since each process (as we know) has argv, stdin, stdout, stderr, we can really describe what the command line "ls -l | cut -c30-40 | prog1" means like this:

Notice how the regular output of ls is never seen (why?) but error messages are (why?).

bear[~]> ls -z | cut -c30-40 | ./prog1
ls: illegal option -- z
usage: ls -aAbcCdeEfFghHilLmnopqrRsStuxvV1@/[c | v]%[atime | crtime | ctime | mtime | all] [files]
Error!  No data entered!

Another crucial part of the Unix Philosophy is that modifying the behavior of utilities is done with command-line options (the argv elements) rather than via stdin. Without this separation -- data comes in via stdin, behavior modification is done via command-line options -- the whole concept of combining tools via pipes breaks down. For instance, we used the command-line option -c30-40 with cut. If that modification had to be done via stdin we'd be stuck, because stdin is coming from ls, not the keyboard, and ls isn't going to write out -c30-40. Moreover, what if it did write out stuff like that, but we didn't want it interpreted as a request to change cut's behavior.

Special shell symbols

There are some symbols that have special meaning in the shell interpreter's language (Note: the following examples use the wc utility which, with the -l option, returns the number of lines in a file):

; separate a sequence of commands

> cd ; ls

| "pipe" - connects stdout of one program to stdin of another

> cat /etc/passwd | wc -l

< > file redirection

> redirects stdout to a file, < redirects stdin to a file. If file foo already exists,

>
	    foo

overwrites it. If file foo already exists, >> foo appends new text to the end of the existing file.

> cat /etc/passwd > foo.txt
> ls
foo.txt
> wc -l < foo.txt
218

$? gives value "returned" by the last shell program run.

Every program returns a value (the return statement in main for C/C++ programs), and $? lets you see what it was for the previously executed program. Typically, a non-zero return value is an error indicator.

& run program in the background

The & acts as a separator like ; with the difference that the program before the & is run "in the background", meaning that the shell immediately retakes control of terminal input and prompts the user for another command.

Some useful Unix utilities and shell commands

echo	built in to the shell, copies its arguments to stdout

	$ echo foo bar
	foo bar
	$ echo *
	foo foo.c
	$ echo "The rain in Spain stays mainly on the plain" > eliza
	$ ls
	foo foo.c eliza 

cat	concatenate and display

	cat [options] [file1 ...]

	$ cat -n foo.c
	1  // foo.c
        2  int main()
	3  {
	4    return 80;
	5  }

	$ cat foo.c eliza
	// foo.c
        int main()
	{
	  return 80;
	}
	The rain in Spain stays maaaainly on the plain 

	$ cat eliza foo.c eliza foo.c eliza > junk
	$ ls
	foo foo.c eliza junk

head	display the first few lines of a file

	head [-number | -n number] [file1 ...]

	$ head -3 foo.c junk
	==> foo.c <==
	// foo.c
	int main()
	{

	==> junk <==
	The rain in Spain stays maaaainly on the plain
	// foo.c
	int main()

tail    display the last few lines of the file

        This is like the reverse of head.  So 

        tail -3 foo.txt    ← writes the last few lines of foo.txt

        tail +3 foo.txt    ← writes every line from the 3rd line to last

        tail -r foo.txt    ← writes foo.txt in reverse

grep	search a file for a pattern
	('horizontal' cuts through a file)

	(many options!)

	$ grep -n Spain junk
        1:The rain in Spain stays maaaainly on the plain
        7:The rain in Spain stays maaaainly on the plain
        13:The rain in Spain stays maaaainly on the plain

        grep -v pattern file <= lines that DO NOT match pattern

tr	translate characters

	tr [options] str1 str2	(loosely: see the man page)

		-d delete all occurrences of charaters in str1
		-s replace repeated occurrences of characters in str1 with
   a single character

	$ cat eliza | tr -d a
	The rin in Spin stys minly on the plin
	$ cat eliza | tr -s a
	The rain in Spain stays mainly on the plain
	$ cat eliza | tr -s a | tr [a-z] [A-Z] > caps
	$ cat caps
	THE RAIN IN SPAIN STAYS MAINLY ON THE PLAIN
	$ cat caps | tr " " ":" > delim
	$ cat delim
	THE:RAIN:IN:SPAIN:STAYS:MAINLY:ON:THE:PLAIN

cut	cut out selected fields of each line of a file
	('vertical' cuts through a file)

	cut [options] [file ...] 

	$ head -1 nflqb2008.txt
	$ Name   Team    G       QBRat ... etc
	$ grep -v Name nflqb2008.txt | cut -f1 > names
	$ grep -v Name nflqb2008.txt | cut -f2 > teams
	$ grep -v Name nflqb2008.txt | cut -f4 > ratings

paste	merge lines of files

	paste [-s] [-d list] file ...

	$ paste names teams ratings > stats

sort	sort files

	$ sort stats
	$ sort -k2 stats
	$ sort -r -k2 stats

wc      counts lines, characters and words in a file or stdin.
  
        $ wc -l nflqb2008.txt
          34

Shell command history

Arrow keys (UP/DN) - recall commands History size can be set by the user (later on that) $history (numbered list) !! most recent comand (! is called a bang) !number execute the numbered command !string execute most recent command starting with 'string'

Shell variables and environment variables

The shell allows variables to be defined. In fact, the $history from the previous section is simply a shell variable. All variables are referenced this way: the $ prefixed to the variable name gets the variable's value: e.g. $foo. Optionally, you can wrap the variable's name in

${
	      }

, e.g. ${foo}. Note: in programming language circles, the "$" is referred to as a sigil.

Variables are given values with =, with no spaces allowed on either side of the =.

bear[~]> foo=twain
bear[~]> echo foo ← Oops, forgot the $
foo
bear[~]> echo $foo
twain
bear[~]> echo $foobar ← Undefined variable

bear[~]>

Every process has a list of environment variables, the values of which can be read and written by the process. Normal shell variables are "local" by default, meaning they're defined in the current shell session, but will not exist as environment variables in processes that are spawned by the shell. However, if you export a variable, it becomes global --- meaning that the variable will be an environment variable in the shell and in all processes spawned by the shell.

bear[~]> foo=twain
bear[~]> export foo ← Can also combine as export foo=twain

There are several very important environment variables that are defined when you login, including HOSTNAME, the name of the computer you're on, and USER, your user name and, most importantly, PATH.

bear[~]> echo $HOSTNAME
mich301csdbrownu
bear[~]> echo $USER
wcbrown
bear[~]> echo $PATH
/opt/csw/bin:/usr/local/bin:/usr/bin:/bin:.

The PATH variable is the key to how the shell determines what program to execute when you enter a command line and don't give a full or relative path the program you want, but rather just the program's name. For example, when you enter emacs at the command prompt (assuming the PATH above) the shell looks for a file emacs is /opt/csw/bin, doesn't find one, then tries /usr/local/bin, finds a file called emacs, then executes it. If I have a program called foo in the current working directory (and it appears nowhere else) and I enter foo at the command prompt, the shell will execute that foo file, because "." is in the PATH and it means "current directory". If "." were not in the path, I'd have to give a path to foo, so I'd write ./foo instead. If you want to know where in PATH the shell finds a certain command, use the which command:

bear[1] [~/]> which emacs
/usr/local/bin/emacs

Putting "." in the PATH is considered a security concern.