Unix Regular Expression Tutorial

Regular expressions are a powerful tool. We'll look at regular expressions primarily with the unix egrep utility, but keep in mind that regular expressions are used everywhere: they're the backbone of perl programming, for example, and a principal componant of any tool that helps build intereters or compilers.

The egrep utility takes a regular expression and a file and prints out all lines in the file that contain a string that can be generated by the regular expression. So, from the command line you type:

egrep 'regexp' filename

The regular expression regexp, which we'll discuss in a second, is usually surrounded by ' ' to keep your shell from interpreting and modifying anything. If you are interested in a more complete (though possibly incomprehensible) description of regular expressions in Unix, you can type man -s5 regexp at the command prompt.

The simplest example
The simplest regular expression is just a literal string, so let's look at an example like this:

FILE: data COMMAND AND RESULT
therainin
spainfallsmainly
ontheplain
	    
[~/]> egrep 'fall' data
spainfallsmainly
[~/]> 
	    

What's going on here? Well, the regular expression is fall (i.e. f concatenated with a concatenated with ...) and egrep printed out the only line that contained an occurrence of that regular expression. Of course, matching a literal string like this is not too interesting, though often quite powerful.

Choices in regular expressions: the "|"
What if we wanted the lines that contain fall or on? In regular expressions, the | represents choice: fall|on Be carefull not to put in white space, bcause it matters. I'll say it again: whitespace matters!!!!

FILE: data COMMAND AND RESULT WHITESPACE MISTAKE
therainin
spainfallsmainly
ontheplain
	    
[~/]> egrep 'fall|on' data
spainfallsmainly
ontheplain
[~/]> 
	    
[~/]> egrep 'fall | on' data
[~/]> 
	    

Parantheses
You can put ()'s around any regular expression for the usual grouping purposes.

Getting at special characters
What if I want to use the ( character or the | character in a regular expression. I've got trouble, because they already mean something! The solution (as always) is to put a \ in front of the character.

FILE: data COMMAND AND RESULT COMMAND AND RESULT
{ x in Y | x + 3 < 12 }
{ all even x's }
(Y x Y)
))){{{
	    
[~/]> egrep '\|' data
{ x in Y | x + 3 < 12 }
[~/]> 
	    
[~/]> egrep '\||\(' data
{ x in Y | x + 3 < 12 }
(Y x Y)
[~/]> 
	    

+ and *
If you follow a character or a parethisized expression with a *, it matches zero or more occurences of the expression. A + matches 1 or more.
FILE: data COMMAND AND RESULT COMMAND AND RESULT
001100110011

0101010
	    
[~/]> egrep '(00|11)*' data
001100110011

010101
[~/]> 
	    
[~/]> egrep '(00|11)+' data
001100110011
[~/]> 
	    

Concatenation
Unix regular expressions allow for concatenation in the usual way: by simply writing regular expressions next to one another.
FILE: data COMMAND AND RESULT COMMAND AND RESULT
prevent   a10
affix     bx
postorder b10
postal    a
prefix    a00
	    
[~/]> egrep '(pre|post)(order|fix)' data
postorder b10
prefix    a00
[~/]> 
	    
[~/]> egrep '(a|b)(0|1)+' data
prevent   a10
postorder b10
prefix    a00
[~/]> 
	    

Matching the line rather than any substring in the line
Sometimes you only want egrep to print lines that completely match a given regular expression rather than lines that contain some substring that matches the expression. For example, suppose we want to print out the lines that contain only 0's. We might type egrep '0+'. Unfortunately, this would print out a line like 11001101, since it has "00" as a substring, which matches 0+. However, unix regular expressions include ^, which matches the beginning of a line, and $, which matches the end of a line. Thus, egrep '^0+$' matches only a line of 0ne or more zeros.
FILE: data COMMAND AND RESULT COMMAND AND RESULT COMMAND AND RESULT
0x000000
(1100110)
0000000
1010000
(00)
0011111
	  
[~/]> egrep '^0+' data
0x000000
0000000
0011111
[~/]> 
	    
[~/]> egrep '0+$' data
0x000000
0000000
1010000
[~/]> 
	    
[~/]> egrep '^0+$' data
0000000
[~/]> 
	  

Other fun and useful features
There are some other fun and useful features in egrep's regular expressions.


Prof Christopher W Brown
Last modified: Fri Oct 3 11:22:00 EDT 2003