Unix Regular Expression Tutorial

Regular expressions are a powerful tool. We'll look at regular expressions primarily with the unix egrep utility, but keep in mind that regular expressions are used everywhere: they're the backbone of perl programming, for example, and a principal componant of any tool that helps build intereters or compilers.

The egrep utility takes a regular expression and a file and prints out all lines in the file that contain a string that can be generated by the regular expression. So, from the command line you type:

egrep 'regexp' filename

The regular expression regexp, which we'll discuss in a second, is usually surrounded by ' ' to keep your shell from interpreting and modifying anything. If you are interested in a more complete (though possibly incomprehensible) description of regular expressions in Unix, you can type man -s5 regexp at the command prompt.

The simplest example

The simplest regular expression is just a literal string, so let's look at an example like this:

FILE: data COMMAND AND RESULT

therainin spainfallsmainly ontheplain

[~/]> egrep 'fall' data spainfallsmainly [~/]>

What's going on here? Well, the regular expression is fall (i.e. f concatenated with a concatenated with ...) and egrep printed out the only line that contained an occurrence of that regular expression. Of course, matching a literal string like this is not too interesting, though often quite powerful.

Choices in regular expressions: the "|"

What if we wanted the lines that contain fall or on? In regular expressions, the | represents choice: fall|on Be carefull not to put in white space, bcause it matters. I'll say it again: whitespace matters!!!!

FILE: data COMMAND AND RESULT WHITESPACE MISTAKE

therainin spainfallsmainly ontheplain

[~/]> egrep 'fall|on' data spainfallsmainly ontheplain [~/]>

[~/]> egrep 'fall | on' data [~/]>

Parantheses

You can put ()'s around any regular expression for the usual grouping purposes.

Getting at special characters

What if I want to use the ( character or the | character in a regular expression. I've got trouble, because they already mean something! The solution (as always) is to put a \ in front of the character.

FILE: data COMMAND AND RESULT COMMAND AND RESULT

{ x in Y | x + 3 < 12 } { all even x's } (Y x Y) ))){{{

[~/]> egrep '\|' data { x in Y | x + 3 < 12 } [~/]>

[~/]> egrep '\||\(' data { x in Y | x + 3 < 12 } (Y x Y) [~/]>

+ and *

If you follow a character or a parethisized expression with a *, it matches zero or more occurences of the expression. A + matches 1 or more.

FILE: `data`	COMMAND AND RESULT	COMMAND AND RESULT
001100110011 0101010	[~/]> egrep '(00\|11)*' data 001100110011 010101 [~/]>	[~/]> egrep '(00\|11)+' data 001100110011 [~/]>

Concatenation

Unix regular expressions allow for concatenation in the usual way: by simply writing regular expressions next to one another.

FILE: `data`	COMMAND AND RESULT	COMMAND AND RESULT
prevent a10 affix bx postorder b10 postal a prefix a00	[~/]> egrep '(pre\|post)(order\|fix)' data postorder b10 prefix a00 [~/]>	[~/]> egrep '(a\|b)(0\|1)+' data prevent a10 postorder b10 prefix a00 [~/]>

Matching the line rather than any substring in the line

Sometimes you only want egrep to print lines that completely match a given regular expression rather than lines that contain some substring that matches the expression. For example, suppose we want to print out the lines that contain only 0's. We might type egrep '0+'. Unfortunately, this would print out a line like 11001101, since it has "00" as a substring, which matches 0+. However, unix regular expressions include ^, which matches the beginning of a line, and $, which matches the end of a line. Thus, egrep '^0+$' matches only a line of 0ne or more zeros.

FILE: `data`	COMMAND AND RESULT	COMMAND AND RESULT	COMMAND AND RESULT
0x000000 (1100110) 0000000 1010000 (00) 0011111	[~/]> egrep '^0+' data 0x000000 0000000 0011111 [~/]>	[~/]> egrep '0+$' data 0x000000 0000000 1010000 [~/]>	[~/]> egrep '^0+$' data 0000000 [~/]>

Other fun and useful features

There are some other fun and useful features in egrep's regular expressions.

You can specify ranges of characeters. For example, to match a digit you put [0-9]. To match all uppercase letters you put [A-Z]. To match a letter (either case) or a digit, you'd use [0-9]|[A-Z]|[a-z].
You can specify zero-or-one occurences of something with the ? character. For example, if you want to specify that a + or - sign may be there, but may not, you'd write (+|-)?.

Prof Christopher W Brown

Last modified: Fri Oct 3 11:22:00 EDT 2003

FILE: `data`	COMMAND AND RESULT
therainin spainfallsmainly ontheplain	[~/]> egrep 'fall' data spainfallsmainly [~/]>

FILE: `data`	COMMAND AND RESULT	WHITESPACE MISTAKE
therainin spainfallsmainly ontheplain	[~/]> egrep 'fall\|on' data spainfallsmainly ontheplain [~/]>	[~/]> egrep 'fall \| on' data [~/]>