| ex0.py | input0.txt | run of program |
$ ./ex0.py < input0.txt
There were 2 blank lines!
|
re module, which you need to "import" before
using. The re module has a number of useful functions. We will
start
with re.search(regex,str),
which searches the input string str for a substring
matching the regular expression regex.
What follows is a simple Perl program that uses regular
expressions to determine whether a line contains either
b0b or b1b.
| ex1.py | run of program |
$ ./ex1.py Blah,b0bfa MATCH! Foob10b0x NO MATCH! how about b 1 b ? NO MATCH! howaboutb1b? MATCH! |
Note1:
A Python string prefaced with r is a "raw
string", meaning that there are no "escapes" or anything. So
'x\ny' is an x followed by a newline followed by
y, but r'x\ny' is an x followed by a backslash
followed by n followed by y. It's easiest to write your
regular expressions with raw strings for reasons that will
become clear later.
Note2:
If re.search( ) successfully finds a match, it returns a
"match object". We'll talk about that later. If it doesn't
find a match, re.search( ) returns None,
which is Python's version of "null". And if you try to use a
refernce in a boolean context, like an
if-condition, None is false. and anything
non-None is true. That's why that if-statement
in the program works!
axorq8,e3k0ka1b0bkca4bb2b" a
match? Why/Why not?b(0|1)b, but also
to know what matched the (0|1) part -- i.e. did
we get a zero or a one in there? The "match object",
i.e. mres in our program,
is like an array storing
the string that matched a parenthesized sub-part of a regular
expression at index i, where the left parenthesis around
the subpart is the ith left parenthesis of the whole
regular expression. In the above, there is only one open paren, so
mres[1] is the string that matched the
(0|1).
| ex2.py | run of program |
$ ./ex2.py theb0bdog MATCH! Bit was 0 likes dog b1scuits NO MATCH! before b1bbles MATCH! Bit was 1 |
i to use in mres[i].
In the following example, we have two parenthesized
subexpressions, each matching a different part of the string.
| ex3.py | run of program |
$ ./ex3.py we match <26A> of course MATCH! Bit was (26,A) but not <25AB>, naturally NO MATCH! |
Copy this program and try it on input like this:
The secret key <456 B> is
hidden <B678> in <007B> this
document somewhere 56A but you'll never
find <22B it.
Sometimes parens are nested and things get complicated. For
example,
consider the expression E = x(letter(a|b)+|number(0|1)+)y.
This is matched by strings like xletteraaby and xnumber1010y.
In the first case, mres[1] = "letteraab",
mres[2] = "aab", and mres[3] = "". In the second case,
mres[1] = "number1010",
mres[2] = "", and
mres[3] = "1010". Do you see why?
Note: It's okay to put redundant parentheses around
parts of your regular expression just so you can use a
mres[i] to refer back to whatever matched them.
\d this matches a digit, so the regular
expression r'(a|b)\d' matches
a0,b0,a1,b1,a2,b2,...,a9,b9.? this matches 0 or 1 occurence of whatever
preceeded it. So, for example, the regular expression
r'1?\d' matches all the numbers from zero up to
19.[x-y] Ranges (as for
egrep). For example, the regular expression
r'[A-D]|F' matches the valid letter grades,
A, B, C, D and F.
r'<(.*)>'
and mres[1] would give you whatever came between the < >s.
\d matches a single digit),
possibly followed by one capital letter. For example,
SI204, SI486A and SI311 are all "SI" computer science courses. on the other hand,
IS221, SI32, and SI496r are not.
If you give you program the input file
This semester SI433 is running for the last time take it while you can. EM300 is running as always, but don't try taking it out of sequence. We got two OSI2 people, one teaching SI204 (or is it IC210?), the other teaching IT310 as well as SI283. I don't think there is a SI496A running this Spring. The SI 2008 course offerings look like they're nailed down, though.You should get output:
433 204 283 496A
<b>" and
"</b>".
There <b>is</b> a time to play and there <b>is not</b> a reason to panic!your program should produce output
is is not
Welcome to <b>Joes</b> home of <b>FREE REFILLS</b>!!!What you'll probably get is:
Joes</b> home of <b>FREE REFILLSWhat happened? Well, after finding the first <b> it decided to match with the last </b>. By default, when you use a * or a + Perl will try to make the match as long as possible. Here we want it to make the match as short as possible. If you use *? or +? in place of * and + you get shortest possible matches instead. Modify your Problem 2 solution so that you get
Joeson the above input. In other words, you get the first occurrence of something delimited by <b> and </b>.
ugly like this:
curl https://www.usna.edu/Users/cs/wcbrown/courses/F11SI340/classes/L17/input1 > uglyIt is html source code that has a boatload of meeting times in it. Write a program that will print out the meeting times it contains. If the script prints out a couple of extras ... so be it. If you program is
lab.py, run it
like this:
lab4.pl < uglyThat'll save you copying and pasting this big ugly file! Hint: If you want between m and n occurences of something in a regular expression, {m,n} does the trick. It's like a limited * or +. So, for example, if you wanted a number between 10 and 9999 you might use the regular expression
[1-9]\d{1,3}.
(1|1\-|1\w)? # Possible 1 or 1-
((\d\d\d)|\((\d\d\d)\)) # Area code (possibly in parens)
(\-|\s)? # Space or dash or nothing
(\d\d\d) # Block one
(\-|\s)? # Space or dash or nothing
(\d\d\d\d) # Block two |
my number is +1(410)677-8220.the number 4106778220 doesn't exist.try dialing 410 677-8220 and seedon't call (410) 677 8220 please