Python Regular Expressions

 This post should serve as a basic guide for Python regular expressions.

It is recommended to learn Python basics before you learn Python regular expressions.

Regular expressions, in general, are a language unto themselves. They can be used in many different languages. They involve cryptic, yet very succinct ways of presenting solutions to programming problems. For some, regular expressions will come very natural, but others will prefer the step-by-step method of writing code. You do not have to know regular expressions, but if you do, you may find them quite fun to work with.

Wk2_RegExpDef

You might want to bookmark a character guide for Python regular expressions.

You can see from the guide that regular expressions are a language of characters. Certain characters have a special meaning, similar to how Python reserved words have special meaning. Shown below is a module you should follow if you want to make use of Python regular expressions in your program.

Wk2_RegExpMod

Consider these two example lines of text from the mbox-short.txt file.

X-DSPAM-Result:
X-Plane is behind schedule:

Now consider the following code:

import re
lines = open(‘mbox-short.txt’)
for line in lines:
++++line = line.rstrip()
++++if re.search(‘^X.*:’):

The “if” statement will catch lines that start with (‘^’) ‘X’, followed by any character (‘.’), zero or more times (‘*’), followed by a ‘:’. The ‘X’ and ‘:’ are not special characters, the other characters do have special meaning. This if statement should catch the two example lines of text written above.

Suppose you do not want to catch a line if it has blank spaces, or whitespace (the second example line). You would modify the regular expression as follows:

for line in lines:
++++line = line.rstrip()
++++if re.search(‘^X-\S+:’):

Now your if statement will only match lines that start with ‘X-‘, followed by non-whitespace (‘\S’), one or more times (‘+’), followed by a ‘:’.

Matching and Extracting Data with Python Regular Expressions

The method “re.search()” returns a True or False, depending if the regular expression finds a match.

Use “re.findall()” if you want matching strings to be extracted.

Consider these four lines of code below, in the Python interpreter.

Wk2_RegExpMatchExtract

The ‘[0-9]+’ represents any single digit that occurs one or more times. Therefore, the variable ‘y’ returns a Python list of matches from the parameter, which is ‘x’. So, “re.findall()” extracts matching data and returns a list.

It is important to know that matching regular expressions will return the largest possible string, by default. For example:

>>> import re
>>> x = ‘From: Using the : character’
>>> y = re.findall(‘^F.+:’, x)
>>> print y
[‘From: Using the :’]

Did you notice? The “re.findall()” did not stop at ‘From:’, because that is not the largest possible matching string. This concept is referred to as greedy matching, because it returns the largest possible match. If you wanted to stop at the first colon, then you would need to use non-greedy matching:

>>> y = re.findall(‘^F.+?:’, x)

This regular expression will return the shortest string match.