Learn Python Series (#23) - Handling Regular Expressions Part 1
Learn Python Series (#23) - Handling Regular Expressions Part 1
Repository
https://github.com/python/cpython
What Will I Learn?
- In this part (1, of a subseries within the
Learn Python Series
) you will learn about using regular expressions in Python via there
module, with which you can write patterns in order to match them within input strings; - in order not to overwhelm beginner programmer, in
Part 1
you'll be only using fixed string patterns; - how to use the
re
functionsmatch()
,search()
,findall()
,finditer()
,sub()
,split()
andcompile()
; - as well as how to use a few handy attributes of a
Match
object.
Requirements
- A working modern computer running macOS, Windows or Ubuntu;
- An installed Python 3(.6) distribution, such as (for example) the Anaconda Distribution;
- The ambition to learn Python programming.
Difficulty
- Basic
Curriculum (of the Learn Python Series
):
- Learn Python Series - Intro
- Learn Python Series (#2) - Handling Strings Part 1
- Learn Python Series (#3) - Handling Strings Part 2
- Learn Python Series (#4) - Round-Up #1
- Learn Python Series (#5) - Handling Lists Part 1
- Learn Python Series (#6) - Handling Lists Part 2
- Learn Python Series (#7) - Handling Dictionaries
- Learn Python Series (#8) - Handling Tuples
- Learn Python Series (#9) - Using Import
- Learn Python Series (#10) - Matplotlib Part 1
- Learn Python Series (#11) - NumPy Part 1
- Learn Python Series (#12) - Handling Files
- Learn Python Series (#13) - Mini Project - Developing a Web Crawler Part 1
- Learn Python Series (#14) - Mini Project - Developing a Web Crawler Part 2
- Learn Python Series (#15) - Handling JSON
- Learn Python Series (#16) - Mini Project - Developing a Web Crawler Part 3
- Learn Python Series (#17) - Roundup #2 - Combining and analyzing any-to-any multi-currency historical data
- Learn Python Series (#18) - PyMongo Part 1
- Learn Python Series (#19) - PyMongo Part 2
- Learn Python Series (#20) - PyMongo Part 3
- Learn Python Series (#21) - Handling Dates and Time Part 1
- Learn Python Series (#22) - Handling Dates and Time Part 2
Proof of Work Done
Supplemental source code, including tutorial itself (iPython):
https://github.com/realScipio/learn-python-series/blob/master/regex-01.ipynb
Learn Python Series (#23) - Handling Regular Expressions Part 1
This subseries of the Learn Python Series
focuses on working with regular expressions, also oftentimes abbreviated to the term regex. Regular expressions are text matching patterns, and are very powerful to use but its syntax is quite formal and for beginners quite difficult to comprehend.
You can look at regular expressions as a specialized programming languaged squeezed inside Python. It empowers you to define "rules" you want to match for within a larger string of data. For example "Is there a phone number on a certain web page?", or "Is this a valid email address?".
When a lot of text processing is involved, regular expressions are your friend (as long as you know how to use regexes, that is!). When you're developing an application containing lots of code, using regular expressions to find and / or edit parts of the code itself is pretty helpful and common as well. Sys admins dealing with log files, find regular expressions useful as well. When developing a textual search engine, you're obviously also dealing with big volumes of textual data, and again using regular expressions then is useful and common.
Nota bene 1: a word of advise to those new to handling regular expressions in general, not just in Python: regexes can be pretty overwhelming at first sight, but don't feel intimidated by them! I will try to explain everything in the right order, as I always do, and feel free to ask me questions in the comment sections if something is unclear. I'm happy to help!
Nota bene 2: in order to keep things easy-to-follow for beginner programmers, I decided to present the information in this regex subseries in a very specific order. That means: instead of beginning with the "regex language" itself, explaining meta characters, pattern syntax, escaping, grouping, sets, etc. right of the bat (which would mean risking beginning programmers felling dazzled and completely overwhelmed with the info presented), I'm first explaining the basic mechanisms (functions, methods, attributes used oftentimes) of the re
module.
Therefore, in this Handling Regular Expressions Part 1
episode, I will only be using simple, fixed, literal, character matching. Starting from Part 2
I'll be going over the "nitty gritty" of the regex language itself.
Having said that: let's begin!
Basic (fixed) substring matching (non-regex) via in
In the Learn Python Series (#3) - Handling Strings Part 2 episode I first introduced using the in
operation. The in
operation is not a part of the re
module in Python (for using regular expressions) but in
can be used for exact substring matching nonetheless.
For example:
substring = "simply"
string = "This is simply an example string"
if substring in string:
print("The word 'simply' is in there all right!")
else:
print("Nope! No 'simply' found!")
The word 'simply' is in there all right!
The in
operation just checks if an exact substring (in this case "simply"
) is contained in a larger string.
Finding patterns in text via regular expressions
A regular expression is - in essence - a search pattern described by a sequence of characters with a formal syntax. The patterns are not just (like finding an exact substring in a larger string, as mentioned above) matched "AS-IS", but rather interpreted as instructions. The pattern is then compared to an input string and in case of a match, the matching subset is returned. Regular expressions can (of course) include literal text matching, but also repetition and composition, as well as other rules.
Importing the re
module
In Python, the functionality for regular expressions is opened with importing the re
module.
import re
The match()
and the search()
functions, returning a Match
object
I think the most frequent usecase for regular expressions, is finding / searching for patterns in text (the input string). The re
module comes with two built-in primitive functions for matching patterns found in strings: thematch()
function and the search()
function.
match()
only checks for a match at the beginning of the input string, where search()
tries to match the pattern anywhere inside the string (including if it's found at the beginning).
You can pass in a pattern string and an input string as its arguments, and when the pattern was matched both functions will return a Match
object for the first instance in case the pattern was found, or None
when it was not.
# Using the `match()` function
pattern = "I"
string = "I still remember how it all changed"
match = re.match(pattern, string)
print(match)
<_sre.SRE_Match object; span=(0, 1), match='I'>
# Using the `search()` function
pattern = "for"
string = "Hello! Is it me you're looking for?"
match = re.search(pattern, string)
print(match)
<_sre.SRE_Match object; span=(31, 34), match='for'>
# The `match()` function only looks at the beginning of the string,
# and returns `None`
# The `search()` function looks at the entire string,
# and returns a Match object.
pattern = "all"
string = "We all stand together!"
match_match = re.match(pattern, string)
match_search = re.search(pattern, string)
print(match_match)
print(match_search)
None
<_sre.SRE_Match object; span=(3, 6), match='all'>
In order to get more info from the returned Match
object, we can use two methods start()
and end()
, which return the start & end indexes of the input string where the pattern was found. And we could also return the input string attribute from the Match
object, as well as the pattern that was used.
print(match_search.start())
print(match_search.end())
print(match_search.string)
print(match_search.re.pattern)
3
6
We all stand together!
all
Nota bene: please remember that the end index
is non-inclusively used for slicing!
I.e.:
print(match_search.string[match_search.start():match_search.end()])
all
The finditer()
function, for returning multiple Match
objects
It could well be that more than one substring matches a regex pattern within an input string. The finditer()
function returns an iterator of Match
objects. Please observe the following example:
pattern = "is"
string = "It is what it is, isn't it?"
for match in re.finditer(pattern, string):
start_index = match.start()
end_index = match.end()
print("Start: {}, End: {}".format(start_index, end_index))
Start: 3, End: 5
Start: 14, End: 16
Start: 18, End: 20
The findall()
function, for returning matches as a list
In case you're not particularly interested in returning Match
objects from your regular expression, but rather have a list of matches returned instead, then use the findall()
function:
pattern = "la"
string = "Ooh la la la, it's the way that we rock when we're doing our thing"
matches = re.findall(pattern, string)
print(type(matches), len(matches), "returned:", matches)
<class 'list'> 3 returned: ['la', 'la', 'la']
Modifying strings via regular expressions
The sub()
function, for search and replace
After finding one or more matches on a pattern, you might want to replace them with something else. This can be done using the sub()
function (short for substitute). As its default arguments sub()
accepts a pattern to match for, a replacement string, and of course the input string to search & replace on.
Please observe the following, simple, example:
pattern = "dogs"
replacement = "cows"
string = "It's raining cats and dogs"
new_string = re.sub(pattern, replacement, string)
print(new_string)
It's raining cats and cows
The split()
function
Again in Learn Python Series (#3) - Handling Strings Part 2 we first talked about the split()
method (on strings). The re
module also has a split()
function, which looks for a pattern match and in case it finds one (or more) splits the input string at that point into multiple list elements. A list is returned from calling split()
.
pattern = " "
string = "This is an example sentence"
result = re.split(pattern, string)
print(type(result), result)
<class 'list'> ['This', 'is', 'an', 'example', 'sentence']
Compiling regular expressions
The compile()
function
The re
module allows for working with regular expressions in the form of strings, just like we did until now. Yet it's also possible, for faster processing, to first compile the regular expression string and convert it intro a RegexObject
. This can be done using the compile()
function.
pattern = "comes"
string = "Here it comes again, that feeling!"
regex = re.compile(pattern)
print(type(regex), regex)
<class '_sre.SRE_Pattern'> re.compile('comes')
Since we've now compiled the (very simple) regular expression, we can then use - for example - the search()
method of the (now) RegexObject
object, which now only requires one argument: the input string:
match = regex.search(string)
print(match)
<_sre.SRE_Match object; span=(8, 13), match='comes'>
<_sre.SRE_Match object; span=(8, 13), match='comes'>
Nota bene: precompiling regular expressions works faster, because the compile task is done when the application (your Python script containing the compile task) is started, in stead of - for example - when the application needs to react to, for example, user input.
What did we learn, hopefully?
This Part 1
episode was intended as an introduction to the re
module which is built-in to (almost) every Python distribution, and enables you to write and execute (very) (advanced) pattern matching via regular expressions. But since regular expressions can be regarded as a "mini programming language", and understanding its syntax and mechanisms is of utmost importance in order to read, let alone write, complex regular expressions, in this episode we "only" talked about how to use re
functions such as match()
, search()
, findall()
, finditer()
, split()
, sub()
and compile()
.
Hopefully the way I "introduced" regular expressions didn't scare you into trying to learn more about them! I deliberately labeled this Part 1
episode as Difficulty: Basic
, Part 2
will be Difficulty: Intermediate
and from there on it's... Difficulty: Fun!
!
Come and find out! See you there!
Interesting these 2 posts from @nomannomi: "it give us a new lesson I like it" but then "No dear I dont know about coding". Makes you think how sincere these comments are...
Anyway, thank you for your Python series @scipio. There are several resources available on the web but yours is really comprehensive with good examples. Much appreciated.
Indeed! ;-)
It's my book, the entirety of all episodes (well, I might alter a few things here in there in case I'd want to publish it). And I'm doing my best to explain things differently, beginner-friendly, with lots of examples, which tend to get more complex as the series evolves. In this episode (part 1 of regex) I deliberately used fixed strings as patterns, not yet digging into the regex mechanisms themselves.
But watch out for part 2! Where the real fun begins! :-)
Starting simple and building it up keeps a lot of people engaged (including myself) and will return to read the next post.
Looking forward for part 2 and if you ever publish your work, I'll buy it! :-)
Well, if you look down a comment or two, since the upvote opportunity has passed, I just gave the answer myself :P (Plus use it for Part 2 :P )
Just saw it. Nice extra example for this post and funny as well :-))
Wao great post dear about pythen your post give us a new lesson I like it dear beautifull we can learn alot of things about pythan
Maybe you could use some regular expression knowledge for parsing
pythen
, orpythan
and substitute it withpython
! Can you give me the code for that as a comment? I'll upvote your answer it you're correct! :-)No dear I dont know about coding but my younger brother can do it because it is his field .
Ok then! Mind if I use your comment to give the answer myself in Part 2 then? :-)
Hmm good I will try it .
Will you really???
Yes
Okay then!
Hey @scipio
We're already looking forward to your next contribution!
Contributing on Utopian
Learn how to contribute on our website or by watching this tutorial on Youtube.
Utopian Witness!
Vote for Utopian Witness! We are made of developers, system administrators, entrepreneurs, artists, content creators, thinkers. We embrace every nationality, mindset and belief.
Want to chat? Join us on Discord https://discord.gg/h52nFrV