Learn Python Series (#23) - Handling Regular Expressions Part 1

in #utopian-io6 years ago (edited)

Learn Python Series (#23) - Handling Regular Expressions Part 1

python_logo.png

Repository

https://github.com/python/cpython

What Will I Learn?

  • In this part (1, of a subseries within the Learn Python Series) you will learn about using regular expressions in Python via the re module, with which you can write patterns in order to match them within input strings;
  • in order not to overwhelm beginner programmer, in Part 1 you'll be only using fixed string patterns;
  • how to use the re functions match(), search(), findall(), finditer(), sub(), split() and compile();
  • as well as how to use a few handy attributes of a Match object.

Requirements

  • A working modern computer running macOS, Windows or Ubuntu;
  • An installed Python 3(.6) distribution, such as (for example) the Anaconda Distribution;
  • The ambition to learn Python programming.

Difficulty

  • Basic

Curriculum (of the Learn Python Series):

Proof of Work Done

https://github.com/realScipio

Supplemental source code, including tutorial itself (iPython):

https://github.com/realScipio/learn-python-series/blob/master/regex-01.ipynb

Learn Python Series (#23) - Handling Regular Expressions Part 1

This subseries of the Learn Python Series focuses on working with regular expressions, also oftentimes abbreviated to the term regex. Regular expressions are text matching patterns, and are very powerful to use but its syntax is quite formal and for beginners quite difficult to comprehend.

You can look at regular expressions as a specialized programming languaged squeezed inside Python. It empowers you to define "rules" you want to match for within a larger string of data. For example "Is there a phone number on a certain web page?", or "Is this a valid email address?".

When a lot of text processing is involved, regular expressions are your friend (as long as you know how to use regexes, that is!). When you're developing an application containing lots of code, using regular expressions to find and / or edit parts of the code itself is pretty helpful and common as well. Sys admins dealing with log files, find regular expressions useful as well. When developing a textual search engine, you're obviously also dealing with big volumes of textual data, and again using regular expressions then is useful and common.

Nota bene 1: a word of advise to those new to handling regular expressions in general, not just in Python: regexes can be pretty overwhelming at first sight, but don't feel intimidated by them! I will try to explain everything in the right order, as I always do, and feel free to ask me questions in the comment sections if something is unclear. I'm happy to help!

Nota bene 2: in order to keep things easy-to-follow for beginner programmers, I decided to present the information in this regex subseries in a very specific order. That means: instead of beginning with the "regex language" itself, explaining meta characters, pattern syntax, escaping, grouping, sets, etc. right of the bat (which would mean risking beginning programmers felling dazzled and completely overwhelmed with the info presented), I'm first explaining the basic mechanisms (functions, methods, attributes used oftentimes) of the re module.
Therefore, in this Handling Regular Expressions Part 1 episode, I will only be using simple, fixed, literal, character matching. Starting from Part 2 I'll be going over the "nitty gritty" of the regex language itself.

Having said that: let's begin!

Basic (fixed) substring matching (non-regex) via in

In the Learn Python Series (#3) - Handling Strings Part 2 episode I first introduced using the in operation. The in operation is not a part of the re module in Python (for using regular expressions) but in can be used for exact substring matching nonetheless.

For example:

substring = "simply"
string = "This is simply an example string"

if substring in string:
    print("The word 'simply' is in there all right!")
else:
    print("Nope! No 'simply' found!")    
The word 'simply' is in there all right!

The in operation just checks if an exact substring (in this case "simply") is contained in a larger string.

Finding patterns in text via regular expressions

A regular expression is - in essence - a search pattern described by a sequence of characters with a formal syntax. The patterns are not just (like finding an exact substring in a larger string, as mentioned above) matched "AS-IS", but rather interpreted as instructions. The pattern is then compared to an input string and in case of a match, the matching subset is returned. Regular expressions can (of course) include literal text matching, but also repetition and composition, as well as other rules.

Importing the re module

In Python, the functionality for regular expressions is opened with importing the re module.

import re

The match() and the search() functions, returning a Match object

I think the most frequent usecase for regular expressions, is finding / searching for patterns in text (the input string). The re module comes with two built-in primitive functions for matching patterns found in strings: thematch() function and the search() function.

match() only checks for a match at the beginning of the input string, where search() tries to match the pattern anywhere inside the string (including if it's found at the beginning).

You can pass in a pattern string and an input string as its arguments, and when the pattern was matched both functions will return a Match object for the first instance in case the pattern was found, or None when it was not.

# Using the `match()` function

pattern = "I"
string = "I still remember how it all changed"

match = re.match(pattern, string)
print(match)
<_sre.SRE_Match object; span=(0, 1), match='I'>
# Using the `search()` function

pattern = "for"
string = "Hello! Is it me you're looking for?"

match = re.search(pattern, string)
print(match)
<_sre.SRE_Match object; span=(31, 34), match='for'>
# The `match()` function only looks at the beginning of the string,
# and returns `None`
# The `search()` function looks at the entire string,
# and returns a Match object.

pattern = "all"
string = "We all stand together!"

match_match = re.match(pattern, string)
match_search = re.search(pattern, string)
print(match_match)
print(match_search)
None
<_sre.SRE_Match object; span=(3, 6), match='all'>

In order to get more info from the returned Match object, we can use two methods start() and end(), which return the start & end indexes of the input string where the pattern was found. And we could also return the input string attribute from the Match object, as well as the pattern that was used.

print(match_search.start())
print(match_search.end())
print(match_search.string)
print(match_search.re.pattern)
3
6
We all stand together!
all

Nota bene: please remember that the end index is non-inclusively used for slicing!
I.e.:

print(match_search.string[match_search.start():match_search.end()])
all

The finditer() function, for returning multiple Match objects

It could well be that more than one substring matches a regex pattern within an input string. The finditer() function returns an iterator of Match objects. Please observe the following example:

pattern = "is"
string = "It is what it is, isn't it?"

for match in re.finditer(pattern, string):
    start_index = match.start()
    end_index = match.end()
    print("Start: {}, End: {}".format(start_index, end_index))
Start: 3, End: 5
Start: 14, End: 16
Start: 18, End: 20

The findall() function, for returning matches as a list

In case you're not particularly interested in returning Match objects from your regular expression, but rather have a list of matches returned instead, then use the findall() function:

pattern = "la"
string = "Ooh la la la, it's the way that we rock when we're doing our thing"

matches = re.findall(pattern, string)
print(type(matches), len(matches), "returned:", matches)
<class 'list'> 3 returned: ['la', 'la', 'la']

Modifying strings via regular expressions

The sub() function, for search and replace

After finding one or more matches on a pattern, you might want to replace them with something else. This can be done using the sub() function (short for substitute). As its default arguments sub() accepts a pattern to match for, a replacement string, and of course the input string to search & replace on.

Please observe the following, simple, example:

pattern = "dogs"
replacement = "cows"
string = "It's raining cats and dogs"

new_string = re.sub(pattern, replacement, string)
print(new_string)
It's raining cats and cows

The split() function

Again in Learn Python Series (#3) - Handling Strings Part 2 we first talked about the split() method (on strings). The re module also has a split() function, which looks for a pattern match and in case it finds one (or more) splits the input string at that point into multiple list elements. A list is returned from calling split().

pattern = " "
string = "This is an example sentence"

result = re.split(pattern, string)
print(type(result), result)
<class 'list'> ['This', 'is', 'an', 'example', 'sentence']

Compiling regular expressions

The compile() function

The re module allows for working with regular expressions in the form of strings, just like we did until now. Yet it's also possible, for faster processing, to first compile the regular expression string and convert it intro a RegexObject. This can be done using the compile() function.

pattern = "comes"
string = "Here it comes again, that feeling!"

regex = re.compile(pattern)
print(type(regex), regex)
<class '_sre.SRE_Pattern'> re.compile('comes')

Since we've now compiled the (very simple) regular expression, we can then use - for example - the search() method of the (now) RegexObject object, which now only requires one argument: the input string:

match = regex.search(string)
print(match)
<_sre.SRE_Match object; span=(8, 13), match='comes'>
<_sre.SRE_Match object; span=(8, 13), match='comes'>

Nota bene: precompiling regular expressions works faster, because the compile task is done when the application (your Python script containing the compile task) is started, in stead of - for example - when the application needs to react to, for example, user input.

What did we learn, hopefully?

This Part 1 episode was intended as an introduction to the re module which is built-in to (almost) every Python distribution, and enables you to write and execute (very) (advanced) pattern matching via regular expressions. But since regular expressions can be regarded as a "mini programming language", and understanding its syntax and mechanisms is of utmost importance in order to read, let alone write, complex regular expressions, in this episode we "only" talked about how to use re functions such as match(), search(), findall(), finditer(), split(), sub() and compile().

Hopefully the way I "introduced" regular expressions didn't scare you into trying to learn more about them! I deliberately labeled this Part 1 episode as Difficulty: Basic, Part 2 will be Difficulty: Intermediate and from there on it's... Difficulty: Fun!!

Come and find out! See you there!

Thank you for your time!

Sort:  

Interesting these 2 posts from @nomannomi: "it give us a new lesson I like it" but then "No dear I dont know about coding". Makes you think how sincere these comments are...

Anyway, thank you for your Python series @scipio. There are several resources available on the web but yours is really comprehensive with good examples. Much appreciated.

Indeed! ;-)

It's my book, the entirety of all episodes (well, I might alter a few things here in there in case I'd want to publish it). And I'm doing my best to explain things differently, beginner-friendly, with lots of examples, which tend to get more complex as the series evolves. In this episode (part 1 of regex) I deliberately used fixed strings as patterns, not yet digging into the regex mechanisms themselves.

But watch out for part 2! Where the real fun begins! :-)

Starting simple and building it up keeps a lot of people engaged (including myself) and will return to read the next post.

Looking forward for part 2 and if you ever publish your work, I'll buy it! :-)

Well, if you look down a comment or two, since the upvote opportunity has passed, I just gave the answer myself :P (Plus use it for Part 2 :P )

Just saw it. Nice extra example for this post and funny as well :-))

Wao great post dear about pythen your post give us a new lesson I like it dear beautifull we can learn alot of things about pythan

Maybe you could use some regular expression knowledge for parsing pythen, or pythan and substitute it with python! Can you give me the code for that as a comment? I'll upvote your answer it you're correct! :-)

No dear I dont know about coding but my younger brother can do it because it is his field .

Ok then! Mind if I use your comment to give the answer myself in Part 2 then? :-)

import re

pattern = "pyth[ae]n"

replace = "Python"

string = """Wao great post dear about pythen 

your post give us a new lesson I like it dear beautifull 

we can learn alot of things about pythan"""

​

result = re.sub(pattern, replace, string)

print(result)

Wao great post dear about Python 
your post give us a new lesson I like it dear beautifull 
we can learn alot of things about Python

Hmm good I will try it .

Will you really???

Hey @scipio

We're already looking forward to your next contribution!

Contributing on Utopian

Learn how to contribute on our website or by watching this tutorial on Youtube.

Utopian Witness!

Vote for Utopian Witness! We are made of developers, system administrators, entrepreneurs, artists, content creators, thinkers. We embrace every nationality, mindset and belief.

Want to chat? Join us on Discord https://discord.gg/h52nFrV

Coin Marketplace

STEEM 0.16
TRX 0.13
JST 0.027
BTC 60841.72
ETH 2603.92
USDT 1.00
SBD 2.56