Matching and Searching Strings with regex in Python
What Will I Learn?
This tutorial covers following topic:
- Regex concept
- Regex patterns and it's matches
- Searching vs Matching
- Replacing patterns
Requirements
- A PC/laptop with any Operating system such as Linux, Mac OSX, Windows OS
- Preinstalled Python
- Preinstalled Code Editors such as Atom, Sublime text or Pycharm IDE
Note: This tutorial is performed in Pycharm IDE in laptop with Ubuntu 17.1, 64 bit OS
Difficulty
Intermediate. I recommend learning basics of python programming before starting this tutorial. The links for previous tutorials are at the bottom of this tutorial.
Tutorial Contents
What is regex?
Regular Expression or regex is a sequence of character which helps to search, find strings using syntax which is in the form of a pattern. It defines the search pattern.
In python, we write regular expressions using raw string literals instead of regular python strings. Raw strings start with 'r' prefix.
Why to use raw string literals?
Because if we write raw strings in python then all escapes code, backslashes and special metacharacters in the string are not interpreted by the python interpreter. Regular expression contains lots of backslashes in it thus it would be difficult to interpret it if we do not use raw string form and escape it.
For example, "\n" which gives new line in output and r"\n" is not same. In *r"\n" ' \ ' is escaped and then interpreted by python interpreter.
print('\t Programming Hub')
Output:
Programming Hub
Using raw string literals.
print(r'\t Programming Hub')
Output:
\t Programming Hub
In above code, we can clearly see that '\t' is not interpreted when used raw string form.
Regualar Expression Patterns and it's matches
| Pattern | Matches |
|---|---|
| \d | Any Digit |
| \D | Any Non-digit character |
| . | Any Character |
| \ . | Period |
| [abc] | Only a, b, or c |
| [^abc] | Not a, b, nor c |
| [a-z] | Characters a to z |
| [0-9] | Numbers 0 to 9 |
| \w | Any Alphanumeric character |
| \W | Any Non-alphanumeric character |
| \b | word boundary |
| \B | Not a word boundary |
| {m} | m Repetitions |
| * | Zero or more repetitions |
| + | One or more repetitions |
| ? | Optional character |
| \s | Any Whitespace, tab or new line |
| \S | Any Non-whitespace character |
| ^ | Start of a string |
| $ | End of a string |
| (…) | Capture Group |
| (a(bc)) | Capture Sub-group |
| (.*) | Capture all |
re module
re is a python module which provides matching operations on regular expressions.
Matching vs Searching
Matching a string
re module provides various methods to match a string with the specified pattern. We can use re.match()to regex pattern to match string, re.search() to search for the first occurrence of regex pattern within the given string,
To match string, we start by importing re module to our program.
import re
Now we will declare a state in which we will perform match operation.
statement = "Learn python programming with us"
Now we will define a variable which holds returned matched object. mobj
isn that variable here. We called re.match method to match pattern with string. It takes two compulsory arguments regex pattern and string or variable that holds string.
mobj = re.match( r'Learn', statement)
Now using if else statement we print the matched word. group() returns the matched word from the statement.
if mobj:
print("Matched word : ", mobj.group())
else:
print("No match found!")
If we compile above codes than output will be:
Matched word : Learn
Searching String
Searching is also like matching string. We use re.search() to search pattern in string. We will use previously defined statement to search.
sobj = re.search( r'python', statement)
if sobj:
print("Searched word : ", sobj.group())
else:
print("Nothing found!")
Output:
Searched word : python
Now we will search and match the same word and see output:
import re
statement = "Learn python programming with us"
mobj = re.match( r'python', statement)
if mobj:
print("Matched word : ", mobj.group())
else:
print("No match found!")
sobj = re.search( r'python', statement)
if sobj:
print("Searched word : ", sobj.group())
else:
print("Nothing found!")
Output:
No match found!
Searched word : python
In above code we can clearly see that, we searched and matched same word python. But re.match couldn't found it and re.search found it. This is the main differnce between searching and matching. Matching looks for match only at the beginning of the string whereas Searching search whole string for a match and returns it.
Replacing patterns
Now we will replace regex pattern that are appeared in string. We use re.sub method for doing this. This method takes 3 arguemnts compulsorily pattern, repl & string. Here repl is the replacement.
We will start by importing re module and defining a string .
import re
address = "Wall street 19, New York"
Now we will search string to find digits and print the address removing it.
add = re.sub(r'\d', "", address)
print("Address without digit: ", add)
new variable will hold the value of address after removing digits from address.
Output:
Address without digit: Wall street , New York
all above codes including previous tutorials codes are available in my Github repo. Click here to download
For more details please visit Python Docs.
Curriculum
Python tutorials for beginners : Part - I
Python tutorials for beginners : Part - II
Python tutorials for beginners : Part - III
Reading and writing to files in python
Posted on Utopian.io - Rewarding Open Source Contributors
Thank you for the contribution. It has been approved.
You can contact us on Discord.
[utopian-moderator]
Thanks
Hey @fuzeh, I just gave you a tip for your hard work on moderation. Upvote this comment to support the utopian moderators and increase your future rewards!
Hey @programminghub I am @utopian-io. I have just upvoted you!
Achievements
Community-Driven Witness!
I am the first and only Steem Community-Driven Witness. Participate on Discord. Lets GROW TOGETHER!
Up-vote this comment to grow my power and help Open Source contributions like this one. Want to chat? Join me on Discord https://discord.gg/Pc8HG9x