Day 10–101 Days of DevOps — Regular Expression
Welcome to Day 10 of 101 Days of DevOps. The topic for today is a regular expression. First, we are going to explore regular expression in general terms and then we will explore how to use it with Python.
To view the complete course, please check the below url.
For more info, register via the below link
YouTube Channel link
What is a Regular Expression?
It’s a pattern-matching language
OR
Regular expressions are specially encoded text strings used as patterns for matching sets of strings
OR
Is a sequence of characters that define a search pattern
OR
Can think like a mini-language for specifying text pattern
This is what Regular Expression looks like
\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}
- To break down this
* \d : Matches a digit 0-9
* {1,3}: Repeat the prior pattern 1-3 times
* . : is a wildcard, so we need to escape it
The above expression match any IP address
eg:
192.168.0.1
127.0.0.1
The concept we have just learned let’s try to use it with grep command
- I have a file called myipaddress
# cat myipaddress192.168.0.1172.16.0.210.0.0.3
- to grep ip address from this file
# grep -P '\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}' myipaddress192.168.0.1172.16.0.210.0.0.3
Where -P
-P, --perl-regexpInterpret PATTERN as a Perl regular expression. This is highly experimental and grep -P may warn of unimplemented features.
- If you try to use -E
-E, --extended-regexpInterpret PATTERN as an extended regular expression (ERE, see below). (-E is specified by POSIX.)
- Execute it
# grep -E ‘\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}’ myipaddress
#
- Now why this doesn’t return anything because the extended regular expression doesn’t use \d to refer to digits it uses :digit:
# grep -E '[[:digit:]]{1,3}\.[[:digit:]]{1,3}\.[[:digit:]]{1,3}\.[[:digit:]]{1,3}' myipaddress192.168.0.1172.16.0.210.0.0.3
NOTE: The above regular expression will even match non-existent IP addresses. So if I update myipaddress file with a non-existent IP address
999.999.999.999
and try to run this regex, this will match.
grep -P '\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}' myipaddress
999.999.999.999
- As we are starting for regex journey I don't want to start with complex regex but the more concise regex to match IP address is
"^(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)$"
- Try to use the above regex and it will only match the valid IP/subnet ranges.
Let’s use regular expression with Python. To use regular expression with Python you need to use re module.
import re
To match the particular expression, the syntax is
match = re.findall(pattern,string)
Let use the same example to match IP address but this time using re module
import re
string="192.168.0.1"
pattern="\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}"
match = re.findall(pattern, string)
print(match)
Now save this to a file and regex_match.py execute it as you can you will got the output in the form of list.
python3 regex_match.py
['192.168.0.1']
- So far I have shown you all the digit examples but how to match words, to match words you need to use \w OR you can use ranges
\w, \W: ANY ONE word/non-word character. For ASCII, word characters are [a-zA-Z0-9_]Ranges will look like this [A-Z] or [a-z]
- Let’s take a simple example where I need to match a word, special character, and digit
A-1B-2E-3
- To do that
# grep -P '\w\-\d' testA-1B-2E-3
WhiteSpaces
- \s, \S: ANY ONE space/non-space character. For ASCII, whitespace characters are
[ \n\r\t\f]
Some more Pattern type
\w : sequence of word-like characters [a-zA-Z0–9_] that are not space\d: Any numeric digit[0–9]\s: whitespace characters(space,newline,tab)\D: match characters that are NOT numeric digits\W: match characters that are NOT words,digit or underscore\S: match characters that are NOT spaces,tab or newline
Position Anchors: does not match character, but position such as start-of-line or end-of-word
- Let say I want to search for root user in /etc/passwd file
# grep root /etc/passwdroot:x:0:0:root:/root:/bin/bashoperator:x:11:0:operator:/root:/sbin/nologin
- This doesn’t seem to be correct
# grep ^root /etc/passwdroot:x:0:0:root:/root:/bin/bash
- ^: start-of-line
- Similar way we can use $ which denotes the end of line
# grep "bash$" /etc/passwdroot:x:0:0:root:/root:/bin/bashcentos:x:1000:1000:Cloud User:/home/centos:/bin/bashtestuser:x:1001:1001::/home/testuser:/bin/bash
- Let’s look at one of the common problems
Scenario 1: Delete all the blank line in the file
# cat error_log[Fri Apr 05 04:21:33.481424 2019] [suexec:notice] [pid 7698] AH01232: suEXEC mechanism enabled (wrapper: /usr/sbin/suexec)[Fri Apr 05 04:21:33.491890 2019] [auth_digest:notice] [pid 7698] AH01757: generating secret for digest authentication ...[Fri Apr 05 04:21:33.492419 2019] [lbmethod_heartbeat:notice] [pid 7698] AH02282: No slotmem from mod_heartmonitor
- As you can see we have white space after each line
# sed -i -r ‘/^\s*$/d’ /var/log/httpd/error_log
- To fix this we can use sed in a combination of what we have learned today
- The way sed generally works
sed ’s/find/replace/g’ <filename>* sed is a Unix utility that parses and transforms text
* -i : edit files in place
* -r : use extended regular expressions in the script
* d : signify we want to delete these lines
Scenario 2: Look for the specific word in the file
# grep -i error error_log[Fri Apr 05 04:21:33.494401 2019] [core:notice] [pid 7698] AH00094: Command line: '/usr/sbin/httpd -D FOREGROUND' myerror[Fri Apr 05 04:22:05.502364 2019] [autoindex:error] [pid 7702] [client 204.14.239.17:65129] AH01276: Cannot serve directory /var/www/html/: No matching DirectoryIndex (index.html) found, and server-generated directory index forbidden by Options directive errorlog[Fri Apr 05 04:22:29.747570 2019] [autoindex:error] [pid 7701] [client 70.42.131.189:15077] AH01276: Cannot serve directory /var/www/html/: No matching DirectoryIndex (index.html) found, and server-generated directory index forbidden by Options directive
- Here as you can see I am looking for word error but grep is returning all the lines which include error(i.e errorlog and myerror)
# grep -P '\berror\b' error_log[Fri Apr 05 04:22:05.502364 2019] [autoindex:error] [pid 7702] [client 204.14.239.17:65129] AH01276: Cannot serve directory /var/www/html/: No matching DirectoryIndex (index.html) found, and server-generated directory index forbidden by Options directive errorlog[Fri Apr 05 04:22:29.747570 2019] [autoindex:error] [pid 7701] [client 70.42.131.189:15077] AH01276: Cannot serve directory /var/www/html/: No matching DirectoryIndex (index.html) found, and server-generated directory index forbidden by Options directiveYou have new mail in /var/spool/mail/root
- \b: the boundary of a word, i.e., start-of-word or end-of-word
NOTE: Word Boundary is not the same in every case, in the case of vim you need to use angle brackets, so expression will be like this :/\<error\>
Complete list
+ : 1 or more* : 0 or more?: 0 or 1{k}: exactly integer K occurence{m,n}: m to n occurence inclusive. :matches any character except the newline(\n)^: start of the string$: end of string\: escape character
Example
# Re module has all regular expression function in it>>> import re>>> example = “Welcome to the world of Python”>>> pattern = r’Python’>>> match = re.search(pattern,example)>>> print(match)<_sre.SRE_Match object; span=(24, 30), match=’Python’>>>> if match:… print(“found”, match.group())… else:… print(“No match found”)…found Python
NOTE: r is for the raw string as Regex often uses \ backslashes(\w), so they are often raw strings(r’\d’)
The most popular example is finding a phone number :-)
>>> import re>>> message = “my number is 510–123–4567”# Here we are creating regex object,which define the pattern we are looking for
>>> myregex = re.compile(r’\d\d\d-\d\d\d-\d\d\d\d’)# Then we are trying to find a pattern in the string
>>> match = myregex.search(message)# This will tell us the actual text
>>> print(match.group())510–123–4567
In case we have multiple phone numbers, use findall()
>>> import re>>> message = “my number is 510–123–4567 and my office number is 510–987–1234”>>> myregex = re.compile(r’\d\d\d-\d\d\d-\d\d\d\d’)# Find all pattern of the string and return a list objects
>>> print(myregex.findall(message))[‘510–123–4567’, ‘510–987–1234’]
Let's use the group to separate the area codes with the phone numbers. Here parenthesis has a special meaning where the group starts and where the group end.
import remyregex = re.compile(r’(\d\d\d)-(\d\d\d-\d\d\d\d)’)>>> match = myregex.search(“My number is 510–123–4567”)>>> match<_sre.SRE_Match object; span=(13, 25), match=’510–123–4567'># This will return the full matching string
>>> match.group()‘510–123–4567’# Only return the first matching group(area code)
>>> match.group(1)‘510’#Second matching group(Return the whole phone number)
>>> match.group(2)‘123–4567’
To find out parentheses literally in string, we need to escape parentheses using backslash \(
>>> myregex = re.compile(r’\(\d\d\d\)-(\d\d\d-\d\d\d\d)’)>>> match = myregex.search(“My number is (510)-123–4567”)>>> match.group()‘(510)-123–4567’
Pipe Character(|) match one of many possible groups
>>> lang = re.compile(r’Pyt(hon|con|mon)’)>>> match = lang.search(“Python is a wonderful language”)>>> match.group()‘Python’>>> match = lang.search(“Pytcon is a wonderful language”)>>> match.group()‘Pytcon’>>> match = lang.search(“Pytmon is a wonderful language”)>>> match.group()‘Pytmon’
If regular expression not able to find that pattern it will return None, to verify that
>>> match = lang.search(“Pytut is a wonderful language”)>>> match == NoneTrue
? : zero or one time
>>> import re# Here ho is optional it might occur zero time or one time
>>> myexpr = re.compile(r’Pyt(ho)?n’)>>> match = myexpr.search(“Python a wonderful language”)>>> match.group()‘Python’>>> match = myexpr.search(“Pytn a wonderful language”)>>> match.group()‘Pytn’
So if we try to match this expression it will fail
>>> match = myexpr.search(“Pythohon a wonderful language”)>>> match.group()Traceback (most recent call last):File “<stdin>”, line 1, in <module>AttributeError: ‘NoneType’ object has no attribute ‘group’>>> match ==NoneTrue
Same way as with our previous example of Phone Number we can make area code optional
>>> myphone = re.compile(r’(\d\d\d-)?\d\d\d-\d\d\d\d’)>>> match = myphone.search(“My phone number is 123–4567”)>>> match.group()‘123–4567’
“*” zero or more time
>>> import re>>> myexpr = re.compile(r’Pyth(on)*’)>>> match = myexpr.search(“Welcome to the world of Pythononon”)>>> match.group()‘Pythononon’
“+” must appear atleast 1 or more time
>>> myexpr = re.compile(r’Pyth(on)+’)>>> match = myexpr.search(“Welcome to the world of Pyth”)>>> match.group()Traceback (most recent call last):File "<stdin>", line 1, in <module>AttributeError: 'NoneType' object has no attribute 'group'>>> match = myexpr.search(“Welcome to the world of Python”)>>> match.group()‘Python’>>> match = myexpr.search(“Welcome to the world of Pythonononon”)>>> match.group()‘Pythonononon’
Now if we want to match a specific number of times
>>> myregex = re.compile(r’(Re){3}’)>>> match = myregex.search(“My matching string is ReReRe”)>>> match.group()‘ReReRe’# Range of repetitions
>>> myregex = re.compile(r'(Re){3,5}')
>>> match = myregex.search("My matching string is ReReReRe")>>> match.group()'ReReReRe'
The regular expression in Python do greedy matches i.e it try to match the longest possible string
# Instead of searching for min i.e first 3 it matches first 5>>> mydigit = re.compile(r’(\d){3,5}’)>>> match = mydigit.search(‘123456789’)>>> match.group()‘12345’
To do a non-greedy match add ? (then it matches the shortest string possible),Putting a question mark after the curly braces makes it to do a non-greedy match
>>> mydigit = re.compile(r’(\d){3,5}?’)>>> match = mydigit.search(‘123456789’)>>> match.group()‘123’
Let’s take a look at few more example which involves character classes
\w : sequence of word-like characters [a-zA-Z0–9_] that are not space\d: Any numeric digit[0–9]\s: whitespace characters(space,newline,tab)
Let say I need to match this address
>>> import re>>> address = “123 fremont street”>>> match = re.compile(r’\d+\s\w+\s\w+’)>>>match.findall( match.finditer( match.flags match.fullmatch(>>> match.findall(address)[‘123 fremont street’]
We can create our own character class
#Let's create our own character class which matches all lower case vowel
>>> myregex = re.compile(r’[aeiou]’) #To match even upper case r'[aeiouAEIOU]'>>> mypat = “Welcome to the world of Python”>>> myregex.findall(mypat)[‘e’, ‘o’, ‘e’, ‘o’, ‘e’, ‘o’, ‘o’, ‘o’]
Now if we want to match two vowels in a row
>>> myregex = re.compile(r’[aeiouAEIOU]{2}’)>>> mypat = “Welcome to the world of Python ae”>>> myregex.findall(mypat)[‘ae’]
Negative Character Class(Use of ^ means search everything except vowel)
>>> myregex = re.compile(r’[^aeiouAEIOU]’)>>> mypat = “Welcome to the world of Python ae”>>> myregex.findall(mypat)[‘W’, ‘l’, ‘c’, ‘m’, ‘ ‘, ‘t’, ‘ ‘, ‘t’, ‘h’, ‘ ‘, ‘w’, ‘r’, ‘l’, ‘d’, ‘ ‘, ‘f’, ‘ ‘, ‘P’, ‘y’, ‘t’, ‘h’, ’n’, ‘ ‘]
Let take look at the dot (. :matches any character except the newline(\n))
>>> myregex = re.compile(r’.x’)>>> mypat = “Linux Unix Minix”>>> myregex.findall(mypat)[‘ux’, ‘ix’, ‘ix’]
Dot is majorly used with *
* : 0 or more
Now if we change our regex to include both
>>> myregex = re.compile(r’.*x’)>>> mypat = “Linux Unix Minix”>>> myregex.findall(mypat)[‘Linux Unix Minix’]
NOTE
.*: always perform greedy match(except newline).*?: To make it non-greedy add ?
Let take a look at this with the help of this example
>>> mystr = ‘“Welcome to the world of Python” great language to learn”’>>> mypat = re.compile(r’”(.*?)”’)#Because of non-greedy nature it will search till first " is encountered
>>> mypat.findall(mystr)[‘Welcome to the world of Python’]
But in the case of a greedy match
>>> mypat = re.compile(r’”(.*)”’)# It will return the whole string
>>> mypat.findall(mystr)[‘Welcome to the world of Python” great language to learn’]
Now as we mentioned above .* matches everything except newline
>>> myexpr = “Welcome to the \n world of \n Python”>>> print(myexpr)Welcome to theworld ofPython>>> mypat = re.compile(r’(.*)’)>>> mypat.search(myexpr)<_sre.SRE_Match object; span=(0, 15), match=’Welcome to the ‘>
Now even in this case if we want to perform a greedy match add re.DOTALL(then it will match newlines as well)
>>> mypat = re.compile(r’.*’,re.DOTALL)>>> mypat.search(myexpr)<_sre.SRE_Match object; span=(0, 34), match=’Welcome to the \n world of \n Python’>
The second argument is really useful, especially if we want to perform a case-insensitive search(re.I)
>>> import re>>> mystr = “Why Linux Is Such An Awesome Platform”>>> mypat = re.compile(r’[aeiou]’,re.I)>>> mypat.findall(mystr)[‘i’, ‘u’, ‘I’, ‘u’, ‘A’, ‘A’, ‘e’, ‘o’, ‘e’, ‘a’, ‘o’]
Some of my favorite websites to make your regular expression life easy
A regular expression is really powerful, in this blog we barely scratch the surface. Let solidify your regular concept with the help of the assignment.
Assignment:
- Create an apache log parser that reads the log file and finds the IP address. Log entry inside the file will look like this
192.168.0.1 — — [23/Apr/2017:05:54:36 -0400] “GET / HTTP/1.1” 403 3985 “-” “Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36”
BONUS
- Count the number of times the IP address repeated in the file
- Save the parsed output in csv file
- Make the script user friendly by using argparse module
I am looking forward to you guys joining the amazing journey.
- Twitter: @lakhera2015
- Facebook: https://www.facebook.com/groups/795382630808645/
- Medium: https://medium.com/@devopslearning
- GitHub: https://github.com/100daysofdevops/100daysofdevops
- Slack: https://join.slack.com/t/100daysofdevops/shared_invite/zt-au03logz-YfDUp_FJF4rAUeDEbgWmsg
- Reddit: r/101DaysofDevops
- Meetup: https://www.meetup.com/100daysofdevops/