Regular Expressions Are Nothing to Fear


Doug Hellmann

PyATL, February 2017

What are
Regular Expressions”?

Formal Language for
Matching Patterns in Text

Their Uses

  • Finding
  • Parsing
  • Editing

t

Does the pattern match this text?

Using re from Python

import re

s = "Does the pattern match this text?"

pattern = re.compile("t")

print(pattern.search(s))
<_sre.SRE_Match object; span=(5, 6), match='t'>

this

Does the pattern match this text?

tt

Does the pattern match this text?

t{2}

Does the pattern match this text?

t{2,3}

Does the pattern match this text?

t+

Does the pattern match this text?

t..t

Does the pattern match this text?

t.+t

Does the pattern match this text?

t.+?t

Does the pattern match this text?

t.*t

Does the pattern match this text?

t.*?t

Does the pattern match this text?

t[aeiou]

Does the pattern match this text?

t[^aeiou]

Does the pattern match this text?

t[a-zA-Z]

Does the pattern match this text?

\d{5}

Athens, 30605

\D+

Athens, 30605

\s+

Athens, 30605

\S+

Athens, 30605

\w+

Athens, 30605

\W+

Athens, 30605

^t

Does the pattern match this text?
(no match)

\?$

Does the pattern match this text?

\bt

Does the pattern match this text?

\Bt

Does the pattern match this text?

\bt(\w+)t\W

Does the pattern match this text?

(1, 1) ex

\b(t\w{2}|t..t)\W

Does the pattern match this text?

(1, 1) the


(2, 1) text

((\d{1,3}\.){3}\d{1,3})

80.5.216.116 - - [14/Jan/2017:12:20:35 -0800] "GET /3/re/index.html HTTP/1.1" 200 180340 "https://pymotw.com/3/index.html" "Mozilla/5.0 (X11; Linux x86_64; rv:50.0) Gecko/20100101 Firefox/50.0"

(1, 1) 80.5.216.116
(1, 2) 216.

\[([^]]+)\]

80.5.216.116 - - [14/Jan/2017:12:20:35 -0800] "GET /3/re/index.html HTTP/1.1" 200 180340 "https://pymotw.com/3/index.html" "Mozilla/5.0 (X11; Linux x86_64; rv:50.0) Gecko/20100101 Firefox/50.0"

(1, 1) 14/Jan/2017:12:20:35 -0800

"((\w+)\s+([^"]*)\s+HTTP/[^"]+)"

80.5.216.116 - - [14/Jan/2017:12:20:35 -0800] "GET /3/re/index.html HTTP/1.1" 200 180340 "https://pymotw.com/3/index.html" "Mozilla/5.0 (X11; Linux x86_64; rv:50.0) Gecko/20100101 Firefox/50.0"

(1, 1) GET /3/re/index.html HTTP/1.1
(1, 2) GET
(1, 3) /3/re/index.html

^((\d{1,3}\.){3}\d{1,3})[^\]]+\[([^]]+)\][^"]*"((\w+)\s+([^"]*)\s+HTTP/[^"]+)"\s+(\d+)

80.5.216.116 - - [14/Jan/2017:12:20:35 -0800] "GET /3/re/index.html HTTP/1.1" 200 180340 "https://pymotw.com/3/index.html" "Mozilla/5.0 (X11; Linux x86_64; rv:50.0) Gecko/20100101 Firefox/50.0"

(1, 1) 80.5.216.116
(1, 2) 216.
(1, 3) 14/Jan/2017:12:20:35 -0800
(1, 4) GET /3/re/index.html HTTP/1.1
(1, 5) GET
(1, 6) /3/re/index.html
(1, 7) 200

^
((\d{1,3}\.){3}\d{1,3})  # IP Address
[^[]+
\[([^]]+)\]  # Date
[^"]*
"((\w+)\s+([^"]*)\s+HTTP/[^"]+)"  # Request
\s+
(\d+)  # Response code
80.5.216.116 - - [14/Jan/2017:12:20:35 -0800] "GET /3/re/index.html HTTP/1.1" 200 180340 "https://pymotw.com/3/index.html" "Mozilla/5.0 (X11; Linux x86_64; rv:50.0) Gecko/20100101 Firefox/50.0"

(1, 1) 80.5.216.116
(1, 2) 216.
(1, 3) 14/Jan/2017:12:20:35 -0800
(1, 4) GET /3/re/index.html HTTP/1.1
(1, 5) GET
(1, 6) /3/re/index.html
(1, 7) 200

^
(?P<ip>(\d{1,3}\.){3}\d{1,3})
[^[]+
\[(?P<date>[^]]+)\]
[^"]*
"((?P<method>\w+)\s+
  (?P<path>[^"]*)\s+HTTP/[^"]+)"
\s+
(?P<response>\d+)
80.5.216.116 - - [14/Jan/2017:12:20:35 -0800] "GET /3/re/index.html HTTP/1.1" 200 180340 "https://pymotw.com/3/index.html" "Mozilla/5.0 (X11; Linux x86_64; rv:50.0) Gecko/20100101 Firefox/50.0"

date      : 14/Jan/2017:12:20:35 -0800
ip        : 80.5.216.116
method    : GET
path      : /3/re/index.html
response  : 200

Parsing with re

for line in open('ex2.log', 'r', encoding='utf-8'):
    parsed = _pattern.search(line).groupdict()
    print('{ip:<15}: {path}'.format(**parsed))
37.120.9.213   : /3/ipaddress/index.html
37.120.9.213   : /3/smtplib/index.html
210.184.1.177  : /3/index.html
37.120.9.213   : /3/subprocess/index.html
37.120.9.213   : /3/statistics/index.html
194.187.170.148: /3/resource/index.html
37.120.9.213   : /3/datetime/index.html
37.120.9.213   : /3/time/index.html
180.245.166.121: /3/string/index.html
94.99.141.16   : /3/textwrap/index.html

Editing with re

import re

bold = re.compile(r'\*{2}(.*?)\*{2}')

text = 'Make this **bold**.  This **too**.'
html = bold.sub(r'<b>\1</b>', text)

print('Text:', text)
print('HTML:', html)
Text: Make this **bold**.  This **too**.
HTML: Make this <b>bold</b>.  This <b>too</b>.

Real Example

Abstract base classes are a form of interface checking more
strict than individual :func:`hasattr()` checks for particular
methods.  By defining an abstract base class, a common API
can be established for a set of subclasses.  This capability
is especially useful in situations where someone less familiar
with the source for an application is

To start, define an abstract base class to represent the API
of a set of plug-ins for saving and loading data.  Set the
metaclass for the new base class to :class:`ABCMeta`, and use
decorators to establish the public API for the class.  The
following examples use ``abc_base.py``, which contains:

Real Example

:func:`hasattr()`   -> ``hasattr()``
:class:`ABCMeta`    -> ``ABCMeta``
:func:`hasattr()`

:func:`
([^(`]+)
(?:\(\))?
`
:func:`hasattr()`

:func:`
([^(`]+)
(?:\(\))?
`

``\1()``

:func:`hasattr()`

func = re.compile(
    r'''
    :func:
    `
    ([^(`]+)
    (?:\(\))?
    `
    ''',
    flags=re.VERBOSE | re.MULTILINE,
)
text = func.sub(r'``\1()``', text)
:class:`ABCMeta`

:
(?:class|const)
:`
([^(`]+)
(?:\(\))?
`
:class:`ABCMeta`

:
(?:class|const)
:`
([^(`]+)
(?:\(\))?
`

``\1``

Real Example

strip = re.compile(
    r'''
    :(?:class|data|const|command):
    `
    ([^(`]+)
    (?:\(\))?
    `
    ''',
    flags=re.VERBOSE | re.MULTILINE,
)
text = strip.sub(r'``\1``', text)
Abstract base classes are a form of interface checking more
strict than individual ``hasattr()`` checks for particular
methods.  By defining an abstract base class, a common API
can be established for a set of subclasses.  This capability
is especially useful in situations where someone less familiar
with the source for an application is

To start, define an abstract base class to represent the API
of a set of plug-ins for saving and loading data.  Set the
metaclass for the new base class to ``ABCMeta``, and use
decorators to establish the public API for the class.  The
following examples use ``abc_base.py``, which contains:

Further Reading

https://doughellmann.com/presentations/regexes-fear/
http://github.com/dhellmann/presentation-regexes-fear/
https://pymotw.com/3/re/

Mastering Regular Expressions by Jeffrey Freidl