Pixellized regular expression becoming clear
Pixellized regular expression becoming clear

Human-Readable Python Regular Expressions

(The regex classes described in this post are available in my blog examples repo.)

Because I don't work with regular expressions every day, I have to re-learn them whenever I need to write a regex pattern. I can barely write a medium-complexity pattern without a couple minutes on pythex, and complex patterns turn my brain to goop.

I know what regular expressions are, and I know their terminology, but I have a hard time translating my understanding of regular expressions into regex syntax.

I'm not alone (from CommitStrip):

CommitStrip #650
CommitStrip #650

Regular expressions are ugly, complex, dense, and near-illegible, defying at least four of the aphorisms in The Zen of Python:

Beautiful is better than ugly.

Simple is better than complex.

Sparse is better than dense.

Readability counts.

So I got to thinking: What could I do to make regular expressions easier to understand?

In this post, I describe a method of formulating a regular expression pattern that highlights its composeability. I also introduce a class-based strategy for building regex patterns that I think makes them easier to read, write, and maintain.

The challenge

Suppose we have a Markdown file containing fenced code blocks:

# Markdown File

```python
def print_banana():
    print('banana')
```

This is a lovely---albeit unimportant---paragraph.

```python
def print_chocolate():
    print('chocolate')
```

Suppose further that we want to extract only the Python code from that document.

While we're supposing, we suppose even further that someone's life depends on it.

Regex to the rescue! But what pattern will work?

Regex patterns in plain language

Before I start writing a regex pattern, I try to explain it to myself in plain language.

In this case:

If we find a line consising of ```python, match the following group of one or more lines—but as few as possible—stopping when we find ```

Now that we know what we're looking for, let's break that description into manageable parts:

  1. The start of a fenced code block, which we don't want to capture
  2. One or more lines of code, which we do want to capture
  3. The end of a fenced code block, which we don't want to capture

Together, these form a grammar for our regex pattern (see the 6th definition). Now let's look at writing the pattern's syntax.

Regex patterns in not-so-plain syntax

Let's start with the completed regex pattern (thanks, pythex):

(?<=```python\n)((?:.*\n)+?)(?=```)

Unless you're a computer, the plain-language regex pattern should be easier to understand than that.

The pattern is an eyeful, but it's an eyeful we can break into discrete parts, corresponding to the three plain-language parts above.

(1) The start of the fenced code block, which we don't want to capture:

(?<=```python\n)

(2) One or more lines of code, which we do want to capture:

((?:.*\n)+?)

(3) The end of the fenced code block, which we don't want to capture:

(?=```)

Exploding regex patterns like this is possible because regular expressions are composeable; they can be stitched together. The re docs explain this:

Regular expressions can be concatenated to form new regular expressions; if A and B are both regular expressions, then AB is also a regular expression.

As a result, we could write that regular expression pattern like this:

pattern = (
    r'(?<=```python\n)' +
    r'((?:.*\n)+?)' +
    r'(?=```)'
)

Here, the pattern is broken into three sequences. Breaking down a regular expression doesn't make it more readable, but it's the first step.

Regex patterns in legible syntax

Variable-based patterns

Once a regex pattern is broken into logical sequences, naming those sequences can make a regular expression a bit easier to understand:

python_code_block_start = r'(?<=```python\n)'
python_code_body = r'((?:.*\n)+?)'
python_code_block_end = r'(?=```)'

pattern = (
    python_code_block_start +
    python_code_body +
    python_code_block_end
)

assert pattern == '(?<=```python\n)((?:.*\n)+?)(?=```)'

Although variable names provide some context to the pattern sequences, the individual parts are still gobbledygook. The python_code_body sequence is a mess of parentheses and question marks, and if you rearrange the lookbehind ((?<=)), it looks like a bird trying to decipher a regex pattern:

  ?
<(=)

So how can we make those complex patterns easier to read (and more Zen)?

Class-based pattern groups

Python's classes are a good fit for wrapping regex groups and making them easier to write and read.

"Classes provide a means of bundling data and functionality together", which is exactly what we're looking for.

To do this, I wrote classes for regex group-like sequences:

You can find these classes in the regex_patterns.py file in my blog examples repo.

Each of these classes accepts a pattern as its first init argument and wraps that pattern in the appropriate regex syntax. They all have a __str__() method that returns a standard regex pattern string, too, so they can be used transparently with Python's re library.

Let's see how they compare to the plain-language and not-so-plain-syntax versions.

(1) The start of the fenced code block, which we don't want to capture:

python_code_block_start = LookBehind('```python\n')

(2) One or more lines of code, which we do want to capture:

one_or_more_lines = Group(
    '.*\n', capturing=False, greedy=False, one_or_more=True
)
python_code_body = Group(one_or_more_lines)

(3) The end of the fenced code block, which we don't want to capture:

python_code_block_end = LookAhead('```')

All together

To stitch together pattern pieces, I wrote a simple function that concatenates strings or pattern classes:

build_pattern(*pattern_parts)

Let's put it all together:

one_or_more_lines = Group(
    '.*\n', capturing=False, greedy=False, one_or_more=True
)

markdown_python_code_pattern = build_pattern(
    LookBehind('```python\n'),
    Group(one_or_more_lines),
    LookAhead('```'),
)
assert markdown_python_code_pattern == '(?<=```python\n)((?:.*\n)+?)(?=```)'

Now, we can make sense of the pattern through the level of abstraction that the classes provide.

It's clear that the pattern consists of a lookbehind, a group, and a lookahead.

The code block content pattern is easier for a human to parse, too. I don't have to remember whether the ?: or ? makes the group non-greedy, or that +? works and ?+ doesn't.

The amount of handwritten regex pattern shrunk from 35 characters to 18—nearly half. That may not seem like many characters, but 17 characters is quite a lot of regex, especially if you have trouble remembering whether ^ matches the start or end of a string.

Before I forget

Now that the pattern is assembled, we can use it like any other regex pattern. Let's extract the Python code from that Markdown content (at long last):

>>> matches = re.findall(
    markdown_python_code_pattern, markdown_content, re.MULTILINE
)
>>> matches
["def print_banana():\n    print('banana')\n",
 "def print_chocolate():\n    print('chocolate')\n"]

Mission accomplished.

Wrap-up

In this post, I bellyached about my inability to remember regex pattern syntax. I described a stategy of writing regular expressions by breaking them into multiple variables, which provides some descriptive value. Then, I introduced classes for regex groups, lookaheads, and lookbehinds.

Class-based patterns are a different way to write regular expressions that encourages composing composeable parts. Writing regex patterns like this helps me focus on the grammar of a pattern instead of tripping over its syntax.

The classes don't take all the clunkiness out of regex patterns, but they displace some raw regex syntax, replacing it with code that reads more like plain language—and less like punctuation soup.