Cheat sheet · No. II

Regex.

A regex is a tiny program that walks a string left to right, backtracking whenever a branch fails. Most of the cost — and most of the bugs — live in greedy quantifiers and lookarounds.

Printable One A4 page

The reference

ANCHORS

^: Start of line/string
$: End of line/string
\b: Word boundary
\B: Non-boundary
\A \z: Start / end of input (no multiline)

CLASSES

.: Any char (except newline by default)
\d \D: Digit / non-digit
\w \W: Word / non-word
\s \S: Whitespace / non-whitespace
[abc]: Any of a, b, c
[^abc]: None of a, b, c
[a-z]: Range

QUANTIFIERS

?: 0 or 1
*: 0 or more (greedy)
+: 1 or more (greedy)
{n}: Exactly n
{n,}: n or more
{n,m}: Between n and m
*? +?: Lazy versions

GROUPS

(...): Capturing group
(?:...): Non-capturing
(?P<name>...): Named (Python)
\1 \2: Backreference to group N
|: Alternation

LOOKAROUNDS

(?=...): Positive lookahead
(?!...): Negative lookahead
(?<=...): Positive lookbehind
(?<!...): Negative lookbehind

FLAGS

i: Case-insensitive
m: Multiline (^ $ match each line)
s: Dotall (. matches newline)
x: Verbose (whitespace + comments allowed)
g: Global (find all)

Field notes

Greedy by default

Quantifiers (* + {n,}) grab as much as they can, then give back. Add ? to make them lazy (*?, +?) when a match overshoots.

Anchor or scan

Without ^ … $ the engine retries at every start position. Anchoring is both faster and usually more correct.

Catastrophic backtracking

Nested quantifiers over overlapping patterns, like (a+)+, can hang on a long non-match. Avoid them, or reach for atomic groups.

Unicode is not a given

\d and \w are Unicode-aware in some engines and ASCII-only in others. Test against the runtime you actually ship.

Tip: hit ⌘P / Ctrl-P to save this single page as a PDF or print it for the wall.

← All cheat sheets