Cheat sheet · No. II

Regex.

A regex is a tiny program that walks a string left to right, backtracking whenever a branch fails. Most of the cost — and most of the bugs — live in greedy quantifiers and lookarounds.

Printable One A4 page
PLATE — RegexFIG. II /^h(el)+o$/g h e l e l o · w oMATCH^ $ anchors + * greedy (?=) lookaheadone page, pinned to the wall.
The reference
ANCHORS
^
Start of line/string
$
End of line/string
\b
Word boundary
\B
Non-boundary
\A \z
Start / end of input (no multiline)
CLASSES
.
Any char (except newline by default)
\d \D
Digit / non-digit
\w \W
Word / non-word
\s \S
Whitespace / non-whitespace
[abc]
Any of a, b, c
[^abc]
None of a, b, c
[a-z]
Range
QUANTIFIERS
?
0 or 1
*
0 or more (greedy)
+
1 or more (greedy)
{n}
Exactly n
{n,}
n or more
{n,m}
Between n and m
*? +?
Lazy versions
GROUPS
(...)
Capturing group
(?:...)
Non-capturing
(?P<name>...)
Named (Python)
\1 \2
Backreference to group N
|
Alternation
LOOKAROUNDS
(?=...)
Positive lookahead
(?!...)
Negative lookahead
(?<=...)
Positive lookbehind
(?<!...)
Negative lookbehind
FLAGS
i
Case-insensitive
m
Multiline (^ $ match each line)
s
Dotall (. matches newline)
x
Verbose (whitespace + comments allowed)
g
Global (find all)
Field notes
Greedy by default

Quantifiers (* + {n,}) grab as much as they can, then give back. Add ? to make them lazy (*?, +?) when a match overshoots.

Anchor or scan

Without ^ … $ the engine retries at every start position. Anchoring is both faster and usually more correct.

Catastrophic backtracking

Nested quantifiers over overlapping patterns, like (a+)+, can hang on a long non-match. Avoid them, or reach for atomic groups.

Unicode is not a given

\d and \w are Unicode-aware in some engines and ASCII-only in others. Test against the runtime you actually ship.

Tip: hit ⌘P / Ctrl-P to save this single page as a PDF or print it for the wall.

Found this useful?