Cheat sheet · No. II
Regex.
A regex is a tiny program that walks a string left to right, backtracking whenever a branch fails. Most of the cost — and most of the bugs — live in greedy quantifiers and lookarounds.
The reference
ANCHORS
^- Start of line/string
$- End of line/string
\b- Word boundary
\B- Non-boundary
\A\z- Start / end of input (no multiline)
CLASSES
.- Any char (except newline by default)
\d\D- Digit / non-digit
\w\W- Word / non-word
\s\S- Whitespace / non-whitespace
[abc]- Any of a, b, c
[^abc]- None of a, b, c
[a-z]- Range
QUANTIFIERS
?- 0 or 1
*- 0 or more (greedy)
+- 1 or more (greedy)
{n}- Exactly n
{n,}- n or more
{n,m}- Between n and m
*?+?- Lazy versions
GROUPS
(...)- Capturing group
(?:...)- Non-capturing
(?P<name>...)- Named (Python)
\1\2- Backreference to group N
|- Alternation
LOOKAROUNDS
(?=...)- Positive lookahead
(?!...)- Negative lookahead
(?<=...)- Positive lookbehind
(?<!...)- Negative lookbehind
FLAGS
i- Case-insensitive
m- Multiline (^ $ match each line)
s- Dotall (. matches newline)
x- Verbose (whitespace + comments allowed)
g- Global (find all)
Field notes
Greedy by default
Quantifiers (* + {n,}) grab as much as they can, then give back. Add ? to make them lazy (*?, +?) when a match overshoots.
Anchor or scan
Without ^ … $ the engine retries at every start position. Anchoring is both faster and usually more correct.
Catastrophic backtracking
Nested quantifiers over overlapping patterns, like (a+)+, can hang on a long non-match. Avoid them, or reach for atomic groups.
Unicode is not a given
\d and \w are Unicode-aware in some engines and ASCII-only in others. Test against the runtime you actually ship.