go-fastregexp
This Go library speeds up the evaluation of regular expressions, especially on larger inputs. It works by first extracting a set of string literals that the input must contain for a match from the regular expression by means of static analysis. Secondly, the input is scanned for the presence of any such string literal. Finally, substrings of the original input surrounding those matches are then evaluated by the original regular expression.
An example use case is secret token detection.
Benchmark
The table belows show the results of running the benchmarks in matcher_bench_test.go on an Apple Macbook Pro M1 Max (14-inch, 2021, 10 cores: 8 performance and 2 efficiency). Baseline corresponds to regexp.Regexp.Find()
and Optimized to fastregexp.Matcher.Find()
. Lower is better.
Test case |
Input length till match (bytes) |
Baseline (ns/op) |
Optimized (ns/op) |
Speedup (x) |
literal prefix |
100 |
277 |
300 |
0.92 |
literal prefix |
1000 |
370 |
382 |
0.96 |
literal prefix |
10000 |
1531 |
1062 |
1.44 |
literal prefix |
100000 |
8235 |
7802 |
1.05 |
literal prefix |
1000000 |
76957 |
77113 |
0.99 |
no prefix |
100 |
2054 |
2064 |
0.99 |
no prefix |
1000 |
17861 |
4820 |
3.7 |
no prefix |
10000 |
241428 |
5544 |
43.54 |
no prefix |
100000 |
2489774 |
12183 |
204.36 |
no prefix |
1000000 |
34577379 |
81190 |
425.88 |
case folded prefix |
100 |
1961 |
2194 |
0.89 |
case folded prefix |
1000 |
17101 |
6323 |
2.7 |
case folded prefix |
10000 |
224752 |
20840 |
10.78 |
case folded prefix |
100000 |
2294908 |
166169 |
13.81 |
case folded prefix |
1000000 |
30792046 |
1616105 |
19.05 |
The "literal prefix" case is already optimized by Golang's implementation, see regexp.Regexp.LiteralPrefix
, hence the absence of significant speedups.
How it works
Golang's regexp/syntax packages parses regular expressions and compiles them into Nondeterministic Finite Automaton (NFA).
The approach chosen by regexp/syntax
to encode those NFAs as programs (syntax.Prog
) consisting of a list of instruction ([]syntax.Inst
), which can be interpreted to match a regexp against some input. Instructions capture both NFA state and transition information.
Each program defines a single start instruction (Prog.Start int
), as an index into the slice of instructions (Prog.Inst []Inst
). Similarly to how CPUs work, a symbolic program counter pc
register can be initialized to that value to drive the interpreter. Because the program represents an NFA, as opposed to a DFA, the interpreter in general has to consider multiple program counters at once.
Literal extraction needs to consider the following instructions:
InstRune1
Matches exactly one rune. Like all other InstRune*
instructions, it moves the program counter pc
to Inst.Out
if the rune matched, or aborts the interpretation of pc
if it failed to match
InstAlt
/ InstAltMatch
"Forks" the interpreter so that it has to consider two potential program points: Inst.Out
and Inst.Alt
InstMatch
The regexp matched succesfully
InstFail
Aborts the interpretation of pc
InstRune
A more complex version of InstRune1
. It can match one rune against one or more ranges of runes, and supports case folding as well
InstRuneAny
/ InstRuneAnyNotNL
Matches any rune, including or excluding the newline one
Conceptually, we would like to find the longest consecutive chain of InstRune1
instructions from the start instruction until the InstMatch
one.
Because the interpreter forks upon encountering InstAlt
and InstAltMatch
instructions, we actually extracts the longest consecutive chain of InstRune1
for every possible path. From there, the most frequently occurring literal across all paths is picked, preferring longer (ab
over c
) and smaller (a
over b
) literals to break ties. Let's consider a few examples to illustrate this:
abc
returns abc
a.b
yields literals a
(path #1) and b
(also path #1). a
is returned because its length is identical to b
but is lexically smaller. Note that .
matches any rune in the regexp
ab.c
yields literals ab
(path #1) and c
(also path #1). ab
is returned because it is longer than c
(ab|c)
yields ab
(path #1) and c
(path #2). Both literals are returned
(ab|c)d
yields literals ab
(path #1), c
(path #2) and d
(paths #1 and #2). Only d
is returned because it is common to both paths, even if ab
is longer
The idea is to prefer literals with higher entropy to reduce the number of potential preliminary matches (see below) while minimizing the number of returned literals to reduce the amount of bytes.Index()
calls done during preliminary matching.
Case folding is supported by the InstRune
instruction when Inst.Arg & syntax.FoldCase != 0
. In such cases, the returned literal will hold the smallest fold for each rune, eg. A
instead of a
. Generally, non-folded literals (ie. exact matches) are preferred over folded ones.
The number of paths grows exponentially with the number of InstAlt
and InstAltMatch
instructions. For example, a program containing n=4
instructions of type InstAlt
, the number of paths will bounded by 2^4 = 16
. A limit on the number of interpreted paths must be set to prevent Denial of Service attacks.
Preliminary matching
Preliminary matching iterates over all the previously extracted literals and performs a series of bytes.Index()
calls on slices of the input to locate every potential match. Case-folded literals are matched against the case-folded input.
Final matching
A user specified window of bytes is taken before and after each preliminary match to form slices of the input, which are passed to the regexp.
Future work
- Support non-ASCII literals
- Use Aho–Corasick algorithm to scan for multiple literals in parallel rather than calling
string.Contains()
for each literal
- Infer the maximal substring length from the regular expression itself rather than relying on hardcoded values
- Unroll rune ranges only when strictly required to reduce path explosion
- Merge overlapping substrings together to reduce the number of times the regular expression is evaluated