fastregexp

package module

v0.3.0 Latest Latest Go to latest Published: Jul 13, 2022 License: MIT Imports: 7 Imported by: 0

Details

Valid go.mod file

The Go module system was introduced in Go 1.11 and is the official dependency management solution for Go.
Redistributable license

Redistributable licenses place minimal restrictions on how software can be used, modified, and redistributed.
Tagged version

Modules with tagged versions give importers more predictable builds.
Stable version

When a project reaches major version v1 it is considered stable.
Learn more about best practices

Repository

gitlab.com/gitlab-org/vulnerability-research/foss/go-fastregexp

Links

Open Source Insights

README ¶

go-fastregexp

This Go library speeds up the evaluation of regular expressions, especially on larger inputs. It works by first extracting a set of string literals that the input must contain for a match from the regular expression by means of static analysis. Secondly, the input is scanned for the presence of any such string literal. Finally, substrings of the original input surrounding those matches are then evaluated by the original regular expression.

An example use case is secret token detection.

Benchmark

The table belows show the results of running the benchmarks in matcher_bench_test.go on an Apple Macbook Pro M1 Max (14-inch, 2021, 10 cores: 8 performance and 2 efficiency). Baseline corresponds to regexp.Regexp.Find() and Optimized to fastregexp.Matcher.Find(). Lower is better.

Test case	Input length till match (bytes)	Baseline (ns/op)	Optimized (ns/op)	Speedup (x)
literal prefix	100	277	300	0.92
literal prefix	1000	370	382	0.96
literal prefix	10000	1531	1062	1.44
literal prefix	100000	8235	7802	1.05
literal prefix	1000000	76957	77113	0.99
no prefix	100	2054	2064	0.99
no prefix	1000	17861	4820	3.7
no prefix	10000	241428	5544	43.54
no prefix	100000	2489774	12183	204.36
no prefix	1000000	34577379	81190	425.88
case folded prefix	100	1961	2194	0.89
case folded prefix	1000	17101	6323	2.7
case folded prefix	10000	224752	20840	10.78
case folded prefix	100000	2294908	166169	13.81
case folded prefix	1000000	30792046	1616105	19.05

The "literal prefix" case is already optimized by Golang's implementation, see regexp.Regexp.LiteralPrefix, hence the absence of significant speedups.

How it works

Literal extraction

Golang's regexp/syntax packages parses regular expressions and compiles them into Nondeterministic Finite Automaton (NFA).

The approach chosen by regexp/syntax to encode those NFAs as programs (syntax.Prog) consisting of a list of instruction ([]syntax.Inst), which can be interpreted to match a regexp against some input. Instructions capture both NFA state and transition information.

Each program defines a single start instruction (Prog.Start int), as an index into the slice of instructions (Prog.Inst []Inst). Similarly to how CPUs work, a symbolic program counter pc register can be initialized to that value to drive the interpreter. Because the program represents an NFA, as opposed to a DFA, the interpreter in general has to consider multiple program counters at once.

Literal extraction needs to consider the following instructions:

InstRune1 Matches exactly one rune. Like all other InstRune* instructions, it moves the program counter pc to Inst.Out if the rune matched, or aborts the interpretation of pc if it failed to match
InstAlt / InstAltMatch "Forks" the interpreter so that it has to consider two potential program points: Inst.Out and Inst.Alt
InstMatch The regexp matched succesfully
InstFail Aborts the interpretation of pc
InstRune A more complex version of InstRune1. It can match one rune against one or more ranges of runes, and supports case folding as well
InstRuneAny / InstRuneAnyNotNL Matches any rune, including or excluding the newline one

Conceptually, we would like to find the longest consecutive chain of InstRune1 instructions from the start instruction until the InstMatch one.

Because the interpreter forks upon encountering InstAlt and InstAltMatch instructions, we actually extracts the longest consecutive chain of InstRune1 for every possible path. From there, the most frequently occurring literal across all paths is picked, preferring longer (ab over c) and smaller (a over b) literals to break ties. Let's consider a few examples to illustrate this:

abc returns abc
a.b yields literals a (path #1) and b (also path #1). a is returned because its length is identical to b but is lexically smaller. Note that . matches any rune in the regexp
ab.c yields literals ab (path #1) and c (also path #1). ab is returned because it is longer than c
(ab|c) yields ab (path #1) and c (path #2). Both literals are returned
(ab|c)d yields literals ab (path #1), c (path #2) and d (paths #1 and #2). Only d is returned because it is common to both paths, even if ab is longer

The idea is to prefer literals with higher entropy to reduce the number of potential preliminary matches (see below) while minimizing the number of returned literals to reduce the amount of bytes.Index() calls done during preliminary matching.

Case folding is supported by the InstRune instruction when Inst.Arg & syntax.FoldCase != 0. In such cases, the returned literal will hold the smallest fold for each rune, eg. A instead of a. Generally, non-folded literals (ie. exact matches) are preferred over folded ones.

The number of paths grows exponentially with the number of InstAlt and InstAltMatch instructions. For example, a program containing n=4 instructions of type InstAlt, the number of paths will bounded by 2^4 = 16. A limit on the number of interpreted paths must be set to prevent Denial of Service attacks.

Preliminary matching

Preliminary matching iterates over all the previously extracted literals and performs a series of bytes.Index() calls on slices of the input to locate every potential match. Case-folded literals are matched against the case-folded input.

Final matching

A user specified window of bytes is taken before and after each preliminary match to form slices of the input, which are passed to the regexp.

Future work

Support non-ASCII literals
Use Aho–Corasick algorithm to scan for multiple literals in parallel rather than calling string.Contains() for each literal
Infer the maximal substring length from the regular expression itself rather than relying on hardcoded values
Unroll rune ranges only when strictly required to reduce path explosion
Merge overlapping substrings together to reduce the number of times the regular expression is evaluated

Documentation ¶

Index ¶

Constants
func FindLiterals(regex string) (lits []string, foldedLits []string, err error)
type Matcher
- func Compile(expr string) (*Matcher, error)

Constants ¶

View Source

const DefaultWindowLength = 256

Variables ¶

This section is empty.

Functions ¶

func FindLiterals ¶

func FindLiterals(regex string) (lits []string, foldedLits []string, err error)

FindLiterals from the provided regex both in literal form and case folded literal form. Returns error if unable to compile, or nil if unable to extract literals.

Types ¶

type Matcher ¶

type Matcher struct {
	// contains filtered or unexported fields
}

func Compile ¶

func Compile(expr string) (*Matcher, error)

func (*Matcher) Find ¶

func (m *Matcher) Find(input []byte) []byte

func (*Matcher) FindWindow ¶

func (m *Matcher) FindWindow(input []byte, windowLength int) []byte

func (*Matcher) Optimized ¶

func (m *Matcher) Optimized() bool

func (*Matcher) String ¶

func (m *Matcher) String() string

Source Files ¶

View all Source files

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL