fastregexp

package module
v0.3.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Jul 13, 2022 License: MIT Imports: 7 Imported by: 0

README

go-fastregexp

This Go library speeds up the evaluation of regular expressions, especially on larger inputs. It works by first extracting a set of string literals that the input must contain for a match from the regular expression by means of static analysis. Secondly, the input is scanned for the presence of any such string literal. Finally, substrings of the original input surrounding those matches are then evaluated by the original regular expression.

An example use case is secret token detection.

Benchmark

The table belows show the results of running the benchmarks in matcher_bench_test.go on an Apple Macbook Pro M1 Max (14-inch, 2021, 10 cores: 8 performance and 2 efficiency). Baseline corresponds to regexp.Regexp.Find() and Optimized to fastregexp.Matcher.Find(). Lower is better.

Test case Input length till match (bytes) Baseline (ns/op) Optimized (ns/op) Speedup (x)
literal prefix 100 277 300 0.92
literal prefix 1000 370 382 0.96
literal prefix 10000 1531 1062 1.44
literal prefix 100000 8235 7802 1.05
literal prefix 1000000 76957 77113 0.99
no prefix 100 2054 2064 0.99
no prefix 1000 17861 4820 3.7
no prefix 10000 241428 5544 43.54
no prefix 100000 2489774 12183 204.36
no prefix 1000000 34577379 81190 425.88
case folded prefix 100 1961 2194 0.89
case folded prefix 1000 17101 6323 2.7
case folded prefix 10000 224752 20840 10.78
case folded prefix 100000 2294908 166169 13.81
case folded prefix 1000000 30792046 1616105 19.05

The "literal prefix" case is already optimized by Golang's implementation, see regexp.Regexp.LiteralPrefix, hence the absence of significant speedups.

How it works

Literal extraction

Golang's regexp/syntax packages parses regular expressions and compiles them into Nondeterministic Finite Automaton (NFA).

The approach chosen by regexp/syntax to encode those NFAs as programs (syntax.Prog) consisting of a list of instruction ([]syntax.Inst), which can be interpreted to match a regexp against some input. Instructions capture both NFA state and transition information.

Each program defines a single start instruction (Prog.Start int), as an index into the slice of instructions (Prog.Inst []Inst). Similarly to how CPUs work, a symbolic program counter pc register can be initialized to that value to drive the interpreter. Because the program represents an NFA, as opposed to a DFA, the interpreter in general has to consider multiple program counters at once.

Literal extraction needs to consider the following instructions:

  1. InstRune1 Matches exactly one rune. Like all other InstRune* instructions, it moves the program counter pc to Inst.Out if the rune matched, or aborts the interpretation of pc if it failed to match
  2. InstAlt / InstAltMatch "Forks" the interpreter so that it has to consider two potential program points: Inst.Out and Inst.Alt
  3. InstMatch The regexp matched succesfully
  4. InstFail Aborts the interpretation of pc
  5. InstRune A more complex version of InstRune1. It can match one rune against one or more ranges of runes, and supports case folding as well
  6. InstRuneAny / InstRuneAnyNotNL Matches any rune, including or excluding the newline one

Conceptually, we would like to find the longest consecutive chain of InstRune1 instructions from the start instruction until the InstMatch one.

Because the interpreter forks upon encountering InstAlt and InstAltMatch instructions, we actually extracts the longest consecutive chain of InstRune1 for every possible path. From there, the most frequently occurring literal across all paths is picked, preferring longer (ab over c) and smaller (a over b) literals to break ties. Let's consider a few examples to illustrate this:

  1. abc returns abc
  2. a.b yields literals a (path #1) and b (also path #1). a is returned because its length is identical to b but is lexically smaller. Note that . matches any rune in the regexp
  3. ab.c yields literals ab (path #1) and c (also path #1). ab is returned because it is longer than c
  4. (ab|c) yields ab (path #1) and c (path #2). Both literals are returned
  5. (ab|c)d yields literals ab (path #1), c (path #2) and d (paths #1 and #2). Only d is returned because it is common to both paths, even if ab is longer

The idea is to prefer literals with higher entropy to reduce the number of potential preliminary matches (see below) while minimizing the number of returned literals to reduce the amount of bytes.Index() calls done during preliminary matching.

Case folding is supported by the InstRune instruction when Inst.Arg & syntax.FoldCase != 0. In such cases, the returned literal will hold the smallest fold for each rune, eg. A instead of a. Generally, non-folded literals (ie. exact matches) are preferred over folded ones.

The number of paths grows exponentially with the number of InstAlt and InstAltMatch instructions. For example, a program containing n=4 instructions of type InstAlt, the number of paths will bounded by 2^4 = 16. A limit on the number of interpreted paths must be set to prevent Denial of Service attacks.

Preliminary matching

Preliminary matching iterates over all the previously extracted literals and performs a series of bytes.Index() calls on slices of the input to locate every potential match. Case-folded literals are matched against the case-folded input.

Final matching

A user specified window of bytes is taken before and after each preliminary match to form slices of the input, which are passed to the regexp.

Future work

  1. Support non-ASCII literals
  2. Use Aho–Corasick algorithm to scan for multiple literals in parallel rather than calling string.Contains() for each literal
  3. Infer the maximal substring length from the regular expression itself rather than relying on hardcoded values
  4. Unroll rune ranges only when strictly required to reduce path explosion
  5. Merge overlapping substrings together to reduce the number of times the regular expression is evaluated

Documentation

Index

Constants

View Source
const DefaultWindowLength = 256

Variables

This section is empty.

Functions

func FindLiterals

func FindLiterals(regex string) (lits []string, foldedLits []string, err error)

FindLiterals from the provided regex both in literal form and case folded literal form. Returns error if unable to compile, or nil if unable to extract literals.

Types

type Matcher

type Matcher struct {
	// contains filtered or unexported fields
}

func Compile

func Compile(expr string) (*Matcher, error)

func (*Matcher) Find

func (m *Matcher) Find(input []byte) []byte

func (*Matcher) FindWindow

func (m *Matcher) FindWindow(input []byte, windowLength int) []byte

func (*Matcher) Optimized

func (m *Matcher) Optimized() bool

func (*Matcher) String

func (m *Matcher) String() string

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL