ahocorasick

package module
v1.0.3 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Jul 16, 2020 License: MIT Imports: 9 Imported by: 12

README

Aho-Corasick

Build Status Go Version Latest Tag

Implementation of the Aho-Corasick string-search algorithm in Go.

Licensed under MIT License.

Details

This implementation does not use a Double-Array Trie as in my implementation from a couple of years back.

This reduces the build time drastically, but at the cost of higher memory consumption.

The search time is still fast, and comparable to other Go implementations I have found on github that claims to be fast (see performance).

Documentation

Can be found at godoc.org.

Example Usage

Use a TrieBuilder to build a Trie:

trie := NewTrieBuilder().
    AddStrings([]string{"or", "amet"}).
    Build()

Then go and match something interesting:

matches := trie.MatchString("Lorem ipsum dolor sit amet, consectetur adipiscing elit.")
fmt.Printf("Got %d matches.\n", len(matches))

// => Got 3 matches.

What did we match?

for _, match := range matches {
    fmt.Printf("Matched pattern %d %q at position %d.\n", match.Match(),
        match.Pattern(), match.Pos())
}

// => Matched pattern 0 "or" at position 1.
// => Matched pattern 0 "or" at position 15.
// => Matched patterh 1 "amet" at position 22.

Building

You can easily load patterns from file:

builder := NewTrieBuilder()
builder.LoadPatterns("patterns.txt")
builder.LoadStrings("strings.txt")

Both functions expects a text file with one pattern per line. LoadPatterns expects the pattern to be in hexadecimal form.

Storing

Use Encode to store a Trie in gzip compressed binary format:

f, err := os.Create("trie.gz")
err := Encode(f, trie)

And Decode to load it from binary format:

f, err := os.Open("trie.gz")
trie, err := Decode(f)

Performance

Some simple benchmarking on my machine (Intel(R) Core(TM) i7-8750H CPU @ 2.20GHz, 32 GiB RAM).

Build and search time grows quite linearly with regards to number of patterns and input text length.

Building
BenchmarkTrieBuild/100-12                    10000              0.1460 ms/op
BenchmarkTrieBuild/1000-12                    1000              2.1643 ms/op
BenchmarkTrieBuild/10000-12                    100             14.3305 ms/op
BenchmarkTrieBuild/100000-12                    10            131.2442 ms/op
Searching
BenchmarkMatchIbsen/100-12                 2000000              0.0006 ms/op
BenchmarkMatchIbsen/1000-12                 300000              0.0042 ms/op
BenchmarkMatchIbsen/10000-12                 30000              0.0436 ms/op
BenchmarkMatchIbsen/100000-12                 3000              0.4310 ms/op
Compared to Other Implementation

See aho-corasick-benchmark.

Memory Usage

As mentioned, the memory consumption will be quite high compared to a double-array trie implementation. Especially during the build phase (which currently contains a lot of object allocations).

Documentation

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

func Encode added in v1.0.2

func Encode(w io.Writer, trie *Trie) error

Encode writes a Trie to w in gzip compressed binary format.

func MatchEqual

func MatchEqual(a, b *Match) bool

MatchEqual check whether two matches are equal (i.e. at same position, pattern and same pattern).

Types

type Match

type Match struct {
	// contains filtered or unexported fields
}

Match represents a matched pattern in the input.

func (*Match) Match

func (m *Match) Match() []byte

Match returns the pattern matched.

func (*Match) MatchString

func (m *Match) MatchString() string

MatchString returns the pattern matched as a string.

func (*Match) Pattern added in v1.0.3

func (m *Match) Pattern() int64

Pattern returns the pattern id of the match.

func (*Match) Pos

func (m *Match) Pos() int64

Pos returns the byte position of the match.

func (*Match) String

func (m *Match) String() string

type Trie

type Trie struct {
	// contains filtered or unexported fields
}

Trie represents a trie of patterns with extra links as per the Aho-Corasick algorithm.

func Decode added in v1.0.2

func Decode(r io.Reader) (*Trie, error)

Decode reads a Trie in gzip compressed binary format from r.

func (*Trie) Match

func (tr *Trie) Match(input []byte) []*Match

Match runs the Aho-Corasick string-search algorithm on a byte input.

func (*Trie) MatchFirst

func (tr *Trie) MatchFirst(input []byte) *Match

MatchFirst is the same as Match, but returns after first successful match.

func (*Trie) MatchFirstString

func (tr *Trie) MatchFirstString(input string) *Match

MatchFirstString is the same as MatchString, but returns after first successful match.

func (*Trie) MatchString

func (tr *Trie) MatchString(input string) []*Match

MatchString runs the Aho-Corasick string-search algorithm on a string input.

func (*Trie) Walk added in v1.0.1

func (tr *Trie) Walk(input []byte, fn WalkFn)

Walk runs the algorithm on a given output, calling the supplied callback function on every match. The algorithm will terminate if the callback function returns false.

type TrieBuilder

type TrieBuilder struct {
	// contains filtered or unexported fields
}

TrieBuilder is used to build Tries.

func NewTrieBuilder

func NewTrieBuilder() *TrieBuilder

NewTrieBuilder creates and initializes a new TrieBuilder.

func (*TrieBuilder) AddPattern

func (tb *TrieBuilder) AddPattern(pattern []byte) *TrieBuilder

AddPattern adds a byte pattern to the Trie under construction.

func (*TrieBuilder) AddPatterns

func (tb *TrieBuilder) AddPatterns(patterns [][]byte) *TrieBuilder

AddPatterns adds multiple byte patterns to the Trie.

func (*TrieBuilder) AddString

func (tb *TrieBuilder) AddString(pattern string) *TrieBuilder

AddString adds a string pattern to the Trie under construction.

func (*TrieBuilder) AddStrings

func (tb *TrieBuilder) AddStrings(patterns []string) *TrieBuilder

AddStrings add multiple strings to the Trie.

func (*TrieBuilder) Build

func (tb *TrieBuilder) Build() *Trie

Build constructs the Trie.

func (*TrieBuilder) LoadPatterns

func (tb *TrieBuilder) LoadPatterns(path string) error

LoadPatterns loads byte patterns from a file. Expects one pattern per line in hexadecimal form.

func (*TrieBuilder) LoadStrings

func (tb *TrieBuilder) LoadStrings(path string) error

LoadStrings loads string patterns from a file. Expects one pattern per line.

type WalkFn added in v1.0.1

type WalkFn func(end, n, pattern int64) bool

Walk calls this function on any match, giving the end position, length of the matched bytes, and the pattern number.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL