ahocorasick

package module
v0.0.0-...-9d454b2 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Jul 13, 2019 License: MIT Imports: 7 Imported by: 2

README

Aho-Corasick

⚠ Use the newer aho-corasick instead. This implementation does not scale well with number of patterns. ⚠

Build Status

Aho-Corasick string search algorithm implemented in Go.

Uses a double array trie for improved access speeds and reduced memory consumption.

Licensed under MIT License.

Documentation

Can be found here.

Usage

Use a TrieBuilder to create a Trie:

trie := NewTrieBuilder().
    AddStrings([]string{"hers", "his", "he", "she"}).
    Build()

Match something:

matches := trie.MatchString("I have never tasted a hershey bar.")
fmt.Printf("We got %d matches.\n", len(matches))

// => We got 4 matches.

Examine matches:

for _, match := range matches {
    fmt.Printf("Matched %q at offset %d.\n", match.Match(), match.Pos())
}

// => Matched "he" at offset 22.
// => Matched "hers" at offset 22.
// => Matched "she" at offset 25.
// => Matched "he" at offset 26.

For debugging you may output the trie in DOT format:

NewTrieGrapher(trie).DrawFailLinks(true).Graph("example.dot")

And convert to image, e.g.:

$ dot -Tpng -o example.png example.dot

example-trie

Building

You can use ReadStrings or ReadHex to read patterns from a file (one pattern on each line).

patterns, err := ReadStrings("patterns.txt")
if err != nil {
    log.Fatal(err)
}

trie := NewTrieBuilder().AddPatterns(patterns).Build()

Saving/Loading

Building a large trie can take some time:

chart

So you can create a trie and save to file and load it instead of recreating it each time:

err := SaveTrie(trie, "my.trie")
if err != nil {
    log.Fatal(err)
}

And later:

trie, err := LoadTrie("my.trie")
if err != nil {
    log.Fatal(err)
}

Performance

Tested on a Dell XPS (i7-6700HQ @ 2.60GHz and 16 GiB RAM).

Building
BenchmarkBuildNSF/10-8         	  200000	      11871 ns/op
BenchmarkBuildNSF/50-8         	   20000	      66552 ns/op
BenchmarkBuildNSF/100-8        	   10000	     168264 ns/op
BenchmarkBuildNSF/500-8        	    1000	    2118400 ns/op
BenchmarkBuildNSF/1000-8       	     200	    6864025 ns/op
BenchmarkBuildNSF/5000-8       	      10	  126533350 ns/op
BenchmarkBuildNSF/10000-8      	       2	  504624124 ns/op
BenchmarkBuildNSF/50000-8      	       1	11154774829 ns/op
BenchmarkBuildNSF/100000-8     	       1	45294850018 ns/op

build-chart

As you can see, building gets a bit rough above 10,000 patterns.

Matching

Using 10,000 patterns.

BenchmarkMatchIbsen/100-8         	 1000000	      1193 ns/op
BenchmarkMatchIbsen/500-8         	  300000	      5279 ns/op
BenchmarkMatchIbsen/1000-8        	  200000	      9704 ns/op
BenchmarkMatchIbsen/5000-8        	   30000	     49823 ns/op
BenchmarkMatchIbsen/10000-8       	   20000	    102436 ns/op
BenchmarkMatchIbsen/50000-8       	    3000	    490882 ns/op
BenchmarkMatchIbsen/100000-8      	    2000	    976724 ns/op

match-chard

Matching follows the input more linearly and is quite fast.

Memory usage

Haven't tested this properly, but a quick test with 10,000 patterns gave Trie with size 99830 (that is, the length of all its slices combined). This should in theory equal around 0.76MiB. I do not know the internals of golang enough to know how this is in practice.

The memory usage is (obviously) higher when actually doing matching.

Documentation

Overview

Package ahocorasick implements the Aho-Corasick string searching algorithm in Go.

The algorithm is implemented using a double array trie for increased access speed and reduced memory consumption.

The algorithm uses an alphabet size of 256, so can only be used to match byte patterns.

Index

Examples

Constants

View Source
const (
	AlphabetSize int64 = 256 // The size of the alphabet is fixed to the size of a byte.
	RootState    int64 = 0   // The root state of the trie is always 0.
	EmptyCell    int64 = -1  // Represents an unused cell.
	DefaultBase  int64 = 0   // The default base for new states.
)
View Source
const MagicNumber int32 = 0x45495254

Variables

This section is empty.

Functions

func DecodeByte

func DecodeByte(e int64) byte

func EncodeByte

func EncodeByte(b byte) int64

EncodeByte optimizes for ASCII text by shifting to 0x41 ('A'). Also adds one to avoid byte == 0.

func MatchEqual

func MatchEqual(m1, m2 *Match) bool

Check if two matches are equal.

func ReadHex

func ReadHex(path string) ([][]byte, error)

Read patterns in hex format, one pattern on each line.

func ReadStrings

func ReadStrings(path string) ([][]byte, error)

Read string patterns from a text file, one pattern on each line.

func SaveTrie

func SaveTrie(tr *Trie, path string) error

Save a Trie to file.

Types

type Match

type Match struct {
	// contains filtered or unexported fields
}

Represents a matched pattern.

func (*Match) End

func (m *Match) End() int64

Get the end position of the matched pattern.

func (*Match) Match

func (m *Match) Match() []byte

Get the matched byte pattern.

func (*Match) MatchString

func (m *Match) MatchString() string

Just to make working with strings a little more comfortable.

func (*Match) Pos

func (m *Match) Pos() int64

Get the position (offset) of the matched pattern.

func (*Match) String

func (m *Match) String() string

type Trie

type Trie struct {
	// contains filtered or unexported fields
}

Trie implementing the Aho-Corasick algorithm. Uses two arrays (base and check) for transitions (as described by Aho).

A transition in the trie from state s to state t on symbol c is described by:

base[s] + c = t
check[t] = s

Note that the symbol c is actually stored as c + 1 in this implementation for easier handling of transition on 0.

func LoadTrie

func LoadTrie(path string) (*Trie, error)

Read a Trie from file.

func (*Trie) Match

func (tr *Trie) Match(input []byte) []*Match

Run the Trie against the provided input and returns potentially matches.

func (*Trie) MatchFirst

func (tr *Trie) MatchFirst(input []byte) *Match

Same as Match, but returns immediately after the first matched pattern.

func (*Trie) MatchString

func (tr *Trie) MatchString(input string) []*Match

Helper method to make matching strings a little more comfortable.

Example
trie := NewTrieBuilder().
	AddStrings([]string{"hers", "his", "he", "she"}).
	Build()
matches := trie.MatchString("she is here")
fmt.Println(len(matches))
Output:

3

func (*Trie) MatchStringFirst

func (tr *Trie) MatchStringFirst(input string) *Match

Helper method to make matching a string a little more comfortable.

func (*Trie) NumPatterns

func (tr *Trie) NumPatterns() int64

type TrieBuilder

type TrieBuilder struct {
	// contains filtered or unexported fields
}

A TrieBuilder must be used to properly build Tries.

func NewTrieBuilder

func NewTrieBuilder() *TrieBuilder

Create and initialize a new TrieBuilder.

func (*TrieBuilder) AddPattern

func (tb *TrieBuilder) AddPattern(pattern []byte) *TrieBuilder

Add a new pattern to be built into the resulting Trie.

func (*TrieBuilder) AddPatterns

func (tb *TrieBuilder) AddPatterns(patterns [][]byte) *TrieBuilder

A helper method to make adding multiple patterns a little more comfortable.

func (*TrieBuilder) AddString

func (tb *TrieBuilder) AddString(pattern string) *TrieBuilder

A helper method to make adding a string pattern more comfortable.

func (*TrieBuilder) AddStrings

func (tb *TrieBuilder) AddStrings(patterns []string) *TrieBuilder

A helper method to make adding multiple string patterns a little more comfortable.

func (*TrieBuilder) Build

func (tb *TrieBuilder) Build() *Trie

Build the trie.

Example
builder := NewTrieBuilder()

builder.AddPattern([]byte{0x44, 0x22, 0x31, 0x52, 0x32, 0x00, 0x01, 0x01})
builder.AddStrings([]string{"hello", "world"})

trie := builder.Build()

fmt.Println(len(trie.MatchString("hello!")))
Output:

1

type TrieGrapher

type TrieGrapher struct {
	// contains filtered or unexported fields
}

A TrieGrapher is used to output a Trie in the DOT graph description language.

func NewTrieGrapher

func NewTrieGrapher(trie *Trie) *TrieGrapher

Create a new TrieGrapher.

func (tg *TrieGrapher) DrawFailLinks(b bool) *TrieGrapher

Toggle inclusion of fail links in the graph.

func (*TrieGrapher) Graph

func (tg *TrieGrapher) Graph(path string) error

Output the DOT graph to a file.

Example
trie := NewTrieBuilder().
	AddStrings([]string{"his", "hers", "he", "she"}).
	Build()

grapher := NewTrieGrapher(trie).DrawFailLinks(true)
grapher.Graph("trie.dot")

if _, err := exec.LookPath("dot"); err == nil {
	if err := exec.Command("dot", "-Tpng", "-o", "trie.png", "trie.dot").Run(); err != nil {
		fmt.Println(err)
	} else {
		fmt.Println("OK")
	}
	// Output: OK
} else {
	
Output:

Directories

Path Synopsis

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL