trie

package module
v0.0.0-...-1e91994 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Nov 17, 2022 License: MIT Imports: 5 Imported by: 0

README

Trie

GoDoc Go Report Card

An Aho-Corasick algorithm based string-searching utility for Go. It supports tokenization, ignoring case, replacing text. So you can use it to find keywords in an article, filter sensitive words, etc.

Implementation in Java:Trie4j

Introduction

判断一个字符串是否包含另一个字符串,我们通常使用strings.Index()strings.Contains()进行判断,其底层实现基于RK、KMP、BM和Sunday等算法。如果要判断一个字符串是否包含多个字符串,比如在一篇文章找几个敏感词,继续使用上述的字符串搜索算法显然是不合适,这种场景就需要用到多模式匹配算法。

Aho–Corasick 算法是由贝尔实验室的 Alfred V. Aho 和 Margaret J. Corasick 在 1975 年发明的一种字符串搜索算法。它是一种字典匹配算法,可在输入文本中定位有限字符串集(字典)的元素。它同时匹配所有字符串。该算法的复杂性与字符串长度加上搜索文本的长度加上输出匹配的数量成线性关系。

该算法主要依靠构造一个有限状态机来实现,然后通过失配指针在查找字符串失败时进行回退,转向某前缀的其他分支,免于重复匹配前缀,提高算法效率。

Usage

FindAll
t := trie.New("雨疏", "风骤", "残酒", "卷帘人", "知否")
emits := t.FindAll("昨夜雨疏风骤,浓睡不消残酒。试问卷帘人,却道海棠依旧。知否,知否?应是绿肥红瘦。", false)
[2:4=雨疏, 4:6=风骤, 11:13=残酒, 16:19=卷帘人, 27:29=知否, 30:32=知否]
FindFirst
t := trie.New("雨疏", "风骤", "残酒", "卷帘人", "知否")
emit := t.FindFirst("昨夜雨疏风骤,浓睡不消残酒。试问卷帘人,却道海棠依旧。知否,知否?应是绿肥红瘦。", false)
2:4=雨疏
FindAll (Case Insensitive)
t := trie.New("poetry", "TRANSLATION")
emits := t.FindAll("Poetry is what gets lost in translation.", true)
[0:6=poetry, 28:39=TRANSLATION]
FindFirst (Case Insensitive)
t := trie.New("poetry", "TRANSLATION")
emit := t.FindFirst("Poetry is what gets lost in translation.", true)
0:6=poetry
Tokenize
s := "常记溪亭日暮,沉醉不知归路。兴尽晚回舟,误入藕花深处。争渡,争渡,惊起一滩鸥鹭。"
t := trie.New("溪亭", "归路", "藕花", "争渡")
emits := t.FindAll(s, false)
tokens := trie.Tokenize(emits, s)
["常记", "溪亭(2:4=溪亭)", "日暮,沉醉不知", "归路(11:13=归路)", "。兴尽晚回舟,误入", "藕花(22:24=藕花)", "深处。", "争渡(27:29=争渡)", ",", "争渡(30:32=争渡)", ",惊起一滩鸥鹭。"]
Replace
s := "我正在参加砍价,砍到0元就可以免费拿啦。亲~帮我砍一刀呗,咱们一起免费领好货。"
t := trie.New("0元", "砍一刀", "免费拿", "免费领")
emits := t.FindAll(s, false)
r1 := trie.Replace(emits, s, "*")
r2 := trie.Replace(emits, s, "@#$%^&*")
我正在参加砍价,砍到**就可以***啦。亲~帮我***呗,咱们一起***好货。
我正在参加砍价,砍到%^就可以#$%啦。亲~帮我%^&呗,咱们一起&*@好货。

License

This project is under the MIT license. See the LICENSE file for details.

Documentation

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

func Replace

func Replace(emits []*Emit, source string, replacement string) string

Types

type Emit

type Emit struct {
	Begin, End int
	Keyword    string
}

func RemoveContains

func RemoveContains(emits []*Emit) []*Emit

func RemoveOverlaps

func RemoveOverlaps(emits []*Emit) []*Emit

func (*Emit) Contains

func (e *Emit) Contains(o *Emit) bool

func (*Emit) Equals

func (e *Emit) Equals(o *Emit) bool

func (*Emit) Length

func (e *Emit) Length() int

func (*Emit) Overlaps

func (e *Emit) Overlaps(o *Emit) bool

func (*Emit) String

func (e *Emit) String() string

type Keyword

type Keyword struct {
	// contains filtered or unexported fields
}

type State

type State struct {
	// contains filtered or unexported fields
}

func (*State) AddKeyword

func (s *State) AddKeyword(keyword string)

func (*State) AddKeywords

func (s *State) AddKeywords(keywords []*Keyword)

func (*State) AddState

func (s *State) AddState(str string) *State

func (*State) GetState

func (s *State) GetState(c rune, ignoreCase bool) *State

func (*State) HasKeyword

func (s *State) HasKeyword(keyword string) bool

func (*State) NextState

func (s *State) NextState(c rune, ignoreCase bool) *State

type Token

type Token struct {
	Fragment string
	Emit     *Emit
}

func Tokenize

func Tokenize(emits []*Emit, source string) []*Token

func (*Token) IsMatch

func (t *Token) IsMatch() bool

func (*Token) String

func (t *Token) String() string

type Trie

type Trie struct {
	// contains filtered or unexported fields
}

func New

func New(keywords ...string) *Trie

func (*Trie) AddKeywords

func (t *Trie) AddKeywords(keywords ...string) *Trie

func (*Trie) FindAll

func (t *Trie) FindAll(text string, ignoreCase bool) []*Emit

func (*Trie) FindFirst

func (t *Trie) FindFirst(text string, ignoreCase bool) *Emit

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL