filter

package module
v1.2.6 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Jun 10, 2022 License: Apache-2.0 Imports: 6 Imported by: 0

README

Golang Dirty Filter

GoDoc

基于DFA算法; 支持动态修改敏感词,同时支持特殊字符的筛选; 敏感词的存储支持内存存储及MongoDB存储。

获取

$ go get -v github.com/antlinker/go-dirtyfilter

使用

package main

import (
  "fmt"

  "github.com/antlinker/go-dirtyfilter"
  "github.com/antlinker/go-dirtyfilter/store"
)

var (
  filterText = `我是需要过滤的内容,内容为:**文@@件,需要过滤。。。`
)

func main() {
  memStore, err := store.NewMemoryStore(store.MemoryConfig{
    DataSource: []string{"文件"},
  })
  if err != nil {
    panic(err)
  }
  filterManage := filter.NewDirtyManager(memStore)
  result, err := filterManage.Filter().Filter(filterText, '*', '@')
  if err != nil {
    panic(err)
  }
  fmt.Println(result)
}

输出结果

[文件]

License

Copyright 2016.All rights reserved.

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at

   http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

DAT % git submodule add https://github.com/anknown/darts % git submodule add https://github.com/awsong/go-darts

https://github.com/colin0000007/darts-go https://zhuanlan.zhihu.com/p/113262718 https://github.com/heidawei/DoubleArrayTrie https://linux.thai.net/~thep/datrie/datrie.html https://www.co-ding.com/assets/pdf/dat.pdf

a DFA compression could be done using four linear arrays, namely default, base, next, and check. However, in a case simpler than the lexical analyzer, such as the mere trie for information retrieval, the default array could be omitted. Thus, a trie can be implemented using three arrays according to this scheme.

有限状态机可以用4个数组表示:default, base, next, and check trie树不需要默认值

The tripple-array structure is composed of:

base. Each element in base corresponds to a node of the trie. For a trie node s, base[s] is the starting index within the next and check pool (to be explained later) for the row of the node s in the transition table. next. This array, in coordination with check, provides a pool for the allocation of the sparse vectors for the rows in the trie transition table. The vector data, that is, the vector of transitions from every node, would be stored in this array. check. This array works in parallel to next. It marks the owner of every cell in next. This allows the cells next to one another to be allocated to different trie nodes. That means the sparse vectors of transitions from more than one node are allowed to be overlapped. Definition 1. For a transition from state s to t which takes character c as the input, the condition maintained in the tripple-array trie is:

c s----->t

check[base[s] + c] = s next[base[s] + c] = t

base:trie 树上的每一个节点,对于节点s,base[s]是状态转移表中next数组和check数组的开始index next:记录状态转移表的稀疏向量,所有的状态转移向量的目标节点会被存储在这里 check:记录状态转移向量的开始节点

Walking t := base[s] + c; if check[t] = s then next state := next[t] else fail endif

Construction Procedure Relocate(s : state; b : base_index) { Move base for state s to a new place beginning at b } begin foreach input character c for the state s { i.e. foreach c such that check[base[s] + c]] = s } begin check[b + c] := s; { mark owner } next[b + c] := next[base[s] + c]; { copy data } check[base[s] + c] := none { free the cell } end; base[s] := b end

Double-Array Trie The next/check pool may be able to keep in a single array of integer couples, but the base array does not grow in parallel to the pool, and is therefore usually split. next和check可以用pair存储,但是base不行

n the double-array structure, the base and next are merged, resulting in only two parallel arrays, namely, base and check.

check[base[s] + c] = s base[s] + c = t

Walking t := base[s] + c; if check[t] = s then next state := t else fail endif

Construction Procedure Relocate(s : state; b : base_index) { Move base for state s to a new place beginning at b } begin foreach input character c for the state s { i.e. foreach c such that check[base[s] + c]] = s } begin check[b + c] := s; { mark owner } base[b + c] := base[base[s] + c]; { copy data } { the node base[s] + c is to be moved to b + c; Hence, for any i for which check[i] = base[s] + c, update check[i] to b + c } foreach input character d for the node base[s] + c begin check[base[base[s] + c] + d] := b + c end; check[base[s] + c] := none { free the cell } end; base[s] := b end

by splitting non-branching suffixes into single string storages, called tail,

https://blog.csdn.net/zzran/article/details/8462002 可以这么理解dat里面,check 存储了状态转移向量,value是箭尾(从何状态而来),index是箭尾(目标状态) 节点值和弧线上的值都存储在base数组里面, https://www.cnblogs.com/zhangchaoyang/articles/4508266.html

c s----->t

check[t]=s base[s]+c=t

层次遍历来DFA的第二层

已知状态“阿”的下标是2,变量“根”、“胶”、“拉”的编号依次是4、5、6

给base[2]赋值:从小到大遍历所有的正整数,直到发现某个数正整k满足base[k+4]=base[k+5]=base[k+6]=check[k+4]=check[k+5]=check[k+6]=0。得到k=1,那么就把1赋给base[2],同时也确定了状态“阿根”、“阿胶”、“阿拉”的下标依次是k+4、k+5、k+6,即5、6、7,而且check[5]=check[6]=check[7]=2。

最后遍历一次DFA,当某个节点已经是一个词的结尾时,按下列方法修改其base值。 if(base[i]==0) base[i]=-i else base[i]=-base[i]

变量“阿”的编号是2,base[2]=1,变量“胶”的编号是5,base[2]+5=6,我们检查一下check[6]是否等于2。check[6]确实等于2,则继续看下一个状态转移。同时我们发现base[6]是负数,这说明“阿胶”已经是一个完整的词了。 继续看下一个状态转移,base[6]=-6,负数取其相反数,base[6]=6,变量“及”的编号是7,base[6]+7=13,我们检查一下check[13]是否等于6,发现不满足,则“阿胶及”不是一个词,甚至都是不是任意一个词的前缀。

Documentation

Index

Constants

View Source
const (
	// DefaultCheckInterval 敏感词检查频率(默认5秒检查一次)
	DefaultCheckInterval = time.Second * 5
)

Variables

This section is empty.

Functions

This section is empty.

Types

type DirtyFilter

type DirtyFilter interface {
	// Filter 文本过滤函数
	// excludes 表示排除指定的字符
	// 返回文本中出现的敏感词,如果敏感词不存在则返回nil
	// 如果出现异常,则返回error
	Filter(text string, excludes ...rune) ([]string, error)

	// FilterResult 文本过滤函数
	// excludes 表示排除指定的字符
	// 返回文本中出现的敏感词及出现次数,如果敏感词不存在则返回nil
	// 如果出现异常,则返回error
	FilterResult(text string, excludes ...rune) (map[string]int, map[string][]int, error)

	// FilterReader 从可读流中过滤敏感词
	// excludes 表示排除指定的字符
	// 返回可读流中出现的敏感词,如果敏感词不存在则返回nil
	// 如果出现异常,则返回error
	FilterReader(reader io.Reader, excludes ...rune) ([]string, error)

	// FilterReaderResult 从可读流中过滤敏感词
	// excludes 表示排除指定的字符
	// 返回可读流中出现的敏感词及出现次数,如果敏感词不存在则返回nil
	// 如果出现异常,则返回error
	FilterReaderResult(reader io.Reader, excludes ...rune) (map[string]int, map[string][]int, error)

	// Replace 使用字符替换文本中的敏感词
	// delim 替换的字符
	// 如果出现异常,则返回error
	Replace(text string, delim rune, excludes ...rune) (string, error)
}

DirtyFilter 提供敏感词过滤接口

func NewNodeChanFilter

func NewNodeChanFilter(text <-chan string) DirtyFilter

NewNodeChanFilter 创建节点过滤器,实现敏感词的过滤 从通道中读取敏感词数据

func NewNodeFilter

func NewNodeFilter(text []string) DirtyFilter

NewNodeFilter 创建节点过滤器,实现敏感词的过滤 从切片中读取敏感词数据

func NewNodeReaderFilter

func NewNodeReaderFilter(rd io.Reader, delim byte) DirtyFilter

NewNodeReaderFilter 创建节点过滤器,实现敏感词的过滤 从可读流中读取敏感词数据(以指定的分隔符读取数据)

type DirtyManager

type DirtyManager struct {
	// contains filtered or unexported fields
}

DirtyManager 提供敏感词的管理

func NewDirtyManager

func NewDirtyManager(store DirtyStore, checkInterval ...time.Duration) *DirtyManager

NewDirtyManager 使用敏感词存储接口创建敏感词管理的实例

func (*DirtyManager) Filter

func (dm *DirtyManager) Filter() DirtyFilter

Filter 获取敏感词过滤接口

func (*DirtyManager) Store

func (dm *DirtyManager) Store() DirtyStore

Store 获取敏感词存储接口

type DirtyStore

type DirtyStore interface {
	// Write 将敏感词写入存储区,如果写入失败则返回error
	Write(words ...string) error

	// Read 以迭代的方式读取敏感词
	Read() <-chan string

	// ReadAll 获取所有的敏感词数据,如果获取失败则返回error
	ReadAll() ([]string, error)

	// Remove 移除敏感词,如果移除失败则返回error
	Remove(words ...string) error

	// Version 数据存储版本号
	Version() uint64
}

DirtyStore 提供敏感词的读取、写入存储接口

Directories

Path Synopsis

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL