search

package module

v0.0.0-...-5e73d17 Latest Latest Go to latest Published: Feb 24, 2016 License: GPL-2.0 Imports: 15 Imported by: 6

Details

Valid go.mod file

The Go module system was introduced in Go 1.11 and is the official dependency management solution for Go.
Redistributable license

Redistributable licenses place minimal restrictions on how software can be used, modified, and redistributed.
Tagged version

Modules with tagged versions give importers more predictable builds.
Stable version

When a project reaches major version v1 it is considered stable.
Learn more about best practices

Repository

github.com/aosen/search

Links

Open Source Insights

README ¶

search

Golang实现的搜索引擎

##使用过程中遇到任何问题，请联系QQ: 316052486

详细用法见: github.com/aosen/searchserver

##思维脑图

思维脑图

#几个重要结构体 ##搜索引擎初始化结构体

type EngineInitOptions struct {
	// 半角逗号分隔的字典文件，具体用法见
    //分词器接口
	Segmenter SearchSegmenter

	// 停用词文件
	StopTokenFile string

	// 分词器线程数
	NumSegmenterThreads int

	// 索引器和排序器的shard数目
	// 被检索/排序的文档会被均匀分配到各个shard中
	NumShards int

	// 索引器的信道缓冲长度
	IndexerBufferLength int

	// 索引器每个shard分配的线程数
	NumIndexerThreadsPerShard int

	// 排序器的信道缓冲长度
	RankerBufferLength int

	// 排序器每个shard分配的线程数
	NumRankerThreadsPerShard int

	// 索引器初始化选项
	IndexerInitOptions *IndexerInitOptions

	// 默认的搜索选项
	DefaultRankOptions *RankOptions

	// 是否使用持久数据库，以及数据库文件保存的目录和裂分数目
	UsePersistentStorage    bool
    //索引持久化
    //pipline接口
    SearchPipline SearchPipline
	//索引器生成方法
	CreateIndexer func() SearchIndexer
}

##建立索引结构体

type DocumentIndexData struct {
	// 文档全文（必须是UTF-8格式），用于生成待索引的关键词
	Content string

	// 文档的关键词
	// 当Content不为空的时候，优先从Content中分词得到关键词。
	// Tokens存在的意义在于绕过内置的分词器，在引擎外部
	// 进行分词和预处理。
	Tokens []TokenData

	// 文档标签（必须是UTF-8格式），比如文档的类别属性等，这些标签并不出现在文档文本中
	Labels []string

	// 文档的评分字段，可以接纳任何类型的结构体
	Fields interface{}
}

##建立索引

func (engine *Engine) IndexDocument(docId uint64, data DocumentIndexData)

##搜索请求结构体

type SearchRequest struct {
	// 搜索的短语（必须是UTF-8格式），会被分词
	// 当值为空字符串时关键词会从下面的Tokens读入
	Text string

	// 关键词（必须是UTF-8格式），当Text不为空时优先使用Text
	// 通常你不需要自己指定关键词，除非你运行自己的分词程序
	Tokens []string

	// 文档标签（必须是UTF-8格式），标签不存在文档文本中，但也属于搜索键的一种
	Labels []string

	// 当不为空时，仅从这些文档中搜索
	DocIds []uint64

	// 排序选项
	RankOptions *RankOptions

	// 超时，单位毫秒（千分之一秒）。此值小于等于零时不设超时。
	// 搜索超时的情况下仍有可能返回部分排序结果。
	Timeout int
}

##搜索查询符合条件的文档

func (engine *Engine) Search(request SearchRequest) (output SearchResponse)

##搜索引擎返回结构体

type SearchResponse struct {
	// 搜索用到的关键词
	Tokens []string

	// 搜索到的文档，已排序
	Docs []ScoredDocument

	// 搜索是否超时。超时的情况下也可能会返回部分结果
	Timeout bool
}

#开发进度

2015-01-14 增加pipline对mysql的支持
2015-01-08 目前打分器只支持BM25, 排序必须依靠BM25进行排序，接下来需要让引擎支持更多的打分规则。
2015-01-06 打分器接口话 done
2015-01-06 排序器接口话 done
2015-01-06 索引器接口话 done
2015-12-17 分词器接口化 done
2015-12-17 优化目录结构 done
2015-12-17 优化搜索效率 done
2015-12-04 增加索引存储方式： mongodb done

Documentation ¶

Index ¶

Constants
func SegmentsToSlice(segs []Segment, searchMode bool) (output []string)
func SegmentsToString(segs []Segment, searchMode bool) (output string)
func TextSliceByteLength(text []Text) (length int)
func TextSliceToString(text []Text) (output string)
func TokenToSlice(token *Token) (output []string)
func TokenToString(token *Token) (output string)
func UpdateJumper(jumper *Jumper, baseDistance float32, token *Token)
type BM25Parameters
type Dictionary
- func (self *Dictionary) AddToken(token *Token)
- func (self *Dictionary) GetMaxTokenLength() int
- func (self *Dictionary) GetNumTokens() int
- func (self *Dictionary) GetTotalFrequency() int64
- func (self *Dictionary) LookupTokens(words []Text, tokens []*Token) int
type DocumentIndex
type DocumentIndexData
type Engine
- func NewSearchEngine() *Engine
- func (engine *Engine) Close()
- func (engine *Engine) FlushIndex()
- func (engine *Engine) IndexDocument(docId uint64, data DocumentIndexData)
- func (engine *Engine) Init(options EngineInitOptions)
- func (engine *Engine) NumDocumentsIndexed() uint64
- func (engine *Engine) NumTokenIndexAdded() uint64
- func (engine *Engine) RemoveDocument(docId uint64)
- func (engine *Engine) Search(request SearchRequest) (output SearchResponse)
type EngineInitOptions
- func (options *EngineInitOptions) Init()
type IndexedDocument
type IndexerInitOptions
type Jumper
type KeywordIndex
type Node
type RankOptions
type ScoredDocument
type ScoredDocuments
- func (docs ScoredDocuments) Len() int
- func (docs ScoredDocuments) Less(i, j int) bool
- func (docs ScoredDocuments) Swap(i, j int)
type SearchIndexer
type SearchPipline
type SearchRanker
type SearchRequest
type SearchResponse
type SearchScorer
type SearchSegmenter
type Segment
- func (self *Segment) GetEnd() int
- func (self *Segment) GetStart() int
- func (self *Segment) GetToken() *Token
type StopTokens
- func (st *StopTokens) Init(stopTokenFile string)
- func (st *StopTokens) IsStopToken(token string) bool
type Text
- func SplitTextToWords(text Text) []Text
type Token
- func (token *Token) GetFrequency() int
- func (token *Token) GetPos() string
- func (token *Token) GetSegments() []*Segment
- func (token *Token) GetText() string
type TokenData

Constants ¶

View Source

const (
	// 仅存储文档的docId
	DocIdsIndex = 0

	// 存储关键词的词频，用于计算BM25
	FrequenciesIndex = 1

	// 存储关键词在文档中出现的具体字节位置（可能有多个）
	// 如果你希望得到关键词紧邻度数据，必须使用LocationsIndex类型的索引
	LocationsIndex = 2
)

这些常数定义了反向索引表存储的数据类型

View Source

const (
	NumNanosecondsInAMillisecond = 1000000
)

Variables ¶

This section is empty.

Functions ¶

func SegmentsToSlice ¶

func SegmentsToSlice(segs []Segment, searchMode bool) (output []string)

func SegmentsToString ¶

func SegmentsToString(segs []Segment, searchMode bool) (output string)

输出分词结果为字符串

有两种输出模式，以"中华人民共和国"为例

普通模式（searchMode=false）输出一个分词"中华人民共和国/ns "
搜索模式（searchMode=true） 输出普通模式的再细致切分：
    "中华/nz 人民/n 共和/nz 共和国/ns 人民共和国/nt 中华人民共和国/ns "

搜索模式主要用于给搜索引擎提供尽可能多的关键字，详情请见Token结构体的注释。

func TextSliceByteLength ¶

func TextSliceByteLength(text []Text) (length int)

返回多个字元的字节总长度

func TextSliceToString ¶

func TextSliceToString(text []Text) (output string)

将多个字元拼接一个字符串输出

func TokenToSlice ¶

func TokenToSlice(token *Token) (output []string)

func TokenToString ¶

func TokenToString(token *Token) (output string)

func UpdateJumper ¶

func UpdateJumper(jumper *Jumper, baseDistance float32, token *Token)

更新跳转信息:

当该位置从未被访问过时(jumper.minDistance为零的情况)，或者
当该位置的当前最短路径大于新的最短路径时

将当前位置的最短路径值更新为baseDistance加上新分词的概率

Types ¶

type BM25Parameters ¶

type BM25Parameters struct {
	K1 float32
	B  float32
}

默认值见engine_init_options.go

type Dictionary ¶

type Dictionary struct {
	Root           Node     // 根节点
	MaxTokenLength int      // 词典中最长的分词
	NumTokens      int      // 词典中分词数目
	Tokens         []*Token // 词典中所有的分词，方便遍历
	TotalFrequency int64    // 词典中所有分词的频率之和
}

Dictionary结构体实现了一个字串前缀树，一个分词可能出现在叶子节点也有可能出现在非叶节点

func (*Dictionary) AddToken ¶

func (self *Dictionary) AddToken(token *Token)

向词典中加入一个分词

func (*Dictionary) GetMaxTokenLength ¶

func (self *Dictionary) GetMaxTokenLength() int

词典中最长的分词

func (*Dictionary) GetNumTokens ¶

func (self *Dictionary) GetNumTokens() int

词典中分词数目

func (*Dictionary) GetTotalFrequency ¶

func (self *Dictionary) GetTotalFrequency() int64

词典中所有分词的频率之和

func (*Dictionary) LookupTokens ¶

func (self *Dictionary) LookupTokens(words []Text, tokens []*Token) int

在词典中查找和字元组words可以前缀匹配的所有分词返回值为找到的分词数

type DocumentIndex ¶

type DocumentIndex struct {
	// 文本的DocId
	DocId uint64

	// 文本的关键词长
	TokenLength float32

	// 加入的索引键
	Keywords []KeywordIndex
}

type DocumentIndexData ¶

type DocumentIndexData struct {
	// 文档全文（必须是UTF-8格式），用于生成待索引的关键词
	Content string

	// 文档的关键词
	// 当Content不为空的时候，优先从Content中分词得到关键词。
	// Tokens存在的意义在于绕过内置的分词器，在引擎外部
	// 进行分词和预处理。
	Tokens []TokenData

	// 文档标签（必须是UTF-8格式），比如文档的类别属性等，这些标签并不出现在文档文本中
	Labels []string

	// 文档的评分字段，可以接纳任何类型的结构体
	Fields interface{}
}

type Engine ¶

type Engine struct {
	// contains filtered or unexported fields
}

搜索引擎基类

func NewSearchEngine ¶

func NewSearchEngine() *Engine

func (*Engine) Close ¶

func (engine *Engine) Close()

关闭引擎

func (*Engine) FlushIndex ¶

func (engine *Engine) FlushIndex()

阻塞等待直到所有索引添加完毕

func (*Engine) IndexDocument ¶

func (engine *Engine) IndexDocument(docId uint64, data DocumentIndexData)

将文档加入索引

输入参数：

docId	标识文档编号，必须唯一
data	见DocumentIndexData注释

注意：

这个函数是线程安全的，请尽可能并发调用以提高索引速度
这个函数调用是非同步的，也就是说在函数返回时有可能文档还没有加入索引中，因此如果立刻调用Search可能无法查询到这个文档。强制刷新索引请调用FlushIndex函数。

func (*Engine) Init ¶

func (engine *Engine) Init(options EngineInitOptions)

func (*Engine) NumDocumentsIndexed ¶

func (engine *Engine) NumDocumentsIndexed() uint64

func (*Engine) NumTokenIndexAdded ¶

func (engine *Engine) NumTokenIndexAdded() uint64

func (*Engine) RemoveDocument ¶

func (engine *Engine) RemoveDocument(docId uint64)

将文档从索引中删除

输入参数：

docId	标识文档编号，必须唯一

注意：这个函数仅从排序器中删除文档的自定义评分字段，索引器不会发生变化。所以你的自定义评分字段必须能够区别评分字段为nil的情况，并将其从排序结果中删除。

func (*Engine) Search ¶

func (engine *Engine) Search(request SearchRequest) (output SearchResponse)

查找满足搜索条件的文档，此函数线程安全

type EngineInitOptions ¶

type EngineInitOptions struct {
	// 半角逗号分隔的字典文件，具体用法见
	// sego.Segmenter.LoadDictionary函数的注释
	Segmenter SearchSegmenter

	// 停用词文件
	StopTokenFile string

	// 分词器线程数
	NumSegmenterThreads int

	// 索引器和排序器的shard数目
	// 被检索/排序的文档会被均匀分配到各个shard中
	NumShards int

	// 索引器的信道缓冲长度
	IndexerBufferLength int

	// 索引器每个shard分配的线程数
	NumIndexerThreadsPerShard int

	// 排序器的信道缓冲长度
	RankerBufferLength int

	// 排序器每个shard分配的线程数
	NumRankerThreadsPerShard int

	// 索引器初始化选项
	IndexerInitOptions *IndexerInitOptions

	// 是否使用持久数据库，以及数据库文件保存的目录和裂分数目
	UsePersistentStorage bool

	//索引存储接口对接
	SearchPipline SearchPipline

	//索引器生成方法
	CreateIndexer func() SearchIndexer

	//排序器生成方法
	CreateRanker func() SearchRanker

	//打分器设置
	SearchScorer SearchScorer
}

func (*EngineInitOptions) Init ¶

func (options *EngineInitOptions) Init()

初始化EngineInitOptions，当用户未设定某个选项的值时用默认值取代

type IndexedDocument ¶

type IndexedDocument struct {
	DocId uint64

	// BM25，仅当索引类型为FrequenciesIndex或者LocationsIndex时返回有效值
	BM25 float32

	// 关键词在文档中的紧邻距离，紧邻距离的含义见computeTokenProximity的注释。
	// 仅当索引类型为LocationsIndex时返回有效值。
	TokenProximity int32

	// 紧邻距离计算得到的关键词位置，和Lookup函数输入tokens的长度一样且一一对应。
	// 仅当索引类型为LocationsIndex时返回有效值。
	TokenSnippetLocations []int

	// 关键词在文本中的具体位置。
	// 仅当索引类型为LocationsIndex时返回有效值。
	TokenLocations [][]int
}

索引器返回结果

type IndexerInitOptions ¶

type IndexerInitOptions struct {
	// 索引表的类型，见上面的常数
	IndexType int

	// BM25参数
	BM25Parameters *BM25Parameters
}

初始化索引器选项

type Jumper ¶

type Jumper struct {
	MinDistance float32
	Token       *Token
}

该结构体用于记录Viterbi算法中某字元处的向前分词跳转信息

type KeywordIndex ¶

type KeywordIndex struct {
	// 搜索键的UTF-8文本
	Text string

	// 搜索键词频
	Frequency float32

	// 搜索键在文档中的起始字节位置，按照升序排列
	Starts []int
}

反向索引项，这实际上标注了一个（搜索键，文档）对。

type Node ¶

type Node struct {
	Word     Text    // 该节点对应的字元
	Token    *Token  // 当此节点没有对应的分词时值为nil
	Children []*Node // 该字元后继的所有可能字元，当为叶子节点时为空
}

前缀树节点

type RankOptions ¶

type RankOptions struct {
	// 文档的评分规则，值为nil时使用Engine初始化时设定的规则
	SearchScorer SearchScorer

	// 默认情况下（ReverseOrder=false）按照分数从大到小排序，否则从小到大排序
	ReverseOrder bool

	// 从第几条结果开始输出
	OutputOffset int

	// 最大输出的搜索结果数，为0时无限制
	MaxOutputs int
}

排序选项

type ScoredDocument ¶

type ScoredDocument struct {
	DocId uint64

	// 文档的打分值
	// 搜索结果按照Scores的值排序，先按照第一个数排，如果相同则按照第二个数排序，依次类推。
	Scores []float32

	// 用于生成摘要的关键词在文本中的字节位置，该切片长度和SearchResponse.Tokens的长度一样
	// 只有当IndexType == LocationsIndex时不为空
	TokenSnippetLocations []int

	// 关键词出现的位置
	// 只有当IndexType == LocationsIndex时不为空
	TokenLocations [][]int
}

type ScoredDocuments ¶

type ScoredDocuments []ScoredDocument

func (ScoredDocuments) Len ¶

func (docs ScoredDocuments) Len() int

func (ScoredDocuments) Less ¶

func (docs ScoredDocuments) Less(i, j int) bool

func (ScoredDocuments) Swap ¶

func (docs ScoredDocuments) Swap(i, j int)

type SearchIndexer ¶

type SearchIndexer interface {
	// 初始化索引器
	Init(options IndexerInitOptions)
	// 向反向索引表中加入一个文档
	AddDocument(document *DocumentIndex)
	// 查找包含全部搜索键(AND操作)的文档
	// 当docIds不为nil时仅从docIds指定的文档中查找
	Lookup(tokens []string, labels []string, docIds []uint64) (docs []IndexedDocument)
}

索引器接口开发者只要实现以下接口，即可实现一个索引器

type SearchPipline ¶

type SearchPipline interface {
	//初始化存储器, shard为初始化的集合编号
	Init()
	//获取存储集合数量, 集合数量可以提高并行计算效率
	GetStorageShards() int
	//连接数据库
	Conn(shard int)
	//关闭数据库连接
	Close(shard int)
	//将数据从shard DB恢复到内存
	Recover(shard int, internalIndexDocument func(docId uint64, data DocumentIndexData)) error
	//存储索引
	Set(shard int, key, value []byte)
	//从DB删除索引
	Delete(shard int, key []byte)
}

存储器

type SearchRanker ¶

type SearchRanker interface {
	//排序起初始化
	Init()
	// 给某个文档添加评分字段
	AddScoringFields(docId uint64, fields interface{})
	// 删除某个文档的评分字段
	RemoveScoringFields(docId uint64)
	// 给文档评分并排序
	Rank(docs []IndexedDocument, options RankOptions) (outputDocs ScoredDocuments)
}

排序起接口

type SearchRequest ¶

type SearchRequest struct {
	// 搜索的短语（必须是UTF-8格式），会被分词
	// 当值为空字符串时关键词会从下面的Tokens读入
	Text string

	// 关键词（必须是UTF-8格式），当Text不为空时优先使用Text
	// 通常你不需要自己指定关键词，除非你运行自己的分词程序
	Tokens []string

	// 文档标签（必须是UTF-8格式），标签不存在文档文本中，但也属于搜索键的一种
	Labels []string

	// 当不为空时，仅从这些文档中搜索
	DocIds []uint64

	// 排序选项
	RankOptions *RankOptions

	// 超时，单位毫秒（千分之一秒）。此值小于等于零时不设超时。
	// 搜索超时的情况下仍有可能返回部分排序结果。
	Timeout int
}

type SearchResponse ¶

type SearchResponse struct {
	// 搜索用到的关键词
	Tokens []string

	// 搜索到的文档，已排序
	Docs []ScoredDocument

	// 搜索是否超时。超时的情况下也可能会返回部分结果
	Timeout bool
}

type SearchScorer ¶

type SearchScorer interface {
	// 给一个文档评分，文档排序时先用第一个分值比较，如果
	// 分值相同则转移到第二个分值，以此类推。
	// 返回空切片表明该文档应该从最终排序结果中剔除。
	Score(doc IndexedDocument, fields interface{}) []float32
}

评分规则通用接口

type SearchSegmenter ¶

type SearchSegmenter interface {
	// 返回分词器使用的词典
	Dictionary() *Dictionary
	// 从文件中载入词典
	// 可以载入多个词典文件，文件名用","分隔，排在前面的词典优先载入分词，比如
	// 	"用户词典.txt,通用词典.txt"
	// 当一个分词既出现在用户词典也出现在通用词典中，则优先使用用户词典。
	// 词典的格式为（每个分词一行）：
	//	分词文本 频率 词性
	LoadDictionary(files string)
	// 对文本分词
	// 输入参数：
	//	bytes	UTF8文本的字节数组
	// 输出：
	//	[]Segment	划分的分词
	Cut(bytes []byte, model bool) []Segment
}

分词器

type Segment ¶

type Segment struct {
	// 分词在文本中的起始字节位置
	Start int
	// 分词在文本中的结束字节位置（不包括该位置）
	End int
	// 分词信息
	Token *Token
}

文本中的一个分词

func (*Segment) GetEnd ¶

func (self *Segment) GetEnd() int

返回分词在文本中的结束字节位置（不包括该位置）

func (*Segment) GetStart ¶

func (self *Segment) GetStart() int

返回分词在文本中的起始字节位置

func (*Segment) GetToken ¶

func (self *Segment) GetToken() *Token

返回分词信息

type StopTokens ¶

type StopTokens struct {
	// contains filtered or unexported fields
}

停用词管理

func (*StopTokens) Init ¶

func (st *StopTokens) Init(stopTokenFile string)

从stopTokenFile中读入停用词，一个词一行文档索引建立时会跳过这些停用词

func (*StopTokens) IsStopToken ¶

func (st *StopTokens) IsStopToken(token string) bool

type Text ¶

type Text []byte

字串类型，可以用来表达

一个字元，比如"中"又如"国", 英文的一个字元是一个词
一个分词，比如"中国"又如"人口"
一段文字，比如"中国有十三亿人口"

func SplitTextToWords ¶

func SplitTextToWords(text Text) []Text

将文本划分成字元

type Token ¶

type Token struct {
	// 分词的字串，这实际上是个字元数组
	TextList []Text
	// 分词在语料库中的词频
	Frequency int
	// log2(总词频/该分词词频)，这相当于log2(1/p(分词))，用作动态规划中
	// 该分词的路径长度。求解prod(p(分词))的最大值相当于求解
	// sum(distance(分词))的最小值，这就是“最短路径”的来历。
	Distance float32
	// 词性标注
	Pos string
	// 该分词文本的进一步分词划分，见Segments函数注释。
	Segments []*Segment
}

一个分词

func (*Token) GetFrequency ¶

func (token *Token) GetFrequency() int

返回分词在语料库中的词频

func (*Token) GetPos ¶

func (token *Token) GetPos() string

返回分词词性标注

func (*Token) GetSegments ¶

func (token *Token) GetSegments() []*Segment

该分词文本的进一步分词划分，比如"中华人民共和国中央人民政府"这个分词有两个子分词"中华人民共和国"和"中央人民政府"。子分词也可以进一步有子分词形成一个树结构，遍历这个树就可以得到该分词的所有细致分词划分，这主要用于搜索引擎对一段文本进行全文搜索。

func (*Token) GetText ¶

func (token *Token) GetText() string

返回分词文本

type TokenData ¶

type TokenData struct {
	// 关键词的字符串
	Text string

	// 关键词的首字节在文档中出现的位置
	Locations []int
}

文档的一个关键词

Source Files ¶

View all Source files

Directories ¶

Path	Synopsis
indexer
pipeline
ranker
scorer
segmenter
utils

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL