tokenizer

package
v0.0.0-...-e36dbc7 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Apr 9, 2024 License: Apache-2.0, Apache-2.0 Imports: 6 Imported by: 0

Documentation

Overview

Package tokenizer converts a text into a stream of tokens.

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

This section is empty.

Types

type Hash

type Hash map[uint32]TokenRanges

Hash is a map of the hashes of a section of text to the token range covering that text.

type TokenRange

type TokenRange struct {
	Start int
	End   int
}

TokenRange indicates the range of tokens that map to a particular checksum.

func (*TokenRange) String

func (t *TokenRange) String() string

type TokenRanges

type TokenRanges []*TokenRange

TokenRanges is a list of TokenRange objects. The chance that two different strings map to the same checksum is very small, but unfortunately isn't zero, so we use this instead of making the assumption that they will all be unique.

func (TokenRanges) CombineUnique

func (t TokenRanges) CombineUnique(other TokenRanges) TokenRanges

CombineUnique returns the combination of both token ranges with no duplicates.

func (TokenRanges) Len

func (t TokenRanges) Len() int

func (TokenRanges) Less

func (t TokenRanges) Less(i, j int) bool

func (TokenRanges) Swap

func (t TokenRanges) Swap(i, j int)

type Tokens

type Tokens []*token

Tokens is a list of Token objects.

func Tokenize

func Tokenize(s string) (toks Tokens)

Tokenize converts a string into a stream of tokens.

func (Tokens) GenerateHashes

func (t Tokens) GenerateHashes(h Hash, size int) ([]uint32, TokenRanges)

GenerateHashes generates hashes for "size" length substrings. The "stringifyTokens" call takes a long time to run, so not all substrings have hashes, i.e. we skip some of the smaller substrings.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL