mafsa

package module
v0.2.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Jun 24, 2020 License: MIT Imports: 11 Imported by: 1

README

MA-FSA for Go

Package mafsa implements Minimal Acyclic Finite State Automata (MA-FSA) with Minimal Perfect Hashing (MPH). Basically, it's a set of strings that lets you test for membership, do spelling correction (fuzzy matching) and autocomplete, but with higher memory efficiency than a regular trie. With MPH, you can associate each entry in the tree with data from your own application.

In this package, MA-FSA is implemented by two types:

  • BuildTree
  • MinTree

A BuildTree is used to build data from scratch. Once all the elements have been inserted, the BuildTree can be serialized into a byte slice or written to a file directly. It can then be decoded into a MinTree, which uses significantly less memory. MinTrees are read-only, but this greatly improves space efficiency.

Tutorial

Create a BuildTree and insert your items in lexicographical order. Be sure to call Finish() when you're done.

bt := mafsa.New()
bt.Insert("cities") // an error will be returned if input is < last input
bt.Insert("city")
bt.Insert("pities")
bt.Insert("pity")
bt.Finish()

The tree is now compressed to a minimum number of nodes and is ready to be saved.

err := bt.Save("filename")
if err != nil {
    log.Fatal("Could not save data to file:", err)
}

In your production application, then, you can read the file into a MinTree directly:

mt, err := mafsa.Load("filename")
if err != nil {
    log.Fatal("Could not load data from file:", err)
}

The mt variable is a *MinTree which has the same data as the original BuildTree, but without all the extra "scaffolding" that was required for adding new elements. In our testing, we were able to store over 8 million phrases (average length 24, much longer than words in a typical dictionary) in as little as 2 GB on a 64-bit system.

The package provides some basic read mechanisms.

// See if a string is a member of the set
fmt.Println("Does tree contain 'cities'?", mt.Contains("cities"))
fmt.Println("Does tree contain 'pitiful'?", mt.Contains("pitiful"))

// You can traverse down to a certain node, if it exists
fmt.Printf("'y' node is at: %p\n", mt.Traverse([]rune("city")))

// To traverse the tree and get the number of elements inserted
// before the prefix specified
node, idx := mt.IndexedTraverse([]rune("pit"))
fmt.Println("Index number for 'pit': %d", idx)

To associate entries in the tree with data from your own application, create a slice with the data in the same order as the elements were inserted into the tree:

myData := []string{
    "The plural of city",
    "Noun; a large town",
    "The state of pitying",
    "A feeling of sorrow and compassion",
}

The index number returned with IndexedTraverse() (usually minus 1) will get you to the element in your slice if you traverse directly to a final node:

node, i := mt.IndexedTraverse([]rune("pities"))
if node != nil && node.Final {
    fmt.Println(myData[i-1])
}

If you do IndexedTraverse() directly to a word in the tree, you must -1 because that method returns the number of elements in the tree before those that start with the specified prefix, which is non-inclusive with the node the method landed on.

There are many ways to apply MA-FSA with minimal perfect hashing, so the package only provides the basic utilities. Along with Traverse() and IndexedTraverse(), the edges of each node are exported so you may conduct your own traversals according to your needs.

Encoding Format

This section describes the file format used by Encoder and Decoder. You probably will never need to implement this yourself; it's already done for you in this package.

BuildTrees can be encoded as bytes and then stored on disk or decoded into a MinTree. The encoding of a BuildTree is a binary file that is composed of a sequence of words, which is a sequence of big-endian bytes. The file is inspected one word at a time.

The word length is equal to (in bytes): 1 (the flag byte) + character length (UTF-8 encoding of the unicode point) + pointer length (byte 1 in the header).

The first word contains file format information. Byte 0 is the file version (right now, 2). Byte 1 is the pointer length. In the first word specifying the file format, we assume a character length of 1 and fill up the rest of the bytes with zeros bytes.

Package Shugyousha/mafsa initializes the file with this first word entry by default:

[]byte{0x02 0x04 0x00 0x00 0x00 0x00}

This indicates that we are using version 2 of the file format and each pointer to another node is 4 bytes. This pointer size allows us to encode trees with up to 715.8 Mio. edges. The rest of the first entry are filled with zero bytes until the minimal size of a word has been reached (the pointer size takes the position of the UTF-8-character in a regular word).

Every other word in the file represents an edge (not a node). Those words consist of bytes according to this format:

[flag byte] [UTF-8-encoded character] [pointer]

The length of the character is encoded in the bits 2-4 (zero-indexed from the least significant one) of the flags, the pointer size in bytes is specified in the second byte of the initial word, and the flags part is always a single byte. This allows us to have 8 flags per edge, but currently only 5 are used.

The flag bits are:

  • 0x01 = End of Word (EOW, or final)
  • 0x02 = End of Node (EON)
  • 0x04 = lsb of character length (three bits in total)
  • 0x08 = middle bit of character length (three bits in total)
  • 0x10 = msb of character length (three bits in total)

A node essentially consists of a consecutive, variable-length sequence of words, where each word represents an edge. To encode a node, write each edge sequentially. Set the final (EOW) flag if the node it points to is a final node, and set the EON flag if it is the last edge for that node. The EON flag indicates the next word is an edge belonging to a different node. The pointer in each word should be the byte index of the start of the node being pointed to.

For example, the tree containing these words:

-- dog
-- dogs
-- hello
-- jello
-- été
-- あello

Encodes to (each line is one word, note that they are varying in size depending on the UTF-8 encoding length of the character):

 1   02 04 00 00 00 00
 2   04 64 00 00 00 27
 3   04 68 00 00 00 2d
 4   04 6a 00 00 00 33
 5   08 c3 a9 00 00 00 39
 6   0e e3 81 82 00 00 00 3f
 7   06 6f 00 00 00 45
 8   06 65 00 00 00 4b
 9   06 65 00 00 00 4b
10   06 74 00 00 00 51
11   06 65 00 00 00 4b
12   07 67 00 00 00 58
13   06 6c 00 00 00 5e
14   0b c3 a9 00 00 00 00
15   07 73 00 00 00 00
16   06 6c 00 00 00 64
17   07 6f 00 00 00 00

Or here's the hexdump (this makes it easier to follow the byte offsets):

00000000: 0204 0000 0000 0464 0000 0027 0468 0000  .......d...'.h..
00000010: 002d 046a 0000 0033 08c3 a900 0000 390e  .-.j...3......9.
00000020: e381 8200 0000 3f06 6f00 0000 4506 6500  ......?.o...E.e.
00000030: 0000 4b06 6500 0000 4b06 7400 0000 5106  ..K.e...K.t...Q.
00000040: 6500 0000 4b07 6700 0000 5806 6c00 0000  e...K.g...X.l...
00000050: 5e0b c3a9 0000 0000 0773 0000 0000 066c  ^........s.....l
00000060: 0000 0064 076f 0000 0000                 ...d.o....

Now, this tree isn't a thorough-enough example to test your implementation against, but it should illustrate the idea. First notice that the first word is the initializer word. The second word is the first edge coming off the root node, and words 3 and 4 are both edges coming off the root node. The word at 6 has the EON flag set (0x02) which indicates the end of the edges coming off that node (the root node). You'll see that edges at word indices 8 and 9 both point to the node starting at edge with word index 16 (at byte offset 0x4b, 75 in decimal). That would be the shared 'l' node after 'he' and 'je'.

If a node has no outgoing edges, the pointer bits are 0.

Documentation

Overview

Package mafsa implements Minimal Acyclic Finite State Automata (MA-FSA) in a space-optimized way as described by Dacuik, Mihov, Watson, and Watson in their paper, "Incremental Construction of Minimal Acyclic Finite-State Automata" (2000). It also implements Minimal Perfect Hashing (MPH) as described by Lucceshi and Kowaltowski in their paper, "Applications of Finite Automata Representing Large Vocabularies" (1992).

Unscientifically speaking, this package lets you store large amounts of strings (with Unicode) in memory so that membership queries, prefix lookups, and fuzzy searches are fast. And because minimal perfect hashing is included, you can associate each entry in the tree with more data used by your application. See the README or the end of this documentation for a brief tutorial.

MA-FSA structures are a specific type of Deterministic Acyclic Finite State Automaton (DAFSA) which fold equivalent state transitions into each other starting from the suffix of each entry. Typical construction algorithms involve building out the entire tree first, then minimizing the completed tree. However, the method described in the paper above allows the tree to be minimized after every word insertion, provided the insertions are performed in lexicographical order, which drastically reduces memory usage compared to regular prefix trees ("tries").

The goal of this package is to provide a simple, useful, and correct implementation of MA-FSA. Though more complex algorithms exist for removal of items and unordered insertion, these features may be outside the scope of this package.

Membership queries are on the order of O(n), where n is the length of the input string, so basically O(1). It is advisable to keep n small since long entries without much in common, especially in the beginning or end of the string, will quickly overrun the optimizations that are available. In those cases, n-gram implementations might be preferable, though these will use more CPU.

This package provides two kinds of MA-FSA implementations. One, the BuildTree, facilitates the construction of an optimized tree and allows ordered insertions. The other, MinTree, is effectively read-only but uses significantly less memory and is ideal for production environments where only reads will be occurring.

Usually your build process will be separate from your production application, which will make heavy use of reading the structure.

To use this package, create a BuildTree and insert your items in lexicographical order:

bt := mafsa.New()
bt.Insert("cities") // an error will be returned if input is < last input
bt.Insert("city")
bt.Insert("pities")
bt.Insert("pity")
bt.Finish()

The tree is now compressed to a minimum number of nodes and is ready to be saved.

err := bt.Save("filename")
if err != nil {
	log.Fatal("Could not save data to file:", err)
}

In your production application, then, you can read the file into a MinTree directly:

mt, err := mafsa.Load("filename")
if err != nil {
	log.Fatal("Could not load data from file:", err)
}

The mt variable is a *MinTree which has the same data as the original BuildTree, but without all the extra "scaffolding" that was required for adding new elements.

The package provides some basic read mechanisms.

// See if a string is a member of the set
fmt.Println("Does tree contain 'cities'?", mt.Contains("cities"))
fmt.Println("Does tree contain 'pitiful'?", mt.Contains("pitiful"))

// You can traverse down to a certain node, if it exists
fmt.Printf("'y' node is at: %p\n", mt.Traverse([]rune("city")))

// To traverse the tree and get the number of elements inserted
// before the prefix specified
_, idx := mt.IndexedTraverse([]rune("pit"))
fmt.Println("Elements inserted before entries starting with 'pit':", idx)

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

This section is empty.

Types

type BuildTree

type BuildTree struct {
	Root *BuildTreeNode
	// contains filtered or unexported fields
}

BuildTree implements an MA-FSA that contains all the "scaffolding" (state) necessary to perform optimizations after inserting each item, which prevents the tree from growing too big.

func New

func New() *BuildTree

New constructs a new, empty MA-FSA that can be filled with data.

func (*BuildTree) Contains

func (t *BuildTree) Contains(word string) bool

Contains returns true if word is found in the tree, false otherwise.

func (*BuildTree) Finish

func (t *BuildTree) Finish()

Finish completes the optimizations on a tree. You must call Finish at least once, like immediately after all entries have been inserted.

func (*BuildTree) Insert

func (t *BuildTree) Insert(val string) error

Insert adds val to the tree and performs optimizations to minimize the number of nodes in the tree. The inserted val must be lexicographically equal to or higher than the last inserted val.

func (*BuildTree) MarshalBinary

func (t *BuildTree) MarshalBinary() ([]byte, error)

MarshalBinary encodes t into a binary format. It implements the functionality described by the encoding.BinaryMarhsaler interface. The associated encoding.BinaryUnmarshaler type is MinTree.

func (*BuildTree) Save

func (t *BuildTree) Save(filename string) error

Save encodes the MA-FSA into a binary format and writes it to a file. The tree can later be restored by calling the package's Load function.

func (*BuildTree) String

func (t *BuildTree) String() string

String returns a roughly human-readable string representing the basic structure of the tree. For debugging only. Do not use with very large trees.

func (*BuildTree) Traverse

func (t *BuildTree) Traverse(word []rune) *BuildTreeNode

Traverse follows nodes down the tree according to word and returns the ending node if there was one or nil if there wasn't one. Note that this method alone does not indicate membership; some node may still be reached even if the word is not in the structure.

type BuildTreeNode

type BuildTreeNode struct {
	Edges map[rune]*BuildTreeNode
	// contains filtered or unexported fields
}

BuildTreeNode is a node in a BuildTree that contains all the info necessary to be a part of optimizations during item insertions.

func (*BuildTreeNode) OrderedEdges

func (tn *BuildTreeNode) OrderedEdges() []rune

OrderedEdges returns the keys of the outgoing edges in lexigocraphical sorted order. It enables deterministic traversal of edges to child nodes.

type Decoder

type Decoder struct {
	// contains filtered or unexported fields
}

Decoder is a type which can decode a byte slice into a MinTree.

func (*Decoder) Decode

func (d *Decoder) Decode(data []byte) (*MinTree, error)

Decode transforms the binary serialization of a MA-FSA into a new MinTree (a read-only MA-FSA).

func (*Decoder) ReadFrom

func (d *Decoder) ReadFrom(r io.Reader) (*MinTree, error)

ReadFrom reads the binary serialization of a MA-FSA into a new MinTree (a read-only MA-FSA) from a io.Reader.

type Encoder

type Encoder struct {
	// contains filtered or unexported fields
}

Encoder is a type which can encode a BuildTree into a byte slice which can be written to a file.

func (*Encoder) Encode

func (e *Encoder) Encode(t *BuildTree) ([]byte, error)

Encode serializes a BuildTree t into a byte slice.

func (*Encoder) WriteTo

func (e *Encoder) WriteTo(wr io.Writer, t *BuildTree) error

WriteTo encodes and saves the BuildTree to a io.Writer.

type Entry

type Entry struct {
	// The actual entry that was added to the tree
	Value string

	// Represents how many entries were added to the tree before this one
	Index int
}

Entry represents a value in the tree along with its index number (minimal perfect hashing).

type MinTree

type MinTree struct {
	Root *MinTreeNode
}

MinTree implements a simple prefix tree with read-only functionality. It supports Minimal Perfect Hashing (MPH) so you can lookup entries by their index in a slice.

func Load

func Load(filename string) (*MinTree, error)

Load loads an existing MA-FSA from a file specified by filename. It returns a read-only MA-FSA, or an error if loading failed.

func (*MinTree) Contains

func (t *MinTree) Contains(word string) bool

Contains returns true if word is found in the tree, false otherwise.

func (*MinTree) IndexedTraverse

func (t *MinTree) IndexedTraverse(word []rune) (*MinTreeNode, int)

IndexedTraverse visits nodes according to word and returns the last node, if there was one, along with its index number. The index number represents how many entries, including the returned node, are in the tree before entries beginning with the specified prefix. Such an indexed traversal is slightly slower than a regular traversal since this kind of traversal has to iterate more nodes and count the numbers on each node. Note that returning a non-nil node is not indicative of membership; a node may be be returned even if the word is not in the set. If last node is nil, the returned index is -1. The algorithm is adapted directly from the paper by Lucceshi and Kowaltowski, 1992 (section 5.2), with one slight modification for correctness because of a slight difference in the way our tree is implemented.

func (*MinTree) IterateDepthFirst

func (t *MinTree) IterateDepthFirst() chan string

IterateDepthFirst returns an unbuffered channel which will be loaded with all items in the tree in lexicographical order. The caller need only range over the channel (which will be closed whan all items have been loaded).

func (*MinTree) String

func (t *MinTree) String() string

String serializes the tree roughly as a human-readable string for debugging purposes. Don't try this on large trees.

func (*MinTree) Traverse

func (t *MinTree) Traverse(word []rune) *MinTreeNode

Traverse visits nodes according to word and returns the last node, if there was one. Note that returning a non-nil node is not indicative of membership; a node may be returned even if the word is not in the set. Traverse implements a fast traversal which does not count the index number of the last node.

func (*MinTree) UnmarshalBinary

func (t *MinTree) UnmarshalBinary(data []byte) error

UnmarshalBinary decodes binary data into this MinTree. It implements the functionality described by the encoding.BinaryMarhsaler interface. The associated encoding.BinaryMarshaler type is BuildTree.

type MinTreeNode

type MinTreeNode struct {
	Edges  map[rune]*MinTreeNode
	Final  bool
	Number int
}

MinTreeNode is a node in a MinTree.

func (*MinTreeNode) OrderedEdges

func (n *MinTreeNode) OrderedEdges() []rune

OrderedEdges returns the keys of the outgoing edges in lexigocraphical sorted order. It enables deterministic traversal of edges to child nodes.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL