dawg

package module
v0.0.0-...-66057bd Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Jan 18, 2022 License: MIT Imports: 10 Imported by: 1

README

dawg GoDoc

CircleCI Package dawg is an implemention of a Directed Acyclic Word Graph, as described on my blog at http://stevehanov.ca/blog/?id=115 It is designed to be as memory efficient as possible.

Download:

go get github.com/smhanov/dawg

Package dawg is an implemention of a Directed Acyclic Word Graph, as described on my blog at http://stevehanov.ca/blog/?id=115

A DAWG provides fast lookup of all possible prefixes of words in a dictionary, as well as the ability to get the index number of any word.

This particular implementation may be different from others because it is very memory efficient, and it also works fast with large character sets. It can deal with thousands of branches out of a single node without needing to go through each one.

The storage format is as small as possible. Bits are used instead of bytes so that no space is wasted as padding, and there are no practical limitations to the number of nodes or characters. A summary of the data format is found at the top of disk.go.

In general, to use it you first create a builder using dawg.New(). You can then add words to the Dawg. The two restrictions are that you cannot repeat a word, and they must be in strictly increasing alphabetical order.

After all the words are added, call Finish() which returns a dawg.Finder interface. You can perform queries with this interface, such as finding all prefixes of a given string which are also words, or looking up a word's index that you have previously added.

After you have called Finish() on a Builder, you may choose to write it to disk using the Save() function. The DAWG can then be opened again later using the Load() function. When opened from disk, no memory is used. The structure is accessed in-place on disk.

Benchmarks

There are some benchmarks in this project:

https://github.com/timurgarif/go-fsa-trie-bench

The library is optimized to take less memory or no-memory if accessing a file. We easily beat all the alternatives in this area, using only 520KB compared to others which take from 3.6MB to 32MB to store the same dictionary. However, the tradeoff is that the bit-level accesses cause it to take 10X as along to lookup words.

Documentation

Overview

Package dawg is an implemention of a Directed Acyclic Word Graph, as described on my blog at http://stevehanov.ca/blog/?id=115

A DAWG provides fast lookup of all possible prefixes of words in a dictionary, as well as the ability to get the index number of any word.

This particular implementation may be different from others because it is very memory efficient, and it also works fast with large character sets. It can deal with thousands of branches out of a single node without needing to go through each one.

The storage format is as small as possible. Bits are used instead of bytes so that no space is wasted as padding, and there are no practical limitations to the number of nodes or characters. A summary of the data format is found at the top of disk.go.

In general, to use it you first create a builder using dawg.New(). You can then add words to the Dawg. The two restrictions are that you cannot repeat a word, and they must be in strictly increasing alphabetical order.

After all the words are added, call Finish() which returns a dawg.Finder interface. You can perform queries with this interface, such as finding all prefixes of a given string which are also words, or looking up a word's index that you have previously added.

After you have called Finish() on a Builder, you may choose to write it to disk using the Save() function. The DAWG can then be opened again later using the Load() function. When opened from disk, no memory is used. The structure is accessed in-place on disk.

Index

Examples

Constants

This section is empty.

Variables

This section is empty.

Functions

func DumpFile

func DumpFile(f io.ReaderAt)

DumpFile prints out the file

Types

type Builder

type Builder interface {
	// Add the word to the dawg
	Add(wordIn string)

	// Returns true if the word can be added.
	CanAdd(word string) bool

	// Complete the dawg and return a Finder.
	Finish() Finder
}

Builder is the interface for creating a new Dawg. Use New() to create it.

func New

func New() Builder

New creates a new dawg

Example
package main

import (
	"fmt"

	"github.com/smhanov/dawg"
)

func main() {
	dawg := dawg.New()

	dawg.Add("blip")   // index 0
	dawg.Add("cat")    // index 1
	dawg.Add("catnip") // index 2
	dawg.Add("cats")   // index 3

	finder := dawg.Finish()

	for _, result := range finder.FindAllPrefixesOf("catsup") {
		fmt.Printf("Found prefix %s, index %d\n", result.Word, result.Index)
	}

}
Output:

Found prefix cat, index 1
Found prefix cats, index 3

type EnumFn

type EnumFn = func(index int, word []rune, final bool) EnumerationResult

EnumFn is a method that you implement. It will be called with all prefixes stored in the DAWG. If final is true, the prefix represents a complete word that has been stored.

type EnumerationResult

type EnumerationResult = int

EnumerationResult is returned by the enumeration function to indicate whether indication should continue below this depth or to stop altogether

const (
	// Continue enumerating all words with this prefix
	Continue EnumerationResult = iota

	// Skip will skip all words with this prefix
	Skip

	// Stop will immediately stop enumerating words
	Stop
)

type FindResult

type FindResult struct {
	Word  string
	Index int
}

FindResult is the result of a lookup in the d. It contains both the word found, and it's index based on the order it was added.

type Finder

type Finder interface {
	// Find all prefixes of the given string
	FindAllPrefixesOf(input string) []FindResult

	// Find the index of the given string
	IndexOf(input string) int

	AtIndex(index int) (string, error)

	// Enumerate all prefixes stored in the dawg.
	Enumerate(fn EnumFn)

	// Returns the number of words
	NumAdded() int

	// Returns the number of edges
	NumEdges() int

	// Returns the number of nodes
	NumNodes() int

	// Output a human-readable description of the dawg to stdout
	Print()

	// Close the dawg that was opened with Load(). After this, it is no longer
	// accessible.
	Close() error

	// Save to a writer
	Write(w io.Writer) (int64, error)

	// Save to a file
	Save(filename string) (int64, error)
}

Finder is the interface for querying a dawg. Use either Builder.Finish() or Load() to obtain one.

func Load

func Load(filename string) (Finder, error)

Load loads the dawg from a file

func Read

func Read(f io.ReaderAt, offset int64) (Finder, error)

Read returns a finder that accesses the dawg in-place using the given io.ReaderAt

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL