triegen

package
v0.14.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Oct 11, 2023 License: BSD-3-Clause Imports: 9 Imported by: 0

Documentation

Overview

Package triegen implements a code generator for a trie for associating unsigned integer values with UTF-8 encoded runes.

Many of the go.text packages use tries for storing per-rune information. A trie is especially useful if many of the runes have the same value. If this is the case, many blocks can be expected to be shared allowing for information on many runes to be stored in little space.

As most of the lookups are done directly on []byte slices, the tries use the UTF-8 bytes directly for the lookup. This saves a conversion from UTF-8 to runes and contributes a little bit to better performance. It also naturally provides a fast path for ASCII.

Space is also an issue. There are many code points defined in Unicode and as a result tables can get quite large. So every byte counts. The triegen package automatically chooses the smallest integer values to represent the tables. Compacters allow further compression of the trie by allowing for alternative representations of individual trie blocks.

triegen allows generating multiple tries as a single structure. This is useful when, for example, one wants to generate tries for several languages that have a lot of values in common. Some existing libraries for internationalization store all per-language data as a dynamically loadable chunk. The go.text packages are designed with the assumption that the user typically wants to compile in support for all supported languages, in line with the approach common to Go to create a single standalone binary. The multi-root trie approach can give significant storage savings in this scenario.

triegen generates both tables and code. The code is optimized to use the automatically chosen data types. The following code is generated for a Trie or multiple Tries named "foo":

  • type fooTrie The trie type.

  • func newFooTrie(x int) *fooTrie Trie constructor, where x is the index of the trie passed to Gen.

  • func (t *fooTrie) lookup(s []byte) (v uintX, sz int) The lookup method, where uintX is automatically chosen.

  • func lookupString, lookupUnsafe and lookupStringUnsafe Variants of the above.

  • var fooValues and fooIndex and any tables generated by Compacters. The core trie data.

  • var fooTrieHandles Indexes of starter blocks in case of multiple trie roots.

It is recommended that users test the generated trie by checking the returned value for every rune. Such exhaustive tests are possible as the number of runes in Unicode is limited.

Example (Build)

Example_build shows how to build a simple trie. It assigns the value 1 to 100 random runes generated by randomRunes.

package main

import (
	"fmt"
	"io"
	"math/rand"
	"unicode"

	"golang.org/x/text/internal/triegen"
)

const seed = 0x12345

var genWriter = io.Discard

func randomRunes() map[rune]uint8 {
	rnd := rand.New(rand.NewSource(seed))
	m := map[rune]uint8{}
	for len(m) < 100 {

		if r := rune(rnd.Int31n(unicode.MaxRune + 1)); []rune(string(r))[0] == r {
			m[r] = 1
		}
	}
	return m
}

func main() {
	t := triegen.NewTrie("rand")

	for r, _ := range randomRunes() {
		t.Insert(r, 1)
	}
	sz, err := t.Gen(genWriter)

	fmt.Printf("Trie size: %d bytes\n", sz)
	fmt.Printf("Error:     %v\n", err)

}
Output:

Trie size: 9280 bytes
Error:     <nil>
Example (Lookup)

Example_lookup demonstrates how to use the trie generated by Example_build.

trie := newRandTrie(0)

// The same set of runes used by Example_build.
runes := randomRunes()

// Verify the right value is returned for all runes.
for r := rune(0); r <= unicode.MaxRune; r++ {
	// Note that the return type of lookup is uint8.
	if v, _ := trie.lookupString(string(r)); v != runes[r] {
		fmt.Println("FAILURE")
		return
	}
}
fmt.Println("SUCCESS")
Output:

SUCCESS

Index

Examples

Constants

This section is empty.

Variables

This section is empty.

Functions

func Gen

func Gen(w io.Writer, name string, tries []*Trie, opts ...Option) (sz int, err error)

Gen writes Go code for a shared trie lookup structure to w for the given Tries. The generated trie type will be called nameTrie. newNameTrie(x) will return the *nameTrie for tries[x]. A value can be looked up by using one of the various lookup methods defined on nameTrie. It returns the table size of the generated trie.

Example (Build)

ExampleGen_build demonstrates the creation of multiple tries sharing common blocks. ExampleGen_lookup demonstrates how to use the generated tries.

package main

import (
	"fmt"
	"io"
	"math/rand"
	"unicode"

	"golang.org/x/text/internal/triegen"
)

const seed = 0x12345

var genWriter = io.Discard

// runeValues generates some random values for a set of interesting runes.
func runeValues() map[rune]uint64 {
	rnd := rand.New(rand.NewSource(seed))
	m := map[rune]uint64{}
	for p := 4; p <= unicode.MaxRune; p <<= 1 {
		for d := -1; d <= 1; d++ {
			m[rune(p+d)] = uint64(rnd.Int63())
		}
	}
	return m
}

func main() {
	var tries []*triegen.Trie

	rv := runeValues()
	for _, c := range []struct {
		include func(rune) bool
		name    string
	}{
		{func(r rune) bool { return true }, "all"},
		{func(r rune) bool { return r < 0x80 }, "ASCII only"},
		{func(r rune) bool { return r < 0x80 }, "ASCII only 2"},
		{func(r rune) bool { return r <= 0xFFFF }, "BMP only"},
		{func(r rune) bool { return r > 0xFFFF }, "No BMP"},
	} {
		t := triegen.NewTrie(c.name)
		tries = append(tries, t)

		for r, v := range rv {
			if c.include(r) {
				t.Insert(r, v)
			}
		}
	}
	sz, err := triegen.Gen(genWriter, "multi", tries)

	fmt.Printf("Trie size: %d bytes\n", sz)
	fmt.Printf("Error:     %v\n", err)

}
Output:

Trie size: 18250 bytes
Error:     <nil>
Example (Lookup)

ExampleGen_lookup shows how to look up values in the trie generated by ExampleGen_build.

rv := runeValues()
for i, include := range []func(rune) bool{
	func(r rune) bool { return true },        // all
	func(r rune) bool { return r < 0x80 },    // ASCII only
	func(r rune) bool { return r < 0x80 },    // ASCII only 2
	func(r rune) bool { return r <= 0xFFFF }, // BMP only
	func(r rune) bool { return r > 0xFFFF },  // No BMP
} {
	t := newMultiTrie(i)

	for r := rune(0); r <= unicode.MaxRune; r++ {
		x := uint64(0)
		if include(r) {
			x = rv[r]
		}
		// As we convert from a valid rune, we know it is safe to use
		// lookupStringUnsafe.
		if v := t.lookupStringUnsafe(string(r)); x != v {
			fmt.Println("FAILURE")
			return
		}
	}
}
fmt.Println("SUCCESS")
Output:

SUCCESS

Types

type Compacter

type Compacter interface {
	// Size returns whether the Compacter could encode the given block as well
	// as its size in case it can. len(v) is always 64.
	Size(v []uint64) (sz int, ok bool)

	// Store stores the block using the Compacter's compression method.
	// It returns a handle with which the block can be retrieved.
	// len(v) is always 64.
	Store(v []uint64) uint32

	// Print writes the data structures associated to the given store to w.
	Print(w io.Writer) error

	// Handler returns the name of a function that gets called during trie
	// lookup for blocks generated by the Compacter. The function should be of
	// the form func (n uint32, b byte) uint64, where n is the index returned by
	// the Compacter's Store method and b is the last byte of the UTF-8
	// encoding, where 0x80 <= b < 0xC0, for which to do the lookup in the
	// block.
	Handler() string
}

A Compacter generates an alternative, more space-efficient way to store a trie value block. A trie value block holds all possible values for the last byte of a UTF-8 encoded rune. Excluding ASCII characters, a trie value block always has 64 values, as a UTF-8 encoding ends with a byte in [0x80, 0xC0).

Example
package main

import (
	"fmt"
	"io"

	"golang.org/x/text/internal/triegen"
)

func main() {
	t := triegen.NewTrie("root")
	for r := rune(0); r < 10000; r += 64 {
		t.Insert(r, 0x9015BADA55^uint64(r))
	}
	sz, _ := t.Gen(io.Discard)

	fmt.Printf("Size normal:    %5d\n", sz)

	var c myCompacter
	sz, _ = t.Gen(io.Discard, triegen.Compact(&c))

	fmt.Printf("Size compacted: %5d\n", sz)

}

// A myCompacter accepts a block if only the first value is given.
type myCompacter []uint64

func (c *myCompacter) Size(values []uint64) (sz int, ok bool) {
	for _, v := range values[1:] {
		if v != 0 {
			return 0, false
		}
	}
	return 8, true // the size of a uint64
}

func (c *myCompacter) Store(v []uint64) uint32 {
	x := uint32(len(*c))
	*c = append(*c, v[0])
	return x
}

func (c *myCompacter) Print(w io.Writer) error {
	fmt.Fprintln(w, "var firstValue = []uint64{")
	for _, v := range *c {
		fmt.Fprintf(w, "\t%#x,\n", v)
	}
	fmt.Fprintln(w, "}")
	return nil
}

func (c *myCompacter) Handler() string {
	return "getFirstValue"

	// Where getFirstValue is included along with the generated code:
	// func getFirstValue(n uint32, b byte) uint64 {
	//     if b == 0x80 { // the first continuation byte
	//         return firstValue[n]
	//     }
	//     return 0
	//  }
}
Output:

Size normal:    81344
Size compacted:  3224

type Option

type Option func(b *builder) error

An Option can be passed to Gen.

func Compact

func Compact(c Compacter) Option

Compact configures the trie generator to use the given Compacter.

type Trie

type Trie struct {
	// contains filtered or unexported fields
}

A Trie represents a single root node of a trie. A builder may build several overlapping tries at once.

func NewTrie

func NewTrie(name string) *Trie

NewTrie returns a new trie root.

func (*Trie) Gen

func (t *Trie) Gen(w io.Writer, opts ...Option) (sz int, err error)

Gen is a convenience wrapper around the Gen func passing t as the only trie and uses the name passed to NewTrie. It returns the size of the generated tables.

func (*Trie) Insert

func (t *Trie) Insert(r rune, value uint64)

Insert associates value with the given rune. Insert will panic if a non-zero value is passed for an invalid rune.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL