stringbenchmarks

package
v0.0.0-...-46fb334 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Mar 30, 2021 License: MIT Imports: 11 Imported by: 0

Documentation

Overview

Package stringutils implements additional functions to support the go standard library strings module.

The algorithms chosen are based on benchmarks from the stringbenchmarks module. ymmv...

The current implementation at the start of this project was .../go/1.15.3/libexec/src/strings/strings.go

For information about UTF-8 strings in Go, see https://blog.golang.org/strings.

Index

Examples

Constants

View Source
const (
	RuneError = utf8.RuneError // '\uFFFD'       // the "error" Rune or "Unicode replacement character"
	RuneSelf  = utf8.RuneSelf  // 0x80           // characters below RuneSelf are represented as themselves in a single byte.
	MaxRune   = utf8.MaxRune   // '\U0010FFFF'   // Maximum valid Unicode code point.
	UTFMax    = utf8.UTFMax    // 4              // maximum number of bytes of a UTF-8 encoded Unicode character.

)

Numbers fundamental to the encoding.

View Source
const (
	TAB   = 0x09 // '\t'
	LF    = 0x0A // '\n'
	VT    = 0x0B // '\v'
	FF    = 0x0C // '\f'
	CR    = 0x0D // '\r'
	SPACE = ' '
	NBSP  = 0x00A0
	NEL   = 0x0085
)
View Source
const MaxBruteForce = 64 // x86 values
View Source
const PrimeRK = 16777619

PrimeRK is the prime base used in Rabin-Karp algorithm.

Variables

View Source
var MaxLen int = 0
View Source
var (

	// UnicodeWhiteSpaceMap provides a mapping from Unicode runes to strings
	// with descriptions of each. It is marginally slower than the bool map.
	//
	// In computer programming, whitespace is any character or series of
	// characters that represent horizontal or vertical space in typography.
	// When rendered, a whitespace character does not correspond to a visible
	// mark, but typically does occupy an area on a page. For example, the
	// common whitespace symbol SPACE (unicode: U+0020 ASCII: 32 decimal 0x20
	// hex) represents a blank space punctuation character in text, used as a
	// word divider in Western scripts.
	//
	// Reference: https://en.wikipedia.org/wiki/Whitespace_character
	UnicodeWhiteSpaceMap = map[rune]string{
		0x0009: `CHARACTER TABULATION <TAB>`,
		0x000A: `ASCII LF`,
		0x000B: `LINE TABULATION <VT>`,
		0x000C: `FORM FEED <FF>`,
		0x000D: `ASCII CR`,
		0x0020: `SPACE <SP>`,
		0x00A0: `NO-BREAK SPACE <NBSP>`,
		0x0085: `NEL; Next Line`,
		0x1680: `Ogham space mark, interword separation in Ogham text`,
		0x2000: `EN QUAD, 0x2002 is preferred`,
		0x2001: `EM QUAD, mutton quad, 0x2003 is preferred`,
		0x2002: `EN SPACE, "nut", &ensp, LaTeX: '\enspace'`,
		0x2003: `EM SPACE, "mutton", &emsp;, LaTeX: '\quad'`,
		0x2004: `THREE-PER-EM SPACE, "thick space", &emsp13;`,
		0x2005: `four-per-em space, "mid space", &emsp14;`,
		0x2006: `SIX-PER-EM SPACE, sometimes equated to U+2009`,
		0x2007: `FIGURE SPACE, width of monospaced char, &numsp;`,
		0x2008: `PUNCTUATION SPACE, width of period or comma, &puncsp;`,
		0x2009: `THIN SPACE, 1/5th em, thousands sep, &thinsp;; LaTeX: '\,'`,
		0x200A: `HAIR SPACE, &hairsp;`,
		0x2028: `LINE SEPARATOR`,
		0x2029: `PARAGRAPH SEPARATOR`,
		0x202F: `NARROW NO-BREAK SPACE`,
		0x205F: `MEDIUM MATHEMATICAL SPACE, MMSP, &MediumSpace, 4/18 em`,
		0x3000: `IDEOGRAPHIC SPACE, full width CJK character cell`,
		0xFFEF: `ZERO WIDTH NO-BREAK SPACE <ZWNBSP> (BOM), deprecated Unicode 3.2 (use U+2060)`,
	}
)

Functions

func ByteSamples

func ByteSamples() []byte

func Cutover

func Cutover(n int) int

Cutover reports the number of failures of IndexByte we should tolerate before switching over to Index. n is the number of bytes processed so far. See the bytes.Index implementation for details.

func DedupeWhitespace

func DedupeWhitespace(s string, ignoreNewlines bool) string

DedupeWhitespace removes any duplicate whitespace from the string and replaces it with a single space. If ignoreNewlines == true then \n is ignored.

func Equal

func Equal(a, b []byte) bool

Equal reports whether a and b are the same length and contain the same bytes. A nil argument is equivalent to an empty slice.

func HashStrBytes

func HashStrBytes(sep []byte) (uint32, uint32)

HashStrBytes returns the hash and the appropriate multiplicative factor for use in Rabin-Karp algorithm.

func IndexRabinKarpBytes

func IndexRabinKarpBytes(s, sep []byte) int

IndexRabinKarpBytes uses the Rabin-Karp search algorithm to return the index of the first occurrence of substr in s, or -1 if not present.

func IsASCIIAlpha

func IsASCIIAlpha(c byte) bool

func IsASCIIPrintable

func IsASCIIPrintable(s string) bool

IsASCIIPrintable checks if s is ascii and printable, aka doesn't include tab, backspace, etc.

func IsASCIISpace

func IsASCIISpace(c byte) bool

IsASCIISpace tests for the most common ASCII whitespace characters:

' ', '\t', '\n', '\f', '\r', '\v'

This excludes all Unicode code points above 0x007F.

The C language defines whitespace characters to be "space, horizontal tab, new-line, vertical tab, and form-feed."

func IsAlphaNum

func IsAlphaNum(c byte) bool

IsAlphaNum reports whether the byte is an ASCII letter, number, or underscore

func IsAlphaNumSwitch

func IsAlphaNumSwitch(c byte) bool

func IsAlphaNumUnder

func IsAlphaNumUnder(c byte) bool

func IsDigit

func IsDigit(c byte) bool

func IsDigitSingleOP

func IsDigitSingleOP(c byte) bool

IsDigitSingleOP uses a single operation instead of the standard a << c && c << b form Another good example: very common thing is if(x >= 1 && x <= 9) which can be done as if( (unsigned)(x-1) <=(unsigned)(9-1)) Changing two conditional tests to one can be a big speed advantage; especially when it allows predicated execution instead of branches. I used this for years (where justified) until I noticed abt 10 years ago that compilers had started doing this transform in the optimizer, then I stopped. Still good to know, since there are similar situations where the compiler can't make the transform for you. Or if you're working on a compiler.

func IsDigitSingleOPCompare

func IsDigitSingleOPCompare(c byte) bool

IsDigitSingleOPCompare is a sample implementation used for benchmarking IsDigitSingleOP

func IsHex

func IsHex(c byte) bool

func IsSpaceMask

func IsSpaceMask(c byte) bool

func IsUnicodeWhiteSpaceMap

func IsUnicodeWhiteSpaceMap(r rune) bool

IsUnicodeWhiteSpaceMap reports whether the rune is any utf8 whitespace character using the broadest and most complete definition.

The speed of this implementation ~25% slower than that of IsASCIISpace(c byte) but tests 3.75 times more possible code points.

The speed is ~7% faster than that of unicode.IsSpace(r rune) from the standard library and covers nearly twice as many code points.

isWhiteSpaceLogicChain checks for any unicode whitespace rune.

Included:

0x2000, 0x2001, 0x2002, 0x2003, 0x2004, 0x2005,
0x2006, 0x2007, 0x2008, 0x2009, 0x200A, 0x2028,
0x2029, 0x202F, 0x205F, 0x3000, 0x1680

Related Unicode characters (property White_Space=no) Not included:

0x200B,	0x200C,	0x200D,	0x2060

func JoinLines

func JoinLines(list []string) string

func RuneSample

func RuneSample(c rune)

RuneSample prints a sample of various Unicode runes.

func RuneSamples

func RuneSamples() []rune

func SmallByteSamples

func SmallByteSamples() []byte

func SmallByteStringSamples

func SmallByteStringSamples() (list []string)

func SmallRuneSamples

func SmallRuneSamples() []rune

func SmallRuneStringSamples

func SmallRuneStringSamples() (list []string)

func TabIt

func TabIt(s string, n int) string

func ToLower

func ToLower(s string) string

func ToLowerByte

func ToLowerByte(c byte) byte

func ToString

func ToString(any interface{}) string

ToString implements Stringer directly as a function call with a parameter instead of a method on that parameter.

func ToUpper

func ToUpper(s string) string

func ToUpperByte

func ToUpperByte(c byte) byte

Types

type Any

type Any interface{}

Any is used to store data when the type cannot be determined ahead of time.

type List

type List struct {
	// contains filtered or unexported fields
}

List is a wrapper around a slice of items. It offers formatting options and convenience functions.

Example
// List.Contains()
fmt.Println(tempList.Contains(3.14))
fmt.Println(tempList.Contains(42))
// List.Len()
fmt.Println(tempList.Len())
// List.Cap()
fmt.Println(tempList.Cap())
// List.Name()
fmt.Println(tempList.Name())
// List.Add()
fmt.Println(tempList.Contains("fake"))
tempList.Add("fake")
// fmt.Println(tempList.Contains("fake"))
Output:

false
false
6
6
tempList
false

func NewList

func NewList(name string, data []Any) *List

NewList returns a new List from the given data.

Example
fmt.Println(tempList)
Output:

&{tempList [this 1 <nil> 0 3.14 9]}

func (*List) Add

func (v *List) Add(item Any)

Add adds item to the List Duplicates are allowed.

func (*List) Cap

func (v *List) Cap() int

Cap returns the max number of elements in the List.

func (*List) Contains

func (v *List) Contains(item Any) bool

Contains tells whether a contains x.

func (*List) Len

func (v *List) Len() int

Len returns of count of elements in the Set. If the Set is nil, Len() is zero.

func (*List) Name

func (v *List) Name() string

Name returns the name of the List.

func (*List) ToSet

func (v *List) ToSet() *Set

ToSet returns the underlying data as a Set.

func (*List) ToSlice

func (v *List) ToSlice() []Any

ToSlice returns the underlying data as a slice.

type Set

type Set struct {
	// contains filtered or unexported fields
}

Set is a hashable version of a list with unique items.

Example
// Set.Contains()
fmt.Println(tempSet.Contains(3.14))
fmt.Println(tempSet.Contains(42))
// Set.Len()
fmt.Println(tempSet.Len())
// Set.Cap()
fmt.Println(tempSet.Cap())
// Set.Name()
fmt.Println(tempSet.Name())
// Set.Add()
fmt.Println(tempSet.Contains("fake"))
_ = tempSet.Add("fake")
// fmt.Println(tempSet.Contains("fake"))
Output:

true
false
6
6
tempSet
false

func NewSet

func NewSet(name string, data []Any) *Set

NewSet returns a new Set from the given List

Example
fmt.Println(tempSet)
Output:

&{tempSet map[<nil>:true 3.14:true 0:true 1:true 9:true this:true]}

func (*Set) Add

func (s *Set) Add(item Any) error

Add adds item to the Set or returns an error. Duplicates are not allowed.

func (*Set) Cap

func (s *Set) Cap() int

Cap returns the max number of elements in the Set (since cap is undefined for map types in go).

func (*Set) Contains

func (s *Set) Contains(item Any) bool

Contains returns true if the Set contains item.

func (*Set) Len

func (s *Set) Len() int

Len returns of elements in the Set If the Set is nil, Len() is zero.

func (*Set) Name

func (s *Set) Name() string

Name returns the name of the Set.

func (*Set) ToList

func (s *Set) ToList() *List

ToList returns the underlying data as a List.

func (*Set) ToSlice

func (s *Set) ToSlice() []Any

ToSlice returns the underlying data as a slice.

type SetMap

type SetMap = map[Any]bool

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL