regex

package module
v0.0.0-...-af7e799 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: May 24, 2023 License: Apache-2.0 Imports: 10 Imported by: 2

README

ICU Regular Expressions in Go

The ICU library is used in MySQL to parse regular expressions. Go's built-in regular expressions follow a different standard than ICU, and thus can cause inconsistencies when attempting to match MySQL's behavior. These inconsistencies would hopefully result in an error (prompting user intervention), but may silently return unexpected results, raising no alarm when data is being modified in unexpected ways.

To get around this, we've implemented the necessary ICU functions by compiling them into a WebAssembly module, and running the module using the wazero library. Although this approach does come with a performance penalty, this allows for implementing packages to retain cross-compilation support, as CGo is not invoked due to this package.

Building

To make modifications to the compiled WASM module, we've included a build script. The requirements are as follows:

  • Emscripten v3.1.38
  • wasm2wat
  • wat2wasm

Other Emscripten versions may compile just fine, however they have not been tested, and thus we restrict compilation to only the tested version. This also means that the ICU library is version 68.1, as that is the only version that our supported version of Emscripten has ported. Both wasm2wat and wat2wasm exist to expose the global stack variable, as not all platforms will expose the variable. None of the exposed functions require ICU's data, thus it has been excluded to save on space and memory usage. MySQL, although collation aware (and in spite of what the documentation may suggest), does not make use of any collation functionality in the context of regular expressions.

Notes

Due to the high startup-cost of the WASM runtime, this package enforces that all Regex objects are closed before being dereferenced. If any Regex objects are dereferenced before being closed, then a panic will occur at some non-deterministic point in the future.

Documentation

Index

Constants

This section is empty.

Variables

View Source
var (
	// ErrRegexNotYetSet is returned when attempting to use another function before the regex has been initialized.
	ErrRegexNotYetSet = errors.NewKind("SetRegexString must be called before any other function")
	// ErrMatchNotYetSet is returned when attempting to use another function before the match string has been set.
	ErrMatchNotYetSet = errors.NewKind("SetMatchString must be called as there is nothing to match against")
	// ErrInvalidRegex is returned when an invalid regex is given
	ErrInvalidRegex = errors.NewKind("the given regular expression is invalid")
)

Functions

This section is empty.

Types

type CharPtr

type CharPtr int32

type Regex

type Regex interface {
	// SetRegexString sets the string that will later be matched against. This must be called at least once before any other
	// calls are made (except for Close).
	SetRegexString(ctx context.Context, regexStr string, flags RegexFlags) error
	// SetMatchString sets the string that we will either be matching against, or executing the replacements on. This
	// must be called after SetRegexString, but before any other calls.
	SetMatchString(ctx context.Context, matchStr string) error
	// Matches returns whether the previously-set regex matches the previously-set match string. Must call
	// SetRegexString and SetMatchString before this function.
	Matches(ctx context.Context, start int, occurrence int) (bool, error)
	// Replace returns a new string with the replacement string occupying the matched portions of the match string,
	// based on the regex. Position starts at 1, not 0. Must call SetRegexString and SetMatchString before this function.
	Replace(ctx context.Context, replacementStr string, position int, occurrence int) (string, error)
	// StringBufferSize returns the size of the string buffers, in bytes. If the string buffer is not being used, then
	// this returns zero.
	StringBufferSize() uint32
	// Close frees up the internal resources. This MUST be called, else a panic will occur at some non-deterministic time.
	Close() error
}

Regex is an interface that wraps around the ICU library, exposing ICU's regular expression functionality. It is imperative that Regex is closed once it is finished.

func CreateRegex

func CreateRegex(stringBufferInBytes uint32) Regex

CreateRegex creates a Regex, with a region of memory that has been preallocated to support strings that are less than or equal to the given size. Such strings will skip the allocation and deallocation phases, which save time. A size of zero will force all strings to be allocated and deallocated. The buffer is defined for one string, therefore double the amount given will actually be consumed (regex and match strings). Once the Regex is done with, you must remember to call Close. This Regex is intended for single-threaded use only, therefore it is advised for each thread to use its own Regex when one is needed.

type RegexFlags

type RegexFlags uint32

RegexFlags are flags to define the behavior of the regular expression. Use OR (|) to combine flags. All flag values were taken directly from ICU.

const (
	// Enable case insensitive matching.
	RegexFlags_None RegexFlags = 0

	// Enable case insensitive matching.
	RegexFlags_Case_Insensitive RegexFlags = 2

	// Allow white space and comments within patterns.
	RegexFlags_Comments RegexFlags = 4

	// If set, '.' matches line terminators,  otherwise '.' matching stops at line end.
	RegexFlags_Dot_All RegexFlags = 32

	// If set, treat the entire pattern as a literal string. Metacharacters or escape sequences in the input sequence
	// will be given no special meaning.
	//
	// The flag RegexFlags_Case_Insensitive retains its impact on matching when used in conjunction with this flag. The
	// other flags become superfluous.
	RegexFlags_Literal RegexFlags = 16

	// Control behavior of "$" and "^". If set, recognize line terminators within string, otherwise, match only at start
	// and end of input string.
	RegexFlags_Multiline RegexFlags = 8

	// Unix-only line endings. When this mode is enabled, only '\n' is recognized as a line ending in the behavior
	// of ., ^, and $.
	RegexFlags_Unix_Lines RegexFlags = 1

	// Unicode word boundaries. If set, \b uses the Unicode TR 29 definition of word boundaries. Warning: Unicode word
	// boundaries are quite different from traditional regular expression word boundaries.
	// See http://unicode.org/reports/tr29/#Word_Boundaries
	RegexFlags_Unicode_Word RegexFlags = 256

	// Error on Unrecognized backslash escapes. If set, fail with an error on patterns that contain backslash-escaped
	// ASCII letters without a known special meaning. If this flag is not set, these escaped letters represent
	// themselves.
	RegexFlags_Error_On_Unknown_Escapes RegexFlags = 512
)

type UCharPtr

type UCharPtr uint32

type UErrorCode

type UErrorCode int32

type URegularExpressionPtr

type URegularExpressionPtr uint32

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL