cords

package module
v0.1.1 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Nov 18, 2021 License: BSD-3-Clause Imports: 5 Imported by: 1

README

Cords

Cords of text as a versatile string enhancement

Status

Work in progress, please be patient!


From a paper by Hans-J. Boehm, Russ Atkinson and Michael Plass, 1995:

What's wrong with Strings?

Programming languages such as C and traditional Pascal provide a built-in notion of strings as essentially fixed length arrays of characters. The language itself provides the array primitives for accessing such strings, plus often a collection of library routines for higher level operations such as string concatenation. Thus the implementation is essentially constrained to represent strings as contiguous arrays of characters, with or without additional space for a length, expansion room, etc. […] We desire the following characteristics:

  1. Immutable strings, i.e. strings that cannot be modified in place, should be well supported. A procedure should be able to operate on a string it was passed without danger of accidentally modifying the caller’s data structures. This becomes particularly important in the presence of concurrency, where in-place updates to strings would often have to be properly synchronized. […]
  2. Commonly occurring operations on strings should be efficient. In particular (non-destructive) concatenation of strings and non-destructive substring operations should be fast, and should not require excessive amounts of space.
  3. Common string operations should scale to long strings. There should be no practical bound on the length of strings. Performance should remain acceptable for long strings. […]
  4. It should be as easy as possible to treat some other representation of ‘sequenceof character’ (e.g. a file) as a string. Functions on strings should be maximally reusable.

Strings represented as contiguous arrays of characters, as in C or Pascal, violate most of these.

From:

Ropes: an Alternative to Strings

1995, by

hans-j. boehm, russ atkinson and michael plass

Xerox PARC, 3333 Coyote Hill Rd., Palo Alto, CA 94304, U.S.A. (email:boehm@parc.xerox.com)

Other / similar Solutions

Documentation

Overview

Package cords offers a versatile string enhancement to ease handling of texts.

Cords

Cords (or sometimes called ropes) organize fragments of immutable text internally in a tree-structure. This speeds up frequent string-operations like concatenation, especially for long strings. This package aims towards applications which have to deal with text, i.e., large amounts of organized strings.

From Wikipedia: In computer programming, a rope, or cord, is a data structure composed of smaller strings that is used to efficiently store and manipulate a very long string. For example, a text editing program may use a rope to represent the text being edited, so that operations such as insertion, deletion, and random access can be done efficiently. […] In summary, ropes are preferable when the data is large and modified often.

_________________________________________________________________________

From a paper by Hans-J. Boehm, Russ Atkinson and Michael Plass, 1995:

Ropes, an Alternative to Strings

Xerox PARC, 3333 Coyote Hill Rd., Palo Alto, CA 94304, U.S.A. (email:boehm@parc.xerox.com)

What's wrong with Strings?

Programming languages such as C […] provide a built-in notion of strings as essentially fixed length arrays of characters. The language itself provides the array primitives for accessing such strings, plus often a collection of library routines for higher level operations such as string concatenation. Thus the implementation is essentially constrained to represent strings as contiguous arrays of characters, with or without additional space for a length, expansion room, etc. […] We desire the following characteristics:

1. Immutable strings, i.e. strings that cannot be modified in place, should be well supported. A procedure should be able to operate on a string it was passed without danger of accidentally modifying the caller’s data structures. This becomes particularly important in the presence of concurrency, where in-place updates to strings would often have to be properly synchronized. […]

2. Commonly occurring operations on strings should be efficient. In particular (non-destructive) concatenation of strings and non-destructive substring operations should be fast, and should not require excessive amounts of space.

3. Common string operations should scale to long strings. There should be no practical bound on the length of strings. Performance should remain acceptable for long strings. […]

4. It should be as easy as possible to treat some other representation of ‘sequenceof character’ (e.g. a file) as a string. Functions on strings should be maximally reusable.

Strings represented as contiguous arrays of characters, as in C or Pascal, violate most of these.

_________________________________________________________________________

In Go, these points of critique are somewhat mitigated with slices. However, slices will carry only so far, and cords add a layer of convenience and stable performance characteristics on top of them. You can think of cords as fancy slices of text, with some additional functionality.

Cords may be constructed from various sources, with the simplest case being a call to

cords.FromString("Hello World")

Other possibilities are cords from text files or from HTML documents.

_________________________________________________________________________

BSD 3-Clause License

All rights reserved.

Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:

1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.

2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.

3. Neither the name of the copyright holder nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission.

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

Index

Constants

View Source
const ErrCordCompleted = CordError("forbidden to add fragements; cord has been completed")

ErrCordCompleted signals that a cord builder has already completed a cord and it's illegal to further add fragments.

View Source
const ErrIllegalArguments = CordError("illegal arguments")

ErrIllegalArguments is flagged whenever function parameters are invalid.

View Source
const ErrIllegalDelimiterPattern = CordError("illegal delimiter pattern")

ErrIllegalDelimiterPattern is flagged if a given delimiter pattern is either not compilable as a valid regular expression or if it accepts the empty string as a match.

View Source
const ErrIndexOutOfBounds = CordError("index out of bounds")

ErrIndexOutOfBounds is flagged whenever a cord position is greater than the length of the cord.

Variables

This section is empty.

Functions

func ApplyMaterializedMetric

func ApplyMaterializedMetric(cord Cord, i, j uint64, metric MaterializedMetric) (MetricValue, Cord, error)

ApplyMaterializedMetric applies a materialized metric to a (section of a) text. Returns a metric value and a cord, which manages the spans of the metric on the text.

i and j are text positions with Go slice semantics. If [i, j) does not specify a valid slice of the text, ErrIndexOutOfBounds will be returned.

func Cord2Dot

func Cord2Dot(text Cord, w io.Writer)

Cord2Dot outputs the internal structure of a Cord in Graphviz DOT format (for debugging purposes).

func Cut

func Cut(cord Cord, i, l uint64) (Cord, Cord, error)

Cut cuts out a substring [i…i+l) from a cord. It will return a new cord without the cut-out segment, plus the substring-segment, and possibly an error. If l is 0, cord is unchanged.

func Split

func Split(cord Cord, i uint64) (Cord, Cord, error)

Split splits a cord into two new (smaller) cords right before position i. Split(C,i) ⇒ split C into C1 and C2, with C1=b0,…,bi-1 and C2=bi,…,bn.

func T

func T() tracing.Trace

T traces to a global core-tracer.

Types

type Builder

type Builder struct {
	// contains filtered or unexported fields
}

Builder is for building cords by adding text fragments (leafs). The empty instance is a valid cord builder, but clients may use NewBuilder instead.

func NewBuilder

func NewBuilder() *Builder

NewBuilder creates a new and empty cord builder.

func (*Builder) Append

func (b *Builder) Append(leaf Leaf) error

Append appends a text fragement represented by a cord leaf at the end of the cord to build.

func (Builder) Cord

func (b Builder) Cord() Cord

Cord returns the cord which this builder is holding up to now. It is illegal to continue adding fragments after `Cord` has been called, but `Cord` may be called multiple times.

func (*Builder) Prepend

func (b *Builder) Prepend(leaf Leaf) error

Prepend pushes a text fragement represented by a cord leaf at the beginning of the cord to build.

func (*Builder) Reset

func (b *Builder) Reset()

Reset drops the cord building currently in progress and prepares the builder for a fresh build.

type Cord

type Cord struct {
	// contains filtered or unexported fields
}

Cord is a type for an enhanced string.

A cord internally consists of fragments of text, which are considered immutable. Fragments may be shared between cords, or versions of cords. Cords change in a concurrency-safe way, as every modifying operation on a cord will create a copy of changed parts of the cord. Thus cords are persistent data structures.

A cord created by

Cord{}

is a valid object and behaves like the empty string.

Due to their internal structure cords do have performance characteristics differing from Go strings or byte arrays.

Operation     |   Rope          |  String
--------------+-----------------+--------
Index         |   O(log n)      |   O(1)
Split         |   O(log n)      |   O(1)
Iterate       |   O(n)          |   O(n)

Concatenate   |   O(log n)      |   O(n)
Insert        |   O(log n)      |   O(n)
Delete        |   O(log n)      |   O(n)

For use cases with many editing operations on large texts, cords have stable performance and space characteristics. It's more appropriate to think of cords as a type for ‘text’ than as strings (https://mortoray.com/2014/03/17/strings-and-text-are-not-the-same/).

func Concat

func Concat(cord Cord, others ...Cord) Cord

Concat concatenates cords and returns a new cord.

func FromString

func FromString(s string) Cord

FromString creates a cord from a Go string.

func Insert

func Insert(cord Cord, c Cord, i uint64) (Cord, error)

Insert inserts a substring-cord c into cord at index i, resulting in a new cord. If i is greater than the length of cord, an out-of-bounds error is returned.

func Substr

func Substr(cord Cord, i, l uint64) (Cord, error)

Substr creates a new cord from a subset of cord, with: Substr(C,i,l) ⇒ Cord C2=bi,…,bi+l−1.

func (Cord) EachLeaf

func (cord Cord) EachLeaf(f func(Leaf, uint64) error) error

EachLeaf iterates over all leaf nodes of the cord.

func (Cord) FragmentCount

func (cord Cord) FragmentCount() int

FragmentCount returns the number of fragments this cord is internally split into. It is a read-only property, which is provided as it is sometimes helpful in scenarios for special-type cords.

func (Cord) Index

func (cord Cord) Index(i uint64) (Leaf, uint64, error)

Index returns the cord Leaf (i.e., a text fragment) which includes the byte at position i, together with an index position within the fragment's text.

func (Cord) IsVoid

func (cord Cord) IsVoid() bool

IsVoid returns true if cord is "".

func (Cord) Len

func (cord Cord) Len() uint64

Len returns the length in bytes of a cord.

func (Cord) Reader

func (cord Cord) Reader() io.Reader

Reader returns a reader for the bytes of cord.

func (Cord) Report

func (cord Cord) Report(i, l uint64) (string, error)

Report outputs a substring: Report(i,l) ⇒ outputs the string bi,…,bi+l−1.

func (Cord) String

func (cord Cord) String() string

String returns the cord as a Go string. This may be an expensive operation, as it will allocate a buffer for all the bytes of the cord and collect all fragments to a single continuous string. When working with large amounts of text, clients should probably avoid to call this. Instead they should jump to a position within the cord and report a substring or use an iterator.

type CordError

type CordError string

CordError is an error type for the cords module

func (CordError) Error

func (e CordError) Error() string

type Leaf

type Leaf interface {
	Weight() uint64                  // length of the leaf fragment in bytes
	String() string                  // produce the leaf fragment as a string
	Substring(uint64, uint64) []byte // substring [i:j]
	Split(uint64) (Leaf, Leaf)       // split into 2 leafs at position i
}

Leaf is an interface type for leafs of a cord structure. Leafs do carry fragments of text.

The default implementation uses Go strings.

type MaterializedMetric

type MaterializedMetric interface {
	Metric
	Leafs(MetricValue, bool) []Leaf
}

MaterializedMetric is a type for metrics (please refer to interface Metric) that build a concrete cord tree for a text-cord.

A materialized metric does metric calculations exactly as a simple metric. However, it additionally supports building up a tree from atomic leafs containing metric values.

There are (at least) two ways to go about building a metric tree: one to preserve a homomorph of the text fragments, essentially materializing a catamorphism, or we can re-align the leaf boundaries with the continuity-boundaries of the metric. We'll go with the latter and build up a tree which will the be somewhat decoupled from the text cord.

type Metric

type Metric interface {
	Apply(frag []byte) MetricValue
	Combine(leftSibling, rightSibling MetricValue, metric Metric) MetricValue
}

Metric is a metric to calculate on a cord. Sometimes it's helpful to find information about a (large) text by collecting metrics from fragments and assembling them. Cords naturally break up texts into smaller fragments, letting us calculate metrics by applying them to (a subset of) fragments and propagate them upwards the nodes of the cord tree.

An example of a (very simplistic) metric would be to count the number of bytes in a text. The total count is calculated by counting the bytes in every fragment and adding up intermediate sums while travelling upwards through the cord's tree.

Metric is a simple interface whith a metric function that will be applied to portions of text. Clients will have no control over size or boundaries of the fragment. Applying metrics to different fragments may be done concurrently by the calculation driver, therefore it is illegal to hold unguarded global state in a metric.

Combine requires the Metric to calculate a metric value “sum” (monoid) from two metric values. This way the metric will bubble up metric values to the root of the cord tree and therewith result in a single overall metric value for a text. Combine must be a monoid over cords.MetricValue, with a neutral element n of Apply = f("") → n, i.e. the metric value of the empty string.

However, for materialized metrics it is a bit different from plain metrics: they resemble a free monoid. This is reflected by the result of materialized metrics, which is a list of spans (organized through a cord-tree). As a corollary, Combine has an additional task for materialized metrics than it has for plain metrics. Combine has to focus on the bytes *between* the already recognized spans of both the left and right sibling, and be able to convert them to cord leafs.

type MetricValue

type MetricValue interface {
	Len() int                      // summed up length of text fragments
	Unprocessed() ([]byte, []byte) // unprocessed bytes at either end
}

MetricValue is a type returned by applying a metric to text fragments (see interface Metric). It holds information about the added length of the text fragments which this value has been calulated for, and slices of bytes at either end of the accumulated fragments which have to be reprocessed.

Fragments of text are presented to the metric function as slices of bytes, without regard to rune-, grapheme- or line-boundaries. If we want to calculate information about, say, the maximum line length in a text, we'd have to count the graphemes of fragments. Graphemes will consist, however, of an unknown number of bytes and code points, which may only be identified by reading them at the grapheme start character. If a fragment is cut in the middle of a grapheme, the metric at the first bytes of a fragment cannot reliably calculated. Therefore metrics will be calculated on substrings of fragments where conditions allow a metric application, and any unprocessed bytes at the left and right boundary will be marked for reprocessing.

When propagating metric values up the tree nodes, metric value of the left and right child node of a cord tree node will have to be combined. The combination must be able to reprocess any unprocessed bytes.

   --- denotes unprocessed bytes
   === denotes bytes already processed by the metric

(a)  |-----========--|   &   |----==============|     combine two sibling fragments

(b)  |-----========    ------     ==============|     reprocess 6 bytes in between

(c)  |-----============================|              combined intermediate fragment  or

For an approachable discussion please refer to Raph Levien's “Rope Science” series (https://xi-editor.io/docs/rope_science_01.html), or—less approachable—read up on catamorphism.

Calculating metric values is a topic where implementation characteristics of cords get somewhat visible for clients of the API. This is unfortunate, but may be mitigated by helper types provided by this package. Clients usually will either create metrics on top of pre-defined basic metrics or they may embed the MetricValueBase helper type in their MetricValues.

func ApplyMetric

func ApplyMetric(cord Cord, i, j uint64, metric Metric) (MetricValue, error)

ApplyMetric applies a metric calculation on a (section of a) text.

i and j are text positions with Go slice semantics. If [i, j) does not specify a valid slice of the text, ErrIndexOutOfBounds will be returned.

type StringLeaf

type StringLeaf string

StringLeaf is the default implementation of type Leaf. Calls to cords.FromString(…) will produce a cord with leafs of type StringLeaf.

StringLeaf is made public, because it may be of use for other constructors of cords.

func (StringLeaf) Split

func (lstr StringLeaf) Split(i uint64) (Leaf, Leaf)

Split splits a leaf at position i, resulting in 2 new leafs.

func (StringLeaf) String

func (lstr StringLeaf) String() string

func (StringLeaf) Substring

func (lstr StringLeaf) Substring(i, j uint64) []byte

Substring returns a string segment of the leaf's text fragment.

func (StringLeaf) Weight

func (lstr StringLeaf) Weight() uint64

Weight of a leaf is its string length in bytes.

Directories

Path Synopsis
Package metrics provides some pre-manufactured metrics on texts.
Package metrics provides some pre-manufactured metrics on texts.
Package styled makes styled text.
Package styled makes styled text.
formatter
Package formatter formats styled text on output devices with fixed-width fonts.
Package formatter formats styled text on output devices with fixed-width fonts.
inline
Package inline styles inline text such as HTML-spans or console-output.
Package inline styles inline text such as HTML-spans or console-output.
itemized
Package itemized helps itemizing paragraphs and prepare them for formatting.
Package itemized helps itemizing paragraphs and prepare them for formatting.
Package textfile provides a convenient and efficient API for handling content of large texts as strings.
Package textfile provides a convenient and efficient API for handling content of large texts as strings.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL