Documentation
¶
Overview ¶
Package cords offers a versatile string enhancement to ease handling of texts.
Cords ¶
Cords (or sometimes called ropes) organize fragments of immutable text internally in a tree-structure. This speeds up frequent string-operations like concatenation, especially for long strings. This package aims towards applications which have to deal with text, i.e., large amounts of organized strings.
From Wikipedia: In computer programming, a rope, or cord, is a data structure composed of smaller strings that is used to efficiently store and manipulate a very long string. For example, a text editing program may use a rope to represent the text being edited, so that operations such as insertion, deletion, and random access can be done efficiently. […] In summary, ropes are preferable when the data is large and modified often.
_________________________________________________________________________
From a paper by Hans-J. Boehm, Russ Atkinson and Michael Plass, 1995:
Ropes, an Alternative to Strings ¶
Xerox PARC, 3333 Coyote Hill Rd., Palo Alto, CA 94304, U.S.A. (email:[email protected])
What's wrong with Strings?
Programming languages such as C […] provide a built-in notion of strings as essentially fixed length arrays of characters. The language itself provides the array primitives for accessing such strings, plus often a collection of library routines for higher level operations such as string concatenation. Thus the implementation is essentially constrained to represent strings as contiguous arrays of characters, with or without additional space for a length, expansion room, etc. […] We desire the following characteristics:
1. Immutable strings, i.e. strings that cannot be modified in place, should be well supported. A procedure should be able to operate on a string it was passed without danger of accidentally modifying the caller’s data structures. This becomes particularly important in the presence of concurrency, where in-place updates to strings would often have to be properly synchronized. […]
2. Commonly occurring operations on strings should be efficient. In particular (non-destructive) concatenation of strings and non-destructive substring operations should be fast, and should not require excessive amounts of space.
3. Common string operations should scale to long strings. There should be no practical bound on the length of strings. Performance should remain acceptable for long strings. […]
4. It should be as easy as possible to treat some other representation of ‘sequenceof character’ (e.g. a file) as a string. Functions on strings should be maximally reusable.
Strings represented as contiguous arrays of characters, as in C or Pascal, violate most of these.
_________________________________________________________________________
In Go, these points of critique are somewhat mitigated with slices. However, slices will carry only so far, and cords add a layer of convenience and stable performance characteristics on top of them. You can think of cords as fancy slices of text, with some additional functionality.
Cords may be constructed from various sources, with the simplest case being a call to
cords.FromString("Hello World")
Other possibilities are cords from text files or from HTML documents.
_________________________________________________________________________
BSD 3-Clause License ¶
Copyright (c) 2020–21, Norbert Pillmayer ¶
All rights reserved.
Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:
1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.
2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.
3. Neither the name of the copyright holder nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission.
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
Index ¶
- Constants
- func ApplyMaterializedMetric(cord Cord, i, j uint64, metric MaterializedMetric) (MetricValue, Cord, error)
- func Cord2Dot(text Cord, w io.Writer)
- func Cut(cord Cord, i, l uint64) (Cord, Cord, error)
- func Split(cord Cord, i uint64) (Cord, Cord, error)
- func T() tracing.Trace
- type Builder
- type Cord
- func (cord Cord) EachLeaf(f func(Leaf, uint64) error) error
- func (cord Cord) FragmentCount() int
- func (cord Cord) Index(i uint64) (Leaf, uint64, error)
- func (cord Cord) IsVoid() bool
- func (cord Cord) Len() uint64
- func (cord Cord) Reader() io.Reader
- func (cord Cord) Report(i, l uint64) (string, error)
- func (cord Cord) String() string
- type CordError
- type Leaf
- type MaterializedMetric
- type Metric
- type MetricValue
- type StringLeaf
Constants ¶
const ErrCordCompleted = CordError("forbidden to add fragements; cord has been completed")
ErrCordCompleted signals that a cord builder has already completed a cord and it's illegal to further add fragments.
const ErrIllegalArguments = CordError("illegal arguments")
ErrIllegalArguments is flagged whenever function parameters are invalid.
const ErrIllegalDelimiterPattern = CordError("illegal delimiter pattern")
ErrIllegalDelimiterPattern is flagged if a given delimiter pattern is either not compilable as a valid regular expression or if it accepts the empty string as a match.
const ErrIndexOutOfBounds = CordError("index out of bounds")
ErrIndexOutOfBounds is flagged whenever a cord position is greater than the length of the cord.
Variables ¶
This section is empty.
Functions ¶
func ApplyMaterializedMetric ¶
func ApplyMaterializedMetric(cord Cord, i, j uint64, metric MaterializedMetric) (MetricValue, Cord, error)
ApplyMaterializedMetric applies a materialized metric to a (section of a) text. Returns a metric value and a cord, which manages the spans of the metric on the text.
i and j are text positions with Go slice semantics. If [i, j) does not specify a valid slice of the text, ErrIndexOutOfBounds will be returned.
func Cord2Dot ¶
Cord2Dot outputs the internal structure of a Cord in Graphviz DOT format (for debugging purposes).
func Cut ¶
Cut cuts out a substring [i…i+l) from a cord. It will return a new cord without the cut-out segment, plus the substring-segment, and possibly an error. If l is 0, cord is unchanged.
Types ¶
type Builder ¶
type Builder struct {
// contains filtered or unexported fields
}
Builder is for building cords by adding text fragments (leafs). The empty instance is a valid cord builder, but clients may use NewBuilder instead.
func (*Builder) Append ¶
Append appends a text fragement represented by a cord leaf at the end of the cord to build.
func (Builder) Cord ¶
Cord returns the cord which this builder is holding up to now. It is illegal to continue adding fragments after `Cord` has been called, but `Cord` may be called multiple times.
type Cord ¶
type Cord struct {
// contains filtered or unexported fields
}
Cord is a type for an enhanced string.
A cord internally consists of fragments of text, which are considered immutable. Fragments may be shared between cords, or versions of cords. Cords change in a concurrency-safe way, as every modifying operation on a cord will create a copy of changed parts of the cord. Thus cords are persistent data structures.
A cord created by
Cord{}
is a valid object and behaves like the empty string.
Due to their internal structure cords do have performance characteristics differing from Go strings or byte arrays.
Operation | Rope | String --------------+-----------------+-------- Index | O(log n) | O(1) Split | O(log n) | O(1) Iterate | O(n) | O(n) Concatenate | O(log n) | O(n) Insert | O(log n) | O(n) Delete | O(log n) | O(n)
For use cases with many editing operations on large texts, cords have stable performance and space characteristics. It's more appropriate to think of cords as a type for ‘text’ than as strings (https://mortoray.com/2014/03/17/strings-and-text-are-not-the-same/).
func Insert ¶
Insert inserts a substring-cord c into cord at index i, resulting in a new cord. If i is greater than the length of cord, an out-of-bounds error is returned.
func Substr ¶
Substr creates a new cord from a subset of cord, with: Substr(C,i,l) ⇒ Cord C2=bi,…,bi+l−1.
func (Cord) FragmentCount ¶
FragmentCount returns the number of fragments this cord is internally split into. It is a read-only property, which is provided as it is sometimes helpful in scenarios for special-type cords.
func (Cord) Index ¶
Index returns the cord Leaf (i.e., a text fragment) which includes the byte at position i, together with an index position within the fragment's text.
func (Cord) String ¶
String returns the cord as a Go string. This may be an expensive operation, as it will allocate a buffer for all the bytes of the cord and collect all fragments to a single continuous string. When working with large amounts of text, clients should probably avoid to call this. Instead they should jump to a position within the cord and report a substring or use an iterator.
type Leaf ¶
type Leaf interface { Weight() uint64 // length of the leaf fragment in bytes String() string // produce the leaf fragment as a string Substring(uint64, uint64) []byte // substring [i:j] Split(uint64) (Leaf, Leaf) // split into 2 leafs at position i }
Leaf is an interface type for leafs of a cord structure. Leafs do carry fragments of text.
The default implementation uses Go strings.
type MaterializedMetric ¶
type MaterializedMetric interface { Metric Leafs(MetricValue, bool) []Leaf }
MaterializedMetric is a type for metrics (please refer to interface Metric) that build a concrete cord tree for a text-cord.
A materialized metric does metric calculations exactly as a simple metric. However, it additionally supports building up a tree from atomic leafs containing metric values.
There are (at least) two ways to go about building a metric tree: one to preserve a homomorph of the text fragments, essentially materializing a catamorphism, or we can re-align the leaf boundaries with the continuity-boundaries of the metric. We'll go with the latter and build up a tree which will the be somewhat decoupled from the text cord.
type Metric ¶
type Metric interface { Apply(frag []byte) MetricValue Combine(leftSibling, rightSibling MetricValue, metric Metric) MetricValue }
Metric is a metric to calculate on a cord. Sometimes it's helpful to find information about a (large) text by collecting metrics from fragments and assembling them. Cords naturally break up texts into smaller fragments, letting us calculate metrics by applying them to (a subset of) fragments and propagate them upwards the nodes of the cord tree.
An example of a (very simplistic) metric would be to count the number of bytes in a text. The total count is calculated by counting the bytes in every fragment and adding up intermediate sums while travelling upwards through the cord's tree.
Metric is a simple interface whith a metric function that will be applied to portions of text. Clients will have no control over size or boundaries of the fragment. Applying metrics to different fragments may be done concurrently by the calculation driver, therefore it is illegal to hold unguarded global state in a metric.
Combine requires the Metric to calculate a metric value “sum” (monoid) from two metric values. This way the metric will bubble up metric values to the root of the cord tree and therewith result in a single overall metric value for a text. Combine must be a monoid over cords.MetricValue, with a neutral element n of Apply = f("") → n, i.e. the metric value of the empty string.
However, for materialized metrics it is a bit different from plain metrics: they resemble a free monoid. This is reflected by the result of materialized metrics, which is a list of spans (organized through a cord-tree). As a corollary, Combine has an additional task for materialized metrics than it has for plain metrics. Combine has to focus on the bytes *between* the already recognized spans of both the left and right sibling, and be able to convert them to cord leafs.
type MetricValue ¶
type MetricValue interface { Len() int // summed up length of text fragments Unprocessed() ([]byte, []byte) // unprocessed bytes at either end }
MetricValue is a type returned by applying a metric to text fragments (see interface Metric). It holds information about the added length of the text fragments which this value has been calulated for, and slices of bytes at either end of the accumulated fragments which have to be reprocessed.
Fragments of text are presented to the metric function as slices of bytes, without regard to rune-, grapheme- or line-boundaries. If we want to calculate information about, say, the maximum line length in a text, we'd have to count the graphemes of fragments. Graphemes will consist, however, of an unknown number of bytes and code points, which may only be identified by reading them at the grapheme start character. If a fragment is cut in the middle of a grapheme, the metric at the first bytes of a fragment cannot reliably calculated. Therefore metrics will be calculated on substrings of fragments where conditions allow a metric application, and any unprocessed bytes at the left and right boundary will be marked for reprocessing.
When propagating metric values up the tree nodes, metric value of the left and right child node of a cord tree node will have to be combined. The combination must be able to reprocess any unprocessed bytes.
--- denotes unprocessed bytes === denotes bytes already processed by the metric (a) |-----========--| & |----==============| combine two sibling fragments (b) |-----======== ------ ==============| reprocess 6 bytes in between (c) |-----============================| combined intermediate fragment or
For an approachable discussion please refer to Raph Levien's “Rope Science” series (https://xi-editor.io/docs/rope_science_01.html), or—less approachable—read up on catamorphism.
Calculating metric values is a topic where implementation characteristics of cords get somewhat visible for clients of the API. This is unfortunate, but may be mitigated by helper types provided by this package. Clients usually will either create metrics on top of pre-defined basic metrics or they may embed the MetricValueBase helper type in their MetricValues.
func ApplyMetric ¶
func ApplyMetric(cord Cord, i, j uint64, metric Metric) (MetricValue, error)
ApplyMetric applies a metric calculation on a (section of a) text.
i and j are text positions with Go slice semantics. If [i, j) does not specify a valid slice of the text, ErrIndexOutOfBounds will be returned.
type StringLeaf ¶
type StringLeaf string
StringLeaf is the default implementation of type Leaf. Calls to cords.FromString(…) will produce a cord with leafs of type StringLeaf.
StringLeaf is made public, because it may be of use for other constructors of cords.
func (StringLeaf) Split ¶
func (lstr StringLeaf) Split(i uint64) (Leaf, Leaf)
Split splits a leaf at position i, resulting in 2 new leafs.
func (StringLeaf) String ¶
func (lstr StringLeaf) String() string
func (StringLeaf) Substring ¶
func (lstr StringLeaf) Substring(i, j uint64) []byte
Substring returns a string segment of the leaf's text fragment.
func (StringLeaf) Weight ¶
func (lstr StringLeaf) Weight() uint64
Weight of a leaf is its string length in bytes.
Directories
¶
Path | Synopsis |
---|---|
Package metrics provides some pre-manufactured metrics on texts.
|
Package metrics provides some pre-manufactured metrics on texts. |
Package styled makes styled text.
|
Package styled makes styled text. |
formatter
Package formatter formats styled text on output devices with fixed-width fonts.
|
Package formatter formats styled text on output devices with fixed-width fonts. |
inline
Package inline styles inline text such as HTML-spans or console-output.
|
Package inline styles inline text such as HTML-spans or console-output. |
itemized
Package itemized helps itemizing paragraphs and prepare them for formatting.
|
Package itemized helps itemizing paragraphs and prepare them for formatting. |
Package textfile provides a convenient and efficient API for handling content of large texts as strings.
|
Package textfile provides a convenient and efficient API for handling content of large texts as strings. |