tokenizer

package

v0.1.3 Latest Latest Go to latest Published: Jul 16, 2015 License: MIT Imports: 6 Imported by: 0

Details

Valid go.mod file

The Go module system was introduced in Go 1.11 and is the official dependency management solution for Go.
Redistributable license

Redistributable licenses place minimal restrictions on how software can be used, modified, and redistributed.
Tagged version

Modules with tagged versions give importers more predictable builds.
Stable version

When a project reaches major version v1 it is considered stable.
Learn more about best practices

Repository

github.com/RxnWeaver/RxnMiner

Links

Open Source Insights

README ¶

tokenizer -- A text tokenizer and sentence assembler

tokenizer is a small Go package for:

splitting input text into quasi-atomic tokens,
assembling sentences from those tokens, and
annotating a set of one or more (consecutive) token(s) as words or phrases.

Rather than split the input text into sentences first, and tokenize the sentences next, tokenizer assembles them from tokens. For the purposes of RxnMiner (the containing project of this package) - which processes technical text - the conventional approach (as followed by most leading NLP engines) produced too many incorrect sentence breaks, leading to mis-applied annotations downstream. Hence this inverted design.

tokenizer is rule-based.

Installation

Preferred:

go get -u 'github.com/RxnWeaver/RxnMiner/tokenizer'
cd $GOPATH/src/github.com/RxnWeaver/RxnMiner
git checkout <tag>
go test -v ./...
go install ./...

where <tag> represents the most-recently tagged release.

For the adventurous:

go get -u 'github.com/RxnWeaver/RxnMiner/tokenizer'

Status

This package is being used already, and is - consequently - reasonably battle-tested. The repository itself includes tests for over 7,000 real life test input texts.

Abbreviation handling is both English-centric and very limited. This is likely to improve in future.

See open issues on GitHub for currently known issues and corner cases.

Usage

In most cases, instantiating a document is a good place to start. Here is a trivial example.

doc, err := tokenizer.NewDocument("MyDoc-1")
if err != nil {
    return err
}
doc.SetInput("Section-1", someText)
doc.Tokenize()
doc.AssembleSentences()

toks := doc.SectionTokens("Section-1")
for _, tok := range toks {
    fmt.Printf("%v\n", tok)
}

sents := doc.SectionSentences("Section-1")
for _, sent := range sents {
    fmt.Printf("%v\n", sent)
}

The tokens obtained by splitting the given input text can, of course, be used for purposes other than sentence assembly as well.

Refer to tests for a few more interesting usages, and examples of applying annotations and extracting the resulting words and phrases.

Documentation ¶

Index ¶

Variables
type Annotation
- func NewAnnotation(in string) (*Annotation, error)
type Document
- func NewDocument(id string) (*Document, error)
- func NewTechnicalDocument(id string) (*Document, error)
- func (d *Document) Annotate(a *Annotation, what string) error
- func (d *Document) AssembleSentences()
- func (d *Document) Input(sec string) (string, error)
- func (d *Document) SectionAnnotationCount(sec string) (int, error)
- func (d *Document) SectionAnnotations(sec string) []*Annotation
- func (d *Document) SectionSentenceCount(sec string) (int, error)
- func (d *Document) SectionSentences(sec string) []*Sentence
- func (d *Document) SectionTokenCount(sec string) (int, error)
- func (d *Document) SectionTokens(sec string) []*TextToken
- func (d *Document) SectionWordCount(sec string) (int, error)
- func (d *Document) SectionWords(sec string) []*Word
- func (d *Document) SetInput(sec, input string) error
- func (d *Document) Tokenize()
type Sentence
- func (s *Sentence) Begin() int
- func (s *Sentence) BeginToken() int
- func (s *Sentence) End() int
- func (s *Sentence) EndToken() int
- func (s *Sentence) Text() string
- func (s *Sentence) Type() TokenType
type SentenceIterator
- func NewSentenceIterator(toks []*TextToken) *SentenceIterator
- func NewTechnicalSentenceIterator(toks []*TextToken) *SentenceIterator
- func (si *SentenceIterator) Item() *Sentence
- func (si *SentenceIterator) MoveNext() error
type TextToken
- func (tt *TextToken) Begin() int
- func (tt *TextToken) End() int
- func (tt *TextToken) Text() string
- func (tt *TextToken) Type() TokenType
type TextTokenIterator
- func NewTextTokenIterator(input string) *TextTokenIterator
- func NewTextTokenIteratorWithOffset(input string, n int) *TextTokenIterator
- func (ti *TextTokenIterator) Item() *TextToken
- func (ti *TextTokenIterator) MoveNext() error
type Token
type TokenType
- func RuneType(r rune) TokenType
type Word
- func (w *Word) Begin() int
- func (w *Word) Class() string
- func (w *Word) End() int
- func (w *Word) IOB() byte
- func (w *Word) Lemma() string
- func (w *Word) POS() string
- func (w *Word) Text() string
- func (w *Word) Type() TokenType

Constants ¶

This section is empty.

Variables ¶

View Source

var MayBeTermAbbrevs = map[string]struct{}{
	"etc": {},
}

MayBeTermAbbrevs lists the common abbreviations that could end with a full stop, possibly without ending the sentence. The abbrevs are in lowercase.

View Source

var MayBeTermGroupAbbrevs = map[string][]string{
	"e": {"i"},
	"g": {"e"},
}

MayBeTermGroupAbbrevs lists the common abbreviations that are compound, i.e. they involve more than one token. The table omits any intervening period. The abbrevs are in lowercase.

View Source

var NonTermAbbrevs = map[string]struct{}{
	"viz": {},
	"eg":  {},
	"ex":  {},
	"fig": {},

	"mr":   {},
	"ms":   {},
	"mrs":  {},
	"dr":   {},
	"prof": {},
}

NonTermAbbrevs lists the common abbreviations that could end with a full stop, but without ending the sentence. The abbrevs are in lowercase.

View Source

var TtDescriptions = map[TokenType]string{
	TokOther:        "TokOther",
	TokSpace:        "TokSpace",
	TokLetter:       "TokLetter",
	TokNumber:       "TokLetter",
	TokMayBeTerm:    "TokMayBeTerm",
	TokTerm:         "TokTerm",
	TokPause:        "TokPause",
	TokParenOpen:    "TokParenOpen",
	TokParenClose:   "TokParenClose",
	TokBracketOpen:  "TokBracketOpen",
	TokBracketClose: "TokBracketClose",
	TokBraceOpen:    "TokBraceOpen",
	TokBraceClose:   "TokBraceClose",
	TokSquote:       "TokSquote",
	TokDquote:       "TokDquote",
	TokIniQuote:     "TokIniQuote",
	TokFinQuote:     "TokFinQuote",
	TokPunct:        "TokPunct",
	TokSymbol:       "TokSymbol",
	TokMayBeWord:    "TokMayBeWord",
	TokWord:         "TokWord",
	TokSentence:     "TokSentence",
}

TtDescriptions helps in printing token types.

Functions ¶

This section is empty.

Types ¶

type Annotation ¶

type Annotation struct {
	DocumentID string
	Section    string
	Begin      int
	End        int
	Entity     string
	Property   string
}

Annotation represents a curated annotation of a logical word in a text.

Each annotated word belongs to exactly one input document, and exactly one identified section within that (title, abstract, etc.). The annotation also holds information about a particular property of the word. Annotations are used for training the tools.

func NewAnnotation ¶

func NewAnnotation(in string) (*Annotation, error)

NewAnnotation creates and initialises a new annotation for the given input word.

It expects its input to be in six columns that are tab-separated. The order of the fields is:

document identifier,
section,
beginning index of the word in the input text,
corresponding ending index,
word itself and
entity type.

type Document ¶

type Document struct {
	// contains filtered or unexported fields
}

Document represents the entirety of input text of one logical document -- usually a file.

It holds information about its sections, tokens in them, and the words and sentences that were recognised by other processors. In case the document has associated training annotations, it holds them as well.

func NewDocument ¶

func NewDocument(id string) (*Document, error)

NewDocument creates and initialises a document with the given identifier.

It holds information about its sections. It also holds information of their constituent tokens, words, sentences and annotations.

func NewTechnicalDocument ¶

func NewTechnicalDocument(id string) (*Document, error)

NewTechnicalDocument creates and initialises a document of technical nature, with the given identifier.

It holds information about its sections. It also holds information of their constituent tokens, words, sentences and annotations.

func (*Document) Annotate ¶

func (d *Document) Annotate(a *Annotation, what string) error

Annotate records the given annotation against the applicable sequence of tokens in the appropriate section of the document.

It creates or updates a `Word` corresponding to the text in the annotation. The annotation can be for one of: (a) part of speech ("POS"), (b) lemma ("LEM") or (c) class/category ("CLS").

func (*Document) AssembleSentences ¶ added in v0.1.1

func (d *Document) AssembleSentences()

AssembleSentences builds sentences the text tokens obtained as a result of tokenization of the sections in the document.

func (*Document) Input ¶

func (d *Document) Input(sec string) (string, error)

Input answers the registered input text of the given section, if one exists.

func (*Document) SectionAnnotationCount ¶

func (d *Document) SectionAnnotationCount(sec string) (int, error)

SectionAnnotationCount answers the number of registered annotations in the given section.

func (*Document) SectionAnnotations ¶ added in v0.1.1

func (d *Document) SectionAnnotations(sec string) []*Annotation

SectionAnnotations answers registered annotations for the given section.

func (*Document) SectionSentenceCount ¶ added in v0.1.1

func (d *Document) SectionSentenceCount(sec string) (int, error)

SectionSentenceCount answers the number of assembled sentences in the given section.

func (*Document) SectionSentences ¶ added in v0.1.1

func (d *Document) SectionSentences(sec string) []*Sentence

SectionSentences answers assembled sentences in the given section.

func (*Document) SectionTokenCount ¶

func (d *Document) SectionTokenCount(sec string) (int, error)

SectionTokenCount answers the number of recognised tokens in the given section.

func (*Document) SectionTokens ¶ added in v0.1.1

func (d *Document) SectionTokens(sec string) []*TextToken

SectionTokens answers recognised tokens in the given section.

func (*Document) SectionWordCount ¶

func (d *Document) SectionWordCount(sec string) (int, error)

SectionWordCount answers the number of recognised words in the given section.

func (*Document) SectionWords ¶ added in v0.1.1

func (d *Document) SectionWords(sec string) []*Word

SectionWords answers recognised words in the given section.

func (*Document) SetInput ¶

func (d *Document) SetInput(sec, input string) error

SetInput registers the input text of the given section of the document.

func (*Document) Tokenize ¶

func (d *Document) Tokenize()

Tokenize breaks the text in the various sections of the document into quasi-atomic tokens.

These tokens can be matched against any available annotations. They can also be combined into logical words for named entity recognition and part of speech recognition purposes.

type Sentence ¶

type Sentence struct {
	// contains filtered or unexported fields
}

Sentence represents a logical sentence.

It holds information about its text, its offsets and its constituent text tokens.

func (*Sentence) Begin ¶

func (s *Sentence) Begin() int

func (*Sentence) BeginToken ¶

func (s *Sentence) BeginToken() int

func (*Sentence) End ¶

func (s *Sentence) End() int

func (*Sentence) EndToken ¶

func (s *Sentence) EndToken() int

func (*Sentence) Text ¶

func (s *Sentence) Text() string

func (*Sentence) Type ¶

func (s *Sentence) Type() TokenType

type SentenceIterator ¶

type SentenceIterator struct {
	// contains filtered or unexported fields
}

SentenceIterator helps in assembling consecutive sentences from the underlying text tokens.

func NewSentenceIterator ¶

func NewSentenceIterator(toks []*TextToken) *SentenceIterator

NewSentenceIterator creates and initialises a sentence iterator over the given text tokens.

func NewTechnicalSentenceIterator ¶

func NewTechnicalSentenceIterator(toks []*TextToken) *SentenceIterator

NewTechnicalSentenceIterator creates and initialises a sentence iterator in technical mode, over the given text tokens.

func (*SentenceIterator) Item ¶

func (si *SentenceIterator) Item() *Sentence

Item answers the current sentence. This has no side effects, and can be invoked any number of times.

func (*SentenceIterator) MoveNext ¶

func (si *SentenceIterator) MoveNext() error

MoveNext assembles the next sentence from the given input tokens.

It begins with the current running token index (which could be at the beginning of the input slice of tokens), and continues until it can logically complete a sentence. Should it not be able to complete one such, it treats all remaining input tokens as constituting a single sentence.

The return value is either `nil` (more sentences may be available) or `io.EOF` (no more sentences).

type TextToken ¶

type TextToken struct {
	// contains filtered or unexported fields
}

TextToken represents a piece of text extracted from a larger input. It holds information regarding its beginning and ending offsets in the input text. A text token may span the entire input.

func (*TextToken) Begin ¶

func (tt *TextToken) Begin() int

func (*TextToken) End ¶

func (tt *TextToken) End() int

func (*TextToken) Text ¶

func (tt *TextToken) Text() string

func (*TextToken) Type ¶

func (tt *TextToken) Type() TokenType

type TextTokenIterator ¶

type TextTokenIterator struct {
	// contains filtered or unexported fields
}

TextTokenIterator helps in retrieving consecutive text tokens from an input text.

func NewTextTokenIterator ¶

func NewTextTokenIterator(input string) *TextTokenIterator

NewTextTokenIterator creates and initialises a token iterator over the given input text.

func NewTextTokenIteratorWithOffset ¶

func NewTextTokenIteratorWithOffset(input string, n int) *TextTokenIterator

NewTextTokenIteratorWithOffset creates and initialises a token iterator over the given input text.

It treats the given offset - rather than 0 - as the starting index from which to track all subsequent indices.

func (*TextTokenIterator) Item ¶

func (ti *TextTokenIterator) Item() *TextToken

Item answers the current token. This has no side effects, and can be invoked any number of times.

func (*TextTokenIterator) MoveNext ¶

func (ti *TextTokenIterator) MoveNext() error

MoveNext detects the next token in the input, should one be available.

It begins with the current running byte offset (which could be the beginning of the input string), and continues until it can logically break on a token terminator. Should it not be able to find one such, it treats all remaining runes in the input string as constituting a single token.

The return value is either `nil` (more tokens may be available) or `io.EOF` (no more tokens).

type Token ¶

type Token interface {
	Text() string
	Begin() int
	End() int
	Type() TokenType
}

Token represents a piece of text extracted from a larger input. It holds information regarding its beginning and ending offsets in the input text. A token may span the entire input.

type TokenType ¶

type TokenType byte

TokenType represents types that a token can have. The granularity of a token is variable: character, smallest logical unit, word, sentence, etc. Accordingly, the corresponding tokens use appropriate token types.

const (
	TokOther TokenType = iota

	TokSpace
	TokLetter
	TokNumber
	TokMayBeTerm
	TokTerm
	TokPause
	TokParenOpen
	TokParenClose
	TokBracketOpen
	TokBracketClose
	TokBraceOpen
	TokBraceClose
	TokSquote
	TokDquote
	TokIniQuote
	TokFinQuote
	TokPunct
	TokSymbol

	TokMayBeWord
	TokWord
	TokSentence
)

List of defined token types.

func RuneType ¶ added in v0.1.1

func RuneType(r rune) TokenType

RuneType answers the token type of the given rune.

type Word ¶

type Word struct {
	// contains filtered or unexported fields
}

Word represents a token whose type is one of `TokMayBeWord` or `TokWord`, and qualifies it.

It holds information regarding the so-called IOB (Inside, Outside, Beginning) status of the token, its lemma form (in case of a word), its part of speech (in case of a word), etc.

func (*Word) Begin ¶

func (w *Word) Begin() int

func (*Word) Class ¶ added in v0.1.1

func (w *Word) Class() string

func (*Word) End ¶

func (w *Word) End() int

func (*Word) IOB ¶

func (w *Word) IOB() byte

func (*Word) Lemma ¶

func (w *Word) Lemma() string

func (*Word) POS ¶

func (w *Word) POS() string

func (*Word) Text ¶

func (w *Word) Text() string

func (*Word) Type ¶

func (w *Word) Type() TokenType

Source Files ¶

View all Source files

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL