tokenizer

package
v0.1.3 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Jul 16, 2015 License: MIT Imports: 6 Imported by: 0

README

tokenizer -- A text tokenizer and sentence assembler

tokenizer is a small Go package for:

  • splitting input text into quasi-atomic tokens,
  • assembling sentences from those tokens, and
  • annotating a set of one or more (consecutive) token(s) as words or phrases.

Rather than split the input text into sentences first, and tokenize the sentences next, tokenizer assembles them from tokens. For the purposes of RxnMiner (the containing project of this package) - which processes technical text - the conventional approach (as followed by most leading NLP engines) produced too many incorrect sentence breaks, leading to mis-applied annotations downstream. Hence this inverted design.

tokenizer is rule-based.

Installation

Preferred:

go get -u 'github.com/RxnWeaver/RxnMiner/tokenizer'
cd $GOPATH/src/github.com/RxnWeaver/RxnMiner
git checkout <tag>
go test -v ./...
go install ./...

where <tag> represents the most-recently tagged release.

For the adventurous:

go get -u 'github.com/RxnWeaver/RxnMiner/tokenizer'

Status

This package is being used already, and is - consequently - reasonably battle-tested. The repository itself includes tests for over 7,000 real life test input texts.

Abbreviation handling is both English-centric and very limited. This is likely to improve in future.

See open issues on GitHub for currently known issues and corner cases.

Usage

In most cases, instantiating a document is a good place to start. Here is a trivial example.

doc, err := tokenizer.NewDocument("MyDoc-1")
if err != nil {
    return err
}
doc.SetInput("Section-1", someText)
doc.Tokenize()
doc.AssembleSentences()

toks := doc.SectionTokens("Section-1")
for _, tok := range toks {
    fmt.Printf("%v\n", tok)
}

sents := doc.SectionSentences("Section-1")
for _, sent := range sents {
    fmt.Printf("%v\n", sent)
}

The tokens obtained by splitting the given input text can, of course, be used for purposes other than sentence assembly as well.

Refer to tests for a few more interesting usages, and examples of applying annotations and extracting the resulting words and phrases.

Documentation

Index

Constants

This section is empty.

Variables

View Source
var MayBeTermAbbrevs = map[string]struct{}{
	"etc": {},
}

MayBeTermAbbrevs lists the common abbreviations that could end with a full stop, possibly without ending the sentence. The abbrevs are in lowercase.

View Source
var MayBeTermGroupAbbrevs = map[string][]string{
	"e": {"i"},
	"g": {"e"},
}

MayBeTermGroupAbbrevs lists the common abbreviations that are compound, i.e. they involve more than one token. The table omits any intervening period. The abbrevs are in lowercase.

View Source
var NonTermAbbrevs = map[string]struct{}{
	"viz": {},
	"eg":  {},
	"ex":  {},
	"fig": {},

	"mr":   {},
	"ms":   {},
	"mrs":  {},
	"dr":   {},
	"prof": {},
}

NonTermAbbrevs lists the common abbreviations that could end with a full stop, but without ending the sentence. The abbrevs are in lowercase.

View Source
var TtDescriptions = map[TokenType]string{
	TokOther:        "TokOther",
	TokSpace:        "TokSpace",
	TokLetter:       "TokLetter",
	TokNumber:       "TokLetter",
	TokMayBeTerm:    "TokMayBeTerm",
	TokTerm:         "TokTerm",
	TokPause:        "TokPause",
	TokParenOpen:    "TokParenOpen",
	TokParenClose:   "TokParenClose",
	TokBracketOpen:  "TokBracketOpen",
	TokBracketClose: "TokBracketClose",
	TokBraceOpen:    "TokBraceOpen",
	TokBraceClose:   "TokBraceClose",
	TokSquote:       "TokSquote",
	TokDquote:       "TokDquote",
	TokIniQuote:     "TokIniQuote",
	TokFinQuote:     "TokFinQuote",
	TokPunct:        "TokPunct",
	TokSymbol:       "TokSymbol",
	TokMayBeWord:    "TokMayBeWord",
	TokWord:         "TokWord",
	TokSentence:     "TokSentence",
}

TtDescriptions helps in printing token types.

Functions

This section is empty.

Types

type Annotation

type Annotation struct {
	DocumentID string
	Section    string
	Begin      int
	End        int
	Entity     string
	Property   string
}

Annotation represents a curated annotation of a logical word in a text.

Each annotated word belongs to exactly one input document, and exactly one identified section within that (title, abstract, etc.). The annotation also holds information about a particular property of the word. Annotations are used for training the tools.

func NewAnnotation

func NewAnnotation(in string) (*Annotation, error)

NewAnnotation creates and initialises a new annotation for the given input word.

It expects its input to be in six columns that are tab-separated. The order of the fields is:

  • document identifier,
  • section,
  • beginning index of the word in the input text,
  • corresponding ending index,
  • word itself and
  • entity type.

type Document

type Document struct {
	// contains filtered or unexported fields
}

Document represents the entirety of input text of one logical document -- usually a file.

It holds information about its sections, tokens in them, and the words and sentences that were recognised by other processors. In case the document has associated training annotations, it holds them as well.

func NewDocument

func NewDocument(id string) (*Document, error)

NewDocument creates and initialises a document with the given identifier.

It holds information about its sections. It also holds information of their constituent tokens, words, sentences and annotations.

func NewTechnicalDocument

func NewTechnicalDocument(id string) (*Document, error)

NewTechnicalDocument creates and initialises a document of technical nature, with the given identifier.

It holds information about its sections. It also holds information of their constituent tokens, words, sentences and annotations.

func (*Document) Annotate

func (d *Document) Annotate(a *Annotation, what string) error

Annotate records the given annotation against the applicable sequence of tokens in the appropriate section of the document.

It creates or updates a `Word` corresponding to the text in the annotation. The annotation can be for one of: (a) part of speech ("POS"), (b) lemma ("LEM") or (c) class/category ("CLS").

func (*Document) AssembleSentences added in v0.1.1

func (d *Document) AssembleSentences()

AssembleSentences builds sentences the text tokens obtained as a result of tokenization of the sections in the document.

func (*Document) Input

func (d *Document) Input(sec string) (string, error)

Input answers the registered input text of the given section, if one exists.

func (*Document) SectionAnnotationCount

func (d *Document) SectionAnnotationCount(sec string) (int, error)

SectionAnnotationCount answers the number of registered annotations in the given section.

func (*Document) SectionAnnotations added in v0.1.1

func (d *Document) SectionAnnotations(sec string) []*Annotation

SectionAnnotations answers registered annotations for the given section.

func (*Document) SectionSentenceCount added in v0.1.1

func (d *Document) SectionSentenceCount(sec string) (int, error)

SectionSentenceCount answers the number of assembled sentences in the given section.

func (*Document) SectionSentences added in v0.1.1

func (d *Document) SectionSentences(sec string) []*Sentence

SectionSentences answers assembled sentences in the given section.

func (*Document) SectionTokenCount

func (d *Document) SectionTokenCount(sec string) (int, error)

SectionTokenCount answers the number of recognised tokens in the given section.

func (*Document) SectionTokens added in v0.1.1

func (d *Document) SectionTokens(sec string) []*TextToken

SectionTokens answers recognised tokens in the given section.

func (*Document) SectionWordCount

func (d *Document) SectionWordCount(sec string) (int, error)

SectionWordCount answers the number of recognised words in the given section.

func (*Document) SectionWords added in v0.1.1

func (d *Document) SectionWords(sec string) []*Word

SectionWords answers recognised words in the given section.

func (*Document) SetInput

func (d *Document) SetInput(sec, input string) error

SetInput registers the input text of the given section of the document.

func (*Document) Tokenize

func (d *Document) Tokenize()

Tokenize breaks the text in the various sections of the document into quasi-atomic tokens.

These tokens can be matched against any available annotations. They can also be combined into logical words for named entity recognition and part of speech recognition purposes.

type Sentence

type Sentence struct {
	// contains filtered or unexported fields
}

Sentence represents a logical sentence.

It holds information about its text, its offsets and its constituent text tokens.

func (*Sentence) Begin

func (s *Sentence) Begin() int

func (*Sentence) BeginToken

func (s *Sentence) BeginToken() int

func (*Sentence) End

func (s *Sentence) End() int

func (*Sentence) EndToken

func (s *Sentence) EndToken() int

func (*Sentence) Text

func (s *Sentence) Text() string

func (*Sentence) Type

func (s *Sentence) Type() TokenType

type SentenceIterator

type SentenceIterator struct {
	// contains filtered or unexported fields
}

SentenceIterator helps in assembling consecutive sentences from the underlying text tokens.

func NewSentenceIterator

func NewSentenceIterator(toks []*TextToken) *SentenceIterator

NewSentenceIterator creates and initialises a sentence iterator over the given text tokens.

func NewTechnicalSentenceIterator

func NewTechnicalSentenceIterator(toks []*TextToken) *SentenceIterator

NewTechnicalSentenceIterator creates and initialises a sentence iterator in technical mode, over the given text tokens.

func (*SentenceIterator) Item

func (si *SentenceIterator) Item() *Sentence

Item answers the current sentence. This has no side effects, and can be invoked any number of times.

func (*SentenceIterator) MoveNext

func (si *SentenceIterator) MoveNext() error

MoveNext assembles the next sentence from the given input tokens.

It begins with the current running token index (which could be at the beginning of the input slice of tokens), and continues until it can logically complete a sentence. Should it not be able to complete one such, it treats all remaining input tokens as constituting a single sentence.

The return value is either `nil` (more sentences may be available) or `io.EOF` (no more sentences).

type TextToken

type TextToken struct {
	// contains filtered or unexported fields
}

TextToken represents a piece of text extracted from a larger input. It holds information regarding its beginning and ending offsets in the input text. A text token may span the entire input.

func (*TextToken) Begin

func (tt *TextToken) Begin() int

func (*TextToken) End

func (tt *TextToken) End() int

func (*TextToken) Text

func (tt *TextToken) Text() string

func (*TextToken) Type

func (tt *TextToken) Type() TokenType

type TextTokenIterator

type TextTokenIterator struct {
	// contains filtered or unexported fields
}

TextTokenIterator helps in retrieving consecutive text tokens from an input text.

func NewTextTokenIterator

func NewTextTokenIterator(input string) *TextTokenIterator

NewTextTokenIterator creates and initialises a token iterator over the given input text.

func NewTextTokenIteratorWithOffset

func NewTextTokenIteratorWithOffset(input string, n int) *TextTokenIterator

NewTextTokenIteratorWithOffset creates and initialises a token iterator over the given input text.

It treats the given offset - rather than 0 - as the starting index from which to track all subsequent indices.

func (*TextTokenIterator) Item

func (ti *TextTokenIterator) Item() *TextToken

Item answers the current token. This has no side effects, and can be invoked any number of times.

func (*TextTokenIterator) MoveNext

func (ti *TextTokenIterator) MoveNext() error

MoveNext detects the next token in the input, should one be available.

It begins with the current running byte offset (which could be the beginning of the input string), and continues until it can logically break on a token terminator. Should it not be able to find one such, it treats all remaining runes in the input string as constituting a single token.

The return value is either `nil` (more tokens may be available) or `io.EOF` (no more tokens).

type Token

type Token interface {
	Text() string
	Begin() int
	End() int
	Type() TokenType
}

Token represents a piece of text extracted from a larger input. It holds information regarding its beginning and ending offsets in the input text. A token may span the entire input.

type TokenType

type TokenType byte

TokenType represents types that a token can have. The granularity of a token is variable: character, smallest logical unit, word, sentence, etc. Accordingly, the corresponding tokens use appropriate token types.

const (
	TokOther TokenType = iota

	TokSpace
	TokLetter
	TokNumber
	TokMayBeTerm
	TokTerm
	TokPause
	TokParenOpen
	TokParenClose
	TokBracketOpen
	TokBracketClose
	TokBraceOpen
	TokBraceClose
	TokSquote
	TokDquote
	TokIniQuote
	TokFinQuote
	TokPunct
	TokSymbol

	TokMayBeWord
	TokWord
	TokSentence
)

List of defined token types.

func RuneType added in v0.1.1

func RuneType(r rune) TokenType

RuneType answers the token type of the given rune.

type Word

type Word struct {
	// contains filtered or unexported fields
}

Word represents a token whose type is one of `TokMayBeWord` or `TokWord`, and qualifies it.

It holds information regarding the so-called IOB (Inside, Outside, Beginning) status of the token, its lemma form (in case of a word), its part of speech (in case of a word), etc.

func (*Word) Begin

func (w *Word) Begin() int

func (*Word) Class added in v0.1.1

func (w *Word) Class() string

func (*Word) End

func (w *Word) End() int

func (*Word) IOB

func (w *Word) IOB() byte

func (*Word) Lemma

func (w *Word) Lemma() string

func (*Word) POS

func (w *Word) POS() string

func (*Word) Text

func (w *Word) Text() string

func (*Word) Type

func (w *Word) Type() TokenType

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL