jargon

package module
v1.0.9 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Jun 12, 2022 License: MIT Imports: 7 Imported by: 2

README

Jargon

Jargon is a text pipeline, focused on recognizing variations on canonical and synonymous terms.

For example, jargon lemmatizes react, React.js, React JS and REACTJS to a canonical reactjs.

Install

If you have a Go installation:

go install github.com/clipperhouse/jargon/cmd/jargon

If you’re on a Mac and have Homebrew:

brew install clipperhouse/tap/jargon

There on binaries for Mac, Windows and Linux on the releases page.

To display usage, simply type:

jargon

Example:

curl -s https://en.wikipedia.org/wiki/Computer_programming | jargon -html -stack -lemmas -lines

CLI usage and details...

In your code

See GoDoc. Example:

import (
	"fmt"
	"log"
	"strings"

	"github.com/clipperhouse/jargon"
	"github.com/clipperhouse/jargon/filters/stackoverflow"
)
 
text := `Let’s talk about Ruby on Rails and ASPNET MVC.`
stream := jargon.TokenizeString(text).Filter(stackoverflow.Tags)

// Loop while Scan() returns true. Scan() will return false on error or end of tokens.
for stream.Scan() {
	token := stream.Token()
	// Do stuff with token
	fmt.Print(token)
}

if err := stream.Err(); err != nil {
	// Because the source is I/O, errors are possible
	log.Fatal(err)
}

// As an iterator, a token stream is 'forward-only'; once you consume a token, you can't go back.

// See also the convenience methods String, ToSlice, WriteTo

Token filters

Canonical terms (lemmas) are looked up in token filters. Several are available:

Stack Overflow technology tags

  • Ruby on Rails → ruby-on-rails
  • ObjC → objective-c

Contractions

  • Couldn’t → Could not

ASCII fold

  • café → cafe

Stem

  • Manager|management|manages → manag

To implement your own, see the Filter type.

Performance

jargon is designed to work in constant memory, regardless of input size. It buffers input and streams tokens.

Execution time is designed to O(n) on input size. It is I/O-bound. In your code, you control I/O and performance implications by the Reader you pass to Tokenize.

Tokenizer

Jargon includes a tokenizer based partially on Unicode text segmentation. It’s good for many common cases.

It preserves all tokens verbatim, including whitespace and punctuation, so the original text can be reconstructed with fidelity (“round tripped”).

Background

When dealing with technical terms in text – say, a job listing or a resume – it’s easy to use different words for the same thing. This is acute for things like “react” where it’s not obvious what the canonical term is. Is it React or reactjs or react.js?

This presents a problem when searching for such terms. We know the above terms are synonymous but databases don’t.

A further problem is that some n-grams should be understood as a single term. We know that “Objective C” represents one technology, but databases naively see two words.

What’s it for?

  • Recognition of domain terms in text
  • NLP for unstructured data, when we wish to ensure consistency of vocabulary, for statistical analysis.
  • Search applications, where searches for “Ruby on Rails” are understood as an entity, instead of three unrelated words, or to ensure that “React” and “reactjs” and “react.js” and handled synonmously.

Documentation

Index

Examples

Constants

This section is empty.

Variables

This section is empty.

Functions

This section is empty.

Types

type Filter added in v0.9.6

type Filter func(*TokenStream) *TokenStream

Filter processes a stream of tokens

type Token

type Token struct {
	// contains filtered or unexported fields
}

Token represents a piece of text with metadata.

func NewToken added in v0.9.6

func NewToken(s string, isLemma bool) *Token

NewToken creates a new token, and calculates whether the token is space or punct.

func (*Token) IsLemma

func (t *Token) IsLemma() bool

IsLemma indicates that the token is a lemma, i.e., a canonical term that replaced original token(s).

func (*Token) IsPunct

func (t *Token) IsPunct() bool

IsPunct indicates that the token should be considered 'breaking' of a run of words. Mostly uses Unicode's definition of punctuation, with some exceptions for our purposes.

func (*Token) IsSpace

func (t *Token) IsSpace() bool

IsSpace indicates that the token consists entirely of white space, as defined by the unicode package.

A token can be both IsPunct and IsSpace -- for example, line breaks and tabs are punctuation for our purposes.

func (*Token) String

func (t *Token) String() string

String is the string value of the token

type TokenStream added in v0.9.7

type TokenStream struct {
	// contains filtered or unexported fields
}

TokenStream represents an 'iterator' of Token, the result of a call to Tokenize or Filter. Call Next() until it returns nil.

func NewTokenStream added in v0.9.7

func NewTokenStream(next func() (*Token, error)) *TokenStream

NewTokenStream creates a new TokenStream

func Tokenize

func Tokenize(r io.Reader) *TokenStream

Tokenize tokenizes a reader into a stream of tokens. Iterate through the stream by calling Scan() or Next().

Its uses several specs from Unicode Text Segmentation https://unicode.org/reports/tr29/. It's not a full implementation, but a decent approximation for many mainstream cases.

Tokenize returns all tokens (including white space), so text can be reconstructed with fidelity.

Example
package main

import (
	"log"
	"strings"

	"github.com/clipperhouse/jargon"
)

func main() {
	// Tokenize takes an io.Reader
	text := `Let’s talk about Ruby on Rails and ASPNET MVC.`
	r := strings.NewReader(text)

	tokens := jargon.Tokenize(r)

	// Tokenize returns a Tokens iterator. Iterate by calling Next() until nil, which
	// indicates that the iterator is exhausted.
	for {
		token, err := tokens.Next()
		if err != nil {
			// Because the source is I/O, errors are possible
			log.Fatal(err)
		}
		if token == nil {
			break
		}

		// Do stuff with token
	}

	// Tokens is lazily evaluated; it does the tokenization work as you call Next.
	// This is done to ensure predictble memory usage and performance. It is
	// 'forward-only', which means that once you consume a token, you can't go back.

	// Usually, Tokenize serves as input to Lemmatize
}
Output:

func TokenizeHTML

func TokenizeHTML(r io.Reader) *TokenStream

TokenizeHTML tokenizes HTML. Text nodes are tokenized using jargon.Tokenize; everything else (tags, comments) are left verbatim. It returns a Tokens, intended to be iterated over by calling Next(), until nil. It returns all tokens (including white space), so text can be reconstructed with fidelity. Ignoring (say) whitespace is a decision for the caller.

func TokenizeString added in v0.9.6

func TokenizeString(s string) *TokenStream

TokenizeString tokenizes a string into a stream of tokens. Iterate through the stream by calling Scan() or Next().

It returns all tokens (including white space), so text can be reconstructed with fidelity ("round tripped").

func (*TokenStream) Count added in v0.9.7

func (stream *TokenStream) Count() (int, error)

Count counts all tokens. Note that it will consume all tokens, so you will not be able to iterate further after making this call.

func (*TokenStream) Distinct added in v0.9.17

func (stream *TokenStream) Distinct() *TokenStream

Distinct return one token per occurence of a given value (string)

func (*TokenStream) Err added in v0.9.7

func (stream *TokenStream) Err() error

Err returns the current error in the stream, after calling Scan

func (*TokenStream) Filter added in v0.9.7

func (stream *TokenStream) Filter(filters ...Filter) *TokenStream

Filter applies one or more filters to a token stream

Example
package main

import (
	"fmt"
	"log"
	"strings"

	"github.com/clipperhouse/jargon"
	"github.com/clipperhouse/jargon/filters/stackoverflow"
)

func main() {
	// Lemmatize take tokens and attempts to find their canonical version

	// Lemmatize takes a Tokens iterator, and one or more token filters
	text := `Let’s talk about Ruby on Rails and ASPNET MVC.`
	r := strings.NewReader(text)

	tokens := jargon.Tokenize(r)
	filtered := tokens.Filter(stackoverflow.Tags)

	// Lemmatize returns a Tokens iterator. Iterate by calling Next() until nil, which
	// indicates that the iterator is exhausted.
	for {
		token, err := filtered.Next()
		if err != nil {
			// Because the source is I/O, errors are possible
			log.Fatal(err)
		}
		if token == nil {
			break
		}

		// Do stuff with token
		if token.IsLemma() {
			fmt.Printf("found lemma: %s", token)
		}
	}
}
Output:

func (*TokenStream) Lemmas added in v0.9.7

func (stream *TokenStream) Lemmas() *TokenStream

Lemmas returns only tokens which have been 'lemmatized', or in some way modified by a token filter

func (*TokenStream) Next added in v0.9.7

func (stream *TokenStream) Next() (*Token, error)

Next returns the next Token. If nil, the iterator is exhausted. Because it depends on I/O, callers should check errors.

Example
package main

import (
	"log"
	"strings"

	"github.com/clipperhouse/jargon"
)

func main() {
	// TokensStream is an iterator resulting from a call to Tokenize or Filter

	text := `Let’s talk about Ruby on Rails and ASPNET MVC.`
	r := strings.NewReader(text)
	tokens := jargon.Tokenize(r)

	// Iterate by calling Next() until nil, which indicates that the iterator is exhausted.
	for {
		token, err := tokens.Next()
		if err != nil {
			// Because the source is I/O, errors are possible
			log.Fatal(err)
		}
		if token == nil {
			break
		}

		// Do stuff with token
	}

	// As an iterator, TokenStream is 'forward-only', which means that
	// once you consume a token, you can't go back.

	// See also the convenience methods String, ToSlice, WriteTo
}
Output:

func (*TokenStream) Scan added in v0.9.7

func (stream *TokenStream) Scan() bool

Scan retrieves the next token and returns true if successful. The resulting token can be retrieved using the Token() method. Scan returns false at EOF or on error. Be sure to check the Err() method.

for stream.Scan() {
	token := stream.Token()
	// do stuff with token
}
if err := stream.Err(); err != nil {
	// do something with err
}
Example
package main

import (
	"fmt"
	"log"

	"github.com/clipperhouse/jargon"
	"github.com/clipperhouse/jargon/filters/stackoverflow"
)

func main() {
	// TokensStream is an iterator resulting from a call to Tokenize or Filter

	text := `Let’s talk about Ruby on Rails and ASPNET MVC.`
	stream := jargon.TokenizeString(text).Filter(stackoverflow.Tags)

	// Loop while Scan() returns true. Scan() will return false on error or end of tokens.
	for stream.Scan() {
		token := stream.Token()
		// Do stuff with token
		fmt.Print(token)
	}

	if err := stream.Err(); err != nil {
		// Because the source is I/O, errors are possible
		log.Fatal(err)
	}

	// As an iterator, TokenStream is 'forward-only', which means that
	// once you consume a token, you can't go back.

	// See also the convenience methods String, ToSlice, WriteTo
}
Output:

func (*TokenStream) String added in v0.9.7

func (stream *TokenStream) String() (string, error)

func (*TokenStream) ToSlice added in v0.9.7

func (stream *TokenStream) ToSlice() ([]*Token, error)

ToSlice converts the Tokens iterator into a slice (array). Calling ToSlice will exhaust the iterator. For big files, putting everything into an array may cause memory pressure.

func (*TokenStream) Token added in v0.9.7

func (stream *TokenStream) Token() *Token

Token returns the current Token in the stream, after calling Scan

func (*TokenStream) Where added in v0.9.7

func (stream *TokenStream) Where(predicate func(*Token) bool) *TokenStream

Where filters a stream of Tokens that match a predicate

func (*TokenStream) Words added in v0.9.7

func (stream *TokenStream) Words() *TokenStream

Words returns only all non-punctuation and non-space tokens

func (*TokenStream) WriteTo added in v0.9.7

func (stream *TokenStream) WriteTo(w io.Writer) (int64, error)

WriteTo writes all token string values to w

Directories

Path Synopsis
cmd
filters
ascii
Package ascii folds Unicode characters to their ASCII equivalents where possible.
Package ascii folds Unicode characters to their ASCII equivalents where possible.
contractions
Package contractions provides a filter to expand English contractions, such as "don't" → "does not", for use with jargon
Package contractions provides a filter to expand English contractions, such as "don't" → "does not", for use with jargon
mapper
Package mapper provides a convenience builder for filters that map inputs to outputs, one-to-one
Package mapper provides a convenience builder for filters that map inputs to outputs, one-to-one
nba
stackoverflow
Package stackoverflow provides a filter for identifying technical terms in jargon
Package stackoverflow provides a filter for identifying technical terms in jargon
stemmer
Package stemmer offers the Snowball stemmer in several languages
Package stemmer offers the Snowball stemmer in several languages
stopwords
Package stopwords allows omission of words from a token stream
Package stopwords allows omission of words from a token stream
synonyms
Package synonyms provides a builder for filtering and replacing synonyms in a token stream
Package synonyms provides a builder for filtering and replacing synonyms in a token stream
twitter
Package twitter provides filters to identify Twitter-style @handles and #hashtags, and coalesce them into single tokens
Package twitter provides filters to identify Twitter-style @handles and #hashtags, and coalesce them into single tokens
A demo of jargon for use on Google App Engine
A demo of jargon for use on Google App Engine

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL