tokens

package module

v0.0.0-...-465a484 Latest Latest Go to latest Published: Mar 5, 2015 License: MIT Imports: 2 Imported by: 0

Details

Valid go.mod file

The Go module system was introduced in Go 1.11 and is the official dependency management solution for Go.
Redistributable license

Redistributable licenses place minimal restrictions on how software can be used, modified, and redistributed.
Tagged version

Modules with tagged versions give importers more predictable builds.
Stable version

When a project reaches major version v1 it is considered stable.
Learn more about best practices

Repository

github.com/nyxtom/tokens

Links

Open Source Insights

README ¶

tokens

Tokens is a simple nlp utility (written in go) for tokenizing strings using common split regular expressions for whitespace, words, emoticons, urls and more.

View the docs.

Installation

$ go get github.com/nyxtom/tokens

Example

import "github.com/nyxtom/tokens"

func main() {
	fmt.Println(tokens.SplitNatural("hello world, this is @nyxtom!"))
}

Expressions

RepeatedPunctRegexp (repeated punctuation)
NumericRegexp (expression to test if a given string is only numeric)
CashTagRegexp ($GOOG, $ATT and various cashtags used in twitter or other places)
HashTagRegexp (#hashtags)
MentionRegexp (@mentions)
HTTPWWWRegexp (determine if a url is prefixed with https? and or www)
URLRegexp (regular expression for finding urls based on a variant of daringfireball.net/2010/07/improved_regex_for_matching_urls)
EmailRegexp
EmoticonsRegexp
EmoticonWordPunctuationRegexp
WordPunctuationRegexp

Word punctuation contains many patterns including detecting partial urls, file paths, money, numerics, decimals, words with hyphens, abbreviations, numeric / words (3D), phone numbers, repeated punctuations, and non-whitespace.

LICENSE

MIT

Documentation ¶

Index ¶

Variables
func CashTag(text string) []string
func CashTagIndex(text string) [][]int
func Email(text string) []string
func EmailIndex(text string) [][]int
func Emoticon(text string) []string
func EmoticonIndex(text string) [][]int
func EmoticonWordPunct(text string) []string
func EmoticonWordPunctIndex(text string) [][]int
func Filter(s []string, fn func(string) bool) []string
func HashTag(text string) []string
func HashTagIndex(text string) [][]int
func MatchAny(text string, patterns ...*regexp.Regexp) bool
func Mention(text string) []string
func MentionIndex(text string) [][]int
func Split(text string, filters ...func(t string) [][]int) []string
func SplitIndex(text string, filters ...func(t string) [][]int) [][]int
func SplitNatural(text string) []string
func URL(text string) []string
func URLIndex(text string) [][]int
func WordPunct(text string) []string
func WordPunctIndex(text string) [][]int

Constants ¶

This section is empty.

Variables ¶

View Source

var CashTagRegexp = regexp.MustCompile("(?i)\\$([A-Za-z]+[A-Za-z0-9_]*)")

CashTagRegexp will look for cashtags

View Source

var EmailRegexp = regexp.MustCompile("(?i)[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\\.[a-zA-Z]+\\b")

EmailRegexp will look for email expressions within text

View Source

var EmoticonWordPunctuationRegexp = regexp.MustCompile(combinedWordPunctuationPattern)

EmoticonWordPunctuationRegexp is a combined emoticons and word punctuation data tokenization pattern

View Source

var EmoticonsRegexp = regexp.MustCompile("(?i)" + emoticonsPattern)

EmoticonsRegexp is a pattern for tokenizing on various emoticons

View Source

var HTTPWWWRegexp = regexp.MustCompile("(?i)^(?:https?://){0,1}(?:www\\.){0,1}")

/ HTTPWWWRegexp will look for http or www prefixes

View Source

var HashTagRegexp = regexp.MustCompile("(?i)#([A-Za-zÀ-ÿ0-9\\-_&;]+)")

HashTagRegexp will look for hashtags

View Source

var MentionRegexp = regexp.MustCompile("(?i)@([A-Za-zÀ-ÿ0-9\\-_&;]+)")

MentionRegexp will look for hashtags

View Source

var NumericRegexp = regexp.MustCompile("(?i)^\\d+\\%?")

NumericRegexp is a simple expression for simple repeated numbers as a quick pattern

View Source

var RepeatedPunctRegexp = regexp.MustCompile("(?i)\\%|(?:[\\!\\?]+)|\\!+|\\.+|;+|,+|:+|\\'+|\\\"+|-+|\\?+|\\&+|\\*+|\\(+|\\)+|_+|\\++|\\/+|\\\\+")

RepeatedPunctRegexp is a simple expression for repeated punctuation patterns

View Source

var URLRegexp = regexp.MustCompile("(?i)\\b(?:(?:https?)://|www\\.|ftp\\.)(?:\\([-A-Z0-9+&@#/%=~_|$?!:,.]*\\)|[-A-Z0-9+&@#/%=~_|$?!:,.])*(?:\\([-A-Z0-9+&@#/%=~_|$?!:,.'\"\"]*\\)?|[A-Z0-9+&@#/%=~_'\"\"|$])")

URLRegexp will look for urls

View Source

var WordPunctuationRegexp = regexp.MustCompile("(?i)" + wordPunctuationPattern)

WordPunctuationRegexp is a popular pattern for tokenizing on various types of data

Functions ¶

func CashTag ¶

func CashTag(text string) []string

CashTag to split and return the string of hashtags for the hashtag regex pattern

func CashTagIndex ¶

func CashTagIndex(text string) [][]int

CashTagIndex to split and return the indexes for the hashtag regex pattern

func Email ¶

func Email(text string) []string

Email to split and return the strings for the email regex pattern

func EmailIndex ¶

func EmailIndex(text string) [][]int

EmailIndex to split and return the indexes for the email regex pattern

func Emoticon ¶

func Emoticon(text string) []string

Emoticon will split and return the strings of all the found emoticons using the regex pattern for emoticons

func EmoticonIndex ¶

func EmoticonIndex(text string) [][]int

EmoticonIndex will split and return the indexes of all the found emoticons using the regex pattern for emoticons

func EmoticonWordPunct ¶

func EmoticonWordPunct(text string) []string

EmoticonWordPunct to split and return strings for the combined emoticon and word punctuation regular expression patterns

func EmoticonWordPunctIndex ¶

func EmoticonWordPunctIndex(text string) [][]int

EmoticonWordPunctIndex to split and return the indexes for the combined emoticon and word punctuation regular expression patterns

func Filter ¶

func Filter(s []string, fn func(string) bool) []string

Filter applies the given function condition for the array of strings

func HashTag ¶

func HashTag(text string) []string

HashTag to split and return the string of hashtags for the hashtag regex pattern

func HashTagIndex ¶

func HashTagIndex(text string) [][]int

HashTagIndex to split and return the indexes for the hashtag regex pattern

func MatchAny ¶

func MatchAny(text string, patterns ...*regexp.Regexp) bool

MatchAny will see if the given string matches any of the given regular expressions

func Mention ¶

func Mention(text string) []string

Mention to split and return the string of mentions for the mention regex pattern

func MentionIndex ¶

func MentionIndex(text string) [][]int

MentionIndex to split and return the indexes for the mention regex pattern

func Split ¶

func Split(text string, filters ...func(t string) [][]int) []string

Split to return the strings by passing the text through a pre-filter prior to a post-filter where the prefix is executed first before continuing to tokenize on the surrounding text (such as in Email & WordPunct)

func SplitIndex ¶

func SplitIndex(text string, filters ...func(t string) [][]int) [][]int

SplitIndex to return the indices of all the tokens that passed through the pre and post filters. The prefix is executed first and the postfix is executed on the surrounding text to the tokens found by the prefix filter.

func SplitNatural ¶

func SplitNatural(text string) []string

SplitNatural to split and return the list of strings tokenized by all common word patterns

func URL ¶

func URL(text string) []string

URL to split and return the strings of urls using the url regex pattern

func URLIndex ¶

func URLIndex(text string) [][]int

URLIndex to split and return the indexes for the url regex pattern

func WordPunct ¶

func WordPunct(text string) []string

WordPunct to split and return all strings using the wordpunctuation expression

func WordPunctIndex ¶

func WordPunctIndex(text string) [][]int

WordPunctIndex to split and return the indexes for matches with the word punctuation regular expression

Types ¶

This section is empty.

Source Files ¶

View all Source files

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL