tokens

package module
v0.0.0-...-465a484 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Mar 5, 2015 License: MIT Imports: 2 Imported by: 0

README

tokens

Tokens is a simple nlp utility (written in go) for tokenizing strings using common split regular expressions for whitespace, words, emoticons, urls and more.

View the docs.

Installation

$ go get github.com/nyxtom/tokens

Example

import "github.com/nyxtom/tokens"

func main() {
	fmt.Println(tokens.SplitNatural("hello world, this is @nyxtom!"))
}

Expressions

  • RepeatedPunctRegexp (repeated punctuation)
  • NumericRegexp (expression to test if a given string is only numeric)
  • CashTagRegexp ($GOOG, $ATT and various cashtags used in twitter or other places)
  • HashTagRegexp (#hashtags)
  • MentionRegexp (@mentions)
  • HTTPWWWRegexp (determine if a url is prefixed with https? and or www)
  • URLRegexp (regular expression for finding urls based on a variant of daringfireball.net/2010/07/improved_regex_for_matching_urls)
  • EmailRegexp
  • EmoticonsRegexp
  • EmoticonWordPunctuationRegexp
  • WordPunctuationRegexp

Word punctuation contains many patterns including detecting partial urls, file paths, money, numerics, decimals, words with hyphens, abbreviations, numeric / words (3D), phone numbers, repeated punctuations, and non-whitespace.

LICENSE

MIT

Documentation

Index

Constants

This section is empty.

Variables

View Source
var CashTagRegexp = regexp.MustCompile("(?i)\\$([A-Za-z]+[A-Za-z0-9_]*)")

CashTagRegexp will look for cashtags

View Source
var EmailRegexp = regexp.MustCompile("(?i)[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\\.[a-zA-Z]+\\b")

EmailRegexp will look for email expressions within text

View Source
var EmoticonWordPunctuationRegexp = regexp.MustCompile(combinedWordPunctuationPattern)

EmoticonWordPunctuationRegexp is a combined emoticons and word punctuation data tokenization pattern

View Source
var EmoticonsRegexp = regexp.MustCompile("(?i)" + emoticonsPattern)

EmoticonsRegexp is a pattern for tokenizing on various emoticons

View Source
var HTTPWWWRegexp = regexp.MustCompile("(?i)^(?:https?://){0,1}(?:www\\.){0,1}")

/ HTTPWWWRegexp will look for http or www prefixes

View Source
var HashTagRegexp = regexp.MustCompile("(?i)#([A-Za-zÀ-ÿ0-9\\-_&;]+)")

HashTagRegexp will look for hashtags

View Source
var MentionRegexp = regexp.MustCompile("(?i)@([A-Za-zÀ-ÿ0-9\\-_&;]+)")

MentionRegexp will look for hashtags

View Source
var NumericRegexp = regexp.MustCompile("(?i)^\\d+\\%?")

NumericRegexp is a simple expression for simple repeated numbers as a quick pattern

View Source
var RepeatedPunctRegexp = regexp.MustCompile("(?i)\\%|(?:[\\!\\?]+)|\\!+|\\.+|;+|,+|:+|\\'+|\\\"+|-+|\\?+|\\&+|\\*+|\\(+|\\)+|_+|\\++|\\/+|\\\\+")

RepeatedPunctRegexp is a simple expression for repeated punctuation patterns

View Source
var URLRegexp = regexp.MustCompile("(?i)\\b(?:(?:https?)://|www\\.|ftp\\.)(?:\\([-A-Z0-9+&@#/%=~_|$?!:,.]*\\)|[-A-Z0-9+&@#/%=~_|$?!:,.])*(?:\\([-A-Z0-9+&@#/%=~_|$?!:,.'\"\"]*\\)?|[A-Z0-9+&@#/%=~_'\"\"|$])")

URLRegexp will look for urls

View Source
var WordPunctuationRegexp = regexp.MustCompile("(?i)" + wordPunctuationPattern)

WordPunctuationRegexp is a popular pattern for tokenizing on various types of data

Functions

func CashTag

func CashTag(text string) []string

CashTag to split and return the string of hashtags for the hashtag regex pattern

func CashTagIndex

func CashTagIndex(text string) [][]int

CashTagIndex to split and return the indexes for the hashtag regex pattern

func Email

func Email(text string) []string

Email to split and return the strings for the email regex pattern

func EmailIndex

func EmailIndex(text string) [][]int

EmailIndex to split and return the indexes for the email regex pattern

func Emoticon

func Emoticon(text string) []string

Emoticon will split and return the strings of all the found emoticons using the regex pattern for emoticons

func EmoticonIndex

func EmoticonIndex(text string) [][]int

EmoticonIndex will split and return the indexes of all the found emoticons using the regex pattern for emoticons

func EmoticonWordPunct

func EmoticonWordPunct(text string) []string

EmoticonWordPunct to split and return strings for the combined emoticon and word punctuation regular expression patterns

func EmoticonWordPunctIndex

func EmoticonWordPunctIndex(text string) [][]int

EmoticonWordPunctIndex to split and return the indexes for the combined emoticon and word punctuation regular expression patterns

func Filter

func Filter(s []string, fn func(string) bool) []string

Filter applies the given function condition for the array of strings

func HashTag

func HashTag(text string) []string

HashTag to split and return the string of hashtags for the hashtag regex pattern

func HashTagIndex

func HashTagIndex(text string) [][]int

HashTagIndex to split and return the indexes for the hashtag regex pattern

func MatchAny

func MatchAny(text string, patterns ...*regexp.Regexp) bool

MatchAny will see if the given string matches any of the given regular expressions

func Mention

func Mention(text string) []string

Mention to split and return the string of mentions for the mention regex pattern

func MentionIndex

func MentionIndex(text string) [][]int

MentionIndex to split and return the indexes for the mention regex pattern

func Split

func Split(text string, filters ...func(t string) [][]int) []string

Split to return the strings by passing the text through a pre-filter prior to a post-filter where the prefix is executed first before continuing to tokenize on the surrounding text (such as in Email & WordPunct)

func SplitIndex

func SplitIndex(text string, filters ...func(t string) [][]int) [][]int

SplitIndex to return the indices of all the tokens that passed through the pre and post filters. The prefix is executed first and the postfix is executed on the surrounding text to the tokens found by the prefix filter.

func SplitNatural

func SplitNatural(text string) []string

SplitNatural to split and return the list of strings tokenized by all common word patterns

func URL

func URL(text string) []string

URL to split and return the strings of urls using the url regex pattern

func URLIndex

func URLIndex(text string) [][]int

URLIndex to split and return the indexes for the url regex pattern

func WordPunct

func WordPunct(text string) []string

WordPunct to split and return all strings using the wordpunctuation expression

func WordPunctIndex

func WordPunctIndex(text string) [][]int

WordPunctIndex to split and return the indexes for matches with the word punctuation regular expression

Types

This section is empty.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL