featuremill

package module
v0.0.0-...-38f9c56 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Mar 17, 2018 License: Apache-2.0 Imports: 8 Imported by: 1

README

featuremill

Build Status

Featuremill is a general-purpose fast, stateless, and deterministic feature extractor writting in golang for use in machine learning.

The text feature extractor makes heavy use of a hashing vectorizer in order to rapidly generate features with very low overhead. Hash vectorizing does not require storing any state and is deterministic so it has these advantages:

  • fast - does not need to perform pre-processing or lookups because hashing vectorizer is used for generating feature IDs
  • low memory consumption for large amounts of data - no index information or state stored in memory
  • can easily be scaled horizontally - requires no stored state, so no coordination required
  • features and valuescan be added at a later date

The advantages do not come without a cost, though. Featuremill cannot apply transformations such as Inverse Document Frequency (IDF) to text features, which can be helpful to discount the impact of common text features. If this has significant impact on prediction with your ML algorithm or dataset, consider adding that support or use different tooling or approach. Other optimizations, such as number scaling can be done if you're able to provide some basic metrics on your number data from your datastore, such as the maximum value in the dataset.

Output format

libsvm format (<index_number>:<value>) A slice of them, if the feature generates multiple vectors (text, timestamp, date). This format makes this library easy to use along with hector

Supported data

  • IP - gets converted to integer representation (preserves locality) and min/max scaled
  • timestamp - represeneted as 3 seasonality vectors: minute of hour, hour of day, day of week
  • date - represented as 2 seasonality vectors: day of week, and month of year
  • text - tokenized by word and one-hot encoded using hashing vectorizer
  • numerical - log scaler (default), or min/max rescaler
  • boolean - represented as 0 or 1
  • categorical - not tokenized at all and one-hot encoded using hashing vectorizer

Documentation

Godocs and tests

godocs

Look at the included example and look at the tests for example usage, as well.

How I use featuremill

For each incoming sample being processed, I append the returned string or expanded slice to a features slice. To assemble the final sample in libsvm format, just sort and join it with the sample label:

sort.Strings(features)
sample := "0 " + strings.Join(features, " ")

Documentation

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

func ExtractBoolean

func ExtractBoolean(field, boolean string) (string, error)

ExtractBoolean returns a 0/1 vector for the deterministic feature ID

func ExtractCategorical

func ExtractCategorical(field, category string) string

ExtractCategorical returns a vector that is a positive boolean for a deterministic categorical feature ID

func ExtractDate

func ExtractDate(field, date string) ([]string, error)

ExtractDate returns a slice of 2 scaled seasonality vectors: day of week, and month of year each with a deterministic feature id

func ExtractIP

func ExtractIP(field, addr string) (string, error)

ExtractIP returns a vector that is a scaled integer representation of IPv4 and IPv6 IPs with a deterministic feature ID

func ExtractNumerical

func ExtractNumerical(field string, num float32, min float32, max float32) string

ExtractNumerical returns a min/max scalled vector to a deterministic feature ID

func ExtractNumericalMaxMin

func ExtractNumericalMaxMin(field string, num float32, min float32, max float32) string

ExtractNumericalMaxMin returns a min/max scalled vector to a deterministic feature ID

func ExtractText

func ExtractText(text, delim string) []string

ExtractText returns a slice of "featureID:1" strings for each token in the string

func ExtractTimestamp

func ExtractTimestamp(field, timestamp string) ([]string, error)

ExtractTimestamp returns a slice of 3 scaled seasonality vectors: minute of hour, hour of day, day of week each with a deterministic feature ID

Types

This section is empty.

Directories

Path Synopsis

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL