feature

package
v0.5.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Nov 6, 2022 License: AGPL-3.0 Imports: 5 Imported by: 0

Documentation

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

func HashOneHot

func HashOneHot(buf []byte, size int) []float64

func HashOneHot32 added in v0.3.0

func HashOneHot32(buf []byte, size int) []float32

func SimpleOneHot

func SimpleOneHot(value int, size int) []float64

func StringSplitMultiHot

func StringSplitMultiHot(str string, sep string, size int) []float64

Types

type CountVectorizer

type CountVectorizer struct {
	Mapping   map[string]uint // word to index
	Separator string          // default space
}

CountVectorizer performs bag of words encoding of text.

Separator should not be a part of any word. Responsibility to ensure this is on caller. Words that have separator as its substring will be ommited.

Mapping should contain all values from 0 to N where N is len(Mapping). Responsibility to ensure this is on caller. If some index is higher than N or lower than 0, then code will panic. If some index is not set, then that index will be skipped. If some index is set twice, then index will have sum of words.

func (*CountVectorizer) FeatureNames

func (t *CountVectorizer) FeatureNames() []string

FeatureNames returns slice with produced feature names

func (*CountVectorizer) Fit

func (t *CountVectorizer) Fit(vals []string)

Fit assigns a number from 0 to N for each word in input, where N is number of words

func (*CountVectorizer) NumFeatures

func (t *CountVectorizer) NumFeatures() int

NumFeatures returns num of features made for single input field

func (*CountVectorizer) Transform

func (t *CountVectorizer) Transform(v string) []float64

Transform counts how many times each word appeared in input

func (*CountVectorizer) TransformInplace

func (t *CountVectorizer) TransformInplace(dest []float64, v string)

TransformInplace counts how many times each word appeared in input, inplace version. It is responsibility of caller to zero-out destination. Using zero memory allocation algorithm based on `strings.Split`. Utilizing that string is slice of bytes. Works fine with UTF-8.

type Identity

type Identity struct{}

Identity is a transformer that returns unmodified input value

func (*Identity) Fit

func (t *Identity) Fit(_ []float64)

Fit is not used, it is here only to keep same interface as rest of transformers

func (*Identity) Transform

func (t *Identity) Transform(v float64) float64

Transform returns same value as input

type KBinsDiscretizer

type KBinsDiscretizer struct {
	QuantileScaler
}

KBinsDiscretizer based on quantile strategy

func (*KBinsDiscretizer) Fit

func (t *KBinsDiscretizer) Fit(vals []float64)

Fit fits quantile scaler

func (*KBinsDiscretizer) Transform

func (t *KBinsDiscretizer) Transform(v float64) float64

Transform finds index of matched quantile for input

type MaxAbsScaler

type MaxAbsScaler struct {
	Max float64
}

MaxAbsScaler transforms value into -1 to +1 range linearly

func (*MaxAbsScaler) Fit

func (t *MaxAbsScaler) Fit(vals []float64)

Fit finds maximum abssolute value

func (*MaxAbsScaler) Transform

func (t *MaxAbsScaler) Transform(v float64) float64

Transform scales value into -1 to +1 range

type MinMaxScaler

type MinMaxScaler struct {
	Min float64
	Max float64
}

MinMaxScaler is a transformer that rescales value into range between min and max

func (*MinMaxScaler) Fit

func (t *MinMaxScaler) Fit(vals []float64)

Fit findx min and max value in range

func (*MinMaxScaler) Transform

func (t *MinMaxScaler) Transform(v float64) float64

Transform scales value from 0 to 1 linearly

type OneHotEncoder

type OneHotEncoder struct {
	Mapping map[string]uint // word to index
}

OneHotEncoder encodes string value to corresponding index

Mapping should contain all values from 0 to N where N is len(Mapping). Responsibility to ensure this is on caller. If some index is higher than N or lower than 0, then code will panic. If some index is not set, then that index will be skipped. If some index is set twice, then index will have effect of either of words.

func (*OneHotEncoder) FeatureNames

func (t *OneHotEncoder) FeatureNames() []string

FeatureNames returns names of each produced value.

func (*OneHotEncoder) Fit

func (t *OneHotEncoder) Fit(vs []string)

Fit assigns each value from inputs a number based on order of occurrence in input data. Ignoring empty strings in input.

func (*OneHotEncoder) NumFeatures

func (t *OneHotEncoder) NumFeatures() int

NumFeatures returns number of features one field is expanded

func (*OneHotEncoder) Transform

func (t *OneHotEncoder) Transform(v string) []float64

Transform assigns 1 to value that is found

func (*OneHotEncoder) TransformInplace

func (t *OneHotEncoder) TransformInplace(dest []float64, v string)

TransformInplace assigns 1 to value that is found, inplace. It is responsibility of a caller to reset destination to 0.

type OrdinalEncoder

type OrdinalEncoder struct {
	Mapping map[string]uint
}

OrdinalEncoder returns 0 for string that is not found, or else a number for that string

Mapping should contain all values from 0 to N where N is len(Mapping). Responsibility to ensure this is on caller. If some index is higher than N or lower than 0, then code will panic. If some index is not set, then that index will be skipped. If some index is set twice, then index will have effect of either of words.

func (*OrdinalEncoder) Fit

func (t *OrdinalEncoder) Fit(vals []string)

Fit assigns each word value from 1 to N Ignoring empty strings in input.

func (*OrdinalEncoder) Transform

func (t *OrdinalEncoder) Transform(v string) float64

Transform returns number of input, if not found returns zero value which is 0

type QuantileScaler

type QuantileScaler struct {
	Quantiles []float64
}

QuantileScaler transforms any distribution to uniform distribution This is done by mapping values to quantiles they belong to.

func (*QuantileScaler) Fit

func (t *QuantileScaler) Fit(vals []float64)

Fit sets parameters for quantiles based on input. Number of quantiles are specified by size of Quantiles slice. If it is empty or nil, then 100 is used as default. If input is smaller than number of quantiles, then using length of input.

func (*QuantileScaler) Transform

func (t *QuantileScaler) Transform(v float64) float64

Transform changes distribution into uniform one from 0 to 1

type SampleNormalizerL1

type SampleNormalizerL1 struct{}

SampleNormalizerL1 transforms features for single sample to have norm L1=1

func (*SampleNormalizerL1) Fit

func (t *SampleNormalizerL1) Fit(_ []float64)

Fit is empty, kept only to keep same interface

func (*SampleNormalizerL1) Transform

func (t *SampleNormalizerL1) Transform(vs []float64) []float64

Transform returns L1 normalized vector

func (*SampleNormalizerL1) TransformInplace

func (t *SampleNormalizerL1) TransformInplace(dest []float64, vs []float64)

TransformInplace returns L1 normalized vector, inplace

type SampleNormalizerL2

type SampleNormalizerL2 struct{}

SampleNormalizerL2 transforms features for single sample to have norm L2=1

func (*SampleNormalizerL2) Fit

func (t *SampleNormalizerL2) Fit(_ []float64)

Fit is empty, kept only to keep same interface

func (*SampleNormalizerL2) Transform

func (t *SampleNormalizerL2) Transform(vs []float64) []float64

Transform returns L2 normalized vector

func (*SampleNormalizerL2) TransformInplace

func (t *SampleNormalizerL2) TransformInplace(dest []float64, vs []float64)

TransformInplace returns L2 normalized vector, inplace

type StandardScaler

type StandardScaler struct {
	Mean float64
	STD  float64
}

StandardScaler transforms feature into normal standard distribution.

func (*StandardScaler) Fit

func (t *StandardScaler) Fit(vals []float64)

Fit computes mean and standard deviation

func (*StandardScaler) Transform

func (t *StandardScaler) Transform(v float64) float64

Transform centralizes and scales based on standard deviation and mean

type StructTransformer

type StructTransformer struct {
	Transformers []interface{}
}

StructTransformer uses reflection to encode struct into feature vector. It uses struct tags to create feature transformers for each field. Since it is using reflection, there is a slight overhead for large structs, which can be seen in benchmarks. For better performance, use codegen version for your struct, refer to README of this repo.

func (*StructTransformer) Fit

func (s *StructTransformer) Fit(_ []interface{})

Fit will fit all field transformers

func (*StructTransformer) Transform

func (s *StructTransformer) Transform(v interface{}) []float64

Transform applies all field transformers

type TFIDFVectorizer

type TFIDFVectorizer struct {
	CountVectorizer
	DocCount     []uint // number of documents where i-th word from CountVectorizer appeared in
	NumDocuments int
	Normalizer   SampleNormalizerL2
}

TFIDFVectorizer performs tf-idf vectorization on top of count vectorization. Based on: https://scikit-learn.org/stable/modules/feature_extraction.html Using non-smooth version, adding 1 to log instead of denominator in idf.

DocCount should have len of len(CountVectorizer.Mapping). It is responsibility of a caller to sensure it is so.

func (*TFIDFVectorizer) FeatureNames

func (t *TFIDFVectorizer) FeatureNames() []string

FeatureNames returns slice with produced feature names.

func (*TFIDFVectorizer) Fit

func (t *TFIDFVectorizer) Fit(vals []string)

Fit fits CountVectorizer and extra information for tf-idf computation

func (*TFIDFVectorizer) NumFeatures

func (t *TFIDFVectorizer) NumFeatures() int

NumFeatures returns number of features for single field

func (*TFIDFVectorizer) Transform

func (t *TFIDFVectorizer) Transform(v string) []float64

Transform performs tf-idf computation

func (*TFIDFVectorizer) TransformInplace

func (t *TFIDFVectorizer) TransformInplace(dest []float64, v string)

TransformInplace performs tf-idf computation, inplace. It is responsibility of caller to zero-out destination.

Directories

Path Synopsis
emb
Package preprocessing includes scaling, centering, normalization, binarization and imputation methods.
Package preprocessing includes scaling, centering, normalization, binarization and imputation methods.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL