Documentation ¶
Index ¶
- func HashOneHot(buf []byte, size int) []float64
- func HashOneHot32(buf []byte, size int) []float32
- func SimpleOneHot(value int, size int) []float64
- func StringSplitMultiHot(str string, sep string, size int) []float64
- type CountVectorizer
- type Identity
- type KBinsDiscretizer
- type MaxAbsScaler
- type MinMaxScaler
- type OneHotEncoder
- type OrdinalEncoder
- type QuantileScaler
- type SampleNormalizerL1
- type SampleNormalizerL2
- type StandardScaler
- type StructTransformer
- type TFIDFVectorizer
Constants ¶
This section is empty.
Variables ¶
This section is empty.
Functions ¶
func HashOneHot ¶
func HashOneHot32 ¶ added in v0.3.0
func SimpleOneHot ¶
Types ¶
type CountVectorizer ¶
type CountVectorizer struct { Mapping map[string]uint // word to index Separator string // default space }
CountVectorizer performs bag of words encoding of text.
Separator should not be a part of any word. Responsibility to ensure this is on caller. Words that have separator as its substring will be ommited.
Mapping should contain all values from 0 to N where N is len(Mapping). Responsibility to ensure this is on caller. If some index is higher than N or lower than 0, then code will panic. If some index is not set, then that index will be skipped. If some index is set twice, then index will have sum of words.
func (*CountVectorizer) FeatureNames ¶
func (t *CountVectorizer) FeatureNames() []string
FeatureNames returns slice with produced feature names
func (*CountVectorizer) Fit ¶
func (t *CountVectorizer) Fit(vals []string)
Fit assigns a number from 0 to N for each word in input, where N is number of words
func (*CountVectorizer) NumFeatures ¶
func (t *CountVectorizer) NumFeatures() int
NumFeatures returns num of features made for single input field
func (*CountVectorizer) Transform ¶
func (t *CountVectorizer) Transform(v string) []float64
Transform counts how many times each word appeared in input
func (*CountVectorizer) TransformInplace ¶
func (t *CountVectorizer) TransformInplace(dest []float64, v string)
TransformInplace counts how many times each word appeared in input, inplace version. It is responsibility of caller to zero-out destination. Using zero memory allocation algorithm based on `strings.Split`. Utilizing that string is slice of bytes. Works fine with UTF-8.
type Identity ¶
type Identity struct{}
Identity is a transformer that returns unmodified input value
type KBinsDiscretizer ¶
type KBinsDiscretizer struct {
QuantileScaler
}
KBinsDiscretizer based on quantile strategy
func (*KBinsDiscretizer) Fit ¶
func (t *KBinsDiscretizer) Fit(vals []float64)
Fit fits quantile scaler
func (*KBinsDiscretizer) Transform ¶
func (t *KBinsDiscretizer) Transform(v float64) float64
Transform finds index of matched quantile for input
type MaxAbsScaler ¶
type MaxAbsScaler struct {
Max float64
}
MaxAbsScaler transforms value into -1 to +1 range linearly
func (*MaxAbsScaler) Fit ¶
func (t *MaxAbsScaler) Fit(vals []float64)
Fit finds maximum abssolute value
func (*MaxAbsScaler) Transform ¶
func (t *MaxAbsScaler) Transform(v float64) float64
Transform scales value into -1 to +1 range
type MinMaxScaler ¶
MinMaxScaler is a transformer that rescales value into range between min and max
func (*MinMaxScaler) Fit ¶
func (t *MinMaxScaler) Fit(vals []float64)
Fit findx min and max value in range
func (*MinMaxScaler) Transform ¶
func (t *MinMaxScaler) Transform(v float64) float64
Transform scales value from 0 to 1 linearly
type OneHotEncoder ¶
OneHotEncoder encodes string value to corresponding index
Mapping should contain all values from 0 to N where N is len(Mapping). Responsibility to ensure this is on caller. If some index is higher than N or lower than 0, then code will panic. If some index is not set, then that index will be skipped. If some index is set twice, then index will have effect of either of words.
func (*OneHotEncoder) FeatureNames ¶
func (t *OneHotEncoder) FeatureNames() []string
FeatureNames returns names of each produced value.
func (*OneHotEncoder) Fit ¶
func (t *OneHotEncoder) Fit(vs []string)
Fit assigns each value from inputs a number based on order of occurrence in input data. Ignoring empty strings in input.
func (*OneHotEncoder) NumFeatures ¶
func (t *OneHotEncoder) NumFeatures() int
NumFeatures returns number of features one field is expanded
func (*OneHotEncoder) Transform ¶
func (t *OneHotEncoder) Transform(v string) []float64
Transform assigns 1 to value that is found
func (*OneHotEncoder) TransformInplace ¶
func (t *OneHotEncoder) TransformInplace(dest []float64, v string)
TransformInplace assigns 1 to value that is found, inplace. It is responsibility of a caller to reset destination to 0.
type OrdinalEncoder ¶
OrdinalEncoder returns 0 for string that is not found, or else a number for that string
Mapping should contain all values from 0 to N where N is len(Mapping). Responsibility to ensure this is on caller. If some index is higher than N or lower than 0, then code will panic. If some index is not set, then that index will be skipped. If some index is set twice, then index will have effect of either of words.
func (*OrdinalEncoder) Fit ¶
func (t *OrdinalEncoder) Fit(vals []string)
Fit assigns each word value from 1 to N Ignoring empty strings in input.
func (*OrdinalEncoder) Transform ¶
func (t *OrdinalEncoder) Transform(v string) float64
Transform returns number of input, if not found returns zero value which is 0
type QuantileScaler ¶
type QuantileScaler struct {
Quantiles []float64
}
QuantileScaler transforms any distribution to uniform distribution This is done by mapping values to quantiles they belong to.
func (*QuantileScaler) Fit ¶
func (t *QuantileScaler) Fit(vals []float64)
Fit sets parameters for quantiles based on input. Number of quantiles are specified by size of Quantiles slice. If it is empty or nil, then 100 is used as default. If input is smaller than number of quantiles, then using length of input.
func (*QuantileScaler) Transform ¶
func (t *QuantileScaler) Transform(v float64) float64
Transform changes distribution into uniform one from 0 to 1
type SampleNormalizerL1 ¶
type SampleNormalizerL1 struct{}
SampleNormalizerL1 transforms features for single sample to have norm L1=1
func (*SampleNormalizerL1) Fit ¶
func (t *SampleNormalizerL1) Fit(_ []float64)
Fit is empty, kept only to keep same interface
func (*SampleNormalizerL1) Transform ¶
func (t *SampleNormalizerL1) Transform(vs []float64) []float64
Transform returns L1 normalized vector
func (*SampleNormalizerL1) TransformInplace ¶
func (t *SampleNormalizerL1) TransformInplace(dest []float64, vs []float64)
TransformInplace returns L1 normalized vector, inplace
type SampleNormalizerL2 ¶
type SampleNormalizerL2 struct{}
SampleNormalizerL2 transforms features for single sample to have norm L2=1
func (*SampleNormalizerL2) Fit ¶
func (t *SampleNormalizerL2) Fit(_ []float64)
Fit is empty, kept only to keep same interface
func (*SampleNormalizerL2) Transform ¶
func (t *SampleNormalizerL2) Transform(vs []float64) []float64
Transform returns L2 normalized vector
func (*SampleNormalizerL2) TransformInplace ¶
func (t *SampleNormalizerL2) TransformInplace(dest []float64, vs []float64)
TransformInplace returns L2 normalized vector, inplace
type StandardScaler ¶
StandardScaler transforms feature into normal standard distribution.
func (*StandardScaler) Fit ¶
func (t *StandardScaler) Fit(vals []float64)
Fit computes mean and standard deviation
func (*StandardScaler) Transform ¶
func (t *StandardScaler) Transform(v float64) float64
Transform centralizes and scales based on standard deviation and mean
type StructTransformer ¶
type StructTransformer struct {
Transformers []interface{}
}
StructTransformer uses reflection to encode struct into feature vector. It uses struct tags to create feature transformers for each field. Since it is using reflection, there is a slight overhead for large structs, which can be seen in benchmarks. For better performance, use codegen version for your struct, refer to README of this repo.
func (*StructTransformer) Fit ¶
func (s *StructTransformer) Fit(_ []interface{})
Fit will fit all field transformers
func (*StructTransformer) Transform ¶
func (s *StructTransformer) Transform(v interface{}) []float64
Transform applies all field transformers
type TFIDFVectorizer ¶
type TFIDFVectorizer struct { CountVectorizer DocCount []uint // number of documents where i-th word from CountVectorizer appeared in NumDocuments int Normalizer SampleNormalizerL2 }
TFIDFVectorizer performs tf-idf vectorization on top of count vectorization. Based on: https://scikit-learn.org/stable/modules/feature_extraction.html Using non-smooth version, adding 1 to log instead of denominator in idf.
DocCount should have len of len(CountVectorizer.Mapping). It is responsibility of a caller to sensure it is so.
func (*TFIDFVectorizer) FeatureNames ¶
func (t *TFIDFVectorizer) FeatureNames() []string
FeatureNames returns slice with produced feature names.
func (*TFIDFVectorizer) Fit ¶
func (t *TFIDFVectorizer) Fit(vals []string)
Fit fits CountVectorizer and extra information for tf-idf computation
func (*TFIDFVectorizer) NumFeatures ¶
func (t *TFIDFVectorizer) NumFeatures() int
NumFeatures returns number of features for single field
func (*TFIDFVectorizer) Transform ¶
func (t *TFIDFVectorizer) Transform(v string) []float64
Transform performs tf-idf computation
func (*TFIDFVectorizer) TransformInplace ¶
func (t *TFIDFVectorizer) TransformInplace(dest []float64, v string)
TransformInplace performs tf-idf computation, inplace. It is responsibility of caller to zero-out destination.
Source Files ¶
Directories ¶
Path | Synopsis |
---|---|
Package preprocessing includes scaling, centering, normalization, binarization and imputation methods.
|
Package preprocessing includes scaling, centering, normalization, binarization and imputation methods. |