encoder

package module
v0.1.2 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: May 23, 2021 License: Apache-2.0 Imports: 11 Imported by: 0

README

encoder package

PRs welcome!

Catgorical variable encoders

  • Ordinal
  • OneHot
  • Frequency
  • RollingFrequency
  • James-Stein (target encoder)

TODO

  • WOE
  • LOO
  • Catboost
Serializations
  • Protobuf
  • Flatbuf
  • Parquet
  • XML

open to other suggestions ...

Documentation

Index

Constants

This section is empty.

Variables

View Source
var (
	ErrBounds       = errors.New("index out of bounds")
	ErrLength       = errors.New("code length does not match encoder length")
	ErrTargetLength = errors.New("target data is not same length as categorical data")
)

Functions

This section is empty.

Types

type CatBoost

type CatBoost struct {
}

CatBoost ...

func NewCatBoost

func NewCatBoost() *CatBoost

NewCatBoost ...

func (*CatBoost) Decode

func (e *CatBoost) Decode(code int) (string, error)

Decode ...

func (*CatBoost) Encode

func (e *CatBoost) Encode(s string) int

Encode ...

type Frequency

type Frequency struct {
	// contains filtered or unexported fields
}

Frequency is a one-way encoder. You cannot decode Frequency values as some values may be encoded with the same numerical value.

func NewFrequency

func NewFrequency(values []string) *Frequency

NewFrequency will return a frequency encoder with the given values encoded.

func (*Frequency) Get

func (e *Frequency) Get(s string) (int, bool)

Get ...

type JamesSteinClassification

type JamesSteinClassification struct {
	// contains filtered or unexported fields
}

JamesSteinClassification is a one way encoder. You cannot decode JamesSteinClassification values as some values may be encoded with the same numerical code. JamesSteinClassification is a target-based encoder.

func NewJamesSteinClassification

func NewJamesSteinClassification(values []string, target []string) (*JamesSteinClassification, error)

NewJamesSteinClassification will create a JamesSteinClassification encoder

func (*JamesSteinClassification) Codes

Codes will return the slice of codes for all of the values used in the construction of the JamesSteinClassification encoder.

func (*JamesSteinClassification) Get

func (e *JamesSteinClassification) Get(index int) (float64, error)

Get will retrieve the code for the given categorical value.

type JamesSteinRegression

type JamesSteinRegression struct {
	// contains filtered or unexported fields
}

JamesSteinRegression is a one way encoder. You cannot decode JamesSteinRegression values as some values may be encoded with the same numerical code. JamesSteinRegression is a target-based encoder.

func NewJamesSteinRegression

func NewJamesSteinRegression(values []string, target []float64) (*JamesSteinRegression, error)

NewJamesSteinRegression will create a JamesSteinRegression encoder

func (*JamesSteinRegression) Get

func (e *JamesSteinRegression) Get(s string) (float64, bool)

Get will retrieve the code for the given categorical value.

type LeaveOneOut

type LeaveOneOut struct {
}

LeaveOneOut ...

func NewLeaveOneOut

func NewLeaveOneOut() *LeaveOneOut

func (*LeaveOneOut) Decode

func (e *LeaveOneOut) Decode(code int) (string, error)

Decode ...

func (*LeaveOneOut) Encode

func (e *LeaveOneOut) Encode(s string) int

Encode ...

type OneHot

type OneHot struct {
	// contains filtered or unexported fields
}

OneHot will encode string values into a unique one-hot vector (binary vector with a single 1). The empty string is ALWAYS the 0-vector. It will also allow for string values to be decoded.

func NewOneHot

func NewOneHot() *OneHot

NewOneHot will return a one-hot encoder that will set the empty string as the first dimension of every one-hot binary codeword. "Binary" here means that every value in the codeword (integer slice) will be either a 0 or a 1.

func (*OneHot) Contains

func (e *OneHot) Contains(s string) bool

Contains will check if a string has been assigned a one-hot code or not.

func (*OneHot) ContainsCode

func (e *OneHot) ContainsCode(code []uint8) bool

ContainsCode will check if a codeword is a valid codeword or not.

func (*OneHot) Decode

func (e *OneHot) Decode(code []uint8) (string, error)

Decode will return the string for the given binary codeword (one-hot code). If the codeword argument is longer than the encoders codewords then an `ErrLength` error will be returned.

func (*OneHot) Dimension

func (e *OneHot) Dimension() int

Dimension returns the current dimension of each one-hot codeword. The dimension increases with every new string that gets encoded.

func (*OneHot) Encode

func (e *OneHot) Encode(s string) []uint8

Encode will return the integer slice that represents the binary encoding of the given string argument. If the string argument does not already have a code it will generate a new codeword for the given string argument and add it to the encoder.

func (*OneHot) MarshalCSV

func (e *OneHot) MarshalCSV() ([]byte, error)

MarshalCSV ...

func (*OneHot) MarshalJSON

func (e *OneHot) MarshalJSON() ([]byte, error)

MarshalJSON ...

func (*OneHot) UnmarshalCSV

func (e *OneHot) UnmarshalCSV(data []byte) error

UnmarshalCSV ...

func (*OneHot) UnmarshalJSON

func (e *OneHot) UnmarshalJSON(data []byte) error

UnmarshalJSON ...

type Ordinal

type Ordinal struct {
	*sync.RWMutex
	// contains filtered or unexported fields
}

Ordinal will encode string values into a unique integer value. The empty string is ALWAYS the 0 value. It will also allow for string values to be decoded.

func NewOrdinal

func NewOrdinal(init bool) *Ordinal

NewOrdinal will create a new ordinal encoder. If the `init` boolean is specified as true, then the encoder will intialize with the empty string `""` encoded as the `0` value.

func (*Ordinal) Contains

func (e *Ordinal) Contains(s string) bool

Contains will return whether or not a string has been assigned an ordinal code or not.

func (*Ordinal) ContainsCode

func (e *Ordinal) ContainsCode(code int) bool

ContainsCode ...

func (*Ordinal) Decode

func (e *Ordinal) Decode(i uint64) string

Decode will return an empty string if supplied integer argument is not a valid code.

func (*Ordinal) DecodeSlice

func (e *Ordinal) DecodeSlice(s sam.SliceInt) sam.SliceString

DecodeSlice will decode all the values in the slice of integers provided as an argument. If a string value has no existing encoding then it will be returned as the empty string.

func (*Ordinal) Encode

func (e *Ordinal) Encode(s string) uint64

Encode ...

func (*Ordinal) EncodeBytes added in v0.1.0

func (e *Ordinal) EncodeBytes(b []byte) uint64

EncodeBytes --

func (*Ordinal) EncodeSlice

func (e *Ordinal) EncodeSlice(s sam.SliceString) []uint64

EncodeSlice will encode all the values in the slice of strings provided as an argument.

func (*Ordinal) EncodeStringer added in v0.1.0

func (e *Ordinal) EncodeStringer(s fmt.Stringer) uint64

EncodeStringer --

func (*Ordinal) GobDecode added in v0.1.0

func (e *Ordinal) GobDecode(data []byte) error

GobDecode ...

func (*Ordinal) GobEncode added in v0.1.0

func (e *Ordinal) GobEncode() ([]byte, error)

GobEncode ...

func (*Ordinal) Length

func (e *Ordinal) Length() int

Length ...

func (*Ordinal) List

func (e *Ordinal) List() sam.SliceString

List ...

func (*Ordinal) MarshalCSV

func (e *Ordinal) MarshalCSV() ([]byte, error)

MarshalCSV ...

func (*Ordinal) MarshalJSON

func (e *Ordinal) MarshalJSON() ([]byte, error)

MarshalJSON ...

func (*Ordinal) UnmarshalCSV

func (e *Ordinal) UnmarshalCSV(data []byte) error

UnmarshalCSV ...

func (*Ordinal) UnmarshalJSON

func (e *Ordinal) UnmarshalJSON(data []byte) error

UnmarshalJSON ...

type RollingFrequency

type RollingFrequency struct {
	// contains filtered or unexported fields
}

RollingFrequency is a one-war encoder. You cannot decode RollingFrequency values as some values may be encoded with the same numerical code.

func NewRollingFrequency

func NewRollingFrequency(window int, values []string) *RollingFrequency

NewRollingFrequency will create a codeword for every value in the list of values in the order of those values. The list of values supplied to this function should not be a unique list of categorical values. The list should contain all the individual observation values found in the dataset/sample.

func (*RollingFrequency) Codes

func (e *RollingFrequency) Codes() sam.SliceInt

Codes will return the list of codes generated for the list of values provided in the creation of the RollingFrequency encoder.

func (*RollingFrequency) Get

func (e *RollingFrequency) Get(index int) (int, error)

Get will return the code for the given index, according to the original slice of values provided in the construction of the RollingFrequency encoder.

func (*RollingFrequency) Window

func (e *RollingFrequency) Window() int

Window will return the window used when creating the RollingFrequency encoder.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL