base

package
v0.0.0-...-00e0c84 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Jul 15, 2022 License: MIT Imports: 7 Imported by: 45

README

Base Package

import "github.com/cdipaolo/goml/base"

GoDoc

This package helps define common patterns (interfaces,) as well as letting you work with data, get it into your programs, and munge through it.

This package also implements optimization algorithms which can be made available to a user's own models by implementing easy to use interfaces.

functions for working with data

Documentation

Overview

Package base declares models, interfaces, and methods to be used when working with the rest of the goml library. It also includes common functions both used by the rest of the library and for the user's convenience for working with data, persisting it to files, and optimizing functions

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

func EuclideanDistance

func EuclideanDistance(u []float64, v []float64) float64

EuclideanDistance returns the distance betweek two float64 vectors. NOTE that this function does not check that the vectors are different lengths (to improve computation speed in, say, KNN.) Make sure you pass in same-length vectors.

func GaussianKernel

func GaussianKernel(sigma float64) func([]float64, []float64) float64

GaussianKernel takes in a parameter for sigma (σ) and returns a valid (Gaussian) Radial Basis Function Kernel. If the input dimensions aren't valid, the kernel will return 0.0 (as if the vectors are orthogonal)

K(x, x`) = exp( -1 * |x - x`|^2 / 2σ^2)

https://en.wikipedia.org/wiki/Radial_basis_function_kernel

This can be used within any models that can use Kernels.

Sigma (σ) will default to 1 if given 0.0

func GradientAscent

func GradientAscent(d Ascendable) error

GradientAscent operates on a Ascendable model and further optimizes the parameter vector Theta of the model, which is then used within the Predict function.

Gradient Ascent follows the following algorithm: θ[j] := θ[j] + α·∇J(θ)

where J(θ) is the cost function, α is the learning rate, and θ[j] is the j-th value in the parameter vector

func LinearKernel

func LinearKernel() func([]float64, []float64) float64

LinearKernel is the base kernel function. It will return a valid kernel for use within models that can use the Kernel Trick. The resultant kernel just takes the dot/inner product of it's argument vectors.

K(x, x`) = x*x`

This is also a subset of the Homogeneous Polynomial kernel family (where the degree is 1 in this case): https://en.wikipedia.org/wiki/Homogeneous_polynomial

Using this kernel is effectively the same as not using a kernel at all (for SVM and Kernel perceptron, at least.)

func LoadDataFromCSV

func LoadDataFromCSV(filepath string) ([][]float64, []float64, error)

LoadDataFromCSV takes in a path to a CSV file and loads that data into a Golang 2D array of 'X' values and a Golang 1D array of 'Y', or expected result, values.

Errors are returned if there are any problems

Expected Data Format:

  • There should be no header/text lines.
  • The 'Y' (expected value) line should be the last column of the CSV.

Example CSV file with 2 input parameters:

>>>>>>> BEGIN FILE
1.06,2.30,17
17.62,12.06,18.92
11.623,1.1,15.093
12.01,6,15.032
...
>>>>>>> END FILE

func LoadDataFromCSVToStream

func LoadDataFromCSVToStream(filepath string, data chan Datapoint, errors chan error)

LoadDataFromCSVToStream loads a CSV data file just like LoadDataFromCSV, but it pushes each row into a data channel as it scans. This is useful for very large CSV files where you would want to learn (using the online model methods) as you read from the data as to minimize memory usage.

The errors channel will be passed any errors.

When the function returns, either in the case of an error, or at the end of reading, both the data stream channel and the errors channel will be closed.

func ManhattanDistance

func ManhattanDistance(u []float64, v []float64) float64

ManhattanDistance returns the manhattan distance between teo float64 vectors. This is the sum of the differences between each value

Example Points:

 .
 |
2|
 |______.
    2

Note that the Euclidean distance between these 2 points is 2*sqrt(2)=2.828. The Manhattan distance is 4.

NOTE that this function does not check that the vectors are different lengths (to improve computation speed in, say, KNN.) Make sure you pass in same-length vectors.

func Normalize

func Normalize(x [][]float64)

Normalize takes in an array of arrays of inputs as well as the corresponding array of solutions and normalizes each 'row' of data to unit vector length.

That is: x[i][j] := x[i][j] / |x[i]|

func NormalizePoint

func NormalizePoint(x []float64)

NormalizePoint is the same as Normalize, but it only operates on one singular datapoint, normalizing it's value to unit length.

func OnlyAsciiLetters

func OnlyAsciiLetters(r rune) bool

OnlyAsciiLetters is a transform function that will only let a-zA-Z through

func OnlyAsciiWords

func OnlyAsciiWords(r rune) bool

OnlyAsciiWords is a transform function that will only let a-zA-Z, and spaces through

func OnlyAsciiWordsAndNumbers

func OnlyAsciiWordsAndNumbers(r rune) bool

OnlyAsciiWordsAndNumbers is a transform function that will only let 0-9a-zA-Z, and spaces through

func OnlyLetters

func OnlyLetters(r rune) bool

OnlyLetters is a transform function that lets any unicode letter through

func OnlyWords

func OnlyWords(r rune) bool

OnlyWords is a transform function that lets any unicode letter through as well as spaces

func OnlyWordsAndNumbers

func OnlyWordsAndNumbers(r rune) bool

OnlyWordsAndNumbers is a transform function that lets any unicode letter or digit through as well as spaces

func PolynomialKernel

func PolynomialKernel(d int, constants ...float64) func([]float64, []float64) float64

PolynomialKernel takes in an optional constant (where any extra args passed will be added and count as the constant,) and a main arg of the degree of the polynomial and returns a valid kernel in the Polynomial Function Kernel family. This kernel can be used with all models that take kernels.

K(x, x`) = (x*x` + c)^d

https://en.wikipedia.org/wiki/Polynomial_kernel

Note that if no extra argument is passed (no constant) then the kernel is a Homogeneous Polynomial Kernel (as opposed to Inhomogeneous!) Also if there is no constant and d=1, then the returned kernel is the same (though less efficient) as just LinearKernel().

https://en.wikipedia.org/wiki/Homogeneous_polynomial

`d` will default to 1 if 0 is given.

func SaveDataToCSV

func SaveDataToCSV(filepath string, x [][]float64, y []float64, highPrecision bool) error

SaveDataToCSV takes in a absolute filepath, as well as a 2D array of 'X' values and a 1D array of 'Y', or expected values, concatenates the format to the same as LoadDataFromCSV, and saves that data to a file, returning any errors.

highPrecision is a boolean where if true the values will be stored with a 64 bit precision when converting the floats to strings. Otherwise (if it's false) it uses 32 bits.

func StochasticGradientAscent

func StochasticGradientAscent(d StochasticAscendable) error

StochasticGradientAscent operates on a StochasticAscendable model and further optimizes the parameter vector Theta of the model, which is then used within the Predict function. Stochastic gradient descent updates the parameter vector after looking at each individual training example, which can result in never converging to the absolute minimum; even raising the cost function potentially, but it will typically converge faster than batch gradient descent (implemented as func GradientAscent(d Ascendable) error) because of that very difference.

Gradient Ascent follows the following algorithm: θ[j] := θ[j] + α·∇J(θ)

where J(θ) is the cost function, α is the learning rate, and θ[j] is the j-th value in the parameter vector

func TanhKernel

func TanhKernel(k float64, constants ...float64) func([]float64, []float64) float64

TanhKernel takes in a required Kappa modifier parameter (defaults to 1.0 if 0.0 given,) and optional float64 args afterwords which will be added together to create a constant term (general reccomended use is to just pass one arg as the constant if you need it.)

K(x, x`) = tanh(κx*x` + c)

https://en.wikipedia.org/wiki/Hyperbolic_function https://en.wikipedia.org/wiki/Support_vector_machine#Nonlinear_classification

Note that c must be less than 0 (if >= 0 default to -1.0) and κ (for most cases, but not all - hence no default) must be greater than 0

Types

type Ascendable

type Ascendable interface {
	// LearningRate returns the learning rate α
	// to be used in Gradient Descent as the
	// modifier term
	LearningRate() float64

	// Dj returns the derivative of the cost function
	// J(θ) with respect to the j-th parameter of
	// the hypothesis, θ[j]. Called as Dj(j)
	Dj(int) (float64, error)

	// Theta returns a pointer to the parameter vector
	// theta, which is 1D vector of floats
	Theta() []float64

	// MaxIterations returns the maximum number of
	// iterations to try using gradient ascent. Might
	// return after less if strong convergance is
	// detected, but it'll let the user set a cap.
	MaxIterations() int
}

Ascendable is an interface that can be used with batch gradient descent where the parameter vector theta is in one dimension only (so softmax regression would need it's own model, for example)

type Datapoint

type Datapoint struct {
	X []float64 `json:"x"`
	Y []float64 `json:"y"`
}

Datapoint is used in some models where it is cleaner to pass data as a struct rather than just as 1D and 2D arrays like Generalized Linear Models are doing, for example. X corresponds to the inputs and Y corresponds to the result of the hypothesis.

This is used with the Perceptron, for example, so data can be easily passed in channels while staying encapsulated well.

type DistanceMeasure

type DistanceMeasure func([]float64, []float64) float64

DistanceMeasure is any function that maps two vectors of float64s to a float64. Used for vector distance calculations

func LNorm

func LNorm(p int) DistanceMeasure

LNorm returns a DistanceMeasure of the l-p norm. L norms are a generalized family of the Euclidean and Manhattan distance.

https://en.wikipedia.org/wiki/Norm_(mathematics)

(p = 1) -> Manhattan Distance (p = 2) -> Euclidean Distance

NOTE that this function does not check that the vectors are different lengths (to improve computation speed in, say, KNN.) Make sure you pass in same-length vectors.

type Model

type Model interface {

	// The variadic argument in Predict is an
	// optional arg which (if true) tells the
	// function to first normalize the input to
	// vector unit length. Use (and only use) this
	// if you trained on normalized inputs.
	Predict([]float64, ...bool) ([]float64, error)

	// PersistToFile and RestoreFromFile both take
	// in paths (absolute paths!) to files and
	// persists the necessary data to the filepath
	// such that you can RestoreFromFile later and
	// have the same instance. Helpful when you want
	// to train a model, save it to a file, then
	// open it later for prediction
	PersistToFile(string) error
	RestoreFromFile(string) error
}

Model is an interface that can Train based on a 2D array of data (called x) and an array (y) of solution data. Model trains in a supervised manor. Predict takes in a vector of floats and returns a real number response (float, again) and an error if any

type OnlineModel

type OnlineModel interface {
	Predict([]float64) ([]float64, error)

	// OnlineLearn has no outputs so you can run the data
	// within a separate goroutine! A channel of
	// errors is passed so you know when there's been
	// an error in learning, though learning will
	// just ignore the datapoint that caused the
	// error and continue on.
	//
	// Most times errors are caused when passed
	// datapoints are not of a consistent dimension.
	//
	// The function passed is a callback that is called
	// whenever the parameter vector theta is updated
	OnlineLearn(chan error, func([]float64))

	// UpdateStream updates the datastream channel
	// used in learning for the algorithm
	UpdateStream(chan Datapoint)

	// PersistToFile and RestoreFromFile both take
	// in paths (absolute paths!) to files and
	// persists the necessary data to the filepath
	// such that you can RestoreFromFile later and
	// have the same instance. Helpful when you want
	// to train a model, save it to a file, then
	// open it later for prediction
	PersistToFile(string) error
	RestoreFromFile(string) error
}

OnlineModel differs from Model because the learning can take place in a goroutine because the data is passed through a channel, ending when the channel is closed.

type OnlineTextModel

type OnlineTextModel interface {
	// Predict takes a document and returns the
	// expected class found by the model
	Predict(string) uint8

	// OnlineLearn has no outputs so you can run the data
	// within a separate goroutine! A channel of
	// errors is passed so you know when there's been
	// an error in learning, though learning will
	// just ignore the datapoint that caused the
	// error and continue on.
	OnlineLearn(chan<- error)

	// UpdateStream updates the datastream channel
	// used in learning for the algorithm
	UpdateStream(chan TextDatapoint)

	// PersistToFile and RestoreFromFile both take
	// in paths (absolute paths!) to files and
	// persists the necessary data to the filepath
	// such that you can RestoreFromFile later and
	// have the same instance. Helpful when you want
	// to train a model, save it to a file, then
	// open it later for prediction
	PersistToFile(string) error
	RestoreFromFile(string) error
}

OnlineTextModel holds the interface for text classifiers. They have the refular learn & predict functions, but don't include an updating callback func in OnlineLearn because the parameter vector passed would very often be _huge_, and therefore would be a detriment to performance.

type OptimizationMethod

type OptimizationMethod string

OptimizationMethod defines a type enum which (using constants declared below) lets a user pass in a optimization method to use when creating a new model

const (
	BatchGA      OptimizationMethod = "Batch Gradient Ascent"
	StochasticGA                    = "Stochastic Gradient Descent"
)

Constants declare the types of optimization methods you can use.

type StochasticAscendable

type StochasticAscendable interface {
	// LearningRate returns the learning rate α
	// to be used in Gradient Descent as the
	// modifier term
	LearningRate() float64

	// Examples returns the number of examples in the
	// training set the model is using
	Examples() int

	// Dj returns the derivative of the cost function
	// J(θ) with respect to the j-th parameter of
	// the hypothesis, θ[j], for the training example
	// x[i]. Called as Dij(i,j)
	Dij(int, int) (float64, error)

	// Theta returns a pointer to the parameter vector
	// theta, which is 1D vector of floats
	Theta() []float64

	// MaxIterations returns the maximum number of
	// iterations to try using gradient ascent. Might
	// return after less if strong convergance is
	// detected, but it'll let the user set a cap.
	MaxIterations() int
}

StochasticAscendable is an interface that can be used with stochastic gradient descent where the parameter vector theta is in one dimension only (so softmax regression would need it's own model, for example)

type TextDatapoint

type TextDatapoint struct {
	X string `json:"x"`
	Y uint8  `json:"y"`
}

TextDatapoint is the data structure expected for text classification models. The passed types, therefore, are inherently different from the other structures. X is now a string (or, document. Usually this would be a sentence or multiple sentences.) Y is now a uint8 denoting the class, because you can't regress on text classification (at least not well/effectively)

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL