base

package

v0.0.0-...-00e0c84 Latest Latest Go to latest Published: Jul 15, 2022 License: MIT Imports: 7 Imported by: 45

Details

Valid go.mod file

The Go module system was introduced in Go 1.11 and is the official dependency management solution for Go.
Redistributable license

Redistributable licenses place minimal restrictions on how software can be used, modified, and redistributed.
Tagged version

Modules with tagged versions give importers more predictable builds.
Stable version

When a project reaches major version v1 it is considered stable.
Learn more about best practices

Repository

github.com/cdipaolo/goml

Links

Open Source Insights

README ¶

Base Package

`import "github.com/cdipaolo/goml/base"`

This package helps define common patterns (interfaces,) as well as letting you work with data, get it into your programs, and munge through it.

This package also implements optimization algorithms which can be made available to a user's own models by implementing easy to use interfaces.

functions for working with data

func LoadDataFromCSV(filepath string) ([][]float64, []float64, error)
- takes a training set (in the format specified on the function's comments/documentation) and returns a 2D slice of float64's of the input features, as well as a 1D slice of the results of those inputs.
func SaveDataToCSV(filepath string, x [][]float64, y []float64, highPrecision bool) error
- takes datasets you might have within the memory and save them to disk. Could be useful if you edit data within a program and want to save a new version of that somewhere.

Documentation ¶

Overview ¶

Package base declares models, interfaces, and methods to be used when working with the rest of the goml library. It also includes common functions both used by the rest of the library and for the user's convenience for working with data, persisting it to files, and optimizing functions

Index ¶

func EuclideanDistance(u []float64, v []float64) float64
func GaussianKernel(sigma float64) func([]float64, []float64) float64
func GradientAscent(d Ascendable) error
func LinearKernel() func([]float64, []float64) float64
func LoadDataFromCSV(filepath string) ([][]float64, []float64, error)
func LoadDataFromCSVToStream(filepath string, data chan Datapoint, errors chan error)
func ManhattanDistance(u []float64, v []float64) float64
func Normalize(x [][]float64)
func NormalizePoint(x []float64)
func OnlyAsciiLetters(r rune) bool
func OnlyAsciiWords(r rune) bool
func OnlyAsciiWordsAndNumbers(r rune) bool
func OnlyLetters(r rune) bool
func OnlyWords(r rune) bool
func OnlyWordsAndNumbers(r rune) bool
func PolynomialKernel(d int, constants ...float64) func([]float64, []float64) float64
func SaveDataToCSV(filepath string, x [][]float64, y []float64, highPrecision bool) error
func StochasticGradientAscent(d StochasticAscendable) error
func TanhKernel(k float64, constants ...float64) func([]float64, []float64) float64
type Ascendable
type Datapoint
type DistanceMeasure
- func LNorm(p int) DistanceMeasure
type Model
type OnlineModel
type OnlineTextModel
type OptimizationMethod
type StochasticAscendable
type TextDatapoint

Constants ¶

This section is empty.

Variables ¶

This section is empty.

Functions ¶

func EuclideanDistance ¶

func EuclideanDistance(u []float64, v []float64) float64

EuclideanDistance returns the distance betweek two float64 vectors. NOTE that this function does not check that the vectors are different lengths (to improve computation speed in, say, KNN.) Make sure you pass in same-length vectors.

func GaussianKernel ¶

func GaussianKernel(sigma float64) func([]float64, []float64) float64

GaussianKernel takes in a parameter for sigma (σ) and returns a valid (Gaussian) Radial Basis Function Kernel. If the input dimensions aren't valid, the kernel will return 0.0 (as if the vectors are orthogonal)

K(x, x`) = exp( -1 * |x - x`|^2 / 2σ^2)

https://en.wikipedia.org/wiki/Radial_basis_function_kernel

This can be used within any models that can use Kernels.

Sigma (σ) will default to 1 if given 0.0

func GradientAscent ¶

func GradientAscent(d Ascendable) error

GradientAscent operates on a Ascendable model and further optimizes the parameter vector Theta of the model, which is then used within the Predict function.

Gradient Ascent follows the following algorithm: θ[j] := θ[j] + α·∇J(θ)

where J(θ) is the cost function, α is the learning rate, and θ[j] is the j-th value in the parameter vector

func LinearKernel ¶

func LinearKernel() func([]float64, []float64) float64

LinearKernel is the base kernel function. It will return a valid kernel for use within models that can use the Kernel Trick. The resultant kernel just takes the dot/inner product of it's argument vectors.

K(x, x`) = x*x`

This is also a subset of the Homogeneous Polynomial kernel family (where the degree is 1 in this case): https://en.wikipedia.org/wiki/Homogeneous_polynomial

Using this kernel is effectively the same as not using a kernel at all (for SVM and Kernel perceptron, at least.)

func LoadDataFromCSV ¶

func LoadDataFromCSV(filepath string) ([][]float64, []float64, error)

LoadDataFromCSV takes in a path to a CSV file and loads that data into a Golang 2D array of 'X' values and a Golang 1D array of 'Y', or expected result, values.

Errors are returned if there are any problems ¶

Expected Data Format:

There should be no header/text lines.
The 'Y' (expected value) line should be the last column of the CSV.

Example CSV file with 2 input parameters:

>>>>>>> BEGIN FILE
1.06,2.30,17
17.62,12.06,18.92
11.623,1.1,15.093
12.01,6,15.032
...
>>>>>>> END FILE

func LoadDataFromCSVToStream ¶

func LoadDataFromCSVToStream(filepath string, data chan Datapoint, errors chan error)

LoadDataFromCSVToStream loads a CSV data file just like LoadDataFromCSV, but it pushes each row into a data channel as it scans. This is useful for very large CSV files where you would want to learn (using the online model methods) as you read from the data as to minimize memory usage.

The errors channel will be passed any errors.

When the function returns, either in the case of an error, or at the end of reading, both the data stream channel and the errors channel will be closed.

func ManhattanDistance ¶

func ManhattanDistance(u []float64, v []float64) float64

ManhattanDistance returns the manhattan distance between teo float64 vectors. This is the sum of the differences between each value

Example Points:

 .
 |
2|
 |______.
    2

Note that the Euclidean distance between these 2 points is 2*sqrt(2)=2.828. The Manhattan distance is 4.

NOTE that this function does not check that the vectors are different lengths (to improve computation speed in, say, KNN.) Make sure you pass in same-length vectors.

func Normalize ¶

func Normalize(x [][]float64)

Normalize takes in an array of arrays of inputs as well as the corresponding array of solutions and normalizes each 'row' of data to unit vector length.

That is: x[i][j] := x[i][j] / |x[i]|

func NormalizePoint ¶

func NormalizePoint(x []float64)

NormalizePoint is the same as Normalize, but it only operates on one singular datapoint, normalizing it's value to unit length.

func OnlyAsciiLetters ¶

func OnlyAsciiLetters(r rune) bool

OnlyAsciiLetters is a transform function that will only let a-zA-Z through

func OnlyAsciiWords ¶

func OnlyAsciiWords(r rune) bool

OnlyAsciiWords is a transform function that will only let a-zA-Z, and spaces through

func OnlyAsciiWordsAndNumbers ¶

func OnlyAsciiWordsAndNumbers(r rune) bool

OnlyAsciiWordsAndNumbers is a transform function that will only let 0-9a-zA-Z, and spaces through

func OnlyLetters ¶

func OnlyLetters(r rune) bool

OnlyLetters is a transform function that lets any unicode letter through

func OnlyWords ¶

func OnlyWords(r rune) bool

OnlyWords is a transform function that lets any unicode letter through as well as spaces

func OnlyWordsAndNumbers ¶

func OnlyWordsAndNumbers(r rune) bool

OnlyWordsAndNumbers is a transform function that lets any unicode letter or digit through as well as spaces

func PolynomialKernel ¶

func PolynomialKernel(d int, constants ...float64) func([]float64, []float64) float64

PolynomialKernel takes in an optional constant (where any extra args passed will be added and count as the constant,) and a main arg of the degree of the polynomial and returns a valid kernel in the Polynomial Function Kernel family. This kernel can be used with all models that take kernels.

K(x, x`) = (x*x` + c)^d

https://en.wikipedia.org/wiki/Polynomial_kernel

Note that if no extra argument is passed (no constant) then the kernel is a Homogeneous Polynomial Kernel (as opposed to Inhomogeneous!) Also if there is no constant and d=1, then the returned kernel is the same (though less efficient) as just LinearKernel().

https://en.wikipedia.org/wiki/Homogeneous_polynomial

`d` will default to 1 if 0 is given.

func SaveDataToCSV ¶

func SaveDataToCSV(filepath string, x [][]float64, y []float64, highPrecision bool) error

SaveDataToCSV takes in a absolute filepath, as well as a 2D array of 'X' values and a 1D array of 'Y', or expected values, concatenates the format to the same as LoadDataFromCSV, and saves that data to a file, returning any errors.

highPrecision is a boolean where if true the values will be stored with a 64 bit precision when converting the floats to strings. Otherwise (if it's false) it uses 32 bits.

func StochasticGradientAscent ¶

func StochasticGradientAscent(d StochasticAscendable) error

StochasticGradientAscent operates on a StochasticAscendable model and further optimizes the parameter vector Theta of the model, which is then used within the Predict function. Stochastic gradient descent updates the parameter vector after looking at each individual training example, which can result in never converging to the absolute minimum; even raising the cost function potentially, but it will typically converge faster than batch gradient descent (implemented as func GradientAscent(d Ascendable) error) because of that very difference.

Gradient Ascent follows the following algorithm: θ[j] := θ[j] + α·∇J(θ)

where J(θ) is the cost function, α is the learning rate, and θ[j] is the j-th value in the parameter vector

func TanhKernel ¶

func TanhKernel(k float64, constants ...float64) func([]float64, []float64) float64

TanhKernel takes in a required Kappa modifier parameter (defaults to 1.0 if 0.0 given,) and optional float64 args afterwords which will be added together to create a constant term (general reccomended use is to just pass one arg as the constant if you need it.)

K(x, x`) = tanh(κx*x` + c)

https://en.wikipedia.org/wiki/Hyperbolic_function https://en.wikipedia.org/wiki/Support_vector_machine#Nonlinear_classification

Note that c must be less than 0 (if >= 0 default to -1.0) and κ (for most cases, but not all - hence no default) must be greater than 0

Types ¶

type Ascendable ¶

type Ascendable interface {
	// LearningRate returns the learning rate α
	// to be used in Gradient Descent as the
	// modifier term
	LearningRate() float64

	// Dj returns the derivative of the cost function
	// J(θ) with respect to the j-th parameter of
	// the hypothesis, θ[j]. Called as Dj(j)
	Dj(int) (float64, error)

	// Theta returns a pointer to the parameter vector
	// theta, which is 1D vector of floats
	Theta() []float64

	// MaxIterations returns the maximum number of
	// iterations to try using gradient ascent. Might
	// return after less if strong convergance is
	// detected, but it'll let the user set a cap.
	MaxIterations() int
}

Ascendable is an interface that can be used with batch gradient descent where the parameter vector theta is in one dimension only (so softmax regression would need it's own model, for example)

type Datapoint ¶

type Datapoint struct {
	X []float64 `json:"x"`
	Y []float64 `json:"y"`
}

Datapoint is used in some models where it is cleaner to pass data as a struct rather than just as 1D and 2D arrays like Generalized Linear Models are doing, for example. X corresponds to the inputs and Y corresponds to the result of the hypothesis.

This is used with the Perceptron, for example, so data can be easily passed in channels while staying encapsulated well.

type DistanceMeasure ¶

type DistanceMeasure func([]float64, []float64) float64

DistanceMeasure is any function that maps two vectors of float64s to a float64. Used for vector distance calculations

func LNorm ¶

func LNorm(p int) DistanceMeasure

LNorm returns a DistanceMeasure of the l-p norm. L norms are a generalized family of the Euclidean and Manhattan distance.

https://en.wikipedia.org/wiki/Norm_(mathematics)

(p = 1) -> Manhattan Distance (p = 2) -> Euclidean Distance

NOTE that this function does not check that the vectors are different lengths (to improve computation speed in, say, KNN.) Make sure you pass in same-length vectors.

type Model ¶

type Model interface {

	// The variadic argument in Predict is an
	// optional arg which (if true) tells the
	// function to first normalize the input to
	// vector unit length. Use (and only use) this
	// if you trained on normalized inputs.
	Predict([]float64, ...bool) ([]float64, error)

	// PersistToFile and RestoreFromFile both take
	// in paths (absolute paths!) to files and
	// persists the necessary data to the filepath
	// such that you can RestoreFromFile later and
	// have the same instance. Helpful when you want
	// to train a model, save it to a file, then
	// open it later for prediction
	PersistToFile(string) error
	RestoreFromFile(string) error
}

Model is an interface that can Train based on a 2D array of data (called x) and an array (y) of solution data. Model trains in a supervised manor. Predict takes in a vector of floats and returns a real number response (float, again) and an error if any

type OnlineModel ¶

type OnlineModel interface {
	Predict([]float64) ([]float64, error)

	// OnlineLearn has no outputs so you can run the data
	// within a separate goroutine! A channel of
	// errors is passed so you know when there's been
	// an error in learning, though learning will
	// just ignore the datapoint that caused the
	// error and continue on.
	//
	// Most times errors are caused when passed
	// datapoints are not of a consistent dimension.
	//
	// The function passed is a callback that is called
	// whenever the parameter vector theta is updated
	OnlineLearn(chan error, func([]float64))

	// UpdateStream updates the datastream channel
	// used in learning for the algorithm
	UpdateStream(chan Datapoint)

	// PersistToFile and RestoreFromFile both take
	// in paths (absolute paths!) to files and
	// persists the necessary data to the filepath
	// such that you can RestoreFromFile later and
	// have the same instance. Helpful when you want
	// to train a model, save it to a file, then
	// open it later for prediction
	PersistToFile(string) error
	RestoreFromFile(string) error
}

OnlineModel differs from Model because the learning can take place in a goroutine because the data is passed through a channel, ending when the channel is closed.

type OnlineTextModel ¶

type OnlineTextModel interface {
	// Predict takes a document and returns the
	// expected class found by the model
	Predict(string) uint8

	// OnlineLearn has no outputs so you can run the data
	// within a separate goroutine! A channel of
	// errors is passed so you know when there's been
	// an error in learning, though learning will
	// just ignore the datapoint that caused the
	// error and continue on.
	OnlineLearn(chan<- error)

	// UpdateStream updates the datastream channel
	// used in learning for the algorithm
	UpdateStream(chan TextDatapoint)

	// PersistToFile and RestoreFromFile both take
	// in paths (absolute paths!) to files and
	// persists the necessary data to the filepath
	// such that you can RestoreFromFile later and
	// have the same instance. Helpful when you want
	// to train a model, save it to a file, then
	// open it later for prediction
	PersistToFile(string) error
	RestoreFromFile(string) error
}

OnlineTextModel holds the interface for text classifiers. They have the refular learn & predict functions, but don't include an updating callback func in OnlineLearn because the parameter vector passed would very often be _huge_, and therefore would be a detriment to performance.

type OptimizationMethod ¶

type OptimizationMethod string

OptimizationMethod defines a type enum which (using constants declared below) lets a user pass in a optimization method to use when creating a new model

const (
	BatchGA      OptimizationMethod = "Batch Gradient Ascent"
	StochasticGA                    = "Stochastic Gradient Descent"
)

Constants declare the types of optimization methods you can use.

type StochasticAscendable ¶

type StochasticAscendable interface {
	// LearningRate returns the learning rate α
	// to be used in Gradient Descent as the
	// modifier term
	LearningRate() float64

	// Examples returns the number of examples in the
	// training set the model is using
	Examples() int

	// Dj returns the derivative of the cost function
	// J(θ) with respect to the j-th parameter of
	// the hypothesis, θ[j], for the training example
	// x[i]. Called as Dij(i,j)
	Dij(int, int) (float64, error)

	// Theta returns a pointer to the parameter vector
	// theta, which is 1D vector of floats
	Theta() []float64

	// MaxIterations returns the maximum number of
	// iterations to try using gradient ascent. Might
	// return after less if strong convergance is
	// detected, but it'll let the user set a cap.
	MaxIterations() int
}

StochasticAscendable is an interface that can be used with stochastic gradient descent where the parameter vector theta is in one dimension only (so softmax regression would need it's own model, for example)

type TextDatapoint ¶

type TextDatapoint struct {
	X string `json:"x"`
	Y uint8  `json:"y"`
}

TextDatapoint is the data structure expected for text classification models. The passed types, therefore, are inherently different from the other structures. X is now a string (or, document. Usually this would be a sentence or multiple sentences.) Y is now a uint8 denoting the class, because you can't regress on text classification (at least not well/effectively)

Source Files ¶

View all Source files

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL