bayes

package module
v0.5.1 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Sep 29, 2023 License: MIT Imports: 7 Imported by: 5

README

bayes Build Status Doc Status

An implementation of Naive Bayes classifier. More details are in docs.

Usage

This package allows to classify a new entity into one or another category (class) according to features of the entity. The algorithm uses known data to calculate a weight of each feature for each category.

func Example() {
	// there are two jars of cookies, they are our training set.
	// Cookies have be round or star-shaped.
	// There are plain or chocolate chips cookies.
	jar1 := ft.Class("Jar1")
	jar2 := ft.Class("Jar2")

	// Every preclassified feature-set provides data for one cookie. It tells
	// what jar has the cookie, what its kind and shape.
	cookie1 := ft.ClassFeatures{
		Class: jar1,
		Features: []ft.Feature{
			{Name: "kind", Value: "plain"},
			{Name: "shape", Value: "round"},
		},
	}
	cookie2 := ft.ClassFeatures{
		Class: jar1,
		Features: []ft.Feature{
			{Name: "kind", Value: "plain"},
			{Name: "shape", Value: "star"},
		},
	}
	cookie3 := ft.ClassFeatures{
		Class: jar1,
		Features: []ft.Feature{
			{Name: "kind", Value: "chocolate"},
			{Name: "shape", Value: "star"},
		},
	}
	cookie4 := ft.ClassFeatures{
		Class: jar1,
		Features: []ft.Feature{
			{Name: "kind", Value: "plain"},
			{Name: "shape", Value: "round"},
		},
	}
	cookie5 := ft.ClassFeatures{
		Class: jar1,
		Features: []ft.Feature{
			{Name: "kind", Value: "plain"},
			{Name: "shape", Value: "round"},
		},
	}
	cookie6 := ft.ClassFeatures{
		Class: jar2,
		Features: []ft.Feature{
			{Name: "kind", Value: "chocolate"},
			{Name: "shape", Value: "star"},
		},
	}
	cookie7 := ft.ClassFeatures{
		Class: jar2,
		Features: []ft.Feature{
			{Name: "kind", Value: "chocolate"},
			{Name: "shape", Value: "star"},
		},
	}
	cookie8 := ft.ClassFeatures{
		Class: jar2,
		Features: []ft.Feature{
			{Name: "kind", Value: "chocolate"},
			{Name: "shape", Value: "star"},
		},
	}

	lfs := []ft.ClassFeatures{
		cookie1, cookie2, cookie3, cookie4, cookie5, cookie6, cookie7, cookie8,
	}

	nb := bayes.New()
	nb.Train(lfs)
	oddsPrior, err := nb.PriorOdds(jar1)
	if err != nil {
		log.Println(err)
	}

	// If we got a chocolate star-shaped cookie, which jar it came from most
	// likely?
	aCookie := []ft.Feature{
		{Name: ft.Name("kind"), Value: ft.Value("chocolate")},
		{Name: ft.Name("shape"), Value: ft.Value("star")},
	}

	res, err := nb.PosteriorOdds(aCookie)
	if err != nil {
		fmt.Println(err)
	}

	// it is more likely to that a random cookie comes from Jar1, but
	// for chocolate and star-shaped cookie it is more likely to come from
	// Jar2.
	fmt.Printf("Prior odds for Jar1 are %0.2f\n", oddsPrior)
	fmt.Printf("The cookie came from %s, with odds %0.2f\n", res.MaxClass, res.MaxOdds)
	// Output:
	// Prior odds for Jar1 are 1.67
	// The cookie came from Jar2, with odds 7.50
}

Development

Testing
go test

Other implementations:

Go, Java, Python, R, Ruby

Documentation

Overview

Package bayes implements Naive Bayes trainer and classifier. Code is located at https://github.com/gnames/bayes

Naive Bayes rule calculates a probability of a hypothesis from a prior knowledge about the hypothesis, as well as the evidence that supports or diminishes the probability of the hypothesis. Prior knowledge can dramatically influence the posterior probability of a hypothesis. For example assuming that an adult bird that cannot fly is a penguin is very unlikely in the northern hemisphere, but is very likely in Antarctica. Bayes' theorem is often depicted as

P(H|E) = P(H) * P(E|H) / P(E)

where H is our hypothesis, E is a new evidence, P(H) is a prior probability of H to be true, P(E|H) is a known probability for the evidence when H is true, P(E) is a known probability of E in all known cases. P(H|E) is a posterior probability of a hypothesis H adjusted accordingly to the new evidence E.

Finding a probability that a hypothesis is true can be considered a classification event. Given prior knowledge and a new evidence we are able to classify an entity to a hypothesis that has the highest posterior probability.

Using odds instead of probabilities

It is possible to represent Bayes theorem using odds. Odds describe how likely a hypothesis is in comparison to all other possible hypotheses.

odds = P(H) / (1 - P(H))

P(H) = odds / (1 + odds)

Using odds allows us to simplify Bayes calculations

oddsPosterior = oddsPrior * likelihood

where likelihood is

likelihood = P(E|H)/P(E|H')

P(E|H') in this case is a known probability of an evidence when H is not true. In case if we have several evidences that are independent from each other, posterior odds can be calculated as a product of prior odds and all likelihoods of all given evidences.

oddsPosterior = oddsPrior * likelihood1 * likelihood2 * likelihood3 ...

Each subsequent evidence modifies prior odds. If evidences are not independent (for example inability to fly and a propensity to nesting on the ground for birds) they skew the outcome. In reality given evidences are quite often not completely independent. Because of that Naive Bayes got its name. People who apply it "naively" state that their evidences are completely independent from each other. In practice Naive Bayes approach often shows good results in spite of this known fallacy.

Training and prior odds

It is quite possible that while likelihoods of evidences are representative for classification data the prior odds from the training are not. As in the previous example an evidence that a bird cannot fly supports a 'penguin' hypothesis much better in Antarctica because odds to meet a penguin there are much higher than in the northern hemisphere. Therefore we give an ability to supply prior probability value at a classification event.

Terminology

In natural language processing `evidences` are often called `features`. We follow the same convention in this package.

Hypotheses are often called classes. Based on the outcome we classify an entity (assign a class to the entity in other words). Every class receives a number of elements or `tokens`, each with a set of features.

Example
package main

import (
	"fmt"
	"log"

	"github.com/gnames/bayes"
	ft "github.com/gnames/bayes/ent/feature"
)

func main() {
	// there are two jars of cookies, they are our training set.
	// Cookies have be round or star-shaped.
	// There are plain or chocolate chips cookies.
	jar1 := ft.Class("Jar1")
	jar2 := ft.Class("Jar2")

	// Every preclassified feature-set provides data for one cookie. It tells
	// what jar has the cookie, what its kind and shape.
	cookie1 := ft.ClassFeatures{
		Class: jar1,
		Features: []ft.Feature{
			{Name: "kind", Value: "plain"},
			{Name: "shape", Value: "round"},
		},
	}
	cookie2 := ft.ClassFeatures{
		Class: jar1,
		Features: []ft.Feature{
			{Name: "kind", Value: "plain"},
			{Name: "shape", Value: "star"},
		},
	}
	cookie3 := ft.ClassFeatures{
		Class: jar1,
		Features: []ft.Feature{
			{Name: "kind", Value: "chocolate"},
			{Name: "shape", Value: "star"},
		},
	}
	cookie4 := ft.ClassFeatures{
		Class: jar1,
		Features: []ft.Feature{
			{Name: "kind", Value: "plain"},
			{Name: "shape", Value: "round"},
		},
	}
	cookie5 := ft.ClassFeatures{
		Class: jar1,
		Features: []ft.Feature{
			{Name: "kind", Value: "plain"},
			{Name: "shape", Value: "round"},
		},
	}
	cookie6 := ft.ClassFeatures{
		Class: jar2,
		Features: []ft.Feature{
			{Name: "kind", Value: "chocolate"},
			{Name: "shape", Value: "star"},
		},
	}
	cookie7 := ft.ClassFeatures{
		Class: jar2,
		Features: []ft.Feature{
			{Name: "kind", Value: "chocolate"},
			{Name: "shape", Value: "star"},
		},
	}
	cookie8 := ft.ClassFeatures{
		Class: jar2,
		Features: []ft.Feature{
			{Name: "kind", Value: "chocolate"},
			{Name: "shape", Value: "star"},
		},
	}

	lfs := []ft.ClassFeatures{
		cookie1, cookie2, cookie3, cookie4, cookie5, cookie6, cookie7, cookie8,
	}

	nb := bayes.New()
	nb.Train(lfs)
	oddsPrior, err := nb.PriorOdds(jar1)
	if err != nil {
		log.Println(err)
	}

	// If we got a chocolate star-shaped cookie, which jar it came from most
	// likely?
	aCookie := []ft.Feature{
		{Name: ft.Name("kind"), Value: ft.Value("chocolate")},
		{Name: ft.Name("shape"), Value: ft.Value("star")},
	}

	res, err := nb.PosteriorOdds(aCookie)
	if err != nil {
		fmt.Println(err)
	}

	// it is more likely to that a random cookie comes from Jar1, but
	// for chocolate and star-shaped cookie it is more likely to come from
	// Jar2.
	fmt.Printf("Prior odds for Jar1 are %0.2f\n", oddsPrior)
	fmt.Printf("The cookie came from %s, with odds %0.2f\n", res.MaxClass, res.MaxOdds)
}
Output:

Prior odds for Jar1 are 1.67
The cookie came from Jar2, with odds 7.50

Index

Examples

Constants

This section is empty.

Variables

This section is empty.

Functions

This section is empty.

Types

type Bayes added in v0.2.0

type Bayes interface {
	Trainer
	Serializer
	Calc
}

Bayes interface uses Bayes algorithm for calculation of the posterior and prior odds. For training it takes manually curated data packed into features, and allows to serialize and deserialize the data.

func New added in v0.2.0

func New() Bayes

New creates a new instance of Bayes object. This object needs to get data from either training or from loading a dump of previous training data.

type Calc added in v0.2.0

type Calc interface {
	// PriorOdds method returns Odds from the training.
	PriorOdds(ft.Class) (float64, error)
	// PosteriorOdds uses set of features to determing which class they belong
	// to with the most probability.
	PosteriorOdds([]ft.Feature, ...Option) (posterior.Odds, error)
	// Likelihood gives an isolated likelihood of a feature.
	Likelihood(ft.Feature, ft.Class) (float64, error)
}

Calc provides methods for calculating Prior and Posterior Odds from new data, allowing to classify the data according to its features.

type Option added in v0.2.0

type Option func(nb *bayes)

func OptIgnorePriorOdds added in v0.2.0

func OptIgnorePriorOdds(b bool) Option

OptIgnorePriorOdds might be needed if it is a muV PriorOdds already are accounted for.

func OptPriorOdds added in v0.2.0

func OptPriorOdds(lc map[ft.Class]int) Option

OptPriorOdds allows dynamical change of prior odds used in calculations. Sometimes prior odds during classification event are very different from ones aquired during training. If for example 'real' prior odds are 100 times larger it means the calculated posterior odds will be 100 times smaller than what they would suppose to be.

type Serializer added in v0.2.0

type Serializer interface {
	// Inspect returns back simplified and publicly accessed information that
	// is normally private for Bayes object.
	Inspect() bayesdump.BayesDump
	// Load takes a slice of bytes that corresponds to output.Output and
	// creates a Bayes instance from it.
	Load([]byte) error
	// Dump takes an internal data of a Bayes instance, converts it to
	// object.Object and serializes it to slice of bytes.
	Dump() ([]byte, error)
}

Serializer provides methods for dumping data from Bayes object to a slice of bytes, and rebuilding Bayes object from such data.

type Trainer added in v0.2.0

type Trainer interface {
	Train([]ft.ClassFeatures)
}

Trainer interface provides methods for training Bayes object to data from the training set.

Directories

Path Synopsis
ent
output
package output contains helpers to make results of odds calculation JSON friendly.
package output contains helpers to make results of odds calculation JSON friendly.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL