classifier

package module
v0.5.1 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Jan 7, 2024 License: Apache-2.0 Imports: 8 Imported by: 0

README

classifier

General purpose text classifier (naïve bayes, k-nearest neighbors)

codecov Go Report Card Documentation

Installation

go get github.com/alexsuslov/classifier

Usage

Classification

There are two methods of classifying text data: io.Reader or string. To classify strings, use the TrainString or ClassifyString functions. To classify larger sources, use the Train and Classify functions that take an io.Reader as input.

package main

import (
	"fmt"
	
	"github.com/alexsuslov/classifier/naive"
)

func main() {
    classifier := naive.New()
    classifier.TrainString("The quick brown fox jumped over the lazy dog", "ham")
    classifier.TrainString("Earn a degree online", "ham")
    classifier.TrainString("Earn cash quick online", "spam")
    
    if classification, err := classifier.ClassifyString("Earn your masters degree online"); err == nil {
        fmt.Println("Classification => ", classification) // ham
    } else {
        fmt.Println("error: ", err)
    }	
}

Contributing

  • Fork the repository
  • Create a local feature branch
  • Run gofmt
  • Bump the VERSION file using semantic versioning
  • Submit a pull request

License

Copyright 2023 n3integration@gmail.com

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

Documentation

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

func Filter

func Filter(vs chan string, filters ...Predicate) chan string

Filter removes elements from the input channel where the supplied predicate is satisfied Filter is a Predicate aggregation

func IsNotStopWord

func IsNotStopWord(v string) bool

IsNotStopWord is the inverse function of IsStopWord

func IsStopWord

func IsStopWord(v string) bool

IsStopWord checks against a list of known english stop words and returns true if v is a stop word; false otherwise

func IsWord

func IsWord(v string) bool

IsWord is a predicate to determine if a string contains at least two characters and doesn't contain any numbers

func LoadStopWords

func LoadStopWords(filename string) error

func Map

func Map(vs chan string, f ...Mapper) chan string

Map applies f to each element of the supplied input channel

func ScanAlphaWords

func ScanAlphaWords(data []byte, atEOF bool) (advance int, token []byte, err error)

ScanAlphaWords is a function that splits text on whitespace, punctuation, and symbols; derived bufio.ScanWords

func WordCounts

func WordCounts(r io.Reader) (map[string]int, error)

WordCounts extracts term frequencies from a text corpus

Types

type Classifier

type Classifier interface {
	// Train allows clients to train the classifier
	Train(io.Reader, string) error
	// TrainString allows clients to train the classifier using a string
	TrainString(string, string) error
	// Classify performs a classification on the input corpus and assumes that
	// the underlying classifier has been trained.
	Classify(io.Reader) (string, error)
	// ClassifyString performs text classification using a string
	ClassifyString(string) (string, error)
}

Classifier provides a simple interface for different text classifiers

type Mapper

type Mapper func(string) string

Mapper provides a map function

type Predicate

type Predicate func(string) bool

Predicate provides a predicate function

type StdOption

type StdOption func(*StdTokenizer)

StdOption provides configuration settings for a StdTokenizer

func BufferSize

func BufferSize(size int) StdOption

BufferSize adjusts the size of the buffered channel

func Filters

func Filters(f ...Predicate) StdOption

Filters overrides the list of predicates

func SplitFunc

func SplitFunc(fn bufio.SplitFunc) StdOption

SplitFunc overrides the default word split function, based on whitespace

func Transforms

func Transforms(m ...Mapper) StdOption

Transforms overrides the list of mappers

type StdTokenizer

type StdTokenizer struct {
	// contains filtered or unexported fields
}

StdTokenizer provides a common document tokenizer that splits a document by word boundaries

func NewTokenizer

func NewTokenizer(opts ...StdOption) *StdTokenizer

NewTokenizer initializes a new standard Tokenizer instance

func (*StdTokenizer) Tokenize

func (t *StdTokenizer) Tokenize(r io.Reader) chan string

Tokenize words and return streaming results

type Tokenizer

type Tokenizer interface {
	// Tokenize breaks the provided document into a channel of tokens
	Tokenize(io.Reader) chan string
}

Tokenizer provides a common interface to tokenize documents

type WeightScheme

type WeightScheme func(term string) float64

WeightScheme provides a contract for term frequency weight schemes

func BagOfWords

func BagOfWords(doc map[string]float64) WeightScheme

BagOfWords weight scheme: counts the number of occurrences

func Binary

func Binary(doc map[string]float64) WeightScheme

Binary weight scheme: 1 if present; 0 otherwise

func LogNorm

func LogNorm(doc map[string]float64) WeightScheme

LogNorm weight scheme: returns the natural log of the number of occurrences of a term

func TermFrequency

func TermFrequency(doc map[string]float64) WeightScheme

TermFrequency weight scheme; counts the number of occurrences divided by the number of terms within a document

type WeightSchemeStrategy

type WeightSchemeStrategy func(doc map[string]float64) WeightScheme

WeightSchemeStrategy provides support for pluggable weight schemes

Directories

Path Synopsis

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL