distance

package
v0.8.3 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Jul 18, 2023 License: MIT Imports: 17 Imported by: 0

Documentation

Index

Constants

This section is empty.

Variables

View Source
var CACHEFILENAME string = filepath.Join(os.TempDir(), "fileDistance.cache")

Default. Can be changed before use, see the CLI flags in main.go.

Functions

func DistBytes

func DistBytes(x, y []byte) float64

Compute a normalized compression distance between two []byte using gzip. Should normally be between 0.0 and 1.0., could sometimes end up slightly above 1.0 ... See article attached, specially annexe A.

func DistFile

func DistFile(f1, f2 string) float64

Distance between two files, given by their path names. Useful text content will be extracted before distance computation.

func DistString

func DistString(xs, ys string) float64

Compute a normalized compression distance between two strings using gzip. Should normally be between 0.0 and 1.0., could sometimes end up slightly above 1.0 ... See article attached, specially annexe A.

func ExtractText

func ExtractText(fname string) []byte

Extract useful content. Currently tries gzip, zlib, zip, pure xml, html in that order, then removes multiple white space characters.

func FilesInFolder

func FilesInFolder(folder string) []string

Get (recursively) all files in folder, ignoring .git folder Files names are returned as absolute path.

Types

type Cache

type Cache struct {
	M map[[sha256.Size * 2]byte]float64 // should not be used directly, nor relied upon. Public only because required for ease of saving as gob.
}

Cache for file to file distance. It is a very expensive calculation, since we check for word, excel, zip, etc ... files, so caching makes sense. We do not use filenames, but the hash of both files, to ensure propoer handling of file name or content changes.

func NewCache

func NewCache() *Cache

Create a new cache. Load from previously saved cache if there is one. Not thread safe.

func (*Cache) Clear

func (c *Cache) Clear()

Clear cache in memory. Cache on file will be erased on next save.

func (*Cache) Get

func (c *Cache) Get(f1, f2 string) float64

Try to read from cache, if cache misses, compute, store and return result.

func (*Cache) Save

func (c *Cache) Save()

Save cache to file. Not thread safe.

func (*Cache) Size

func (c *Cache) Size() int

Number of distinct pair of files whose distance is cached.

type Matrix

type Matrix struct {
	// contains filtered or unexported fields
}

A distance matrix Optimised for storage efficiency. Zero value can be used immediately.

func ComputeEuclid

func ComputeEuclid(vects []Vect) (mat *Matrix)

Compute euclidian distance matrix for vectors. Used mainly for test purposes.

func ComputeFiles

func ComputeFiles(fnames ...string) *Matrix

Compute the distance matrix for a group of files. Computations are cached for later reuse

func ComputeFolder

func ComputeFolder(folder string) *Matrix

Compute the distance matrix for all files in the folder

func ComputeString

func ComputeString(ss []string) (mat *Matrix)

Compute the distance matrix between strings

func (*Matrix) Dist

func (m *Matrix) Dist(i, j int) float64

Get distance between i and j. This is the minimum interface required by the cluster package.

func (*Matrix) Set

func (m *Matrix) Set(i, j int, d float64)

Set a distance for (i,j). It also sets the same value for (j,i). Size will increase as needed.

func (*Matrix) Size

func (m *Matrix) Size() int

Provide current size n of matrix (n x n) May dynamically increase when elements are added.

func (*Matrix) String

func (m *Matrix) String() string

String to display a readable (possibly truncated) matrix.

type Vect

type Vect = []float64

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL