gomtch

package module
v1.1.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Aug 11, 2021 License: BSD-3-Clause Imports: 12 Imported by: 0

README

gomtch - find text even if it doesn't want to be found

Do your users have clever ways to hide some terms from you? Sometimes it is hard to find forbidden terms when the user doesn't want it to be found.

technology Go Build Status GoDoc GoReportCard

gomtch aims to help you find tokens in real life text offering the flexibility that most out-of-the-box algorithms lack. Ever wanted to find instances of a split word in text corpora (s p l i t e d)? Most NLP algorithms require a lot of normalization what could warm the integrity of the text corpora you are working with. gomtch looks for instances of splited words making the whole process easier. Also, the classic duplicated character problem (reeeeal), gomtch takes care of that for you as well. Finally, gomtch gives you the possibility to choose how to analise a potentially dangerous text corpora by considering special characters and digits as wild cards and leaving to you to choose how much (%) of a term should be considered (ex: h4rd matches 90% with the word hard).

https://nicolasassi.medium.com/gomtch-find-text-even-if-it-doesnt-want-to-be-found-a2229aed2a88

Table of Contents

Installation

just your good old go get

$ go get github.com/nicolasassi/gomtch

(optional) To run unit tests:

$ cd $GOPATH/src/github.com/nicolasassi/gomtch
$ go test

(optional) To run benchmarks (warning: it might take some time):

$ cd $GOPATH/src/github.com/nicolasassi/gomtch
$ go test -bench=".*"

Docs

https://pkg.go.dev/github.com/nicolasassi/gomtch

API

gomtch exposes a Document interface. The porpoise of it Document is to be compared with another Document interface. Keep in mind that one Document can be compared with as many Documents as necessary. Use the Scan() in the reference Document with the Documents do be compared as arguments (ex: referenceDocument.Scan(doc1, doc2 doc3...)).

gomtch provides a variety of text normalization features. Some features already implemented are:

  • HTML parsing (remove any HTML tags and keep the text)
  • Sequential character removal (reaaal = real)
  • Upper and lower normalization
  • Unicode normalization (canção = cancao)
  • Replace any unwanted token with a regexp

The implementation of Document requires the field matchScoreFunc with has the following signature func(int, int) bool. This field is used to determine the percentage of a token that should match a token.

Note that the matching behaves differently when comparing digits, letters and special characters.

Examples

Simple document
package main

import (
    "bytes"
    "github.com/nicolasassi/gomtch"
    "log"
)

func main() {
    text := []byte("this is a text c o r p o r a")
    tokenToFind := []byte("corpora")
    corp, err := gomtch.NewDoc(bytes.NewReader(text))
    if err != nil {
        log.Fatal(err)
    }
    match, err := gomtch.NewDoc(bytes.NewReader(tokenToFind))
    if err != nil {
        log.Fatal(err)
    }
    for index, match := range corp.Scan(match) {
        log.Printf("index: %v match: %s", index, string(match))
    }
}
Playing with matching scores
package main

import (
  "bytes"
  "github.com/nicolasassi/gomtch"
  "log"
)

func main() {
  text := []byte("this is a text corp0ra")
  tokenToFind := []byte("corpora")
  corp, err := gomtch.NewDoc(bytes.NewReader(text))
  if err != nil {
    log.Fatal(err)
  }
  // this will not match because the default minimum match score of NewDoc is 100 and
  // "corp0ra" != "corpora"
  match1, err := gomtch.NewDoc(bytes.NewReader(tokenToFind))
  if err != nil {
    log.Fatal(err)
  }
  // this will match because 90% of len(tokenToFind) == +- 6. This means that there is space for
  // one not matching letter.
  match2, err := gomtch.NewDoc(bytes.NewReader(tokenToFind), gomtch.WithMinimumMatchScore(90))
  if err != nil {
    log.Fatal(err)
  }
  for index, match := range corp.Scan(match1, match2) {
    log.Printf("index: %v match: %s", index, string(match))
  }
}
Complex document and conditions
package main

import (
  "bytes"
  "github.com/nicolasassi/gomtch"
  "log"
  "regexp"
)

func main() {
    text := []byte("<p>This is REAAAAAAL WORLD example of a téxt quite h4rd to match!!<p>")
    corp, err := gomtch.NewDoc(bytes.NewReader(text),
        gomtch.WithSetLower(),
        gomtch.WithSequentialEqualCharsRemoval(),
        gomtch.WithHMTLParsing(),
        gomtch.WithReplacer(regexp.MustCompile(`[\[\]()\-.,:;{}"'!?]`), " "),
        gomtch.WithTransform(gomtch.NewASCII()))
    if err != nil {
        log.Fatal(err)
    }
    // this will match because we set that sequential equal caracters shoud removed and the
    // text should be all in lower case.
    // So REAAAAAL becames real
    match1, err := gomtch.NewDoc(bytes.NewReader([]byte("real world")))
    if err != nil {
        log.Fatal(err)
    }
    // this will match because we allow each token to have a 90% minimum match score.
    match2, err := gomtch.NewDoc(bytes.NewReader([]byte("text quite hard to match")),
        gomtch.WithMinimumMatchScore(90))
    if err != nil {
        log.Fatal(err)
    }
    for index, match := range corp.Scan(match1, match2) {
        log.Printf("index: %v match: %s", index, string(match))
    }
}

Support

There are a number of ways you can support the project:

  • Use it, star it, build something with it, spread the word!
  • Raise issues to improve the project (note: doc typos and clarifications are issues too!)
    • Please search existing issues before opening a new one - it may have already been addressed.
  • Pull requests: please discuss new code in an issue first, unless the fix is really trivial.
    • Make sure new code is tested.
    • Be mindful of existing code - PRs that break existing code have a high probability of being declined, unless it fixes a serious issue.

License

The BSD 3-Clause license, the same as the Go language.

Documentation

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

This section is empty.

Types

type ASCII

type ASCII struct {
	// contains filtered or unexported fields
}

func NewASCII

func NewASCII() *ASCII

func (ASCII) Transform

func (a ASCII) Transform(s string) (string, error)

type Document

type Document struct {
	Text   string
	Tokens []string
	// contains filtered or unexported fields
}

func NewDocument

func NewDocument(text string, opts ...Option) (*Document, error)

func NewDocumentFromReader

func NewDocumentFromReader(text io.Reader, opts ...Option) (*Document, error)

func (Document) Compare

func (d Document) Compare(tokens Tokens) (bool, []rune)

func (Document) CompareRune

func (d Document) CompareRune(a, b rune) bool

CompareRune compares one rune to another and returns true if there is a match. Besides checking equality by the standard form (==) it also applies some rules to check if the compared value might be the same as the reference but is masked somehow. Numbers and numerical info must match exactly. Also if both A and B are letters both should match as well. If the compared entities is not a letter, number or numerical info and the reference is not a number or numerical info, it will match.

func (Document) IsEqual

func (d Document) IsEqual(a, b []rune) bool

IsEqual compares A and B and returns true if they probably are the same word and false otherwise. if A and B have different lengths it will return false. The A and B variables are not interchangeable as A represents the entities to be compared to B, the reference. IsEqual uses the minimumMatchScore to determine if the words are the same even if there are differences between then. Numbers, numericalInfo and letters should match exactly and the matches increase the counter for the minimumMatchScore, otherwise it returns false immediately.

func (Document) Scan

func (d Document) Scan(docs ...Documenter) Matches

func (Document) String

func (d Document) String() string

type Documenter

type Documenter interface {
	Compare(ref Tokens) (bool, []rune)
	IsEqual(a, b []rune) bool
	CompareRune(a, b rune) bool
	fmt.Stringer
}

type Mapper

type Mapper interface {
	Map() Tokens
	Items() map[string][]int
	AddIndex(v string, index int) Mapping
}

type Mapping

type Mapping map[string][]int

func NewMappingFromTokens

func NewMappingFromTokens(tokens []string) Mapping

func (Mapping) AddIndex

func (m Mapping) AddIndex(v string, index int) Mapping

func (Mapping) Items

func (m Mapping) Items() map[string][]int

func (Mapping) Map

func (m Mapping) Map() Tokens

type Matches

type Matches map[int][]rune

type Option

type Option func(*Document)

func WithConditionalMatchScore

func WithConditionalMatchScore(f func(int, int) bool) Option

func WithCustomRegexpTokenizer

func WithCustomRegexpTokenizer(t *tokenize.RegexpTokenizer) Option

func WithHMTLParsing

func WithHMTLParsing() Option

func WithMinimumMatchScore

func WithMinimumMatchScore(score int) Option

func WithReplacer

func WithReplacer(pattern *regexp.Regexp, rep string) Option

func WithSequentialEqualCharsRemoval

func WithSequentialEqualCharsRemoval() Option

func WithSetLower

func WithSetLower() Option

func WithSetUpper

func WithSetUpper() Option

func WithTransform

func WithTransform(t Transformer) Option

type Scanner

type Scanner interface {
	Scan(docs ...Documenter) Matches
}

type Tokens

type Tokens struct {
	Values map[int][]rune
	Ids    []int
}

func (Tokens) GetRunesByID

func (t Tokens) GetRunesByID(id int) []rune

type Transformer

type Transformer interface {
	Transform(s string) (string, error)
}

Directories

Path Synopsis
examples

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL