keyword

package
v0.0.0-...-efe2ce5 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Apr 25, 2024 License: Apache-2.0, MIT Imports: 7 Imported by: 0

Documentation

Overview

String processing helpers for doing fuzzy detection and normalized token matching against keyword lists.

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

func SlugContainsExplicitSlur

func SlugContainsExplicitSlur(raw string) string

For a small set of frequently-abused explicit slurs, checks for a of permissive set of "l33t-speak" variations of the keyword. This is intended to be used with pre-processed "slugs", which are strings with all whitespace, punctuation, and other characters removed. These could be pre-processed identifiers (like handles or record keys), or pre-processed free-form text.

If there is a match, returns a plan-text version of the slur.

This is a loose port of the 'hasExplicitSlur' function from the `@atproto/pds` TypeScript package.

func SlugIsExplicitSlur

func SlugIsExplicitSlur(raw string) string

Variant of `SlugContainsExplicitSlur` where the entire slug must match.

func Slugify

func Slugify(orig string) string

Takes an arbitrary string (eg, an identifier or free-form text) and returns a version with all non-letter, non-digit characters removed, and all lower-case

func TokenInSet

func TokenInSet(tok string, set []string) bool

Helper to check a single token against a list of tokens

func TokenizeIdentifier

func TokenizeIdentifier(orig string) []string

Splits an identifier in to tokens. Removes any single-character tokens.

For example, the-handle.bsky.social would be split in to ["the", "handle", "bsky", "social"]

func TokenizeText

func TokenizeText(text string) []string

Splits free-form text in to tokens, including lower-case, unicode normalization, and some unicode folding.

The intent is for this to work similarly to an NLP tokenizer, as might be used in a fulltext search engine, and enable fast matching to a list of known tokens. It might eventually even do stemming, removing pluralization (trailing "s" for English), etc.

Types

This section is empty.

Directories

Path Synopsis
cmd

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL