langdet

package module
v0.0.3 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Feb 1, 2023 License: ISC Imports: 4 Imported by: 0

README

langdet - Language Detection for Go

GoDoc Go Report Card

Overview

Package langdet detects natural languages in text using a straightforward implementation of trigram based text categorization. The most commonly used languages worldwide are supported out of the box, but the code is flexible enough to accept any set of languages.

Langdet first detects the writing script in order to narrow down the number of languages to test against. Some writing scripts are used by only a single language (Korean, Greek, etc). In that case the language is returned directly without needing to do trigram analysis. Otherwise, it matches each language profile under the detected writing script against the input text and returns a result set listing the languages ordered by confidence.

Install

go get -u github.com/askeladdk/langdet

Quickstart

Use DetectLanguage to detect the language of a string. It returns the BCP 47 language tag of the language with the highest probability. If no language was detected, the function returns language.Und.

detectedLanguage := langdet.DetectLanguage(s)

Use DetectLanguageWithOptions if you need more control. DetectLanguage is a shorthand for this function using DefaultOptions. Unlike DetectLanguage, DetectLanguageWithOptions returns a slice of Results listing the probabilities of all languages using the detected writing script ordered by probability.

results := langdet.DetectLanguageWithOptions(s, DefaultOptions)

Use Options to configure the detector. Any number of writing scripts and languages can be detected by setting the Scripts and Languages fields. Use the Train function to build language profiles. Use MinConfidence and MinRelConfidence to filter languages by confidence.

myLang := langdet.Language {
    Tag: language.Make("zz"),
    Trigrams: langdet.Train(trainingSet),
}

options := langdet.Options {
    Scripts: []*unicode.RangeTable{
        unicode.Latin,
    },
    Languages: map[*unicode.RangeTable]langdet.Languages {
        unicode.Latin: {
            Languages: []langdet.Languge {
                langdet.Dutch,
                langdet.French,
                myLang,
            },
        },
    },
}

results := langdet.DetectLanguageWithOptions(s, options)

Read the rest of the documentation on pkg.go.dev. It's easy-peasy!

License

Package langdet is released under the terms of the ISC license.

Documentation

Overview

Package langdet detects natural languages in text.

Index

Constants

This section is empty.

Variables

View Source
var (
	BelarusianTag    = language.Make("be")
	BosnianTag       = language.Make("bs")
	IrishTag         = language.Make("ga")
	JavaneseTag      = language.Make("jv")
	LatinTag         = language.Make("la")
	LuxembourgishTag = language.Make("lb")
	MalteseTag       = language.Make("mt")
	MyanmarTag       = language.Make("my")
	OriyaTag         = language.Make("or")
	SundaneseTag     = language.Make("su")
	TibetanTag       = language.Make("bo")
)

Tags for languages missing from golang.org/x/text/language.

View Source
var Albanian = Language{
	Tag:      language.Albanian,
	Trigrams: _AlbanianTrigrams,
}

Albanian profiles the Albanian language.

View Source
var Belarusian = Language{
	Tag:      BelarusianTag,
	Trigrams: _BelarusianTrigrams,
}

Belarusian profiles the Belarusian language.

View Source
var Bosnian = Language{
	Tag:      BosnianTag,
	Trigrams: _BosnianTrigrams,
}

Bosnian profiles the Bosnian language.

View Source
var Bulgarian = Language{
	Tag:      language.Bulgarian,
	Trigrams: _BulgarianTrigrams,
}

Bulgarian profiles the Bulgarian language.

View Source
var Croatian = Language{
	Tag:      language.Croatian,
	Trigrams: _CroatianTrigrams,
}

Croatian profiles the Croatian language.

View Source
var Czech = Language{
	Tag:      language.Czech,
	Trigrams: _CzechTrigrams,
}

Czech profiles the Czech language.

View Source
var Danish = Language{
	Tag:      language.Danish,
	Trigrams: _DanishTrigrams,
}

Danish profiles the Danish language.

View Source
var DefaultOptions = Options{
	Languages: map[*unicode.RangeTable]Languages{
		unicode.Arabic: {
			DefaultTag: language.Arabic,
		},
		unicode.Armenian: {
			DefaultTag: language.Armenian,
		},
		unicode.Bengali: {
			DefaultTag: language.Bengali,
		},
		unicode.Cyrillic: {
			Languages: []Language{
				Belarusian,
				Bulgarian,
				Macedonian,
				Russian,
				Serbian,
				Ukrainian,
			},
		},
		unicode.Devanagari: {
			DefaultTag: language.Hindi,
		},
		unicode.Ethiopic: {
			DefaultTag: language.Amharic,
		},
		unicode.Javanese: {
			DefaultTag: JavaneseTag,
		},
		unicode.Latin: {
			Languages: []Language{
				Albanian,
				Bosnian,
				Croatian,
				Czech,
				Danish,
				Dutch,
				English,
				Estonian,
				Finnish,
				French,
				German,
				Hungarian,
				Icelandic,
				Irish,
				Italian,
				Latin,
				Latvian,
				Lithuanian,
				Luxembourgish,
				Maltese,
				NorwegianBokmål,
				NorwegianNynorsk,
				Polish,
				Portuguese,
				Romanian,
				Spanish,
				Slovak,
				Slovenian,
				Swedish,
				Turkish,
			},
		},
		unicode.Georgian: {
			DefaultTag: language.Georgian,
		},
		unicode.Greek: {
			DefaultTag: language.Greek,
		},
		unicode.Gujarati: {
			DefaultTag: language.Gujarati,
		},
		unicode.Gurmukhi: {
			DefaultTag: language.Punjabi,
		},
		unicode.Han: {
			DefaultTag: language.Chinese,
		},
		unicode.Hangul: {
			DefaultTag: language.Korean,
		},
		unicode.Hebrew: {
			DefaultTag: language.Hebrew,
		},
		HiraganaKatakana: {
			DefaultTag: language.Japanese,
		},
		unicode.Kannada: {
			DefaultTag: language.Kannada,
		},
		unicode.Khmer: {
			DefaultTag: language.Khmer,
		},
		unicode.Lao: {
			DefaultTag: language.Lao,
		},
		unicode.Malayalam: {
			DefaultTag: language.Malayalam,
		},
		unicode.Myanmar: {
			DefaultTag: MyanmarTag,
		},
		unicode.Oriya: {
			DefaultTag: OriyaTag,
		},
		unicode.Sinhala: {
			DefaultTag: language.Sinhala,
		},
		unicode.Sundanese: {
			DefaultTag: SundaneseTag,
		},
		unicode.Tamil: {
			DefaultTag: language.Tamil,
		},
		unicode.Telugu: {
			DefaultTag: language.Telugu,
		},
		unicode.Thai: {
			DefaultTag: language.Thai,
		},
		unicode.Tibetan: {
			DefaultTag: TibetanTag,
		},
	},

	Scripts: []*unicode.RangeTable{
		unicode.Latin,
		unicode.Han,
		unicode.Arabic,
		unicode.Devanagari,
		unicode.Bengali,
		unicode.Cyrillic,
		HiraganaKatakana,
		unicode.Javanese,
		unicode.Hangul,
		unicode.Telugu,
		unicode.Tamil,
		unicode.Gujarati,
		unicode.Kannada,
		unicode.Myanmar,
		unicode.Malayalam,
		unicode.Thai,
		unicode.Sundanese,
		unicode.Gurmukhi,
		unicode.Lao,
		unicode.Oriya,
		unicode.Ethiopic,
		unicode.Sinhala,
		unicode.Hebrew,
		unicode.Armenian,
		unicode.Khmer,
		unicode.Greek,
		unicode.Tibetan,
		unicode.Georgian,
	},
}

DefaultOptions is a default set of options that detects the most commonly used languages worldwide.

View Source
var Dutch = Language{
	Tag:      language.Dutch,
	Trigrams: _DutchTrigrams,
}

Dutch profiles the Dutch language.

View Source
var English = Language{
	Tag:      language.English,
	Trigrams: _EnglishTrigrams,
}

English profiles the English language.

View Source
var Estonian = Language{
	Tag:      language.Estonian,
	Trigrams: _EstonianTrigrams,
}

Estonian profiles the Estonian language.

View Source
var Finnish = Language{
	Tag:      language.Finnish,
	Trigrams: _FinnishTrigrams,
}

Finnish profiles the Finnish language.

View Source
var French = Language{
	Tag:      language.French,
	Trigrams: _FrenchTrigrams,
}

French profiles the French language.

View Source
var German = Language{
	Tag:      language.German,
	Trigrams: _GermanTrigrams,
}

German profiles the German language.

View Source
var HiraganaKatakana = &unicode.RangeTable{
	R16: append(unicode.Hiragana.R16, unicode.Katakana.R16...),
	R32: append(unicode.Hiragana.R32, unicode.Katakana.R32...),
}

HiraganaKatakana is the unicode set of Japanese characters.

View Source
var Hungarian = Language{
	Tag:      language.Hungarian,
	Trigrams: _HungarianTrigrams,
}

Hungarian profiles the Hungarian language.

View Source
var Icelandic = Language{
	Tag:      language.Icelandic,
	Trigrams: _IcelandicTrigrams,
}

Icelandic profiles the Icelandic language.

View Source
var Irish = Language{
	Tag:      IrishTag,
	Trigrams: _IrishTrigrams,
}

Irish profiles the Irish language.

View Source
var Italian = Language{
	Tag:      language.Italian,
	Trigrams: _ItalianTrigrams,
}

Italian profiles the Italian language.

View Source
var Latin = Language{
	Tag:      LatinTag,
	Trigrams: _LatinTrigrams,
}

Latin profiles the Latin language.

View Source
var Latvian = Language{
	Tag:      language.Latvian,
	Trigrams: _LatvianTrigrams,
}

Latvian profiles the Latvian language.

View Source
var Lithuanian = Language{
	Tag:      language.Lithuanian,
	Trigrams: _LithuanianTrigrams,
}

Lithuanian profiles the Lithuanian language.

View Source
var Luxembourgish = Language{
	Tag:      LuxembourgishTag,
	Trigrams: _LuxembourgishTrigrams,
}

Luxembourgish profiles the Luxembourgish language.

View Source
var Macedonian = Language{
	Tag:      language.Macedonian,
	Trigrams: _MacedonianTrigrams,
}

Macedonian profiles the Macedonian language.

View Source
var Maltese = Language{
	Tag:      MalteseTag,
	Trigrams: _MalteseTrigrams,
}

Maltese profiles the Maltese language.

View Source
var NorwegianBokmål = Language{
	Tag:      language.Norwegian,
	Trigrams: _NorwegianBokmålTrigrams,
}

NorwegianBokmål profiles the NorwegianBokmål language.

View Source
var NorwegianNynorsk = Language{
	Tag:      language.Norwegian,
	Trigrams: _NorwegianNynorskTrigrams,
}

NorwegianNynorsk profiles the NorwegianNynorsk language.

View Source
var Polish = Language{
	Tag:      language.Polish,
	Trigrams: _PolishTrigrams,
}

Polish profiles the Polish language.

View Source
var Portuguese = Language{
	Tag:      language.Portuguese,
	Trigrams: _PortugueseTrigrams,
}

Portuguese profiles the Portuguese language.

View Source
var Romanian = Language{
	Tag:      language.Romanian,
	Trigrams: _RomanianTrigrams,
}

Romanian profiles the Romanian language.

View Source
var Russian = Language{
	Tag:      language.Russian,
	Trigrams: _RussianTrigrams,
}

Russian profiles the Russian language.

View Source
var Serbian = Language{
	Tag:      language.Serbian,
	Trigrams: _SerbianTrigrams,
}

Serbian profiles the Serbian language.

View Source
var Slovak = Language{
	Tag:      language.Slovak,
	Trigrams: _SlovakTrigrams,
}

Slovak profiles the Slovak language.

View Source
var Slovenian = Language{
	Tag:      language.Slovenian,
	Trigrams: _SlovenianTrigrams,
}

Slovenian profiles the Slovenian language.

View Source
var Spanish = Language{
	Tag:      language.Spanish,
	Trigrams: _SpanishTrigrams,
}

Spanish profiles the Spanish language.

View Source
var Swedish = Language{
	Tag:      language.Swedish,
	Trigrams: _SwedishTrigrams,
}

Swedish profiles the Swedish language.

View Source
var Turkish = Language{
	Tag:      language.Turkish,
	Trigrams: _TurkishTrigrams,
}

Turkish profiles the Turkish language.

View Source
var Ukrainian = Language{
	Tag:      language.Ukrainian,
	Trigrams: _UkrainianTrigrams,
}

Ukrainian profiles the Ukrainian language.

Functions

func DetectLanguage

func DetectLanguage(s string) language.Tag

DetectLanguage is a shorthand that calls DetectLanguageWithOptions with the default options and returns the best detected language.

func DetectScript

func DetectScript(s string, scripts []*unicode.RangeTable) *unicode.RangeTable

DetectScript detects the dominant writing script of s.

Types

type Language

type Language struct {
	// Tag is the BCP 47 language tag.
	Tag language.Tag

	// Trigrams is the trigrams profile created by Train.
	Trigrams []Trigram
}

Language profiles a natural language.

type Languages

type Languages struct {
	// DefaultTag is the default language tag used if Languages is empty.
	DefaultTag language.Tag

	// Languages is the set of languages sharing the same writing script.
	// If this is empty or nil, the detected language is always DefaultTag.
	Languages []Language
}

Languages is a set of languages that share the same writing script.

type Options

type Options struct {
	// Scripts is the set of writing scripts to detect.
	Scripts []*unicode.RangeTable

	// Languages maps writing systems to a set of languages.
	Languages map[*unicode.RangeTable]Languages

	// MinConfidence is the minimum confidence that must be met
	// before DetectLanguage returns the detected language.
	MinConfidence float64

	// MinRelConfidence is the minimum confidence difference
	// that must be met between detected languages.
	// Languages that do not meet the minimum are filtered from the result.
	MinRelConfidence float64
}

Options configures the language detector.

type Result

type Result struct {
	// Tag is the detected language.
	Tag language.Tag

	// Confidence is the probability that this language is correct, between 0 and 1.
	Confidence float64
}

Result holds a detected language and confidence.

func DetectLanguageWithOptions

func DetectLanguageWithOptions(s string, options Options) []Result

DetectLanguageWithOptions detects the language of s configured by options. It returns a set of candidate languages ordered by confidence level. At least one result is always returned.

type Trigram

type Trigram [3]rune

Trigram is a tuple of three unicode runes.

func Train

func Train(s string) []Trigram

Train counts all trigrams in s and orders them by frequency.

func (Trigram) MarshalText

func (t Trigram) MarshalText() ([]byte, error)

MarshalText implements encoding.TextMarshaler.

func (Trigram) String

func (t Trigram) String() string

String implements fmt.Stringer.

func (*Trigram) UnmarshalText

func (t *Trigram) UnmarshalText(b []byte) error

UnmarshalText implements encoding.TextUnmarshaler.

Directories

Path Synopsis
cmd

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL