wordsegmentation

package module

v0.0.0-...-17d2028 Latest Latest Go to latest Published: Jan 17, 2019 License: MIT Imports: 3 Imported by: 0

Details

Valid go.mod file

The Go module system was introduced in Go 1.11 and is the official dependency management solution for Go.
Redistributable license

Redistributable licenses place minimal restrictions on how software can be used, modified, and redistributed.
Tagged version

Modules with tagged versions give importers more predictable builds.
Stable version

When a project reaches major version v1 it is considered stable.
Learn more about best practices

Repository

github.com/AntoineAugusti/wordsegmentation

Links

Open Source Insights

README ¶

Word segmentation

Word segmentation is the process of dividing a phrase without spaces back into its constituent parts. For example, consider a phrase like "thisisatest". Humans can immediately identify that the correct phrase should be "this is a test".

Source and credits

This package is heavily inspired by the Python module grantjenks/wordsegment.

The package is based on code from the chapter Natural Language Corpus Data by Peter Norvig from the book Beautiful Data (Segaran and Hammerbacher, 2009).

Getting started

You can grab this package with the following command:

go get gopkg.in/antoineaugusti/wordsegmentation.v0

Usage

If you wanna use the default English corpus:

package main

import (
    "fmt"

    "github.com/antoineaugusti/wordsegmentation"
    "github.com/antoineaugusti/wordsegmentation/corpus"
)

func main() {
    // Grab the default English corpus that will be created thanks to TSV files
    englishCorpus := corpus.NewEnglishCorpus()
    fmt.Println(wordsegmentation.Segment(englishCorpus, "thisisatest"))
}

Unigrams and bigrams

Information: an n-gram is a contiguous sequence of n items from a given sequence of text or speech.

This package ships with an English corpus by default that is ready to use. Data files are derived from the Google web trillion word corpus, as described by Thorsten Brants and Alex Franz, and distributed by the Linguistic Data Consortium. This module contains only a subset of that data. The unigram data includes only the most common 333,000 words. Similarly, bigram data includes only the most common 250,000 phrases. Every word and phrase is lowercased with punctuation removed.

Using a custom corpus

If you want to use a custom corpus, you will need to implement the Corpus interface to give to the Segment method.

The interface is as follow:

// The corpus interface that lets access bigrams,
// unigrams, the total number of words from the corpus
// and a function to clean a string.
type Corpus interface {
    Bigrams() *models.Bigrams
    Unigrams() *models.Unigrams
    Total() float64
    Clean(string) string
}

Take a look at the English corpus source code to help you start!

Documentation

The documentation of this package can be found on GoDoc. Here is a list of links for the different modules:

corpus - the default English corpus
helpers - little functions to get the length of a string, remove special characters of a string, get the minimum between 2 given integers
models - the various objects used (Unigrams, Bigrams, Arrangement, Candidate, Possibility)
parsers - parsers to read tab-separated files into Unigrams and Bigrams
segment - the 'main' package

Documentation ¶

Index ¶

func Segment(corp Corpus, text string) []string
type Corpus

Constants ¶

This section is empty.

Variables ¶

This section is empty.

Functions ¶

func Segment ¶

func Segment(corp Corpus, text string) []string

Return a list of words that is the best segmentation of a given text.

Types ¶

type Corpus ¶

type Corpus interface {
	Bigrams() *m.Bigrams
	Unigrams() *m.Unigrams
	Total() float64
	Clean(string) string
}

The corpus interface that lets access bigrams, unigrams, the total number of words from the corpus and a function to clean a string.

This is the interface you will need to implement if you want to use a custom corpus.

Source Files ¶

View all Source files

wordsegmentation.go

Directories ¶

Path	Synopsis
corpus
data
english
helpers
models
parsers

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL