symbolset

package module
v0.0.0-...-262ae63 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Mar 21, 2023 License: MIT Imports: 8 Imported by: 0

README

symbolset

Symbolset is a repository for handling phonetic symbol sets and mappers/converters between different symbol sets and languages. Written in go.

GoDoc Go Report Card Build Status

I. Server installation

  1. Set up go

    Download: https://golang.org/dl/ (1.13 or higher)
    Installation instructions: https://golang.org/doc/install

  2. Clone the source code

    $ git clone https://github.com/stts-se/symbolset.git
    $ cd symbolset

  3. Test (optional)

    symbolset$ go test ./...

  4. Pre-compile server (for faster execution times).

    symbolset$ cd server
    server$ go build .

II. Quick start: Start the server with demo set of symbol sets

server$ ./server -ss_files demo_files

III. Setup with Wikispeech symbolsets

  1. Clone Wikispeech lexdata (this might take a couple of minutes)

    $ git clone https://github.com/stts-se/wikispeech-lexdata.git

  2. Setup

    server$ bash setup.sh wikispeech-lexdata ss_files

  3. Start server

    server$ ./server -ss_files ss_files


This work was supported by the Swedish Post and Telecom Authority (PTS) through the grant "Wikispeech – en användargenererad talsyntes på Wikipedia" (2016–2017).

Documentation

Overview

Package symbolset is used to define symbol sets, such as NST-SAMPA, Wikispeech-SAMPA, and so on.

Each symbol set is defined in a .sym file including each symbol's corresponding IPA representation:

DESCRIPTION          SYMBOL   IPA	 IPA UNICODE          CATEGORY

Sample lines (Swedish Wikispeech SAMPA):

DESCRIPTION          SYMBOL   IPA	 IPA UNICODE          CATEGORY
sil                  i:       iː 	 U+0069U+02D0         Syllabic
aula                 au       a⁀ʊ	 U+0061U+2040U+028A   Syllabic
bok                  b        b  	 U+0062               NonSyllabic
forna                rn       ɳ  	 U+0273               NonSyllabic
syllable delimiter   .        .  	 U+002E               SyllableDelimiter
accent I             "        ˈ  	 U+02C8               Stress
accent II            ""       ˈ̀  	 U+02C8U+0300         Stress
secondary stress     %        ˌ  	 U+02CC               Stress

Note that the header is required on the first line. As you can see in the examples, the IPA UNICODE is specified on the format U+<NUMBER> (no space between symbols in sequence).

Each symbol set has a name, extracted from the .sym file name.

Legal categories (pre-defined in code):

Syllabic: syllabic phonemes (typically vowels and syllabic consonants)

NonSyllabic: non-syllabic phonemes (typically consonants)

Stress: stress and accent symbols (primary, secondary, tone accents, etc)

PhonemeDelimiter: phoneme delimiters (white space, empty string, etc)

SyllableDelimiter: syllable delimiters

MorphemeDelimiter: morpheme delimiters that need not align with morpheme boundaries in the decompounded orthography

CompoundDelimiter: compound delimiters that should be aligned with compound boundaries in the decompounded orthography

WordDelimiter: word delimiters

For real world examples (used for unit tests), see the test_data folder: https://github.com/stts-se/pronlex/tree/master/symbolset/test_data

Index

Constants

This section is empty.

Variables

View Source
var SymbolSetSuffix = ".sym"

SymbolSetSuffix defines the filename extension for symbol sets

Functions

func LoadSymbolSetsFromDir

func LoadSymbolSetsFromDir(dirName string) (map[string]SymbolSet, error)

LoadSymbolSetsFromDir loads a all symbol sets from the specified folder (all files with .sym extension)

Types

type IPASymbol

type IPASymbol struct {
	String  string
	Unicode string
}

IPASymbol ipa symbol string with Unicode representation

type Symbol

type Symbol struct {
	String string
	Cat    SymbolCat
	Desc   string
	IPA    IPASymbol
}

Symbol represent a phoneme, stress or delimiter symbol used in transcriptions, including the IPA symbol with unicode

type SymbolCat

type SymbolCat int

SymbolCat is used to categorize transcription symbols.

const (
	// Syllabic is used for syllabic phonemes (typically vowels and syllabic consonants)
	Syllabic SymbolCat = iota

	// NonSyllabic is used for non-syllabic phonemes (typically consonants)
	NonSyllabic

	// Stress is used for stress and accent symbols (primary, secondary, tone accents, etc)
	Stress

	// PhonemeDelimiter is used for phoneme delimiters (white space, empty string, etc)
	PhonemeDelimiter

	// SyllableDelimiter is used for syllable delimiters
	SyllableDelimiter

	// MorphemeDelimiter is used for morpheme delimiters that need not align with
	// morpheme boundaries in the decompounded orthography
	MorphemeDelimiter

	// CompoundDelimiter is used for compound delimiters that should be aligned
	// with compound boundaries in the decompounded orthography
	CompoundDelimiter

	// WordDelimiter is used for word delimiters
	WordDelimiter
)

func (SymbolCat) String

func (i SymbolCat) String() string

type SymbolSet

type SymbolSet struct {
	Name    string
	Type    Type
	Symbols []Symbol

	// Phonemes: actual phonemes (syllabic non-syllabic)
	Phonemes []Symbol

	// PhoneticSymbols: Phonemes and stress
	PhoneticSymbols []Symbol

	PhonemeRe     *regexp.Regexp
	SyllabicRe    *regexp.Regexp
	NonSyllabicRe *regexp.Regexp
	SymbolRe      *regexp.Regexp

	PhonemeDelimiter Symbol
	// contains filtered or unexported fields
}

SymbolSet is a struct for package private usage. To create a new 'SymbolSet' instance, use NewSymbolSet

func LoadSymbolSet

func LoadSymbolSet(fName string) (SymbolSet, error)

LoadSymbolSet loads a SymbolSet from file

func LoadSymbolSetWithName

func LoadSymbolSetWithName(name string, fName string) (SymbolSet, error)

LoadSymbolSetWithName loads a SymbolSet from file, and names the SymbolSet

func NewSymbolSet

func NewSymbolSet(name string, symbols []Symbol) (SymbolSet, error)

NewSymbolSet is a constructor for 'symbols' with built-in error checks

func NewSymbolSetWithTests

func NewSymbolSetWithTests(name string, symbols []Symbol, testLines []string, checkForDups bool) (SymbolSet, error)

NewSymbolSetWithTests is a constructor for 'symbols' with built-in error checks

func (SymbolSet) ContainsSymbols

func (ss SymbolSet) ContainsSymbols(trans string, symbols []Symbol) (bool, error)

ContainsSymbols checks if a transcription contains a certain phoneme symbol

func (SymbolSet) ConvertFromIPA

func (ss SymbolSet) ConvertFromIPA(trans string) (string, error)

ConvertFromIPA maps one input IPA transcription into the current symbol set

func (SymbolSet) ConvertToIPA

func (ss SymbolSet) ConvertToIPA(trans string) (string, error)

ConvertToIPA maps one input transcription string into an IPA transcription

func (SymbolSet) Get

func (ss SymbolSet) Get(symbol string) (Symbol, error)

Get searches the SymbolSet for a symbol with the given string

func (SymbolSet) GetFromIPA

func (ss SymbolSet) GetFromIPA(ipa string) (Symbol, error)

GetFromIPA searches the SymbolSet for a symbol with the given IPA symbol string

func (SymbolSet) SplitIPATranscription

func (ss SymbolSet) SplitIPATranscription(input string) ([]string, error)

SplitIPATranscription splits the input transcription into separate symbols

func (SymbolSet) SplitTranscription

func (ss SymbolSet) SplitTranscription(input string) ([]string, error)

SplitTranscription splits the input transcription into separate symbols

func (SymbolSet) ValidIPASymbol

func (ss SymbolSet) ValidIPASymbol(symbol string) bool

ValidIPASymbol checks if a string is a valid symbol or not

func (SymbolSet) ValidSymbol

func (ss SymbolSet) ValidSymbol(symbol string) bool

ValidSymbol checks if a string is a valid symbol or not

type Type

type Type int

Type is used for accent placement, etc.

const (
	// CMU is used for the phone set used in the CMU lexicon
	CMU Type = iota

	// SAMPA is used for SAMPA transcriptions (http://www.phon.ucl.ac.uk/home/sampa/)
	SAMPA

	// IPA is used for IPA transcriptions
	IPA

	// Other is used for symbol sets not defined in the types above
	Other
)

func (Type) String

func (i Type) String() string

Directories

Path Synopsis
Package converter is used to convert between symbol sets from different languages.
Package converter is used to convert between symbol sets from different languages.
Package mapper is used to map between different phonetic symbol sets, such as NST-SAMPA to Wikispeech-SAMPA, IPA to SAMPA, and so on.
Package mapper is used to map between different phonetic symbol sets, such as NST-SAMPA to Wikispeech-SAMPA, IPA to SAMPA, and so on.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL