Documentation ¶
Overview ¶
Package rbg2p contains utilities for rule based, manually written, grapheme to phoneme rules.
Each g2p rule set is defined in a .g2p file with the following content:
- specific variables
- used to define constant variables such as character set and phoneme delimiter
- variables
- any variables for use in the context of the actual rules
- sylldef - definitions for dividing transcriptions into syllables
- rules - g2p rules
- filters - transcription filters applied after the rules
- tests - input/output tests
- comments
SPECIFIC VARIABLES ¶
Defines a set of constant variables, such as character set and phoneme delimiter. Please note that quotes are required around the value, since space and the empty string can be used as a value.
<NAME> "<VALUE>"
Available variables (* means required):
CHARACTER_SET* (default: none) - used to check that each character in the character set has at least one rule PHONEME_SET (default: none) - space separated symbol set, used to validate the phonemes in the g2p rules DEFAULT_PHONEME (default: "_") - used for input input (orthographic) symbols PHONEME_DELIMITER (default: " ") - used to concatenate phonemes into a transcriptions DOWNCASE_INPUT (default: true)
Examples:
CHARACTER_SET "abcdefghijklmnopqrstuvwxyzåäö" PHONEME_SET "a au o u i y e eu p t k b d g r s f h j l v w m n S tS" DEFAULT_PHONEME "_" PHONEME_DELIMITER " "
VARIABLES ¶
Regexp variables prefixed by VAR, that can be used in the rule context and filters as exemplified below. The variable names must not contain underscore (_).
VAR <NAME> <VALUE>
Examples:
VAR VOWEL [aeyuio] VAR AFFRICATE (tS|dZ) VAR VOICELESS [ptksf]
SYLLDEF ¶
An set of variables prefixed by SYLLDEF, used for syllabification (not required).
SYLLDEF <NAME> "<VALUE>"
Currently, only maximum onset (MOP) syllabification can be used. Variables currently available:
TYPE (default: MOP) - currently, the only value allowed here is MOP ONSETS - a comma separated list of valid syllable onsets (typically consonant clusters) SYLLABIC - a space separated list of syllabic phonemes (typically vowels) STRESS - a space separated list of stress symbols DELIMITER - syllable delimiter symbol
Examples:
SYLLDEF TYPE MOP SYLLDEF ONSETS "p, b, t, rt, m, n, d, rd, k, g, rn, f, v, C, rs, r, l, s, x, S, h, rl, j, s, p, r, rs p r, s p l, rs p l, s p j, rs p j, s t r, rs rt r, s k r, rs k r, s k v, rs k v, p r, p j, p l, b r, b j, b l, t r, rt r, t v, rt v, d r, rd r, d v, rd v, k r, k l, k v, k n, g r, g l, g n, f r, f l, f j, f n, v r, s p, s t, s k, s v, s l, s m, s n, n j, rs p, rs rt, rs k, rs v, rs rl, rs m, rs rn, rn j, m j" SYLLDEF SYLLABIC "i: I u0 }: a A: u: U E: {: E { au y: Y e: e 2: 9: 2 9 o: O @ eu" SYLLDEF STRESS "\" %" SYLLDEF DELIMITER "."
RULES ¶
Grapheme to phoneme rules written in a format loosely based on phonotactic rules. The rules are ordered, and typically the rule order is of great importance.
<INPUT> -> <OUTPUT> <INPUT> -> <OUTPUT> / <CONTEXT> <INPUT> -> (<OUTPUT1>, <OUTPUT2>) <INPUT> -> (<OUTPUT1>, <OUTPUT2>) / <CONTEXT>
Context:
<LEFT CONTEXT> _ <RIGHT CONTEXT>
<INPUT> is a string of one or more input characters. <OUTPUT> is a string representing the output (separated by the pre-defined phoneme delimiter, above). For empty output, i.e., when a character should not be pronounced, use the empty set symbol "∅" (U+2205).
<CONTEXT> is the context in which the <INPUT> should occur for the rule to apply. Pre-defined variables (above) can be use in the context specs. # is used for anchoring (marks the start/end of the input string).
Examples:
a -> ? a / # _ a -> a e -> e skt -> (s t, s k t) / _ ck -> k b -> p / _ VOICELESS h -> ∅ / # _
PREFILTERS ¶
Regexp replacement filters for transcriptions. The filters are applied before the g2p rules. Pre-defined variables (above) can be use in the input regexp surrounded by curly brackets.
PREFILTER "<FROM RE>" -> "<TO STRING>" PREFILTER "<FROM RE WITH {VARIABLENAME}>" -> "<TO STRING>"
Example:
PREFILTER "п" -> "p" // cyrillic char to latin
FILTERS ¶
Regexp replacement filters for transcriptions. The filters are applied after the g2p rules. Pre-defined variables (above) can be use in the input regexp surrounded by curly brackets.
FILTER "<FROM RE>" -> "<TO STRING>" FILTER "<FROM RE WITH {VARIABLENAME}>" -> "<TO STRING>"
Example:
FILTER "^" -> "\" " // place stress first in transcription
COMMENTS ¶
Comments are prefixed by // or #
TESTS ¶
Test examples prefixed by TEST:
TEST <INPUT> -> <OUTPUT>
or with variants:
TEST <INPUT> -> (<OUTPUT1>, <OUTPUT2>)
Examples:
TEST hit -> h i t TEST kex -> (k e k s, C e k s)
---
SEPARATE SYLLABIFICATION RULE FILE ¶
A .syll file for syllabification contains a subset of the items used for a proper g2p.
Example (for the CMU lexicon):
PHONEME_SET "AA AE AH AX AO AW AY B CH D DH EH ER EY F G HH IH IY JH K L M N NG OW OY P R S SH T TH UH UW V W Y Z ZH 1 2" PHONEME_DELIMITER " " SYLLDEF TYPE MOP SYLLDEF ONSETS "P, T, K, B, D, G, CH, JH, F, V, T, D, S, Z, S, Z, H, L, M, N, N, R, W, J, P R, T R, B R, G R, S T R, S P R, S K R, P L, T L, B L, G L, S T L, S P L, S K L, S P, S T, S K" SYLLDEF SYLLABIC "AA AE AH AX AO AW AY EH ER EY IH IY OW OY UH UW" SYLLDEF STRESS "1 2" SYLLDEF DELIMITER "$" SYLLDEF TEST AX P R 1 AA K S AX M AX T -> AX $ P R 1 AA K $ S AX $ M AX T SYLLDEF TEST W 1 UH D S T R 2 IY M -> W 1 UH D $ S T R 2 IY M
For details on the .g2p file format, check docs for the root folder of this package.
For more examples (used for unit tests), see the test_data folder: https://github.com/stts-se/rbg2p/tree/master/test_data
To test a single g2p file from the command line, use cmd/g2p.
To import and use the rbg2p rule package in another go program:
import ( "github.com/stts-se/rbg2p" ) func main() { // TODO: initialize g2pFile and orth var g2pFile = "", orth = "" // Load rule file ruleSet, err := rbg2p.LoadFile(g2pFile) // TODO: check for error in err // Test rule set testRes := ruleSet.Test() // TODO: check for errors in testRes // testRes is an instance of rbg2p.TestResult // - you can do a quick check using testRes.Failed() to find out if there were any errors // - you can retrieve all errors using testRes.AllErrors() // - you can retrieve all errors and warnings using testRes.AllMessages() // Transcribe an input word transes, err := ruleSet.Apply(orth) }
Index ¶
- Variables
- func Contains(slice []string, value string) bool
- type Context
- type Filter
- type MOPSyllDef
- func (def MOPSyllDef) ContainsSyllabic(phonemes []string) bool
- func (def MOPSyllDef) IncludePhonemeDelimiter() bool
- func (def MOPSyllDef) IsDefined() bool
- func (def MOPSyllDef) IsStress(symbol string) bool
- func (def MOPSyllDef) IsSyllabic(phoneme string) bool
- func (def MOPSyllDef) PhonemeDelimiter() string
- func (def MOPSyllDef) StressPlacement() StressPlacement
- func (def MOPSyllDef) SyllableDelimiter() string
- func (def MOPSyllDef) ValidSplit(left0 []string, right0 []string) bool
- type PhonemeSet
- type Prefilter
- type Regexp
- type Rule
- type RuleSet
- type StressPlacement
- type SyllDef
- type SyllTest
- type Syllabifier
- type Test
- type TestResult
Constants ¶
This section is empty.
Variables ¶
var Debug = false
Functions ¶
Types ¶
type Context ¶
type Context struct { // Input is the regexp as written in the input string Input string // Regexp is the input string converted to a regular expression for internal use (with variables expanded, and adapted anchoring) Regexp *regexp2.Regexp }
Context in which the rule applies (left hand/right hand context specified by a regular expression)
type Filter ¶
Filter is a regexp filter for rules that cannot be expressed using the standard rule systme
type MOPSyllDef ¶
type MOPSyllDef struct { Onsets []string Syllabic []string PhnDelim string SyllDelim string Stress []string StressPlcmnt StressPlacement IncludePhnDelim bool }
MOPSyllDef is a Maximum Onset Principle implementation of the SyllDef interface
func (MOPSyllDef) ContainsSyllabic ¶
func (def MOPSyllDef) ContainsSyllabic(phonemes []string) bool
ContainsSyllabic tells if the input phoneme slice contains any syllabic phonemes (required by interface)
func (MOPSyllDef) IncludePhonemeDelimiter ¶
func (def MOPSyllDef) IncludePhonemeDelimiter() bool
IncludePhonemeDelimiter defines whether the syllable boundaries should be surrounded by the phoneme delimiter
func (MOPSyllDef) IsDefined ¶
func (def MOPSyllDef) IsDefined() bool
IsDefined is used to determine if there is a syllabifier defined or not (required by interface)
func (MOPSyllDef) IsStress ¶
func (def MOPSyllDef) IsStress(symbol string) bool
IsStress is used to check if the input symbol is a stress symbol
func (MOPSyllDef) IsSyllabic ¶
func (def MOPSyllDef) IsSyllabic(phoneme string) bool
IsSyllabic is used to check if the input phoneme is syllabic
func (MOPSyllDef) PhonemeDelimiter ¶
func (def MOPSyllDef) PhonemeDelimiter() string
PhonemeDelimiter is the string used to separate phonemes (required by interface)
func (MOPSyllDef) StressPlacement ¶
func (def MOPSyllDef) StressPlacement() StressPlacement
StressPlacement
func (MOPSyllDef) SyllableDelimiter ¶
func (def MOPSyllDef) SyllableDelimiter() string
SyllableDelimiter is the string used to separate syllables (required by interface)
func (MOPSyllDef) ValidSplit ¶
func (def MOPSyllDef) ValidSplit(left0 []string, right0 []string) bool
ValidSplit is called by Syllabifier.Syllabify to test where to put the boundaries
type PhonemeSet ¶
type PhonemeSet struct { Symbols []string PhnDelim Regexp SyllDelim Regexp SyllDelimIncludesPhnDelim bool }
PhonemeSet is a package internal container for the phoneme set definition
func LoadPhonemeSetFile ¶
func LoadPhonemeSetFile(fName string, syllDelimIncludesPhnDelim bool, syllDelimiter, phnDelimiter string) (PhonemeSet, error)
LoadPhonemeSetFile loads a phoneme set definition from file (one phoneme per line, // for comments)
func NewPhonemeSet ¶
func NewPhonemeSet(symbols []string, syllDelimIncludesPhnDelim bool, syllDelimiter, phnDelimiter string) (PhonemeSet, error)
NewPhonemeSet creates a phoneme set from a slice of symbols, and a phoneme delimiter string
func (PhonemeSet) SplitTranscription ¶
func (ps PhonemeSet) SplitTranscription(trans string) ([]string, error)
SplitTranscription splits the input transcription into a slice of phonemes, based on the pre-defined phoneme delimiter
type Rule ¶
type Rule struct { Input string Output []string LeftContext Context RightContext Context LineNumber int // for debugging }
Rule is a g2p rule representation
type RuleSet ¶
type RuleSet struct { CharacterSet []string PhonemeSet PhonemeSet PhonemeDelimiter string SyllableDelimiter string DefaultPhoneme string DowncaseInput bool Vars map[string]string Rules []Rule RulesAppliedMutex *sync.RWMutex RulesApplied map[string]int // for coverage checks Tests []Test Filters []Filter Prefilters []Prefilter Syllabifier Syllabifier Content string Debug bool }
RuleSet is a set of g2p rules, with variables and built-in tests
func (RuleSet) Apply ¶
Apply applies the rules to an input string, returns a slice of transcriptions. If unknown input characters are found, an error will be created, and an underscore will be appended to the transcription. Even if an error is returned, the loop will continue until the end of the input string.
func (RuleSet) Test ¶
func (rs RuleSet) Test() TestResult
Test runs the built-in tests. Returns a test result with errors and warnings, if any.
type StressPlacement ¶
type StressPlacement int
StressPlacement is used to define where in a syllable the stress should be put in an output string
const ( // Undefined - position not defined Undefined StressPlacement = iota // FirstInSyllable -- before the syllable's first phoneme FirstInSyllable // BeforeSyllabic -- before the first syllabic phoneme BeforeSyllabic // AfterSyllabic -- after the first syllabic phoneme AfterSyllabic )
type SyllDef ¶
type SyllDef interface { ValidSplit(left []string, right []string) bool ContainsSyllabic(phonemes []string) bool IsDefined() bool IsStress(symbol string) bool IsSyllabic(symbol string) bool PhonemeDelimiter() string StressPlacement() StressPlacement IncludePhonemeDelimiter() bool SyllableDelimiter() string }
SyllDef is an interface for implementing custom made syllabification strategies
type Syllabifier ¶
type Syllabifier struct { SyllDef SyllDef Tests []SyllTest StressPlacement StressPlacement PhonemeSet PhonemeSet Debug bool }
Syllabifier is a module to divide a transcription into syllables
func LoadSyllFile ¶
func LoadSyllFile(fName string) (Syllabifier, error)
LoadSyllFile loads a syllabifier from the specified file
func LoadSyllURL ¶
func LoadSyllURL(url string) (Syllabifier, error)
LoadSyllURL loads a syllabifier from an URL
func (Syllabifier) IsDefined ¶
func (s Syllabifier) IsDefined() bool
IsDefined is used to determine if there is a syllabifier defined or not
func (Syllabifier) SyllabifyFromPhonemes ¶
func (s Syllabifier) SyllabifyFromPhonemes(phns []string) string
SyllabifyFromPhonemes is used to divide a range of phonemes into syllables and create an output string
func (Syllabifier) SyllabifyFromString ¶
func (s Syllabifier) SyllabifyFromString(trans string) (string, error)
SyllabifyFromString is used to divide a transcription string into syllables and create an output string
func (Syllabifier) Test ¶
func (s Syllabifier) Test() TestResult
Test to test the input syllabifier definition using tests in the input data or file
type TestResult ¶
TestResult is a container for test results (errors, warnings, and failed tests from tests speficied in the g2p rule file)
func (TestResult) AllErrors ¶
func (tr TestResult) AllErrors() []string
AllMessages returns one single slice with all errors and test results (if any). Each message is prefixed by its type (ERROR/FAILED TESTS).
func (TestResult) AllMessages ¶
func (tr TestResult) AllMessages() []string
AllMessages returns one single slice with all errors, warnings and test results (if any). Each message is prefixed by its type (ERROR/WARNING/FAILED TESTS).
func (TestResult) Failed ¶
func (tr TestResult) Failed() bool
Failed returns true if the test result has any errors or failed tests
func (TestResult) Strings ¶
func (tr TestResult) Strings() []string
Strings returns all messages as strings