rbg2p

package module
v1.0.1 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: May 6, 2021 License: MIT Imports: 13 Imported by: 2

README

rbg2p

Utilities for rule based, manually written, grapheme to phoneme rules

GoDoc Go Report Card Github actions workflow status

Command line tools

G2P
g2p <FLAGS> <G2P RULE FILE> <WORDS (FILES OR LIST OF WORDS)> (optional)

FLAGS:
  -column int
        only convert specified column (default: first field)
  -coverage
        run coverage check (rules applied/not applied) (default: false)
  -debug
        print extra debug info (default: false)
  -force
        print transcriptions even if errors are found (default: false)
  -help
        print help and exit
  -quiet
        inhibit warnings (default: false)
  -symbolset string
        use specified symbol set file for validating the symbols in the g2p rule set (default: none; overrides the g2p rule file's symbolset, if any)
  -test
        test g2p against input file; orth <tab> trans (default: false)
  -test:removestress
        remove stress when comparing using the -test switch (default: false)
Microservice API/server
 $ server cmd/server/g2p_files

Visit http://localhost:6771/ for info on available API calls


This work was supported by the Swedish Post and Telecom Authority (PTS) through the grant "Wikispeech – en användargenererad talsyntes på Wikipedia" (2016–2017).

Documentation

Overview

Package rbg2p contains utilities for rule based, manually written, grapheme to phoneme rules.

Each g2p rule set is defined in a .g2p file with the following content:

  • specific variables
  • used to define constant variables such as character set and phoneme delimiter
  • variables
  • any variables for use in the context of the actual rules
  • sylldef - definitions for dividing transcriptions into syllables
  • rules - g2p rules
  • filters - transcription filters applied after the rules
  • tests - input/output tests
  • comments

SPECIFIC VARIABLES

Defines a set of constant variables, such as character set and phoneme delimiter. Please note that quotes are required around the value, since space and the empty string can be used as a value.

<NAME> "<VALUE>"

Available variables (* means required):

CHARACTER_SET*     (default: none)
 - used to check that each character in the character set has at least one rule
PHONEME_SET        (default: none)
 - space separated symbol set, used to validate the phonemes in the g2p rules
DEFAULT_PHONEME    (default: "_")
 - used for input input (orthographic) symbols
PHONEME_DELIMITER  (default: " ")
 - used to concatenate phonemes into a transcriptions
DOWNCASE_INPUT     (default: true)

Examples:

CHARACTER_SET "abcdefghijklmnopqrstuvwxyzåäö"
PHONEME_SET "a au o u i y e eu p t k b d g r s f h j l v w m n S tS"
DEFAULT_PHONEME "_"
PHONEME_DELIMITER " "

VARIABLES

Regexp variables prefixed by VAR, that can be used in the rule context and filters as exemplified below. The variable names must not contain underscore (_).

VAR <NAME> <VALUE>

Examples:

VAR VOWEL [aeyuio]
VAR AFFRICATE (tS|dZ)
VAR VOICELESS [ptksf]

SYLLDEF

An set of variables prefixed by SYLLDEF, used for syllabification (not required).

SYLLDEF <NAME> "<VALUE>"

Currently, only maximum onset (MOP) syllabification can be used. Variables currently available:

TYPE    (default: MOP)
 - currently, the only value allowed here is MOP
ONSETS
 - a comma separated list of valid syllable onsets (typically consonant clusters)
SYLLABIC
 - a space separated list of syllabic phonemes (typically vowels)
STRESS
 - a space separated list of stress symbols
DELIMITER
 - syllable delimiter symbol

Examples:

SYLLDEF TYPE MOP
SYLLDEF ONSETS "p, b, t, rt, m, n, d, rd, k, g, rn, f, v, C, rs, r, l, s, x, S, h, rl, j, s, p, r, rs p r, s p l, rs p l, s p j, rs p j, s t r, rs rt r, s k r, rs k r, s k v, rs k v, p r, p j, p l, b r, b j, b l, t r, rt r, t v, rt v, d r, rd r, d v, rd v, k r, k l, k v, k n, g r, g l, g n, f r, f l, f j, f n, v r, s p, s t, s k, s v, s l, s m, s n, n j, rs p, rs rt, rs k, rs v, rs rl, rs m, rs rn, rn j, m j"
SYLLDEF SYLLABIC "i: I u0 }: a A: u: U E: {: E { au y: Y e: e 2: 9: 2 9 o: O @ eu"
SYLLDEF STRESS "\" %"
SYLLDEF DELIMITER "."

RULES

Grapheme to phoneme rules written in a format loosely based on phonotactic rules. The rules are ordered, and typically the rule order is of great importance.

<INPUT> -> <OUTPUT>
<INPUT> -> <OUTPUT> / <CONTEXT>
<INPUT> -> (<OUTPUT1>, <OUTPUT2>)
<INPUT> -> (<OUTPUT1>, <OUTPUT2>) / <CONTEXT>

Context:

<LEFT CONTEXT> _ <RIGHT CONTEXT>

<INPUT> is a string of one or more input characters. <OUTPUT> is a string representing the output (separated by the pre-defined phoneme delimiter, above). For empty output, i.e., when a character should not be pronounced, use the empty set symbol "∅" (U+2205).

<CONTEXT> is the context in which the <INPUT> should occur for the rule to apply. Pre-defined variables (above) can be use in the context specs. # is used for anchoring (marks the start/end of the input string).

Examples:

a -> ? a / # _
a -> a
e -> e
skt -> (s t, s k t) / _
ck -> k
b -> p / _ VOICELESS
h -> ∅ / # _

PREFILTERS

Regexp replacement filters for transcriptions. The filters are applied before the g2p rules. Pre-defined variables (above) can be use in the input regexp surrounded by curly brackets.

PREFILTER "<FROM RE>" -> "<TO STRING>"
PREFILTER "<FROM RE WITH {VARIABLENAME}>" -> "<TO STRING>"

Example:

PREFILTER "п" -> "p" // cyrillic char to latin

FILTERS

Regexp replacement filters for transcriptions. The filters are applied after the g2p rules. Pre-defined variables (above) can be use in the input regexp surrounded by curly brackets.

FILTER "<FROM RE>" -> "<TO STRING>"
FILTER "<FROM RE WITH {VARIABLENAME}>" -> "<TO STRING>"

Example:

FILTER "^" -> "\" " // place stress first in transcription

COMMENTS

Comments are prefixed by // or #

TESTS

Test examples prefixed by TEST:

TEST <INPUT> -> <OUTPUT>

or with variants:

TEST <INPUT> -> (<OUTPUT1>, <OUTPUT2>)

Examples:

TEST hit -> h i t
TEST kex -> (k e k s, C e k s)

---

SEPARATE SYLLABIFICATION RULE FILE

A .syll file for syllabification contains a subset of the items used for a proper g2p.

Example (for the CMU lexicon):

PHONEME_SET "AA AE AH AX AO AW AY B CH D DH EH ER EY F G HH IH IY JH K L M N NG OW OY P R S SH T TH UH UW V W Y Z ZH 1 2"
PHONEME_DELIMITER " "

SYLLDEF TYPE MOP
SYLLDEF ONSETS "P, T, K, B, D, G, CH, JH, F, V, T, D, S, Z, S, Z, H, L, M, N, N, R, W, J, P R, T R, B R, G R, S T R, S P R, S K R, P L, T L, B L, G L, S T L, S P L, S K L, S P, S T, S K"
SYLLDEF SYLLABIC "AA AE AH AX AO AW AY EH ER EY IH IY OW OY UH UW"
SYLLDEF STRESS "1 2"
SYLLDEF DELIMITER "$"

SYLLDEF TEST AX P R 1 AA K S AX M AX T -> AX $ P R 1 AA K $ S AX $ M AX T
SYLLDEF TEST W 1 UH D S T R 2 IY M -> W 1 UH D $ S T R 2 IY M

For details on the .g2p file format, check docs for the root folder of this package.

For more examples (used for unit tests), see the test_data folder: https://github.com/stts-se/rbg2p/tree/master/test_data

To test a single g2p file from the command line, use cmd/g2p.

To import and use the rbg2p rule package in another go program:

import (
       "github.com/stts-se/rbg2p"
)

func main() {
        // TODO: initialize g2pFile and orth
        var g2pFile = "", orth = ""

        // Load rule file
        ruleSet, err := rbg2p.LoadFile(g2pFile)
        // TODO: check for error in err

        // Test rule set
        testRes := ruleSet.Test()
        // TODO: check for errors in testRes
        // testRes is an instance of rbg2p.TestResult
        // - you can do a quick check using testRes.Failed() to find out if there were any errors
        // - you can retrieve all errors using testRes.AllErrors()
        // - you can retrieve all errors and warnings using testRes.AllMessages()

        // Transcribe an input word
        transes, err := ruleSet.Apply(orth)
}

Index

Constants

This section is empty.

Variables

View Source
var Debug = false

Functions

func Contains

func Contains(slice []string, value string) bool

Contains checks whether a slice of strings contains a specific string

Types

type Context

type Context struct {
	// Input is the regexp as written in the input string
	Input string

	// Regexp is the input string converted to a regular expression for internal use (with variables expanded, and adapted anchoring)
	Regexp *regexp2.Regexp
}

Context in which the rule applies (left hand/right hand context specified by a regular expression)

func (Context) IsDefined

func (c Context) IsDefined() bool

IsDefined returns true if the contained regexp is defined

func (Context) Matches

func (c Context) Matches(s string) (bool, error)

Matches checks if the input string matches the context rule

func (Context) String

func (c Context) String() string

String returns a string representation of the Context

type Filter

type Filter struct {
	Regexp *regexp2.Regexp
	Output string
}

Filter is a regexp filter for rules that cannot be expressed using the standard rule systme

func (Filter) Apply

func (f Filter) Apply(s string) (string, error)

Apply is used to apply the filter to an input string

type MOPSyllDef

type MOPSyllDef struct {
	Onsets          []string
	Syllabic        []string
	PhnDelim        string
	SyllDelim       string
	Stress          []string
	StressPlcmnt    StressPlacement
	IncludePhnDelim bool
}

MOPSyllDef is a Maximum Onset Principle implementation of the SyllDef interface

func (MOPSyllDef) ContainsSyllabic

func (def MOPSyllDef) ContainsSyllabic(phonemes []string) bool

ContainsSyllabic tells if the input phoneme slice contains any syllabic phonemes (required by interface)

func (MOPSyllDef) IncludePhonemeDelimiter

func (def MOPSyllDef) IncludePhonemeDelimiter() bool

IncludePhonemeDelimiter defines whether the syllable boundaries should be surrounded by the phoneme delimiter

func (MOPSyllDef) IsDefined

func (def MOPSyllDef) IsDefined() bool

IsDefined is used to determine if there is a syllabifier defined or not (required by interface)

func (MOPSyllDef) IsStress

func (def MOPSyllDef) IsStress(symbol string) bool

IsStress is used to check if the input symbol is a stress symbol

func (MOPSyllDef) IsSyllabic

func (def MOPSyllDef) IsSyllabic(phoneme string) bool

IsSyllabic is used to check if the input phoneme is syllabic

func (MOPSyllDef) PhonemeDelimiter

func (def MOPSyllDef) PhonemeDelimiter() string

PhonemeDelimiter is the string used to separate phonemes (required by interface)

func (MOPSyllDef) StressPlacement

func (def MOPSyllDef) StressPlacement() StressPlacement

StressPlacement

func (MOPSyllDef) SyllableDelimiter

func (def MOPSyllDef) SyllableDelimiter() string

SyllableDelimiter is the string used to separate syllables (required by interface)

func (MOPSyllDef) ValidSplit

func (def MOPSyllDef) ValidSplit(left0 []string, right0 []string) bool

ValidSplit is called by Syllabifier.Syllabify to test where to put the boundaries

type PhonemeSet

type PhonemeSet struct {
	Symbols                   []string
	PhnDelim                  Regexp
	SyllDelim                 Regexp
	SyllDelimIncludesPhnDelim bool
}

PhonemeSet is a package internal container for the phoneme set definition

func LoadPhonemeSetFile

func LoadPhonemeSetFile(fName string, syllDelimIncludesPhnDelim bool, syllDelimiter, phnDelimiter string) (PhonemeSet, error)

LoadPhonemeSetFile loads a phoneme set definition from file (one phoneme per line, // for comments)

func NewPhonemeSet

func NewPhonemeSet(symbols []string, syllDelimIncludesPhnDelim bool, syllDelimiter, phnDelimiter string) (PhonemeSet, error)

NewPhonemeSet creates a phoneme set from a slice of symbols, and a phoneme delimiter string

func (PhonemeSet) SplitTranscription

func (ps PhonemeSet) SplitTranscription(trans string) ([]string, error)

SplitTranscription splits the input transcription into a slice of phonemes, based on the pre-defined phoneme delimiter

type Prefilter

type Prefilter struct {
	Regexp *regexp2.Regexp
	Output string
}

Prefilter is a regexp filter

func (Prefilter) Apply

func (pf Prefilter) Apply(s string) (string, error)

Apply is used to apply the prefilter to an input string

type Regexp

type Regexp struct {
	RE     *regexp.Regexp
	Source string
}

type Rule

type Rule struct {
	Input        string
	Output       []string
	LeftContext  Context
	RightContext Context
	LineNumber   int // for debugging
}

Rule is a g2p rule representation

func (Rule) String

func (r Rule) String() string

String returns a string representation of the Rule

type RuleSet

type RuleSet struct {
	CharacterSet      []string
	PhonemeSet        PhonemeSet
	PhonemeDelimiter  string
	SyllableDelimiter string
	DefaultPhoneme    string
	DowncaseInput     bool
	Vars              map[string]string
	Rules             []Rule
	RulesAppliedMutex *sync.RWMutex
	RulesApplied      map[string]int // for coverage checks
	Tests             []Test
	Filters           []Filter
	Prefilters        []Prefilter
	Syllabifier       Syllabifier
	Content           string
	Debug             bool
}

RuleSet is a set of g2p rules, with variables and built-in tests

func LoadFile

func LoadFile(fName string) (RuleSet, error)

LoadFile loads a g2p rule set from the specified file

func LoadURL

func LoadURL(url string) (RuleSet, error)

LoadURL loads a g2p rule set from an URL

func (RuleSet) Apply

func (rs RuleSet) Apply(s string) ([]string, error)

Apply applies the rules to an input string, returns a slice of transcriptions. If unknown input characters are found, an error will be created, and an underscore will be appended to the transcription. Even if an error is returned, the loop will continue until the end of the input string.

func (RuleSet) Test

func (rs RuleSet) Test() TestResult

Test runs the built-in tests. Returns a test result with errors and warnings, if any.

type StressPlacement

type StressPlacement int

StressPlacement is used to define where in a syllable the stress should be put in an output string

const (
	// Undefined - position not defined
	Undefined StressPlacement = iota

	// FirstInSyllable -- before the syllable's first phoneme
	FirstInSyllable

	// BeforeSyllabic -- before the first syllabic phoneme
	BeforeSyllabic

	// AfterSyllabic -- after the first syllabic phoneme
	AfterSyllabic
)

type SyllDef

type SyllDef interface {
	ValidSplit(left []string, right []string) bool
	ContainsSyllabic(phonemes []string) bool
	IsDefined() bool
	IsStress(symbol string) bool
	IsSyllabic(symbol string) bool
	PhonemeDelimiter() string
	StressPlacement() StressPlacement
	IncludePhonemeDelimiter() bool
	SyllableDelimiter() string
}

SyllDef is an interface for implementing custom made syllabification strategies

type SyllTest

type SyllTest struct {
	Input  string
	Output string
}

SyllTest defines a rule test (input -> output)

type Syllabifier

type Syllabifier struct {
	SyllDef         SyllDef
	Tests           []SyllTest
	StressPlacement StressPlacement
	PhonemeSet      PhonemeSet
	Debug           bool
}

Syllabifier is a module to divide a transcription into syllables

func LoadSyllFile

func LoadSyllFile(fName string) (Syllabifier, error)

LoadSyllFile loads a syllabifier from the specified file

func LoadSyllURL

func LoadSyllURL(url string) (Syllabifier, error)

LoadSyllURL loads a syllabifier from an URL

func (Syllabifier) IsDefined

func (s Syllabifier) IsDefined() bool

IsDefined is used to determine if there is a syllabifier defined or not

func (Syllabifier) SyllabifyFromPhonemes

func (s Syllabifier) SyllabifyFromPhonemes(phns []string) string

SyllabifyFromPhonemes is used to divide a range of phonemes into syllables and create an output string

func (Syllabifier) SyllabifyFromString

func (s Syllabifier) SyllabifyFromString(trans string) (string, error)

SyllabifyFromString is used to divide a transcription string into syllables and create an output string

func (Syllabifier) Test

func (s Syllabifier) Test() TestResult

Test to test the input syllabifier definition using tests in the input data or file

type Test

type Test struct {
	Input  string
	Output []string
}

Test defines a rule test (input -> output)

type TestResult

type TestResult struct {
	Errors      []string
	Warnings    []string
	FailedTests []string
}

TestResult is a container for test results (errors, warnings, and failed tests from tests speficied in the g2p rule file)

func (TestResult) AllErrors

func (tr TestResult) AllErrors() []string

AllMessages returns one single slice with all errors and test results (if any). Each message is prefixed by its type (ERROR/FAILED TESTS).

func (TestResult) AllMessages

func (tr TestResult) AllMessages() []string

AllMessages returns one single slice with all errors, warnings and test results (if any). Each message is prefixed by its type (ERROR/WARNING/FAILED TESTS).

func (TestResult) Failed

func (tr TestResult) Failed() bool

Failed returns true if the test result has any errors or failed tests

func (TestResult) Strings

func (tr TestResult) Strings() []string

Strings returns all messages as strings

Directories

Path Synopsis
cmd
g2p

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL