hangulize

package module
v0.5.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Feb 1, 2023 License: MIT Imports: 18 Imported by: 4

README

한글라이즈

GoDoc Go Report Card Build Status Coverage Status

(WIP: 아직 개발 중, API가 임의로 바뀔 수 있어요!)

외국어의 한글 표기 체계가 제대로 서려면 일반인이 외국어를 한글로 표기하고 싶을 때 바로바로 쉽게 용례를 찾을 수 있어야 한다. 정기적으로 회의를 열어 용례를 정하는 것으로는 한계가 있다. 외래어 표기 심의 방식이 자동화되어 한글로 표기하고 싶은 외국어를 입력하자마자 한글 표기가 나와야 한다. 이미 용례가 정해진 것은 그것을 따르고 용례에 없는 것이라도 각 언어의 표기 규칙에 따라 권장 표기를 표시해야 한다. 프로그래머들과 언어학자들이 손잡고 연구한다면 이게 공상으로만 그치지 않을 것이다.

Brian Jongseong Park (http://iceager.egloos.com/2610028)

한글라이즈는 외래어를 한글로 변환하는 도구입니다.

$ go get -u github.com/hangulize/hangulize
import "github.com/hangulize/hangulize"

hangulize.Hangulize("ita", "Cappuccino")
// output: "카푸치노"

지원하는 언어

LANG     STAGE    ENG                      KOR
aze      draft    Azerbaijani              아제르바이잔어
bel      draft    Belarusian               벨라루스어
bul      draft    Bulgarian                불가리아어
cat      draft    Catalan                  카탈로니아어
ces      draft    Czech                    체코어
chi      draft    Chinese                  중국어
cym      draft    Welsh                    웨일스어
deu      draft    German                   독일어
ell      draft    Greek                    그리스어
epo      draft    Esperanto                에스페란토어
est      draft    Estonian                 에스토니아어
fin      draft    Finnish                  핀란드어
grc      draft    Ancient Greek            고대 그리스어
hbs      draft    Serbo-Croatian           세르보크로아트어
hun      draft    Hungarian                헝가리어
isl      draft    Icelandic                아이슬란드어
ita      draft    Italian                  이탈리아어
jpn      draft    Japanese                 일본어
jpn-ck   draft    Japanese (C.K.)          일본어(최영애-김용옥)
kat-1    draft    Georgian (1st scheme)    조지아어(제1안)
kat-2    draft    Georgian (2nd scheme)    조지아어(제2안)
lat      draft    Latin                    라틴어
lav      draft    Latvian                  라트비아어
lit      draft    Lithuanian               리투아니아어
mkd      draft    Macedonian               마케도니아어
nld      draft    Dutch                    네덜란드어
pol      draft    Polish                   폴란드어
por      draft    Portuguese               포르투갈어
por-br   draft    Brazilian Portuguese     브라질 포르투갈어
ron      draft    Romanian                 루마니아어
rus      draft    Russian                  러시아어
slk      draft    Slovak                   슬로바키아어
slv      draft    Slovenian                슬로베니아어
spa      draft    Spanish                  스페인어
sqi      draft    Albanian                 알바니아어
swe      draft    Swedish                  스웨덴어
tur      draft    Turkish                  터키어
ukr      draft    Ukrainian                우크라이나어
vie      draft    Vietnamese               베트남어
wlm      draft    Middle Welsh             웨일스어(중세)

읽을거리

만든이

라이선스

한글라이즈는 MIT 라이선스 하에 공개되어 있습니다. 소스코드를 사용할 경우 라이선스 내용을 준수해주세요. 라이선스 전문은 LICENSE 파일에서 확인하실 수 있습니다.

Documentation

Overview

Package hangulize transcribes non-Korean words into Hangul.

"Hello!" -> "헬로!"

Hangulize was inspired by Brian Jongseong Park (http://iceager.egloos.com/2610028). Based on this idea, the original Hangulize was developed in Python and went out in 2010 (https://github.com/sublee/hangulize). Since then, serving as a web application on https://hangulize.org/, it has been of great help for Korean translators.

This Go re-implementation is a reboot of Hangulize with feature improvements.

Procedure

Basically, Hangulize transcribes with 5 steps. These steps include "Normalize", "Group", "Rewrite", "Transcribe", and "Syllabify". To clarify these concepts, let's consider an imaginary example of "Hello!" in English into "헬로!" (actually, English is not supported yet).

First, Hangulize normalizes letter cases:

"Hello" -> "hello!"

And then, it groups letters by meanings:

"hello!" -> "hello", "!"

After that, grouped chunks are rewritten as source language-specific rules. This step is usually for minimizing the differences between pronunciation and spelling:

"hello", "!" -> "heˈlō", "!"

And it transcribes rewritten chunks into Hangul Jamo phonemes.

"heˈlō", "!" -> "ㅎㅔ-ㄹㄹㅗ", "!"

Finally, it composes Jamo phonemes into Hangul syllabic blocks and joins all groups.

"ㅎㅔ-ㄹㄹㅗ", "!" -> "헬로!"

Extended Procedure

Some languages, such as Japanese, may require 2 more steps: "Transliterate" and "Localize". The prior is before the Normalize step, and the latter is after the Syllabify step.

Japanese uses Kanji which is an ideogram. There is the Kanji-to-Kana mapping called Furigana. To get Furigana from Kanji, we need a lexical analysis based on several dictionaries. The Transliterate step guesses the phonograms from a spelling based on lexical analysis.

"日本語" -> "ニホンゴ"

Furthermore, Japanese uses the full-width characters for puctuations while Korean and European languages use the half-width. The full-width puctuations need to be replaced with the half-width and a space to generate a comfortable Korean word. The Localize step replaces them.

"이마、아이니유키마스" -> "이마, 아이니유키마스"

Spec

A spec is written by the HSL format which is a configuration DSL for Hangulize 2. One spec is for one language transcription system. So we need to describe about the language at the first:

lang:
    id      = "ita"
    codes   = "it", "ita" # ISO 639-1 and 3 codes
    english = "Italian"
    korean  = "이탈리아어"
    script  = "roman"

Then write about yourself and the stage of this spec:

config:
    author = "John Doe <john@example.com>"
    stage  = "draft"

We will write many patterns in rewrite/transcribe rules soon. Some expressions may appear many times annoyingly. To not repeat ourselves, we can use variables and macros.

A variable is a combination of letters. Variable in pattern will match with one of the letters. Variable "foo" can be referenced with "<foo>" in the patterns.

vars:
    "vowels" = "a", "e", "i", "o", "u"

A macro expression is replaced with the target before parsing the patterns. "@" is the common macro for "<vowels>" variable:

macros:
    "@" = "<vowels>"

Now we can write "rewrite" rules. There are Pattern and RPattern. Pattern matches with letters in a word. RPattern represents how the matched letters should be replaced. A replaced word by a rule would become as the input for the next rule:

rewrite:
    "^gli$"   -> "li"
    "^gli{@}" -> "li"
    "{@}gli"  -> "li"
    "gn{@}"   -> "nJ"

Pattern is based on Regular Expression but it has it's own custom syntax. We call it "HRE" which means "Hangulize-specific Regular Expression".

"transcribe" rules are exactly same with "rewrite" rules. But it's RPatterns represent Hangul Jamo phonemes. In contrast to "rewrite", a replaced word won't become as the input for the next rules:

transcribe:
    "b" -> "ㅂ"
    "d" -> "ㄷ"
    "f" -> "ㅍ"
    "g" -> "ㄱ"

Finally, we should write expected transcription examples. They are used for unit testing. Verify your spec yourself:

test:
    "allegretto" -> "알레그레토"
    "gita"       -> "지타"
    "bisnonno"   -> "비스논노"
    "Pinocchio"  -> "피노키오"
Example
package main

import (
	"fmt"

	"github.com/hangulize/hangulize"
)

func main() {
	// Person names from http://iceager.egloos.com/2610028
	catalin, _ := hangulize.Hangulize("ron", "Cătălin Moroşanu")
	fmt.Println(catalin)

	jerrel, _ := hangulize.Hangulize("nld", "Jerrel Venetiaan")
	fmt.Println(jerrel)

	vitor, _ := hangulize.Hangulize("por", "Vítor Constâncio")
	fmt.Println(vitor)
}
Output:

커털린 모로샤누
예럴 페네티안
비토르 콘스탄시우

Index

Examples

Constants

This section is empty.

Variables

AllSteps is the array of all steps.

View Source
var ErrSpecNotFound = errors.New("spec not found")

ErrSpecNotFound occurs when the spec for the given language is not found.

View Source
var ErrTranslit = errors.New("translit error")

ErrTranslit occurs when a transliteration has been failed.

View Source
var ErrTranslitNotImported = errors.New("translit not imported")

ErrTranslitNotImported occurs when the selected spec requires a Translit but it has not been imported yet.

Functions

func Hangulize

func Hangulize(lang string, word string) (string, error)

Hangulize transcribes a non-Korean word into Hangul, which is the Korean alphabet.

For example, it will transcribe "Владивосто́к" in Russian into "블라디보스토크".

It is the most simple and useful API of thie package.

Example (Cappuccino)
package main

import (
	"fmt"

	"github.com/hangulize/hangulize"
)

func main() {
	cappuccino, _ := hangulize.Hangulize("ita", "Cappuccino")
	fmt.Println(cappuccino)
}
Output:

카푸치노
Example (Nietzsche)
package main

import (
	"fmt"

	"github.com/hangulize/hangulize"
)

func main() {
	nietzsche, _ := hangulize.Hangulize("deu", "Friedrich Wilhelm Nietzsche")
	fmt.Println(nietzsche)
}
Output:

프리드리히 빌헬름 니체
Example (ShinkaiMakoto)
package main

import (
	"fmt"

	"github.com/hangulize/hangulize"
)

func main() {
	// import "github.com/hangulize/hangulize/translit"
	// translit.Install()

	shinkai, _ := hangulize.Hangulize("jpn", "新海誠")
	fmt.Println(shinkai)
}
Output:

신카이 마코토

func ListLangs

func ListLangs() []string

ListLangs returns the language name list of bundled specs. The bundled spec can be loaded by LoadSpec.

Example

Here're all supported languages.

package main

import (
	"fmt"

	"github.com/hangulize/hangulize"
)

func main() {
	for _, lang := range hangulize.ListLangs() {
		fmt.Println(lang)
	}
}
Output:

aze
bel
bul
cat
ces
chi
cym
deu
ell
epo
est
fin
grc
hbs
hun
isl
ita
jpn
jpn-ck
kat-1
kat-2
lat
lav
lit
mkd
nld
pol
por
por-br
ron
rus
slk
slv
spa
sqi
swe
tur
ukr
vie
wlm

func Translits added in v0.5.0

func Translits() map[string]Translit

Translits returns a copy of the default Translit registry.

func UnloadSpec added in v0.2.11

func UnloadSpec(lang string)

UnloadSpec flushes a cached spec to get free memory.

func UnuseTranslit added in v0.5.0

func UnuseTranslit(scheme string) bool

UnuseTranslit removes an imported Translit from the default registry.

func UseTranslit added in v0.5.0

func UseTranslit(t Translit) bool

UseTranslit imports a Translit into the default registry.

Types

type Config

type Config struct {
	Authors []string
	Stage   string
}

Config keeps some configurations for a transactiption specification.

type Hangulizer

type Hangulizer interface {
	Spec() *Spec

	Translits() map[string]Translit
	UseTranslit(Translit) bool
	UnuseTranslit(scheme string) bool

	Hangulize(word string) (string, error)
	HangulizeTrace(word string) (string, Traces, error)
}

Hangulizer is dedicated for a specific language. transcribes a provides the transcription logic for the underlying spec.

func New added in v0.5.0

func New(spec *Spec) Hangulizer

New creates a hangulizer for a Spec.

Example
package main

import (
	"fmt"

	"github.com/hangulize/hangulize"
)

func main() {
	spec, _ := hangulize.LoadSpec("nld")
	h := hangulize.New(spec)

	gogh, _ := h.Hangulize("Vincent van Gogh")
	fmt.Println(gogh)
}
Output:

빈센트 반고흐

type Language

type Language struct {
	ID       string    // Arbitrary, but identifiable language ID.
	Codes    [2]string // [0]: ISO 639-1 code, [1]: ISO 639-3 code
	English  string    // The language name in English.
	Korean   string    // The language name in Korean.
	Script   string
	Translit []string
}

Language identifies a natural language.

func (Language) String

func (l Language) String() string

type Rule

type Rule struct {
	ID   int
	From *hre.Pattern
	To   *hre.RPattern
}

Rule is a pair of Pattern and RPattern.

func (Rule) Replace added in v0.3.0

func (r Rule) Replace(word string) string

Replace matches the word with the Pattern and replaces with the RPattern.

func (Rule) String

func (r Rule) String() string

type Spec

type Spec struct {
	// Meta information sections
	Lang   Language
	Config Config

	// Helper setting sections
	Macros    map[string]string
	Vars      map[string][]string
	Normalize map[string][]string

	// Rewrite/Transcribe
	Rewrite    []Rule
	Transcribe []Rule

	// Test examples
	Test [][2]string

	// Source code
	Source string
	// contains filtered or unexported fields
}

Spec represents a transactiption specification for a language.

func LoadSpec

func LoadSpec(lang string) (*Spec, bool)

LoadSpec finds a bundled spec by the given language name. Once it loads a spec, it will cache the spec.

func ParseSpec

func ParseSpec(r io.Reader) (*Spec, error)

ParseSpec parses a Spec from an HSL source.

func (Spec) GoString added in v0.3.1

func (s Spec) GoString() string

GoString implements GoStringer for Spec.

func (Spec) String

func (s Spec) String() string

type Step added in v0.3.0

type Step int

Step is an identifier for the each step in the Hangulize procedure.

const (

	// Input step just records the beginning.
	Input Step

	// Transliterate step converts the spelling to the phonograms.
	Transliterate

	// Normalize step eliminates letter case to make the next steps work easier.
	Normalize

	// Group step associates meaningful letters.
	Group

	// Rewrite step minimizes the gap between pronunciation and spelling.
	Rewrite

	// Transcribe step determines Hangul spelling for the pronunciation.
	Transcribe

	// Syllabify step composes Jamo phonemes into Hangul syllabic blocks.
	Syllabify

	// Localize step converts foreign punctuations to fit in Korean.
	Localize
)

func (Step) String added in v0.3.0

func (s Step) String() string

type Trace

type Trace struct {
	Step Step
	Word string

	Why string

	Rule    Rule
	HasRule bool
}

Trace is emitted when a replacement occurs. It is used for tracing of the Hangulize procedure internal.

func (Trace) String

func (t Trace) String() string

type Traces added in v0.3.0

type Traces []Trace

Traces is an array of Trace.

func (Traces) Render added in v0.3.0

func (ts Traces) Render(w io.Writer)

Render generates a report text.

type Translit added in v0.5.0

type Translit interface {
	// Scheme returns the identifier string of a Translit.
	Scheme() string

	// Transliterate transliterates the given word.
	Transliterate(string) (string, error)
}

Translit is an interface for a transliterator. It may convert a word from one script to another script. It also may guess phonograms from the spelling based on lexical analysis.

Directories

Path Synopsis
cmd
internal
jamo
Package jamo implements a Hangul composer.
Package jamo implements a Hangul composer.
subword
Package subword implements a word replacement with a level.
Package subword implements a word replacement with a level.
pkg
hre
Package hre provides the regular expression dialect for Hangulize called HRE.
Package hre provides the regular expression dialect for Hangulize called HRE.
hsl
Package hsl implements a parser for the HSL format which is used for Hangulize.
Package hsl implements a parser for the HSL format which is used for Hangulize.
furigana
Package furigana implements the hangulize.Translit interface for Japanese Kanji.
Package furigana implements the hangulize.Translit interface for Japanese Kanji.
pinyin
Package pinyin implements the hangulize.Translit interface for Chinese Hanzu.
Package pinyin implements the hangulize.Translit interface for Chinese Hanzu.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL