cedict

package module
v0.1.1 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Dec 23, 2018 License: BSD-3-Clause Imports: 7 Imported by: 0

README

Build Status

NOTE: A friendly fork for personal use.

CEDict Parser in Go

Package cedict provides a parser / tokenizer for reading entries from the CC-CEDict Chinese dictionary project.

Installation

Assuming you have Go installed, installation is as easy as running:

go get github.com/FSX/cedict

You will need a copy of the CEDict dictionary text file. You can download CEDict from MDBG.net. Extract the file somewhere you want to use it from, and then follow the usage instructions below.

Usage

Tokenizing is done by creating a CEDict for an io.Reader r. It is the caller's responsibility to ensure that r provides a CEDict-formatted dictionary.

    import "github.com/FSX/cedict"

    ...

    c := cedict.New(r) // r is an io.Reader to the cedict file

Given a CEDict called c, the dictionary is tokenized by repeatedly calling c.NextEntry(), which parses until it reaches the next entry, or an error if no more entries are found:

    for {
        err := c.NextEntry()
        if err != nil {
            break
        }
        entry := c.Entry()
        fmt.Println(entry.Simplified, entry.Definitions[0])
    }

To retrieve the current entry, the Entry method can be called. There is also a lower-level API available, using the bufio.Scanner Scan method. Using this lower-level API is the recommended way to read comments from the CEDict, should that be necessary.

Documentation

Full documentation can be found at https://godoc.org/github.com/FSX/cedict

Documentation

Overview

Package cedict provides a parser / tokenizer for reading entries from the CEDict Chinese dictionary project.

Tokenizing is done by creating a CEDict for an io.Reader r. It is the caller's responsibility to ensure that r provides a CEDict-formatted dictionary.

import "github.com/hermanschaaf/cedict"

...

c := cedict.New(r) // r is an io.Reader to the cedict file

Given a CEDict c, the dictionary is tokenized by repeatedly calling c.NextEntry(), which parses until it reaches the next entry, or an error if no more entries are found:

for {
    err := c.NextEntry()
    if err != nil {
        break
    }
    entry := c.Entry()
    fmt.Println(entry.Simplified, entry.Definitions[0])
}

To retrieve the current entry, the Entry method can be called. There is also a lower-level API available, using the bufio.Scanner Scan method. Using this lower-level API is the recommended way to read comments from the CEDict, should that be necessary.

Index

Examples

Constants

View Source
const (
	EntryToken = iota
	CommentToken
	ErrorToken
)

Variables

View Source
var NoMoreEntries error = errors.New("No more entries to read")

Functions

func ToPinyinTonemarks

func ToPinyinTonemarks(p string) string

ToPinyinTonemarks takes a CEDICT pinyin representation and returns the concatenated pinyin version with tone marks, e.g., yi1 lan3 zi5 => yīlǎnzi. This function is useful for customizing pinyin conversion for your own application. For example, if you wish to get the tone pinyin of each character, you may pass in each section of the original word separately, as in yi1 => yī, lan3 => lǎn, zi5 => zi.

Types

type CEDict

type CEDict struct {
	*bufio.Scanner
	TokenType int
	// contains filtered or unexported fields
}

CEDict is the basic tokenizer struct we use to read and parse new dictionary instances.

Example

The following example demonstrates basic usage of the package. It uses a string.Reader as io.Reader, where you would normally use a file.Reader.

dict := `一層 一层 [yi1 ceng2] /layer/
一攬子 一揽子 [yi1 lan3 zi5] /all-inclusive/undiscriminating/`
r := io.Reader(strings.NewReader(dict))
c := New(r)
for {
	err := c.NextEntry()
	if err != nil {
		// you may also compare the error to cedict.NoMoreEntries
		// to know whether the end was reached or some other problem
		// occurred.
		break
	}
	// get current entry
	entry := c.Entry()
	// print out some fields
	fmt.Printf("%s\t(%s)\t%s\n", entry.Simplified, entry.PinyinWithTones, entry.Definitions[0])
}
Output:

一层	(yīcéng)	layer
一揽子	(yīlǎnzi)	all-inclusive

func New

func New(r io.Reader) *CEDict

New takes an io.Reader and creates a new CEDict instance.

func (*CEDict) Entry

func (c *CEDict) Entry() *Entry

Entry returns a pointer to the most recently parsed Entry struct.

func (*CEDict) NextEntry

func (c *CEDict) NextEntry() error

Next reads until the next entry token is found. Once found, it parses the token and returns a pointer to a newly populated Entry struct.

type Entry

type Entry struct {
	Simplified      string
	Traditional     string
	Pinyin          string
	PinyinWithTones string
	PinyinNoTones   string
	Definitions     []string
}

Entry represents a single entry in the cedict dictionary.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL