vertigo

package module
v3.0.5 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Jan 1, 2020 License: Apache-2.0 Imports: 12 Imported by: 0

README

vertigo

The program is intended for parsing so called corpus vertical files, which are basically SGML files where structural information is realized by custom tags (each tag on its own line) and token information (again, each token on its own line) is realized via tab-separated values (e.g. word[tab]lemma[tab]tag). The file looks like this:

<doc id="adams-restaurant_at_the" lang="en" version="00" wordcount="54066">
<div author="Adams, Douglas" title="The Restaurant at the End of the Universe" group="Core" publisher="" pubplace="" pubyear="1980" pubmonth="" origyear="" isbn="" txtype="fiction" comment="" original="Yes" srclang="en" translator="" transsex="" authsex="M" lang_var="en-GB" id="en:adams-restaurant_na_ko:0" wordcount="54066">
<p id="en:adams-restaurant_na_ko:0:1">
<s id="en:adams-restaurant_na_ko:0:1:1">
The     the     DT
Restaurant      Restaurant      NP
at      at      IN
the     the     DT
End     end     NN
of      of      IN
the     the     DT
Universe        universe        NN
</s>
</p>
<p id="en:adams-restaurant_na_ko:0:2">
<s id="en:adams-restaurant_na_ko:0:2:1">
There   there   EX
is      be      VBZ
a       a       DT
theory  theory  NN
...

Vertigo parses an input file and builds a result (via provided LineProcessor) at the same time using two goroutines combined into the producer-consumer pattern.

The LineProcessor interface is the following:

type LineProcessor interface {
	ProcToken(token *Token, line int, err error)
	ProcStruct(strc *Structure, line int, err error)
	ProcStructClose(strc *StructureClose, line int, err error)
}

An example of how to configure and run the parser (with some fake functions inside) may look like this:

package main

import (
	"log"
	"github.com/tomachalek/vertigo"
)

type MyProcessor struct {
}

func (mp *MyProcessor) ProcToken(token *Token, line int, err error) {
	useWordPosAttr(token.Word)
	useFirstNonWordPosAttr(tokenAttrs[0])
}

func (d *MyProcessor) ProcStruct(strc *Structure, line int, err error) {
	structNameIs(strc.Name)
	for sattr, sattrVal := range strc.Attrs {
		useStructAttr(sattr, sattrVal)
	}
}

func (d *MyProcessor) ProcStructClose(strc *StructureClose, line int, err error) {

}

func main() {
	pc := &vertigo.ParserConf{
		InputFilePath:         "/path/to/a/vertical/file",
		Encoding:              "utf-8",
		StructAttrAccumulator: "comb",
	}
	proc := MyProcessor{}
	err := vertigo.ParseVerticalFile(pc, proc)
	if err != nil {
		log.Fatal(err)
	}
}

Documentation

Index

Constants

View Source
const (
	LineTypeToken   = "token"
	LineTypeStruct  = "struct"
	LineTypeIgnored = "ignored"

	AccumulatorTypeStack = "stack"
	AccumulatorTypeComb  = "comb"
	AccumulatorTypeNil   = "nil"

	CharsetISO8859_1   = "iso-8859-1"
	CharsetISO8859_2   = "iso-8859-2"
	CharsetISO8859_3   = "iso-8859-3"
	CharsetISO8859_4   = "iso-8859-4"
	CharsetISO8859_5   = "iso-8859-5"
	CharsetISO8859_6   = "iso-8859-6"
	CharsetISO8859_7   = "iso-8859-7"
	CharsetISO8859_8   = "iso-8859-8"
	CharsetWindows1250 = "windows-1250"
	CharsetWindows1251 = "windows-1251"
	CharsetWindows1252 = "windows-1252"
	CharsetWindows1253 = "windows-1253"
	CharsetWindows1254 = "windows-1254"
	CharsetWindows1255 = "windows-1255"
	CharsetWindows1256 = "windows-1256"
	CharsetWindows1257 = "windows-1257"
	CharsetWindows1258 = "windows-1258"
	CharsetUTF_8       = "utf-8"
)

Variables

This section is empty.

Functions

func GetCharmapByName

func GetCharmapByName(name string) (*charmap.Charmap, error)

GetCharmapByName returns a proper Charmap instance based on provided encoding name. The name detection is case insensitive (e.g. utf-8 is the same as UTF-8). The number of supported charsets is

func ParseVerticalFile

func ParseVerticalFile(conf *ParserConf, lproc LineProcessor) error

ParseVerticalFile processes a corpus vertical file line by line and applies a custom LineProcessor on them. The processing is parallelized in the sense that reading a file into lines and processing of the lines runs in different goroutines. To reduce overhead, the data are passed between goroutines in chunks.

func ParseVerticalFileNoGoRo

func ParseVerticalFileNoGoRo(conf *ParserConf, lproc LineProcessor)

ParseVerticalFileNoGoRo is just for benchmarking purposes

func SupportedCharsets

func SupportedCharsets() []string

SupportedCharsets returns a list of names of character sets.

Types

type LineProcessor

type LineProcessor interface {
	ProcToken(token *Token, line int, err error)
	ProcStruct(strc *Structure, line int, err error)
	ProcStructClose(strc *StructureClose, line int, err error)
}

type ParserConf

type ParserConf struct {

	// Source vertical file (either a plain text file or a gzip one)
	InputFilePath string `json:"inputFilePath"`

	Encoding string `json:"encoding"`

	FilterArgs [][][]string `json:"filterArgs"`

	StructAttrAccumulator string `json:"structAttrAccumulator"`

	LogProgressEachNth int `json:"logProgressEachNth"`
}

ParserConf contains configuration parameters for vertical file parser

func LoadConfig

func LoadConfig(path string) *ParserConf

LoadConfig loads the configuration from a JSON file. In case of an error the program exits with panic.

type Structure

type Structure struct {
	Name    string
	Attrs   map[string]string
	IsEmpty bool
}

Structure represent a structure opening tag

type StructureClose

type StructureClose struct {
	Name string
}

StructureClose represent a structure closing tag

type Token

type Token struct {
	Idx         int
	Word        string
	Attrs       []string
	StructAttrs map[string]string
}

Token is a representation of a parsed line. It connects both, positional attributes and currently accumulated structural attributes.

func (*Token) MatchesFilter

func (t *Token) MatchesFilter(filterCNF [][][]string) bool

MatchesFilter tests whether a provided token matches a filter in Conjunctive normal form encoded as a 3-d list E.g.: div.author = 'John Doe' AND (div.title = 'Unknown' OR div.title = 'Superunknown') encodes as: { {{"div.author" "John Doe"}} {{"div.title" "Unknown"} {"div.title" "Superunknown"}} }

func (*Token) PosAttrByIndex

func (t *Token) PosAttrByIndex(idx int) string

PosAttrByIndex returns a positional attribute based on its original index in vertical file

func (*Token) WordLC

func (t *Token) WordLC() string

WordLC returns the 'word' positional attribute converted to lowercase

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL