vertigo

package module
v5.1.1 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Apr 16, 2024 License: Apache-2.0 Imports: 12 Imported by: 9

README

vertigo

Vertigo is a parser for so called corpus vertical files, which are basically SGML files where structural information is realized by custom tags (each tag on its own line) and token information (again, each token on its own line) is realized via tab-separated values (e.g. word[tab]lemma[tab]tag). The parser is written in the Go language, the latest version is v5.

An example of a vertical file looks like this:

<doc id="adams-restaurant_at_the" lang="en" version="00" wordcount="54066">
<div author="Adams, Douglas" title="The Restaurant at the End of the Universe" group="Core" publisher="" pubplace="" pubyear="1980" pubmonth="" origyear="" isbn="" txtype="fiction" comment="" original="Yes" srclang="en" translator="" transsex="" authsex="M" lang_var="en-GB" id="en:adams-restaurant_na_ko:0" wordcount="54066">
<p id="en:adams-restaurant_na_ko:0:1">
<s id="en:adams-restaurant_na_ko:0:1:1">
The     the     DT
Restaurant      Restaurant      NP
at      at      IN
the     the     DT
End     end     NN
of      of      IN
the     the     DT
Universe        universe        NN
</s>
</p>
<p id="en:adams-restaurant_na_ko:0:2">
<s id="en:adams-restaurant_na_ko:0:2:1">
There   there   EX
is      be      VBZ
a       a       DT
theory  theory  NN
...

Vertigo parses an input file and builds a result (via provided LineProcessor) at the same time using two goroutines combined into the producer-consumer pattern. But the external behavior of the parsing is synchronous. I.e. once the ParseVerticalFile call returns a value the parsing is completed and all the possible additional goroutines are finished.

The LineProcessor interface is the following:

type LineProcessor interface {
	ProcToken(token *Token, line int, err error) error
	ProcStruct(strc *Structure, line int, err error) error
	ProcStructClose(strc *StructureClose, line int, err error) error
}

An example of how to configure and run the parser (with some fake functions inside) may look like this:

package main

import (
	"log"
	"github.com/tomachalek/vertigo"
)

type MyProcessor struct {
}

func (mp *MyProcessor) ProcToken(token *Token, line int, err error) error {
	if err != nil {
		return err
	}
	useWordPosAttr(token.Word)
	useFirstNonWordPosAttr(tokenAttrs[0])
}

func (d *MyProcessor) ProcStruct(strc *Structure, line int, err error) error {
	if err != nil {
		return err
	}
	structNameIs(strc.Name)
	for sattr, sattrVal := range strc.Attrs {
		useStructAttr(sattr, sattrVal)
	}
}

func (d *MyProcessor) ProcStructClose(strc *StructureClose, line int, err error) error {
	return err
}

func main() {
	pc := &vertigo.ParserConf{
		InputFilePath:         "/path/to/a/vertical/file",
		Encoding:              "utf-8",
		StructAttrAccumulator: "comb",
	}
	proc := MyProcessor{}
	err := vertigo.ParseVerticalFile(pc, proc)
	if err != nil {
		log.Fatal(err)
	}
}

Documentation

Index

Constants

View Source
const (
	LineTypeToken   = "token"
	LineTypeStruct  = "struct"
	LineTypeIgnored = "ignored"

	AccumulatorTypeStack = "stack"
	AccumulatorTypeComb  = "comb"
	AccumulatorTypeNil   = "nil"

	CharsetISO8859_1   = "iso-8859-1"
	CharsetISO8859_2   = "iso-8859-2"
	CharsetISO8859_3   = "iso-8859-3"
	CharsetISO8859_4   = "iso-8859-4"
	CharsetISO8859_5   = "iso-8859-5"
	CharsetISO8859_6   = "iso-8859-6"
	CharsetISO8859_7   = "iso-8859-7"
	CharsetISO8859_8   = "iso-8859-8"
	CharsetWindows1250 = "windows-1250"
	CharsetWindows1251 = "windows-1251"
	CharsetWindows1252 = "windows-1252"
	CharsetWindows1253 = "windows-1253"
	CharsetWindows1254 = "windows-1254"
	CharsetWindows1255 = "windows-1255"
	CharsetWindows1256 = "windows-1256"
	CharsetWindows1257 = "windows-1257"
	CharsetWindows1258 = "windows-1258"
	CharsetUTF_8       = "utf-8"
)

Variables

This section is empty.

Functions

func GetCharmapByName

func GetCharmapByName(name string) (*charmap.Charmap, error)

GetCharmapByName returns a proper Charmap instance based on provided encoding name. The name detection is case insensitive (e.g. utf-8 is the same as UTF-8). The number of supported charsets is

func ParseVerticalFile

func ParseVerticalFile(conf *ParserConf, lproc LineProcessor) error

ParseVerticalFile processes a corpus vertical file line by line and applies a custom LineProcessor on them. The processing is parallelized in the sense that reading a file into lines and processing of the lines runs in different goroutines. But the function as a whole behaves synchronously - i.e. once it returns a value, the processing is finished.

func ParseVerticalFileNoGoRo

func ParseVerticalFileNoGoRo(conf *ParserConf, lproc LineProcessor)

ParseVerticalFileNoGoRo is just for benchmarking purposes

func SupportedCharsets

func SupportedCharsets() []string

SupportedCharsets returns a list of names of character sets.

Types

type LineProcessor

type LineProcessor interface {

	// ProcToken is called each time the parser encounters a positional
	// attribute. In case parsing produces an error, it is passed to the
	// function without stopping the whole process.
	// In case the function returns an error, the parser stops
	// (in the simplest case it can be even the error it recieves)
	ProcToken(token *Token, line int, err error) error

	// ProcStruct is called each time parser encounters a structure opening
	// element (e.g. <doc>). In case parsing produces an error, it is passed
	// to the function without stopping the whole process.
	// In case the function returns an error, the parser stops.
	ProcStruct(strc *Structure, line int, err error) error

	// ProcStructClose is called each time parser encouters a structure
	// closing element (e.g. </doc>). In case parsing produces an error,
	// it is passed to the function without stopping the whole process.
	// In case the function returns an error, the parser stops.
	ProcStructClose(strc *StructureClose, line int, err error) error
}

LineProcessor describes an object able to handle Vertigo's parsing events.

type ParserConf

type ParserConf struct {

	// Source vertical file (either a plain text file or a gzip one)
	InputFilePath string `json:"inputFilePath"`

	Encoding string `json:"encoding"`

	FilterArgs [][][]string `json:"filterArgs"`

	StructAttrAccumulator string `json:"structAttrAccumulator"`

	LogProgressEachNth int `json:"logProgressEachNth"`
}

ParserConf contains configuration parameters for vertical file parser

func LoadConfig

func LoadConfig(path string) *ParserConf

LoadConfig loads the configuration from a JSON file. In case of an error the program exits with panic.

type Structure

type Structure struct {

	// Name defines a name of a structure tag (e.g. 'doc' for <doc> element)
	Name string

	// Attrs store structural attributes of the tag
	// (e.g. <doc id="foo"> produces map with a single key 'id' and value 'foo')
	Attrs map[string]string

	// IsEmpty defines a possible self-closing tag
	// if true then the structure is self-closing
	// (i.e. there is no 'close element' event following)
	IsEmpty bool
}

Structure represent a structure opening tag

type StructureClose

type StructureClose struct {
	Name string
}

StructureClose represent a structure closing tag

type Token

type Token struct {
	Idx         int
	Word        string
	Attrs       []string
	StructAttrs map[string]string
}

Token is a representation of a parsed line. It connects both, positional attributes and currently accumulated structural attributes.

func (*Token) MatchesFilter

func (t *Token) MatchesFilter(filterCNF [][][]string) bool

MatchesFilter tests whether a provided token matches a filter in Conjunctive normal form encoded as a 3-d list E.g.: div.author = 'John Doe' AND (div.title = 'Unknown' OR div.title = 'Superunknown') encodes as: { {{"div.author" "John Doe"}} {{"div.title" "Unknown"} {"div.title" "Superunknown"}} }

func (*Token) PosAttrByIndex

func (t *Token) PosAttrByIndex(idx int) string

PosAttrByIndex returns a positional attribute based on its original index in vertical file

func (*Token) WordLC

func (t *Token) WordLC() string

WordLC returns the 'word' positional attribute converted to lowercase

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL