vertigo

package module

v5.1.1 Latest Latest Go to latest Published: Apr 16, 2024 License: Apache-2.0 Imports: 12 Imported by: 9

Details

Valid go.mod file

The Go module system was introduced in Go 1.11 and is the official dependency management solution for Go.
Redistributable license

Redistributable licenses place minimal restrictions on how software can be used, modified, and redistributed.
Tagged version

Modules with tagged versions give importers more predictable builds.
Stable version

When a project reaches major version v1 it is considered stable.
Learn more about best practices

Repository

github.com/tomachalek/vertigo

Links

Open Source Insights

README ¶

vertigo

Vertigo is a parser for so called corpus vertical files, which are basically SGML files where structural information is realized by custom tags (each tag on its own line) and token information (again, each token on its own line) is realized via tab-separated values (e.g. word[tab]lemma[tab]tag). The parser is written in the Go language, the latest version is v5.

An example of a vertical file looks like this:

<doc id="adams-restaurant_at_the" lang="en" version="00" wordcount="54066">
<div author="Adams, Douglas" title="The Restaurant at the End of the Universe" group="Core" publisher="" pubplace="" pubyear="1980" pubmonth="" origyear="" isbn="" txtype="fiction" comment="" original="Yes" srclang="en" translator="" transsex="" authsex="M" lang_var="en-GB" id="en:adams-restaurant_na_ko:0" wordcount="54066">
<p id="en:adams-restaurant_na_ko:0:1">
<s id="en:adams-restaurant_na_ko:0:1:1">
The     the     DT
Restaurant      Restaurant      NP
at      at      IN
the     the     DT
End     end     NN
of      of      IN
the     the     DT
Universe        universe        NN
</s>
</p>
<p id="en:adams-restaurant_na_ko:0:2">
<s id="en:adams-restaurant_na_ko:0:2:1">
There   there   EX
is      be      VBZ
a       a       DT
theory  theory  NN
...

Vertigo parses an input file and builds a result (via provided LineProcessor) at the same time using two goroutines combined into the producer-consumer pattern. But the external behavior of the parsing is synchronous. I.e. once the ParseVerticalFile call returns a value the parsing is completed and all the possible additional goroutines are finished.

The LineProcessor interface is the following:

type LineProcessor interface {
	ProcToken(token *Token, line int, err error) error
	ProcStruct(strc *Structure, line int, err error) error
	ProcStructClose(strc *StructureClose, line int, err error) error
}

An example of how to configure and run the parser (with some fake functions inside) may look like this:

package main

import (
	"log"
	"github.com/tomachalek/vertigo"
)

type MyProcessor struct {
}

func (mp *MyProcessor) ProcToken(token *Token, line int, err error) error {
	if err != nil {
		return err
	}
	useWordPosAttr(token.Word)
	useFirstNonWordPosAttr(tokenAttrs[0])
}

func (d *MyProcessor) ProcStruct(strc *Structure, line int, err error) error {
	if err != nil {
		return err
	}
	structNameIs(strc.Name)
	for sattr, sattrVal := range strc.Attrs {
		useStructAttr(sattr, sattrVal)
	}
}

func (d *MyProcessor) ProcStructClose(strc *StructureClose, line int, err error) error {
	return err
}

func main() {
	pc := &vertigo.ParserConf{
		InputFilePath:         "/path/to/a/vertical/file",
		Encoding:              "utf-8",
		StructAttrAccumulator: "comb",
	}
	proc := MyProcessor{}
	err := vertigo.ParseVerticalFile(pc, proc)
	if err != nil {
		log.Fatal(err)
	}
}

Documentation ¶

Index ¶

Constants
func GetCharmapByName(name string) (*charmap.Charmap, error)
func ParseVerticalFile(conf *ParserConf, lproc LineProcessor) error
func ParseVerticalFileNoGoRo(conf *ParserConf, lproc LineProcessor)
func SupportedCharsets() []string
type LineProcessor
type ParserConf
- func LoadConfig(path string) *ParserConf
type Structure
type StructureClose
type Token

Constants ¶

View Source

const (
	LineTypeToken   = "token"
	LineTypeStruct  = "struct"
	LineTypeIgnored = "ignored"

	AccumulatorTypeStack = "stack"
	AccumulatorTypeComb  = "comb"
	AccumulatorTypeNil   = "nil"

	CharsetISO8859_1   = "iso-8859-1"
	CharsetISO8859_2   = "iso-8859-2"
	CharsetISO8859_3   = "iso-8859-3"
	CharsetISO8859_4   = "iso-8859-4"
	CharsetISO8859_5   = "iso-8859-5"
	CharsetISO8859_6   = "iso-8859-6"
	CharsetISO8859_7   = "iso-8859-7"
	CharsetISO8859_8   = "iso-8859-8"
	CharsetWindows1250 = "windows-1250"
	CharsetWindows1251 = "windows-1251"
	CharsetWindows1252 = "windows-1252"
	CharsetWindows1253 = "windows-1253"
	CharsetWindows1254 = "windows-1254"
	CharsetWindows1255 = "windows-1255"
	CharsetWindows1256 = "windows-1256"
	CharsetWindows1257 = "windows-1257"
	CharsetWindows1258 = "windows-1258"
	CharsetUTF_8       = "utf-8"
)

Variables ¶

This section is empty.

Functions ¶

func GetCharmapByName ¶

func GetCharmapByName(name string) (*charmap.Charmap, error)

GetCharmapByName returns a proper Charmap instance based on provided encoding name. The name detection is case insensitive (e.g. utf-8 is the same as UTF-8). The number of supported charsets is

func ParseVerticalFile ¶

func ParseVerticalFile(conf *ParserConf, lproc LineProcessor) error

ParseVerticalFile processes a corpus vertical file line by line and applies a custom LineProcessor on them. The processing is parallelized in the sense that reading a file into lines and processing of the lines runs in different goroutines. But the function as a whole behaves synchronously - i.e. once it returns a value, the processing is finished.

func ParseVerticalFileNoGoRo ¶

func ParseVerticalFileNoGoRo(conf *ParserConf, lproc LineProcessor)

ParseVerticalFileNoGoRo is just for benchmarking purposes

func SupportedCharsets ¶

func SupportedCharsets() []string

SupportedCharsets returns a list of names of character sets.

Types ¶

type LineProcessor ¶

type LineProcessor interface {

	// ProcToken is called each time the parser encounters a positional
	// attribute. In case parsing produces an error, it is passed to the
	// function without stopping the whole process.
	// In case the function returns an error, the parser stops
	// (in the simplest case it can be even the error it recieves)
	ProcToken(token *Token, line int, err error) error

	// ProcStruct is called each time parser encounters a structure opening
	// element (e.g. <doc>). In case parsing produces an error, it is passed
	// to the function without stopping the whole process.
	// In case the function returns an error, the parser stops.
	ProcStruct(strc *Structure, line int, err error) error

	// ProcStructClose is called each time parser encouters a structure
	// closing element (e.g. </doc>). In case parsing produces an error,
	// it is passed to the function without stopping the whole process.
	// In case the function returns an error, the parser stops.
	ProcStructClose(strc *StructureClose, line int, err error) error
}

LineProcessor describes an object able to handle Vertigo's parsing events.

type ParserConf ¶

type ParserConf struct {

	// Source vertical file (either a plain text file or a gzip one)
	InputFilePath string `json:"inputFilePath"`

	Encoding string `json:"encoding"`

	FilterArgs [][][]string `json:"filterArgs"`

	StructAttrAccumulator string `json:"structAttrAccumulator"`

	LogProgressEachNth int `json:"logProgressEachNth"`
}

ParserConf contains configuration parameters for vertical file parser

func LoadConfig ¶

func LoadConfig(path string) *ParserConf

LoadConfig loads the configuration from a JSON file. In case of an error the program exits with panic.

type Structure ¶

type Structure struct {

	// Name defines a name of a structure tag (e.g. 'doc' for <doc> element)
	Name string

	// Attrs store structural attributes of the tag
	// (e.g. <doc id="foo"> produces map with a single key 'id' and value 'foo')
	Attrs map[string]string

	// IsEmpty defines a possible self-closing tag
	// if true then the structure is self-closing
	// (i.e. there is no 'close element' event following)
	IsEmpty bool
}

Structure represent a structure opening tag

type StructureClose ¶

type StructureClose struct {
	Name string
}

StructureClose represent a structure closing tag

type Token ¶

type Token struct {
	Idx         int
	Word        string
	Attrs       []string
	StructAttrs map[string]string
}

Token is a representation of a parsed line. It connects both, positional attributes and currently accumulated structural attributes.

func (*Token) MatchesFilter ¶

func (t *Token) MatchesFilter(filterCNF [][][]string) bool

MatchesFilter tests whether a provided token matches a filter in Conjunctive normal form encoded as a 3-d list E.g.: div.author = 'John Doe' AND (div.title = 'Unknown' OR div.title = 'Superunknown') encodes as: { {{"div.author" "John Doe"}} {{"div.title" "Unknown"} {"div.title" "Superunknown"}} }

func (*Token) PosAttrByIndex ¶

func (t *Token) PosAttrByIndex(idx int) string

PosAttrByIndex returns a positional attribute based on its original index in vertical file

func (*Token) WordLC ¶

func (t *Token) WordLC() string

WordLC returns the 'word' positional attribute converted to lowercase

Source Files ¶

View all Source files

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL