Documentation ¶
Index ¶
- Variables
- type AccumItem
- type AttrAccumulator
- type LineFilter
- type PassAllFilter
- type Status
- type TTExtractor
- func (tte *TTExtractor) GetColCounts() map[string]*ptcount.NgramCounter
- func (tte *TTExtractor) GetNumTokens() int
- func (tte *TTExtractor) ProcStruct(st *vertigo.Structure, line int, err error) error
- func (tte *TTExtractor) ProcStructClose(st *vertigo.StructureClose, line int, err error) error
- func (tte *TTExtractor) ProcToken(tk *vertigo.Token, line int, err error) error
- func (tte *TTExtractor) Run(conf *vertigo.ParserConf) error
- func (tte *TTExtractor) WordDict() *ptcount.WordDict
Constants ¶
This section is empty.
Variables ¶
var (
ErrorTooManyParsingErrors = errors.New("too many parsing errors")
)
Functions ¶
This section is empty.
Types ¶
type AttrAccumulator ¶
type AttrAccumulator interface { ForEachAttr(fn func(structure string, attr string, val string) bool) // contains filtered or unexported methods }
AttrAccumulator specifies an object able to collect (as tokens go) current structural attribute information. Under the hood you can imagine something like a non-strict, generalized stack.
type LineFilter ¶
type LineFilter interface {
Apply(tk *vertigo.Token, attrAcc AttrAccumulator) bool
}
LineFilter allows selecting only tokens with specific accumulated structure information (e.g. I want doc.type='scifi' AND text.type!='meta').
func LoadCustomFilter ¶
func LoadCustomFilter(libPath string, fn string) (LineFilter, error)
LoadCustomFilter loads a compiled .so plugin from a defined path and selects a function identified by fn. In case libPath does not point to an existing file, the function handles it as a path suffix and tries other locations (working directory, /usr/local/lib/gloomy).
type PassAllFilter ¶
type PassAllFilter struct{}
PassAllFilter is the default filter which returns true for any struct-attr values.
func (*PassAllFilter) Apply ¶
func (df *PassAllFilter) Apply(tk *vertigo.Token, attrAcc AttrAccumulator) bool
Apply tests current state of the attribute accumulator against the filter.
type Status ¶
type Status struct { Datetime time.Time File string ProcessedAtoms int ProcessedLines int Error error }
Status stores some basic information about vertical file processing
type TTExtractor ¶
type TTExtractor struct {
// contains filtered or unexported fields
}
TTExtractor handles writing parsed data to a sqlite3 database. Parsed values are received pasivelly by implementing vertigo.LineProcessor
func NewTTExtractor ¶
func NewTTExtractor( database db.Writer, conf *cnf.VTEConf, colgenFn colgen.AlignedColGenFn, statusChan chan Status, stopChan <-chan os.Signal, ) (*TTExtractor, error)
NewTTExtractor is a factory function to instantiate proper TTExtractor.
func (*TTExtractor) GetColCounts ¶
func (tte *TTExtractor) GetColCounts() map[string]*ptcount.NgramCounter
func (*TTExtractor) GetNumTokens ¶
func (tte *TTExtractor) GetNumTokens() int
func (*TTExtractor) ProcStruct ¶
func (tte *TTExtractor) ProcStruct(st *vertigo.Structure, line int, err error) error
ProcStruct is a part of vertigo.LineProcessor implementation. It si called by Vertigo parser when an opening structure tag is encountered.
func (*TTExtractor) ProcStructClose ¶
func (tte *TTExtractor) ProcStructClose(st *vertigo.StructureClose, line int, err error) error
ProcStructClose is a part of vertigo.LineProcessor implementation. It is called by Vertigo parser when a closing structure tag is encountered.
func (*TTExtractor) ProcToken ¶
func (tte *TTExtractor) ProcToken(tk *vertigo.Token, line int, err error) error
ProcToken is a part of vertigo.LineProcessor implementation. It is called by Vertigo parser when a token line is encountered.
func (*TTExtractor) Run ¶
func (tte *TTExtractor) Run(conf *vertigo.ParserConf) error
Run starts the parsing and metadata extraction process. The method expects a proper database schema to be ready (see database.go for details). The whole process runs within a transaction which makes sqlite3 inserts a few orders of magnitude faster.
func (*TTExtractor) WordDict ¶
func (tte *TTExtractor) WordDict() *ptcount.WordDict