ntto

package module
v0.4.2 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Aug 5, 2019 License: MIT Imports: 7 Imported by: 0

README

ntto

Minimal n-triples toolkit. It can:

  • shrink n-triples by applying namespace abbreviations (given some rules)
  • convert n-triples to line delimited JSON (.ldj)

To list the abbreviation rules, run:

$ ntto -d

To create an abbreviated NT file from an NT file, run:

$ ntto -o OUTPUT.NT -a FILE.nt

To create an abbreviated JSON file from an NT file, run:

$ ntto -a -j FILE.nt > OUTPUT.LDJ

To create an abbreviated JSON file from an NT file while ignoring conversion errors, run:

$ ntto -a -j -i FILE.nt > OUTPUT.LDJ

To create an abbreviated JSON file from an NT file while ignoring conversion errors and using a custom RULES file, run:

$ ntto -r RULES -a -j -i FILE.nt > OUTPUT.LDJ

Installation

RPM and DEB packages can be found under releases.

With a proper Go setup, a

$ go get github.com/miku/ntto/cmd/ntto

should work as well.

Usage

$ ntto
Usage: ntto [OPTIONS] FILE
  -a    abbreviate n-triples using rules
  -c    dump constructed sed command and exit
  -cpuprofile string
        write cpu profile to file
  -d    dump rules and exit
  -i    ignore conversion errors
  -j    convert nt to json
  -n string
        string to indicate empty string replacement (default "<NULL>")
  -o string
        output file to write result to
  -r string
        path to rules file, use built-in if none given
  -v    prints current version and exits
  -w int
        parallelism measure (default 4)

Mode of operation

ntto takes a RULES file (alternatively uses some hardwired rules) to abbreviate common prefixes in a n-triple file. ntto does not do the replacements itself, but outsources it to external programs, like replace or perl.

With the help of replace ntto can shorten up to 3M lines per second. The resulting file size can be up to 50% of the size of the original file.

Example rules file

$ cat RULES
# example rules file
dbp             http://dbpedia.org/resource/
gnd             http://d-nb.info/gnd/
dnbes           http://d-nb.info/standards/elementset/gnd#
dnbac           http://d-nb.info/standards/vocab/gnd/geographic-area-code#
dnbv            http://d-nb.info/standards/vocab/gnd/

viaf            http://viaf.org/viaf/
frbr            http://rdvocab.info/uri/schema/FRBRentitiesRDA/
rdgr            http://rdvocab.info/ElementsGr2/

# empty lines are ignored, as are comments

foaf            http://xmlns.com/foaf/0.1/
rdf             http://www.w3.org/1999/02/22-rdf-syntax-ns#
rdfs            http://www.w3.org/2000/01/rdf-schema#
schema          http://schema.org/
dc              http://purl.org/dc/elements/1.1/
dcterms         http://purl.org/dc/terms/

Performance data point

$ wc -l file.nt
114171541

$ time ntto -o output.nt -a file.nt
real    1m51.202s
user    1m3.626s
sys     0m13.602s

$ time ntto -a -j file.nt > output.ldj
real    15m47.872s
user    16m19.516s
sys      2m3.013s

Sometimes, less is more, but YMMV:

$ time ntto -w 2 -a -j file.nt > output.ldj
real    12m3.619s
user    15m17.422s
sys     2m14.430s

Documentation

Index

Constants

View Source
const AppVersion = "0.4.2"

Variables

View Source
var DefaultRules = `` /* 10202-byte string literal not displayed */

Functions

func DumpRules

func DumpRules(rules []Rule) string

func PartitionRules

func PartitionRules(rules []Rule, count int) [][]Rule

PartitionRules divides the rules slice into `count` partitions

func Replacify

func Replacify(rules []Rule, in string) string

func ReplacifyNull

func ReplacifyNull(rules []Rule, in, null string) string

func Sedify

func Sedify(rules []Rule, p int, in string) string

Turn rules into a sed command `in` as input, `out` as output filename

func SedifyNull

func SedifyNull(rules []Rule, p int, in, null string) string

Turn rules into a sed command `in` as input, `out` as output filename

Types

type Rule

type Rule struct {
	Prefix   string
	Shortcut string
}

func ParseRules

func ParseRules(s string) ([]Rule, error)

ParseAbbreviations takes a string, parse the abbreviations and returns them as slice

func (Rule) String

func (r Rule) String() string

type Triple

type Triple struct {
	XMLName   xml.Name `json:"-" xml:"t"`
	Subject   string   `json:"s" xml:"s"`
	Predicate string   `json:"p" xml:"p"`
	Object    string   `json:"o" xml:"o"`
}

func ParseNTriple

func ParseNTriple(line string) (*Triple, error)

Simplistic NTriples parser

Directories

Path Synopsis
cmd

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL