preparation

package
v0.0.0-...-5fcd0f1 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: May 21, 2021 License: GPL-3.0 Imports: 11 Imported by: 0

README

Preparation Module

The Preparation Module is capable of preparing datasets in N-Triples format.

It is able to split datasets into multiple files, or to filter a dataset and only output the filtered entries. For each operation a method can be chosen that alters how the split and filter is executed.

Splitting Methods

  • 1-in-n: Used for splitting datasets into training and test sets. It will filter out every Nth line into a new file.

  • by-type: Takes a dataset and generates 3 files. One for all items, one for all properties, and one for entries which are neither of the two. Note that this splitter needs to assume that all subjects come in contiguous lines. In other words, the dataset has to be grouped by the subject column.

  • by-prefix: Takes a dataset and generates 3 files. Split is made according to the prefix of the subject.

Filtering Methods

  • for-schematree: Filters out entries that are not useful for the schematree build process.

  • for-glossary: Filters out entries that are not useful for the glossary build process.

  • for-evaluation: Filters out entries that make the evaluation of a schematree slower without adding information. This is the case when many labels are given, as to prevent the evaluation to iterate through all of the repeated label properties.

Identifying items and properties

The wikidata dump has all subjects together, both items and properties. To identify whether a subject is an item or a property we need to check the object of a specific predicate.

Reminder, the N-Triples files comes in lines of subject predicate object .

  • Predicate: <http://www.w3.org/1999/02/22-rdf-syntax-ns#type>
  • If item, then object is: <http://wikiba.se/ontology#Item>
  • If property, then object is: <http://wikiba.se/ontology#Property>

In previous datasets (10M.nt.gz from June 2019), the items were defined with <http://wikiba.se/ontology-beta#Item> and <http://www.wikidata.org/ontology#Property>, which is different than those given in the latest-truthy.nt.gz (from July 2019)

Another simpler, but not so pedantic way, would be to check if the subject start with a prefix. This is hypothetical and not actually used.

  • For entities: <http://www.wikidata.org/entity/Q
  • For properties: <http://www.wikidata.org/entity/P

Prefix mismatch on properties

Wikidata uses (at least) two different URL prefixes to refer to the properties, and this creates an incompatibility on the glossary which needs to be fixed with an extra preparation step on the property dataset.

When an Item subject refers to a Property predicate, Wikidata will use <http://www.wikidata.org/prop/direct/Pxxx> to refer to the property, but when Wikidata is defining the Property (in other words, when Property is used as a subject), Wikidata will refer to it with <http://www.wikidata.org/entity/Pxxx>. Notice the mismatch between /prop/direct/ and /entity/.

Without a proper preparation step, this mismatch will cause the glossary to store all labels using the /entity/ key, while the server requests will actually try to fetch /prop/direct/ keys from the glossary, resulting in showing no labels at all.

These two different url prefixes are described in the data by using a specific predicate. An example is:

<http://www.wikidata.org/entity/Pxxx> <http://wikiba.se/ontology#directClaim> <http://www.wikidata.org/prop/direct/Pxxx> .

The current extra preparation step makes a simple prefix change, but assumes a specific URL is used. It is not pedantic.

gzip -cd dataset.nt.gz | sed -r -e 's|^<http:\/\/www\.wikidata\.org\/entity\/P([^>]+)>|<http://www.wikidata.org/prop/direct/P\1>|g' | gzip > ./dataset-altered.nt.gz

Requirement of contiguous subject entries

Some splitters that work in block of entries require that all subjects have their definitions in contiguous lines. To guarantee this requirement you can add an extra preparation step to sort the dataset.

gzip -cd ./dataset-filtered.nt.gz | sort | gzip > dataset-filtered-sorted.nt.gz

Documentation

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

func SplitBySampling

func SplitBySampling(fileName string, oneInN int64) error

SplitBySampling splits a dataset file into two by taking out every Nth entry. Taken from the original splitter without modifications.

Note that this method assumes that all subjects are defined in contiguous lines.

Types

type FilterStats

type FilterStats struct {
	KeptCount int
	LostCount int
}

FilterStats are the stats related to the filter operation. TODO: Maybe these types of Stats returns should also tell use where the files have been stored.

func FilterForEvaluation

func FilterForEvaluation(filePath string) (*FilterStats, error)

FilterForEvaluation creates a filtered version of a dataset to make it faster when executing the evaluation.

func FilterForGlossary

func FilterForGlossary(filePath string) (*FilterStats, error)

FilterForGlossary creates a filtered version of a dataset to make it better for usage when building glossaries.

todo: In the future it could use a filter-in mechanism where only specific predicates

are sent to the generated file, instead of filter-out which includes all the
statements except the ones listed. Filter-in should use the same predicates that
are used by the Glossary building step and gives the user a better perception
of what is actually used by the glossary. With filter-in, we know that every
statement in our generated file is also used in the construction of the glossary.
With filter-out there can still be many statements that are silently ignored by
the building step.

func FilterForSchematree

func FilterForSchematree(filePath string) (*FilterStats, error)

FilterForSchematree creates a filtered version of a dataset to make it better for usage when building schematrees.

todo: In future, such hard-coded predicates should probably not exist.

type SplitByPrefixStats

type SplitByPrefixStats struct {
	MiscCount int
	ItemCount int
	PropCount int
}

SplitByPrefixStats are the stats related to the split operation. TODO: Maybe these types of Stats returns should also tell use where the files have been stored.

func SplitByPrefix

func SplitByPrefix(filePath string) (*SplitByPrefixStats, error)

SplitByPrefix will take a dataset and decide where to send it to based on a match of the beginning of the subject. Matches can be of following: item, property, other/miscellaneous

type SplitByTypeStats

type SplitByTypeStats struct {
	MiscCount int
	ItemCount int
	PropCount int
}

SplitByTypeStats are the stats related to the split operation. TODO: Maybe these types of Stats returns should also tell use where the files have been stored.

func SplitByType

func SplitByType(filePath string) (*SplitByTypeStats, error)

SplitByType will take a dataset and generate smaller datasets for each subject type it finds. Types can be of following: item, property, other/miscellaneous.

func SplitByTypeInBlocks

func SplitByTypeInBlocks(filePath string) (*SplitByTypeStats, error)

SplitByTypeInBlocks is a faster implementation of SplitByType, using only a single pass, but assumes that subjects are always found in contiguous lines.

TODO: Maybe there is a need to remove the type-classifying predicates. It that happens

then it should be made as an optional argument.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL