preparation

package

v0.0.0-...-05931ac Latest Latest Go to latest Published: Jun 16, 2020 License: GPL-3.0 Imports: 11 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/lgleim/SchemaTreeRecommender

Links

Open Source Insights

README ¶

Preparation Module

The Preparation Module is capable of preparing datasets in N-Triples format.

It is able to split datasets into multiple files, or to filter a dataset and only output the filtered entries. For each operation a method can be chosen that alters how the split and filter is executed.

Splitting Methods

1-in-n: Used for splitting datasets into training and test sets. It will filter out every Nth line into a new file.
by-type: Takes a dataset and generates 3 files. One for all items, one for all properties, and one for entries which are neither of the two. Note that this splitter needs to assume that all subjects come in contiguous lines. In other words, the dataset has to be grouped by the subject column.
by-prefix: Takes a dataset and generates 3 files. Split is made according to the prefix of the subject.

Filtering Methods

for-schematree: Filters out entries that are not useful for the schematree build process.
for-glossary: Filters out entries that are not useful for the glossary build process.
for-evaluation: Filters out entries that make the evaluation of a schematree slower without adding information. This is the case when many labels are given, as to prevent the evaluation to iterate through all of the repeated label properties.

Identifying items and properties

The wikidata dump has all subjects together, both items and properties. To identify whether a subject is an item or a property we need to check the object of a specific predicate.

Reminder, the N-Triples files comes in lines of subject predicate object .

Predicate: <http://www.w3.org/1999/02/22-rdf-syntax-ns#type>
If item, then object is: <http://wikiba.se/ontology#Item>
If property, then object is: <http://wikiba.se/ontology#Property>

In previous datasets (10M.nt.gz from June 2019), the items were defined with <http://wikiba.se/ontology-beta#Item> and <http://www.wikidata.org/ontology#Property>, which is different than those given in the latest-truthy.nt.gz (from July 2019)

Another simpler, but not so pedantic way, would be to check if the subject start with a prefix. This is hypothetical and not actually used.

For entities: <http://www.wikidata.org/entity/Q
For properties: <http://www.wikidata.org/entity/P

Prefix mismatch on properties

Wikidata uses (at least) two different URL prefixes to refer to the properties, and this creates an incompatibility on the glossary which needs to be fixed with an extra preparation step on the property dataset.

When an Item subject refers to a Property predicate, Wikidata will use <http://www.wikidata.org/prop/direct/Pxxx> to refer to the property, but when Wikidata is defining the Property (in other words, when Property is used as a subject), Wikidata will refer to it with <http://www.wikidata.org/entity/Pxxx>. Notice the mismatch between /prop/direct/ and /entity/.

Without a proper preparation step, this mismatch will cause the glossary to store all labels using the /entity/ key, while the server requests will actually try to fetch /prop/direct/ keys from the glossary, resulting in showing no labels at all.

These two different url prefixes are described in the data by using a specific predicate. An example is:

<http://www.wikidata.org/entity/Pxxx> <http://wikiba.se/ontology#directClaim> <http://www.wikidata.org/prop/direct/Pxxx> .

The current extra preparation step makes a simple prefix change, but assumes a specific URL is used. It is not pedantic.

gzip -cd dataset.nt.gz | sed -r -e 's|^<http:\/\/www\.wikidata\.org\/entity\/P([^>]+)>|<http://www.wikidata.org/prop/direct/P\1>|g' | gzip > ./dataset-altered.nt.gz

Requirement of contiguous subject entries

Some splitters that work in block of entries require that all subjects have their definitions in contiguous lines. To guarantee this requirement you can add an extra preparation step to sort the dataset.

gzip -cd ./dataset-filtered.nt.gz | sort | gzip > dataset-filtered-sorted.nt.gz

Documentation ¶

Index ¶

func SplitBySampling(fileName string, oneInN int64) error
type FilterStats
type SplitByPrefixStats
- func SplitByPrefix(filePath string) (*SplitByPrefixStats, error)
type SplitByTypeStats
- func SplitByType(filePath string) (*SplitByTypeStats, error)
- func SplitByTypeInBlocks(filePath string) (*SplitByTypeStats, error)

Constants ¶

This section is empty.

Variables ¶

This section is empty.

Functions ¶

func SplitBySampling ¶

func SplitBySampling(fileName string, oneInN int64) error

SplitBySampling splits a dataset file into two by taking out every Nth entry. Taken from the original splitter without modifications.

Note that this method assumes that all subjects are defined in contiguous lines.

Types ¶

type FilterStats ¶

type FilterStats struct {
	KeptCount int
	LostCount int
}

FilterStats are the stats related to the filter operation. TODO: Maybe these types of Stats returns should also tell use where the files have been stored.

func FilterForEvaluation ¶

func FilterForEvaluation(filePath string) (*FilterStats, error)

FilterForEvaluation creates a filtered version of a dataset to make it faster when executing the evaluation.

func FilterForGlossary ¶

func FilterForGlossary(filePath string) (*FilterStats, error)

FilterForGlossary creates a filtered version of a dataset to make it better for usage when building glossaries.

todo: In the future it could use a filter-in mechanism where only specific predicates

are sent to the generated file, instead of filter-out which includes all the
statements except the ones listed. Filter-in should use the same predicates that
are used by the Glossary building step and gives the user a better perception
of what is actually used by the glossary. With filter-in, we know that every
statement in our generated file is also used in the construction of the glossary.
With filter-out there can still be many statements that are silently ignored by
the building step.

func FilterForSchematree ¶

func FilterForSchematree(filePath string) (*FilterStats, error)

FilterForSchematree creates a filtered version of a dataset to make it better for usage when building schematrees.

todo: In future, such hard-coded predicates should probably not exist.

type SplitByPrefixStats ¶

type SplitByPrefixStats struct {
	MiscCount int
	ItemCount int
	PropCount int
}

SplitByPrefixStats are the stats related to the split operation. TODO: Maybe these types of Stats returns should also tell use where the files have been stored.

func SplitByPrefix ¶

func SplitByPrefix(filePath string) (*SplitByPrefixStats, error)

SplitByPrefix will take a dataset and decide where to send it to based on a match of the beginning of the subject. Matches can be of following: item, property, other/miscellaneous

type SplitByTypeStats ¶

type SplitByTypeStats struct {
	MiscCount int
	ItemCount int
	PropCount int
}

SplitByTypeStats are the stats related to the split operation. TODO: Maybe these types of Stats returns should also tell use where the files have been stored.

func SplitByType ¶

func SplitByType(filePath string) (*SplitByTypeStats, error)

SplitByType will take a dataset and generate smaller datasets for each subject type it finds. Types can be of following: item, property, other/miscellaneous.

func SplitByTypeInBlocks ¶

func SplitByTypeInBlocks(filePath string) (*SplitByTypeStats, error)

SplitByTypeInBlocks is a faster implementation of SplitByType, using only a single pass, but assumes that subjects are always found in contiguous lines.

TODO: Maybe there is a need to remove the type-classifying predicates. It that happens

then it should be made as an optional argument.

Source Files ¶

View all Source files

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL