groupcover

package module
v0.1.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Jun 2, 2022 License: GPL-3.0 Imports: 7 Imported by: 0

README

groupcover

Staged deduplication.

Test drive

$ go install github.com/miku/groupcover/cmd/groupcover@latest

Or via packages.

Usage

$ groupcover < input.csv > changes.csv

Where input.csv has three or more columns:

id, group, attribute, [key, key, ...]

Items from different groups (e.g. data sources) may share an attribute value (e.g. ISBN or DOI). Depending on a preference over groups (possibly per key), a number of keys may be dropped for an entry.

The CSV file must already be sorted by attribute.

$ groupcover -h
Usage of groupcover:
  -cpuprofile string
        pprof output file
  -f int
        column to use for grouping, one-based (default 3)
  -lower
        lowercase input
  -prefs string
        space separated string of preferences (most preferred first), e.g. 'B A C'
  -verbose
        more output
  -version
        show version

Examples

$ cat fixtures/sample.csv
id-1,group-1,value-1,Leipzig,Berlin
id-2,group-2,value-1,Berlin,Dresden

This is a duplicate (but only for Berlin), because both id-1 and id-2 have the same value: value-1. The Berlin key is repeated. By default, the group with the higher lexicographic value is choosen, so after deduplication Berlin would stay at id-2, but would get dropped from id-1:

$ groupcover < fixtures/sample.csv 2> /dev/null
id-1,group-1,value-1,Leipzig

Since 0.0.4, there is an experimental flag for settings preferences:

$ groupcover -prefs 'group-2 group-1' < fixtures/sample.csv 2> /dev/null
id-1,group-1,value-1,Leipzig

Overwrite default lexicographic order, prefer group-1 over group-2.

$ groupcover -prefs 'group-1 group-2' < fixtures/sample.csv 2> /dev/null
id-2,group-2,value-1,Dresden

Another example.

$ cat fixtures/mini.csv
1,G1,A1,K1,K2
2,G1,A2,K1,K2
3,G2,A2,K1,K2,K3
4,G3,A2,K2
5,G1,A3,K1,K2,K3
6,G2,A3,K2,K3
7,G1,,K2,K3
8,G2,,K2,K3
9,G2,A4,K2,K3
A,G2,A4,K2,K3

To sort CSV by attribute:

$ sort -t, -k3 fixtures/mini.csv

Only the changed entries are written:

$ groupcover < fixtures/mini.csv 2> /dev/null
2,G1,A2
3,G2,A2,K1,K3
5,G1,A3,K1

Finc Index

The licensing information is available e.g. in AILicensing, as intermediate format.

$ jq -r '[
    .["finc.record_id"],
    .["finc.source_id"],
    .["doi"],
    .["x.labels"][]?] | @csv' < <(unpigz -c /tmp/AILicensing/date-2016-11-28.ldj.gz)

"ai-48-QkVGT19fTTgzMDMxOTUzMzcwLU0tRklaVC1ET01BLVpERUUtQkVGTy1JVEVD","48",,"DE-J59"
"ai-48-QkVGT19fTTgzMDMxOTIwNjQ1LU0tRklaVC1ET01BLUJFRk8","48",,"DE-J59"
"ai-48-QkVGT19fTTgzMDMxOTE3NjQ1LU0tRklaVC1ET01BLUJFRk8","48",,"DE-J59"
...


Documentation

Overview

Copyright 2016 by Leipzig University Library, http://ub.uni-leipzig.de
                  The Finc Authors, http://finc.info
                  Martin Czygan, <martin.czygan@uni-leipzig.de>

This file is part of some open source application.

Some open source application is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

Some open source application is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with Foobar. If not, see <http://www.gnu.org/licenses/>.

@license GPL-3.0+ <http://spdx.org/licenses/GPL-3.0+>

Index

Constants

This section is empty.

Variables

View Source
var (
	// Verbose output.
	Verbose = true
)

Functions

func DiscardRows added in v0.0.8

func DiscardRows(records [][]string) ([][]string, error)

Discard all rows.

func GroupRewrite

func GroupRewrite(r io.Reader, w io.Writer, attrFunc AttrFunc, rewriterFunc RewriterFunc) error

GroupRewrite reads CSV records from a given reader, extracts attribute values with attrFunc, groups subsequent records with the same attribute value and passes these groups to a rewriter. The potentially modified records are written as CSV to the given writer.

func LastRow added in v0.0.2

func LastRow(records [][]string) ([][]string, error)

LastRow rewriter that only keeps the last row, similar to uniq(1), which would be similar to GroupRewrite(os.Stdin, os.Stdout, Column(0), LastRow).

func LexChoice

func LexChoice(s []string) string

LexChoice chooses the key with the highest lexicographic value. If there are no choices, return the empty string.

Types

type AttrFunc

type AttrFunc func(record []string) (string, error)

AttrFunc extracts an attribute value from a slice of strings (e.g. coming from a CSV file). Example values could be a single column, part of a column or a value spanning multiple columns.

func Column

func Column(k int) AttrFunc

Column returns an AttrFunc. Yields the value of a given column (zero-indexed).

func ColumnLower added in v0.0.10

func ColumnLower(k int) AttrFunc

ColumnLower returns an AttrFunc. Yields the lowercase value of a given column (zero-indexed), refs #12755.

type ChoiceFunc

type ChoiceFunc func([]string) string

ChoiceFunc presented with a list of choices, chooses one.

func ListChooser added in v0.0.8

func ListChooser(prefs []string) ChoiceFunc

ListChooser takes a preference list (most preferred first) and returns a ChoiceFunc. It's a panic, if the given preference list is empty. If a set of options is given and preferences and options intersect, then the option with the highest preference is choosen. If preferences and options do not intersect, we randomly select an option.

type Preferences added in v0.0.8

type Preferences struct {
	Map     map[string]ChoiceFunc
	Default ChoiceFunc
}

Preferences groups many choices by key (e.g. ISIL). If there is no ChoiceFunc for a key, a default can be used.

type RewriterFunc

type RewriterFunc func(records [][]string) ([][]string, error)

RewriterFunc rewrites a list of records.

func SimpleRewriter

func SimpleRewriter(preferences Preferences) RewriterFunc

SimpleRewriter takes a preference map (which key is interested in which group) and returns a rewriter, which drops certain keys that are assigned to records from multiple groups with the same attribute value. Note: This rewriter returns only differing records.

Directories

Path Synopsis
cmd
groupindex
The groupindex tool can be applied to an intermediate schema or solr file, that is about to be indexed.
The groupindex tool can be applied to an intermediate schema or solr file, that is about to be indexed.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL