lloyd

package module

v0.2.3 Latest Latest Go to latest Published: Apr 24, 2015 License: MIT Imports: 5 Imported by: 0

Details

Valid go.mod file

The Go module system was introduced in Go 1.11 and is the official dependency management solution for Go.
Redistributable license

Redistributable licenses place minimal restrictions on how software can be used, modified, and redistributed.
Tagged version

Modules with tagged versions give importers more predictable builds.
Stable version

When a project reaches major version v1 it is considered stable.
Learn more about best practices

Repository

github.com/ubleipzig/lloyd

Links

Open Source Insights

README ¶

README

Did you ever wanted to use uniq on a line delimited JSON file? You've come to the right place.

Installation

Go get all utils:

$ go get github.com/miku/lloyd/cmd/...

Breaking up the problem

When working with large[1] LDJ files, it is inconvenient to store seen values in a set because of the linear memory requirements. Bloom filters are more space efficent, but they allow false positives.

The traditional uniq is efficient, since it works on sorted input. The first problem therefore would be to sort a line-delimited JSON file by a key or keys.

There is already sort on most Unix systems, which is multicore aware since 8.6:

As of coreutils 8.6 (2010-10-15), GNU sort already sorts in parallel to make use of several processors where available.

We can bracket the sort, so it works with LDJ files, too: First extract the interesting value along with document boundaries from the LDJ, then sort by the value and then permute the original file, given the sorted boundaries:

[1] large: does not fit in memory

Step by step

We would like to filter out documents with duplicate names from the following file. The name with the highest more.syno should win.

$ cat fixtures/test.ldj
{"name": "Ann", "more": {"city": "London", "syno": 4}}
{"name": "涛", "more": {"city": "香港", "syno": 1}}
{"name": "Bob", "more": {"city": "Paris", "syno": 3}}
{"name": "Claude", "more": {"city": "Berlin", "syno": 5}}
{"name": "Diane", "more": {"city": "New York", "syno": 6}}
{"name": "Ann", "more": {"city": "Moscow", "syno": 2}}

First, extract the relevant keys.

$ lloyd-map -keys 'name, more.syno' fixtures/test.ldj
Ann 4   0   55
涛   1   55  55
Bob 3   110 54
Claude  5   164 58
Diane   6   222 59
Ann 2   281 55

And use traditional sort.

$ lloyd-map -keys 'name, more.syno' fixtures/test.ldj | sort
Ann 2   281 55
Ann 4   0   55
Bob 3   110 54
Claude  5   164 58
Diane   6   222 59
涛   1   55  55

Now a sort -u will do the job, if restricted to the first column:

$ lloyd-map -keys 'name, more.syno' fixtures/test.ldj | sort -uk1,1
Ann 4   0   55
Bob 3   110 54
Claude  5   164 58
Diane   6   222 59
涛   1   55  55

Now we only need to seek and read to the locations given as offset and length in the last two columns and slice out the corresponding records from the original file:

$ lloyd-map -keys 'name, more.syno' fixtures/test.ldj | sort -uk1,1 | cut -f3-
0   55
110 54
164 58
222 59
55  55

$ lloyd-map -keys 'name, more.syno' fixtures/test.ldj | sort -uk1,1 | cut -f3- | \
  lloyd-permute fixtures/test.ldj

{"name": "Ann", "more": {"city": "London", "syno": 4}}
{"name": "Bob", "more": {"city": "Paris", "syno": 3}}
{"name": "Claude", "more": {"city": "Berlin", "syno": 5}}
{"name": "Diane", "more": {"city": "New York", "syno": 6}}
{"name": "涛", "more": {"city": "香港", "syno": 1}}

Current limitations

The values should not contain tabs, since lloyd-map currently outputs tab delimited lists.

Documentation ¶

Index ¶

Constants
func StringValue(key string, doc map[string]interface{}) (string, error)

Constants ¶

View Source

const AppVersion = "0.2.3"

Variables ¶

This section is empty.

Functions ¶

func StringValue ¶

func StringValue(key string, doc map[string]interface{}) (string, error)

StringValue returns the value for a given key in dot notation

Types ¶

This section is empty.

Source Files ¶

View all Source files

common.go

Directories ¶

Path	Synopsis
cmd
lloyd-map lloyd-map extract a value per document and its offset and length.	lloyd-map extract a value per document and its offset and length.
lloyd-permute lloyd-permute takes a list of offset, length pairs from Stdin and outputs the parts in the order as they are read.	lloyd-permute takes a list of offset, length pairs from Stdin and outputs the parts in the order as they are read.
lloyd-uniq

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL