nilsimsa

package module

v0.0.0-...-84540bb Latest Latest Go to latest Published: Jun 18, 2020 License: BSD-3-Clause Imports: 4 Imported by: 0

Details

Valid go.mod file

The Go module system was introduced in Go 1.11 and is the official dependency management solution for Go.
Redistributable license

Redistributable licenses place minimal restrictions on how software can be used, modified, and redistributed.
Tagged version

Modules with tagged versions give importers more predictable builds.
Stable version

When a project reaches major version v1 it is considered stable.
Learn more about best practices

Repository

github.com/chikamim/nilsimsa

Links

Open Source Insights

README ¶

nilsimsa

Package nilsimsa is a Go implemenation of Nilsimsa, ported from code.google.com/p/py-nilsimsa but follows the conventions establish by the md5 package in the standard lib

The Java implementaition at https://github.com/weblyzard/nilsimsa/blob/master/src/main/java/com/weblyzard/lib/string/nilsimsa/Nilsimsa.java was also consulted.

There is a discussion about using hash to score string similarities http://stackoverflow.com/questions/4323977/string-similarity-score-hash

Examples

fmt.Println(nilsimsa.HexSum([]byte("Hello nilsimsa!")))
// Output: 436119240183882801210e002e1cb00122a20d11b4268ab8001a51190c08084b

h1 := nilsimsa.HexSum([]byte("Hello nilsimsa!"))
h2 := nilsimsa.HexSum([]byte("Hello world!"))
h3 := nilsimsa.HexSum([]byte("Nobody Expects the Spanish Inquisition"))

fmt.Println(nilsimsa.DiffHexScore(h1, h1))
fmt.Println(nilsimsa.DiffHexScore(h1, h2))
fmt.Println(nilsimsa.DiffHexScore(h1, h3))
// Output:
// 1
// 0.4094488188976378
// 0.14960629921259844

Documentation ¶

Overview ¶

Package nilsimsa is a Go implemenation of Nilsimsa, ported from code.google.com/p/py-nilsimsa but follows the conventions establish by the md5 package in the standard lib

The Java implementaition at https://github.com/weblyzard/nilsimsa/ blob/master/src/main/java/com/weblyzard/lib/string/nilsimsa/Nilsimsa.java was also consulted.

There is a discussion about using hash to score string similarities http://stackoverflow.com/questions/4323977/string-similarity-score-hash

From http://en.wikipedia.org/wiki/Nilsimsa_Hash

Nilsimsa is an anti-spam focused locality-sensitive hashing algorithm originally proposed the cmeclax remailer operator in 2001[1] and then reviewed by Damiani et al. in their 2004 paper titled, "An Open Digest-based Technique for Spam Detection".[2]

The goal of Nilsimsa is to generate a hash digest of an email message such that the digests of two similar messages are similar to each other. In comparison with cryptographic hash functions such as SHA-1 or MD5, making a small modification to a document does not substantially change the resulting hash of the document. The paper suggests that the Nilsimsa satisfies three requirements:

The digest identifying each message should not vary significantly (sic) for changes that can be produced automatically.
The encoding must be robust against intentional attacks.
The encoding should support an extremely low risk of false positives.

Subsequent testing[3] on a range of file types identified the Nilsimsa hash as having a significantly higher false positive rate when compared to other similarity digest schemes such as TLSH, Ssdeep and Sdhash.

Nilsimsa similarity matching was taken in consideration by Jesse Kornblum when developing the fuzzy hashing in 2006,[4] that used the algorithms of spamsum by Andrew Tridgell (2002).[5]

References:

[1] http://web.archive.org/web/20050707005338/

http://ixazon.dynip.com/~cmeclax/nilsimsa-0.2.4.tar.gz

[2] http://spdp.di.unimi.it/papers/pdcs04.pdf [3] https://www.academia.edu/7833902/TLSH_-A_Locality_Sensitive_Hash [4] http://jessekornblum.livejournal.com/242493.html [5] http://dfrws.org/2006/proceedings/12-Kornblum.pdf

From http://blog.semanticlab.net/tag/nilsimsa/

An Open Digest-based Technique for Spam Detection ¶

Damiani, E. et al., 2004. An Open Digest-based Technique for Spam Detection. In in Proceedings of the 2004 International Workshop on Security in Parallel and Distributed Systems. pp. 15-17.

Summary ¶

This paper discusses the Nilsimsa open digest hash algorithm which is frequently used for Spam detection. The authors describe the computation of the 32-byte code, discuss different attack scenarios and measures to counter them.

Digest computation

Slide a five character window through the input text and compute all eight possible tri-grams for each window (e.g. "igram" yields "igr", "gra", "ram", "ira", "iam", "grm", ...)
Hash these trigrams using a hash function h() which maps every tri-gram to one of 256 accumulators and increment the corresponding accumulator. Nilsimsa uses the Trans53 hash function for hashing.
At the end of the process described below, compute the expected value of the accumulators and set the bits which correspond to each accumulator either to (1) if it exceeds this threshold or (o) otherwise.

Similarity computation ¶

The Nilsimsa similarity is computed based on the bitwise difference between two Nilsimsa hashes. Documents are considered similar if they exceed a pre-defined similarity value.

>24 similar bits - conflict probability: 1.35E-4 (suggestions by Nilsimsa's original designer)
>54 similar bits - conflict probability: 7.39E-12 (suggested by the article's authors)

Attacks ¶

Spammers can apply multiple techniques to prevent Nilsimsa from detecting duplicates:

Random addition: requires >300% of additional text to prevent detection.
Thesaurus substitutions: require >20% of replaced text
Perceptive substitution (security > s3cur1ty): requires >15% of the text to be altered
Aimed attacks (i.e. attacks which specifically target Nilsimsa): ~10% of the text needs to be altered

Aimed attacks manipulate Nilsimsa's accumulators by adding words which introduce new tri-grams that specifically alter the hash value. Although these attacks are relatively effective, they can be easily circumvented by computing the Nilsimsa hash twice with different hash functions.

Index ¶

Constants
func BitsDiff(n1, n2 *[Size]byte) byte
func BitsDiffHex(n1, n2 string) (byte, error)
func BitsDiffSlice(n1, n2 []byte) byte
func ComputeDigest(acc *[256]int, count int) [Size]byte
func DiffHexScore(n1, n2 string) float64
func DiffScore(n1, n2 *[Size]byte) float64
func HexSum(data []byte) string
func New() hash.Hash
func Sum(data []byte) [Size]byte
func Update(chunk []byte, count int, c0, c1, c2, c3 byte, acc *[256]int) (int, byte, byte, byte, byte)

Constants ¶

View Source

const BlockSize = 8

The blocksize of Nilsimsa in bytes (not sure what values is best...?)

View Source

const Size = 32

The size of an Nilsimsa hash in bytes.

Variables ¶

This section is empty.

Functions ¶

func BitsDiff ¶

func BitsDiff(n1, n2 *[Size]byte) byte

BitsDiff compares two Nilsimsa digest arrays and return the number of bits that differ.

func BitsDiffHex ¶

func BitsDiffHex(n1, n2 string) (byte, error)

BitsDiffHex compares two Nilsimsa digests hex strings and return the number of bits that differ

func BitsDiffSlice ¶

func BitsDiffSlice(n1, n2 []byte) byte

BitsDiffSlice compares two Nilsimsa digests slices and return the number of bits that differ.

func ComputeDigest ¶

func ComputeDigest(acc *[256]int, count int) [Size]byte

ComputeDigest uses a threshold (mean of the accumulator) to computes the Nilsimsa digest

func DiffHexScore ¶

func DiffHexScore(n1, n2 string) float64

DiffHexScore return the 0-1.0 similarity score

func DiffScore ¶

func DiffScore(n1, n2 *[Size]byte) float64

DiffScore return the 0-1.0 similarity score

func HexSum ¶

func HexSum(data []byte) string

HexSum returns the Nilsimsa digest of the data as a hex string

func New ¶

func New() hash.Hash

New create a new Nilsimsa hash diget

func Sum ¶

func Sum(data []byte) [Size]byte

Sum returns the Nilsimsa digest of the data. Like Sum() for MD5, it returns [Size]byte by value rather than a slice. This way the return value does not need to be allocated on the head so there is no garbage collection later.

To use the result as a slice when using it as as a function parameter, simply re-slice it using the digist[:] syntax See http://blog.golang.org/go-slices-usage-and-internals and search for: "slicing" an existing slice or array

func Update ¶

func Update(chunk []byte, count int, c0, c1, c2, c3 byte,
	acc *[256]int) (int, byte, byte, byte, byte)

Update the Nilsimsa accumulator with the data contained in chunk using a 5 bytes sliding window.

In general it is easier to just use Sum(data []byte)

Types ¶

This section is empty.

Source Files ¶

View all Source files

nilsimsa.go

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL