imageid

package module
v0.0.0-...-0b5b0ed Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Oct 6, 2015 License: Apache-2.0 Imports: 23 Imported by: 0

README

imageid: Similar images indexing service

GoDoc

This tool allows to index a large number (millions) of images and group them in disjoint groups of similar images. Each group of similar images is identified by a canonical URL, which is the URL of one of the images in the group.

Image URL Canonical URL
http://abc.def/umbrella1.png http://abc.def/umbrella1.png
http://asd.fgh/horse.png http://asd.fgh/horse.png
http://xyz.tld/u.jpg http://abc.def/umbrella1.png

Images are hashed using a combination of dhash (difference hash) and phash (perceptual hash), resulting in a 128-bit hash. Similar detection is not flawless; for example a cropped version of an image will be detected as completely different.

Each image is processed using the following pipeline:

  • If the image URL is already indexed, do nothing.
  • Download the image.
  • Calculate MD5. If the MD5 is already indexed, the new URL is assigned to the similarity group.
  • Calculate hash and search for similar hashes in the database (distance <= 8).
    • If no similar hashes are found, store the hash and MD5 associated with the URL, in a new similarity group.
    • If a similar hash is found, assign the new URL and MD5 to the similarity group. The new hash is not stored.

The service uses threads (goroutines) to leverage bandwidth and CPU:

  • 10 threads for downloading images (configurable with the IMAGEID_DOWNLOAD_WORKERS environment variable).
  • runtime.NumCPU() threads for calculating hashes, searching and indexing (configurable with IMAGEID_PROCESS_WORKERS).

Database

Database backend is MySQL, unless the IMAGEID_NULL_STORE environment variable is defined, in which case the image index is stored in main memory.

The database uses 3 tables that mimic a key-value store:

Table Key (k) Value (v)
urls <md5(url)> <canonical-url>
md5s <md5(img)> <canonical-url>
hashes <hash> <canonical-url>
  • <url>: URL of an image
  • <canonical-url>: URL of the first similar image that was indexed.
  • <md5>: MD5 of the image or URL
  • <hash>: dhash+phash of the image

There is also an additional table similar_log, which stores the log of similar images found.

Algorithm

The similar hash search is performed using a metric tree. Hamming distance is used to compute the distance between hashes. At startup, the complete set of hash keys is read from the database, and the metric tree is constructed in main memory.

Installing / Executing

A Dockerfile is provided, which you can either use to run the server as-is, or to extend, or simply to use as install instructions.

When the Docker container is started, the code is compiled and installed, and then imageid-server is run.

Two scripts are provided to build and run the container: scripts/run-dev-mysql.sh and scripts/run-dev-standalone.sh. The mysql variant will run the mariadb Docker image and use it as database backend, for persistence.

imageid-server is the main executable. See the imageid/server package documentation for a description of the available HTTP endpoints.

The log is sent to stdout/stderr.

Usage example

Let's start the server in standalone mode (no database):

$ scripts/run-dev-standalone.sh
++ docker build -t imageid .
...
Successfully built e0e4d021c520
2015/07/01 19:52:02 [INFO] HTTP server listening at port :8080
2015/07/01 19:52:02 [INFO] Initializing DB...

Once the server is running, we can feed some images using a POST request (in another window):

$ curl -X POST 'http://localhost:8080/process?url=https://www.google.com.ar/images/srpr/logo11w.png'
"Added https://www.google.com.ar/images/srpr/logo11w.png"

In the server log window you will see:

2015/07/01 19:59:57 [DEBUG] Processing https://www.google.com.ar/images/srpr/logo11w.png
2015/07/01 19:59:57 [DEBUG] Calculated hash https://www.google.com.ar/images/srpr/logo11w.png: 8216715a5295080c877da82f450307db
2015/07/01 19:59:57 [DEBUG] New hash node: https://www.google.com.ar/images/srpr/logo11w.png

Let's feed two similar images:

$ curl -X POST 'http://localhost:8080/process?url=http://i.imgur.com/JeYm857.png'
"Added http://i.imgur.com/JeYm857.png"
$ curl -X POST 'http://localhost:8080/process?url=http://elplanc.net/wp-content/uploads/2013/11/logocajaazul.png'
"Added http://elplanc.net/wp-content/uploads/2013/11/logocajaazul.png"

Here is the log output:

2015/07/01 20:28:44 [DEBUG] Processing http://i.imgur.com/JeYm857.png
2015/07/01 20:28:44 [DEBUG] Calculated hash http://i.imgur.com/JeYm857.png: 80a486d2cadcd4803c531129c8f4af4f
2015/07/01 20:28:44 [DEBUG] New hash node: http://i.imgur.com/JeYm857.png
2015/07/01 20:29:23 [DEBUG] Processing http://elplanc.net/wp-content/uploads/2013/11/logocajaazul.png
2015/07/01 20:29:23 [DEBUG] Calculated hash http://elplanc.net/wp-content/uploads/2013/11/logocajaazul.png: 808486d2cadcd480bc531129c8f0bf0f
2015/07/01 20:29:23 [DEBUG] hash distance 5: http://elplanc.net/wp-content/uploads/2013/11/logocajaazul.png http://i.imgur.com/JeYm857.png

Let's query the canonical URLs:

$ curl 'http://localhost:8080/canonical?url=https://www.google.com.ar/images/srpr/logo11w.png'
{"Canonical":"https://www.google.com.ar/images/srpr/logo11w.png"}
$ curl 'http://localhost:8080/canonical?url=http://i.imgur.com/JeYm857.png'
{"Canonical":"http://i.imgur.com/JeYm857.png"}
$ curl 'http://localhost:8080/canonical?url=http://elplanc.net/wp-content/uploads/2013/11/logocajaazul.png'
{"Canonical":"http://i.imgur.com/JeYm857.png"}
$ curl 'http://localhost:8080/canonical?url=http://unknown.url'
{"Canonical":""}

Documentation

Overview

Package imageid provides tools to detect similar images in a large collection.

Index

Constants

View Source
const TableHashes = "hashes"
View Source
const TableMD5s = "md5s"
View Source
const TableURLs = "urls"

Variables

View Source
var Shutdown = make(chan int)

Functions

func Process

func Process(db *DB, workerID string, filename string, url string, similarChan chan<- *URLPair)

func ProcessUrls

func ProcessUrls(
	db *DB,
	urlsQueue <-chan string,
	similarChan chan<- *URLPair,
)

func ReadUrls

func ReadUrls(reader func(chan<- string)) chan string

func ShuttingDown

func ShuttingDown() bool

Types

type DB

type DB struct {
	Store kvstore.KeyValueStore
	// contains filtered or unexported fields
}

func OpenDB

func OpenDB() *DB

func (*DB) Close

func (db *DB) Close()

func (*DB) Init

func (db *DB) Init()

func (*DB) Query

func (db *DB) Query(url string) string

type URLPair

type URLPair struct{ URL, Canonical string }

Directories

Path Synopsis
cmd
imageid-cli
imageid-cli allows to process and query images from the command line.
imageid-cli allows to process and query images from the command line.
imageid-server
imageid-server provides access to imageid tools using HTTP endpoints.
imageid-server provides access to imageid tools using HTTP endpoints.
imhash
imhash computes the 128-bit hash (dhash+phash) of an image.
imhash computes the 128-bit hash (dhash+phash) of an image.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL