idmatch

package module
v1.0.1-0...-c326910 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Nov 12, 2019 License: GPL-3.0 Imports: 39 Imported by: 0

README

Identity Matching source{d} Extension

Travis build status Code coverage Docker pulls Go Report Card GPL 3.0 license

Match different identities of the same person using 🤖. Extension for source{d}.

OverviewHow To UseScienceContributionsLicense

Overview

People are using different e-mails and names (aka identities) when they commit their work to git. E-mails can be corporate, personal, special like users.noreply.github.com, etc. Names can be with Surname or without, with typos, no name, etc. Thus to get precise information about developer it is required to gather their identities and separate them from another person identities. That's what we call Identity Matching.

Identity graph

How To Use

Right now no pre-built binaries are available. Please refer to How to build from source code section to build an executable.

Run match-identities --help to see all the parameters that you can configure.

There are two use cases supported for match-identities.

  1. With gitbase
  2. Without gitbase

In both cases, the output identity table is saved as a Parquet file. Read more in the Output format section.

Use with gitbase

match-identities is supposed to be used with gitbase. First of all, make sure you have a gitbase instance running with all the repositories you are going to analyze. Please refer to the gitbase documentation to get more information.

Usage example:

match-identities --output matched_identities.parquet

The credentials can be configured with the --host, --port, --user and --password flags.

For example, the following SQL gitbase query will return the identities of each commit author:

SELECT DISTINCT repository_id, commit_author_name, commit_author_email
FROM commits;

If you want to cache the gitbase output you can use the --cache flag. After the identities are fetched from gitbase, the matching process is run. Read Science section to learn more.

Use without gitbase

If you run match-identities with the --cache option enabled you get a csv file with the cached gitbase output. Besides, if you already have a list of identities it is possible to run match-identities without gitbase involved. Create a CSV file with the columns repo, email and name, then feed it to the --cache parameter.

Usage Example:

match-identities \
    --cache path/to/csv/file.csv \
    --output matched_identities.parquet
Output format

Once the algorithm finishes to merge identities, you get a table with 4 columns:

  1. id (int64) -- unique identifier of the person with the corresponding identity.
  2. email (utf8) -- e-mail of the identity.
  3. name (utf8) -- name of the identity.
  4. repo (utf8) -- repository of the commit.

The columns email, name and repo may contain empty values which means no constraints. For example, let's consider this output identity table:

id,email,name,repo
1,alice@gmail.com,"",""
1,"",alice,""
2,bob@gmail.com,"",""
2,"",bob,""
2,bob@inbox.com,"",""
2,"",no-name,bob/bobs-project

There are two developers. Let's name them Alice (with id 1) and Bob (with id 2). When we analyze a commit with alice@gmail.com as author email, then the author is Alice. The repository and author name are ignored since the author email is the most reliable way to define an identity. On the other hand, when we analyze a commit with alice as an author name, then the author is Alice for whatever combination of email and repository. Same for Bob, although he uses two different email addresses bob@gmail.com and bob@inbox.com. If we come across a commit with the no-name author name in bob/bobs-project repository then it is Bob's.

Convert parquet to CSV

It is possible to convert the output parquet file to CSV using the python script in the research directory:

python3 ./research/parquet2csv.py matched_identities.parquet

The result will be saved as matched_identities.csv. Please note that pyspark must be installed.

External matching option

If the organization is using GitHub, Gitlab or Bitbucket, it is possible to use their API to match identities by emails. In that case, 2 columns are added and filled for every email in the table: the External id provider and the External id itself.

How to build

git clone https://github.com/src-d/identity-matching
cd identity-matching
make build

You'll see two directories with Linux and Macos binaries inside the build directory.

Science

There are two stages to match identities. The first is the precomputation which is run once on the whole dataset and remains unchanged during the subsequent steps. The second is the matching itself.

  1. Precomputation:
    1. Gather 2 lists of the most popular names and emails (by frequencies) on the whole dataset.
    2. Gather 2 lists of emails and names that will be ignored (aka blacklists) on the whole dataset. They are non-human identities and usually related to CI, bots, etc.
  2. Analysis:
    1. Gather the list of triplets {email, name, repository} from all the commits using gitbase.
    2. Remove any triplet whose name or email belongs to the blacklists.
    3. Merge identities with the same e-mail if it doesn't belong to the list of popular emails created in 1.1.
    4. Merge identities with the same name if it doesn't belong to the list of popular names created in 1.1. When the name belongs to this list we replace it with the following tuple (name, repository).
    5. Save the resulting identity table in the desired output format.

Identity matching diagram

There is a Design Document (or a Blueprint, or whatever else you are used to call project documentation) which goes into more detail: link.

Contributions

...are welcome! See CONTRIBUTING and code of conduct.

License

GPL 3.0, see LICENSE. Y u no Apache/MIT? Read here.

Documentation

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

func Dir

func Dir(useLocal bool, name string) http.FileSystem

Dir returns a http.Filesystem for the embedded assets on a given prefix dir. If useLocal is true, the filesystem's contents are instead used.

func FS

func FS(useLocal bool) http.FileSystem

FS returns a http.Filesystem for the embedded assets. If useLocal is true, the filesystem's contents are instead used.

func FSByte

func FSByte(useLocal bool, name string) ([]byte, error)

FSByte returns the named file from the embedded assets. If useLocal is true, the filesystem's contents are instead used.

func FSMustByte

func FSMustByte(useLocal bool, name string) []byte

FSMustByte is the same as FSByte, but panics if name is not present.

func FSMustString

func FSMustString(useLocal bool, name string) string

FSMustString is the string version of FSMustByte.

func FSString

func FSString(useLocal bool, name string) (string, error)

FSString is the string version of FSByte.

func HashPeopleDiscoverySQL

func HashPeopleDiscoverySQL() string

HashPeopleDiscoverySQL returns the hashsum of the SQL used to fetch the raw Git signatures.

func ReducePeople

func ReducePeople(people People, matcher external.Matcher, blacklist Blacklist,
	maxIdentities int) error

ReducePeople merges the identities together by following the fixed set of rules.

  1. Run the external matching, if available.
  2. Run the series of heuristics on those items which were left untouched in the list (everything in case of ext == nil, not found in case of ext != nil).

The heuristics are: TODO(vmarkovtsev): describe the current approach

func SetPrimaryValues

func SetPrimaryValues(people People, nameFreqs, emailFreqs map[string]*Frequency,
	minRecentCount int)

SetPrimaryValues sets people primary name and email to the most frequent name and email of the person's identity. Stats for the fixed recent period of time are used if there are at least minRecentCount commits made by the person's identity in that period. Otherwise the stats for all the time are used.

Types

type Blacklist

type Blacklist struct {
	Domains         map[string]struct{}
	TopLevelDomains map[string]struct{}
	Names           map[string]struct{}
	Emails          map[string]struct{}
	PopularEmails   map[string]struct{}
	PopularNames    map[string]struct{}
}

Blacklist contains all the data to filter identities or identities connection

func NewBlacklist

func NewBlacklist() (Blacklist, error)

NewBlacklist generates Blacklist from the data files embedded to blacklists.go

type Commit

type Commit struct {
	Hash string
	Repo string
}

Commit represent a commit in a specific repository wit ha specific hash.

type Frequency

type Frequency struct {
	Recent int
	Total  int
}

Frequency is a pair of word frequencies for a certain recent period of time and for all the time

type Int64Slice

type Int64Slice []int64

Int64Slice attaches the methods of Interface to []int64, sorting in increasing order.

func (Int64Slice) Len

func (p Int64Slice) Len() int

func (Int64Slice) Less

func (p Int64Slice) Less(i, j int) bool

func (Int64Slice) Sort

func (p Int64Slice) Sort()

Sort is a convenience method.

func (Int64Slice) Swap

func (p Int64Slice) Swap(i, j int)

type NameWithRepo

type NameWithRepo struct {
	Name string
	Repo string
}

NameWithRepo is a Name that can be linked to a specific repo.

func (NameWithRepo) String

func (rn NameWithRepo) String() string

String describes the person's identity parts.

type People

type People map[int64]*Person

People is a map of persons indexed by their ID.

func FindPeople

func FindPeople(ctx context.Context, connString string, cachePath string, blacklist Blacklist,
	recentMonths int) (People, map[string]*Frequency, map[string]*Frequency, error)

FindPeople returns all the people in the database or from the disk cache.

func (People) ForEach

func (p People) ForEach(f func(int64, *Person) bool)

ForEach executes a function over each person in the collection. The order is fixed and constant.

func (People) Merge

func (p People) Merge(ids ...int64) (int64, error)

Merge several persons with the given ids.

func (People) WriteToParquet

func (p People) WriteToParquet(path string, externalIDProvider string) (err error)

WriteToParquet saves People structure to parquet file.

type Person

type Person struct {
	ID             int64
	NamesWithRepos []NameWithRepo
	Emails         []string
	// SampleCommit in an example Git commit which mentions this identity. May be nil.
	SampleCommit *Commit
	ExternalID   string
	PrimaryName  string
	PrimaryEmail string
}

Person is a single individual that can have multiple names and emails.

func (Person) String

func (p Person) String() string

String describes the person's identity parts.

Directories

Path Synopsis
cmd

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL