git

package
v0.0.0-...-e560ebb Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Jul 13, 2021 License: BSD-3-Clause Imports: 22 Imported by: 0

Documentation

Overview

Package git implements derivation of a file graph from git log and optionally from the file structure.

Change-log-based distance

This distance is derived from the probability that two files appeared in the same commit. The core idea is that relevant files tend to be modified together.

Distance(x, y) = -log(P(x is relevant to y))
P(x is relevant to y) := sum(1/(len(c.files)-1) for c in y.commits if x in c.files) / len(y.commits)

or in English,

  • the distance from x to y is -log of the probability that x is relevant to y
  • x is relevant to y if x is likely to appear in a commit that touches y, where the commit is chosen randomly and independently.

There are more nuances to this formula, e.g. file removals are not counted towards len(commit.files), and commits with len(files) = 1 or len(files) > limit are ignored. File renames are also taken care of.

Note that this formula penalizes large commits. The more files are modified in a commit, the weaker is the strength of its signal.

This graph defines distance only between files, and not directories.

File-structure-based distance

This distance is derived from the file structure. It is the number of edges between two files in the *file tree*, i.e. the number of hops one has to make to navigate from one file to another in the file tree. For example, given the following file stucture:

//
├── foo/
│   ├── bar.cc
│   └── baz.cc
└── qux.cc

The distance between //foo/bar.cc and //foo/baz.cc is 2 because they have a common parent, and the distance between //foo/bar.cc and and //qux.cc is 3 because the path goes through the root.

The file-structured-based distance can compensate for the weakness of the change-log-based distance.

This distance formula is disabled by default, and can be enabled in EdgeReader.

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

This section is empty.

Types

type EdgeReader

type EdgeReader struct {
	// ChangeLogDistanceFactor is the multiplier for distances derived from the
	// git log. If both ChangeLogDistanceFactor and FileStructureDistanceFactor
	// are zero, then ChangeLogDistanceFactor defaults to 1. If zero then
	// change-log-based edges are ignored.
	ChangeLogDistanceFactor float64

	// FileStructureDistanceFactor is the multiplier for distances derived from
	// the file structure. If zero, then such edges are not reported.
	FileStructureDistanceFactor float64
}

EdgeReader implements filegraph.EdgeReader. It works only with nodes returned by Graph.Node().

func (*EdgeReader) ReadEdges

func (r *EdgeReader) ReadEdges(from filegraph.Node, callback func(to filegraph.Node, distance float64) (keepGoing bool))

ReadEdges implements filegraph.EdgeReader.

type Graph

type Graph struct {
	// Commit is the git commit that the graph state corresponds to.
	Commit string
	// contains filtered or unexported fields
}

Graph is a file graph based on the git history.

The graph represents aggregated history of all file changes in the repo, rather than the state of the repo at a single point of time. In particular, the graph may include nodes for files that no longer exist. It is generally not possible to tell if a node is a file or a directory, because it might have been a file at one point of time, and a directory at another.

TODO(nodir): introduce a decay function to remove old nodes/edges.

func Load

func Load(ctx context.Context, repoDir string, opt LoadOptions) (*Graph, error)

Load returns a file graph for a git repository. Caches the graph under the .git directory. May take minutes and log progress if the cache is cold.

If the cache exists, but no longer matches the current ref commit, then applies new changes to the loaded graph and updates the cache.

func (*Graph) Node

func (g *Graph) Node(name string) filegraph.Node

Node returns a node by its name. Returns nil if the node is not found or if the name is empty. See also Node.Name().

Idempotent: calling many times with the same name returns the same Node object.

func (*Graph) Read

func (g *Graph) Read(r *bufio.Reader) error

Read reads the graph from r. It is the opposite of (*Graph).Write().

func (*Graph) Update

func (g *Graph) Update(ctx context.Context, repoDir, rev string, opt UpdateOptions) error

Update updates the graph based on changes in a git repository. This is the only way to mutate the Graph. Applies all changes reachable from rev, but not from g.Commit, and updates g.Commit.

If returns an error which wasn't returned by the callback, then it is possible that the graph is corrupted.

func (*Graph) Write

func (g *Graph) Write(w io.Writer) error

Write writes the graph to w. It is the opposite of (*Graph).Read().

Spec:

graph = header version git-commit-hash root total-number-of-edges root-edges
header = 54
version = 0

root = node
node = prob-sum-denominator number-of-children children-sorted-by-base-name
children-sorted-by-base-name = child*
child = base-name node

root-edges = node-edges
node-edges = number-of-edges edge*
edge =
  index-of-the-adjacent-node-as-found-in-the-file
  prob-sum
  edges-of-children-sorted-by-base-name
edges-of-children-sorted-by-base-name = edge*

where
 all integer types are encoded as varint
 all strings are encoded as length-prefixed utf8
 `*` means "0 or more"

type LoadOptions

type LoadOptions struct {
	UpdateOptions

	// Ref is the git ref to load the graph for.
	// Defaults to refs/heads/main.
	//
	// If it is refs/heads/main, but it does not exist, then falls back to
	// refs/heads/master.
	Ref string
}

LoadOptions are options for Load() function.

type SelectionStrategy

type SelectionStrategy struct {
	Graph      *Graph
	EdgeReader *EdgeReader

	// MaxDistance decides whether a test is to be selected: if it is closer or
	// equal than MaxDistance, then it is selected. Otherwise, skipped.
	//
	// Ignored by SelectEval.
	MaxDistance float64

	// OnTestNotFound is called when a test file is not found in the filegraph and
	// is not among changed files. If nil, then the file name is logged.
	//
	// Ignored by Select.
	OnTestNotFound func(ctx context.Context, tv *evalpb.TestVariant)
}

SelectionStrategy implements a selection strategy based on a git graph.

func (*SelectionStrategy) Select

func (s *SelectionStrategy) Select(changedFiles []string, skipFile func(name string) (keepGoing bool))

Select calls skipTestFile for each test file that should be skipped. Does not skip files that it does not know about.

func (*SelectionStrategy) SelectEval

func (s *SelectionStrategy) SelectEval(ctx context.Context, in eval.Input, out *eval.Output) error

SelectEval implements eval.Strategy. It can be used to evaluate data quality of the graph. It is a version of Select specifically for evaluation.

This function has minimal input validation. It is not designed to be called by the evaluation framework directly. Instead it should be wrapped with another strategy function that does the validation. In particular, this function does not check in.ChangedFiles[i].Repo and does not check for file patterns that must be exempted from RTS.

type UpdateOptions

type UpdateOptions struct {
	// Callback, if not nil, is called each time after each commit is processed
	// and Graph.Commit is updated.
	Callback func() error

	// MaxCommitSize is the maximum number of files touched by a commit.
	// Commits that exceed this limit are ignored.
	// The rationale is that large commits provide a weak signal of file
	// relatedness and are expensive to process, O(N^2).
	MaxCommitSize int
}

UpdateOptions are options for Graph.Update().

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL