zoekt

package module
v0.0.0-...-f8e8ada Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Nov 8, 2021 License: Apache-2.0 Imports: 22 Imported by: 8

README

"Zoekt, en gij zult spinazie eten" - Jan Eertink

("seek, and ye shall eat spinach" - My primary school teacher)

This is a fast text search engine, intended for use with source code. (Pronunciation: roughly as you would pronounce "zooked" in English)

NOTICE: github.com/sourcegraph/zoekt is the active main repository for Zoekt development.

INSTRUCTIONS

Downloading:

go get github.com/google/zoekt/

Indexing:

go install github.com/google/zoekt/cmd/zoekt-index
$GOPATH/bin/zoekt-index .

Searching

go install github.com/google/zoekt/cmd/zoekt
$GOPATH/bin/zoekt 'ngram f:READ'

Indexing git repositories:

go install github.com/google/zoekt/cmd/zoekt-git-index
$GOPATH/bin/zoekt-git-index -branches master,stable-1.4 -prefix origin/ .

Indexing repo repositories:

go install github.com/google/zoekt/cmd/zoekt-{repo-index,mirror-gitiles}
zoekt-mirror-gitiles -dest ~/repos/ https://gfiber.googlesource.com
zoekt-repo-index \
   -name gfiber \
   -base_url https://gfiber.googlesource.com/ \
   -manifest_repo ~/repos/gfiber.googlesource.com/manifests.git \
   -repo_cache ~/repos \
   -manifest_rev_prefix=refs/heads/ --rev_prefix= \
   master:default_unrestricted.xml

Starting the web interface

go install github.com/google/zoekt/cmd/zoekt-webserver
$GOPATH/bin/zoekt-webserver -listen :6070

A more organized installation on a Linux server should use a systemd unit file, eg.

[Unit]
Description=zoekt webserver

[Service]
ExecStart=/zoekt/bin/zoekt-webserver -index /zoekt/index -listen :443  --ssl_cert /zoekt/etc/cert.pem   --ssl_key /zoekt/etc/key.pem
Restart=always

[Install]
WantedBy=default.target

SEARCH SERVICE

Zoekt comes with a small service management program:

go install github.com/google/zoekt/cmd/zoekt-indexserver

cat << EOF > config.json
[{"GithubUser": "username"},
 {"GithubOrg": "org"},
 {"GitilesURL": "https://gerrit.googlesource.com", "Name": "zoekt" }
]
EOF

$GOPATH/bin/zoekt-server -mirror_config config.json

This will mirror all repos under 'github.com/username', 'github.com/org', as well as the 'zoekt' repository. It will index the repositories.

It takes care of fetching and indexing new data and cleaning up logfiles.

The webserver can be started from a standard service management framework, such as systemd.

It is recommended to install Universal ctags to improve ranking. See here for more information.

ACKNOWLEDGEMENTS

Thanks to Alexander Neubeck for coming up with this idea, and helping me flesh it out.

DISCLAIMER

This is not an official Google product

Documentation

Index

Constants

View Source
const FeatureVersion = 10

FeatureVersion is increased if a feature is added that requires reindexing data without changing the format version 2: Rank field for shards. 3: Rank documents within shards 4: Dedup file bugfix 5: Remove max line size limit 6: Include '#' into the LineFragment template 7: Record skip reasons in the index. 8: Record source path in the index. 9: Bump default max file size. 10: Switch to a more flexible TOC format.

View Source
const IndexFormatVersion = 15

FormatVersion is a version number. It is increased every time the on-disk index format is changed. 5: subrepositories. 6: remove size prefix for posting varint list. 7: move subrepos into Repository struct. 8: move repoMetaData out of indexMetadata 9: use bigendian uint64 for trigrams. 10: sections for rune offsets. 11: file ends in rune offsets. 12: 64-bit branchmasks. 13: content checksums 14: languages 15: rune based symbol sections 16 (TBA): TODO: remove fallback parsing in readTOC

View Source
const ReadMinFeatureVersion = 8

ReadMinFeatureVersion constrains backwards compatibility by refusing to load a file with a FeatureVersion below it.

View Source
const WriteMinFeatureVersion = 10

WriteMinFeatureVersion constrains forwards compatibility by emitting files that won't load in zoekt with a FeatureVersion below it.

Variables

View Source
var DebugScore = false

DebugScore controls whether we collect data on match scores are constructed. Intended for use in tests.

View Source
var Version string

Filled by the linker (see build-deploy.sh)

Functions

func CheckText

func CheckText(content []byte, maxTrigramCount int) error

CheckText returns a reason why the given contents are probably not source texts.

func ReadMetadata

func ReadMetadata(inf IndexFile) (*Repository, *IndexMetadata, error)

ReadMetadata returns the metadata of index shard without reading the index data. The IndexFile is not closed.

func SortFilesByScore

func SortFilesByScore(ms []FileMatch)

Sort a slice of results.

Types

type Document

type Document struct {
	Name              string
	Content           []byte
	Branches          []string
	SubRepositoryPath string
	Language          string

	// If set, something is wrong with the file contents, and this
	// is the reason it wasn't indexed.
	SkipReason string

	// Document sections for symbols. Offsets should use bytes.
	Symbols []DocumentSection
}

Document holds a document (file) to index.

type DocumentSection

type DocumentSection struct {
	Start, End uint32
}

type FileMatch

type FileMatch struct {
	// Ranking; the higher, the better.
	Score float64 // TODO - hide this field?

	// For debugging. Needs DebugScore set, but public so tests in
	// other packages can print some diagnostics.
	Debug string

	FileName string

	// Repository is the globally unique name of the repo of the
	// match
	Repository  string
	Branches    []string
	LineMatches []LineMatch

	// Only set if requested
	Content []byte

	// Checksum of the content.
	Checksum []byte

	// Detected language of the result.
	Language string

	// SubRepositoryName is the globally unique name of the repo,
	// if it came from a subrepository
	SubRepositoryName string

	// SubRepositoryPath holds the prefix where the subrepository
	// was mounted.
	SubRepositoryPath string

	// Commit SHA1 (hex) of the (sub)repo holding the file.
	Version string
}

FileMatch contains all the matches within a file.

type IndexBuilder

type IndexBuilder struct {
	// contains filtered or unexported fields
}

IndexBuilder builds a single index shard.

func NewIndexBuilder

func NewIndexBuilder(r *Repository) (*IndexBuilder, error)

NewIndexBuilder creates a fresh IndexBuilder. The passed in Repository contains repo metadata, and may be set to nil.

func (*IndexBuilder) Add

func (b *IndexBuilder) Add(doc Document) error

Add a file which only occurs in certain branches.

func (*IndexBuilder) AddFile

func (b *IndexBuilder) AddFile(name string, content []byte) error

AddFile is a convenience wrapper for Add

func (*IndexBuilder) ContentSize

func (b *IndexBuilder) ContentSize() uint32

ContentSize returns the number of content bytes so far ingested.

func (*IndexBuilder) Write

func (b *IndexBuilder) Write(out io.Writer) error

type IndexFile

type IndexFile interface {
	Read(off uint32, sz uint32) ([]byte, error)
	Size() (uint32, error)
	Close()
	Name() string
}

IndexFile is a file suitable for concurrent read access. For performance reasons, it allows a mmap'd implementation.

func NewIndexFile

func NewIndexFile(f *os.File) (IndexFile, error)

NewIndexFile returns a new index file. The index file takes ownership of the passed in file, and may close it.

type IndexMetadata

type IndexMetadata struct {
	IndexFormatVersion    int
	IndexFeatureVersion   int
	IndexMinReaderVersion int
	IndexTime             time.Time
	PlainASCII            bool
	LanguageMap           map[string]byte
	ZoektVersion          string
}

IndexMetadata holds metadata stored in the index file. It contains data generated by the core indexing library.

type LineFragmentMatch

type LineFragmentMatch struct {
	// Offset within the line, in bytes.
	LineOffset int

	// Offset from file start, in bytes.
	Offset uint32

	// Number bytes that match.
	MatchLength int
}

LineFragmentMatch a segment of matching text within a line.

type LineMatch

type LineMatch struct {
	// The line in which a match was found.
	Line       []byte
	LineStart  int
	LineEnd    int
	LineNumber int

	// If set, this was a match on the filename.
	FileName bool

	// The higher the better. Only ranks the quality of the match
	// within the file, does not take rank of file into account
	Score         float64
	LineFragments []LineFragmentMatch
}

LineMatch holds the matches within a single line in a file.

type RepoList

type RepoList struct {
	Repos   []*RepoListEntry
	Crashes int
}

RepoList holds a set of Repository metadata.

type RepoListEntry

type RepoListEntry struct {
	Repository    Repository
	IndexMetadata IndexMetadata
	Stats         RepoStats
}

type RepoStats

type RepoStats struct {
	// Repos is used for aggregrating the number of repositories.
	Repos int

	// Shards is the total number of search shards.
	Shards int

	// Documents holds the number of documents or files.
	Documents int

	// IndexBytes is the amount of RAM used for index overhead.
	IndexBytes int64

	// ContentBytes is the amount of RAM used for raw content.
	ContentBytes int64
}

Statistics of a (collection of) repositories.

func (*RepoStats) Add

func (s *RepoStats) Add(o *RepoStats)

type Repository

type Repository struct {
	// The repository name
	Name string

	// The repository URL.
	URL string

	// The physical source where this repo came from, eg. full
	// path to the zip filename or git repository directory. This
	// will not be exposed in the UI, but can be used to detect
	// orphaned index shards.
	Source string

	// The branches indexed in this repo.
	Branches []RepositoryBranch

	// Nil if this is not the super project.
	SubRepoMap map[string]*Repository

	// URL template to link to the commit of a branch
	CommitURLTemplate string

	// The repository URL for getting to a file.  Has access to
	// {{Branch}}, {{Path}}
	FileURLTemplate string

	// The URL fragment to add to a file URL for line numbers. has
	// access to {{LineNumber}}. The fragment should include the
	// separator, generally '#' or ';'.
	LineFragmentTemplate string

	// All zoekt.* configuration settings.
	RawConfig map[string]string

	// Importance of the repository, bigger is more important
	Rank uint16

	// IndexOptions is a hash of the options used to create the index for the
	// repo.
	IndexOptions string
}

Repository holds repository metadata.

type RepositoryBranch

type RepositoryBranch struct {
	Name    string
	Version string
}

RepositoryBranch describes an indexed branch, which is a name combined with a version.

type SearchOptions

type SearchOptions struct {
	// Return an upper-bound estimate of eligible documents in
	// stats.ShardFilesConsidered.
	EstimateDocCount bool

	// Return the whole file.
	Whole bool

	// Maximum number of matches: skip all processing an index
	// shard after we found this many non-overlapping matches.
	ShardMaxMatchCount int

	// Maximum number of matches: stop looking for more matches
	// once we have this many matches across shards.
	TotalMaxMatchCount int

	// Maximum number of important matches: skip processing
	// shard after we found this many important matches.
	ShardMaxImportantMatch int

	// Maximum number of important matches across shards.
	TotalMaxImportantMatch int

	// Abort the search after this much time has passed.
	MaxWallTime time.Duration

	// Trim the number of results after collating and sorting the
	// results
	MaxDocDisplayCount int
}

func (*SearchOptions) SetDefaults

func (o *SearchOptions) SetDefaults()

func (*SearchOptions) String

func (s *SearchOptions) String() string

type SearchResult

type SearchResult struct {
	Stats
	Files []FileMatch

	// RepoURLs holds a repo => template string map.
	RepoURLs map[string]string

	// FragmentNames holds a repo => template string map, for
	// the line number fragment.
	LineFragments map[string]string
}

SearchResult contains search matches and extra data

type Searcher

type Searcher interface {
	Search(ctx context.Context, q query.Q, opts *SearchOptions) (*SearchResult, error)

	// List lists repositories. The query `q` can only contain
	// query.Repo atoms.
	List(ctx context.Context, q query.Q) (*RepoList, error)
	Close()

	// Describe the searcher for debug messages.
	String() string
}

func NewSearcher

func NewSearcher(r IndexFile) (Searcher, error)

NewSearcher creates a Searcher for a single index file. Search results coming from this searcher are valid only for the lifetime of the Searcher itself, ie. []byte members should be copied into fresh buffers if the result is to survive closing the shard.

type Stats

type Stats struct {
	// Amount of I/O for reading contents.
	ContentBytesLoaded int64

	// Amount of I/O for reading from index.
	IndexBytesLoaded int64

	// Number of search shards that had a crash.
	Crashes int

	// Wall clock time for this search
	Duration time.Duration

	// Number of files containing a match.
	FileCount int

	// Number of files in shards that we considered.
	ShardFilesConsidered int

	// Files that we evaluated. Equivalent to files for which all
	// atom matches (including negations) evaluated to true.
	FilesConsidered int

	// Files for which we loaded file content to verify substring matches
	FilesLoaded int

	// Candidate files whose contents weren't examined because we
	// gathered enough matches.
	FilesSkipped int

	// Shards that we did not process because a query was canceled.
	ShardsSkipped int

	// Number of non-overlapping matches
	MatchCount int

	// Number of candidate matches as a result of searching ngrams.
	NgramMatches int

	// Wall clock time for queued search.
	Wait time.Duration

	// Number of times regexp was called on files that we evaluated.
	RegexpsConsidered int
}

Stats contains interesting numbers on the search

func (*Stats) Add

func (s *Stats) Add(o Stats)

Directories

Path Synopsis
package build implements a more convenient interface for building zoekt indices.
package build implements a more convenient interface for building zoekt indices.
cmd
zoekt-archive-index
Command zoekt-archive-index indexes an archive.
Command zoekt-archive-index indexes an archive.
zoekt-git-clone
This binary fetches all repos of a user or organization and clones them.
This binary fetches all repos of a user or organization and clones them.
zoekt-hg-index
zoekt-hg-index provides bare-bones Mercurial indexing
zoekt-hg-index provides bare-bones Mercurial indexing
zoekt-mirror-bitbucket-server
This binary fetches all repos of a project, and of a specific type, in case these are specified, and clones them.
This binary fetches all repos of a project, and of a specific type, in case these are specified, and clones them.
zoekt-mirror-github
This binary fetches all repos of a user or organization and clones them.
This binary fetches all repos of a user or organization and clones them.
zoekt-mirror-gitiles
This binary fetches all repos of a Gitiles host.
This binary fetches all repos of a Gitiles host.
zoekt-mirror-gitlab
This binary fetches all repos for a user from gitlab.
This binary fetches all repos for a user from gitlab.
zoekt-repo-index
zoekt-repo-index indexes a repo-based repository.
zoekt-repo-index indexes a repo-based repository.
zoekt-test
zoekt-test compares the search engine results with raw substring search
zoekt-test compares the search engine results with raw substring search
Package gitindex provides functions for indexing Git repositories.
Package gitindex provides functions for indexing Git repositories.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL