crawler

package
v0.0.0-...-36ba309 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Jul 30, 2022 License: AGPL-3.0 Imports: 37 Imported by: 0

Documentation

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

func CloneRepository

func CloneRepository(domain Domain, hostname, name, gitURL, index string) error

CloneRepository clone the repository into DATADIR/repos/<hostname>/<vendor>/<repo>/gitClone

func GetAllBlackListedRepos

func GetAllBlackListedRepos() map[string]string

GetAllBlackListedRepos return all blacklisted repositories

func GetClients

func GetClients() map[string]ClientAPI

GetClients returns a list of all registered clientAPI.

func IsRepoInBlackList

func IsRepoInBlackList(repoURL string) bool

IsRepoInBlackList checks whether a repo is in blacklist

func RegisterClientAPIs

func RegisterClientAPIs()

RegisterClientAPIs register all the client APIs for all the clients.

func SaveToFile

func SaveToFile(domain Domain, hostname string, name string, data []byte, index string) error

SaveToFile save the chosen <file_name> in DATADIR/repos/<source>/<vendor>/<repo>/<crawler_timestamp>_<file_name>.

func WalkMatch

func WalkMatch(root, pattern string) ([]string, error)

WalkMatch util func

Types

type Bitbucket

type Bitbucket struct {
	Pagelen int `json:"pagelen"`
	Values  []struct {
		Scm        string `json:"scm"`
		Website    string `json:"website"`
		HasWiki    bool   `json:"has_wiki"`
		Name       string `json:"name"`
		Links      Links  `json:"links"`
		ForkPolicy string `json:"fork_policy"`
		UUID       string `json:"uuid"`
		Language   string `json:"language"`
		CreatedOn  string `json:"created_on"`
		Mainbranch struct {
			Type string `json:"type"`
			Name string `json:"name"`
		} `json:"mainbranch"`
		FullName  string `json:"full_name"`
		HasIssues bool   `json:"has_issues"`
		Owner     struct {
			Username    string `json:"username"`
			DisplayName string `json:"display_name"`
			Type        string `json:"type"`
			UUID        string `json:"uuid"`
			Links       struct {
				Self struct {
					Href string `json:"href"`
				} `json:"self"`
				HTML struct {
					Href string `json:"href"`
				} `json:"html"`
				Avatar struct {
					Href string `json:"href"`
				} `json:"avatar"`
			} `json:"links"`
		} `json:"owner"`
		UpdatedOn   string `json:"updated_on"`
		Size        int    `json:"size"`
		Type        string `json:"type"`
		Slug        string `json:"slug"`
		IsPrivate   bool   `json:"is_private"`
		Description string `json:"description"`
		Project     struct {
			Key   string `json:"key"`
			Type  string `json:"type"`
			UUID  string `json:"uuid"`
			Links struct {
				Self struct {
					Href string `json:"href"`
				} `json:"self"`
				HTML struct {
					Href string `json:"href"`
				} `json:"html"`
				Avatar struct {
					Href string `json:"href"`
				} `json:"avatar"`
			} `json:"links"`
			Name string `json:"name"`
		} `json:"project,omitempty"`
		Parent struct {
			Links struct {
				Self struct {
					Href string `json:"href"`
				} `json:"self"`
				HTML struct {
					Href string `json:"href"`
				} `json:"html"`
				Avatar struct {
					Href string `json:"href"`
				} `json:"avatar"`
			} `json:"links"`
			Type     string `json:"type"`
			Name     string `json:"name"`
			FullName string `json:"full_name"`
			UUID     string `json:"uuid"`
		} `json:"parent,omitempty"`
	} `json:"values"`
	Next string `json:"next"`
}

Bitbucket is the complete response for the Bitbucket all repositories list.

type BitbucketRepo

type BitbucketRepo struct {
	Scm        string    `json:"scm"`
	Website    string    `json:"website"`
	HasWiki    bool      `json:"has_wiki"`
	Name       string    `json:"name"`
	Links      Links     `json:"links"`
	ForkPolicy string    `json:"fork_policy"`
	UUID       string    `json:"uuid"`
	Language   string    `json:"language"`
	CreatedOn  time.Time `json:"created_on"`
	Mainbranch struct {
		Type string `json:"type"`
		Name string `json:"name"`
	} `json:"mainbranch"`
	FullName  string `json:"full_name"`
	HasIssues bool   `json:"has_issues"`
	Owner     struct {
		Username    string `json:"username"`
		DisplayName string `json:"display_name"`
		Type        string `json:"type"`
		UUID        string `json:"uuid"`
		Links       struct {
			Self struct {
				Href string `json:"href"`
			} `json:"self"`
			HTML struct {
				Href string `json:"href"`
			} `json:"html"`
			Avatar struct {
				Href string `json:"href"`
			} `json:"avatar"`
		} `json:"links"`
	} `json:"owner"`
	UpdatedOn   time.Time `json:"updated_on"`
	Size        int       `json:"size"`
	Type        string    `json:"type"`
	Slug        string    `json:"slug"`
	IsPrivate   bool      `json:"is_private"`
	Description string    `json:"description"`
}

BitbucketRepo is the complete response for the Bitbucket single repository.

type Blacklist

type Blacklist struct {
	Repos []Repo `yaml:"repos"`
}

Blacklist contain a list of blocked repositories.

type ClientAPI

type ClientAPI struct {
	Organization OrganizationHandler
	Single       SingleRepoHandler

	APIURL GeneratorAPIURL
}

ClientAPI contains all the API function in a single Client.

type Crawler

type Crawler struct {
	DryRun bool
	// contains filtered or unexported fields
}

Crawler is a helper class representing a crawler.

func NewCrawler

func NewCrawler(dryRun bool) *Crawler

NewCrawler initializes a new Crawler object, updates the IPA list and connects to Elasticsearch (if dryRun == false).

func (*Crawler) CrawlOrg

func (c *Crawler) CrawlOrg(orgURL string, domain *Domain, pa PA)

CrawlOrg fetches all the repositories belonging to an org and crawls them.

func (*Crawler) CrawlPublisher

func (c *Crawler) CrawlPublisher(pa PA)

CrawlPublisher delegates the work to single PA crawlers.

func (*Crawler) CrawlPublishers

func (c *Crawler) CrawlPublishers(publishers []PA) ([]string, error)

CrawlPublishers processes a list of publishers.

func (*Crawler) CrawlRepo

func (c *Crawler) CrawlRepo(repoURL string, pa PA) error

CrawlRepo crawls a single repository.

func (*Crawler) DeleteByQueryFromES

func (c *Crawler) DeleteByQueryFromES(search string) error

DeleteByQueryFromES delete record from elasticsearch that will match search string for publiccode.url field

func (*Crawler) ExportForJekyll

func (c *Crawler) ExportForJekyll() error

ExportForJekyll exports YAML data files for the Jekyll website.

func (*Crawler) KnownHost

func (c *Crawler) KnownHost(link string) (*Domain, error)

KnownHost detect the the right Domain API from the given URL and returns it. If no API is recognized will return an empty domain and an error.

func (*Crawler) ProcessRepo

func (c *Crawler) ProcessRepo(repository Repository)

ProcessRepo looks for a publiccode.yml file in a repository, and if found it processes it.

func (*Crawler) ProcessRepositories

func (c *Crawler) ProcessRepositories(repos chan Repository)

ProcessRepositories process the repositories channel and check the availability of the file.

type Domain

type Domain struct {
	// Domains.yml data
	Host        string   `yaml:"host"`
	UseTokenFor []string `yaml:"use-token-for"`
	BasicAuth   []string `yaml:"basic-auth"`
}

Domain is a single code hosting service.

func ReadAndParseDomains

func ReadAndParseDomains(domainsFile string) ([]Domain, error)

ReadAndParseDomains read domainsFile and return the parsed content in a Domain slice.

func (Domain) API

func (domain Domain) API() string

API returns a Domain without tld.

type GeneratorAPIURL

type GeneratorAPIURL func(url string) ([]string, error)

GeneratorAPIURL returns the url in the api correct ecosystem.

func GenerateBitbucketAPIURL

func GenerateBitbucketAPIURL() GeneratorAPIURL

GenerateBitbucketAPIURL returns the api url of given Bitbucket organization link. IN: https://bitbucket.org/Soft OUT:https://api.bitbucket.org/2.0/repositories/Soft?pagelen=100

func GenerateGithubAPIURL

func GenerateGithubAPIURL() GeneratorAPIURL

GenerateGithubAPIURL returns the api url of given Gitlab organization link. IN: https://github.com/italia OUT:https://api.github.com/orgs/italia/repos,https://api.github.com/users/italia/repos

func GenerateGitlabAPIURL

func GenerateGitlabAPIURL() GeneratorAPIURL

GenerateGitlabAPIURL returns the api url of given Gitlab organization link. IN: https://gitlab.org/blockninja OUT:https://gitlab.com/api/v4/groups/blockninja

func GetAPIURL

func GetAPIURL(clientAPI string) (GeneratorAPIURL, error)

GetAPIURL checks if the API client for the requested API url exists and return its handler.

type GithubFiles

type GithubFiles []struct {
	Name        string `json:"name"`
	Path        string `json:"path"`
	Sha         string `json:"sha"`
	Size        int    `json:"size"`
	URL         string `json:"url"`
	HTMLURL     string `json:"html_url"`
	GitURL      string `json:"git_url"`
	DownloadURL string `json:"download_url"`
	Type        string `json:"type"`
	Links       struct {
		Self string `json:"self"`
		Git  string `json:"git"`
		HTML string `json:"html"`
	} `json:"_links"`
}

GithubFiles is a list of files in repository

type GithubOrgs

type GithubOrgs []struct {
	ID               int       `json:"id"`
	Name             string    `json:"name"`
	FullName         string    `json:"full_name"`
	Owner            Owner     `json:"owner"`
	Private          bool      `json:"private"`
	HTMLURL          string    `json:"html_url"`
	Description      string    `json:"description"`
	Fork             bool      `json:"fork"`
	URL              string    `json:"url"`
	ForksURL         string    `json:"forks_url"`
	KeysURL          string    `json:"keys_url"`
	CollaboratorsURL string    `json:"collaborators_url"`
	TeamsURL         string    `json:"teams_url"`
	HooksURL         string    `json:"hooks_url"`
	IssueEventsURL   string    `json:"issue_events_url"`
	EventsURL        string    `json:"events_url"`
	AssigneesURL     string    `json:"assignees_url"`
	BranchesURL      string    `json:"branches_url"`
	TagsURL          string    `json:"tags_url"`
	BlobsURL         string    `json:"blobs_url"`
	GitTagsURL       string    `json:"git_tags_url"`
	GitRefsURL       string    `json:"git_refs_url"`
	TreesURL         string    `json:"trees_url"`
	StatusesURL      string    `json:"statuses_url"`
	LanguagesURL     string    `json:"languages_url"`
	StargazersURL    string    `json:"stargazers_url"`
	ContributorsURL  string    `json:"contributors_url"`
	SubscribersURL   string    `json:"subscribers_url"`
	SubscriptionURL  string    `json:"subscription_url"`
	CommitsURL       string    `json:"commits_url"`
	GitCommitsURL    string    `json:"git_commits_url"`
	CommentsURL      string    `json:"comments_url"`
	IssueCommentURL  string    `json:"issue_comment_url"`
	ContentsURL      string    `json:"contents_url"`
	CompareURL       string    `json:"compare_url"`
	MergesURL        string    `json:"merges_url"`
	ArchiveURL       string    `json:"archive_url"`
	DownloadsURL     string    `json:"downloads_url"`
	IssuesURL        string    `json:"issues_url"`
	PullsURL         string    `json:"pulls_url"`
	MilestonesURL    string    `json:"milestones_url"`
	NotificationsURL string    `json:"notifications_url"`
	LabelsURL        string    `json:"labels_url"`
	ReleasesURL      string    `json:"releases_url"`
	DeploymentsURL   string    `json:"deployments_url"`
	CreatedAt        time.Time `json:"created_at"`
	UpdatedAt        time.Time `json:"updated_at"`
	PushedAt         time.Time `json:"pushed_at"`
	GitURL           string    `json:"git_url"`
	SSHURL           string    `json:"ssh_url"`
	CloneURL         string    `json:"clone_url"`
	SvnURL           string    `json:"svn_url"`
	Homepage         string    `json:"homepage"`
	Size             int       `json:"size"`
	StargazersCount  int       `json:"stargazers_count"`
	WatchersCount    int       `json:"watchers_count"`
	Language         string    `json:"language"`
	HasIssues        bool      `json:"has_issues"`
	HasProjects      bool      `json:"has_projects"`
	HasDownloads     bool      `json:"has_downloads"`
	HasWiki          bool      `json:"has_wiki"`
	HasPages         bool      `json:"has_pages"`
	ForksCount       int       `json:"forks_count"`
	MirrorURL        string    `json:"mirror_url"`
	Archived         bool      `json:"archived"`
	OpenIssuesCount  int       `json:"open_issues_count"`
	License          struct {
		Key    string `json:"key"`
		Name   string `json:"name"`
		SpdxID string `json:"spdx_id"`
		URL    string `json:"url"`
	} `json:"license"`
	Forks         int    `json:"forks"`
	OpenIssues    int    `json:"open_issues"`
	Watchers      int    `json:"watchers"`
	DefaultBranch string `json:"default_branch"`
	Permissions   struct {
		Admin bool `json:"admin"`
		Push  bool `json:"push"`
		Pull  bool `json:"pull"`
	} `json:"permissions"`
}

GithubOrgs is the complete result from the Github API respose for /orgs/<Name>/repos.

type GithubRepo

type GithubRepo struct {
	ID               int         `json:"id"`
	Name             string      `json:"name"`
	FullName         string      `json:"full_name"`
	Owner            Owner       `json:"owner"`
	Private          bool        `json:"private"`
	HTMLURL          string      `json:"html_url"`
	Description      string      `json:"description"`
	Fork             bool        `json:"fork"`
	URL              string      `json:"url"`
	ForksURL         string      `json:"forks_url"`
	KeysURL          string      `json:"keys_url"`
	CollaboratorsURL string      `json:"collaborators_url"`
	TeamsURL         string      `json:"teams_url"`
	HooksURL         string      `json:"hooks_url"`
	IssueEventsURL   string      `json:"issue_events_url"`
	EventsURL        string      `json:"events_url"`
	AssigneesURL     string      `json:"assignees_url"`
	BranchesURL      string      `json:"branches_url"`
	TagsURL          string      `json:"tags_url"`
	BlobsURL         string      `json:"blobs_url"`
	GitTagsURL       string      `json:"git_tags_url"`
	GitRefsURL       string      `json:"git_refs_url"`
	TreesURL         string      `json:"trees_url"`
	StatusesURL      string      `json:"statuses_url"`
	LanguagesURL     string      `json:"languages_url"`
	StargazersURL    string      `json:"stargazers_url"`
	ContributorsURL  string      `json:"contributors_url"`
	SubscribersURL   string      `json:"subscribers_url"`
	SubscriptionURL  string      `json:"subscription_url"`
	CommitsURL       string      `json:"commits_url"`
	GitCommitsURL    string      `json:"git_commits_url"`
	CommentsURL      string      `json:"comments_url"`
	IssueCommentURL  string      `json:"issue_comment_url"`
	ContentsURL      string      `json:"contents_url"`
	CompareURL       string      `json:"compare_url"`
	MergesURL        string      `json:"merges_url"`
	ArchiveURL       string      `json:"archive_url"`
	DownloadsURL     string      `json:"downloads_url"`
	IssuesURL        string      `json:"issues_url"`
	PullsURL         string      `json:"pulls_url"`
	MilestonesURL    string      `json:"milestones_url"`
	NotificationsURL string      `json:"notifications_url"`
	LabelsURL        string      `json:"labels_url"`
	ReleasesURL      string      `json:"releases_url"`
	DeploymentsURL   string      `json:"deployments_url"`
	CreatedAt        time.Time   `json:"created_at"`
	UpdatedAt        time.Time   `json:"updated_at"`
	PushedAt         time.Time   `json:"pushed_at"`
	GitURL           string      `json:"git_url"`
	SSHURL           string      `json:"ssh_url"`
	CloneURL         string      `json:"clone_url"`
	SvnURL           string      `json:"svn_url"`
	Homepage         string      `json:"homepage"`
	Size             int         `json:"size"`
	StargazersCount  int         `json:"stargazers_count"`
	WatchersCount    int         `json:"watchers_count"`
	Language         string      `json:"language"`
	HasIssues        bool        `json:"has_issues"`
	HasProjects      bool        `json:"has_projects"`
	HasDownloads     bool        `json:"has_downloads"`
	HasWiki          bool        `json:"has_wiki"`
	HasPages         bool        `json:"has_pages"`
	ForksCount       int         `json:"forks_count"`
	MirrorURL        interface{} `json:"mirror_url"`
	Archived         bool        `json:"archived"`
	OpenIssuesCount  int         `json:"open_issues_count"`
	License          interface{} `json:"license"`
	Forks            int         `json:"forks"`
	OpenIssues       int         `json:"open_issues"`
	Watchers         int         `json:"watchers"`
	DefaultBranch    string      `json:"default_branch"`
	NetworkCount     int         `json:"network_count"`
	SubscribersCount int         `json:"subscribers_count"`
}

GithubRepo is a complete result from the Github API respose for a single repository.

type Links struct {
	Watchers struct {
		Href string `json:"href"`
	} `json:"watchers"`
	Branches struct {
		Href string `json:"href"`
	} `json:"branches"`
	Tags struct {
		Href string `json:"href"`
	} `json:"tags"`
	Commits struct {
		Href string `json:"href"`
	} `json:"commits"`
	Clone []struct {
		Href string `json:"href"`
		Name string `json:"name"`
	} `json:"clone"`
	Self struct {
		Href string `json:"href"`
	} `json:"self"`
	Source struct {
		Href string `json:"href"`
	} `json:"source"`
	HTML struct {
		Href string `json:"href"`
	} `json:"html"`
	Avatar struct {
		Href string `json:"href"`
	} `json:"avatar"`
	Hooks struct {
		Href string `json:"href"`
	} `json:"hooks"`
	Forks struct {
		Href string `json:"href"`
	} `json:"forks"`
	Downloads struct {
		Href string `json:"href"`
	} `json:"downloads"`
	Pullrequests struct {
		Href string `json:"href"`
	} `json:"pullrequests"`
}

Links is the list of Links associated to the repository.

type OrganizationHandler

type OrganizationHandler func(domain Domain, url string, repositories chan Repository, pa PA) (string, error)

OrganizationHandler returns the client handler for an organization/team/group page (every domain has a different handler implementation).

func GetClientAPICrawler

func GetClientAPICrawler(clientAPI string) (OrganizationHandler, error)

GetClientAPICrawler checks if the API client for the requested organization clientAPI exists and return its handler.

func RegisterBitbucketAPI

func RegisterBitbucketAPI() OrganizationHandler

RegisterBitbucketAPI register the crawler function for Bitbucket API.

func RegisterGithubAPI

func RegisterGithubAPI() OrganizationHandler

RegisterGithubAPI register the crawler function for Github API. It get the list of repositories on "link" url. If a next page is available return its url. Otherwise returns an empty ("") string.

func RegisterGitlabAPI

func RegisterGitlabAPI() OrganizationHandler

RegisterGitlabAPI register the crawler function for Gitlab API.

type Owner

type Owner struct {
	Login             string `json:"login"`
	ID                int    `json:"id"`
	AvatarURL         string `json:"avatar_url"`
	GravatarID        string `json:"gravatar_id"`
	URL               string `json:"url"`
	HTMLURL           string `json:"html_url"`
	FollowersURL      string `json:"followers_url"`
	FollowingURL      string `json:"following_url"`
	GistsURL          string `json:"gists_url"`
	StarredURL        string `json:"starred_url"`
	SubscriptionsURL  string `json:"subscriptions_url"`
	OrganizationsURL  string `json:"organizations_url"`
	ReposURL          string `json:"repos_url"`
	EventsURL         string `json:"events_url"`
	ReceivedEventsURL string `json:"received_events_url"`
	Type              string `json:"type"`
	SiteAdmin         bool   `json:"site_admin"`
}

Owner of the repository.

type PA

type PA struct {
	Name          string   `yaml:"name"`
	CodiceIPA     string   `yaml:"codice-iPA"`
	Organizations []string `yaml:"orgs"`
	Repositories  []string `yaml:"repos"`
	UnknownIPA    bool     `yaml:"unknown-iPA"`
}

PA is a Public Administration.

func ReadAndParseWhitelist

func ReadAndParseWhitelist(whitelistFile string) ([]PA, error)

ReadAndParseWhitelist read the whitelist and return the parsed content in a slice of PA.

type Range

type Range struct {
	Min    float64
	Max    float64
	Points float64
}

Range is a range between will be assigned Points value.

type Ranges

type Ranges struct {
	Name   string
	Ranges []Range
}

Ranges are the ranges for a specific parameter (userCommunity, codeActivity, releaseHistory, longevity).

type RangesData

type RangesData []Ranges

RangesData contains the data loaded from vitality-ranges.yml

type Repo

type Repo struct {
	URL         string `yaml:"url"`
	Reason      string `yaml:"reason"`
	Description string `yaml:"description"`
}

Repo matches a single repository.

func ReadAndParseBlacklist

func ReadAndParseBlacklist(blacklistFile string) ([]Repo, error)

ReadAndParseBlacklist read the blacklist and return the parsed content in a slice of PA.

type Repository

type Repository struct {
	Name        string
	Hostname    string
	FileRawURL  string
	GitCloneURL string
	GitBranch   string
	Domain      Domain
	Pa          PA
	Headers     map[string]string
	Metadata    []byte
}

Repository is a single code repository. FileRawURL contains the direct url to the raw file.

func (*Repository) CalculateRepoActivity

func (repository *Repository) CalculateRepoActivity(days int) (float64, map[int]float64, error)

CalculateRepoActivity return the repository activity index and the vitality slice calculated on the git clone. It follows the document https://lg-acquisizione-e-riuso-software-per-la-pa.readthedocs.io/ In reference to section: 2.5.2. Fase 2.2: Valutazione soluzioni riusabili per la PA

type SingleRepoHandler

type SingleRepoHandler func(domain Domain, url string, repositories chan Repository, pa PA) error

SingleRepoHandler returns the client handler for an a single repository (every domain has a different handler implementation).

func GetSingleClientAPICrawler

func GetSingleClientAPICrawler(clientAPI string) (SingleRepoHandler, error)

GetSingleClientAPICrawler checks if the API client for the requested single repository clientAPI exists and return its handler.

func RegisterSingleBitbucketAPI

func RegisterSingleBitbucketAPI() SingleRepoHandler

RegisterSingleBitbucketAPI register the crawler function for single Bitbucket repository.

func RegisterSingleGithubAPI

func RegisterSingleGithubAPI() SingleRepoHandler

RegisterSingleGithubAPI register the crawler function for single repository Github API. Return nil if the repository was successfully added to repositories channel. Otherwise return the generated error.

func RegisterSingleGitlabAPI

func RegisterSingleGitlabAPI() SingleRepoHandler

RegisterSingleGitlabAPI register the crawler function for single Bitbucket API.

type Whitelist

type Whitelist []PA

Whitelist contain a list of Public Administrations.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL