mediawiki

package module
v0.14.1 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Mar 2, 2024 License: Apache-2.0 Imports: 29 Imported by: 7

README

Utilities for processing Wikipedia and Wikidata dumps in Go

pkg.go.dev Go Report Card pipeline status coverage report

A Go package providing utilities for processing Wikipedia and Wikidata dumps.

Features:

Installation

This is a Go package. You can add it to your project using go get:

go get gitlab.com/tozd/go/mediawiki

It requires Go 1.20 or newer.

Usage

See full package documentation on pkg.go.dev.

GitHub mirror

There is also a read-only GitHub mirror available, if you need to fork the project there.

Documentation

Overview

Package mediawiki provides utilities for processing Wikipedia and Wikidata dumps.

Index

Constants

This section is empty.

Variables

View Source
var (
	ErrUnexpectedType = errors.Base("unexpected type")
	ErrInvalidValue   = errors.Base("invalid value")
	ErrNotFound       = errors.Base("not found")
	ErrJSONDecode     = errors.Base("cannot decode json")
	ErrSQLParse       = errors.Base("cannot parse SQL")
)

Functions

func DecodeImageMetadata added in v0.5.0

func DecodeImageMetadata(metadata interface{}) (map[string]interface{}, errors.E)

DecodeImageMetadata decodes image and other uploaded files metadata column in image table. See: https://www.mediawiki.org/wiki/Manual:Image_table

func LatestCommonsEntitiesRun added in v0.6.0

func LatestCommonsEntitiesRun(ctx context.Context, client *retryablehttp.Client) (string, errors.E)

LatestCommonsEntitiesRun returns URL of the latest run of Wikimedia Commons entities JSON dump.

func LatestCommonsImageMetadataRun added in v0.6.0

func LatestCommonsImageMetadataRun(ctx context.Context, client *retryablehttp.Client) (string, errors.E)

LatestCommonsImageMetadataRun returns URL of the latest run of Wikimedia Commons image table dump.

func LatestWikidataEntitiesRun added in v0.6.0

func LatestWikidataEntitiesRun(ctx context.Context, client *retryablehttp.Client) (string, errors.E)

LatestWikidataEntitiesRun returns URL of the latest run of Wikidata entities JSON dump.

func LatestWikipediaImageMetadataRun added in v0.6.0

func LatestWikipediaImageMetadataRun(ctx context.Context, client *retryablehttp.Client, language string) (string, errors.E)

LatestWikipediaImageMetadataRun returns URL of the latest run of Wikipedia image table dump. Use "enwiki" for English Wikipedia.

func LatestWikipediaRun added in v0.6.0

func LatestWikipediaRun(ctx context.Context, client *retryablehttp.Client, language string, namespace int) (string, errors.E)

LatestWikipediaRun returns URL of the latest run of Wikimedia Enterprise HTML dump. Use "enwiki" for English Wikipedia and namespace 0 for its articles.

func Process

func Process[T any](ctx context.Context, config *ProcessConfig[T]) errors.E

Process is a low-level function which decompresses a file (supports Compression compressions), extacts JSONs or SQL statements from it (stored in FileType types), decodes JSONs or SQL statements, and calls Process callback on each decoded JSON or SQL statement. All that in parallel fashion, controlled by DecompressionThreads, DecodingThreads, and ItemsProcessingThreads. File is downloaded from a HTTP URL and is processed already during download. Downloaded file is optionally saved (to a file at Path) and followup calls to Process can use a saved file (if same Path is provided).

func ProcessCommonsEntitiesDump added in v0.6.0

func ProcessCommonsEntitiesDump(
	ctx context.Context, config *ProcessDumpConfig,
	processEntity func(context.Context, Entity) errors.E,
) errors.E

ProcessCommonsEntitiesDump downloads (unless already saved), decompresses, decodes JSON, and calls processEntity on every entity in a Wikimedia Commons entities JSON dump.

func ProcessWikidataDump

func ProcessWikidataDump(
	ctx context.Context, config *ProcessDumpConfig,
	processEntity func(context.Context, Entity) errors.E,
) errors.E

ProcessWikidataDump downloads (unless already saves), decompresses, decodes JSON, and calls processEntity on every entity in a Wikidata entities JSON dump.

func ProcessWikipediaDump

func ProcessWikipediaDump(
	ctx context.Context, config *ProcessDumpConfig,
	processArticle func(context.Context, Article) errors.E,
) errors.E

ProcessWikipediaDump downloads (unless already saves), decompresses, decodes JSON, and calls processArticle on every article in a Wikimedia Enterprise HTML dump.

Types

type Amount

type Amount struct {
	big.Rat
}

Amount is an arbitrary precision number and extends big.Rat.

func (Amount) MarshalJSON

func (a Amount) MarshalJSON() ([]byte, error)

MarshalJSON implements json.Marshaler interface for Amount.

func (*Amount) String added in v0.2.0

func (a *Amount) String() string

func (*Amount) UnmarshalJSON

func (a *Amount) UnmarshalJSON(b []byte) error

UnmarshalJSON implements json.Unmarshaler interface for Amount.

type Article

type Article struct {
	Name                   string       `json:"name"`
	Identifier             int64        `json:"identifier"`
	Abstract               string       `json:"abstract,omitempty"`
	WatchersCount          int64        `json:"watchers_count,omitempty"`
	DateCreated            time.Time    `json:"date_created"`
	DateModified           time.Time    `json:"date_modified"`
	DatePreviouslyModified *time.Time   `json:"date_previously_modified,omitempty"`
	Protection             []Protection `json:"protection,omitempty"`
	Version                Version      `json:"version"`
	PreviousVersion        *Version     `json:"previous_version,omitempty"`
	URL                    string       `json:"url"`
	Namespace              Namespace    `json:"namespace"`
	InLanguage             InLanguage   `json:"in_language"`
	MainEntity             *EntityRef   `json:"main_entity,omitempty"`
	AdditionalEntities     []EntityRef  `json:"additional_entities,omitempty"`
	Categories             []Category   `json:"categories,omitempty"`
	Templates              []Template   `json:"templates,omitempty"`
	Redirects              []Redirect   `json:"redirects,omitempty"`
	IsPartOf               IsPartOf     `json:"is_part_of"`
	ArticleBody            ArticleBody  `json:"article_body"`
	License                []License    `json:"license,omitempty"`
	Visibility             *Visibility  `json:"visibility,omitempty"`
	Image                  *Image       `json:"image,omitempty"`
	Event                  Event        `json:"event"`
	InfoBox                []InfoBox    `json:"infobox,omitempty"`
}

Article is a Wikimedia Enterprise HTML dump article.

type ArticleBody

type ArticleBody struct {
	HTML     string `json:"html"`
	WikiText string `json:"wikitext"`
}

type CalendarModel

type CalendarModel int
const (
	Gregorian CalendarModel = iota
	Julian
)

func (CalendarModel) MarshalJSON

func (t CalendarModel) MarshalJSON() ([]byte, error)

MarshalJSON implements json.Marshaler interface for CalendarModel.

Go enumeration values are converted to corresponding calendar Wikidata URIs. Those might be different (but equivalent) than what it was in the source dump.

func (*CalendarModel) UnmarshalJSON

func (t *CalendarModel) UnmarshalJSON(b []byte) error

UnmarshalJSON implements json.Unmarshaler interface for CalendarModel.

It normalizes calendar Wikidata URIs to Go enumeration values.

type Category

type Category struct {
	Name string `json:"name"`
	URL  string `json:"url"`
}

type Compression

type Compression int
const (
	NoCompression Compression = iota
	Tar
	BZIP2
	BZIP2Tar
	GZIP
	GZIPTar
)

type DataType

type DataType int
const (
	WikiBaseItem DataType = iota
	ExternalID
	String
	Quantity
	Time
	GlobeCoordinate
	CommonsMedia
	MonolingualText
	URL
	GeoShape
	WikiBaseLexeme
	WikiBaseSense
	WikiBaseProperty
	Math
	MusicalNotation
	WikiBaseForm
	TabularData
)

func (DataType) MarshalJSON

func (t DataType) MarshalJSON() ([]byte, error)

func (*DataType) UnmarshalJSON

func (t *DataType) UnmarshalJSON(b []byte) error

type DataValue

type DataValue struct {
	Value interface{} `json:"value"`
}

DataValue provides parsed value as Go value in Value.

Value can be one of ErrorValue, StringValue, WikiBaseEntityIDValue, GlobeCoordinateValue, MonolingualTextValue, QuantityValue, and TimeValue.

func (DataValue) MarshalJSON

func (v DataValue) MarshalJSON() ([]byte, error)

MarshalJSON implements json.Marshaler interface for DataValue.

JSON representation of Go values might be different (but equivalent) than what it was in the source dump.

func (*DataValue) UnmarshalJSON

func (v *DataValue) UnmarshalJSON(b []byte) error

UnmarshalJSON implements json.Unmarshaler interface for DataValue.

It normalizes JSON representation to Go values.

type Editor

type Editor struct {
	Identifier        int64      `json:"identifier,omitempty"`
	IsAnonymous       bool       `json:"is_anonymous,omitempty"`
	IsBot             bool       `json:"is_bot,omitempty"`
	IsAdmin           bool       `json:"is_admin,omitempty"`
	IsPatroller       bool       `json:"is_patroller,omitempty"`
	HasAdvancedRights bool       `json:"has_advanced_rights,omitempty"`
	Name              string     `json:"name,omitempty"`
	EditCount         int64      `json:"edit_count,omitempty"`
	DateStarted       *time.Time `json:"date_started,omitempty"`
	Groups            []string   `json:"groups,omitempty"`
}

type Entity

type Entity struct {
	ID           string                     `json:"id"`
	PageID       int64                      `json:"pageid"`
	Namespace    int                        `json:"ns"`
	Title        string                     `json:"title"`
	Modified     time.Time                  `json:"modified"`
	Type         EntityType                 `json:"type"`
	DataType     *DataType                  `json:"datatype,omitempty"`
	Labels       map[string]LanguageValue   `json:"labels,omitempty"`
	Descriptions map[string]LanguageValue   `json:"descriptions,omitempty"`
	Aliases      map[string][]LanguageValue `json:"aliases,omitempty"`
	Claims       map[string][]Statement     `json:"claims,omitempty"`
	SiteLinks    map[string]SiteLink        `json:"sitelinks,omitempty"`
	LastRevID    int64                      `json:"lastrevid"`
}

Entity is a Wikidata entities JSON dump entity.

type EntityRef

type EntityRef struct {
	Identifier string   `json:"identifier"`
	URL        string   `json:"url"`
	Aspects    []string `json:"aspects,omitempty"`
}

type EntityType

type EntityType int
const (
	Item EntityType = iota
	Property
	MediaInfo
)

func (EntityType) MarshalJSON

func (t EntityType) MarshalJSON() ([]byte, error)

func (*EntityType) UnmarshalJSON

func (t *EntityType) UnmarshalJSON(b []byte) error

type ErrorValue

type ErrorValue string

ErrorValue represents an error with the value.

When JSON representation contains an error, only error is provided as a Go value because any other field might be fail to parse.

type Event added in v0.13.0

type Event struct {
	Identifier    string     `json:"identifier"`
	Type          string     `json:"type"`
	DateCreated   time.Time  `json:"date_created"`
	DatePublished *time.Time `json:"date_published,omitempty"`
	Partition     int        `json:"partition,omitempty"`
	Offset        int64      `json:"offset,omitempty"`
}

type FileType

type FileType int
const (
	JSONArray FileType = iota
	NDJSON
	SQLDump
)

type GlobeCoordinateValue

type GlobeCoordinateValue struct {
	Latitude  float64 `json:"latitude"`
	Longitude float64 `json:"longitude"`
	Precision float64 `json:"precision"`
	Globe     string  `json:"globe"`
}

type Image added in v0.13.0

type Image struct {
	ContentURL string `json:"content_url"`
	Width      int    `json:"width,omitempty"`
	Height     int    `json:"height,omitempty"`
}

type InLanguage

type InLanguage struct {
	Identifier string `json:"identifier"`
}

type InfoBox added in v0.13.0

type InfoBox struct {
	Name     string    `json:"name,omitempty"`
	Type     string    `json:"type"`
	Value    string    `json:"value,omitempty"`
	Values   []string  `json:"values,omitempty"`
	HasParts []InfoBox `json:"has_parts,omitempty"`
	Images   []Image   `json:"images,omitempty"`
	Links    []Link    `json:"links,omitempty"`
}

type IsPartOf

type IsPartOf struct {
	Identifier string `json:"identifier"`
	URL        string `json:"url,omitempty"`
}

type LanguageValue

type LanguageValue struct {
	Language string `json:"language"`
	Value    string `json:"value"`
}

type License

type License struct {
	Identifier string `json:"identifier"`
	Name       string `json:"name"`
	URL        string `json:"url"`
}
type Link struct {
	URL    string  `json:"url"`
	Text   string  `json:"text,omitempty"`
	Images []Image `json:"images,omitempty"`
}

type MonolingualTextValue

type MonolingualTextValue struct {
	Language string `json:"language"`
	Text     string `json:"text"`
}

type Namespace

type Namespace struct {
	Identifier int64 `json:"identifier"`
}

type Probability

type Probability struct {
	False float64 `json:"false"`
	True  float64 `json:"true"`
}

type ProcessConfig

type ProcessConfig[T any] struct {
	URL                    string
	Path                   string
	Client                 *retryablehttp.Client
	DecompressionThreads   int
	DecodingThreads        int
	ItemsProcessingThreads int
	Process                func(context.Context, T) errors.E
	Progress               func(context.Context, x.Progress)
	FileType               FileType
	Compression            Compression
}

ProcessConfig is a configuration for low-level Process function.

URL or Path, Process, FileType, and Compression are required. If URL is provided and Path does not already exist, Client is required, too.

If just URL is provided, but not Path, then Process downloads and processes the file at URL, but does not save it. If both URL and Path are provided, and there file at Path does not exist, then Process saves the file at Path while downloading and processing the file at URL. If the file at Path already exists, then Process just uses it as-is and does not download anything from URL.

Client should set User-Agent header with contact information, e.g.:

client := retryablehttp.NewClient()
client.RequestLogHook = func(logger retryablehttp.Logger, req *http.Request, retry int) {
	req.Header.Set("User-Agent", "My bot (user@example.com)")
}

type ProcessDumpConfig

type ProcessDumpConfig struct {
	URL                    string
	Path                   string
	Client                 *retryablehttp.Client
	DecompressionThreads   int
	DecodingThreads        int
	ItemsProcessingThreads int
	Progress               func(context.Context, x.Progress)
}

ProcessDumpConfig is a configuration for high-level Process*Dump functions.

URL or Path are required. If URL is provided and Path does not already exist, Client is required, too.

Client should set User-Agent header with contact information, e.g.:

client := retryablehttp.NewClient()
client.RequestLogHook = func(logger retryablehttp.Logger, req *http.Request, retry int) {
	req.Header.Set("User-Agent", "My bot (user@example.com)")
}

type Protection

type Protection struct {
	Type   string `json:"type"`
	Level  string `json:"level"`
	Expiry string `json:"expiry,omitempty"`
}

type QuantityValue

type QuantityValue struct {
	Amount     Amount  `json:"amount"`
	UpperBound *Amount `json:"upperBound,omitempty"` //nolint:tagliatelle
	LowerBound *Amount `json:"lowerBound,omitempty"` //nolint:tagliatelle
	Unit       string  `json:"unit"`
}

type Redirect

type Redirect struct {
	Name string `json:"name"`
	URL  string `json:"url"`
}

type Reference

type Reference struct {
	Hash       string            `json:"hash,omitempty"`
	Snaks      map[string][]Snak `json:"snaks,omitempty"`
	SnaksOrder []string          `json:"snaks-order,omitempty"` //nolint:tagliatelle
}

type Score

type Score struct {
	Prediction  bool        `json:"prediction"`
	Probability Probability `json:"probability"`
}

type Scores

type Scores struct {
	Damaging  *Score `json:"damaging,omitempty"`
	Goodfaith *Score `json:"goodfaith,omitempty"`
}
type SiteLink struct {
	Site   string   `json:"site"`
	Title  string   `json:"title"`
	Badges []string `json:"badges,omitempty"`
	URL    string   `json:"url,omitempty"`
}

type Size added in v0.13.0

type Size struct {
	Value int64  `json:"value"`
	Unit  string `json:"unit_text"`
}

type Snak

type Snak struct {
	Hash      string     `json:"hash,omitempty"`
	SnakType  SnakType   `json:"snaktype"`
	Property  string     `json:"property"`
	DataType  *DataType  `json:"datatype,omitempty"`
	DataValue *DataValue `json:"datavalue,omitempty"`
}

type SnakType

type SnakType int
const (
	Value SnakType = iota
	SomeValue
	NoValue
)

func (SnakType) MarshalJSON

func (t SnakType) MarshalJSON() ([]byte, error)

func (*SnakType) UnmarshalJSON

func (t *SnakType) UnmarshalJSON(b []byte) error

type Statement

type Statement struct {
	ID              string            `json:"id"`
	Type            StatementType     `json:"type"`
	MainSnak        Snak              `json:"mainsnak"`
	Rank            StatementRank     `json:"rank"`
	Qualifiers      map[string][]Snak `json:"qualifiers,omitempty"`
	QualifiersOrder []string          `json:"qualifiers-order,omitempty"` //nolint:tagliatelle
	References      []Reference       `json:"references,omitempty"`
}

type StatementRank

type StatementRank int
const (
	Preferred StatementRank = iota
	Normal
	Deprecated
)

func (StatementRank) MarshalJSON

func (r StatementRank) MarshalJSON() ([]byte, error)

func (*StatementRank) UnmarshalJSON

func (r *StatementRank) UnmarshalJSON(b []byte) error

type StatementType

type StatementType int
const (
	StatementT StatementType = iota
)

func (StatementType) MarshalJSON

func (t StatementType) MarshalJSON() ([]byte, error)

func (*StatementType) UnmarshalJSON

func (t *StatementType) UnmarshalJSON(b []byte) error

type StringValue

type StringValue string

type Template

type Template struct {
	Name string `json:"name"`
	URL  string `json:"url"`
}

type TimePrecision

type TimePrecision int
const (
	BillionYears TimePrecision = iota
	HoundredMillionYears
	TenMillionYears
	MillionYears
	HoundredMillenniums
	TenMillenniums
	Millennium
	Century
	Decade
	Year
	Month
	Day
	Hour
	Minute
	Second
)

type TimeValue

type TimeValue struct {
	Time      time.Time     `json:"time"`
	Precision TimePrecision `json:"precision"`
	Calendar  CalendarModel `json:"calendar"`
}

TimeValue represents a time value.

While Time is a regular time.Time struct with nanoseconds precision, its real precision is available by Precision.

Note that Wikidata uses historical numbering, in which year 0 is undefined and 1 BCE is represented by -1, but time.Time uses astronomical numbering, in which 1 BCE is represented by 0.

func (TimeValue) MarshalJSON

func (v TimeValue) MarshalJSON() ([]byte, error)

MarshalJSON implements json.Marshaler interface for TimeValue.

func (*TimeValue) UnmarshalJSON

func (v *TimeValue) UnmarshalJSON(b []byte) error

UnmarshalJSON implements json.Unmarshaler interface for TimeValue.

type Version

type Version struct {
	Identifier          int64    `json:"identifier"`
	Editor              *Editor  `json:"editor,omitempty"`
	Comment             string   `json:"comment,omitempty"`
	Tags                []string `json:"tags,omitempty"`
	HasTagNeedsCitation bool     `json:"has_tag_needs_citation,omitempty"`
	IsMinorEdit         bool     `json:"is_minor_edit,omitempty"`
	IsFlaggedStable     bool     `json:"is_flagged_stable,omitempty"`
	Scores              *Scores  `json:"scores,omitempty"`
	Size                *Size    `json:"size,omitempty"`
	NumberOfCharacters  int64    `json:"number_of_characters,omitempty"`
	Event               Event    `json:"event"`
}

type Visibility added in v0.13.0

type Visibility struct {
	Text    bool `json:"text"`
	Editor  bool `json:"editor"`
	Comment bool `json:"comment"`
}

type WikiBaseEntityIDValue

type WikiBaseEntityIDValue struct {
	Type WikiBaseEntityType `json:"entity-type"` //nolint:tagliatelle
	ID   string             `json:"id"`
}

type WikiBaseEntityType

type WikiBaseEntityType int
const (
	ItemType WikiBaseEntityType = iota
	PropertyType
	LexemeType
	FormType
	SenseType
)

func (WikiBaseEntityType) MarshalJSON

func (t WikiBaseEntityType) MarshalJSON() ([]byte, error)

func (*WikiBaseEntityType) UnmarshalJSON

func (t *WikiBaseEntityType) UnmarshalJSON(b []byte) error

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL