parse

package
v0.0.0-...-192e4b2 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Nov 20, 2013 License: MIT Imports: 7 Imported by: 0

Documentation

Overview

An ad-hoc parser for Wikipedia's 45GB (and growing) XML database.

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

func CategorizedParse

func CategorizedParse(reader io.Reader, out chan<- *Page, categories *Categories)

CategorizedParse is just like Parse, except that it also categorizes pages.

func FilterRedirects

func FilterRedirects(rawPages <-chan []byte, nonRedirectPages chan<- []byte)

FilterRedirects discards all pages that redirect to another page.

func GetCategories

func GetCategories(pages <-chan *Page, categorizedPages chan<- *Page, categories *Categories)

GetCategories extracts categories out of each Wikipedia page and adds them to the given categories object. Only links in the form [[Category:target]] are extracted.

func GetChunks

func GetChunks(reader io.Reader, chunks chan<- []byte)

GetChunks reads an XML file line by line and dumps each line to its output channel.

func GetLinks(pages <-chan *Page, linkedPages chan<- *Page)

GetLinks extracts all Wikipedia links found in pages. Only links in the form [[target]] are extracted.

func GetPages

func GetPages(rawPages <-chan []byte, pages chan<- *Page)

GetPages parses a complete XML page into a page object.

func GetRawPages

func GetRawPages(chunks <-chan []byte, pages chan<- []byte)

GetRawPages combines individual line elements into complete XML pages so that they can be processed by a standard in-memory XML parser.

func Parse

func Parse(reader io.Reader, pages chan<- *Page)

Parse parses given reader as XML and dumps Page objects with links into its output channel.

Types

type Categories

type Categories struct {
	// contains filtered or unexported fields
}

Categories is a simple structure for keeping track of the categories of Wikipedia articles. It is optimized for queries like "articles about science" rather than "which categories is this article in".

func NewCategories

func NewCategories() *Categories

NewCategories returns an empty categories object.

func (*Categories) AddPage

func (self *Categories) AddPage(page *Page, cats []string)

AddPage adds the given page to the given categories.

func (*Categories) String

func (self *Categories) String() string

String produces a string represenation of the categories in the form: category -> (article, article, article)

type Page

type Page struct {
	Title    string    `xml:"title"`
	Revision *Revision `xml:"revision"`
	Links    []string
}

Page is a representation of a Wikipedia page with only the necessary fields. A Wikipedia page can be unmarshalled into a page just fine.

func (*Page) String

func (self *Page) String() string

type Revision

type Revision struct {
	Text string `xml:"text"`
}

Revision usually contains information about the user and time of the revision. Since Kapok is focused only on the latest version of Wikipedia, these fields are ignored.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL