GoOse: github.com/suzuken/GoOse Index | Files | Directories

package goose

import "github.com/suzuken/GoOse"

Package goose is a golang port of "Goose" originally licensed to Gravity.com under one or more contributor license agreements. See the NOTICE file distributed with this work for additional information regarding copyright ownership.

Golang port was written by Antonio Linari

Gravity.com licenses this file to you under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

Index

Package Files

article.go cleaner.go configuration.go crawler.go doc.go extractor.go goose.go images.go outputformatter.go parser.go stopwords.go tokenizer.go videos.go wordstats.go

func ExtractMainImage Uses

func ExtractMainImage(article *Article) string

WebPageResolver fetches the main image from the HTML page

func ExtractOpenGraphImage Uses

func ExtractOpenGraphImage(article *Article) string

OpenGraphResolver return OpenGraph properties

type Article Uses

type Article struct {
    Title           string             `json:"title,omitempty"`
    CleanedText     string             `json:"content,omitempty"`
    MetaDescription string             `json:"description,omitempty"`
    MetaLang        string             `json:"lang,omitempty"`
    MetaFavicon     string             `json:"favicon,omitempty"`
    MetaKeywords    string             `json:"keywords,omitempty"`
    CanonicalLink   string             `json:"canonicalurl,omitempty"`
    Domain          string             `json:"domain,omitempty"`
    TopNode         *goquery.Selection `json:"-"`
    TopImage        string             `json:"image,omitempty"`
    Tags            *set.Set           `json:"tags,omitempty"`
    Movies          *set.Set           `json:"movies,omitempty"`
    FinalURL        string             `json:"url,omitempty"`
    RawHTML         string             `json:"rawhtml,omitempty"`
    Doc             *goquery.Document  `json:"-"`
    Links           []string           `json:"links,omitempty"`
    PublishDate     string             `json:"publishdate,omitempty"`
    Delta           int64              `json:"delta,omitempty"`
}

Article is a collection of properties extracted from the HTML body

func (*Article) String Uses

func (article *Article) String() string

type Cleaner Uses

type Cleaner struct {
    // contains filtered or unexported fields
}

Cleaner removes menus, ads, sidebars, etc. and leaves the main content

func NewCleaner Uses

func NewCleaner(config Configuration) *Cleaner

NewCleaner returns a new instance of a Cleaner

type Configuration Uses

type Configuration struct {
    // contains filtered or unexported fields
}

Configuration is a wrapper for various config options

func GetDefaultConfiguration Uses

func GetDefaultConfiguration(args ...string) Configuration

GetDefaultConfiguration returns safe default configuration options

type ContentExtractor Uses

type ContentExtractor struct {
    // contains filtered or unexported fields
}

ContentExtractor can parse the HTML and fetch various properties

func NewExtractor Uses

func NewExtractor(config Configuration) *ContentExtractor

NewExtractor returns a configured HTML parser

type Crawler Uses

type Crawler struct {
    Extractor      *ContentExtractor
    VideoExtractor *VideoExtractor
    Cleaner        *Cleaner
    // contains filtered or unexported fields
}

Crawler can fetch the target HTML page

func NewCrawler Uses

func NewCrawler(config Configuration) *Crawler

NewCrawler returns a crawler object initialized with the URL and the [optional] raw HTML body

func (*Crawler) Extract Uses

func (c *Crawler) Extract(r io.Reader, rawurl string) (*Article, error)

func (*Crawler) Fetch Uses

func (c *Crawler) Fetch(rawurl string) (*Article, error)

Fetch get contents and extract it.

type Goose Uses

type Goose struct {
    // contains filtered or unexported fields
}

Goose is the main entry point of the program

func New Uses

func New(args ...string) Goose

New returns a new instance of the article extractor

func (Goose) ExtractFromRawHTML Uses

func (g Goose) ExtractFromRawHTML(url string, RawHTML io.Reader) (*Article, error)

ExtractFromRawHTML returns an article object from the raw HTML content

func (Goose) ExtractFromURL Uses

func (g Goose) ExtractFromURL(url string) (*Article, error)

ExtractFromURL follows the URL, fetches the HTML page and returns an article object

type MultilangTokenizer Uses

type MultilangTokenizer struct {
    // contains filtered or unexported fields
}

MultilangTokenizer switching tokenizer by given language settings. Tokenizer is used by each document based on its language.

func NewMultilangTokenizer Uses

func NewMultilangTokenizer(lang language.Tag) *MultilangTokenizer

NewMultilangTokenizer makes MultilangTokenizer.

func (*MultilangTokenizer) Tokenize Uses

func (m *MultilangTokenizer) Tokenize(s string) []string

Tokenize runs tokenize string and return its tokens.

type Parser Uses

type Parser struct{}

Parser is an HTML parser specialised in extraction of main content and other properties

func NewParser Uses

func NewParser() *Parser

NewParser returns an HTML parser

type StopWords Uses

type StopWords struct {
    // contains filtered or unexported fields
}

StopWords implements a simple language detector

func NewStopwords Uses

func NewStopwords() StopWords

NewStopwords returns an instance of a stop words detector

func (StopWords) SimpleLanguageDetector Uses

func (stop StopWords) SimpleLanguageDetector(text string) string

SimpleLanguageDetector returns the language code for the text, based on its stop words

type Tokenizer Uses

type Tokenizer interface {
    Tokenize(string) []string
}

Tokenizer represents to tokenize given string into tokens.

type VideoExtractor Uses

type VideoExtractor struct {
    // contains filtered or unexported fields
}

VideoExtractor can extract the main video from an HTML page

func NewVideoExtractor Uses

func NewVideoExtractor() *VideoExtractor

NewVideoExtractor returns a new instance of a HTML video extractor

func (*VideoExtractor) GetVideos Uses

func (ve *VideoExtractor) GetVideos(article *Article) *set.Set

GetVideos returns the video tags embedded in the article

Directories

PathSynopsis
goose-servergoose-server is simple service for presenting extracted contents by GoOse.

Package goose imports 19 packages (graph) and is imported by 1 packages. Updated 2016-07-15. Refresh now. Tools for package owners.