README
¶
Swan


An implementation of the Goose HTML Content / Article Extractor algorithm in golang.
Swan allows you to extract cleaned up text and HTML content from any webpage by removing all the extra junk that so many pages have these days.
Check out the go documentation page for full usage and examples.
Features
- Main content extraction from almost any source
- Extract HTML content with images
- Get article metadata, publish dates, and a lot more
- Recognize different content types and apply special extractions (currently only recognizes comic sites and normal sites)
Planned
- Inline videos into HTML content when found in an article
- Recognize news sources and extract corresponding video / audio content
- Recognize and extract more types of content
- An interesting idea: https://github.com/buriy/python-readability/issues/57#issuecomment-67926023
Documentation
¶
Overview ¶
Package swan implements the Goose HTML Content / Article Extractor algorithm.
Currently, swan will try to extract the following content types:
Comics: if something looks like a web comic, it will be extracted as just an image. This is a WIP.
Everything else: it will look for article text and try to extract any header image that goes with it.
Index ¶
Examples ¶
Constants ¶
const (
// Version of the library
Version = "1.0"
)
Variables ¶
This section is empty.
Functions ¶
Types ¶
type Article ¶
type Article struct { // Final URL after all redirects URL string // Newline-separated and cleaned content CleanedText string // Node from which CleanedText was created. Call .Html() on this to get // printable HTML. TopNode *goquery.Selection // A header image to use for the article. Nil if no image could be // detected. Img *Image // All metadata associated with the original document Meta struct { Authors []string Canonical string Description string Domain string Favicon string Keywords string Links []string Lang string OpenGraph map[string]string PublishDate string Tags []string Title string } // Full document backing this article Doc *goquery.Document // contains filtered or unexported fields }
Article is a fully extracted and cleaned document.
func FromDoc ¶
FromDoc does its best to extract an article from a single document
Pass in the URL the document came from so that images can be resolved correctly.
func FromHTML ¶
FromHTML does its best to extract an article from a single HTML page.
Pass in the URL the document came from so that images can be resolved correctly.
Example ¶
Output: Title: Example Title Site Name: Example Name HTML: <p>some article body with a bunch of text in it</p> Plain: some article body with a bunch of text in it