unfurlist: github.com/Doist/unfurlist Index | Examples | Files | Directories

package unfurlist

import "github.com/Doist/unfurlist"

Package unfurlist implements a service that unfurls URLs and provides more information about them.

The current version supports Open Graph and oEmbed formats, Twitter card format is also planned. If the URL does not support common formats, unfurlist falls back to looking at common HTML tags such as <title> and <meta name="description">.

The endpoint accepts GET and POST requests with `content` as the main argument. It then returns a JSON encoded list of URLs that were parsed.

If an URL lacks an attribute (e.g. `image`) then this attribute will be omitted from the result.

Example:

?content=Check+this+out+https://www.youtube.com/watch?v=dQw4w9WgXcQ

Will return:

    Type: "application/json"

	[
		{
			"title": "Rick Astley - Never Gonna Give You Up",
			"url": "https://www.youtube.com/watch?v=dQw4w9WgXcQ",
			"url_type": "video",
			"site_name": "YouTube",
			"image": "https://i.ytimg.com/vi/dQw4w9WgXcQ/hqdefault.jpg"
		}
	]

If handler was configured with FetchImageSize=true in its config, each hash may have additional fields `image_width` and `image_height` specifying dimensions of image provided by `image` attribute.

Additionally you can supply `callback` to wrap the result in a JavaScript callback (JSONP), the type of this response would be "application/x-javascript"

If an optional `markdown` boolean argument is set (markdown=true), then provided content is parsed as markdown formatted text and links are extracted in context-aware mode — i.e. preformatted text blocks are skipped.

Security

Care should be taken when running this inside internal network since it may disclose internal endpoints. It is a good idea to run the service on a separate host in an isolated subnet.

Alternatively access to internal resources may be limited with firewall rules, i.e. if service is running as 'unfurlist' user on linux box, the following iptables rules can reduce chances of it connecting to internal endpoints (note this example is for ipv4 only!):

iptables -A OUTPUT -m owner --uid-owner unfurlist -p tcp --syn \
	-d 127/8,10/8,169.254/16,172.16/12,192.168/16 \
	-j REJECT --reject-with icmp-net-prohibited
ip6tables -A OUTPUT -m owner --uid-owner unfurlist -p tcp --syn \
	-d ::1/128,fe80::/10 \
	-j REJECT --reject-with adm-prohibited

Index

Examples

Package Files

assets-autogenerated.go conf.go fetcher.go googlemaps.go html_meta_parser.go image.go oembed_parser.go opengraph_parser.go prefixmap.go unfurlist.go url_parser.go

Constants

const DefaultMaxResults = 20

DefaultMaxResults is maximum number of urls to process if not configured by WithMaxResults function

func New Uses

func New(conf ...ConfFunc) http.Handler

New returns new initialized unfurl handler. If no configuration functions provided, sane defaults would be used.

func ParseURLs Uses

func ParseURLs(content string) []string

ParseURLs tries to extract unique url-like (http/https scheme only) substrings from given text. Results may not be proper urls, since only sequence of matched characters are searched for. This function is optimized for extraction of urls from plain text where it can be mixed with punctuation symbols: trailing symbols []()<>,;. are removed, but // trailing >]) are left if any opening <[( is found inside url.

Code:

text := `This text contains various urls mixed with different reserved per rfc3986 characters:
	http://google.com, https://doist.com/#about (also see https://todoist.com), <http://example.com/foo>,
	**[markdown](http://daringfireball.net/projects/markdown/)**,
	http://marvel-movies.wikia.com/wiki/The_Avengers_(film), https://pt.wikipedia.org/wiki/Mamão.
	https://docs.live.net/foo/?section-id={D7CEDACE-AEFB-4B61-9C63-BDE05EEBD80A},
	http://example.com/?param=foo;bar
	`
for _, u := range ParseURLs(text) {
    fmt.Println(u)
}

Output:

http://google.com
https://doist.com/#about
https://todoist.com
http://example.com/foo
http://daringfireball.net/projects/markdown/
http://marvel-movies.wikia.com/wiki/The_Avengers_(film)
https://pt.wikipedia.org/wiki/Mamão
https://docs.live.net/foo/?section-id={D7CEDACE-AEFB-4B61-9C63-BDE05EEBD80A}
http://example.com/?param=foo;bar

type ConfFunc Uses

type ConfFunc func(*unfurlHandler) *unfurlHandler

ConfFunc is used to configure new unfurl handler; such functions should be used as arguments to New function

func WithBlacklistPrefixes Uses

func WithBlacklistPrefixes(prefixes []string) ConfFunc

WithBlacklistPrefixes configures unfurl handler to skip unfurling urls matching any provided prefix

func WithBlacklistTitles Uses

func WithBlacklistTitles(substrings []string) ConfFunc

WithBlacklistTitles configures unfurl handler to skip unfurling urls that return pages which title contains one of substrings provided

func WithExtraHeaders Uses

func WithExtraHeaders(hdr map[string]string) ConfFunc

WithExtraHeaders configures unfurl handler to add extra headers to each outgoing http request

func WithFetchers Uses

func WithFetchers(fetchers ...FetchFunc) ConfFunc

WithFetchers attaches custom fetchers to unfurl handler created by New().

func WithHTTPClient Uses

func WithHTTPClient(client *http.Client) ConfFunc

WithHTTPClient configures unfurl handler to use provided http.Client for outgoing requests

func WithImageDimensions Uses

func WithImageDimensions(enable bool) ConfFunc

WithImageDimensions configures unfurl handler whether to fetch image dimensions or not.

func WithImageProxy Uses

func WithImageProxy(proxyURL, secret string) ConfFunc

WithImageProxy configures unfurl handler to pass plain http image urls through image proxy located at proxyURL. The following query parameters are added to the proxyURL: "u" specifies original image url, "h" specifies sha1 HMAC signature (only if secret is not empty). It is expected that proxyURL does not have query string; it is used "as is", query arguments are appended as "?u=...&h=..." string.

See https://github.com/artyom/image-proxy for proxy implementation example.

func WithLogger Uses

func WithLogger(l Logger) ConfFunc

WithLogger configures unfurl handler to use provided logger

func WithMaxResults Uses

func WithMaxResults(n int) ConfFunc

WithMaxResults configures unfurl handler to only process n first urls it finds. n must be positive.

func WithMemcache Uses

func WithMemcache(client *memcache.Client) ConfFunc

WithMemcache configures unfurl handler to cache metadata in memcached

type FetchFunc Uses

type FetchFunc func(*url.URL) (*Metadata, bool)

FetchFunc defines custom metadata fetchers that can be attached to unfurl handler

func GoogleMapsFetcher Uses

func GoogleMapsFetcher(key string) FetchFunc

GoogleMapsFetcher returns FetchFunc that recognizes some Google Maps urls and constructs metadata for them containing preview image from Google Static Maps API. The only argument is the API key to create image links with.

type Logger Uses

type Logger interface {
    Print(v ...interface{})
    Printf(format string, v ...interface{})
    Println(v ...interface{})
}

Logger describes set of methods used by unfurl handler for logging; standard lib *log.Logger implements this interface.

type Metadata Uses

type Metadata struct {
    Title       string
    Type        string // TODO: make this int8 w/enum constants
    Description string
    Image       string // image/thumbnail url
    ImageWidth  int
    ImageHeight int
}

Metadata represents metadata retrieved by FetchFunc. At least one of Title, Description or Image attributes are expected to be non-empty.

func (*Metadata) Valid Uses

func (m *Metadata) Valid() bool

Valid check that at least one of the mandatory attributes is non-empty

Directories

PathSynopsis
cmd/unfurlistCommand unfurlist implements http server exposing API endpoint

Package unfurlist imports 31 packages (graph) and is imported by 1 packages. Updated 2018-08-10. Refresh now. Tools for package owners.