unfurlist

package module
v0.0.0-...-98bef51 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Mar 28, 2024 License: MIT Imports: 31 Imported by: 0

README

Package unfurlist implements a service that unfurls URLs and provides more information about them.

To install ready-to-use http service:

go get -u github.com/Doist/unfurlist/...

See documentation.

Documentation

Overview

Package unfurlist implements a service that unfurls URLs and provides more information about them.

The current version supports Open Graph and oEmbed formats, Twitter card format is also planned. If the URL does not support common formats, unfurlist falls back to looking at common HTML tags such as <title> and <meta name="description">.

The endpoint accepts GET and POST requests with `content` as the main argument. It then returns a JSON encoded list of URLs that were parsed.

If an URL lacks an attribute (e.g. `image`) then this attribute will be omitted from the result.

Example:

?content=Check+this+out+https://www.youtube.com/watch?v=dQw4w9WgXcQ

Will return:

    Type: "application/json"

	[
		{
			"url": "https://www.youtube.com/watch?v=dQw4w9WgXcQ",
			"title": "Rick Astley - Never Gonna Give You Up (Video)",
			"url_type": "video.other",
			"description": "Rick Astley - Never Gonna Give You Up...",
			"site_name": "YouTube",
			"favicon": "https://www.youtube.com/yts/img/favicon_32-vflOogEID.png",
			"image": "https://i.ytimg.com/vi/dQw4w9WgXcQ/maxresdefault.jpg"
		}
	]

If handler was configured with FetchImageSize=true in its config, each hash may have additional fields `image_width` and `image_height` specifying dimensions of image provided by `image` attribute.

Additionally you can supply `callback` to wrap the result in a JavaScript callback (JSONP), the type of this response would be "application/x-javascript"

If an optional `markdown` boolean argument is set (markdown=true), then provided content is parsed as markdown formatted text and links are extracted in context-aware mode — i.e. preformatted text blocks are skipped.

Security

Care should be taken when running this inside internal network since it may disclose internal endpoints. It is a good idea to run the service on a separate host in an isolated subnet.

Alternatively access to internal resources may be limited with firewall rules, i.e. if service is running as 'unfurlist' user on linux box, the following iptables rules can reduce chances of it connecting to internal endpoints (note this example is for ipv4 only!):

iptables -A OUTPUT -m owner --uid-owner unfurlist -p tcp --syn \
	-d 127/8,10/8,169.254/16,172.16/12,192.168/16 \
	-j REJECT --reject-with icmp-net-prohibited
ip6tables -A OUTPUT -m owner --uid-owner unfurlist -p tcp --syn \
	-d ::1/128,fe80::/10 \
	-j REJECT --reject-with adm-prohibited

Index

Examples

Constants

View Source
const DefaultMaxResults = 20

DefaultMaxResults is maximum number of urls to process if not configured by WithMaxResults function

Variables

This section is empty.

Functions

func New

func New(conf ...ConfFunc) http.Handler

New returns new initialized unfurl handler. If no configuration functions provided, sane defaults would be used.

func ParseURLs

func ParseURLs(content string) []string

ParseURLs tries to extract unique url-like (http/https scheme only) substrings from given text. Results may not be proper urls, since only sequence of matched characters are searched for. This function is optimized for extraction of urls from plain text where it can be mixed with punctuation symbols: trailing symbols []()<>,;. are removed, but // trailing >]) are left if any opening <[( is found inside url.

Example
text := `This text contains various urls mixed with different reserved per rfc3986 characters:
	http://google.com, https://doist.com/#about (also see https://todoist.com), <http://example.com/foo>,
	**[markdown](http://daringfireball.net/projects/markdown/)**,
	http://marvel-movies.wikia.com/wiki/The_Avengers_(film), https://pt.wikipedia.org/wiki/Mamão.
	https://docs.live.net/foo/?section-id={D7CEDACE-AEFB-4B61-9C63-BDE05EEBD80A},
	http://example.com/?param=foo;bar
	HTTPS://EXAMPLE.COM/UPPERCASE
	hTtP://example.com/mixedCase
	`
for _, u := range ParseURLs(text) {
	fmt.Println(u)
}
Output:

http://google.com
https://doist.com/#about
https://todoist.com
http://example.com/foo
http://daringfireball.net/projects/markdown/
http://marvel-movies.wikia.com/wiki/The_Avengers_(film)
https://pt.wikipedia.org/wiki/Mamão
https://docs.live.net/foo/?section-id={D7CEDACE-AEFB-4B61-9C63-BDE05EEBD80A}
http://example.com/?param=foo;bar
HTTPS://EXAMPLE.COM/UPPERCASE
hTtP://example.com/mixedCase

Types

type ConfFunc

type ConfFunc func(*unfurlHandler) *unfurlHandler

ConfFunc is used to configure new unfurl handler; such functions should be used as arguments to New function

func WithBlocklistPrefixes

func WithBlocklistPrefixes(prefixes []string) ConfFunc

WithBlocklistPrefixes configures unfurl handler to skip unfurling urls matching any provided prefix

func WithBlocklistTitles

func WithBlocklistTitles(substrings []string) ConfFunc

WithBlocklistTitles configures unfurl handler to skip unfurling urls that return pages which title contains one of substrings provided

func WithExtraHeaders

func WithExtraHeaders(hdr map[string]string) ConfFunc

WithExtraHeaders configures unfurl handler to add extra headers to each outgoing http request

func WithFetchers

func WithFetchers(fetchers ...FetchFunc) ConfFunc

WithFetchers attaches custom fetchers to unfurl handler created by New().

func WithHTTPClient

func WithHTTPClient(client *http.Client) ConfFunc

WithHTTPClient configures unfurl handler to use provided http.Client for outgoing requests

func WithImageDimensions

func WithImageDimensions(enable bool) ConfFunc

WithImageDimensions configures unfurl handler whether to fetch image dimensions or not.

func WithLogger

func WithLogger(l Logger) ConfFunc

WithLogger configures unfurl handler to use provided logger

func WithMaxResults

func WithMaxResults(n int) ConfFunc

WithMaxResults configures unfurl handler to only process n first urls it finds. n must be positive.

func WithMemcache

func WithMemcache(client *memcache.Client) ConfFunc

WithMemcache configures unfurl handler to cache metadata in memcached

func WithOembedLookupFunc

func WithOembedLookupFunc(fn oembed.LookupFunc) ConfFunc

WithOembedLookupFunc configures unfurl handler to use custom oembed.LookupFunc for oembed lookups.

type FetchFunc

type FetchFunc func(context.Context, *http.Client, *url.URL) (*Metadata, bool)

FetchFunc defines custom metadata fetchers that can be attached to unfurl handler

func GoogleMapsFetcher

func GoogleMapsFetcher(key string) FetchFunc

GoogleMapsFetcher returns FetchFunc that recognizes some Google Maps urls and constructs metadata for them containing preview image from Google Static Maps API. The only argument is the API key to create image links with.

type Logger

type Logger interface {
	Print(v ...any)
	Printf(format string, v ...any)
	Println(v ...any)
}

Logger describes set of methods used by unfurl handler for logging; standard lib *log.Logger implements this interface.

type Metadata

type Metadata struct {
	Title       string
	Type        string // TODO: make this int8 w/enum constants
	Description string
	Image       string // image/thumbnail url
	ImageWidth  int
	ImageHeight int
}

Metadata represents metadata retrieved by FetchFunc. At least one of Title, Description or Image attributes are expected to be non-empty.

func (*Metadata) Valid

func (m *Metadata) Valid() bool

Valid check that at least one of the mandatory attributes is non-empty

Directories

Path Synopsis
cmd
unfurlist
Command unfurlist implements http server exposing API endpoint
Command unfurlist implements http server exposing API endpoint
internal
useragent
This is a vendored copy of https://github.com/artyom/useragent
This is a vendored copy of https://github.com/artyom/useragent

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL