extractor

package

v0.0.0-...-9ce7f06 Latest Latest Go to latest Published: Aug 7, 2019 License: MIT Imports: 8 Imported by: 0

Details

Valid go.mod file

The Go module system was introduced in Go 1.11 and is the official dependency management solution for Go.
Redistributable license

Redistributable licenses place minimal restrictions on how software can be used, modified, and redistributed.
Tagged version

Modules with tagged versions give importers more predictable builds.
Stable version

When a project reaches major version v1 it is considered stable.
Learn more about best practices

Repository

github.com/timtosi/mcrawler

Links

Open Source Insights

Documentation ¶

Index ¶

func GetImg(t html.Token, tokenType html.TokenType) string
func GetLinkBasic(t html.Token, tokenType html.TokenType) string
func GetLinkNoFollow(t html.Token, tokenType html.TokenType) string
type CheckFunc
type Extractor
- func NewExtractor(checkFuncs ...CheckFunc) *Extractor
- func (e *Extractor) ExtractLinks(baseURL string, content []byte) []string
- func (e *Extractor) Pipe(wg *sync.WaitGroup, in <-chan *domain.Target, out chan<- *domain.Target)

Constants ¶

This section is empty.

Variables ¶

This section is empty.

Functions ¶

func GetImg ¶

func GetImg(t html.Token, tokenType html.TokenType) string

GetImg is an `extractor.CheckFunc` used to retrieve image URLs from a web page. It uses `t` as the token to analyse and its `tokenType`. It returns the link value or an empty `string` if `t` does not correspond to a link.

func GetLinkBasic ¶

func GetLinkBasic(t html.Token, tokenType html.TokenType) string

GetLinkBasic is an `extractor.CheckFunc` used to retrieve link URLs from a web page. It uses `t` as the token to analyse and its `tokenType`. It returns the link value or an empty `string` if `t` does not correspond to a link.

NOTE: This function ignores the `nofollow` meta tag.

func GetLinkNoFollow ¶

func GetLinkNoFollow(t html.Token, tokenType html.TokenType) string

GetLinkNoFollow is an `extractor.CheckFunc` used to retrieve link URLs from a web page. It uses `t` as the token to analyse and its `tokenType`. It returns the link value or an empty `string` if `t` does not correspond to a link.

NOTE: This function respect the `nofollow` meta tag.

Types ¶

type CheckFunc ¶

type CheckFunc func(html.Token, html.TokenType) string

CheckFunc is a named type representing a function that checks if an `html.Token` has a link that can be crawled.

type Extractor ¶

type Extractor struct {
	// contains filtered or unexported fields
}

Extractor is a `struct` that extracts links found in a web page according to the results of its inner `CheckFunc` functions.

func NewExtractor ¶

func NewExtractor(checkFuncs ...CheckFunc) *Extractor

NewExtractor returns a new `*extractor.Extractor`.

func (*Extractor) ExtractLinks ¶

func (e *Extractor) ExtractLinks(baseURL string, content []byte) []string

ExtractLinks extracts, cleans and returns a `[]string` of links found in `content` and matching any `e.cf` function.

func (*Extractor) Pipe ¶

func (e *Extractor) Pipe(wg *sync.WaitGroup, in <-chan *domain.Target, out chan<- *domain.Target)

Pipe connects `in` and `out` together. Any `*domain.Target` received from `in` will be parsed and extracted links will be sent to `out`.

NOTE: This function will loop over a channel until `in` is closed. After that it will close `out`.

Source Files ¶

View all Source files

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL