pup_lib

package module

v0.0.0-...-afdd2c2 Latest Latest Go to latest Published: Dec 8, 2022 License: MIT Imports: 15 Imported by: 0

Details

Valid go.mod file

The Go module system was introduced in Go 1.11 and is the official dependency management solution for Go.
Redistributable license

Redistributable licenses place minimal restrictions on how software can be used, modified, and redistributed.
Tagged version

Modules with tagged versions give importers more predictable builds.
Stable version

When a project reaches major version v1 it is considered stable.
Learn more about best practices

Repository

github.com/NordeN37/pup_lib

README ¶

pup_lib fork github.com/ericchiang/pup

pup_lib fork has been cut as a library, for vendoring projects

pup

pup is a command line tool for processing HTML. It reads from stdin, prints to stdout, and allows the user to filter parts of the page using CSS selectors.

Inspired by jq, pup aims to be a fast and flexible way of exploring HTML from the terminal.

Install

Direct downloads are available through the releases page.

If you have Go installed on your computer just run go get.

go get github.com/ericchiang/pup

If you're on OS X, use Homebrew to install (no Go required).

brew install https://raw.githubusercontent.com/EricChiang/pup/master/pup.rb

Quick start

$ curl -s https://news.ycombinator.com/

Ew, HTML. Let's run that through some pup selectors:

$ curl -s https://news.ycombinator.com/ | pup 'table table tr:nth-last-of-type(n+2) td.title a'

Okay, how about only the links?

$ curl -s https://news.ycombinator.com/ | pup 'table table tr:nth-last-of-type(n+2) td.title a attr{href}'

Even better, let's grab the titles too:

$ curl -s https://news.ycombinator.com/ | pup 'table table tr:nth-last-of-type(n+2) td.title a json{}'

Basic Usage

$ cat index.html | pup [flags] '[selectors] [display function]'

Examples

Download a webpage with wget.

$ wget http://en.wikipedia.org/wiki/Robots_exclusion_standard -O robots.html

Clean and indent

By default pup will fill in missing tags and properly indent the page.

$ cat robots.html
# nasty looking HTML
$ cat robots.html | pup --color
# cleaned, indented, and colorful HTML

Filter by tag

$ cat robots.html | pup 'title'
<title>
 Robots exclusion standard - Wikipedia, the free encyclopedia
</title>

Filter by id

$ cat robots.html | pup 'span#See_also'
<span class="mw-headline" id="See_also">
 See also
</span>

Filter by attribute

$ cat robots.html | pup 'th[scope="row"]'
<th scope="row" class="navbox-group">
 Exclusion standards
</th>
<th scope="row" class="navbox-group">
 Related marketing topics
</th>
<th scope="row" class="navbox-group">
 Search marketing related topics
</th>
<th scope="row" class="navbox-group">
 Search engine spam
</th>
<th scope="row" class="navbox-group">
 Linking
</th>
<th scope="row" class="navbox-group">
 People
</th>
<th scope="row" class="navbox-group">
 Other
</th>

Pseudo Classes

CSS selectors have a group of specifiers called "pseudo classes" which are pretty cool. pup implements a majority of the relevant ones them.

Here are some examples.

$ cat robots.html | pup 'a[rel]:empty'
<a rel="license" href="//creativecommons.org/licenses/by-sa/3.0/" style="display:none;">
</a>

$ cat robots.html | pup ':contains("History")'
<span class="toctext">
 History
</span>
<span class="mw-headline" id="History">
 History
</span>

$ cat robots.html | pup ':parent-of([action="edit"])'
<span class="wb-langlinks-edit wb-langlinks-link">
 <a action="edit" href="//www.wikidata.org/wiki/Q80776#sitelinks-wikipedia" text="Edit links" title="Edit interlanguage links" class="wbc-editpage">
  Edit links
 </a>
</span>

For a complete list, view the implemented selectors section.

`+`, `>`, and `,`

These are intermediate characters that declare special instructions. For instance, a comma , allows pup to specify multiple groups of selectors.

$ cat robots.html | pup 'title, h1 span[dir="auto"]'
<title>
 Robots exclusion standard - Wikipedia, the free encyclopedia
</title>
<span dir="auto">
 Robots exclusion standard
</span>

Chain selectors together

When combining selectors, the HTML nodes selected by the previous selector will be passed to the next ones.

$ cat robots.html | pup 'h1#firstHeading'
<h1 id="firstHeading" class="firstHeading" lang="en">
 <span dir="auto">
  Robots exclusion standard
 </span>
</h1>

$ cat robots.html | pup 'h1#firstHeading span'
<span dir="auto">
 Robots exclusion standard
</span>

Implemented Selectors

For further examples of these selectors head over to MDN.

pup '.class'
pup '#id'
pup 'element'
pup 'selector + selector'
pup 'selector > selector'
pup '[attribute]'
pup '[attribute="value"]'
pup '[attribute*="value"]'
pup '[attribute~="value"]'
pup '[attribute^="value"]'
pup '[attribute$="value"]'
pup ':empty'
pup ':first-child'
pup ':first-of-type'
pup ':last-child'
pup ':last-of-type'
pup ':only-child'
pup ':only-of-type'
pup ':contains("text")'
pup ':nth-child(n)'
pup ':nth-of-type(n)'
pup ':nth-last-child(n)'
pup ':nth-last-of-type(n)'
pup ':not(selector)'
pup ':parent-of(selector)'

You can mix and match selectors as you wish.

cat index.html | pup 'element#id[attribute="value"]:first-of-type'

Display Functions

Non-HTML selectors which effect the output type are implemented as functions which can be provided as a final argument.

`text{}`

Print all text from selected nodes and children in depth first order.

$ cat robots.html | pup '.mw-headline text{}'
History
About the standard
Disadvantages
Alternatives
Examples
Nonstandard extensions
Crawl-delay directive
Allow directive
Sitemap
Host
Universal "*" match
Meta tags and headers
See also
References
External links

`attr{attrkey}`

Print the values of all attributes with a given key from all selected nodes.

$ cat robots.html | pup '.catlinks div attr{id}'
mw-normal-catlinks
mw-hidden-catlinks

`json{}`

Print HTML as JSON.

$ cat robots.html  | pup 'div#p-namespaces a'
<a href="/wiki/Robots_exclusion_standard" title="View the content page [c]" accesskey="c">
 Article
</a>
<a href="/wiki/Talk:Robots_exclusion_standard" title="Discussion about the content page [t]" accesskey="t">
 Talk
</a>

$ cat robots.html | pup 'div#p-namespaces a json{}'
[
 {
  "accesskey": "c",
  "href": "/wiki/Robots_exclusion_standard",
  "tag": "a",
  "text": "Article",
  "title": "View the content page [c]"
 },
 {
  "accesskey": "t",
  "href": "/wiki/Talk:Robots_exclusion_standard",
  "tag": "a",
  "text": "Talk",
  "title": "Discussion about the content page [t]"
 }
]

Use the -i / --indent flag to control the intent level.

$ cat robots.html | pup -i 4 'div#p-namespaces a json{}'
[
    {
        "accesskey": "c",
        "href": "/wiki/Robots_exclusion_standard",
        "tag": "a",
        "text": "Article",
        "title": "View the content page [c]"
    },
    {
        "accesskey": "t",
        "href": "/wiki/Talk:Robots_exclusion_standard",
        "tag": "a",
        "text": "Talk",
        "title": "Discussion about the content page [t]"
    }
]

If the selectors only return one element the results will be printed as a JSON object, not a list.

$ cat robots.html  | pup --indent 4 'title json{}'
{
    "tag": "title",
    "text": "Robots exclusion standard - Wikipedia, the free encyclopedia"
}

Because there is no universal standard for converting HTML/XML to JSON, a method has been chosen which hopefully fits. The goal is simply to get the output of pup into a more consumable format.

Flags

Run pup --help for a list of further options

Documentation ¶

Index ¶

Variables
func ParseArgs() ([]string, error)
func ParseAttrMatcher(selector *CSSSelector, s scanner.Scanner) error
func ParseClassMatcher(selector *CSSSelector, s scanner.Scanner) error
func ParseCommands(cmdString string) ([]string, error)
func ParseDisplayer(cmd string) error
func ParseHTML(r io.Reader, cs string) (*html.Node, error)
func ParseIdMatcher(selector *CSSSelector, s scanner.Scanner) error
func ParsePseudo(selector *CSSSelector, s scanner.Scanner) error
func ParseTagMatcher(selector *CSSSelector, s scanner.Scanner) error
func PrintHelp(w io.Writer, exitCode int)
func ProcessFlags(cmds []string) (nonFlagCmds []string, err error)
type AttrDisplayer
- func (a AttrDisplayer) Display(nodes []*html.Node)
type CSSSelector
- func ParseSelector(cmd string) (selector CSSSelector, err error)
- func (s CSSSelector) Match(node *html.Node) bool
type Displayer
type JSONDisplayer
- func (j JSONDisplayer) Display(nodes []*html.Node)
type NumDisplayer
- func (d NumDisplayer) Display(nodes []*html.Node)
type PseudoClass
type Selector
type SelectorFunc
- func Select(s Selector) SelectorFunc
- func SelectFromChildren(s Selector) SelectorFunc
- func SelectNextSibling(s Selector) SelectorFunc
type TextDisplayer
- func (t TextDisplayer) Display(nodes []*html.Node)
type TreeDisplayer
- func (t TreeDisplayer) Display(nodes []*html.Node)

Constants ¶

This section is empty.

Variables ¶

View Source

var VERSION = "0.4.0"

Functions ¶

func ParseArgs ¶

func ParseArgs() ([]string, error)

func ParseAttrMatcher ¶

func ParseAttrMatcher(selector *CSSSelector, s scanner.Scanner) error

Parse an attribute matcher e.g. `[attr^="http"]`

func ParseClassMatcher ¶

func ParseClassMatcher(selector *CSSSelector, s scanner.Scanner) error

Parse a class matcher e.g. `.btn`

func ParseCommands ¶

func ParseCommands(cmdString string) ([]string, error)

Split a string with awareness for quoted text and commas

func ParseDisplayer ¶

func ParseDisplayer(cmd string) error

func ParseHTML ¶

func ParseHTML(r io.Reader, cs string) (*html.Node, error)

Parse the html while handling the charset

func ParseIdMatcher ¶

func ParseIdMatcher(selector *CSSSelector, s scanner.Scanner) error

Parse an id matcher e.g. `#my-picture`

func ParsePseudo ¶

func ParsePseudo(selector *CSSSelector, s scanner.Scanner) error

Parse the selector after ':'

func ParseTagMatcher ¶

func ParseTagMatcher(selector *CSSSelector, s scanner.Scanner) error

Parse the initial tag e.g. `div`

func PrintHelp ¶

func PrintHelp(w io.Writer, exitCode int)

func ProcessFlags ¶

func ProcessFlags(cmds []string) (nonFlagCmds []string, err error)

Process command arguments and return all non-flags.

Types ¶

type AttrDisplayer ¶

type AttrDisplayer struct {
	Attr string
}

Print the attribute of a node

func (AttrDisplayer) Display ¶

func (a AttrDisplayer) Display(nodes []*html.Node)

type CSSSelector ¶

type CSSSelector struct {
	Tag    string
	Attrs  map[string]*regexp.Regexp
	Pseudo PseudoClass
}

func ParseSelector ¶

func ParseSelector(cmd string) (selector CSSSelector, err error)

Parse a selector e.g. `div#my-button.btn[href^="http"]`

func (CSSSelector) Match ¶

func (s CSSSelector) Match(node *html.Node) bool

type Displayer ¶

type Displayer interface {
	Display([]*html.Node)
}

type JSONDisplayer ¶

type JSONDisplayer struct{}

Print nodes as a JSON list

func (JSONDisplayer) Display ¶

func (j JSONDisplayer) Display(nodes []*html.Node)

type NumDisplayer ¶

type NumDisplayer struct{}

Print the number of features returned

func (NumDisplayer) Display ¶

func (d NumDisplayer) Display(nodes []*html.Node)

type PseudoClass ¶

type PseudoClass func(*html.Node) bool

type Selector ¶

type Selector interface {
	Match(node *html.Node) bool
}

type SelectorFunc ¶

type SelectorFunc func(nodes []*html.Node) []*html.Node

func Select ¶

func Select(s Selector) SelectorFunc

func SelectFromChildren ¶

func SelectFromChildren(s Selector) SelectorFunc

Defined for the '+' selector

func SelectNextSibling ¶

func SelectNextSibling(s Selector) SelectorFunc

Defined for the '>' selector

type TextDisplayer ¶

type TextDisplayer struct{}

Print the text of a node

func (TextDisplayer) Display ¶

func (t TextDisplayer) Display(nodes []*html.Node)

type TreeDisplayer ¶

type TreeDisplayer struct {
}

func (TreeDisplayer) Display ¶

func (t TreeDisplayer) Display(nodes []*html.Node)

Source Files ¶

View all Source files

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL