microdata

package module
v1.0.1 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Feb 21, 2023 License: BSD-2-Clause Imports: 11 Imported by: 3

README

Microdata

Microdata is a package to extract Microdata and JSON-LD from HTML documents.

HTML Microdata is a markup specification often used in combination with the schema collection to make it easier for search engines to identify and understand content on web pages. One of the most common schemas is the rating you see when you google for something. Other schemas are persons, places, events, products, etc.

JSON-LD is a lightweight Linked Data format. It is easy for humans to read and write. It is based on the already successful JSON format and provides a way to help JSON data interoperate at Web-scale.

Go package use

Install the package:

go get -u github.com/astappiev/microdata

Use cases:

// Pass a URL to the `ParseURL` function.
data, err := microdata.ParseURL("https://example.com/page")

// Pass a `io.Reader`, content-type and a base URL to the `ParseHTML` function.
data, err := microdata.ParseHTML(reader, contentType, baseURL)

// Pass a `html.Node`, content-type and a base URL to the `ParseNode` function.
data, err := microdata.ParseNode(reader, contentType, baseURL)

An example program:

package main

import (
    "encoding/json"
    "fmt"

    "github.com/astappiev/microdata"
)

func main() {
    data, _ := microdata.ParseURL("https://www.allrecipes.com/recipe/84450/ukrainian-red-borscht-soup/")
    
    // iterate over metadata items:
    items := data.Items
	for _, item := range items {
		fmt.Println(item.Types)
		for key, prop := range item.Properties {
			fmt.Printf("%s: %v\n", key, prop)
		}
	}

    // print json schema
    json, _ := json.MarshalIndent(data, "", "  ")
    fmt.Println(string(json))
}

Command line use

Install the command line tool:

go install github.com/astappiev/microdata/cmd/microdata

Parse an URL:

microdata https://www.gog.com/game/...
{
  "items": [
    {
      "type": [
        "http://schema.org/Product"
      ],
      "properties": {
        "additionalProperty": [
          {
            "type": [
              "http://schema.org/PropertyValue"
            ],
{
...

Parse HTML from the stdin:

$ cat saved.html | microdata

Format the output with a Go template to return the "price" property:

microdata -format '{{with index .Items 0}}{{with index .Properties "offers" 0}}{{with index .Properties "price" 0 }}{{ . }}{{end}}{{end}}{{end}}' https://www.gog.com/game/...
8.99

Documentation

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

This section is empty.

Types

type Item

type Item struct {
	Types      []string    `json:"type"`
	Properties PropertyMap `json:"properties"`
	ID         string      `json:"id,omitempty"`
}

func NewItem

func NewItem() *Item

NewItem returns a new Item.

func (*Item) CountPaths added in v1.0.0

func (i *Item) CountPaths(prefix string, paths *map[string]int)

func (*Item) GetNested added in v1.0.0

func (i *Item) GetNested(keys ...string) (data Microdata, ok bool)

func (*Item) GetNestedItem added in v1.0.0

func (i *Item) GetNestedItem(keys ...string) (val *Item, ok bool)

func (*Item) GetProperties added in v1.0.0

func (i *Item) GetProperties(keys ...string) (arr []interface{}, ok bool)

func (*Item) GetProperty added in v1.0.0

func (i *Item) GetProperty(keys ...string) (val interface{}, ok bool)

func (*Item) IsOfSchemaType added in v1.0.1

func (i *Item) IsOfSchemaType(itemType string) bool

func (*Item) IsOfType added in v1.0.0

func (i *Item) IsOfType(itemType ...string) bool

type Microdata

type Microdata struct {
	Items []*Item `json:"items"`
}

func ParseHTML

func ParseHTML(r io.Reader, contentType string, urlStr string) (*Microdata, error)

ParseHTML parses the HTML document available in the given reader and returns the microdata. The given url is used to resolve the URLs in the attributes. The given contentType is used to convert the content of r to UTF-8. When the given contentType is equal to "", the content type will be detected using `http.DetectContentType`.

func ParseNode added in v1.0.0

func ParseNode(root *html.Node, urlStr string) (*Microdata, error)

ParseNode parses the root Node and returns the microdata.

func ParseURL

func ParseURL(urlStr string) (*Microdata, error)

ParseURL parses the HTML document available at the given URL and returns the microdata.

func (*Microdata) GetFirstOfSchemaType added in v1.0.1

func (m *Microdata) GetFirstOfSchemaType(itemType string) *Item

GetFirstOfSchemaType returns the first item of the given type with possible https://schema.org/ context.

func (*Microdata) GetFirstOfType added in v1.0.0

func (m *Microdata) GetFirstOfType(itemType ...string) *Item

GetFirstOfType returns the first item of the given type.

type PropertyMap

type PropertyMap map[string]ValueList

type ValueList

type ValueList []interface{}

Directories

Path Synopsis
cmd

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL