sitex

package module
v0.1.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Dec 9, 2019 License: MIT Imports: 4 Imported by: 0

README

jdchum/sitex

Package jdchum/sitex reads the text content from a website regardless of styling, behavior, or structure. This package can be used to search site text for key words and phrases as well as monitoring text for changes.

Install

go get -u github.com/jdchum/sitex

Example

package main

import (
    "io/ioutil"

    "github.com/jdchum/sitex"
)

const url = "https://en.wikipedia.org/wiki/Go_(programming_language)"

func main() {
    // Get the site's text
    text, err := sitex.GetSiteText(url, " ")
    if err != nil {
        panic(err)
    }

    // Output the text to disk
    err = ioutil.WriteFile("out.txt", []byte(text), 0644)
    if err != nil {
        panic(err)
    }
}

API

sitex.GetSiteText(url, sep string) (text string, err error)

Attempts to parse all human-readable text from a webpage. "Invisible" text such as HTML tags, JavaScript, and CSS are ignored.

  • url - URL of the webpage to fetch and parse
  • sep - Separator to place between chunks of parsed text

Returns the text parsed from the webpage or an error if one occured.

Limitations

Text is parsed as-is from the initial content returned by the server. This means that content requiring additional network requests or user interactions are not available to the parser.

Roadmap

  • Unicode support
  • Parse visible text from attributes
  • Follow server redirects
  • Parse embedded iframes
  • Parse embedded PDF text

License

MIT licensed. Copyright (c) 2019-2020 Joshua Chumbley. See the LICENSE file for details.

Documentation

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

func GetSiteText

func GetSiteText(url, sep string) (text string, err error)

GetSiteText attempts to parse all human-readable text from a webpage. "Invisible" text such as HTML tags, JavaScript, and CSS are ignored.

Types

This section is empty.

Directories

Path Synopsis

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL