cleanweb

package
v0.0.0-...-a965f51 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Jan 15, 2024 License: MIT Imports: 13 Imported by: 0

README

CleanWeb

CleanWeb is a Go package that provides functionality for parsing web content. It uses a combination of HTTP requests and a headless browser to fetch and parse web content. The parsed content can be returned as HTML or converted to Markdown. The package also includes caching functionality to store and retrieve parsed content.

Installation

To install the package, use the following command:

go get github.com/vaayne/gtk/cleanweb

Usage

Here is a basic example of how to use the package:

package main

import (
    "context"
    "fmt"
    "github.com/vaayne/gtk/cleanweb"
)

func main() {
    // Create a new Parser
    parser := cleanweb.NewParser()

    // Set the Parser's session, browser, timeout, and format
    // parser := cleanweb.NewParser().parser.WithSession(mySession).WithBrowser(myBrowser).WithTimeout(60 * time.Second).WithFormatMarkdown()

    // Parse a URL
    article, err := parser.Parse(context.Background(), "https://example.com")
    if err != nil {
        fmt.Println("Error:", err)
        return
    }

    // Print the article's title and content
    fmt.Println("Title:", article.Title)
    fmt.Println("Content:", article.Content)
}

In this example, mySession and myBrowser should be replaced with your own session and browser instances. The WithFormatMarkdown() method is optional and can be removed if you want the content to be returned as HTML.

Documentation

Overview

Package cleanweb provides functionality for parsing web content.

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

This section is empty.

Types

type Cache

type Cache interface {
	// Get retrieves the value associated with the provided key.
	Get(key string) (interface{}, bool)
	// SetDefault inserts a value into the cache using the provided key, with a default expiration time.
	SetDefault(key string, value interface{})
}

Cache interface defines methods for getting and setting values with a default expiration time.

type Parser

type Parser struct {
	// contains filtered or unexported fields
}

Parser is a struct that holds the session, browser, timeout, format, and cache client for parsing web content.

func NewParser

func NewParser() *Parser

NewParser creates a new Parser with a default session, timeout, and cache client.

func (*Parser) Parse

func (p *Parser) Parse(ctx context.Context, uri string) (readability.Article, error)

Parse is a method of the Parser struct that takes in a context and a URI string. It parses the content at the given URL and returns a readability.Article and an error.

func (*Parser) ParseHtml

func (p *Parser) ParseHtml(ctx context.Context, html string, uri string) (readability.Article, error)

ParseHtml is a method of the Parser struct that takes in a context, an HTML string, and a URI string. It parses the HTML content and returns a readability.Article and an error.

func (*Parser) WithBrowser

func (p *Parser) WithBrowser(browser *rod.Browser) *Parser

WithBrowser sets the browser for the Parser and returns the Parser.

func (*Parser) WithBrowserControlURL

func (p *Parser) WithBrowserControlURL(browserURL string) *Parser

WithBrowserControlURL sets the browser for the Parser using a control URL and returns the Parser.

func (*Parser) WithFormatMarkdown

func (p *Parser) WithFormatMarkdown() *Parser

WithFormatMarkdown sets the format for the Parser to Markdown and returns the Parser.

func (*Parser) WithSession

func (p *Parser) WithSession(sess *session.Session) *Parser

WithSession sets the session for the Parser and returns the Parser.

func (*Parser) WithTimeout

func (p *Parser) WithTimeout(timeout time.Duration) *Parser

WithTimeout sets the timeout for the Parser and returns the Parser.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL