website

package

v0.15.0-beta Latest Latest Go to latest Published: Apr 25, 2024 License: MIT Imports: 13 Imported by: 0

Details

Valid go.mod file

The Go module system was introduced in Go 1.11 and is the official dependency management solution for Go.
Redistributable license

Redistributable licenses place minimal restrictions on how software can be used, modified, and redistributed.
Tagged version

Modules with tagged versions give importers more predictable builds.
Stable version

When a project reaches major version v1 it is considered stable.
Learn more about best practices

Repository

README ¶

---
title: "Website"
lang: "en-US"
draft: false
description: "Learn about how to set up a VDP Website connector https://github.com/instill-ai/instill-core"
---

The Website component is a data connector that allows users to scrape websites.
It can carry out the following tasks:

- [Scrape Website](#scrape-website)

## Release Stage

`Alpha`

## Configuration

The component configuration is defined and maintained [here](https://github.com/instill-ai/component/blob/main/pkg/connector/website/v0/config/definition.json).

## Supported Tasks

### Scrape Website

Scrape the website contents.

| Input | ID | Type | Description |
| :--- | :--- | :--- | :--- |
| Task ID (required) | `task` | string | `TASK_SCRAPE_WEBSITE` |
| Query (required) | `target_url` | string | The root URL to scrape. All links on this page will be scraped, and all links on those pages, and so on. |
| Allowed Domains | `allowed_domains` | array[string] | A list of domains that are allowed to be scraped. If empty, all domains are allowed. |
| Max Number of Pages (required) | `max_k` | integer | The max number of pages to return. If the number is set to 0, all pages will be returned. If the number is set to a positive integer, at most max k pages will be returned. |
| Include Link Text | `include_link_text` | boolean | Indicate whether to scrape the link and include the text of the link associated with this page in the 'link_text' field |
| Include Link HTML | `include_link_html` | boolean | Indicate whether to scrape the link and include the raw HTML of the link associated with this page in the 'link_html' field |

| Output | ID | Type | Description |
| :--- | :--- | :--- | :--- |
| Pages | `pages` | array[object] | The scraped webpages |

Documentation ¶

Index ¶

func Init(l *zap.Logger, u base.UsageHandler) *connector
type PageInfo
type ScrapeWebsiteInput
type ScrapeWebsiteOutput
- func Scrape(input ScrapeWebsiteInput) (ScrapeWebsiteOutput, error)

Constants ¶

This section is empty.

Variables ¶

This section is empty.

Functions ¶

func Init ¶

func Init(l *zap.Logger, u base.UsageHandler) *connector

Types ¶

type PageInfo ¶

type PageInfo struct {
	Link     string `json:"link"`
	Title    string `json:"title"`
	LinkText string `json:"link_text"`
	LinkHTML string `json:"link_html"`
}

type ScrapeWebsiteInput ¶

type ScrapeWebsiteInput struct {
	// TargetURL: The URL of the website to scrape.
	TargetURL string `json:"target_url"`
	// AllowedDomains: The list of allowed domains to scrape.
	AllowedDomains []string `json:"allowed_domains"`
	// MaxK: The maximum number of pages to scrape.
	MaxK int `json:"max_k"`
	// IncludeLinkText: Whether to include the scraped text of the scraped web page.
	IncludeLinkText *bool `json:"include_link_text"`
	// IncludeLinkHTML: Whether to include the scraped HTML of the scraped web page.
	IncludeLinkHTML *bool `json:"include_link_html"`
}

ScrapeWebsiteInput defines the input of the scrape website task

type ScrapeWebsiteOutput ¶

type ScrapeWebsiteOutput struct {
	// Pages: The list of pages that were scraped.
	Pages []PageInfo `json:"pages"`
}

ScrapeWebsiteOutput defines the output of the scrape website task

func Scrape ¶

func Scrape(input ScrapeWebsiteInput) (ScrapeWebsiteOutput, error)

Scrape crawls a webpage and returns a slice of PageInfo

Source Files ¶

View all Source files

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL