website

package
v0.15.0-beta Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Apr 25, 2024 License: MIT Imports: 13 Imported by: 0

README

---
title: "Website"
lang: "en-US"
draft: false
description: "Learn about how to set up a VDP Website connector https://github.com/instill-ai/instill-core"
---

The Website component is a data connector that allows users to scrape websites.
It can carry out the following tasks:

- [Scrape Website](#scrape-website)

## Release Stage

`Alpha`

## Configuration

The component configuration is defined and maintained [here](https://github.com/instill-ai/component/blob/main/pkg/connector/website/v0/config/definition.json).

## Supported Tasks

### Scrape Website

Scrape the website contents.

| Input | ID | Type | Description |
| :--- | :--- | :--- | :--- |
| Task ID (required) | `task` | string | `TASK_SCRAPE_WEBSITE` |
| Query (required) | `target_url` | string | The root URL to scrape. All links on this page will be scraped, and all links on those pages, and so on. |
| Allowed Domains | `allowed_domains` | array[string] | A list of domains that are allowed to be scraped. If empty, all domains are allowed. |
| Max Number of Pages (required) | `max_k` | integer | The max number of pages to return. If the number is set to 0, all pages will be returned. If the number is set to a positive integer, at most max k pages will be returned. |
| Include Link Text | `include_link_text` | boolean | Indicate whether to scrape the link and include the text of the link associated with this page in the 'link_text' field |
| Include Link HTML | `include_link_html` | boolean | Indicate whether to scrape the link and include the raw HTML of the link associated with this page in the 'link_html' field |

| Output | ID | Type | Description |
| :--- | :--- | :--- | :--- |
| Pages | `pages` | array[object] | The scraped webpages |

Documentation

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

func Init

func Init(l *zap.Logger, u base.UsageHandler) *connector

Types

type PageInfo

type PageInfo struct {
	Link     string `json:"link"`
	Title    string `json:"title"`
	LinkText string `json:"link_text"`
	LinkHTML string `json:"link_html"`
}

type ScrapeWebsiteInput

type ScrapeWebsiteInput struct {
	// TargetURL: The URL of the website to scrape.
	TargetURL string `json:"target_url"`
	// AllowedDomains: The list of allowed domains to scrape.
	AllowedDomains []string `json:"allowed_domains"`
	// MaxK: The maximum number of pages to scrape.
	MaxK int `json:"max_k"`
	// IncludeLinkText: Whether to include the scraped text of the scraped web page.
	IncludeLinkText *bool `json:"include_link_text"`
	// IncludeLinkHTML: Whether to include the scraped HTML of the scraped web page.
	IncludeLinkHTML *bool `json:"include_link_html"`
}

ScrapeWebsiteInput defines the input of the scrape website task

type ScrapeWebsiteOutput

type ScrapeWebsiteOutput struct {
	// Pages: The list of pages that were scraped.
	Pages []PageInfo `json:"pages"`
}

ScrapeWebsiteOutput defines the output of the scrape website task

func Scrape

Scrape crawls a webpage and returns a slice of PageInfo

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL