text

package
v0.10.0-beta Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Mar 12, 2024 License: MIT Imports: 12 Imported by: 0

README

---
title: "Text"
lang: "en-US"
draft: false
description: "Learn about how to set up a VDP Text operator https://github.com/instill-ai/vdp"
---

The Text component is an operator that allows users to extract and manipulate text from different sources.
It can carry out the following tasks:

- [Convert To Text](#convert-to-text)
- [Split By Token](#split-by-token)

## Release Stage

`Alpha`

## Configuration

The component configuration is defined and maintained [here](https://github.com/instill-ai/operator/blob/main/pkg/text/v0/config/definitions.json).

## Supported Tasks

### Convert To Text

Convert document to text.

| Input | ID | Type | Description |
| :--- | :--- | :--- | :--- |
| Task ID (required) | `task` | string | `TASK_CONVERT_TO_TEXT` |
| Document (required) | `doc` | string | Base64 encoded document (PDF, DOC, DOCX, XML, HTML, RTF, etc.) to be converted to plain text |

| Output | ID | Type | Description |
| :--- | :--- | :--- | :--- |
| Body | `body` | string | Plain text converted from the document |
| Meta | `meta` | object | Metadata extracted from the document |
| MSecs | `msecs` | number | Time taken to convert the document |
| Error | `error` | string | Error message if any during the conversion process |

### Split By Token

Split text by token.

| Input | ID | Type | Description |
| :--- | :--- | :--- | :--- |
| Task ID (required) | `task` | string | `TASK_SPLIT_BY_TOKEN` |
| Text (required) | `text` | string | Text to be split |
| Model (required) | `model` | string | ID of the model to use for tokenization |
| Chunk Token Size | `chunk_token_size` | integer | Number of tokens per text chunk |

| Output | ID | Type | Description |
| :--- | :--- | :--- | :--- |
| Token Count | `token_count` | integer | Total count of tokens in the input text |
| Text Chunks | `text_chunks` | array[string] | Text chunks after splitting |
| Number of Text Chunks | `chunk_num` | integer | Total number of output text chunks |

Documentation

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

func Init

func Init(logger *zap.Logger) base.IOperator

Init initializes the operator

Types

type ConvertToTextInput

type ConvertToTextInput struct {
	// Doc: Document to convert
	Doc string `json:"doc"`
}

ConvertToTextInput defines the input for convert to text task

type ConvertToTextOutput

type ConvertToTextOutput struct {
	// Body: Plain text converted from the document
	Body string `json:"body"`
	// Meta: Metadata extracted from the document
	Meta map[string]string `json:"meta"`
	// MSecs: Time taken to convert the document
	MSecs uint32 `json:"msecs"`
	// Error: Error message if any during the conversion process
	Error string `json:"error"`
}

ConvertToTextOutput defines the output for convert to text task

type Execution

type Execution struct {
	base.Execution
}

Execution is the derived execution

func (*Execution) Execute

func (e *Execution) Execute(inputs []*structpb.Struct) ([]*structpb.Struct, error)

Execute executes the derived execution

type Operator

type Operator struct {
	base.Operator
}

Operator is the derived operator

func (*Operator) CreateExecution

func (o *Operator) CreateExecution(defUID uuid.UUID, task string, config *structpb.Struct, logger *zap.Logger) (base.IExecution, error)

CreateExecution creates the derived execution

type SplitByTokenInput

type SplitByTokenInput struct {
	// Text: Text to split
	Text string `json:"text"`
	// Model: ID of the model to use for tokenization
	Model string `json:"model"`
	// ChunkTokenSize: Number of tokens per text chunk
	ChunkTokenSize *int `json:"chunk_token_size,omitempty"`
}

SplitByTokenInput defines the input for split by token task

type SplitByTokenOutput

type SplitByTokenOutput struct {
	// TokenCount: Number of tokens in the text
	TokenCount int `json:"token_count"`
	// TextChunks: List of text chunks
	TextChunks []string `json:"text_chunks"`
	// ChunkNum: Number of text chunks
	ChunkNum int `json:"chunk_num"`
}

SplitByTokenOutput defines the output for split by token task

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL