text

package

v0.15.0-beta Latest Latest Go to latest Published: Apr 25, 2024 License: MIT Imports: 11 Imported by: 0

Details

Valid go.mod file

The Go module system was introduced in Go 1.11 and is the official dependency management solution for Go.
Redistributable license

Redistributable licenses place minimal restrictions on how software can be used, modified, and redistributed.
Tagged version

Modules with tagged versions give importers more predictable builds.
Stable version

When a project reaches major version v1 it is considered stable.
Learn more about best practices

Repository

README ¶

---
title: "Text"
lang: "en-US"
draft: false
description: "Learn about how to set up a VDP Text operator https://github.com/instill-ai/instill-core"
---

The Text component is an operator that allows users to extract and manipulate text from different sources.
It can carry out the following tasks:

- [Convert To Text](#convert-to-text)
- [Split By Token](#split-by-token)

## Release Stage

`Alpha`

## Configuration

The component configuration is defined and maintained [here](https://github.com/instill-ai/component/blob/main/pkg/operator/text/v0/config/definition.json).

## Supported Tasks

### Convert To Text

Convert document to text.

| Input | ID | Type | Description |
| :--- | :--- | :--- | :--- |
| Task ID (required) | `task` | string | `TASK_CONVERT_TO_TEXT` |
| Document (required) | `doc` | string | Base64 encoded document (PDF, DOC, DOCX, XML, HTML, RTF, etc.) to be converted to plain text |

| Output | ID | Type | Description |
| :--- | :--- | :--- | :--- |
| Body | `body` | string | Plain text converted from the document |
| Meta | `meta` | object | Metadata extracted from the document |
| MSecs | `msecs` | number | Time taken to convert the document |
| Error | `error` | string | Error message if any during the conversion process |

### Split By Token

Split text by token.

| Input | ID | Type | Description |
| :--- | :--- | :--- | :--- |
| Task ID (required) | `task` | string | `TASK_SPLIT_BY_TOKEN` |
| Text (required) | `text` | string | Text to be split |
| Model (required) | `model` | string | ID of the model to use for tokenization |
| Chunk Token Size | `chunk_token_size` | integer | Number of tokens per text chunk |

| Output | ID | Type | Description |
| :--- | :--- | :--- | :--- |
| Token Count | `token_count` | integer | Total count of tokens in the input text |
| Text Chunks | `text_chunks` | array[string] | Text chunks after splitting |
| Number of Text Chunks | `chunk_num` | integer | Total number of output text chunks |

Documentation ¶

Index ¶

func Init(l *zap.Logger, u base.UsageHandler) *operator
type ConvertToTextInput
type ConvertToTextOutput
type SplitByTokenInput
type SplitByTokenOutput

Constants ¶

This section is empty.

Variables ¶

This section is empty.

Functions ¶

func Init ¶

func Init(l *zap.Logger, u base.UsageHandler) *operator

Init initializes the operator

Types ¶

type ConvertToTextInput ¶

type ConvertToTextInput struct {
	// Doc: Document to convert
	Doc string `json:"doc"`
}

ConvertToTextInput defines the input for convert to text task

type ConvertToTextOutput ¶

type ConvertToTextOutput struct {
	// Body: Plain text converted from the document
	Body string `json:"body"`
	// Meta: Metadata extracted from the document
	Meta map[string]string `json:"meta"`
	// MSecs: Time taken to convert the document
	MSecs uint32 `json:"msecs"`
	// Error: Error message if any during the conversion process
	Error string `json:"error"`
}

ConvertToTextOutput defines the output for convert to text task

type SplitByTokenInput ¶

type SplitByTokenInput struct {
	// Text: Text to split
	Text string `json:"text"`
	// Model: ID of the model to use for tokenization
	Model string `json:"model"`
	// ChunkTokenSize: Number of tokens per text chunk
	ChunkTokenSize *int `json:"chunk_token_size,omitempty"`
}

SplitByTokenInput defines the input for split by token task

type SplitByTokenOutput ¶

type SplitByTokenOutput struct {
	// TokenCount: Number of tokens in the text
	TokenCount int `json:"token_count"`
	// TextChunks: List of text chunks
	TextChunks []string `json:"text_chunks"`
	// ChunkNum: Number of text chunks
	ChunkNum int `json:"chunk_num"`
}

SplitByTokenOutput defines the output for split by token task

Source Files ¶

View all Source files

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL