tika

package module
v1.0.1 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Feb 4, 2021 License: Apache-2.0 Imports: 10 Imported by: 0

README

go-tika

go-tika is a Go client library for accessing the Apache Tika Server API.

go-tika requires Go version 1.15 or greater.

See pkg.go.dev for more documentation on what resources are available.

This repo was forked from github.com/google/go-tika.

License

This library is distributed under the Apache V2 License. See the LICENSE file.

Documentation

Overview

Package tika provides a client for using Apache Tika's (http://tika.apache.org) Server API.

To parse the contents of a file (or any io.Reader), you will need to open the io.Reader, create a client, and call client.Parse.

// import "os"
f, err := os.Open("path/to/file")
if err != nil {
	log.Fatal(err)
}
defer f.Close()

serverURL := "TODO"

client := tika.NewClient(nil, serverURL)
body, err := client.Parse(context.Background(), f)

If you pass an *http.Client to tika.NewClient, it will be used for all requests.

Some functions return a custom type, like Parsers(), Detectors(), and MIMETypes(). Use these to see what features are supported by the current Tika server.

Index

Constants

View Source
const XTIKAContent = "X-TIKA:content"

XTIKAContent is the metadata field of the content of a file after recursive parsing. See ParseRecursive and MetaRecursive.

Variables

This section is empty.

Functions

This section is empty.

Types

type Client

type Client struct {
	// contains filtered or unexported fields
}

Client represents a connection to a Tika Server.

func NewClient

func NewClient(httpClient *http.Client, urlString string) *Client

NewClient creates a new Client. If httpClient is nil, the http.DefaultClient will be used. Be aware that http.DefaultClient has no timeout set.

func NewDefaultClient added in v1.0.1

func NewDefaultClient(urlString string) *Client

NewDefaultClient creates a new Client with a request timeout of 60 seconds.

func (*Client) Detect

func (c *Client) Detect(ctx context.Context, input io.Reader) (string, error)

Detect gets the mimetype of the given input, returning the mimetype and an error. If the error is not nil, the mimetype is undefined.

func (*Client) Detectors

func (c *Client) Detectors(ctx context.Context) (*Detector, error)

Detectors returns the list of available Detectors for this server. To get all available detectors, iterate through the Children of every Detector.

func (*Client) Language

func (c *Client) Language(ctx context.Context, input io.Reader) (string, error)

Language detects the language of the given input, returning the two letter language code and an error. If the error is not nil, the language is undefined.

func (*Client) LanguageString

func (c *Client) LanguageString(ctx context.Context, input string) (string, error)

LanguageString detects the language of the given string, returning the two letter language code and an error. If the error is not nil, the language is undefined.

func (*Client) MIMETypes

func (c *Client) MIMETypes(ctx context.Context) (map[string]MIMEType, error)

MIMETypes returns a map from MIME Type name to MIMEType, or properties about that specific MIMEType.

func (*Client) Meta

func (c *Client) Meta(ctx context.Context, input io.Reader) (string, error)

Meta parses the metadata from the given input, returning the metadata and an error. If the error is not nil, the metadata is undefined.

func (*Client) MetaField

func (c *Client) MetaField(ctx context.Context, input io.Reader, field string) (string, error)

MetaField parses the metadata from the given input and returns the given field. If the error is not nil, the result string is undefined.

func (*Client) MetaRecursive

func (c *Client) MetaRecursive(ctx context.Context, input io.Reader) ([]map[string][]string, error)

MetaRecursive parses the given input and all embedded documents. The result is a list of maps from metadata key to value for each document. The content of each document is in the XTIKAContent field in text form. See ParseRecursive to just get the content of each document. If the error is not nil, the result list is undefined.

func (*Client) MetaRecursiveType

func (c *Client) MetaRecursiveType(ctx context.Context, input io.Reader, contentType string) ([]map[string][]string, error)

MetaRecursiveType parses the given input and all embedded documents. The result is a list of maps from metadata key to value for each document. The content of each document is in the XTIKAContent field, and is of the type indicated by the contentType parameter An empty string can be passed in for a default type of XML. See ParseRecursive to just get the content of each document. If the error is not nil, the result list is undefined.

func (*Client) Parse

func (c *Client) Parse(ctx context.Context, input io.Reader, header Header) (string, error)

Parse parses the given input, returning the body of the input and an error. If the error is not nil, the body is undefined.

func (*Client) ParseRecursive

func (c *Client) ParseRecursive(ctx context.Context, input io.Reader) ([]string, error)

ParseRecursive parses the given input and all embedded documents, returning a list of the contents of the input with one element per document. See MetaRecursive for access to all metadata fields. If the error is not nil, the result is undefined.

func (*Client) Parsers

func (c *Client) Parsers(ctx context.Context) (*Parser, error)

Parsers returns the list of available parsers and an error. If the error is not nil, the list is undefined. To get all available parsers, iterate through the Children of every Parser.

func (*Client) Translate

func (c *Client) Translate(ctx context.Context, input io.Reader, t Translator, src, dst string) (string, error)

Translate returns an error and the translated input from src language to dst language using t. If the error is not nil, the translation is undefined.

func (*Client) Version

func (c *Client) Version(ctx context.Context) (string, error)

Version returns the default hello message from Tika server.

type ClientError

type ClientError struct {
	// StatusCode is the HTTP status code returned by the Tika server.
	StatusCode int
}

ClientError is returned by Client's various parse methods and represents an error response from the Tika server. Example usage:

client := tika.NewClient(nil, tikaURL)
s, err := client.Parse(context.Background(), input)
var tikaErr tika.ClientError
if errors.As(err, &tikaErr) {
    switch tikaErr.StatusCode {
    case http.StatusUnsupportedMediaType, http.StatusUnprocessableEntity:
        // Handle content related error
    default:
        // Handle possibly intermittent http error
    }
} else if err != nil {
    // Handle non-http error
}

func (ClientError) Error

func (e ClientError) Error() string

type Detector

type Detector struct {
	Name      string
	Composite bool
	Children  []Detector
}

A Detector represents a Tika Detector. Detectors are used to get the filetype of a file. To get a list of all Detectors, see Detectors().

type Header struct {
	http.Header
}

func NewHeader added in v1.0.1

func NewHeader() Header

func (Header) AcceptText added in v1.0.1

func (qq Header) AcceptText() Header

func (Header) SetOCRLanguage added in v1.0.1

func (qq Header) SetOCRLanguage(language string) Header

OCRLanguage accepts string in format "eng+deu+spa" language codes are in ISO-639-2 format

type MIMEType

type MIMEType struct {
	Alias     []string
	SuperType string
}

MIMEType represents a Tika MIME Type. To get a list of all MIME Types, see MIMETypes.

type Parser

type Parser struct {
	Name           string
	Decorated      bool
	Composite      bool
	Children       []Parser
	SupportedTypes []string
}

A Parser represents a Tika Parser. To get a list of all Parsers, see Parsers().

type Translator

type Translator string

Translator represents the Java package of a Tika Translator.

const (
	Lingo24Translator   Translator = "org.apache.tika.language.translate.Lingo24Translator"
	GoogleTranslator    Translator = "org.apache.tika.language.translate.GoogleTranslator"
	MosesTranslator     Translator = "org.apache.tika.language.translate.MosesTranslator"
	JoshuaTranslator    Translator = "org.apache.tika.language.translate.JoshuaTranslator"
	MicrosoftTranslator Translator = "org.apache.tika.language.translate.MicrosoftTranslator"
	YandexTranslator    Translator = "org.apache.tika.language.translate.YandexTranslator"
)

Translators available by defult in Tika. You must configure all required authentication details in Tika Server (for example, an API key).

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL