Documentation ¶
Overview ¶
Package tika provides a client for using Apache Tika's (http://tika.apache.org) Server API.
To parse the contents of a file (or any io.Reader), you will need to open the io.Reader, create a client, and call client.Parse.
// import "os" f, err := os.Open("path/to/file") if err != nil { log.Fatal(err) } defer f.Close() serverURL := "TODO" client := tika.NewClient(nil, serverURL) body, err := client.Parse(context.Background(), f)
If you pass an *http.Client to tika.NewClient, it will be used for all requests.
Some functions return a custom type, like Parsers(), Detectors(), and MIMETypes(). Use these to see what features are supported by the current Tika server.
Index ¶
- Constants
- type Client
- func (c *Client) Detect(ctx context.Context, input io.Reader) (string, error)
- func (c *Client) Detectors(ctx context.Context) (*Detector, error)
- func (c *Client) Language(ctx context.Context, input io.Reader) (string, error)
- func (c *Client) LanguageString(ctx context.Context, input string) (string, error)
- func (c *Client) MIMETypes(ctx context.Context) (map[string]MIMEType, error)
- func (c *Client) Meta(ctx context.Context, input io.Reader) (string, error)
- func (c *Client) MetaField(ctx context.Context, input io.Reader, field string) (string, error)
- func (c *Client) MetaRecursive(ctx context.Context, input io.Reader) ([]map[string][]string, error)
- func (c *Client) MetaRecursiveType(ctx context.Context, input io.Reader, contentType string) ([]map[string][]string, error)
- func (c *Client) Parse(ctx context.Context, input io.Reader, header Header) (string, error)
- func (c *Client) ParseRecursive(ctx context.Context, input io.Reader) ([]string, error)
- func (c *Client) Parsers(ctx context.Context) (*Parser, error)
- func (c *Client) Translate(ctx context.Context, input io.Reader, t Translator, src, dst string) (string, error)
- func (c *Client) Version(ctx context.Context) (string, error)
- type ClientError
- type Detector
- type Header
- type MIMEType
- type Parser
- type Translator
Constants ¶
const XTIKAContent = "X-TIKA:content"
XTIKAContent is the metadata field of the content of a file after recursive parsing. See ParseRecursive and MetaRecursive.
Variables ¶
This section is empty.
Functions ¶
This section is empty.
Types ¶
type Client ¶
type Client struct {
// contains filtered or unexported fields
}
Client represents a connection to a Tika Server.
func NewClient ¶
NewClient creates a new Client. If httpClient is nil, the http.DefaultClient will be used. Be aware that http.DefaultClient has no timeout set.
func NewDefaultClient ¶ added in v1.0.1
NewDefaultClient creates a new Client with a request timeout of 60 seconds.
func (*Client) Detect ¶
Detect gets the mimetype of the given input, returning the mimetype and an error. If the error is not nil, the mimetype is undefined.
func (*Client) Detectors ¶
Detectors returns the list of available Detectors for this server. To get all available detectors, iterate through the Children of every Detector.
func (*Client) Language ¶
Language detects the language of the given input, returning the two letter language code and an error. If the error is not nil, the language is undefined.
func (*Client) LanguageString ¶
LanguageString detects the language of the given string, returning the two letter language code and an error. If the error is not nil, the language is undefined.
func (*Client) MIMETypes ¶
MIMETypes returns a map from MIME Type name to MIMEType, or properties about that specific MIMEType.
func (*Client) Meta ¶
Meta parses the metadata from the given input, returning the metadata and an error. If the error is not nil, the metadata is undefined.
func (*Client) MetaField ¶
MetaField parses the metadata from the given input and returns the given field. If the error is not nil, the result string is undefined.
func (*Client) MetaRecursive ¶
MetaRecursive parses the given input and all embedded documents. The result is a list of maps from metadata key to value for each document. The content of each document is in the XTIKAContent field in text form. See ParseRecursive to just get the content of each document. If the error is not nil, the result list is undefined.
func (*Client) MetaRecursiveType ¶
func (c *Client) MetaRecursiveType(ctx context.Context, input io.Reader, contentType string) ([]map[string][]string, error)
MetaRecursiveType parses the given input and all embedded documents. The result is a list of maps from metadata key to value for each document. The content of each document is in the XTIKAContent field, and is of the type indicated by the contentType parameter An empty string can be passed in for a default type of XML. See ParseRecursive to just get the content of each document. If the error is not nil, the result list is undefined.
func (*Client) Parse ¶
Parse parses the given input, returning the body of the input and an error. If the error is not nil, the body is undefined.
func (*Client) ParseRecursive ¶
ParseRecursive parses the given input and all embedded documents, returning a list of the contents of the input with one element per document. See MetaRecursive for access to all metadata fields. If the error is not nil, the result is undefined.
func (*Client) Parsers ¶
Parsers returns the list of available parsers and an error. If the error is not nil, the list is undefined. To get all available parsers, iterate through the Children of every Parser.
type ClientError ¶
type ClientError struct { // StatusCode is the HTTP status code returned by the Tika server. StatusCode int }
ClientError is returned by Client's various parse methods and represents an error response from the Tika server. Example usage:
client := tika.NewClient(nil, tikaURL) s, err := client.Parse(context.Background(), input) var tikaErr tika.ClientError if errors.As(err, &tikaErr) { switch tikaErr.StatusCode { case http.StatusUnsupportedMediaType, http.StatusUnprocessableEntity: // Handle content related error default: // Handle possibly intermittent http error } } else if err != nil { // Handle non-http error }
func (ClientError) Error ¶
func (e ClientError) Error() string
type Detector ¶
A Detector represents a Tika Detector. Detectors are used to get the filetype of a file. To get a list of all Detectors, see Detectors().
type Header ¶ added in v1.0.1
func (Header) AcceptText ¶ added in v1.0.1
func (Header) SetOCRLanguage ¶ added in v1.0.1
OCRLanguage accepts string in format "eng+deu+spa" language codes are in ISO-639-2 format
type MIMEType ¶
MIMEType represents a Tika MIME Type. To get a list of all MIME Types, see MIMETypes.
type Parser ¶
type Parser struct { Name string Decorated bool Composite bool Children []Parser SupportedTypes []string }
A Parser represents a Tika Parser. To get a list of all Parsers, see Parsers().
type Translator ¶
type Translator string
Translator represents the Java package of a Tika Translator.
const ( Lingo24Translator Translator = "org.apache.tika.language.translate.Lingo24Translator" GoogleTranslator Translator = "org.apache.tika.language.translate.GoogleTranslator" MosesTranslator Translator = "org.apache.tika.language.translate.MosesTranslator" JoshuaTranslator Translator = "org.apache.tika.language.translate.JoshuaTranslator" MicrosoftTranslator Translator = "org.apache.tika.language.translate.MicrosoftTranslator" YandexTranslator Translator = "org.apache.tika.language.translate.YandexTranslator" )
Translators available by defult in Tika. You must configure all required authentication details in Tika Server (for example, an API key).