s3

package module
v0.0.0-...-9003a7a Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: May 28, 2020 License: Apache-2.0 Imports: 13 Imported by: 0

README

Sif S3 DataSource

An AWS S3 DataSource for Sif.

$ go get github.com/go-sif/sif-datasource-aws-s3@master
$ go get github.com/aws/aws-sdk-go

Usage

  1. Create a Schema which represents the fields you intend to extract from each document in the target index:
import (
	"github.com/go-sif/sif"
	"github.com/go-sif/sif/schema"
)

schema := schema.CreateSchema()
schema.CreateColumn("coords.x", &sif.Float64ColumnType{})
schema.CreateColumn("coords.z", &sif.Float64ColumnType{})
schema.CreateColumn("date", &sif.TimeColumnType{Format: "2006-01-02 15:04:05"})
  1. Create an AWS Session with your desired configuration parameters
import (
	"github.com/go-sif/sif"
	"github.com/go-sif/sif/schema"
	"github.com/aws/aws-sdk-go/aws/session"
)

// ...

sess := session.Must(session.NewSession())
  1. Finally, define your configuration and create a DataFrame which can be manipulated with sif:
import (
	"github.com/go-sif/sif"
	"github.com/go-sif/sif/schema"
	"github.com/aws/aws-sdk-go/aws/session"
	s3Source "github.com/go-sif/sif-datasource-aws-s3"
)
// ...

parser := // ... any Sif parser

conf := &s3Source.DataSourceConf{
	Bucket:       "bucket.name",       // bucket name
	Prefix:       "/prefix/for/files", // S3 key prefix to filter which keys are accessed
	KeyBatchSize: 5,                   // The number of files assigned to a single worker at a
	                                   // time, to be downloaded concurrently with processing
	Session:      sess,
}

dataframe := s3Source.CreateDataFrame(conf, parser, schema)

Documentation

Overview

Package s3 provides a DataSource which reads data from AWS s3

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

func CreateDataFrame

func CreateDataFrame(conf *DataSourceConf, parser sif.DataSourceParser, schema sif.Schema) sif.DataFrame

CreateDataFrame is a factory for DataSources

Types

type DataSource

type DataSource struct {
	// contains filtered or unexported fields
}

DataSource is a set of files in an s3 bucket, containing data which will be manipulating according to a DataFrame

func (*DataSource) Analyze

func (fs *DataSource) Analyze() (sif.PartitionMap, error)

Analyze returns a PartitionMap, describing how the source file will be divided into Partitions

func (*DataSource) DeserializeLoader

func (fs *DataSource) DeserializeLoader(bytes []byte) (sif.PartitionLoader, error)

DeserializeLoader creates a PartitionLoader for this DataSource from a serialized representation

func (*DataSource) IsStreaming

func (fs *DataSource) IsStreaming() bool

IsStreaming returns true iff this DataSource provides a continuous stream of data

type DataSourceConf

type DataSourceConf struct {
	Bucket string
	// Prefix limits the response to keys prefixed by this string
	Prefix       string
	Filter       *regexp.Regexp
	RequestPayer string
	// KeyBatchSize must be less than 1000 and represents the number of documents which will
	// be assigned as a batch to a Sif worker at one time. Files are assigned in batches
	// so that workers can download and parse files concurrently.
	KeyBatchSize int64
	// PrefetchLimit is a limit on the number of files which workers will prefetch and store in memory
	PrefetchLimit int
	Session       *session.Session
	Decoder       func([]byte) ([]byte, error)
}

DataSourceConf configures a file DataSource

type PartitionLoader

type PartitionLoader struct {
	// contains filtered or unexported fields
}

PartitionLoader is capable of loading partitions of data from a file

func (*PartitionLoader) GobDecode

func (pl *PartitionLoader) GobDecode(in []byte) error

GobDecode deserializes a PartitionLoader

func (*PartitionLoader) GobEncode

func (pl *PartitionLoader) GobEncode() ([]byte, error)

GobEncode serializes a PartitionLoader

func (*PartitionLoader) Load

func (pl *PartitionLoader) Load(parser sif.DataSourceParser, widestInitialSchema sif.Schema) (sif.PartitionIterator, error)

Load is capable of loading partitions of data from a file

func (*PartitionLoader) ToString

func (pl *PartitionLoader) ToString() string

ToString returns a string representation of this PartitionLoader

type PartitionMap

type PartitionMap struct {
	// contains filtered or unexported fields
}

PartitionMap is an iterator producing a sequence of PartitionLoaders

func (*PartitionMap) HasNext

func (pm *PartitionMap) HasNext() bool

HasNext returns true iff there is another PartitionLoader remaining

func (*PartitionMap) Next

func (pm *PartitionMap) Next() sif.PartitionLoader

Next returns the next PartitionLoader for a file

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL