Documentation ¶
Overview ¶
Package sequencefile provides functionality for reading and writing Hadoop's SequenceFile format, documented here: http://goo.gl/sOSJmJ
Index ¶
- Constants
- func BytesWritable(b []byte) []byte
- func IntWritable(b []byte) int32
- func LongWritable(b []byte) int64
- func ReadVInt(r io.Reader) (int64, error)
- func Text(b []byte) string
- func WriteVInt(w io.Writer, i int64) (err error)
- type Compression
- type CompressionCodec
- type Header
- type Reader
- type WritableWriter
- type Writer
- type WriterConfig
Examples ¶
Constants ¶
const ( SyncSize = 16 GzipClassName = "org.apache.hadoop.io.compress.GzipCodec" SnappyClassName = "org.apache.hadoop.io.compress.SnappyCodec" ZlibClassName = "org.apache.hadoop.io.compress.DefaultCodec" ZstdClassName = "org.apache.hadoop.io.compress.ZStandardCodec" Bzip2ClassName = "org.apache.hadoop.io.compress.BZip2Codec" )
const ( BytesWritableClassName = "org.apache.hadoop.io.BytesWritable" TextClassName = "org.apache.hadoop.io.Text" IntWritableClassName = "org.apache.hadoop.io.IntWritable" LongWritableClassName = "org.apache.hadoop.io.LongWritable" )
Variables ¶
This section is empty.
Functions ¶
func BytesWritable ¶
BytesWritable unwraps a hadoop BytesWritable and returns the actual bytes.
func IntWritable ¶
IntWritable unwraps an IntWritable and returns the deserialized int32.
func LongWritable ¶
LongWritable unwraps an LongWritable and returns the deserialized int64.
func ReadVInt ¶
ReadVInt reads an int64 encoded in hadoop's "VInt" format, described and implemented here: https://goo.gl/1h4mrG. It does at most two reads to the underlying io.Reader.
Types ¶
type Compression ¶
type Compression int
const ( NoCompression Compression = iota + 1 RecordCompression BlockCompression )
type CompressionCodec ¶
type CompressionCodec int
const ( GzipCompression CompressionCodec = iota + 1 SnappyCompression ZlibCompression ZstdCompression Bzip2Compression )
type Header ¶
type Header struct { Version int Compression Compression CompressionCodec CompressionCodec CompressionCodecClassName string KeyClassName string ValueClassName string Metadata map[string]string SyncMarker string }
A Header represents the information contained in the header of the SequenceFile.
type Reader ¶
type Reader struct { Header Header // contains filtered or unexported fields }
A Reader reads key/value pairs from a SequenceFile input stream.
A reader is valid at any key or block offset; it's safe to start in the middle of a file or seek the underlying input stream if the location was recorded between calls to Scan, and as long as you call Reset after seeking. Note, however, that with a block-compressed file (Header.Compression set to BlockCompression), the position will be at the beginning of the block that holds the key, not right before the key itself.
Example ¶
package main import ( "fmt" "log" "github.com/colinmarc/sequencefile" ) func main() { sf, err := sequencefile.Open("testdata/block_compressed_snappy.sequencefile") if err != nil { log.Fatal(err) } // Iterate through the file. for sf.Scan() { // Unwrap the BytesWritable values. key := sequencefile.BytesWritable(sf.Key()) value := sequencefile.BytesWritable(sf.Value()) fmt.Println(string(key), string(value)) } if sf.Err() != nil { log.Fatal(err) } }
Output: Alice Practice Bob Hope
func NewReader ¶
New returns a new Reader for a SequenceFile, reading data from r. If the io.Reader is positioned at the start of a file, you should immediately call ReadHeader to read through the header.
func NewReaderCompression ¶
func NewReaderCompression(r io.Reader, compression Compression, codec CompressionCodec) *Reader
New returns a new Reader for a SequenceFile, reading data from r. Normally, compression options are inferred from the header of a file, but if the header is unavailable (because you're starting mid-stream) you can call this method with the compression options set explicitly.
func (*Reader) Key ¶
Key returns the key for the current record. The byte slice will be reused after the next call to Scan.
func (*Reader) ReadHeader ¶
ReadHeader parses the SequenceFile header from the input stream, and fills in the Header struct with the values. This should be called when the reader is positioned at the start of the file or input stream, before any records are read.
ReadHeader will also validate that the settings of the SequenceFile (version, compression, key/value serialization, etc) are compatible.
func (*Reader) Reset ¶
func (r *Reader) Reset()
Reset resets the internal state of the reader, but maintains compression settings and header information. You should call Reset if you seek the underlying reader, but should create an entirely new Reader if you are starting a different file.
type WritableWriter ¶
A WritableWriter knows how to write data wrapped in Hadoop Writables.
Each WritableWriter understands just a single type of data.
func NewWritableWriter ¶
func NewWritableWriter(className string) (WritableWriter, error)
NewWritableWriter gets a WritableWriter for a given Hadoop class name.
type Writer ¶
type Writer struct {
// contains filtered or unexported fields
}
A Writer writes key/value pairs to a sequence file output stream.
Example ¶
package main import ( "bytes" "log" "github.com/colinmarc/sequencefile" ) func main() { var buf bytes.Buffer cfg := &sequencefile.WriterConfig{ Writer: &buf, KeyClass: sequencefile.BytesWritableClassName, ValueClass: sequencefile.BytesWritableClassName, } w, err := sequencefile.NewWriter(cfg) if err != nil { log.Fatal(err) } defer w.Close() pairs := []struct{ k, v string }{ {"Alice", "Practice"}, {"Bob", "Hope"}, } for _, p := range pairs { err = w.Append([]byte(p.k), []byte(p.v)) if err != nil { log.Fatal(err) } } }
Output:
func NewWriter ¶
func NewWriter(cfg *WriterConfig) (w *Writer, err error)
NewWriter constructs a new Writer.
type WriterConfig ¶
type WriterConfig struct { // Writer is where data will be written to. Writer io.Writer // KeyClass is the type of each key to be written. KeyClass string // ValueClass is the type of each value to be written. ValueClass string // Compression is the type of compression to be used. // Either none, record or block. Compression Compression // CompressionCoded is the codec to be used for compression. // This is only relevant if compression is used. CompressionCodec CompressionCodec // BlockSize is the size of each block for compression. // This is only relevant if block compression is used. BlockSize int // Metadata contains key/value pairs to be added to the header. Metadata map[string]string // Rand is a source of random numbers. Should usually be nil, but useful // for reproducible output. Rand *rand.Rand }
A WriterConfig specifies the configuration for a Writer.