gossamr

package module

v0.0.0-...-3ade326 Latest Latest Go to latest Published: Feb 9, 2018 License: Apache-2.0 Imports: 11 Imported by: 2

Details

Valid go.mod file

The Go module system was introduced in Go 1.11 and is the official dependency management solution for Go.
Redistributable license

Redistributable licenses place minimal restrictions on how software can be used, modified, and redistributed.
Tagged version

Modules with tagged versions give importers more predictable builds.
Stable version

When a project reaches major version v1 it is considered stable.
Learn more about best practices

Repository

github.com/vistarmedia/gossamr

Links

Open Source Insights

README ¶

Gossamr lets you run your Go programs on Hadoop.

Quick Example

Oh, man. Illustrating MapReduce with a word count? Get out of town.

package main

import (
  "log"
  "strings"

  "github.com/vistarmedia/gossamr"
)

type WordCount struct{}

func (wc *WordCount) Map(p int64, line string, c gossamr.Collector) error {
  for _, word := range strings.Fields(line) {
    c.Collect(strings.ToLower(word), int64(1))
  }
  return nil
}

func (wc *WordCount) Reduce(word string, counts chan int64, c gossamr.Collector) error {
  var sum int64
  for v := range counts {
    sum += v
  }
  c.Collect(sum, word)
  return nil
}

func main() {
  wordcount := gossamr.NewTask(&WordCount{})

  err := gossamr.Run(wordcount)
  if err != nil {
    log.Fatal(err)
  }
}

Running with Hadoop

./bin/hadoop jar ./contrib/streaming/hadoop-streaming-1.2.1.jar \
  -input /mytext.txt \
  -output /output.15 \
  -mapper "gossamr -task 0 -phase map" \
  -reducer "gossamr -task 0 -phase reduce" \
  -io typedbytes \
  -file ./wordcount
  -numReduceTasks 6

Documentation ¶

Overview ¶

Runs a job (or part of a job). There are three primary types of runners

LocalRunner - Used for simulating a job locally. The sorting and combining functions of Hadoop will be emulated as best as possible, though no guarantees are made
TaskPhaseRunner - Used inter-step during a Hadoop job. This runs a single phase of a task
JobRunner - Submits a multi-task Job to hadoop, organizing temporary files and forking the necessary processes.

Index ¶

func Copy(r Reader, w Writer) (err error)
func NewWriterCollector(writer Writer) *writerCollector
func Run(tasks ...*Task) error
type Collector
type GroupedReader
- func (gr *GroupedReader) Next() (k, v interface{}, err error)
type Job
- func NewJob(tasks ...*Task) *Job
type LineReader
- func NewLineReader(r io.Reader) *LineReader
- func (lr *LineReader) Next() (k, v interface{}, err error)
type LocalRunner
- func (lr *LocalRunner) Run(j *Job) (err error)
type Phase
- func GetPhase(name string) (Phase, error)
type Reader
- func NewGroupedReader(reader Reader) Reader
- func NewPairReader(r io.Reader) Reader
type Runner
- func GetRunner(args []string) (Runner, error)
type SortWriter
- func NewSortWriter(w io.WriteCloser, capacity int) (*SortWriter, error)
- func (sw *SortWriter) Close() (err error)
- func (sw *SortWriter) Write(k, v interface{}) (err error)
type StringWriter
- func NewStringWriter(w io.WriteCloser) *StringWriter
- func (sw *StringWriter) Close() error
- func (sw *StringWriter) Write(k, v interface{}) (err error)
type Task
- func NewTask(instance interface{}) *Task
- func (t *Task) Run(phase Phase, r io.Reader, w io.WriteCloser) (err error)
type TaskPhaseRunner
- func TaskPhaseRunnerFromArgs(args []string) (tpr *TaskPhaseRunner, err error)
- func (tpr *TaskPhaseRunner) Run(j *Job) error
type Writer
- func NewPairWriter(w io.WriteCloser) Writer

Constants ¶

This section is empty.

Variables ¶

This section is empty.

Functions ¶

func Copy ¶

func Copy(r Reader, w Writer) (err error)

func NewWriterCollector ¶

func NewWriterCollector(writer Writer) *writerCollector

func Run ¶

func Run(tasks ...*Task) error

Types ¶

type Collector ¶

type Collector interface {
	Collect(k, v interface{}) error
}

type GroupedReader ¶

type GroupedReader struct {
	// contains filtered or unexported fields
}

A reader that, for each key, will group all its values into a channel.

func (*GroupedReader) Next ¶

func (gr *GroupedReader) Next() (k, v interface{}, err error)

type Job ¶

type Job struct {
	// contains filtered or unexported fields
}

func NewJob ¶

func NewJob(tasks ...*Task) *Job

type LineReader ¶

type LineReader struct {
	// contains filtered or unexported fields
}

Line Reader is used by basic streaming jobs. It yields a line number and the raw line delimited by \n. The consumer must accept the arguments (int64, string).

func NewLineReader ¶

func NewLineReader(r io.Reader) *LineReader

func (*LineReader) Next ¶

func (lr *LineReader) Next() (k, v interface{}, err error)

type LocalRunner ¶

type LocalRunner struct {
	// contains filtered or unexported fields
}

LocalRunner

func (*LocalRunner) Run ¶

func (lr *LocalRunner) Run(j *Job) (err error)

type Phase ¶

type Phase uint8

const (
	MapPhase Phase = iota
	CombinePhase
	ReducePhase
)

func GetPhase ¶

func GetPhase(name string) (Phase, error)

type Reader ¶

type Reader interface {
	Next() (k, v interface{}, err error)
}

func NewGroupedReader ¶

func NewGroupedReader(reader Reader) Reader

func NewPairReader ¶

func NewPairReader(r io.Reader) Reader

Read pairs serialized with Hadoop's typedbytes. It is assumed that in non-local mode, this will always be the wire format for reading and writing.

type Runner ¶

type Runner interface {
	Run(job *Job) error
}

func GetRunner ¶

func GetRunner(args []string) (Runner, error)

Given the arguments, figure out which runner should be used.

type SortWriter ¶

type SortWriter struct {
	// contains filtered or unexported fields
}

func NewSortWriter ¶

func NewSortWriter(w io.WriteCloser, capacity int) (*SortWriter, error)

func (*SortWriter) Close ¶

func (sw *SortWriter) Close() (err error)

func (*SortWriter) Write ¶

func (sw *SortWriter) Write(k, v interface{}) (err error)

type StringWriter ¶

type StringWriter struct {
	// contains filtered or unexported fields
}

StringWriter will coax each key/value to a simple string and output it in simple streaming format: key\tvalue\n

func NewStringWriter ¶

func NewStringWriter(w io.WriteCloser) *StringWriter

func (*StringWriter) Close ¶

func (sw *StringWriter) Close() error

func (*StringWriter) Write ¶

func (sw *StringWriter) Write(k, v interface{}) (err error)

type Task ¶

type Task struct {
	// contains filtered or unexported fields
}

func NewTask ¶

func NewTask(instance interface{}) *Task

func (*Task) Run ¶

func (t *Task) Run(phase Phase, r io.Reader, w io.WriteCloser) (err error)

type TaskPhaseRunner ¶

type TaskPhaseRunner struct {
	// contains filtered or unexported fields
}

TaskPhaseRunner Runs a single phase of a task forked from Hadoop. It is assumed that all input and output will be typed bytes at this point.

func TaskPhaseRunnerFromArgs ¶

func TaskPhaseRunnerFromArgs(args []string) (tpr *TaskPhaseRunner, err error)

func (*TaskPhaseRunner) Run ¶

func (tpr *TaskPhaseRunner) Run(j *Job) error

type Writer ¶

type Writer interface {
	Write(k, v interface{}) error
	Close() error
}

func NewPairWriter ¶

func NewPairWriter(w io.WriteCloser) Writer

Write pairs to an underlying writer in Hadoop's typedbytes format. As above, it is assumed all non-local IO will happen in this format

Source Files ¶

View all Source files

Directories ¶

Path	Synopsis
examples
wordcount

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL