gossamr

package module
v0.0.0-...-3ade326 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Feb 9, 2018 License: Apache-2.0 Imports: 11 Imported by: 2

README

Gossamr lets you run your Go programs on Hadoop.

Quick Example

Oh, man. Illustrating MapReduce with a word count? Get out of town.

package main

import (
  "log"
  "strings"

  "github.com/vistarmedia/gossamr"
)

type WordCount struct{}

func (wc *WordCount) Map(p int64, line string, c gossamr.Collector) error {
  for _, word := range strings.Fields(line) {
    c.Collect(strings.ToLower(word), int64(1))
  }
  return nil
}

func (wc *WordCount) Reduce(word string, counts chan int64, c gossamr.Collector) error {
  var sum int64
  for v := range counts {
    sum += v
  }
  c.Collect(sum, word)
  return nil
}

func main() {
  wordcount := gossamr.NewTask(&WordCount{})

  err := gossamr.Run(wordcount)
  if err != nil {
    log.Fatal(err)
  }
}

Running with Hadoop

./bin/hadoop jar ./contrib/streaming/hadoop-streaming-1.2.1.jar \
  -input /mytext.txt \
  -output /output.15 \
  -mapper "gossamr -task 0 -phase map" \
  -reducer "gossamr -task 0 -phase reduce" \
  -io typedbytes \
  -file ./wordcount
  -numReduceTasks 6

Documentation

Overview

Runs a job (or part of a job). There are three primary types of runners

  1. LocalRunner - Used for simulating a job locally. The sorting and combining functions of Hadoop will be emulated as best as possible, though no guarantees are made
  2. TaskPhaseRunner - Used inter-step during a Hadoop job. This runs a single phase of a task
  3. JobRunner - Submits a multi-task Job to hadoop, organizing temporary files and forking the necessary processes.

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

func Copy

func Copy(r Reader, w Writer) (err error)

func NewWriterCollector

func NewWriterCollector(writer Writer) *writerCollector

func Run

func Run(tasks ...*Task) error

Types

type Collector

type Collector interface {
	Collect(k, v interface{}) error
}

type GroupedReader

type GroupedReader struct {
	// contains filtered or unexported fields
}

A reader that, for each key, will group all its values into a channel.

func (*GroupedReader) Next

func (gr *GroupedReader) Next() (k, v interface{}, err error)

type Job

type Job struct {
	// contains filtered or unexported fields
}

func NewJob

func NewJob(tasks ...*Task) *Job

type LineReader

type LineReader struct {
	// contains filtered or unexported fields
}

Line Reader is used by basic streaming jobs. It yields a line number and the raw line delimited by \n. The consumer must accept the arguments (int64, string).

func NewLineReader

func NewLineReader(r io.Reader) *LineReader

func (*LineReader) Next

func (lr *LineReader) Next() (k, v interface{}, err error)

type LocalRunner

type LocalRunner struct {
	// contains filtered or unexported fields
}

LocalRunner

func (*LocalRunner) Run

func (lr *LocalRunner) Run(j *Job) (err error)

type Phase

type Phase uint8
const (
	MapPhase Phase = iota
	CombinePhase
	ReducePhase
)

func GetPhase

func GetPhase(name string) (Phase, error)

type Reader

type Reader interface {
	Next() (k, v interface{}, err error)
}

func NewGroupedReader

func NewGroupedReader(reader Reader) Reader

func NewPairReader

func NewPairReader(r io.Reader) Reader

Read pairs serialized with Hadoop's typedbytes. It is assumed that in non-local mode, this will always be the wire format for reading and writing.

type Runner

type Runner interface {
	Run(job *Job) error
}

func GetRunner

func GetRunner(args []string) (Runner, error)

Given the arguments, figure out which runner should be used.

type SortWriter

type SortWriter struct {
	// contains filtered or unexported fields
}

func NewSortWriter

func NewSortWriter(w io.WriteCloser, capacity int) (*SortWriter, error)

func (*SortWriter) Close

func (sw *SortWriter) Close() (err error)

func (*SortWriter) Write

func (sw *SortWriter) Write(k, v interface{}) (err error)

type StringWriter

type StringWriter struct {
	// contains filtered or unexported fields
}

StringWriter will coax each key/value to a simple string and output it in simple streaming format: key\tvalue\n

func NewStringWriter

func NewStringWriter(w io.WriteCloser) *StringWriter

func (*StringWriter) Close

func (sw *StringWriter) Close() error

func (*StringWriter) Write

func (sw *StringWriter) Write(k, v interface{}) (err error)

type Task

type Task struct {
	// contains filtered or unexported fields
}

func NewTask

func NewTask(instance interface{}) *Task

func (*Task) Run

func (t *Task) Run(phase Phase, r io.Reader, w io.WriteCloser) (err error)

type TaskPhaseRunner

type TaskPhaseRunner struct {
	// contains filtered or unexported fields
}

TaskPhaseRunner Runs a single phase of a task forked from Hadoop. It is assumed that all input and output will be typed bytes at this point.

func TaskPhaseRunnerFromArgs

func TaskPhaseRunnerFromArgs(args []string) (tpr *TaskPhaseRunner, err error)

func (*TaskPhaseRunner) Run

func (tpr *TaskPhaseRunner) Run(j *Job) error

type Writer

type Writer interface {
	Write(k, v interface{}) error
	Close() error
}

func NewPairWriter

func NewPairWriter(w io.WriteCloser) Writer

Write pairs to an underlying writer in Hadoop's typedbytes format. As above, it is assumed all non-local IO will happen in this format

Directories

Path Synopsis
examples

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL