go-mapreduce

module
v0.0.0-...-f737b9d Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Sep 21, 2023 License: MIT

README

go-mapreduce

While surfing the internets for project ideas to implement using Golang, I stumbled on an MIT lab for MapReduce implementation.

Making a student-lab project seemed boring to me, so I've decided to turn it to a kind of library/framework that provides primitives and basic implementation for MapReduce-like processing.

What is MapReduce

MapReduce is a programming model/framework for processing parallelizable problems with large datasets on multiple machines/computers.

You can read more about it on Wikipedia or in the original paper published by Google.

This implementation takes some liberties with the ideas from the original paper, mainly by decoupling the coordinator (called master in the original paper) from workers.

High-level Design Overview

This repo provides building blocks and complete implementation of the coordinator.

The coordinator is responsible for managing processing tasks, and it allows to:

  • Create tasks
  • Get tasks for processing
  • Report processed task results

The end user application can call the coordinator to create a new map task.

The coordinator saves the map task to persistent storage and notifies any listening workers that a new map task is available for processing.

This communication is done through an event queue and the coordinator does not know anything about the workers.

Workers can listen to this event queue and make a request to the coordinator to get a new task for processing.

After a worker is done with the map task, it reports its result to the coordinator. If task execution is successful, the coordinator creates a reduce task and notifies workers about it.

The reduce task goes through the same steps of processing.

Here's a diagram with visualization for this workflow:

design-overview.png

The workers are implemented by the end user.

The input, intermediate, and output data are files. The coordinator does not know anything about them besides their identifiers.

The end user decides what to use - a simple Network File System or something more sophisticated like HDFS.

An example of an end user application (word counting) can be found here.

Directories

Path Synopsis
cmd
examples
wordcount Module
pkg
app
events/mocks
Package mocks is a generated GoMock package.
Package mocks is a generated GoMock package.
ids
ids/mocks
Package mocks is a generated GoMock package.
Package mocks is a generated GoMock package.
mapreduce/mocks
Package mocks is a generated GoMock package.
Package mocks is a generated GoMock package.
repository/mocks
Package mocks is a generated GoMock package.
Package mocks is a generated GoMock package.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL