berserker

module
v0.0.0-...-cfc38e6 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Oct 17, 2017 License: GPL-3.0

README

Berserker Extractor Build Status

Berserker is an Apache Spark application using it's Scala API. It extracts UAST and other information about every file from the given set of .siva files and stores the result in Parquet format.

Architecture

It's part of repository data collection pipeline:

  • reads the output of Borges
  • uses go-siva to unpack .siva files to headles RootedRepository in local FS
  • uses JGit to iterate over files at HEAD of the main original repository (skip forks)
  • detects languages using Enry
  • parses every file to UAST using Bblfsh

It uses gRPC to talk to Enry server and bblfsh/server for language detection and actual UAST parsing.

Pre-requests

  • Bblfsh sever running
    docker run --privileged -p 9432:9432 --name bblfsh bblfsh/server:dev-<sha> --max-message-size=100
    
  • enrysrv binary built and running on 9091
    #make sure the Berserker clone is under $GOPATH
    cd enrysrv; ./build
    ./bin/enrysrv server
    
  • Scala client for Bblfsh server built (until published on sonatype.org)
    ./local-install-bblfsh-client-scala.sh
    

Build

  • ./sbt compile to compile and generate gRPC code using ScalaPB from ./enrysrv/*.proto
  • ./sbt package to build spark-submit'able .jar file
  • ./sbt assembly to build fatJar for using java -jar (\w Scala and Apache Spark inside)

Test

There are 2 types of tests: UnitTests in Scala and end-to-en integration tests. To run both do

./test

Run

Local mode

On local machine for to use Apache Spark in local mode

./berserker --help
Apache Spark cluster
MASTER="spark-master-url" ./berserker-cluster --help
Kubernetes

For running on Apache Spark deployed on K8s

TBD

kubectl run ....

Directories

Path Synopsis
Package enrysrv is a generated protocol buffer package.
Package enrysrv is a generated protocol buffer package.
cli
Package extractor is a generated protocol buffer package.
Package extractor is a generated protocol buffer package.
cli

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL