codetect

package module
v1.0.0-...-97119fa Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Mar 8, 2018 License: MIT Imports: 4 Imported by: 0

README

Code Neuron

Recurrent neural network to detect code blocks. Runs on Tensorflow. It is trained in two stages.

First stage is pre-training the character level RNN with two branches - before and after:

CharRNN Architecture

my code :  FooBar
------> x <------

We assign recurrent branches to different GPUs to train faster. I set 512 LSTM neurons and reach 89% validation accuracy over 200 most frequent character classes:

CharRNN Validation

The second stage is training the same network but with the different dense layer which predicts only 3 classes: code block begins, code block ends and no-op. The prediction scheme changes: now we look at the adjacent chars and decide if there is a code boundary between them or not.

Code Neuron Validation

It is much faster to train and it reaches ~99.2% validation accuracy.

Training set

StackSample questions and answers, processed with

unzip -p Answers(Questions).csv.zip | ./dataset | sed -r -e '/^$/d' -e '/\x03/ {N; s/\x03\s*\n/\x03/g}' | gzip >> Dataset.txt.gz

Baked model

model_LSTM_600_0.9924.pb - reaches 99.2% accuracy on validation. The model in Tensorflow "GraphDef" protobuf format.

Pretraining was performed with 20% validation on the first 8000000 bytes of the uncompressed questions. Training was performed with 20% validation and 90% negative samples on the first 256000000 bytes of the uncompressed questions. This means I was lazy to wait a week for it to train on the whole dataset - you are encouraged to experiment.

Try to run it:

cat sample.txt | python3 run_model.py -m model_LSTM_600_0.9924.pb

You should see:

Here is my Python code, it is awesome and easy to read:
<code>def main():
    print("Hello, world!")
</code>Please say what you think about it. Mad skills. Here is another one,
<code>func main() {
  println("Hello, world!")
}
</code>As you see, I know Go too. Some more text to provide enough context.

Visualize the trained model:

python3 model2tb.py --model-dir model_LSTM_600_0.9924.pb --log-dir tb_logs
tensorboard --logdir=tb_logs

Go inference

go get gopkg.in/vmarkovtsev/CodeNeuron.v1/...
cat sample.txt | $(go env GOPATH)/bin/codetect

API:

import "gopkg.in/vmarkovtsev/CodeNeuron.v1"

func main() {
  textBytes, _ := ioutil.ReadFile("sample.txt")
  result, _ := codetect.Run(string(textBytes))
}
Updating the model
go-bindata -nomemcopy -nometadata -pkg assets -o assets/bindata.go  model.pb

License

MIT, see LICENSE.

Documentation

Index

Constants

This section is empty.

Variables

View Source
var CHARS = map[rune]uint8{}/* 199 elements not displayed */

Functions

func GetSequenceLength

func GetSequenceLength() int

GetSequenceLength returns the sequence length of the RNN model. text[:length / 2] and text[-length / 2:] are not analyzed because the network has too little context. You can workaround this by appending and prepending some constant strings.

func OpenSession

func OpenSession() (*tf.Session, error)

OpenSession initializes a new Tensorflow session. Remember to defer session.Close()

Types

type CodeBoundary

type CodeBoundary struct {
	// PositionInRunes is the index of the boundary in the parsed *runes array*.
	// This is not a position in the byte stream.
	// The boundary goes *after* the corresponding rune index.
	PositionInRunes int
	// Start is true if the boundary is a start, otherwise, it is false for an end.
	Start bool
}

CodeBoundary represents a start or an end of a detected code block.

func Run

func Run(text string, session *tf.Session) ([]CodeBoundary, error)

Run detects the code block boundaries using CodeNeuron network. See GetSequenceLength() for the details which portion of the text is analyzed.

Directories

Path Synopsis
cmd

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL