codetect

package module

v1.0.0-...-97119fa Latest Latest Go to latest Published: Mar 8, 2018 License: MIT Imports: 4 Imported by: 0

Details

Valid go.mod file

The Go module system was introduced in Go 1.11 and is the official dependency management solution for Go.
Redistributable license

Redistributable licenses place minimal restrictions on how software can be used, modified, and redistributed.
Tagged version

Modules with tagged versions give importers more predictable builds.
Stable version

When a project reaches major version v1 it is considered stable.
Learn more about best practices

Repository

github.com/vmarkovtsev/codeneuron

Links

Open Source Insights

README ¶

Code Neuron

Recurrent neural network to detect code blocks. Runs on Tensorflow. It is trained in two stages.

First stage is pre-training the character level RNN with two branches - before and after:

CharRNN Architecture

my code :  FooBar
------> x <------

We assign recurrent branches to different GPUs to train faster. I set 512 LSTM neurons and reach 89% validation accuracy over 200 most frequent character classes:

CharRNN Validation

The second stage is training the same network but with the different dense layer which predicts only 3 classes: code block begins, code block ends and no-op. The prediction scheme changes: now we look at the adjacent chars and decide if there is a code boundary between them or not.

Code Neuron Validation

It is much faster to train and it reaches ~99.2% validation accuracy.

Training set

StackSample questions and answers, processed with

unzip -p Answers(Questions).csv.zip | ./dataset | sed -r -e '/^$/d' -e '/\x03/ {N; s/\x03\s*\n/\x03/g}' | gzip >> Dataset.txt.gz

Baked model

model_LSTM_600_0.9924.pb - reaches 99.2% accuracy on validation. The model in Tensorflow "GraphDef" protobuf format.

Pretraining was performed with 20% validation on the first 8000000 bytes of the uncompressed questions. Training was performed with 20% validation and 90% negative samples on the first 256000000 bytes of the uncompressed questions. This means I was lazy to wait a week for it to train on the whole dataset - you are encouraged to experiment.

Try to run it:

cat sample.txt | python3 run_model.py -m model_LSTM_600_0.9924.pb

You should see:

Here is my Python code, it is awesome and easy to read:
<code>def main():
    print("Hello, world!")
</code>Please say what you think about it. Mad skills. Here is another one,
<code>func main() {
  println("Hello, world!")
}
</code>As you see, I know Go too. Some more text to provide enough context.

Visualize the trained model:

python3 model2tb.py --model-dir model_LSTM_600_0.9924.pb --log-dir tb_logs
tensorboard --logdir=tb_logs

Go inference

go get gopkg.in/vmarkovtsev/CodeNeuron.v1/...
cat sample.txt | $(go env GOPATH)/bin/codetect

API:

import "gopkg.in/vmarkovtsev/CodeNeuron.v1"

func main() {
  textBytes, _ := ioutil.ReadFile("sample.txt")
  result, _ := codetect.Run(string(textBytes))
}

Updating the model

go-bindata -nomemcopy -nometadata -pkg assets -o assets/bindata.go  model.pb

License

MIT, see LICENSE.

Documentation ¶

Index ¶

Variables
func GetSequenceLength() int
func OpenSession() (*tf.Session, error)
type CodeBoundary
- func Run(text string, session *tf.Session) ([]CodeBoundary, error)

Constants ¶

This section is empty.

Variables ¶

View Source

var CHARS = map[rune]uint8{}/* 199 elements not displayed */

Functions ¶

func GetSequenceLength ¶

func GetSequenceLength() int

GetSequenceLength returns the sequence length of the RNN model. text[:length / 2] and text[-length / 2:] are not analyzed because the network has too little context. You can workaround this by appending and prepending some constant strings.

func OpenSession ¶

func OpenSession() (*tf.Session, error)

OpenSession initializes a new Tensorflow session. Remember to defer session.Close()

Types ¶

type CodeBoundary ¶

type CodeBoundary struct {
	// PositionInRunes is the index of the boundary in the parsed *runes array*.
	// This is not a position in the byte stream.
	// The boundary goes *after* the corresponding rune index.
	PositionInRunes int
	// Start is true if the boundary is a start, otherwise, it is false for an end.
	Start bool
}

CodeBoundary represents a start or an end of a detected code block.

func Run ¶

func Run(text string, session *tf.Session) ([]CodeBoundary, error)

Run detects the code block boundaries using CodeNeuron network. See GetSequenceLength() for the details which portion of the text is analyzed.

Source Files ¶

View all Source files

Directories ¶

Path	Synopsis
cmd
codetect
dataset

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL