uhebom

package module
v0.0.0-...-29d85ef Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Sep 3, 2018 License: MIT Imports: 12 Imported by: 0

README

This is a library for unsupervised data extraction from HTML pages

It has the following name:

  • Unsupervised
  • HTML
  • Extraction
  • Based
  • 0n
  • Mining Data Records

In short, Uhebom.

It consists of two parts:

  • MDR algorithm for extracting data regions from a HTML web page
  • Needleman–Wunsch algorithm for alignment of data records

The MDR algorithm based on Mining Data Records paper. The implementation is heavily inspired by this library.

The alignment part uses this Needleman–Wunsch implementation.

The purpose of this work

Is to provide a fast and portable way to extract repeating data in tabular form from HTML pages. This implementation also aims to work in JS environment.

Installation

go get -u github.com/MichaelLeachim/uhebom

Usage


import (
  extractor "github.com/MichaelLeachim/uhebom"
  log
)

func main(){
  datum_extracted := extractor.Extract([]byte("<html><div>Hello world</div></html>"))
  log.Println(datum_extracted)
}

Demo

You should check out the result of the system

TODO: implement the HTML example of this library usage.

Documentation

Overview

@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ @ Copyright (c) Michael Leachim @ @ You can find additional information regarding licensing of this work in LICENSE.md @ @ You must not remove this notice, or any other, from this software. @ @ All rights reserved. @ @@@@@@ At 2018-27-08 22:34<mklimoff222@gmail.com> @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@

@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ @ Copyright (c) Michael Leachim @ @ You can find additional information regarding licensing of this work in LICENSE.md @ @ You must not remove this notice, or any other, from this software. @ @ All rights reserved. @ @@@@@@ At 2018-01-09 22:53<mklimoff222@gmail.com> @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@

@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ @ Copyright (c) Michael Leachim @ @ You can find additional information regarding licensing of this work in LICENSE.md @ @ You must not remove this notice, or any other, from this software. @ @ All rights reserved. @ @@@@@@ At 2018-02-09 00:13<mklimoff222@gmail.com> @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@

@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ @ Copyright (c) Michael Leachim @ @ You can find additional information regarding licensing of this work in LICENSE.md @ @ You must not remove this notice, or any other, from this software. @ @ All rights reserved. @ @@@@@@ At 2018-27-08 22:34<mklimoff222@gmail.com> @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@

Index

Constants

View Source
const (
	MAX_GENERALIZED_NODES = 5
	THRESHOLD             = 0.75
)
View Source
const (
	ALMOST_SIMILAR = 0.8
)

Variables

This section is empty.

Functions

func Extract

func Extract(data []byte) [][][]string

func TestRecordMiningShouldWorkAsExpected

func TestRecordMiningShouldWorkAsExpected(t *testing.T)

Types

type DataRecord

type DataRecord []*DataTree

type DataRegion

type DataRegion struct {
	Parent  *DataTree
	Start   int
	K       int
	Covered int
	Score   float64
	Items   []*DataRecord
}

type DataTree

type DataTree struct {
	Tag      string
	Data     string
	Children []*DataTree

	Index  int
	Parent *DataTree
	Attrs  map[string]string
	// contains filtered or unexported fields
}

type GeneralizedNode

type GeneralizedNode struct {
	// contains filtered or unexported fields
}

type GeneralizedNodeCompareContainer

type GeneralizedNodeCompareContainer struct {
	// contains filtered or unexported fields
}

type MiningDataRecord

type MiningDataRecord struct {
	// contains filtered or unexported fields
}

type MiningDataRegion

type MiningDataRegion struct {
	// contains filtered or unexported fields
}

type SimpleTreeMatch

type SimpleTreeMatch struct{}

type TabularForm

type TabularForm struct {
	Tag     string
	Path    string
	Content string
	IsGap   bool
}

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL