adult

package

v0.9.1 Latest Latest Go to latest Published: Apr 20, 2024 License: Apache-2.0 Imports: 15 Imported by: 0

Details

Valid go.mod file

The Go module system was introduced in Go 1.11 and is the official dependency management solution for Go.
Redistributable license

Redistributable licenses place minimal restrictions on how software can be used, modified, and redistributed.
Tagged version

Modules with tagged versions give importers more predictable builds.
Stable version

When a project reaches major version v1 it is considered stable.
Learn more about best practices

Repository

github.com/gomlx/gomlx

Links

Open Source Insights

Documentation ¶

Overview ¶

Package adult provides a `InMemoryDataset` implementation for UCI Adult Census dataset. See attached notebook (using GoNB) for details and examples on how to use it.

Mostly one will want to use `LoadAndPreprocessData` to download and preprocess the data, the singleton `Flat` to access it, and `NewDataset` to create datasets for training and evaluating.

It also provides preprocessing functionality:

- Downloading and caching of the dataset. - List of column names organized by types (`AdultFieldNames` and `AdultFieldTypes`) - Vocabularies for categorical features. - Quantiles for continuous features. - Pretty print some stats.

Index ¶

Constants
Variables
func AssertNoError(err error)
func BinaryFilePath(dir string, numQuantiles int) string
func DownloadDataset(dir string, force bool, verbosity int)
func FileExists(path string) (bool, error)
func LoadAndPreprocessData(dir string, numQuantiles int, forceDownload bool, verbosity int)
func LoadBinaryData(dir string, numQuantiles int) (found bool)
func LoadDataFrame(path string) dataframe.DataFrame
func NewDataset(manager *Manager, rawData *RawData, name string) *data.InMemoryDataset
func PopulateQuantiles(df dataframe.DataFrame, numQuantiles int)
func PopulateVocabularies(df dataframe.DataFrame)
func PrintBatchSamples(manager *Manager, data *RawData)
func PrintFeatures(df dataframe.DataFrame)
func PrintRawData(r *RawData)
func SaveBinaryData(dir string, numQuantiles int) (err error)
type QuantileTable
type RawData
- func ConvertDataFrameToRawData(df dataframe.DataFrame) *RawData
- func (r *RawData) CategoricalIdx(rowNum, colNum int) int
- func (r *RawData) CategoricalRow(rowNum int) []int
- func (r *RawData) ContinuousIdx(rowNum, colNum int) int
- func (r *RawData) ContinuousRow(rowNum int) []float32
- func (r *RawData) CreateTensors(manager *graph.Manager) *TensorData
- func (r *RawData) SampleWithReplacement(numExamples int) *RawData
type TensorData

Constants ¶

View Source

const (
	AdultDatasetDataURL  = "https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data"
	AdultDatasetDataFile = "adult.data"
	AdultDatasetTestURL  = "https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.test"
	AdultDatasetTestFile = "adult.test"

	AdultDatasetDataCecksum = "5b00264637dbfec36bdeaab5676b0b309ff9eb788d63554ca0a249491c86603d"
	AdultDatasetTestCecksum = "a2a9044bc167a35b2361efbabec64e89d69ce82d9790d2980119aac5fd7e9c05"
)

Various URLs and file names for Adult-UCI dataset.

View Source

const (
	WeightCol         = "fnlwgt" // "Final Weight", see adult.names file.
	LabelCol          = "label"  // That is the target prediction column.
	EducationTypeCol  = "education"
	EducationYearsCol = "education-num"
)

Column names:

View Source

const (
	LabelTrue  = ">50K"
	LabelFalse = "<=50K"
)

Label values:

View Source

const Unknown = "?"

Unknown representation for string columns -- used in Adult dataset.

Variables ¶

View Source

var (
	// AdultFieldNames in the dataset.
	AdultFieldNames = []string{
		"age", "workclass", WeightCol, EducationTypeCol, EducationYearsCol, "marital-status", "occupation", "relationship",
		"race", "sex", "capital-gain", "capital-loss", "hours-per-week", "native-country", LabelCol,
	}

	// AdultFieldTypes maps the field (column) name to its format.
	AdultFieldTypes = map[string]series.Type{
		"age":             series.Float,
		"workclass":       series.String,
		WeightCol:         series.Float,
		EducationTypeCol:  series.String,
		EducationYearsCol: series.Float,
		"marital-status":  series.String,
		"occupation":      series.String,
		"relationship":    series.String,
		"race":            series.String,
		"sex":             series.String,
		"capital-gain":    series.Float,
		"capital-loss":    series.Float,
		"hours-per-week":  series.Float,
		"native-country":  series.String,
		LabelCol:          series.String,
	}
)

View Source

var Data struct {
	// VocabulariesFeatures is a list of feature names for the vocabularies stored in Vocabularies.
	VocabulariesFeatures []string

	// Vocabularies is a list of maps of value to integer. The
	// special string "Unknown" is mapped to 0.
	Vocabularies []map[string]int

	// FeatureNameToVocabIdx maps a feature name to its vocabulary index.
	FeatureNameToVocabIdx map[string]int

	// QuantilesFeatures is the ordered list of numeric features for which we have quantiles.
	QuantilesFeatures []string

	// Quantiles for features listed in QuantilesFeatures
	Quantiles []QuantileTable

	// Train dataset.
	Train *RawData

	// Test dataset.
	Test *RawData
}

Data holds all the data (train and test), and the required information collected statically (i.e., non machine learned) from the training dataset (we don't look at test to generate these).

It is filled out by LoadAndPreprocessData.

Functions ¶

func AssertNoError ¶

func AssertNoError(err error)

AssertNoError checks that err is nil, otherwise it `panic`s with `err`.

func BinaryFilePath ¶

func BinaryFilePath(dir string, numQuantiles int) string

BinaryFilePath returns the name used to store the preprocessed data in binary (fast) format.

The `numQuantiles` is the only preprocessing parameter that affects the result. We use it as part of the filename to make sure we don't re-use data generated for different `numQuantiles`.

Considering using LoadAndPreprocessData instead.

func DownloadDataset ¶

func DownloadDataset(dir string, force bool, verbosity int)

DownloadDataset downloads the Adult dataset files into `dir`. It `log.Fatal` if it fails. Verbosity files >= 1 will print what it's doing.

If files are already downloaded it does nothing -- except if `force` is set to true.

func FileExists ¶

func FileExists(path string) (bool, error)

FileExists returns whether the given file path exists.

func LoadAndPreprocessData ¶

func LoadAndPreprocessData(dir string, numQuantiles int, forceDownload bool, verbosity int)

LoadAndPreprocessData all in one function call for data preprocessing for the Adult Dataset. Information and data available in the global Data.

Parameters: - dir: where to store downloaded files. By default if they are already downloaded they will be reused. - numQuantiles: number of quantiles to generate for the continuous datasets. They can be used for piecewise-linear calibration. - forceDownload: will download data from the internet even if already downloaded. - verbosity: set to a value >= 1 to print out what it's doing.

The results are stored in the global variable `Flat`.

It panics in case of error.

func LoadBinaryData ¶

func LoadBinaryData(dir string, numQuantiles int) (found bool)

LoadBinaryData saves the global Data structure in binary format, for faster access. It returns true if data was available and loaded.

Considering using LoadAndPreprocessData instead.

func LoadDataFrame ¶

func LoadDataFrame(path string) dataframe.DataFrame

LoadDataFrame and returns a DataFrame. `path` is the name of the downloaded file.

Considering using LoadAndPreprocessData instead.

func NewDataset ¶

func NewDataset(manager *Manager, rawData *RawData, name string) *data.InMemoryDataset

NewDataset creates a new `data.InMemoryDataset` (can be used for training and evaluation) for the MCI Adult dataset.

func PopulateQuantiles ¶

func PopulateQuantiles(df dataframe.DataFrame, numQuantiles int)

PopulateQuantiles with up to numQuantiles for each tloat column.

func PopulateVocabularies ¶

func PopulateVocabularies(df dataframe.DataFrame)

PopulateVocabularies goes over all string columns in the DataFrame and map their value to integers starting from 0. Results in Vocabularies.

func PrintBatchSamples ¶

func PrintBatchSamples(manager *Manager, data *RawData)

PrintBatchSamples just generate a couple of batches of size 3 and print on the output. Just for debugging.

func PrintFeatures ¶ added in v0.1.1

func PrintFeatures(df dataframe.DataFrame)

PrintFeatures prints information on the vacabularies and quantiles about the features.

func PrintRawData ¶

func PrintRawData(r *RawData)

PrintRawData prints positivity ratio and and some samples.

func SaveBinaryData ¶

func SaveBinaryData(dir string, numQuantiles int) (err error)

SaveBinaryData saves the global Data structure in binary format, for faster access.

Considering using LoadAndPreprocessData instead.

Types ¶

type QuantileTable ¶

type QuantileTable []float32

QuantileTable holds the quantiles of a set of values.

type RawData ¶

type RawData struct {
	NumRows, NumCategorical, NumContinuous int

	// Categorical is shaped `[NumRows, NumCategorical]` ordered as in VocabulariesFeatures.
	Categorical []int

	// Continuous is shaped `[NumRows, NumContinuous]`, ordered as in QuantilesFeatures.
	Continuous []float32 // AssertNoError match ModelDType.

	// Weights is shaped [NumRows]
	Weights []float32 // AssertNoError match ModelDType.

	// Labels is shaped [NumRows]: 1.0 (>50K) or 0.0 (<=50K)
	Labels []float32 // AssertNoError match ModelDType.
}

RawData holds the data stripped of all metadata: categorical converted to ints. It includes the whole dataset.

func ConvertDataFrameToRawData ¶

func ConvertDataFrameToRawData(df dataframe.DataFrame) *RawData

ConvertDataFrameToRawData convert df to the raw data. It returns:

func (*RawData) CategoricalIdx ¶

func (r *RawData) CategoricalIdx(rowNum, colNum int) int

func (*RawData) CategoricalRow ¶

func (r *RawData) CategoricalRow(rowNum int) []int

func (*RawData) ContinuousIdx ¶

func (r *RawData) ContinuousIdx(rowNum, colNum int) int

func (*RawData) ContinuousRow ¶

func (r *RawData) ContinuousRow(rowNum int) []float32

func (*RawData) CreateTensors ¶

func (r *RawData) CreateTensors(manager *graph.Manager) *TensorData

CreateTensors of dataset, for faster ML interaction.

func (*RawData) SampleWithReplacement ¶

func (r *RawData) SampleWithReplacement(numExamples int) *RawData

SampleWithReplacement in local memory.

type TensorData ¶

type TensorData struct {
	CategoricalTensor, ContinuousTensor, WeightsTensor, LabelsTensor tensor.Tensor
}

TensorData contains a RawData converted to tensors.

Source Files ¶

View all Source files

Directories ¶

Path	Synopsis
demo Linear generates random synthetic data, based on some linear mode + noise.	Linear generates random synthetic data, based on some linear mode + noise.

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL