adult

package
v0.9.1 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Apr 20, 2024 License: Apache-2.0 Imports: 15 Imported by: 0

Documentation

Overview

Package adult provides a `InMemoryDataset` implementation for UCI Adult Census dataset. See attached notebook (using GoNB) for details and examples on how to use it.

Mostly one will want to use `LoadAndPreprocessData` to download and preprocess the data, the singleton `Flat` to access it, and `NewDataset` to create datasets for training and evaluating.

It also provides preprocessing functionality:

- Downloading and caching of the dataset. - List of column names organized by types (`AdultFieldNames` and `AdultFieldTypes`) - Vocabularies for categorical features. - Quantiles for continuous features. - Pretty print some stats.

Index

Constants

View Source
const (
	AdultDatasetDataURL  = "https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data"
	AdultDatasetDataFile = "adult.data"
	AdultDatasetTestURL  = "https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.test"
	AdultDatasetTestFile = "adult.test"

	AdultDatasetDataCecksum = "5b00264637dbfec36bdeaab5676b0b309ff9eb788d63554ca0a249491c86603d"
	AdultDatasetTestCecksum = "a2a9044bc167a35b2361efbabec64e89d69ce82d9790d2980119aac5fd7e9c05"
)

Various URLs and file names for Adult-UCI dataset.

View Source
const (
	WeightCol         = "fnlwgt" // "Final Weight", see adult.names file.
	LabelCol          = "label"  // That is the target prediction column.
	EducationTypeCol  = "education"
	EducationYearsCol = "education-num"
)

Column names:

View Source
const (
	LabelTrue  = ">50K"
	LabelFalse = "<=50K"
)

Label values:

View Source
const Unknown = "?"

Unknown representation for string columns -- used in Adult dataset.

Variables

View Source
var (
	// AdultFieldNames in the dataset.
	AdultFieldNames = []string{
		"age", "workclass", WeightCol, EducationTypeCol, EducationYearsCol, "marital-status", "occupation", "relationship",
		"race", "sex", "capital-gain", "capital-loss", "hours-per-week", "native-country", LabelCol,
	}

	// AdultFieldTypes maps the field (column) name to its format.
	AdultFieldTypes = map[string]series.Type{
		"age":             series.Float,
		"workclass":       series.String,
		WeightCol:         series.Float,
		EducationTypeCol:  series.String,
		EducationYearsCol: series.Float,
		"marital-status":  series.String,
		"occupation":      series.String,
		"relationship":    series.String,
		"race":            series.String,
		"sex":             series.String,
		"capital-gain":    series.Float,
		"capital-loss":    series.Float,
		"hours-per-week":  series.Float,
		"native-country":  series.String,
		LabelCol:          series.String,
	}
)
View Source
var Data struct {
	// VocabulariesFeatures is a list of feature names for the vocabularies stored in Vocabularies.
	VocabulariesFeatures []string

	// Vocabularies is a list of maps of value to integer. The
	// special string "Unknown" is mapped to 0.
	Vocabularies []map[string]int

	// FeatureNameToVocabIdx maps a feature name to its vocabulary index.
	FeatureNameToVocabIdx map[string]int

	// QuantilesFeatures is the ordered list of numeric features for which we have quantiles.
	QuantilesFeatures []string

	// Quantiles for features listed in QuantilesFeatures
	Quantiles []QuantileTable

	// Train dataset.
	Train *RawData

	// Test dataset.
	Test *RawData
}

Data holds all the data (train and test), and the required information collected statically (i.e., non machine learned) from the training dataset (we don't look at test to generate these).

It is filled out by LoadAndPreprocessData.

Functions

func AssertNoError

func AssertNoError(err error)

AssertNoError checks that err is nil, otherwise it `panic`s with `err`.

func BinaryFilePath

func BinaryFilePath(dir string, numQuantiles int) string

BinaryFilePath returns the name used to store the preprocessed data in binary (fast) format.

The `numQuantiles` is the only preprocessing parameter that affects the result. We use it as part of the filename to make sure we don't re-use data generated for different `numQuantiles`.

Considering using LoadAndPreprocessData instead.

func DownloadDataset

func DownloadDataset(dir string, force bool, verbosity int)

DownloadDataset downloads the Adult dataset files into `dir`. It `log.Fatal` if it fails. Verbosity files >= 1 will print what it's doing.

If files are already downloaded it does nothing -- except if `force` is set to true.

func FileExists

func FileExists(path string) (bool, error)

FileExists returns whether the given file path exists.

func LoadAndPreprocessData

func LoadAndPreprocessData(dir string, numQuantiles int, forceDownload bool, verbosity int)

LoadAndPreprocessData all in one function call for data preprocessing for the Adult Dataset. Information and data available in the global Data.

Parameters: - dir: where to store downloaded files. By default if they are already downloaded they will be reused. - numQuantiles: number of quantiles to generate for the continuous datasets. They can be used for piecewise-linear calibration. - forceDownload: will download data from the internet even if already downloaded. - verbosity: set to a value >= 1 to print out what it's doing.

The results are stored in the global variable `Flat`.

It panics in case of error.

func LoadBinaryData

func LoadBinaryData(dir string, numQuantiles int) (found bool)

LoadBinaryData saves the global Data structure in binary format, for faster access. It returns true if data was available and loaded.

Considering using LoadAndPreprocessData instead.

func LoadDataFrame

func LoadDataFrame(path string) dataframe.DataFrame

LoadDataFrame and returns a DataFrame. `path` is the name of the downloaded file.

Considering using LoadAndPreprocessData instead.

func NewDataset

func NewDataset(manager *Manager, rawData *RawData, name string) *data.InMemoryDataset

NewDataset creates a new `data.InMemoryDataset` (can be used for training and evaluation) for the MCI Adult dataset.

func PopulateQuantiles

func PopulateQuantiles(df dataframe.DataFrame, numQuantiles int)

PopulateQuantiles with up to numQuantiles for each tloat column.

func PopulateVocabularies

func PopulateVocabularies(df dataframe.DataFrame)

PopulateVocabularies goes over all string columns in the DataFrame and map their value to integers starting from 0. Results in Vocabularies.

func PrintBatchSamples

func PrintBatchSamples(manager *Manager, data *RawData)

PrintBatchSamples just generate a couple of batches of size 3 and print on the output. Just for debugging.

func PrintFeatures added in v0.1.1

func PrintFeatures(df dataframe.DataFrame)

PrintFeatures prints information on the vacabularies and quantiles about the features.

func PrintRawData

func PrintRawData(r *RawData)

PrintRawData prints positivity ratio and and some samples.

func SaveBinaryData

func SaveBinaryData(dir string, numQuantiles int) (err error)

SaveBinaryData saves the global Data structure in binary format, for faster access.

Considering using LoadAndPreprocessData instead.

Types

type QuantileTable

type QuantileTable []float32

QuantileTable holds the quantiles of a set of values.

type RawData

type RawData struct {
	NumRows, NumCategorical, NumContinuous int

	// Categorical is shaped `[NumRows, NumCategorical]` ordered as in VocabulariesFeatures.
	Categorical []int

	// Continuous is shaped `[NumRows, NumContinuous]`, ordered as in QuantilesFeatures.
	Continuous []float32 // AssertNoError match ModelDType.

	// Weights is shaped [NumRows]
	Weights []float32 // AssertNoError match ModelDType.

	// Labels is shaped [NumRows]: 1.0 (>50K) or 0.0 (<=50K)
	Labels []float32 // AssertNoError match ModelDType.
}

RawData holds the data stripped of all metadata: categorical converted to ints. It includes the whole dataset.

func ConvertDataFrameToRawData

func ConvertDataFrameToRawData(df dataframe.DataFrame) *RawData

ConvertDataFrameToRawData convert df to the raw data. It returns:

func (*RawData) CategoricalIdx

func (r *RawData) CategoricalIdx(rowNum, colNum int) int

func (*RawData) CategoricalRow

func (r *RawData) CategoricalRow(rowNum int) []int

func (*RawData) ContinuousIdx

func (r *RawData) ContinuousIdx(rowNum, colNum int) int

func (*RawData) ContinuousRow

func (r *RawData) ContinuousRow(rowNum int) []float32

func (*RawData) CreateTensors

func (r *RawData) CreateTensors(manager *graph.Manager) *TensorData

CreateTensors of dataset, for faster ML interaction.

func (*RawData) SampleWithReplacement

func (r *RawData) SampleWithReplacement(numExamples int) *RawData

SampleWithReplacement in local memory.

type TensorData

type TensorData struct {
	CategoricalTensor, ContinuousTensor, WeightsTensor, LabelsTensor tensor.Tensor
}

TensorData contains a RawData converted to tensors.

Directories

Path Synopsis
Linear generates random synthetic data, based on some linear mode + noise.
Linear generates random synthetic data, based on some linear mode + noise.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL