Documentation ¶
Overview ¶
Package adult provides a `InMemoryDataset` implementation for UCI Adult Census dataset. See attached notebook (using GoNB) for details and examples on how to use it.
Mostly one will want to use `LoadAndPreprocessData` to download and preprocess the data, the singleton `Flat` to access it, and `NewDataset` to create datasets for training and evaluating.
It also provides preprocessing functionality:
- Downloading and caching of the dataset. - List of column names organized by types (`AdultFieldNames` and `AdultFieldTypes`) - Vocabularies for categorical features. - Quantiles for continuous features. - Pretty print some stats.
Index ¶
- Constants
- Variables
- func AssertNoError(err error)
- func BinaryFilePath(dir string, numQuantiles int) string
- func DownloadDataset(dir string, force bool, verbosity int)
- func FileExists(path string) (bool, error)
- func LoadAndPreprocessData(dir string, numQuantiles int, forceDownload bool, verbosity int)
- func LoadBinaryData(dir string, numQuantiles int) (found bool)
- func LoadDataFrame(path string) dataframe.DataFrame
- func NewDataset(manager *Manager, rawData *RawData, name string) *data.InMemoryDataset
- func PopulateQuantiles(df dataframe.DataFrame, numQuantiles int)
- func PopulateVocabularies(df dataframe.DataFrame)
- func PrintBatchSamples(manager *Manager, data *RawData)
- func PrintFeatures(df dataframe.DataFrame)
- func PrintRawData(r *RawData)
- func SaveBinaryData(dir string, numQuantiles int) (err error)
- type QuantileTable
- type RawData
- func (r *RawData) CategoricalIdx(rowNum, colNum int) int
- func (r *RawData) CategoricalRow(rowNum int) []int
- func (r *RawData) ContinuousIdx(rowNum, colNum int) int
- func (r *RawData) ContinuousRow(rowNum int) []float32
- func (r *RawData) CreateTensors(manager *graph.Manager) *TensorData
- func (r *RawData) SampleWithReplacement(numExamples int) *RawData
- type TensorData
Constants ¶
const ( AdultDatasetDataURL = "https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data" AdultDatasetDataFile = "adult.data" AdultDatasetTestURL = "https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.test" AdultDatasetTestFile = "adult.test" AdultDatasetDataCecksum = "5b00264637dbfec36bdeaab5676b0b309ff9eb788d63554ca0a249491c86603d" AdultDatasetTestCecksum = "a2a9044bc167a35b2361efbabec64e89d69ce82d9790d2980119aac5fd7e9c05" )
Various URLs and file names for Adult-UCI dataset.
const ( WeightCol = "fnlwgt" // "Final Weight", see adult.names file. LabelCol = "label" // That is the target prediction column. EducationTypeCol = "education" EducationYearsCol = "education-num" )
Column names:
const ( LabelTrue = ">50K" LabelFalse = "<=50K" )
Label values:
const Unknown = "?"
Unknown representation for string columns -- used in Adult dataset.
Variables ¶
var ( // AdultFieldNames in the dataset. AdultFieldNames = []string{ "age", "workclass", WeightCol, EducationTypeCol, EducationYearsCol, "marital-status", "occupation", "relationship", "race", "sex", "capital-gain", "capital-loss", "hours-per-week", "native-country", LabelCol, } // AdultFieldTypes maps the field (column) name to its format. AdultFieldTypes = map[string]series.Type{ "age": series.Float, "workclass": series.String, WeightCol: series.Float, EducationTypeCol: series.String, EducationYearsCol: series.Float, "marital-status": series.String, "occupation": series.String, "relationship": series.String, "race": series.String, "sex": series.String, "capital-gain": series.Float, "capital-loss": series.Float, "hours-per-week": series.Float, "native-country": series.String, LabelCol: series.String, } )
var Data struct { // VocabulariesFeatures is a list of feature names for the vocabularies stored in Vocabularies. VocabulariesFeatures []string // Vocabularies is a list of maps of value to integer. The // special string "Unknown" is mapped to 0. Vocabularies []map[string]int // FeatureNameToVocabIdx maps a feature name to its vocabulary index. FeatureNameToVocabIdx map[string]int // QuantilesFeatures is the ordered list of numeric features for which we have quantiles. QuantilesFeatures []string // Quantiles for features listed in QuantilesFeatures Quantiles []QuantileTable // Train dataset. Train *RawData // Test dataset. Test *RawData }
Data holds all the data (train and test), and the required information collected statically (i.e., non machine learned) from the training dataset (we don't look at test to generate these).
It is filled out by LoadAndPreprocessData.
Functions ¶
func AssertNoError ¶
func AssertNoError(err error)
AssertNoError checks that err is nil, otherwise it `panic`s with `err`.
func BinaryFilePath ¶
BinaryFilePath returns the name used to store the preprocessed data in binary (fast) format.
The `numQuantiles` is the only preprocessing parameter that affects the result. We use it as part of the filename to make sure we don't re-use data generated for different `numQuantiles`.
Considering using LoadAndPreprocessData instead.
func DownloadDataset ¶
DownloadDataset downloads the Adult dataset files into `dir`. It `log.Fatal` if it fails. Verbosity files >= 1 will print what it's doing.
If files are already downloaded it does nothing -- except if `force` is set to true.
func FileExists ¶
FileExists returns whether the given file path exists.
func LoadAndPreprocessData ¶
LoadAndPreprocessData all in one function call for data preprocessing for the Adult Dataset. Information and data available in the global Data.
Parameters: - dir: where to store downloaded files. By default if they are already downloaded they will be reused. - numQuantiles: number of quantiles to generate for the continuous datasets. They can be used for piecewise-linear calibration. - forceDownload: will download data from the internet even if already downloaded. - verbosity: set to a value >= 1 to print out what it's doing.
The results are stored in the global variable `Flat`.
It panics in case of error.
func LoadBinaryData ¶
LoadBinaryData saves the global Data structure in binary format, for faster access. It returns true if data was available and loaded.
Considering using LoadAndPreprocessData instead.
func LoadDataFrame ¶
LoadDataFrame and returns a DataFrame. `path` is the name of the downloaded file.
Considering using LoadAndPreprocessData instead.
func NewDataset ¶
func NewDataset(manager *Manager, rawData *RawData, name string) *data.InMemoryDataset
NewDataset creates a new `data.InMemoryDataset` (can be used for training and evaluation) for the MCI Adult dataset.
func PopulateQuantiles ¶
PopulateQuantiles with up to numQuantiles for each tloat column.
func PopulateVocabularies ¶
PopulateVocabularies goes over all string columns in the DataFrame and map their value to integers starting from 0. Results in Vocabularies.
func PrintBatchSamples ¶
func PrintBatchSamples(manager *Manager, data *RawData)
PrintBatchSamples just generate a couple of batches of size 3 and print on the output. Just for debugging.
func PrintFeatures ¶ added in v0.1.1
PrintFeatures prints information on the vacabularies and quantiles about the features.
func PrintRawData ¶
func PrintRawData(r *RawData)
PrintRawData prints positivity ratio and and some samples.
func SaveBinaryData ¶
SaveBinaryData saves the global Data structure in binary format, for faster access.
Considering using LoadAndPreprocessData instead.
Types ¶
type QuantileTable ¶
type QuantileTable []float32
QuantileTable holds the quantiles of a set of values.
type RawData ¶
type RawData struct {
NumRows, NumCategorical, NumContinuous int
// Categorical is shaped `[NumRows, NumCategorical]` ordered as in VocabulariesFeatures.
Categorical []int
// Continuous is shaped `[NumRows, NumContinuous]`, ordered as in QuantilesFeatures.
Continuous []float32 // AssertNoError match ModelDType.
// Weights is shaped [NumRows]
Weights []float32 // AssertNoError match ModelDType.
// Labels is shaped [NumRows]: 1.0 (>50K) or 0.0 (<=50K)
Labels []float32 // AssertNoError match ModelDType.
}
RawData holds the data stripped of all metadata: categorical converted to ints. It includes the whole dataset.
func ConvertDataFrameToRawData ¶
ConvertDataFrameToRawData convert df to the raw data. It returns:
func (*RawData) CategoricalIdx ¶
func (*RawData) CategoricalRow ¶
func (*RawData) ContinuousIdx ¶
func (*RawData) ContinuousRow ¶
func (*RawData) CreateTensors ¶
func (r *RawData) CreateTensors(manager *graph.Manager) *TensorData
CreateTensors of dataset, for faster ML interaction.
func (*RawData) SampleWithReplacement ¶
SampleWithReplacement in local memory.
type TensorData ¶
type TensorData struct {
CategoricalTensor, ContinuousTensor, WeightsTensor, LabelsTensor tensor.Tensor
}
TensorData contains a RawData converted to tensors.