Data loading

General dataset API has three main kind of interfaces:

The dataset loaders are used to load toy datasets bundled with sklearn.

The dataset fetchers are used to download and load datasets from the internet.

The dataset generators are used to generate controlled synthetic datasets.

Dataset API

Loaders

Fetchers

Load small standard datasets

Fetch and load larger datasets

Both loaders and fetchers return a Bunch object, which is a dictionary with two keys of our interest:

Generator

Controlled synthetic datasets

Key	Values
data	Array of shape (n, m)
target	Array of shape (n,)

Returns tuple \((\mathbf X, \mathbf y)\)

of numpy arrays:

\(\mathbf{X}\) has shape \((n,m)\)
\(\mathbf y\) has shape \((n, )\)

load_*

fetch_*

make_*

return_X_y = True

Dataset Loaders

Note: These datasets are bundled with sklearn and we do not require to download them from external sources.

Dataset Loader	# samples (n)	# features (m)	# labels	Type
`load_iris`	150	3	1	Classification
`load_diabetes`	442	10	1	Regression
`load_digits`	1797	64	1	Classification
`load_linnerud`	20	3	3	Regression (multi output)
`load_wine`	178	13	1	Classification
`load_breast_cancer`	569	30	1	Classification

Dataset Fetchers

Dataset Loader	# samples (n)	# features (m)	# labels	Type
`fetch_olivetti_faces`	400	4096	1 (40)	multi-class image classification
`fetch_20newsgroups`	18846	1	1 (20)	(multi-class) text classification
`fetch_lfw_people`	13233	5828	1 (5749)	(multi-class) image classification
`fetch_covtype`	581012	54	1 (7)	(multi-class) classification
`fetch_rcv1`	804414	47236	1 (103)	(multi-class) classification
`fetch_kddcup99`	4898431	41	1	(multi-class) classification
`fetch_california_housing`	20640	8	1	regression

Dataset generators

Regression

make_regression() produces regression targets as a sparse random linear combination of random features with noise. The informative features are either uncorrelated or low rank.

Single label

Classification

Multilabel

make_blobs() and make_classification() first creates a bunch of normally-distributed clusters of points and then assign one or more clusters to each class thereby creating multi-class datasets.

make_multilabel_classification() generates random samples with multiple labels with a specific generative process and rejection sampling.

Dataset generators

Clustering

make_blobs()generates a bunch of normally-distributed clusters of points with specific mean and standard deviations for each cluster.

Loading external datasets

fetch_openml()fetches datasets from openml.org, which is a public repository for machine learning data and experiments.

pandas.io provides tools to read from common formats like CSV, excel, json, SQL.

scipy.io specializes in binary formats used in scientific computing like .mat and .arff.

numpy/routines.io specializes in loading columnar data into numpy arrays.

dataset.load_files loads directories of text files where directory name is a label and each file is a sample.

Loading external datasets

datasets.load_svmlight_files() loads data in svmlight and libSVM sparse format.

skimage.io provides tools to load images and videos in numpy arrays.

scipy.io.wavfile.read specializes reading WAV file into a numpy array.

For managing numerical data, sklearn recommends using an optimized file format such as HDF5 (Hierarchical Data Format version 5) to reduce data load times.

Pandas, Py Tables and H5Py provides an interface to read and write data in that format.

Data transformation

sklearn provides a library of transformers for

Data cleaning (sklearn.preprocessing) such as
Feature extraction (sklearn.feature_extraction)
Feature reduction
Feature expansion (sklearn.kernel_approximation)

Types of transformers

fit() method learns model parameters from a training set.

Each transformer has the following methods:

transform() method applies the learnt transformation to the new data.

fit_transform() performs function of both fit() and transform() methods and is more convenient and efficient to use.

Transformer methods

Transformers are combined with one another or with other estimators such as classifiers or regressors to build composite estimators.

Tool	Usage
Pipeline	Chaining multiple estimators to execute a fixed sequence of steps in data preprocessing and modelling.
FeatureUnion	Combines output from several transformer objects by creating a new transformer from them.
ColumnTransformer	Enables different transformations on different columns of data based on their types.

mlp_data_loading

By ashishtendulkar

Data loading

Dataset API

Dataset Loaders

Dataset Fetchers

Dataset generators

Dataset generators

Loading external datasets

Loading external datasets

Data transformation

Types of transformers

Transformer methods

mlp_data_loading

More from ashishtendulkar