Data loading

General dataset API has three main kind of interfaces:

  • The dataset loaders are used to load toy datasets bundled with sklearn.
  • The dataset fetchers are used to download and load datasets from the internet.
  • The dataset generators are used to generate controlled synthetic datasets.

Dataset API

Loaders

Fetchers

Load small standard datasets

Fetch and load larger datasets

Both loaders and fetchers return a Bunch object, which is a dictionary with two keys of our interest:

Generator

Controlled synthetic datasets

Key Values
data Array of shape (n, m)
target Array of shape (n,)

Returns tuple \((\mathbf X, \mathbf y)\)

of numpy arrays:

  • \(\mathbf{X}\) has shape \((n,m)\)
  • \(\mathbf y\) has shape \((n, )\)

load_*

fetch_*

make_*

return_X_y = True

Dataset Loaders

Note: These datasets are bundled with sklearn and we do not require to download them from external sources.

Dataset Loader # samples (n) # features (m) # labels Type
load_iris 150 3 1 Classification
load_diabetes 442 10 1 Regression
load_digits 1797 64 1 Classification
load_linnerud 20 3 3 Regression (multi output)
load_wine 178 13 1 Classification
load_breast_cancer 569 30 1 Classification

Dataset Fetchers

Dataset Loader # samples (n) # features (m) # labels Type
fetch_olivetti_faces 400 4096 1 (40) multi-class image classification
fetch_20newsgroups 18846 1 1 (20) (multi-class) text classification
fetch_lfw_people 13233 5828 1 (5749) (multi-class) image classification
fetch_covtype 581012 54 1 (7) (multi-class) classification
fetch_rcv1 804414 47236 1 (103) (multi-class) classification
fetch_kddcup99 4898431 41 1 (multi-class) classification
fetch_california_housing 20640 8 1 regression

Dataset generators

Regression

make_regression() produces regression targets as a sparse random linear combination of random features with noise.  The informative features are either uncorrelated or low rank.

Single label

Classification

Multilabel

make_blobs() and make_classification() first creates a bunch of normally-distributed clusters of points and then assign one or more clusters to each class thereby creating multi-class datasets.

make_multilabel_classification() generates random samples with multiple labels with a specific generative process and rejection sampling. 

Dataset generators

Clustering

make_blobs()generates a bunch of normally-distributed clusters of points with specific mean and standard deviations for each cluster.

Loading external datasets

fetch_openml()fetches datasets from openml.org, which is a public repository for machine learning data and experiments.

pandas.io provides tools to read from common formats like CSV, excel, json, SQL. 

scipy.io specializes in binary formats used in scientific computing like .mat and .arff. 

numpy/routines.io specializes in loading columnar data into numpy arrays. 

dataset.load_files loads directories of text files where directory name is a label and each file is a sample.

Loading external datasets

datasets.load_svmlight_files() loads data in svmlight and libSVM sparse format.

skimage.io provides tools to load images and videos in numpy arrays. 

scipy.io.wavfile.read specializes reading WAV file into a numpy array. 

For managing numerical data, sklearn recommends using an optimized file format such as HDF5 (Hierarchical Data Format version 5) to reduce data load times.

Pandas, Py Tables and H5Py provides an interface to read and write data in that format.

Data transformation

sklearn provides a library of transformers for 

  • Data cleaning (sklearn.preprocessing) such as 
  • Feature extraction (sklearn.feature_extraction)
  • Feature reduction
  • Feature expansion (sklearn.kernel_approximation)

Types of transformers

  • fit() method learns model parameters from a training set.

Each transformer has the following methods:

  • transform() method applies the learnt transformation to the new data.
  • fit_transform() performs function of both fit() and transform() methods and is more convenient and efficient to use.

Transformer methods

Transformers are combined with one another or with other estimators such as classifiers or regressors to build composite estimators.

Tool Usage
Pipeline Chaining multiple estimators to execute a fixed sequence of steps in data preprocessing and modelling.
FeatureUnion Combines output from several transformer objects by creating a new transformer from them.
ColumnTransformer Enables different transformations on different columns of data based on their types.

mlp_data_loading

By ashishtendulkar

mlp_data_loading

  • 268