Introduction to Scikit-Learn (sklearn)

sklearn APIs are organized on the lines of our ML framework.

  • Training data and preprocessing
  • Model subsumes loss function and optimization procedure
  • Model selection and evaluation
  • Model inspection

Training data 

Model

Loss function

Optimization

Evaluation

Scikit-learn

ML Framework

API design principles

sklearn APIs are well designed with the following principles:

  • Consistency: All APIs share a simple and consistent interface.
  • Inspection: The learnable parameters as well as hyperparameters of all estimator's are accessible directly via public instance variables.
  • Nonproliferation of classes: Datasets are represented as Numpy arrays or Scipy sparse matrix instead of custom designed classes.
  • Composition: Existing building blocks are reduced as much as possible.
  • Sensible defaults values are used for parameters that enables quick baseline building.

@sir, copied to 'Data Preprocessing' slide deck

Types of sklearn objects

Estimators

Predictors

Transformers

  • Estimates model parameters based on training data and hyper parameters.
  • fit() method
  • Makes prediction on dataset
  • predict() method that takes dataset as an input and returns predictions.
  • score() method to measure quality of predictions.
  • transforms dataset
  • transform() for transforming dataset.
  • fit() learns parameters.
  • fit_transform() fits parameters and  transform() the dataset.

Data Preprocessing

Training

Inference

@sir, copied to 'Data Preprocessing' slide deck

sklearn API

Data API

Provides functionality for loading, generating and preprocessing the training and test data.

Module Functionality
sklearn.datasets Loading datasets - custom as well as popular reference dataset.
sklearn.preprocessing Scaling, centering, normalization and binarization methods
sklearn.impute Filling missing values
sklearn.feature_selection Implements feature selection algorithms
sklearn.feature_extraction Implements feature extraction from raw data.

@sir, copied to 'Data Preprocessing' slide deck

Model API

Implements supervised and unsupervised models

Regression

Classification

  • sklearn.linear_model (linear, ridge, lasso models)
  • sklearn.trees
  • sklearn.linear_model
  • sklearn.svm
  • sklearn.trees
  • sklearn.neighbors
  • sklearn.naive_bayes
  • sklearn.multiclass

sklearn.multioutput implements multi-output classification and regression.

sklearn.cluster implements many popular clustering algorithms

Model evaluation API

sklearn.metrics implements different metrics for model evaluation.

Model selection API

sklearn.model_selection implements various model selection strategies like cross-validation, tuning hyper-parameters and plotting learning curves.

Model inspection API

sklearn.model_inspection includes tools for model inspection.

Practical advice

import sklearn.linear_model import LogisticRegression
?LogisticRegression
  • It is not possible to remember each and every sklearn API.
  • Use documentation for more information as  follows:
  • Remember high level modules and API design principles.

Introduction to sklearn

By ashishtendulkar

Introduction to sklearn

  • 481