How to approach a Machine Learning problem using Python ?

YouTube Like Count Prediction

ayush0016

Ayush Singh
  • Data Preprocessing
  • Data Exploration & Analysis
  • Feature Engineering
  • Feature Selection

General Workflow/Pipeline

MODEL

  • Model selection
  • Evaluation Metrics
  • Cross Validation
  • Parameter Tuning
  • Plotting Curves

DATA

DATA PREPARATION

  • Data Collection
  • Data Sources

Problem ?

Predicting YouTube Video Like Count given any video.

DATA

Data Collection

  • Public/Private Dataset
  • Data Scraping
  • API
  • Internal databases

YouTube Data API

  • Easy to use
  • Good quota limit
  • Quite extensive data
  • ~22-24k across 15 categories
  • Videos between 2010-2016
  • ~0.35 Million 

Dataset

What all can we see ?

What all we have ?

DATA PREPRATION

Data Preprocessing

  • Formatting: Bring the data into a format which is easy to work with.
    eg. JSON to CSV ..
  • Cleaning: Cleaning data is the removal or fixing of missing data.
  • Transformation: Normalisation, standardisation, Scaling,Encoding Data

Data Exploration and Analysis

Univariate analysis is the simplest form of analyzing data. “Uni” means “one”, so in other words your data has only one variable.

Feature Engineering

Process of transforming raw data into features that better represent the underlying problem to the predictive models, resulting in improved model accuracy on unseen data.
It's an art comes with a lot of practice and looking at what others have done.

Feature Selection

Top reasons to use feature selection are:
  • It enables the machine learning algorithm to train faster.
  • It reduces the complexity of a model and makes it easier to interpret.
  • It improves the accuracy of a model if the right subset is chosen.
Process of selecting those attributes or features from our given pool of features that are most relevant and best describe the relationship between Predictors and Response .

Filter Method

Statistical test to infer the relationship between the response and predictors 

 

Wrapper Methods

The learning algorithm (estimator) itself is used to perform the Feature Selection and select the top features .
  • Forward Selection​
  • Backward Elimination​
  • Recursive Feature elimination
Feature\Response Continuous Categorical
Continuous Pearson' Corr. LDA
Categorical Anova Chi-square

Bivariate Data Analysis

Recursive Feature Elimination

from sklearn.feature_selection import RFE
from sklearn.ensemble import RandomForestRegressor

''' Use Random Forest as the model . '''
Classifier = RandomForestRegressor()
''' Rank all features, i.e. continue the elimination until the last one . '''
Rfe = RFE(Classifier, n_features_to_select=5)
Rfe.fit(X,Y)

print Rfe.ranking 

The final ones...

ViewCount, CommentCount, DislikeCount,
ViewCount/VideoMonthOld, SubscriberCount/VideoCount

MODEL

Training Set:
Testing Set:
a set of examples used by model for learning
 a set of examples used only to assess the performance of a fully-trained estimator
Validation Set:
 a set of examples used to tune the parameters of a model

Terminology

Model Selection

We generally use the following algorithms in the process of selecting a machine learning model:
Classification
Regression
  • Random Forest
  • GBM
  • Logistic Regression
  • Naive Bayes
  • Support Vector Machines
  • k-Nearest Neighbors
  • Neural Networksa
  • Random Forest
  • GBM
  • Linear Regression
  • SVR
  • Neural Networks
and many more...

Evaluation Metrics

Quatifying the quality of prediction
Regression
  • Mean absolute error
  • Mean squared error
  • R^2 Score
Classification
  • Accuracy
  • Precision,Recall
  • Log-Loss

Cross Validation

K Fold Cross Validation
Model evaluation technique to see how well the model is performing on unseen  data and learning parameters.

Hyper Parameter Tuning

Grid Search

Exhaustive search over specified parameter values for an estimator.
Hyperparameters are parameters whose values are set prior to the commencement of the learning process.
They vary with different models.
pipeline=  Pipeline([
                    ('clf',RandomForestRegressor())
                    ])
parameters={
    'clf__n_estimators':([50,100]),
    'clf__max_depth':([15,25,30]),
    'clf__min_samples_split':([15,100]),
    'clf__min_samples_leaf':([2,10])
}

grid_search = GridSearchCV(pipeline,parameters,n_jobs=15,verbose=1,scoring="r2")
grid_search.fit(X,Y)

Plotting Curve

Helps in visualising the progress of an estimator and how well it's learning.
  • Underfitting – Validation and training error high
  • Overfitting – Validation error is high, training error low
  • Good fit – Validation error low, slightly higher than the training error

Final Results

Evaluation metric : R^2 score
Cross-Validation score : 0.95
Training Score : 0.96
Testing Score : 0.90

Questions?

https://github.com/ayush1997/YouTube-Like-predictor

ayush0016

ayush1997

ayushsingh97

https://slides.com/ayush1997/ml-3/

ayushkumar97

Made with Slides.com