YouTube Like Count Prediction
Ayush Singh
Data Preprocessing
Data Exploration & Analysis
Feature Engineering
Feature Selection
MODEL
Model selection
Evaluation Metrics
Cross Validation
Parameter Tuning
Plotting Curves
DATA
DATA PREPARATION
Data Collection
Data Sources
Predicting YouTube Video Like Count given any video.
Public/Private Dataset
Data Scraping
API
Internal databases
What all can we see ?
What all we have ?
Formatting: Bring the data into a format which is easy to work with. eg. JSON to CSV ..
Cleaning: Cleaning data is the removal or fixing of missing data.
Transformation: Normalisation, standardisation, Scaling,Encoding Data
Univariate analysis is the simplest form of analyzing data. “Uni” means “one”, so in other words your data has only one variable.
Process of transforming raw data into features that better represent the underlying problem to the predictive models, resulting in improved model accuracy on unseen data.
It's an art comes with a lot of practice and looking at what others have done.
Top reasons to use feature selection are:
It enables the machine learning algorithm to train faster.
It reduces the complexity of a model and makes it easier to interpret.
It improves the accuracy of a model if the right subset is chosen.
Process of selecting those attributes or features from our given pool of features that are most relevant and best describe the relationship between Predictors and Response .
Statistical test to infer the relationship between the response and predictors
The learning algorithm (estimator) itself is used to perform the Feature Selection and select the top features .
Forward Selection
Backward Elimination
Recursive Feature elimination
Feature\Response | Continuous | Categorical |
---|---|---|
Continuous | Pearson' Corr. | LDA |
Categorical | Anova | Chi-square |
from sklearn.feature_selection import RFE
from sklearn.ensemble import RandomForestRegressor
''' Use Random Forest as the model . '''
Classifier = RandomForestRegressor()
''' Rank all features, i.e. continue the elimination until the last one . '''
Rfe = RFE(Classifier, n_features_to_select=5)
Rfe.fit(X,Y)
print Rfe.ranking
The final ones...
ViewCount, CommentCount, DislikeCount,
ViewCount/VideoMonthOld, SubscriberCount/VideoCount
Training Set:
Testing Set:
a set of examples used by model for learning
a set of examples used only to assess the performance of a fully-trained estimator
Validation Set:
a set of examples used to tune the parameters of a model
We generally use the following algorithms in the process of selecting a machine learning model:
Classification
Regression
Random Forest
GBM
Logistic Regression
Naive Bayes
Support Vector Machines
k-Nearest Neighbors
Neural Networksa
Random Forest
GBM
Linear Regression
SVR
Neural Networks
and many more...
Quatifying the quality of prediction
Regression
Mean absolute error
Mean squared error
R^2 Score
Classification
Accuracy
Precision,Recall
Log-Loss
K Fold Cross Validation
Model evaluation technique to see how well the model is performing on unseen data and learning parameters.
Grid Search
Exhaustive search over specified parameter values for an estimator.
Hyperparameters are parameters whose values are set prior to the commencement of the learning process.
They vary with different models.
pipeline= Pipeline([
('clf',RandomForestRegressor())
])
parameters={
'clf__n_estimators':([50,100]),
'clf__max_depth':([15,25,30]),
'clf__min_samples_split':([15,100]),
'clf__min_samples_leaf':([2,10])
}
grid_search = GridSearchCV(pipeline,parameters,n_jobs=15,verbose=1,scoring="r2")
grid_search.fit(X,Y)
Helps in visualising the progress of an estimator and how well it's learning.
Underfitting – Validation and training error high
Overfitting – Validation error is high, training error low
Good fit – Validation error low, slightly higher than the training error
Evaluation metric : R^2 score
Cross-Validation score : 0.95
Training Score : 0.96
Testing Score : 0.90
https://github.com/ayush1997/YouTube-Like-predictor
ayush0016
ayush1997
ayushsingh97
https://slides.com/ayush1997/ml-3/
ayushkumar97