Abdullah Fathi
List of unsupervised learning algorithms
Before begin our journey with machine learning, we need to prepare our data
Regression models (both linear and non-linear) are used for predicting a real value, like salary for example. If our independent variable is time, then we are forecasting future values, otherwise our model is predicting present but unknown values. Regression technique vary from Linear Regression to SVR and Random Forests Regression
y = Dependant Variable (DV)
X = Independant Variable (IV)
b1 = Coefficient
b0 = Constant
lm(formula, data)
# Where
# formula = A symbolic description of the model to be fitted
# data = An optional data frame
5 methods of building models:
Stepwise Regression: Backward Elimination, Forward Selection, Bidirectional Elimination
STEP 1: Select a significance level to stay in the model
(ex: Significance Level (SL) = 0.05)
STEP 2: Fit the full model with all possible predictors (independant variable)
STEP 3: Consider the predictor with the highest P-value. If P > Significance Level (SL), go to STEP 4, otherwise go to FIN
STEP 4: Remove the predictor
STEP 5: Fit model without this variable*
FIN: Our Model is Ready
STEP 1: Select a significance level to enter the model
(ex: SL = 0.05)
STEP 2: Fit all simple regression models y ~ Xn Select the one with the lowest P-value
STEP 3: Keep this variable and fit all possible models with one extra predictor added to the one(s) we already have
STEP 4: Consider the predictor with the lowest P-value.
If P < SL, go to STEP 3, otherwise go to FIN
FIN: Keep the previous model
STEP 1: Select a significance level to enter and to stay in the model (ex: SLENTER = 0.05, SLSTAY = 0.05)
STEP 2: Perform the next step of Forward Selection (new variables must have: P < SLENTER to enter)
STEP 3: Perform ALL steps of Backward Elimination (old variables must have P < SLSTAY to stay)
STEP 4: Now new variables can enter and no old variables can exit
FIN: Our Model is Ready
STEP 1: Select a criterion of goodness of fit
(ex: Akaike criterion)
STEP 2: Construct All Possible Regression Model:
2^n - 1 total combinations
STEP 3: Select the one with the best criterion
FIN: Our Model is Ready
Ex:
10 columns means
1,023 models
b2X1^2 - give parabolic effect to fit our data better
Polynomial Linear Regression is a special case of Multiple Linear Regression
Next step is to choose kernel
Regularization
SVR has a different regression goal compared to linear regression. In linear regression we are trying to minimize the error between the prediction and data. In SVR our goal is to make sure that errors do not exceed the threshold
Decision boundary is our Margin of tolerance that is We are going to take only those points who are within this boundary.
Or in simple terms that we are going to take only those those points which have least error rate. Thus giving us a better fitting model.
CART
Classification And Regression Trees
Classification Trees
Regression
Trees
X1 and X2 is Independent Variable
Y is our Dependent Variable which we could not see because it is in another dimension (z-axis)
Calculate Mean/Average for each leaf
STEP 1: Pick at random K data points from the Training set.
STEP 2: Build the tree associated to these K data points.
STEP 3: Choose the number Ntree of trees we want to build and repeat STEPS1 & 2
STEP 4: For a new data point, make each one of your Ntree trees predict value of Y to for the data point in question, and assign the new data point the average across all of the predicted Y values.
We are not just predicting based on 1 Tree, We are predicting based on forest of trees. It will improve the accuracy of prediction because we take the average of many prediction
Ensemble training is more stable because one changes of data could really impact one tree, but to impact a forest of trees it would be much harder
Unlike regression where we predict a continuous number, we use classification to predict a category. There is a wide variety of classification applications from medicine to marketing. Classification models include linear models like Logistic Regression, SVM, and nonlinear ones like K-NN, Kernel SVM and Random Forests
library(ElemStatLearn)
set = training_set
X1 = seq(min(set[, 1]) - 1, max(set[, 1]) + 1, by = 0.01)
X2 = seq(min(set[, 2]) - 1, max(set[, 2]) + 1, by = 0.01)
grid_set = expand.grid(X1, X2)
colnames(grid_set) = c('Age', 'EstimatedSalary')
prob_set = predict(classifier, type = 'response', newdata = grid_set)
y_grid = ifelse(prob_set > 0.5, 1, 0)
plot(set[, -3],
main = 'Logistic Regression (Training set)',
xlab = 'Age', ylab = 'Estimated Salary',
xlim = range(X1), ylim = range(X2))
contour(X1, X2, matrix(as.numeric(y_grid), length(X1), length(X2)), add = TRUE)
points(grid_set, pch = '.', col = ifelse(y_grid == 1, 'springgreen3', 'tomato'))
points(set, pch = 21, bg = ifelse(set[, 3] == 1, 'green4', 'red3'))
STEP 1: Choose the number K of neighbours
STEP 2: Take the K nearest neighbors of the new data point, according to the Euclidean distance
STEP 3: Among these K neighbors, count the number of data points in each category
STEP 4: Assign the new data point to the category where you counted the most neighbours
FIN: Our Model is Ready
We can use other Distance as well such as Manhattan Distance. But Euclidean is the commonly used for geometry
?
Mach1: 30 wrenches/hr
Mach2: 20 wrenches/hr
Out of all produced parts:
We can SEE that 1% are defective
Out of all defective parts:
We can SEE that 50% came from mach1 And 50% cam from mach2
Question:
What is the probability that a part produced by mach2 is defective = ?
Assign class based on probability
0.75 VS 0.25
0.75 > 0.25
CART
Classification And Regression Trees
Classification Trees
Regression
Trees
STEP 1: Pick at random K data points from the Training set.
STEP 2: Build the tree associated to these K data points.
STEP 3: Choose the number Ntree of trees we want to build and repeat STEPS1 & 2
STEP 4: For a new data point, make each one of your Ntree trees predict value of Y to for the data point in question, and assign the new data point the average across all of the predicted Y values.
If our problem is linear, we should go for Logistic Regression or SVM.
If our problem is non linear, we should go for K-NN, Naive Bayes, Decision Tree or Random Forest.
Clustering is similar to classification, but the basis is different. In Clustering we don’t know what we are looking for, and we are trying to identify some segments or clusters in our data. When we use clustering algorithms on our dataset, unexpected things can suddenly pop up like structures, clusters and groupings we would have never thought of otherwise.
We can apply K-Means for different purposes:
STEP 1: Choose the number of K clusters
STEP 2: Select at random K points, the centroids (not necessarily from our dataset)
STEP 3: Assign each data point to the closest centroid -> that forms K clusters
STEP 4: Compute and place the new centroid of each cluster
STEP 5: Reassign each data point to the new closest centroid. If any reassignment took place, go to Step 4. Otherwise go to FIN.
FIN: Our Model is Ready
Agglomerative
&
Divisive
STEP 1: Make each data point a single-point cluster -> That forms N clusters
STEP 2: Take the two closest data points and make them one cluster -> That forms N-1 clusters
STEP 3: Take the two closest clusters and make them one cluster -> That forms N-2 clusters
STEP 4: Repeat STEP 3 until there is only one cluster
FIN: Our Model is Ready
How does it work?
How do we use it?
Another Example:
get optimal number of clusters
People who bought also bought ...
STEP 1: Set a minimum support and confidence
STEP 2: Take all the subsets in transactions having higher support than minimum support
STEP 3: Take all the rules of these subsets having higher confidence than minimum confidence
STEP 4: Sort the rules by decreasing lift
People who bought also bought ...
STEP 1: Set a minimum support
STEP 2: Take all the subsets in transactions having higher support than minimum support
STEP 3: Sort these subsets by decreasing support
Used to solve interacting problems where the data observed up to time t is considered to decide which action to take at time t + 1. It is also used for Artificial Intelligence when training machines to perform tasks such as walking. Desired outcomes provide the AI with reward, undesired with punishment. Machines learn through trial and error.
Example used for training purpose. No affiliation with coca-cola
Natural-language processing (NLP) is an area of computer science and artificial intelligence concerned with the interactions between computers and human (natural) languages. NLP is used to apply Machine Learning models to text and language.
Example use:
Teach machines to understand what is said in spoken and written word is the focus of NLP. Whenever we dictate something into our iPhone/Android device that is then converted to text,that's an NLP algorithm in action
The history of natural language processing generally started in the 1950s, although work can be found from earlier periods. In 1950, Alan Turing published an article titled "Intelligence" which proposed what is now called the Turing test as a criterion of intelligence
Up to the 1980s, most natural language processing systems were based on complex sets of hand-written rules. Starting in the late 1980s, however, there was a revolution in natural language processing with the introduction of machine learning algorithms for language processing
source: wiki
NLTK - Part of Speech (PoS)
NN = Noun
JJ = Adjective
DT = Determiner
CD = Cardinal/Number
Semantic Information
Words & Relationship
A very popular NLP Model - It is a model used to preprocess the texts to classify before fitting the classification algorithms on the observations containing the texts.
It involve 2 things:
5MB
10MB
256MB
Human Brain has approx 100B neurons and each neuron is connected to as many as about thousand of its neighbour
Sum all the input values it get
Apply activation function
Pass the signal to the next signal/neuron
Something that can learn and adjust itself
Perceptron algorithm was invented in 1957 at the Cornell Aeronautical Laboratory by Frank Rosenblatt
Cost Function: What is the error we have in our prediction. Our goal is to minimize the error. The lower the cost function the closer y-hat to y
Feed information back to NN. Our goal is to minimize the cost function, all we can do is update the weight
FYI, for now we only deal with 1 row
How the weights are adjusted
Remember the prev example?
Where we Run NN for property evaluation
This is what its looked like after it was trained up already
Faster way to find the best option
Lets start..
In simple term, that's how we find the best weight. Of course its not going to be like a ball rolling. It's going to be zig-zag type of approach. But, its easier and more fun to remember it this way.
Example: Gradient Descent (2D)
Example: Gradient Descent (3D)
Gradient Descent require cost function to be convex. What if our function is not convex?
What if our cost function is not convex?
That could happen if we choose cost function which is not the squared diff between y-hat and y. Or if we do choose CS like that but then in multi dimension space it can turn to something that is not convex
So, basically we adjusting the weight after a single row rather than doing everything together
STEP 1: Randomly initialize the weights to small numbers close to 0 (but not 0)
STEP 2: Input the observation of our dataset in the input layer, each feature in one input node
STEP 3: Forward-Propagation: from left to right, the neurons are activated in a way that the impact of each neuron's activation is limited by the weights. Propagate the activations until getting the predicted result y
STEP 4: Compare the predicted result to the actual result. Measure the generated error
STEP 5: Back-Propagation: from right to left, the error is back-propagated. Update the weights according to how much they are responsible for the error. The learning rate decides by how much we update the weights.
STEP 6: repeat Steps 1 to 5 and update the weights after each observation (reinforcement Learning). Or: Repeat Steps 1 to 5 but update the weights only after a batch of observations (Batch Learning).
STEP 7: When the whole training set passed through the ANN, that makes an epoch. Redo more epochs.
Learn about the relationship between X and Y values
Find list of principal axes
From the m independent variables of our dataset, PCA extracts p <= m new independent variables that explain the most the variance of the dataset,regardless of the dependent variable
The fact that the DV is not considered makes PCA an unsupervised model
LDA differs because in addition to finding the component axises with LDA, we are interested in the axes that maximize the separation between multiple classes
The goal of LDA is to project feature space (a dataset n-dimensional samples) onto a small subspace k (where k<=n-1) while maintaining the class-discriminatory information.
Both PCA and LDA are linear transformation techniques used for dimensional reduction. PCA is described as unsupervised but LDA is supervised because of the relation to the dependent variable
Splitting the training set into 10 folds.
k = 10, most of the time k=10
We train our model on 9 folds and we test on last remaining fold
Gradient boosting is a machine learning technique for regression and classification problems, which produces a prediction model in the form of an ensemble of weak prediction models, typically decision trees.
(Wikipedia definition)
Boosting is an ensemble technique in which the predictors(IV) are not made independently, but sequentially.
There are no secrets to success. It is the result of preparation, hard work, and learning from failure. - Colin Powell