Machine Learning Techniques
e.g. identifying different types of fruits in a picture.
Label set: \(Y = \{y_1, y_2, \ldots, y_k\}\) has \(k\) elements/labels.
Depending on the presence of the label, the corresponding label is set to 1.
versicolor
setosa
virginica
Image Source: Wikipedia.org
Use one-hot encoding scheme for label encoding.
Let's assume that the flower has label versicolor, we will encode it as follows:
Note that the component of \(\mathbf{y}\) corresponding to the label versicolor is 1, every other component is 0.
Sample image
Image source: Wikipedia.org
Apple
Banana
Orange
Different fruits in the images are:
becomes
Concretely, the feature vector for \(i\)-th training example \(\mathbf{x}^{(i)}\) can be obtained by \(\mathbf{X}[i]\):
\(D\): Set of \(n\) pairs of a feature vector \(\mathbf{x}\) and a label vector \(\mathbf{y}\) representing examples.
\(\mathbf{X}\) is an \(n \times m\) feature matrix.
\(\mathbf{Y}\) is a label matrix of shape \(n \times k\), where \(k\) is the total number of classes in label set.
\(D\): Set of \(n\) pairs of a feature vector \(\mathbf{x}\) and a label vector \(\mathbf{y}\) representing examples.
\(\mathbf{X}\) is an \(n \times m\) feature matrix.
\(\mathbf{Y}\) is a label matrix of shape \(n \times k\), where \(k\) is the total number of classes in label set.
Differs in the content of label vector
Discriminant functions learn direct mapping between feature vector \(\mathbf{x}\) and label \(y\).
\(\mathbf{x}\) Discriminant Function \(y\)
In a binary classification set up with \(m\) features, the simplest discriminant function is very similar to the linear regression:
label
bias
weight vector
feature vector
Here label \(y\) is a discrete quantity unlike a real number in the linear regression set up.
Let's look at this discriminant function from geometric perspective.
Geometrically, the simplest discriminant function, \(y = w_0 + \mathbf{w}^T \mathbf{x}\) represents a hyperplane in \(m-1\) dimensional space where \(m\) is the number of features.
| # features (m) | Discriminant function |
|---|---|
| 1 | Point |
| 2 | Line |
| 3 | Plane |
| 4 | Hyperplane in 3D space |
| ... | ... |
| m | Hyperplane in (m-1)-D space |
(m-1)-D hyperplane (which is line in this case) divides feature space into two regions one for each class:
The decision boundary between two classes is represented with (m-1)-D hyperplane: \(w_0 + \mathbf{w}^T\mathbf{x} = 0 \).
On decision boundary, \(y = 0\)
Consider two points \(\mathbf{x}^{(A)}\) and \(\mathbf{x}^{(B)}\) on the decision surface, we will have
Since \(y^{(A)} = y^{(B)} = 0\), \(y^{(A)} - y^{(B)}\) results into the following equation:
The vector \(\mathbf{w}\) is orthogonal to every vector lying within the decision surface, hence it determines the orientation of the decision surface.
For points on decision surface, we have
Normalizing both sides with the length of the vector \(||\mathbf{w}||\), we get normal distance from the origin to the decision surface:
\(w_0\) determines the location of the decision surface
\(y\) gives signed measure of perpendicular distance of the point \(\mathbf{x}\) from the decision surface.
Now that we understand discriminant functions geometrically, let's explore how to classify an example \(\mathbf{x}\) with discriminant functions.
Discriminant function assigns label 1 to an example with feature vector \(\mathbf{x}\) if \((w_0 + \mathbf{w}^T\mathbf{x}) > 0 \) else assigns label 0.
Assuming the number of classes to be \(k > 2\), we can build discriminant functions in two ways:
A single \(k\)-class discriminant comprising \(k\) linear functions as follows:
Concretely:
Assign label \(y_k\) to example \(\mathbf{x}\) if \(y_k > y_j, \forall j \ne k\)
The decision boundary between classes \(y_k\) and \(y_j\) corresponds to \(m-1\) dimensional hyperplane:
This has the same form as the decision boundary for the two class cases:
Now that we have a model of linear discriminant functions, we will study two approaches for learning the parameters of the model:
Least square classification adapts linear regression for classification.
We use least square error as a loss function.
The error at \(i\)-th training point is calculated as follows:
The total loss is the sum of square of errors between actual and predicted labels at each training point.
Note that the loss is dependent on the value of \(\mathbf{w}\) - as these value changes, we get a new model, which will result in different prediction and hence affects the error at each training point.
Calculate derivative of loss function \(J(\mathbf{w})\) w.r.t. weight vector \(\mathbf{w}\). [refer to linear regression material for detailed derivation]
Set \(\dfrac{\partial J(\mathbf{w})}{\partial \mathbf{w}}\) to 0 and solve for \(\mathbf{w}\):
Whenever \(\mathbf{X}^T \mathbf{X}\) is not full rank, we calculate pseudo-inverse: \(\left( \mathbf{X}^T \mathbf{X} \right)^{-1} \mathbf{X}^T\)
We derive weight update rule in vectorized format as follows:
We will implement this update rule in gradient descent algorithm in colab.
def predict(X, w):
z = X @ w
return np.array([1 if z_val >= 0 else 0 for z_val in z]) def fit(X, y):
return np.linalg.pinv(X) @ yInference
Optimization through normal equation
Training data
Decision boundary via LSC