Machine Learning FoundationsWeek 3: Open Session

Theorem

Goal : $\min \limits_x f(x)$

Let $f$ be a differentiable and convex function from $\mathbb R^d \rightarrow \mathbb R$ , $x^* \in \mathbb R^d$ is a global minimum of $f$ if and only if $\nabla f(x^*) = 0$ .

Necessary and sufficient conditions for optimality of convex functions

Additional Properties of Convex Functions

If $f: \mathbb{R}^d \rightarrow \mathbb{R}, g: \mathbb{R}^d \rightarrow \mathbb{R}$ are both convex functions, then $f(x) + g(x)$ is a convex function

https://katex.org/docs/supported.html

Additional Properties of Convex Functions

Let $f: \mathbb{R} \rightarrow \mathbb{R}$ is a convex and non-decreasing function and $g: \mathbb{R}^d \rightarrow \mathbb{R}$ be a convex function, then their composition $h = f(g(x))$ is also a convex function.

Additional Properties of Convex Functions

Let $f: \mathbb{R} \rightarrow \mathbb{R}$ is a convex function and $g: \mathbb{R}^d \rightarrow \mathbb{R}$ be a linear function, then their composition $h = f(g(x))$ is also a convex function.

Additional Properties of Convex Functions

In general, if $f$ and $g$ are both convex functions, then $h=fog$ may not be convex function.

Note: $g$ is concave if and only if $f=-g$ is convex.

Applications of Optimization in ML

Linear Regression:

Training data $\rightarrow$ ${X_1,X_2,...,X_n}$ with corresponding outputs ${y_1,y_2,...,y_n}$ , where $X_i \in \mathbb R^d$ and $y_i \in \mathbb R$ , $\forall i$ .

Gradient of the sum of squares error

Analytical or closed form solution of coefficients $w^*$ of a linear regression model

w^*=(X^TX)^{-1} X^Ty

w^*=(X^TX)^{-1} X^Ty

Applications of Optimization in ML

In linear regression, the gradient descent approach avoids the inverse computation by iteratively updating the weights.

w^{t+1} = w^t - \eta_t \nabla f(w^t)) \\ w^{t+1} = w^t - \eta_t ((X^TX)w^t - X^Ty)

w^{t+1} = w^t - \eta_t \nabla f(w^t)) \\ w^{t+1} = w^t - \eta_t ((X^TX)w^t - X^Ty)

Stochastic gradient descent:

Computes approximation of gradient to make gradient computation faster (because in GD $X^TX$ will use entire dataset).
Samples a small set of data points at random for every iteration to compute the gradient.