Loss functions:

what, how and why?

Loss functions

In supervised machine learning algorithms, we want to minimize the error for each training example during the learning process.
This is done using some optimization strategies like gradient descent. And this error comes from the loss function;
Using the loss function, we can evaluate the generalizing ability of our model.

Today’s plan

Where is the problem
Maximum Likelihood Estimation (MLE)
Mean Square Error Loss (MSE) via MLE
Binary Cross-Entropy Loss (BCE) via MLE
Where to go further?

MSE Loss

\(MSE = \frac{1}{n}\sum_{i=1}^{n} (y_i - \hat{y_i})\)

, where \(y_i\) is target and

\(\hat{y_i}\) is prediction

BCE LOSS

\(BCE = -(y log(p)+(1-y) log(1-p))\)

, where \(y\) is target and

\(p\) is prediction

But why do we use them?

We can naively say that MSE is intuitive, but what about BCE?

And how do we guarantee that they are really optimizing something?

Maximum likelihood estimation

Let's say we have \(n\) independent observations \(\{x^{(i)}\}^{n}_{i=1}\) and probability of obtaining them

\(p(\{x^{(i)}\}^{n}_{i=1})=p(x^{(1)})\cdot p(x^{(2)})\cdot...\cdot p(x^{(n)})=\prod^{n}_{i=1}p(x^{(i)})\)

We will assume that all these observations were kind of generated by the same distribution function and this distribution function has some parameter theta \(θ\).

\(p_{θ}(\{x^{(i)}\}^{n}_{i=1})=p_{θ}(x^{(1)})\cdot p_{θ}(x^{(2)})\cdot...\cdot p_{θ}(x^{(n)})=\prod^{n}_{i=1}p_{θ}(x^{(i)})\)

And we want to estimate parameter \(θ\) using given observations so that the probability (likelihood) of obtaining the observed data maximizes.

\(\prod^{n}_{i=1}p_{θ}(x^{(i)}) \to max_{θ}\)

\(θ_{ML}=\underset{θ}{\mathrm{argmax}} p_{θ}(\{x^{(i)}\}^{n}_{i=1})=\underset{θ}{\mathrm{argmax}}\prod^{n}_{i=1}p_{θ}(x^{(i)})=\underset{θ}{\mathrm{argmax}} \log\prod^{n}_{i=1}p_{θ}(x^{(i)})=\)

\(=\underset{θ}{\mathrm{argmax}}\sum^{n}_{i=1}\log p_{θ}(x^{(i)})\)

The method of maximizing the likelihood of estimating parameters from the assumption of the distribution of our observations is called Maximum Likelihood Estimation (MLE) - this is the general principle that allows us to derive Loss functions

mean squared error loss

Let's say we have an regression problem \(\hat{y}=f_{θ}(x)\) and \(y∼N(y;μ=\hat{y},\sigma^{2})\) with distribution function: \(p_{θ}(y|x)=\frac{1}{\sigma\sqrt{2}}exp(\frac{-(y-\hat{y})^2}{2\sigma^2})\)

By the way, why did we get the normal distribution? It is fair to say that it is a reasonable choice in the absence of prior knowledge about the distribution of the targeted variable.

Now let's take \(J\) as a Loss function and use our new knowledge.

\(J=\sum^{n}_{i=1}\log p_{θ}(y^{(i)}|x^{(i)})=\sum^{n}_{i=1}\log\frac{1}{\sigma\sqrt{2}}exp(\frac{-(y^{(i)}-\hat{y}^{(i)})^2}{2\sigma^2})=...=\)

\(=-n\log(\sigma)-\frac{n}{2}\log(2\pi)-\sum^{n}_{i=1}\frac{(y^{(i)}-\hat{y}^{(i)})^2}{2\sigma^2}\)

\(\nabla J=-\nabla \sum^{n}_{i=1}(y^{(i)}-\hat{y}^{(i)})^2\)

So \(θ_{ML}=\underset{θ}{\mathrm{argmax}} p_{θ}(y|x)=\underset{θ}{\mathrm{argmin}}\sum^{n}_{i=1}(y^{(i)}-\hat{y}^{(i)})^2\) | \(MSE = \frac{1}{n}\sum_{i=1}^{n} (y^{(i)}-\hat{y}^{(i)})\)

This way we find out that positions of optimums of \(θ_{ML}\) and minimized MSE are the same, although they have different values.

Binary cross-entropy loss

Let's toss a coin \(N\) times. Heads is 1 and tails is 0.

1 0 1 1 0 1 0 1 1 0 ...

And we got heads \(n\) times and tails \(N-n\) times. Obviously, this is a Bernoulli distribution.

It turns out that the probability of this combination will be equal to:

\(P(x)=p\cdot (1-p)\cdot p\cdot p\cdot (1-p)\cdot p\cdot (1-p)\cdot ...=C^{n}_{N}\cdot p^{n}\cdot (1-p)^{N-n}\), where

\(C^{n}_{N}=\frac{N!}{n!(N-n)!}\) is binomial coefficient

\(p_{ML}=\underset{p}{\mathrm{argmax}} P(x)=\underset{p}{\mathrm{argmax}}\log P(x)=\)

\(=\underset{p}{\mathrm{argmax}} \log C^{n}_{N}+n\cdot \log(p)+(N-n)\cdot\log (1-p)=\)

\(=\underset{p}{\mathrm{argmax}} \frac{n}{N}\cdot \log(p)+(1-\frac{n}{N})\cdot\log (1-p)\)

\(=\underset{p}{\mathrm{argmin}} - (\frac{n}{N}\cdot \log(p)+(1-\frac{n}{N})\cdot\log (1-p))=\)

\(=\underset{p}{\mathrm{argmin}} \text{BCE}\)

conclusion and
What to read further

Maximum A Posteriori (MAP)
Shannon's Entropy and Kullback–Leibler Divergence
Cramér–Rao Bound
Estimator and it's Consistency
Bias–variance tradeoff

Questions?

https://t.me/NikitaDetkov