In supervised machine learning algorithms, we want to minimize the error for each training example during the learning process.
This is done using some optimization strategies like gradient descent. And this error comes from the loss function;
\(MSE = \frac{1}{n}\sum_{i=1}^{n} (y_i - \hat{y_i})\)
, where \(y_i\) is target and
\(\hat{y_i}\) is prediction
\(BCE = -(y log(p)+(1-y) log(1-p))\)
, where \(y\) is target and
\(p\) is prediction
Let's say we have \(n\) independent observations \(\{x^{(i)}\}^{n}_{i=1}\) and probability of obtaining them
\(p(\{x^{(i)}\}^{n}_{i=1})=p(x^{(1)})\cdot p(x^{(2)})\cdot...\cdot p(x^{(n)})=\prod^{n}_{i=1}p(x^{(i)})\)
We will assume that all these observations were kind of generated by the same distribution function and this distribution function has some parameter theta \(θ\).
\(p_{θ}(\{x^{(i)}\}^{n}_{i=1})=p_{θ}(x^{(1)})\cdot p_{θ}(x^{(2)})\cdot...\cdot p_{θ}(x^{(n)})=\prod^{n}_{i=1}p_{θ}(x^{(i)})\)
And we want to estimate parameter \(θ\) using given observations so that the probability (likelihood) of obtaining the observed data maximizes.
\(\prod^{n}_{i=1}p_{θ}(x^{(i)}) \to max_{θ}\)
\(θ_{ML}=\underset{θ}{\mathrm{argmax}} p_{θ}(\{x^{(i)}\}^{n}_{i=1})=\underset{θ}{\mathrm{argmax}}\prod^{n}_{i=1}p_{θ}(x^{(i)})=\underset{θ}{\mathrm{argmax}} \log\prod^{n}_{i=1}p_{θ}(x^{(i)})=\)
\(=\underset{θ}{\mathrm{argmax}}\sum^{n}_{i=1}\log p_{θ}(x^{(i)})\)
The method of maximizing the likelihood of estimating parameters from the assumption of the distribution of our observations is called Maximum Likelihood Estimation (MLE) - this is the general principle that allows us to derive Loss functions
Let's say we have an regression problem \(\hat{y}=f_{θ}(x)\) and \(y∼N(y;μ=\hat{y},\sigma^{2})\) with distribution function: \(p_{θ}(y|x)=\frac{1}{\sigma\sqrt{2}}exp(\frac{-(y-\hat{y})^2}{2\sigma^2})\)
By the way, why did we get the normal distribution? It is fair to say that it is a reasonable choice in the absence of prior knowledge about the distribution of the targeted variable.
Now let's take \(J\) as a Loss function and use our new knowledge.
\(J=\sum^{n}_{i=1}\log p_{θ}(y^{(i)}|x^{(i)})=\sum^{n}_{i=1}\log\frac{1}{\sigma\sqrt{2}}exp(\frac{-(y^{(i)}-\hat{y}^{(i)})^2}{2\sigma^2})=...=\)
\(=-n\log(\sigma)-\frac{n}{2}\log(2\pi)-\sum^{n}_{i=1}\frac{(y^{(i)}-\hat{y}^{(i)})^2}{2\sigma^2}\)
\(\nabla J=-\nabla \sum^{n}_{i=1}(y^{(i)}-\hat{y}^{(i)})^2\)
So \(θ_{ML}=\underset{θ}{\mathrm{argmax}} p_{θ}(y|x)=\underset{θ}{\mathrm{argmin}}\sum^{n}_{i=1}(y^{(i)}-\hat{y}^{(i)})^2\) | \(MSE = \frac{1}{n}\sum_{i=1}^{n} (y^{(i)}-\hat{y}^{(i)})\)
This way we find out that positions of optimums of \(θ_{ML}\) and minimized MSE are the same, although they have different values.
Let's toss a coin \(N\) times. Heads is 1 and tails is 0.
1 0 1 1 0 1 0 1 1 0 ...
And we got heads \(n\) times and tails \(N-n\) times. Obviously, this is a Bernoulli distribution.
It turns out that the probability of this combination will be equal to:
\(P(x)=p\cdot (1-p)\cdot p\cdot p\cdot (1-p)\cdot p\cdot (1-p)\cdot ...=C^{n}_{N}\cdot p^{n}\cdot (1-p)^{N-n}\), where
\(C^{n}_{N}=\frac{N!}{n!(N-n)!}\) is binomial coefficient
\(p_{ML}=\underset{p}{\mathrm{argmax}} P(x)=\underset{p}{\mathrm{argmax}}\log P(x)=\)
\(=\underset{p}{\mathrm{argmax}} \log C^{n}_{N}+n\cdot \log(p)+(N-n)\cdot\log (1-p)=\)
\(=\underset{p}{\mathrm{argmax}} \frac{n}{N}\cdot \log(p)+(1-\frac{n}{N})\cdot\log (1-p)\)
\(=\underset{p}{\mathrm{argmin}} - (\frac{n}{N}\cdot \log(p)+(1-\frac{n}{N})\cdot\log (1-p))=\)
\(=\underset{p}{\mathrm{argmin}} \text{BCE}\)
https://t.me/NikitaDetkov