Assumes that the PDF of random variables have normal (Gaussian) distribution
$$\mathcal{N}(\vec{y};~\vec{\mu}, \Sigma)$$
Assumes that the PDF of random variables can be modeled by a linear combination of multiple Gaussians $$\sum_i\mathcal{N}_i(\vec{y};~\vec{\mu}_i, \Sigma_i)$$
The average value of a random function \(f(x)\) under a probability distribution \(p(x)\) is called the expectation of \(f(x)\) and denoted through \(E[f(x)]\):
$$E[f(x)]=\sum p(x)f(x)$$
$$E[f(x)]=\int\limits^{\infty}_{-\infty}p(x)f(x)dx$$
The variance of a random function \(f(x)\) is the expected value of the squared deviation from the mean of \(f(x)\),
here mean \(\mu=E[f(x)]\)
$$Var[f(x)]=E[(f(x)-E[f(x)])^2]$$
$$E[(f(x)-E[f(x)])^2]=E[f(x)^2]-E[f(x)]^2$$
If a random variable \(y\) has normal distribution:
We therefore consider a superposition of \(G\) Gaussian densities of the form:
$$\mathcal{P}(\vec{y}) = \sum^{G}_{k=1}\omega_k\mathcal{N}_k(\vec{y};~\vec{\mu}_k,\Sigma_k)$$
which is called a mixture of Gaussians.
Each Gaussian density \(\mathcal{N}_k(\vec{y};~\vec{\mu}_k,\Sigma_k)\) is called a mixture component of the mixture and has its own mean \(\vec{\mu}_k\) and covariance \(\Sigma_k\)
The weights \(\omega_k\) are called mixture coefficients
The Gaussian mixture distribution in 1D formed by 3 Gaussians
$$\mathcal{P}(y)=\omega_1\mathcal{N}_1+\omega_2\mathcal{N}_2 + \omega_3\mathcal{N}_3$$
Source: Christopher M. Bishop's textbook “Pattern Recognition & Machine Learning”
The Gaussian mixture distribution in 2D formed by 4 Gaussians
$$\mathcal{P}(\vec{y})=\omega_1\mathcal{N}_1+\omega_2\mathcal{N}_2 + \omega_3\mathcal{N}_3+\omega_4\mathcal{N}_4$$
$$256^{nFeatures}\cdot nStates = 256^2\cdot3$$
$$256\cdot nFeatures\cdot nStates = 256\cdot 2 \cdot 3$$
$$(nFeatures^2 + nFeatures)\cdot nStates = (4 + 2)\cdot 3$$
$$nGaussians\cdot(nFeatures^2 + nFeatures + 1)\cdot nStates = nGaussians\cdot(4 + 2 + 1)\cdot 3$$
If we integrate both sides of equation
\(\mathcal{P}(\vec{y}) = \sum^{G}_{k=1}\omega_k\mathcal{N}_k(\vec{y};~\vec{\mu}_k,\Sigma_k)\) with respect to \(\vec{y}\), and note that both \(\mathcal{P}(\vec{y})\) and the individual Gaussian components are normalised, we obtain \(\sum^{G}_{k=1}\omega_k=1\).
Also, given that \(\mathcal{N}_k(\vec{y};~\vec{\mu}_k, \Sigma_k)\ge0\), a sufficient condition for the requirement \(\mathcal{P}(\vec{y})\ge0\) is that \(\omega_k\ge0, \forall k \in [1; G]\). Combining this with the condition above we obtain \(0\le\omega_k\le1\).
We therefore see that the mixture coefficients satisfy the requirements to be probabilities, and so, could be considered as as the prior probabilities of picking the k-th component.
The densities \(\mathcal{N}_k(\vec{y};~\vec{\mu}_k, \Sigma_k)\), in its turn, could be considered as the probabilities of \(\vec{y}\) conditioned on \(k\): \(p(\vec{y}~|~k)\).
From Bayes' theorem we can write:
$$p(\vec{y}~|~k)\propto p(\vec{y})\cdot p(\vec{y}~|~k) = \omega_k\mathcal{N}_k(\vec{y};~\vec{\mu}_k,\Sigma_k)$$
(a non-trivial task)
is the estimation of mixture coefficients \(\omega_k\) and parameters of the mixture components \(\vec{\mu}_k\), \(\Sigma_k\) from training dataset
(a non-trivial task)
The green dot are an example of given data that cannot be modelled using single Gaussian distribution, because it has two clusters or summits.
To model green dots distributions, we can use linear combination of Gaussian distributions , which is much better than single Gaussian distribution.