Breve introducción a redes neuronales y deep learning

BE3027 - Robótica Médica

El perceptrón como punto de partida

¿Cómo?

entradas

pesos

suma

\displaystyle \hat{y}=g\left(\sum_{j=1}^{d} w_jx_j+w_0\right)=g\left(\mathbf{w}^\top \tilde{\mathbf{x}}\right)

\tilde{\mathbf{x}}=\begin{bmatrix} 1 \\ \mathbf{x} \end{bmatrix}=\begin{bmatrix} 1 \\ x_1 \\ x_2 \\ \vdots \\ x_d \end{bmatrix}

entradas

pesos

suma

\displaystyle \hat{y}=g\left(\sum_{j=1}^{d} w_jx_j+w_0\right)=g\left(\mathbf{w}^\top \tilde{\mathbf{x}}\right)

\tilde{\mathbf{x}}=\begin{bmatrix} 1 \\ \mathbf{x} \end{bmatrix}=\begin{bmatrix} 1 \\ x_1 \\ x_2 \\ \vdots \\ x_d \end{bmatrix}

término de bias

entradas

pesos

suma

\displaystyle \hat{y}=g\left(\sum_{j=1}^{d} w_jx_j+w_0\right)=g\left(\mathbf{w}^\top \tilde{\mathbf{x}}\right)

\tilde{\mathbf{x}}=\begin{bmatrix} 1 \\ \mathbf{x} \end{bmatrix}=\begin{bmatrix} 1 \\ x_1 \\ x_2 \\ \vdots \\ x_d \end{bmatrix}

término de bias

pensemos sobre realmente qué es esto y cuáles son sus limitantes

entradas

pesos

suma

función de activación

(no linealidad)

\displaystyle \hat{y}=g\left(\sum_{j=1}^{d} w_jx_j+w_0\right)=g\left(\mathbf{w}^\top \tilde{\mathbf{x}}\right)

\tilde{\mathbf{x}}=\begin{bmatrix} 1 \\ \mathbf{x} \end{bmatrix}=\begin{bmatrix} 1 \\ x_1 \\ x_2 \\ \vdots \\ x_d \end{bmatrix}

entradas

pesos

suma

salida

\displaystyle \hat{y}=g\left(\sum_{j=1}^{d} w_jx_j+w_0\right)=g\left(\mathbf{w}^\top \tilde{\mathbf{x}}\right)

\tilde{\mathbf{x}}=\begin{bmatrix} 1 \\ \mathbf{x} \end{bmatrix}=\begin{bmatrix} 1 \\ x_1 \\ x_2 \\ \vdots \\ x_d \end{bmatrix}

hipótesis

función de activación

(no linealidad)

Funciones de activación "comunes"

el perceptrón original empleaba el escalón unitario

\hat{y}=f\left(\mathbf{x},\mathbf{w}\right)=\mathbf{1} \left(\mathbf{w}^\top \tilde{\mathbf{x}}\right)=\begin{cases} 1 & \text{ si } \mathbf{w}^\top \tilde{\mathbf{x}}\ge0 \\ 0 & \text{ si } \mathbf{w}^\top \tilde{\mathbf{x}}<0 \end{cases}

el perceptrón original empleaba el escalón unitario

mapa lineal \(\equiv\) hiperplano en \(\mathbb{R}^d\)

la función de activación introduce la no linealidad que hace posible la clasificación binaria

Ejemplo

\displaystyle L(\mathbf{w})=\dfrac{1}{2} \sum_{i=1}^{n} \left[y^{(i)}-f(\mathbf{x}^{(i)},\mathbf{w})\right]^2

función de activación sigmoide

\(f=\sigma\)

Ejemplo

\displaystyle \mathbf{w}^\star=\begin{bmatrix} w_0^\star \\ w_1^\star \\ w_2^\star \end{bmatrix}=\arg\min_{\mathbf{w}} L(\mathbf{w})

hiperplano separador

=\begin{bmatrix} -3.8383 \\ 2.0384 \\ 2.0087\end{bmatrix}

x_2=mx_1+b=-\dfrac{w_1^\star}{w_2^\star}x_1-\dfrac{w_0^\star}{w_2^\star}

Ejemplo

\displaystyle \mathbf{w}^\star=\begin{bmatrix} w_0^\star \\ w_1^\star \\ w_2^\star \end{bmatrix}=\arg\min_{\mathbf{w}} L(\mathbf{w})

hiperplano separador

=\begin{bmatrix} -3.8383 \\ 2.0384 \\ 2.0087\end{bmatrix}

x_2=mx_1+b=-\dfrac{w_1^\star}{w_2^\star}x_1-\dfrac{w_0^\star}{w_2^\star}

¿Cuándo falla el perceptrón?

Ejemplo: XOR

A	B	A XOR B
0	0	0
0	1	1
1	0	1
1	1	0

Ejemplo: XOR

A	B	A XOR B
0	0	0
0	1	1
1	0	1
1	1	0

La data no es linealmente separable

Ejemplo: XOR

A	B	A XOR B
0	0	0
0	1	1
1	0	1
1	1	0

La data no es linealmente separable

¿Solución?

\equiv

z=\mathbf{w}^\top\tilde{\mathbf{x}}

¿Solución?

z=\mathbf{w}^\top\tilde{\mathbf{x}}

g(\cdot)

¿Solución?

¿Solución?

\cdots

Red neuronal (de una capa)

\cdots

\displaystyle z_m=g\left(\sum_{j=1}^{d} w_{j,m}^{(1)} x_j+w_{0,m}^{(1)}\right) \\ =g\left(\mathbf{w}^{(1)\top}_m \tilde{\mathbf{x}}\right)

entradas

salidas

capa oculta

\tilde{\mathbf{z}}=\begin{bmatrix} 1 \\ \mathbf{z} \end{bmatrix}=\begin{bmatrix} 1 \\ g\left(\mathbf{w}^{(1)\top}_1 \tilde{\mathbf{x}}\right) \\ \vdots \\ g\left(\mathbf{w}^{(1)\top}_{M_1} \tilde{\mathbf{x}}\right) \end{bmatrix}

\displaystyle z_m=g\left(\sum_{j=1}^{d} w_{j,m}^{(1)} x_j+w_{0,m}^{(1)}\right) \\ =g\left(\mathbf{w}^{(1)\top}_m \tilde{\mathbf{x}}\right)

entradas

salidas

capa oculta

\displaystyle \hat{y}_k=g\left(\sum_{m=1}^{M_1} w_{m,k}^{(2)} z_m+w_{0,k}^{(2)}\right)\\ =g\left(\mathbf{w}_k^{(2)\top}\tilde{\mathbf{z}}\right)

\displaystyle z_m=g\left(\sum_{j=1}^{d} w_{j,m}^{(1)} x_j+w_{0,m}^{(1)}\right) \\ =g\left(\mathbf{w}^{(1)\top}_m \tilde{\mathbf{x}}\right)

entradas

salidas

capa oculta

\displaystyle \hat{y}_k=g\left(\sum_{m=1}^{M_1} w_{m,k}^{(2)} z_m+w_{0,k}^{(2)}\right)\\ =g\left(\mathbf{w}_k^{(2)\top}\tilde{\mathbf{z}}\right)

\displaystyle z_m=g\left(\sum_{j=1}^{d} w_{j,m}^{(1)} x_j+w_{0,m}^{(1)}\right) \\ =g\left(\mathbf{w}^{(1)\top}_m \tilde{\mathbf{x}}\right)

entradas

salidas

capa oculta

\displaystyle \hat{y}_k=g\left(\sum_{m=1}^{M_1} w_{m,k}^{(2)} z_m+w_{0,k}^{(2)}\right)\\ =g\left(\mathbf{w}_k^{(2)\top}\tilde{\mathbf{z}}\right)

¿Por qué?

El teorema de aproximación universal

Establece que una red neuronal de una sola capa, también conocido como perceptrón multicapa (MLP), con una función de activación no lineal adecuada, puede aproximar cualquier función continua definida en un espacio compacto con una precisión arbitraria, siempre y cuando se le permita tener un número suficiente de neuronas en la capa oculta.

- ChatGPT 4o

Una justificación retroactiva

Una justificación retroactiva

Una justificación retroactiva

Una justificación retroactiva

Una justificación retroactiva

Una justificación retroactiva

Una justificación retroactiva

otros posibles modelos con parámetros distintos

Una justificación retroactiva

¿Qué ocurre conforme se añaden nodos a la capa oculta?

Del teorema de aproximación universal a deep learning

Deep learning vs machine learning

¿Por qué deep learning? *

¿Cuándo usar deep learning? *

SÍ

Gran cantidad de data (~ 10k+ ejemplos).
Problema complejo.
Data carece de estructura.
Se necesita el "mejor modelo".
Se tiene el hardware apropiado.

Poca data.
Métodos tradicionales son suficientes.
Data posee estructura.
Se posee conocimiento del dominio.
El modelo debe ser explicable.

¿Cuándo usar deep learning? *

SÍ

Gran cantidad de data (~ 10k+ ejemplos).
Problema complejo.
Data carece de estructura.
Se necesita el "mejor modelo".
Se tiene el hardware apropiado.

Poca data.
Métodos tradicionales son suficientes.
Data posee estructura.
Se posee conocimiento del dominio.
El modelo debe ser explicable.

* si bien esto aún es cierto, corresponde a una perspectiva anticuada.

Bajo la perspectiva de MLPs con funciones de activación ReLU puede simplemente decirse que las redes profundas son "más expresivas" que sus contrapartes no profundas.

Ejemplo de "expresividad"

h_1

h_2

h_3

h_1'

h_2'

h_3'

\hat{y}

h_1

h_2

h_3

h_4

h_5

h_6

\hat{y}

Ejemplo de "expresividad"

h_1

h_2

h_3

h_1'

h_2'

h_3'

\hat{y}

h_1

h_2

h_3

h_4

h_5

h_6

\hat{y}

7 regiones lineales vs 16 regiones lineales

Ejemplo de "expresividad"

h_1

h_2

h_3

h_1'

h_2'

h_3'

\hat{y}

h_1

h_2

h_3

h_4

h_5

h_6

\hat{y}

7 regiones lineales vs 16 regiones lineales

D+1

(D+1)^K

\(D\) nodos con \(K\) capas

¿Neuronas "especializadas"?

El zoológico de redes neuronales

Frameworks para deep learning

fáciles, útiles para prototipado

BE3027 - Lecture 16 (2024)

By Miguel Enrique Zea Arenales

BE3027 - Lecture 16 (2024)

BE3027 - Lecture 16 (2024)

More from Miguel Enrique Zea Arenales