re: Capsule Net
Suriyadeepan Ramamoorthy
Who Am I?
- AI Research Engineer, SAAMA Tech.
- I work in NLU
- I am a Free Software Evangelist
- I am interested in
- AGI
- Community Networks
- Data Visualization
- Generative Art
- Creative Coding
- And I have a blog
Motivation
CNN
- Layers of feature extraction followed by Local Pooling
- Precise information about position is thrown away
- in exchange for location invariance
- \(p(y | x)\)
- Averaging outputs of replicated features (max)
- Single output to next layer
Why are CNNs doomed?
- Sub-sampling loses information about position
- precise spatial relationship among high level features
- Cannot extrapolate understanding to new viewpoints
Invariance
- Eliminate information about lighting, viewpoint, etc,
- sub-sampling is misguided
- Knowledge of the shape of the object should be invariant
- stored in the weights
Equivariance
- Neural activities change with change in viewpoint
- but knowledge of the entity remains constant
- Know that it's the same shape
- and notice the change in viewpoint
- Weights are invariant
- Neural Activities are equivariant
Representation
Hierarchy of Parts
Hierarchy of Parts
- From the pose of the mouth \(T_i\)
- we can predict the pose of the face
- \(p_i\), probability that mouth exists
- From the pose of the nose \(T_h\)
- we can predict the pose of the face
- If \(T_iT_{ij} \approx T_hT_{hj}\)
- mouth and nose are related correctly to make a face
- We need pose information of entities
Capsules
- Performs internal computations
- Encapsulates information as a vector
Capsules
- Each capsule should learn to detect a fragment
- Capsule outputs
- \((x,y)\) : coordinates of the fragment
- \(i\) : intensity of the fragment
- Train capsules by reconstructing the original image
- based on contributions from individual capsules
- Autoencoder
Capsule Dynamics
Output of Capsule
- Output of a capsule is a vector
-
Length or magnitude of vector
- probability that entity represented by capsule is present
-
"Squash" Non-linearity
- encourages this feature
v_j = \frac{||s_j||^2}{1 + ||s_j||^2}\frac{s_j}{||s_j||}
vj=1+∣∣sj∣∣2∣∣sj∣∣2∣∣sj∣∣sj
Input to Capsule
- Layer \(l\)'s output vectors \(\{u^l_i\}\)
-
Prediction Vectors
- \(u_{j|i} = W_{ij}u_i\)
- Total input to capsule \(j\) in layer \((l+1)\)
- weighted sum over prediction vectors
- \(s_j = \sum_{i} c_{ij}u_{j|i}\)
Coupling Coefficients
- \(\{c_{ij}\}\) computed by iterative dynamic routing process
- Strength of link between capsules \(i\) and \(j\)
- \(i\) : lower layer capsule
- \(j\) : higher layer capsule
- Note
- \( \sum_j c_{ij} = 1 \)
Routing Softmax
- Log prior probabilities, \(\{b_{ij}\}\)
- for coupling between capsule \(i\), \(j\)
c_{ij} = \frac{exp(b_{ij})}{exp(\sum_{k}b_{ik})}
cij=exp(∑kbik)exp(bij)
- Initial coupling coefficients are iteratively refined
- agreement between capsules \(i, j\)
- output vector of capsule \(j\), \(v_j\)
- prediction by capsule \(i\), \(u_{j|i}\)
- agreement between capsules \(i, j\)
Agreement between Capsules
- Scalar Product, \(a_{ij} = v_j.u_{j|i}\)
- Add agreement to initial logits
- \(b_{ij} = b_{ij} + a_{ij}\)
Routing Algorithm
Routing Algorithm
Routing Algorithm
- Set priors to zeros
- For each routing iteration
Routing Algorithm
- Set priors to zeros
- For each routing iteration
- For all capsule \(i\) in layer \((l)\)
- Calculate coupling coefficients from priors
- \(c_i = softmax(b_i)\)
- For all capsule \(i\) in layer \((l)\)
Routing Algorithm
- Set priors to zeros
- For each routing iteration
- For all capsule \(i\) in layer \((l)\)
- Calculate coupling coefficients from priors
- \(c_i = softmax(b_i)\)
- For all capsule \(j\) in layer \((l+1)\)
- Prediction Vectors, \(u_{j|i}\)
- Total Input, \( s_j = \sum_i c_{ij} u_{j|i}\)
- capsule output, \(v_j = squash(s_j)\)
- For all capsule \(i\) in layer \((l)\)
Routing Algorithm
- Set priors to zeros
- For each routing iteration
- ...
- For all capsule \(i\) in layer \((l)\)
- For all capsule \(j\) in layer \((l+1)\)
- \(b_{ij} = b_{ij} + u_{j|i}.v_j\)
- For all capsule \(j\) in layer \((l+1)\)
Routing Algorithm
- Set priors to zeros
- For each routing iteration
- ...
- For all capsule \(i\) in layer \((l)\)
- For all capsule \(j\) in layer \((l+1)\)
- \(b_{ij} = b_{ij} + u_{j|i}.v_j\)
- For all capsule \(j\) in layer \((l+1)\)
- return capsule output, \(v_j\)
Routing Algorithm
Convolutional Capsule Net
Architecture
(1) Convolution Layer
- [28x28] pixel image as input
- 256 [9x9] size kernel
- stride = 1
- ReLU activation
- 20x20x256 volume
(1) Convolution Layer
# input image
_x = tf.placeholder(tf.float32, [None, 784])
# reshape image for convolution
x = tf.reshape(_x, [-1, 28, 28, 1])
# first layer of convolution
with tf.variable_scope('conv1'):
# create 256 filters of kernel size 9x9
w = tf.get_variable('w', shape=[9, 9, 1, 256], dtype=tf.float32,
initializer=tf.contrib.layers.xavier_initializer())
# stride = 1
conv1 = tf.nn.conv2d(x, w, [1,1,1,1], padding='VALID', name='conv1')
# relu activation
conv1 = tf.nn.relu(conv1)
(2) Primary Capsules
- Convolutional Capsule Layer with "squash" as non-linearity
- 32 channels of 8D capsules
- 32 [9x9] size kernel
- stride = 2
- no activation
- 32 [6x6x8] volumes
- 32 [6x6] 8D vectors
In convolutional capsule layers each unit in a capsule is a convolutional unit. Therefore, each capsule will output a grid of vectors rather than a single vector output.
(2) Primary Capsules
with tf.variable_scope('primary_caps'):
# 9x9 filters, 32*8=256 channels, stride=2
primary_capsules = tf.contrib.slim.conv2d(inputs=conv1,
num_outputs=32*8,
kernel_size=9,
stride=2,
padding='VALID',
activation_fn=None)
# apply "squash" non-linearity
primary_capsules = squash(primary_capsules)
Prediction Vectors
u_{j|i} = W_{ij}u_i
uj∣i=Wijui
# primary capsules : 32 x [6x6]
num_capsules = 32*6*6
primary_capsule_dim = 8
# reshape primary capsules for calculating prediction vectors
primary_capsules = tf.reshape(primary_capsules_,
[-1, 1, num_capsules, 1, primary_capsule_dim])
# next capsule layer (digit capsules) : [10, 16]
num_digits = 10
digit_capsule_dim = 16
Prediction Vectors
# weight matrix
Wij = tf.get_variable('Wij',
[num_digits, num_capsules, primary_capsule_dim, digit_capsule_dim],
dtype=tf.float32)
# tile primary capsules for multiplication with weight matrix
tiled_prim_caps = tf.tile(primary_capsules, [1, num_digits, 1, 1, 1])
# yeah.. we need a loop :(
# help me fix this!
cap_predictions = tf.scan(lambda _, x : tf.matmul(x, Wij), # fn
tiled_prim_caps, # elements
initializer = tf.zeros([num_digits, num_capsules, 1, digit_capsule_dim])
)
# squeeze dummy dimensions
cap_predictions = tf.squeeze(cap_predictions, [3])
Log Priors
# { b_ij } log prior probabilities
priors = tf.get_variable('log_priors',
[num_digits, num_caps],
initializer=tf.zeros_initializer())
# expand to support batch dimension
priors = tf.expand_dims(priors, axis=0)
\{ b_{ij} \}
{bij}
(3) Digit Capsules
for i in range(routing_iterations):
with tf.variable_scope('routing_{}'.format(i)):
# softmax along "digits" axis
c = tf.nn.softmax(priors, dim=1)
# reshape to multiply with predictions
c_t = tf.expand_dims(priors, axis=-1)
s_t = cap_predictions * c_t
s = tf.reduce_sum(s_t, axis=2)
digit_caps = squash(s)
delta_priors = tf.reduce_sum(
cap_predictions * tf.expand_dims(digit_caps, 2), -1)
priors = priors + delta_priors
return digit_caps
Margin Loss
L_c = T_c max(0, m^+ - ||v_c||)^2 + \lambda (1 - T_c) max(0, ||v_c|| - m^-)^2
Lc=Tcmax(0,m+−∣∣vc∣∣)2+λ(1−Tc)max(0,∣∣vc∣∣−m−)2
- Digit Capsule of class \(c\) must have long instantiation vector \(v_c\)
- iff digit \(c\) is present in the image
- \(T_c\) = 1 if digit of class \(c\) is present
- \(m^+\) = 0.9, \(m^-\) = 0.1
- Downweight loss due to absent digit classes by \(\lambda\)
Margin Loss
#positives
pos_loss = tf.maximum(0.,
0.9 - tf.reduce_sum(digit_caps_norm * _y,
axis=1))
# mean-squared error
pos_loss = tf.reduce_mean(tf.square(pos_loss))
# negatives
y_negs = 1. - _y
neg_loss = tf.maximum(0., digit_caps_norm * y_negs - 0.1)
neg_loss = tf.reduce_sum(tf.square(neg_loss), axis=-1) * 0.5
neg_loss = tf.reduce_mean(neg_loss)
margin_loss = pos_loss + neg_loss
Reconstruction Loss
Reconstruction Loss
# reconstruct original image with a 3-layered MLP
def reconstruct(target_cap):
with tf.name_scope('reconstruct'):
fc = fully_connected(target_cap, 512)
fc = fully_connected(fc, 1024)
fc = fully_connected(fc, 784, activation_fn=None)
out = tf.sigmoid(fc)
return out
reconstruct_loss = tf.reduce_mean(tf.reduce_sum(
tf.square(_x - reconstruct(target_cap)), axis=-1))
total_loss = pos_loss + neg_loss + 0.0005 * reconstruct_loss
Pros
- Less Training Samples
- Equivariance
- Overlapping objects or Crowded Scenes
- Interpretable Activation Vectors
Cons
- Still young : not tested on larger images
- Training is slow due to the routing algorithm
Resources
- Dynamic Routing Between Capsules
- Implementations
- Blogs
- Nick Bourdakos
- Max Pechyonkin, Part I, Part II, Part III
- Soham Chatterjee
- Videos
- Geoff Hinton, Does the Brain do Inverse Graphics?
- Aurélien Géron, Capsule Networks
Thank You!
Capsule Net
By Suriyadeepan R
Capsule Net
Introduction to Hinton's Capsule Net
- 4,707