# re: Capsule Net

## Who Am I?

• AI Research Engineer, SAAMA Tech.
• I work in NLU
• I am a Free Software Evangelist
• I am interested in
• AGI
• Community Networks
• Data Visualization
• Generative Art
• Creative Coding
• And I have a blog

# Motivation

## CNN

• Layers of feature extraction followed by Local Pooling
• Precise information about position is thrown away
• in exchange for location invariance
• $$p(y | x)$$
• Averaging outputs of replicated features (max)
• Single output to next layer

## Why are CNNs doomed?

• Sub-sampling loses information about position
• precise spatial relationship among high level features
• Cannot extrapolate understanding to new viewpoints

## Invariance

• Eliminate information about lighting, viewpoint, etc,
• sub-sampling is misguided
• Knowledge of the shape of the object should be invariant
• stored in the weights

## Equivariance

• Neural activities change with change in viewpoint
• but knowledge of the entity remains constant
• Know that it's the same shape
• and notice the change in viewpoint
• Weights are invariant
• Neural Activities are equivariant

# Representation

## Hierarchy of Parts

• From the pose of the mouth $$T_i$$
• we can predict the pose of the face
• $$p_i$$, probability that mouth exists
• From the pose of the nose $$T_h$$
• we can predict the pose of the face
• If $$T_iT_{ij} \approx T_hT_{hj}$$
• mouth and nose are related correctly to make a face
• We need pose information of entities

## Capsules

• Performs internal computations
• Encapsulates information as a vector

## Capsules

• Each capsule should learn to detect a fragment
• Capsule outputs
• $$(x,y)$$ : coordinates of the fragment
• $$i$$ : intensity of the fragment
• Train capsules by reconstructing the original image
• based on contributions from individual capsules
• Autoencoder

# Capsule Dynamics

## Output of Capsule

• Output of a capsule is a vector
• Length or magnitude of vector
• probability that entity represented by capsule is present
• "Squash" Non-linearity
• encourages this feature
v_j = \frac{||s_j||^2}{1 + ||s_j||^2}\frac{s_j}{||s_j||}
$v_j = \frac{||s_j||^2}{1 + ||s_j||^2}\frac{s_j}{||s_j||}$

## Input to Capsule

• Layer $$l$$'s output vectors $$\{u^l_i\}$$
• Prediction Vectors
• $$u_{j|i} = W_{ij}u_i$$
• Total input to capsule $$j$$ in layer $$(l+1)$$
• weighted sum over prediction vectors
• $$s_j = \sum_{i} c_{ij}u_{j|i}$$

## Coupling Coefficients

• $$\{c_{ij}\}$$ computed by iterative dynamic routing process
• Strength of link between capsules $$i$$ and $$j$$
• $$i$$ : lower layer capsule
• $$j$$ : higher layer capsule
• Note
• $$\sum_j c_{ij} = 1$$

## Routing Softmax

• Log prior probabilities, $$\{b_{ij}\}$$
• for coupling between capsule $$i$$, $$j$$
c_{ij} = \frac{exp(b_{ij})}{exp(\sum_{k}b_{ik})}
$c_{ij} = \frac{exp(b_{ij})}{exp(\sum_{k}b_{ik})}$
• Initial coupling coefficients are iteratively refined
• agreement between capsules $$i, j$$
• output vector of capsule $$j$$, $$v_j$$
• prediction by capsule $$i$$, $$u_{j|i}$$

## Agreement between Capsules

• Scalar Product, $$a_{ij} = v_j.u_{j|i}$$
• Add agreement to initial logits
• $$b_{ij} = b_{ij} + a_{ij}$$

# Routing Algorithm

## Routing Algorithm

• Set priors to zeros
• For each routing iteration

## Routing Algorithm

• Set priors to zeros
• For each routing iteration
• For all capsule $$i$$ in layer $$(l)$$
• Calculate coupling coefficients from priors
• $$c_i = softmax(b_i)$$

## Routing Algorithm

• Set priors to zeros
• For each routing iteration
• For all capsule $$i$$ in layer $$(l)$$
• Calculate coupling coefficients from priors
• $$c_i = softmax(b_i)$$
• For all capsule $$j$$ in layer $$(l+1)$$
• Prediction Vectors, $$u_{j|i}$$
• Total Input, $$s_j = \sum_i c_{ij} u_{j|i}$$
• capsule output, $$v_j = squash(s_j)$$

## Routing Algorithm

• Set priors to zeros
• For each routing iteration
• ...
• For all capsule $$i$$ in layer $$(l)$$
• ​For all capsule $$j$$ in layer $$(l+1)$$
• $$b_{ij} = b_{ij} + u_{j|i}.v_j$$

## Routing Algorithm

• Set priors to zeros
• For each routing iteration
• ...
• For all capsule $$i$$ in layer $$(l)$$
• ​For all capsule $$j$$ in layer $$(l+1)$$
• $$b_{ij} = b_{ij} + u_{j|i}.v_j$$
• return capsule output, $$v_j$$

# Architecture

## (1) Convolution Layer

• [28x28] pixel image as input
• 256 [9x9] size kernel
• stride = 1
• ReLU activation
• 20x20x256 volume

## (1) Convolution Layer

# input image
_x = tf.placeholder(tf.float32, [None, 784])

# reshape image for convolution
x = tf.reshape(_x, [-1, 28, 28, 1])

# first layer of convolution
with tf.variable_scope('conv1'):
# create 256 filters of kernel size 9x9
w = tf.get_variable('w', shape=[9, 9, 1, 256], dtype=tf.float32,
initializer=tf.contrib.layers.xavier_initializer())
# stride = 1
conv1 = tf.nn.conv2d(x, w, [1,1,1,1], padding='VALID', name='conv1')

# relu activation
conv1 = tf.nn.relu(conv1)

## (2) Primary Capsules

• Convolutional Capsule Layer with "squash" as non-linearity
• 32 channels of 8D capsules
• 32 [9x9] size kernel
• stride = 2
• no activation
• 32 [6x6x8] volumes
• 32 [6x6] 8D vectors

In convolutional capsule layers each unit in a capsule is a convolutional unit. Therefore, each capsule will output a grid of vectors rather than a single vector output.

## (2) Primary Capsules

with tf.variable_scope('primary_caps'):
# 9x9 filters, 32*8=256 channels, stride=2
primary_capsules = tf.contrib.slim.conv2d(inputs=conv1,
num_outputs=32*8,
kernel_size=9,
stride=2,
activation_fn=None)
# apply "squash" non-linearity
primary_capsules = squash(primary_capsules)

## Prediction Vectors

u_{j|i} = W_{ij}u_i
$u_{j|i} = W_{ij}u_i$
# primary capsules : 32 x [6x6]
num_capsules = 32*6*6
primary_capsule_dim = 8
# reshape primary capsules for calculating prediction vectors
primary_capsules = tf.reshape(primary_capsules_,
[-1, 1, num_capsules, 1, primary_capsule_dim])
# next capsule layer (digit capsules) : [10, 16]
num_digits = 10
digit_capsule_dim = 16

## Prediction Vectors

# weight matrix
Wij = tf.get_variable('Wij',
[num_digits, num_capsules, primary_capsule_dim, digit_capsule_dim],
dtype=tf.float32)
# tile primary capsules for multiplication with weight matrix
tiled_prim_caps = tf.tile(primary_capsules, [1, num_digits, 1, 1, 1])
# yeah.. we need a loop :(
#  help me fix this!
cap_predictions = tf.scan(lambda _, x : tf.matmul(x, Wij), # fn
tiled_prim_caps, # elements
initializer = tf.zeros([num_digits, num_capsules, 1, digit_capsule_dim])
)
# squeeze dummy dimensions
cap_predictions = tf.squeeze(cap_predictions, )



## Log Priors

# { b_ij } log prior probabilities
priors = tf.get_variable('log_priors',
[num_digits, num_caps],
initializer=tf.zeros_initializer())

# expand to support batch dimension
priors = tf.expand_dims(priors, axis=0)


\{ b_{ij} \}
$\{ b_{ij} \}$

## (3) Digit Capsules

for i in range(routing_iterations):
with tf.variable_scope('routing_{}'.format(i)):
# softmax along "digits" axis
c = tf.nn.softmax(priors, dim=1)
# reshape to multiply with predictions
c_t = tf.expand_dims(priors, axis=-1)
s_t = cap_predictions * c_t
s = tf.reduce_sum(s_t, axis=2)
digit_caps = squash(s)
delta_priors = tf.reduce_sum(
cap_predictions * tf.expand_dims(digit_caps, 2), -1)
priors = priors + delta_priors

return digit_caps

## Margin Loss

L_c = T_c max(0, m^+ - ||v_c||)^2 + \lambda (1 - T_c) max(0, ||v_c|| - m^-)^2
$L_c = T_c max(0, m^+ - ||v_c||)^2 + \lambda (1 - T_c) max(0, ||v_c|| - m^-)^2$
• Digit Capsule of class $$c$$ must have long instantiation vector $$v_c$$
• iff digit $$c$$ is present in the image
• $$T_c$$ = 1 if digit of class $$c$$ is present
• $$m^+$$ = 0.9, $$m^-$$ = 0.1
• Downweight loss due to absent digit classes by $$\lambda$$

## Margin Loss

#positives
pos_loss = tf.maximum(0.,
0.9 - tf.reduce_sum(digit_caps_norm * _y,
axis=1))
# mean-squared error
pos_loss = tf.reduce_mean(tf.square(pos_loss))

# negatives
y_negs = 1. - _y
neg_loss = tf.maximum(0., digit_caps_norm * y_negs - 0.1)
neg_loss = tf.reduce_sum(tf.square(neg_loss), axis=-1) * 0.5
neg_loss = tf.reduce_mean(neg_loss)

margin_loss = pos_loss + neg_loss

## Reconstruction Loss

# reconstruct original image with a 3-layered MLP
def reconstruct(target_cap):
with tf.name_scope('reconstruct'):
fc = fully_connected(target_cap, 512)
fc = fully_connected(fc, 1024)
fc = fully_connected(fc, 784, activation_fn=None)
out = tf.sigmoid(fc)
return out

reconstruct_loss = tf.reduce_mean(tf.reduce_sum(
tf.square(_x - reconstruct(target_cap)), axis=-1))

total_loss = pos_loss + neg_loss + 0.0005 * reconstruct_loss

## Pros

• Less Training Samples
• Equivariance
• Overlapping objects or Crowded Scenes
• Interpretable Activation Vectors

## Cons

• Still young : not tested on larger images
• Training is slow due to the routing algorithm