re: Capsule Net

Suriyadeepan Ramamoorthy

suriyadeepan.github.io

Who Am I?

AI Research Engineer, SAAMA Tech.
I work in NLU
I am a Free Software Evangelist
I am interested in
- AGI
- Community Networks
- Data Visualization
- Generative Art
- Creative Coding
And I have a blog

Motivation

CNN

Layers of feature extraction followed by Local Pooling
Precise information about position is thrown away
- in exchange for location invariance
- \(p(y | x)\)
Averaging outputs of replicated features (max)
Single output to next layer

Why are CNNs doomed?

Sub-sampling loses information about position
- precise spatial relationship among high level features
Cannot extrapolate understanding to new viewpoints

Invariance

Eliminate information about lighting, viewpoint, etc,
- sub-sampling is misguided
Knowledge of the shape of the object should be invariant
- stored in the weights

Equivariance

Neural activities change with change in viewpoint
- but knowledge of the entity remains constant
Know that it's the same shape
- and notice the change in viewpoint
Weights are invariant
Neural Activities are equivariant

Representation

Hierarchy of Parts

From the pose of the mouth \(T_i\)
- we can predict the pose of the face
- \(p_i\), probability that mouth exists
From the pose of the nose \(T_h\)
- we can predict the pose of the face
If \(T_iT_{ij} \approx T_hT_{hj}\)
- mouth and nose are related correctly to make a face
We need pose information of entities

Capsules

Performs internal computations
Encapsulates information as a vector

Capsules

Each capsule should learn to detect a fragment
Capsule outputs
- \((x,y)\) : coordinates of the fragment
- \(i\) : intensity of the fragment
Train capsules by reconstructing the original image
- based on contributions from individual capsules
- Autoencoder

Output of Capsule

Output of a capsule is a vector
Length or magnitude of vector
- probability that entity represented by capsule is present
"Squash" Non-linearity
- encourages this feature

v_j = \frac{||s_j||^2}{1 + ||s_j||^2}\frac{s_j}{||s_j||}

v_j = \frac{||s_j||^2}{1 + ||s_j||^2}\frac{s_j}{||s_j||}

Input to Capsule

Layer \(l\)'s output vectors \(\{u^l_i\}\)
Prediction Vectors
- \(u_{j|i} = W_{ij}u_i\)
Total input to capsule \(j\) in layer \((l+1)\)
- weighted sum over prediction vectors
- \(s_j = \sum_{i} c_{ij}u_{j|i}\)

Coupling Coefficients

\(\{c_{ij}\}\) computed by iterative dynamic routing process
Strength of link between capsules \(i\) and \(j\)
- \(i\) : lower layer capsule
- \(j\) : higher layer capsule
Note
- \( \sum_j c_{ij} = 1 \)

Routing Softmax

Log prior probabilities, \(\{b_{ij}\}\)
- for coupling between capsule \(i\), \(j\)

c_{ij} = \frac{exp(b_{ij})}{exp(\sum_{k}b_{ik})}

c_{ij} = \frac{exp(b_{ij})}{exp(\sum_{k}b_{ik})}

Initial coupling coefficients are iteratively refined
- agreement between capsules \(i, j\)
  - output vector of capsule \(j\), \(v_j\)
  - prediction by capsule \(i\), \(u_{j|i}\)

Agreement between Capsules

Scalar Product, \(a_{ij} = v_j.u_{j|i}\)
Add agreement to initial logits
- \(b_{ij} = b_{ij} + a_{ij}\)

Routing Algorithm

Set priors to zeros
For each routing iteration

Routing Algorithm

Set priors to zeros
For each routing iteration
- For all capsule \(i\) in layer \((l)\)
  - Calculate coupling coefficients from priors
  - \(c_i = softmax(b_i)\)

Routing Algorithm

Set priors to zeros
For each routing iteration
- For all capsule \(i\) in layer \((l)\)
  - Calculate coupling coefficients from priors
  - \(c_i = softmax(b_i)\)
- For all capsule \(j\) in layer \((l+1)\)
  - Prediction Vectors, \(u_{j|i}\)
  - Total Input, \( s_j = \sum_i c_{ij} u_{j|i}\)
  - capsule output, \(v_j = squash(s_j)\)

Routing Algorithm

Set priors to zeros
For each routing iteration
- ...
- For all capsule \(i\) in layer \((l)\)
  - For all capsule \(j\) in layer \((l+1)\)
    - \(b_{ij} = b_{ij} + u_{j|i}.v_j\)

Routing Algorithm

Set priors to zeros
For each routing iteration
- ...
- For all capsule \(i\) in layer \((l)\)
  - For all capsule \(j\) in layer \((l+1)\)
    - \(b_{ij} = b_{ij} + u_{j|i}.v_j\)
return capsule output, \(v_j\)

Routing Algorithm

Convolutional Capsule Net

Architecture

(1) Convolution Layer

[28x28] pixel image as input
256 [9x9] size kernel
stride = 1
ReLU activation
20x20x256 volume

(1) Convolution Layer

# input image
_x = tf.placeholder(tf.float32, [None, 784])

# reshape image for convolution
x = tf.reshape(_x, [-1, 28, 28, 1])

# first layer of convolution
with tf.variable_scope('conv1'):
    # create 256 filters of kernel size 9x9
    w = tf.get_variable('w', shape=[9, 9, 1, 256], dtype=tf.float32,
                       initializer=tf.contrib.layers.xavier_initializer())
    # stride = 1
    conv1 = tf.nn.conv2d(x, w, [1,1,1,1], padding='VALID', name='conv1')

    # relu activation
    conv1 = tf.nn.relu(conv1)

(2) Primary Capsules

Convolutional Capsule Layer with "squash" as non-linearity
32 channels of 8D capsules
32 [9x9] size kernel
stride = 2
no activation
32 [6x6x8] volumes
32 [6x6] 8D vectors

In convolutional capsule layers each unit in a capsule is a convolutional unit. Therefore, each capsule will output a grid of vectors rather than a single vector output.

(2) Primary Capsules

with tf.variable_scope('primary_caps'):
    # 9x9 filters, 32*8=256 channels, stride=2 
    primary_capsules = tf.contrib.slim.conv2d(inputs=conv1, 
                                              num_outputs=32*8, 
                                              kernel_size=9, 
                                              stride=2, 
                                              padding='VALID', 
                                              activation_fn=None)
    # apply "squash" non-linearity
    primary_capsules = squash(primary_capsules)

Prediction Vectors

u_{j|i} = W_{ij}u_i

u_{j|i} = W_{ij}u_i

# primary capsules : 32 x [6x6]
num_capsules = 32*6*6
primary_capsule_dim = 8
# reshape primary capsules for calculating prediction vectors
primary_capsules = tf.reshape(primary_capsules_, 
         [-1, 1, num_capsules, 1, primary_capsule_dim])
# next capsule layer (digit capsules) : [10, 16]
num_digits = 10
digit_capsule_dim = 16

Prediction Vectors

# weight matrix
Wij = tf.get_variable('Wij', 
        [num_digits, num_capsules, primary_capsule_dim, digit_capsule_dim],
        dtype=tf.float32)
# tile primary capsules for multiplication with weight matrix
tiled_prim_caps = tf.tile(primary_capsules, [1, num_digits, 1, 1, 1])
# yeah.. we need a loop :(
#  help me fix this!
cap_predictions = tf.scan(lambda _, x : tf.matmul(x, Wij), # fn
         tiled_prim_caps, # elements
         initializer = tf.zeros([num_digits, num_capsules, 1, digit_capsule_dim])
         )
# squeeze dummy dimensions
cap_predictions = tf.squeeze(cap_predictions, [3])

Log Priors

# { b_ij } log prior probabilities
priors = tf.get_variable('log_priors', 
            [num_digits, num_caps], 
            initializer=tf.zeros_initializer())
            
# expand to support batch dimension
priors = tf.expand_dims(priors, axis=0)

\{ b_{ij} \}

\{ b_{ij} \}

(3) Digit Capsules

for i in range(routing_iterations):
    with tf.variable_scope('routing_{}'.format(i)):
        # softmax along "digits" axis
        c = tf.nn.softmax(priors, dim=1)
        # reshape to multiply with predictions 
        c_t = tf.expand_dims(priors, axis=-1)
        s_t = cap_predictions * c_t
        s = tf.reduce_sum(s_t, axis=2)
        digit_caps = squash(s)
        delta_priors = tf.reduce_sum(
            cap_predictions * tf.expand_dims(digit_caps, 2), -1)
        priors = priors + delta_priors

return digit_caps

Margin Loss

L_c = T_c max(0, m^+ - ||v_c||)^2 + \lambda (1 - T_c) max(0, ||v_c|| - m^-)^2

L_c = T_c max(0, m^+ - ||v_c||)^2 + \lambda (1 - T_c) max(0, ||v_c|| - m^-)^2

Digit Capsule of class \(c\) must have long instantiation vector \(v_c\)
- iff digit \(c\) is present in the image
\(T_c\) = 1 if digit of class \(c\) is present
\(m^+\) = 0.9, \(m^-\) = 0.1
Downweight loss due to absent digit classes by \(\lambda\)

Margin Loss

#positives
pos_loss = tf.maximum(0., 
    0.9 - tf.reduce_sum(digit_caps_norm * _y,
    axis=1))
# mean-squared error
pos_loss = tf.reduce_mean(tf.square(pos_loss))

# negatives
y_negs = 1. - _y
neg_loss = tf.maximum(0., digit_caps_norm * y_negs - 0.1)
neg_loss = tf.reduce_sum(tf.square(neg_loss), axis=-1) * 0.5
neg_loss = tf.reduce_mean(neg_loss)

margin_loss = pos_loss + neg_loss

Reconstruction Loss

# reconstruct original image with a 3-layered MLP
def reconstruct(target_cap):
    with tf.name_scope('reconstruct'):
        fc = fully_connected(target_cap, 512)
        fc = fully_connected(fc, 1024)
        fc = fully_connected(fc, 784, activation_fn=None)
        out = tf.sigmoid(fc)
        return out

reconstruct_loss = tf.reduce_mean(tf.reduce_sum(
        tf.square(_x - reconstruct(target_cap)), axis=-1))

total_loss = pos_loss + neg_loss + 0.0005 * reconstruct_loss

Pros

Less Training Samples
Equivariance
Overlapping objects or Crowded Scenes
Interpretable Activation Vectors

Cons

Still young : not tested on larger images
Training is slow due to the routing algorithm

Resources

Dynamic Routing Between Capsules
Implementations
Blogs
- Nick Bourdakos
- Max Pechyonkin, Part I, Part II, Part III
- Soham Chatterjee
Videos
- Geoff Hinton, Does the Brain do Inverse Graphics?
- Aurélien Géron, Capsule Networks

re: Capsule Net

Who Am I?

Motivation

CNN

Why are CNNs doomed?

Invariance

Equivariance

Representation

Hierarchy of Parts

Hierarchy of Parts

Capsules

Capsules

Capsule Dynamics

Output of Capsule

Input to Capsule

Coupling Coefficients

Routing Softmax

Agreement between Capsules

Routing Algorithm

Routing Algorithm

Routing Algorithm

Routing Algorithm

Routing Algorithm

Routing Algorithm

Routing Algorithm

Routing Algorithm

Convolutional Capsule Net

Architecture

(1) Convolution Layer

(1) Convolution Layer

(2) Primary Capsules

(2) Primary Capsules

Prediction Vectors

Prediction Vectors

Log Priors

(3) Digit Capsules

Margin Loss

Margin Loss

Reconstruction Loss

Reconstruction Loss

Pros

Cons

Resources

Thank You!