re: Capsule Net

Suriyadeepan Ramamoorthy

Who Am I?

  • AI Research Engineer, SAAMA Tech.
  • I work in NLU
  • I am a Free Software Evangelist
  • I am interested in
    • AGI
    • Community Networks
    • Data Visualization
    • Generative Art
    • Creative Coding
  • And I have a blog

Motivation

CNN

  • Layers of feature extraction followed by Local Pooling
  • Precise information about position is thrown away
    • in exchange for location invariance
    • \(p(y | x)\)
  • Averaging outputs of replicated features (max)
  • Single output to next layer

Why are CNNs doomed?

  • Sub-sampling loses information about position
    • precise spatial relationship among high level features
  • Cannot extrapolate understanding to new viewpoints

Invariance

  • Eliminate information about lighting, viewpoint, etc,
    • sub-sampling is misguided
  • Knowledge of the shape of the object should be invariant
    • stored in the weights

Equivariance

  • Neural activities change with change in viewpoint
    • but knowledge of the entity remains constant
  • Know that it's the same shape
    • and notice the change in viewpoint
  • Weights are invariant
  • Neural Activities are equivariant

Representation

Hierarchy of Parts

Hierarchy of Parts

  • From the pose of the mouth \(T_i\)
    • we can predict the pose of the face
    • \(p_i\), probability that mouth exists
  • From the pose of the nose \(T_h\)
    • we can predict the pose of the face
  • If \(T_iT_{ij} \approx T_hT_{hj}\)
    • mouth and nose are related correctly to make a face
  • We need pose information of entities

Capsules

  • Performs internal computations
  • Encapsulates information as a vector

Capsules

  • Each capsule should learn to detect a fragment
  • Capsule outputs
    • \((x,y)\) : coordinates of the fragment
    • \(i\) : intensity of the fragment
  • Train capsules by reconstructing the original image
    • based on contributions from individual capsules
    • Autoencoder

Capsule Dynamics

Output of Capsule

  • Output of a capsule is a vector
  • Length or magnitude of vector
    • probability that entity represented by capsule is present
  • "Squash" Non-linearity
    • encourages this feature
v_j = \frac{||s_j||^2}{1 + ||s_j||^2}\frac{s_j}{||s_j||}
vj=sj21+sj2sjsjv_j = \frac{||s_j||^2}{1 + ||s_j||^2}\frac{s_j}{||s_j||}

Input to Capsule

  • Layer \(l\)'s output vectors \(\{u^l_i\}\)
  • Prediction Vectors 
    • \(u_{j|i} = W_{ij}u_i\)
  • Total input to capsule \(j\) in layer \((l+1)\)
    • weighted sum over prediction vectors 
    • \(s_j = \sum_{i} c_{ij}u_{j|i}\)

Coupling Coefficients

  • \(\{c_{ij}\}\) computed by iterative dynamic routing process
  • Strength of link between capsules \(i\) and \(j\)
    • \(i\) : lower layer capsule
    • \(j\) : higher layer capsule
  • Note
    • \( \sum_j c_{ij} = 1 \)

Routing Softmax

  • Log prior probabilities, \(\{b_{ij}\}\)
    • for coupling between capsule \(i\), \(j\)
c_{ij} = \frac{exp(b_{ij})}{exp(\sum_{k}b_{ik})}
cij=exp(bij)exp(kbik)c_{ij} = \frac{exp(b_{ij})}{exp(\sum_{k}b_{ik})}
  • Initial coupling coefficients are iteratively refined
    • agreement between capsules \(i, j\)
      • output vector of capsule \(j\), \(v_j\)
      • prediction by capsule \(i\), \(u_{j|i}\)

Agreement between Capsules

  • Scalar Product, \(a_{ij} = v_j.u_{j|i}\)
  • Add agreement to initial logits
    • \(b_{ij} = b_{ij} + a_{ij}\)

Routing Algorithm

Routing Algorithm

Routing Algorithm

  • Set priors to zeros
  • For each routing iteration

Routing Algorithm

  • Set priors to zeros
  • For each routing iteration
    • For all capsule \(i\) in layer \((l)\)
      • Calculate coupling coefficients from priors
      • \(c_i = softmax(b_i)\)

Routing Algorithm

  • Set priors to zeros
  • For each routing iteration
    • For all capsule \(i\) in layer \((l)\)
      • Calculate coupling coefficients from priors
      • \(c_i = softmax(b_i)\)
    • For all capsule \(j\) in layer \((l+1)\)
      • Prediction Vectors, \(u_{j|i}\)
      • Total Input, \( s_j = \sum_i c_{ij} u_{j|i}\)
      • capsule output, \(v_j = squash(s_j)\)

Routing Algorithm

  • Set priors to zeros
  • For each routing iteration
    • ...
    • For all capsule \(i\) in layer \((l)\)
      • ​For all capsule \(j\) in layer \((l+1)\)
        • \(b_{ij} = b_{ij} + u_{j|i}.v_j\)

Routing Algorithm

  • Set priors to zeros
  • For each routing iteration
    • ...
    • For all capsule \(i\) in layer \((l)\)
      • ​For all capsule \(j\) in layer \((l+1)\)
        • \(b_{ij} = b_{ij} + u_{j|i}.v_j\)
  • return capsule output, \(v_j\)

Routing Algorithm

Convolutional Capsule Net

Architecture

(1) Convolution Layer

  • [28x28] pixel image as input
  • 256 [9x9] size kernel 
  • stride = 1
  • ReLU activation
  • 20x20x256 volume

(1) Convolution Layer

# input image
_x = tf.placeholder(tf.float32, [None, 784])

# reshape image for convolution
x = tf.reshape(_x, [-1, 28, 28, 1])

# first layer of convolution
with tf.variable_scope('conv1'):
    # create 256 filters of kernel size 9x9
    w = tf.get_variable('w', shape=[9, 9, 1, 256], dtype=tf.float32,
                       initializer=tf.contrib.layers.xavier_initializer())
    # stride = 1
    conv1 = tf.nn.conv2d(x, w, [1,1,1,1], padding='VALID', name='conv1')

    # relu activation
    conv1 = tf.nn.relu(conv1)

(2) Primary Capsules

  • Convolutional Capsule Layer with "squash" as non-linearity
  • 32 channels of 8D capsules
  • 32 [9x9] size kernel 
  • stride = 2
  • no activation
  • 32 [6x6x8] volumes
  • 32 [6x6] 8D vectors

In convolutional capsule layers each unit in a capsule is a convolutional unit. Therefore, each capsule will output a grid of vectors rather than a single vector output.

(2) Primary Capsules

with tf.variable_scope('primary_caps'):
    # 9x9 filters, 32*8=256 channels, stride=2 
    primary_capsules = tf.contrib.slim.conv2d(inputs=conv1, 
                                              num_outputs=32*8, 
                                              kernel_size=9, 
                                              stride=2, 
                                              padding='VALID', 
                                              activation_fn=None)
    # apply "squash" non-linearity
    primary_capsules = squash(primary_capsules)

Prediction Vectors

u_{j|i} = W_{ij}u_i
uji=Wijuiu_{j|i} = W_{ij}u_i
# primary capsules : 32 x [6x6]
num_capsules = 32*6*6
primary_capsule_dim = 8
# reshape primary capsules for calculating prediction vectors
primary_capsules = tf.reshape(primary_capsules_, 
         [-1, 1, num_capsules, 1, primary_capsule_dim])
# next capsule layer (digit capsules) : [10, 16]
num_digits = 10
digit_capsule_dim = 16

Prediction Vectors

# weight matrix
Wij = tf.get_variable('Wij', 
        [num_digits, num_capsules, primary_capsule_dim, digit_capsule_dim],
        dtype=tf.float32)
# tile primary capsules for multiplication with weight matrix
tiled_prim_caps = tf.tile(primary_capsules, [1, num_digits, 1, 1, 1])
# yeah.. we need a loop :(
#  help me fix this!
cap_predictions = tf.scan(lambda _, x : tf.matmul(x, Wij), # fn
         tiled_prim_caps, # elements
         initializer = tf.zeros([num_digits, num_capsules, 1, digit_capsule_dim])
         )
# squeeze dummy dimensions
cap_predictions = tf.squeeze(cap_predictions, [3])

Log Priors

# { b_ij } log prior probabilities
priors = tf.get_variable('log_priors', 
            [num_digits, num_caps], 
            initializer=tf.zeros_initializer())
            
# expand to support batch dimension
priors = tf.expand_dims(priors, axis=0)

\{ b_{ij} \}
{bij}\{ b_{ij} \}

(3) Digit Capsules

for i in range(routing_iterations):
    with tf.variable_scope('routing_{}'.format(i)):
        # softmax along "digits" axis
        c = tf.nn.softmax(priors, dim=1)
        # reshape to multiply with predictions 
        c_t = tf.expand_dims(priors, axis=-1)
        s_t = cap_predictions * c_t
        s = tf.reduce_sum(s_t, axis=2)
        digit_caps = squash(s)
        delta_priors = tf.reduce_sum(
            cap_predictions * tf.expand_dims(digit_caps, 2), -1)
        priors = priors + delta_priors

return digit_caps

Margin Loss

L_c = T_c max(0, m^+ - ||v_c||)^2 + \lambda (1 - T_c) max(0, ||v_c|| - m^-)^2
Lc=Tcmax(0,m+vc)2+λ(1Tc)max(0,vcm)2L_c = T_c max(0, m^+ - ||v_c||)^2 + \lambda (1 - T_c) max(0, ||v_c|| - m^-)^2
  • Digit Capsule of class \(c\) must have long instantiation vector \(v_c\)
    • iff digit \(c\) is present in the image
  • \(T_c\) = 1 if digit of class \(c\) is present
  • \(m^+\) = 0.9, \(m^-\) = 0.1
  • Downweight loss due to absent digit classes by \(\lambda\)

Margin Loss

#positives
pos_loss = tf.maximum(0., 
    0.9 - tf.reduce_sum(digit_caps_norm * _y,
    axis=1))
# mean-squared error
pos_loss = tf.reduce_mean(tf.square(pos_loss))

# negatives
y_negs = 1. - _y
neg_loss = tf.maximum(0., digit_caps_norm * y_negs - 0.1)
neg_loss = tf.reduce_sum(tf.square(neg_loss), axis=-1) * 0.5
neg_loss = tf.reduce_mean(neg_loss)

margin_loss = pos_loss + neg_loss

Reconstruction Loss

Reconstruction Loss

# reconstruct original image with a 3-layered MLP
def reconstruct(target_cap):
    with tf.name_scope('reconstruct'):
        fc = fully_connected(target_cap, 512)
        fc = fully_connected(fc, 1024)
        fc = fully_connected(fc, 784, activation_fn=None)
        out = tf.sigmoid(fc)
        return out

reconstruct_loss = tf.reduce_mean(tf.reduce_sum(
        tf.square(_x - reconstruct(target_cap)), axis=-1))

total_loss = pos_loss + neg_loss + 0.0005 * reconstruct_loss

Pros

  • Less Training Samples
  • Equivariance
  • Overlapping objects or Crowded Scenes
  • Interpretable Activation Vectors

Cons

  • Still young : not tested on larger images
  • Training is slow due to the routing algorithm

Resources

Thank You!

Made with Slides.com