For language models: RNN
One model is running over every word
For computer vision: Convolution
Only a small block of pixels processed at a time
Convolutional Layer
Recursive Neural Network
For language models: RNN
One model is running over every word
For computer vision: Convolution
Only a small block of pixels processed at a time
Convolutional Layer
Recursive Neural Network
Ideally we'd process everything at once
All words / image pixels being fed to a single layer
Linear & Convolutional layers rely on an input of fixed size
But there's tasks with inputs of arbitrary size
Naive idea:
Have the same "Linear Layer", but derive weights on the go
Weights are stored in a matrix
Weights are derived from the two elements
Naive idea:
Have the same "Linear Layer", but derive weights on the go
W(i, j) - significance of word i to word j
Just use the similarity between words
w_i is a word token (a vector)
Cosine Similarity is convenient
Give even more control to the NNs!
Where does that formula come from?
sim is the similarity function
ViT (Vision Transformer) uses Positional Encoding
To account for the 2D positioning
Segment Anything uses big ViTs
DiNO uses big ViTs
CV benchmarks are mostly overtaken by ViTs
The actual denoising diffusion U-Net is a ViT
The text is integrated there
During diffusion, intermediate activations from generating original image
Are injected during generation of the new image
Similar thing for video, but we deform the features using Optical Flow
Attention is applied cross-images
Attention is used to inject image features into a 3D structure