AttentionĀ
In Computer Vision
Older approaches
For language models: RNN
One model is running over every word
For computer vision: Convolution
Only a small block of pixels processed at a time
Convolutional Layer
Recursive Neural Network
Older approaches
For language models: RNN
One model is running over every word
For computer vision: Convolution
Only a small block of pixels processed at a time
Convolutional Layer
Recursive Neural Network
Older approaches
Ideally we'd process everything at once
All words / image pixels being fed to a single layer
Motivation
Linear & Convolutional layers rely on an input of fixed size
Motivation
But there's tasks with inputs of arbitrary size
Intuition
Naive idea:
Have the same "Linear Layer", but derive weights on the go
Weights are stored in a matrix
Weights are derived from the two elements
Intuition
Naive idea:
Have the same "Linear Layer", but derive weights on the go
W(i, j) - significance of word i to word j
Intuition
Just use the similarity between words
w_i is a word token (a vector)
Intuition
Cosine Similarity is convenient
Intuition
Give even more control to the NNs!
Intuition
Where does that formula come from?
sim is the similarity function
ViT
ViT (Vision Transformer) uses Positional Encoding
To account for the 2D positioning
ViT
Segment Anything uses big ViTs
DiNO uses big ViTs
CV benchmarks are mostly overtaken by ViTs
Stable Diffusion
The actual denoising diffusion U-Net is a ViT
The text is integrated there
Image editing tricks
During diffusion, intermediate activations from generating original image
Are injected during generation of the new image
Video editing tricks
Similar thing for video, but we deform the features using Optical Flow
Text-to-3D
Attention is applied cross-images
Text-to-3D
Attention is used to inject image features into a 3D structure
Attention in CV
By xallt
Attention in CV
- 43