# Planet: Understanding­ the Amazon from Space

Sr. Data Scientist at TrueAccord

PhD in Physics

Kaggle top 100

• Industry: interpretability, scalability, size, throughput
• Competitions: accuracy

## Problem description

1. Train: 40k images
2. Test: 60k images (Public 40k, Private 20k)

## Problem description

1. Train: 40k images
2. Test: 60k images (Public 40k, Private 20k)
3. JPG: 3 bands R, G, B 8 bit
4. TIF: 4 bands R, G, B, NIR 16 bit
5. Resolution (256, 256)

## Problem description

1. Train: 40k images
2. Test: 60k images (Public 40k, Private 20k)
3. JPG: 3 bands R, G, B 8 bit
4. TIF: 4 bands R, G, B, NIR 16 bit
5. Resolution (256, 256)
6. Multilabel classification (17 classes)
7. Some labels are mutually exclusive.
8. Labels based on jpg

## Problem description

1. Train: 40k images
2. Test: 60k images (Public 40k, Private 20k)
3. JPG: 3 bands R, G, B 8 bit
4. TIF: 4 bands R, G, B, NIR 16 bit
5. Resolution (256, 256)
6. Multilabel classification (17 classes)
7. Some labels are mutually exclusive.
8. Labels based on jpg
F_{\beta} = (1 + \beta^2) \frac {pr} {\beta^2 p + r}
$F_{\beta} = (1 + \beta^2) \frac {pr} {\beta^2 p + r}$
p = \frac {tp} {tp + fp}
$p = \frac {tp} {tp + fp}$
r = \frac {tp} {tp + fn}
$r = \frac {tp} {tp + fn}$
\beta = 2
$\beta = 2$

### Classes

https://www.kaggle.com/anokas/data-exploration-analysis/notebook

### Specifics of the data / data leak

• Red: train
• Green: Public test
• Blue: Private test

Way to get => brute force boundary match using L2 distance

1. Train 40k
2. Stable validation

=>

fight for 0.0001

=>

stacking

## Main idea: building ensemble

### Team => set up notation

1. Code: Private Repository at GitLab. Folder per person.

Google Drive => Predictions on train / test per fold in hdf5.

### 10 Folds

Stratified in a loop starting from the rarest labels

### Ways to split into folds:

1. KFold
2. Stratified KFold
3. GridSearch to find good random seed
4. More advanced techniques (recall Mercedes problem)

### Let's throw models into stacker...

For each model, for each fold we generate prediction on val and test

### Architectures

• Densenet 121, 169, 201
• Resnet 34, 50, 101, 152
• ResNext 50, 101
• VGG 11, 13, 16, 19
• DPN 92, 96

### Let's throw models into stacker...

For each model, for each fold we generate prediction on val and test

### Architectures

• Densenet 121, 169, 201
• Resnet 34, 50, 101, 152
• ResNext 50, 101
• VGG 11, 13, 16, 19
• DPN 92, 96

### Initialization

• From scratch
• ImageNet
• ImageNet 11k + Places 365

### Let's throw models into stacker...

For each model, for each fold we generate prediction on val and test

### Architectures

• Densenet 121, 169, 201
• Resnet 34, 50, 101, 152
• ResNext 50, 101
• VGG 11, 13, 16, 19
• DPN 92, 96

### Initialization

• From scratch
• ImageNet
• ImageNet 11k + Places 365

### Loss

• binary_crossentropy
• bce  - log(F2_approximation)
• softmax(weather) + bce(other)

### Let's throw models into stacker...

For each model, for each fold we generate prediction on val and test

### Architectures

• Densenet 121, 169, 201
• Resnet 34, 50, 101, 152
• ResNext 50, 101
• VGG 11, 13, 16, 19
• DPN 92, 96

### Initialization

• From scratch
• ImageNet
• ImageNet 11k + Places 365

### Loss

• binary_crossentropy
• bce  - log(F2_approximation)
• softmax(weather) + bce(other)

### Training

• Freezing / non-freezing weights
• Different lr schedule
• Keras, PyTorch, MXNet

### Augmentations

• Flips
• Rotations + Reflect
• Shear
• Scale
• Contrast
• Blur
• Channel multiplier

numpy + ImgAug + OpenCV

https://github.com/aleju/imgaug

1. Labels based on JPG
2. JPG carry enough information
3. Shifts between JPG and TIFF
4. All networks pre-trained on 8 bit

It is still possible to get 0.93+ on Tiff.

https://www.kaggle.com/bguberfain/tif-to-jpg-by-matching-percentiles

• TIFF (RGB + N) => NGB
• Percentile matching

48 networks

*

10 folds

=

480 networks

ExtraTrees

NN

Weighted

average

Threasholding

LR

Mean

### Thresholding

1. Gives a lot.
2. Different class thresholds depend on each other
3. Weather hack (if cloudy => lower other)
F_2 = (1 + 2^2) \frac {pr} {2^2 p + r}
$F_2 = (1 + 2^2) \frac {pr} {2^2 p + r}$
T = \frac {1} { 1 + 2^2} = 0.2
$T = \frac {1} { 1 + 2^2} = 0.2$

Worked the best:

On the Bayes-optimality of F-measure maximizers

https://arxiv.org/abs/1310.4849

### Did not work

1. Tiff
2. indices (NDWI, EVI, SAVI, etc)
3. Fbeta loss
4. Two headed networks (weather + softmax, bce for the rest)
5. Dehazing
6. Mosaic features

# Summary

• Three weeks
• ~20 GPUs
• 480 Networks
• 7th Place

Q: How many networks do we need to make it a product?

A: One.