
Machine Learning
Dreams
- Servers (+GPU)
- Team
- Data Engineer(s)
- DevOps
- Assessors
- Senior DS
- Data Infrastructure
- Enough time
- $$$
ML from sh*t & sticks
- No hardware
- No money
- No data (almost)
- You are alone
- Open Source
- You know what you need to solve
ML from sh*t & stick
f : X → Y


Data
Model

Calculator

DATA
What's kind of data?
- text
- image
- audio
- video
- graph
- numberic/categorial

DATA


DATA
Open-Source

DATA
Labeling
- "Market" platform
- Self-hosted platform

DATA
Labeling: "Market" platform


DATA
Labeling: "Market" platform

DATA
Labeling: "Market" platform


- Fast
- Simple
- Cheap
- Low quality
- NDA issues
- Required "gold questions", "exams", etc

DATA
Labeling: Self-hosted platform


DATA
Labeling: Self-hosted platform


- SUPER FLEXIBLE
- Self-hosted
- Manual setup
- Required Assessors

DATA
Augmentation


DATA
Augmentation

Image / Video
- Crop
- Flip
- Rotate
- Color shift
- Blur
- Color filters
- etc
Text
- Synonyms 's/w1/w2'
- Back translation
- Drop/Insert random word
- Cut/Glue texts
Audio
- Noise
- Pitch
- Speed
DATA
Augmentation

Image / Video
Text / Audio
- /makcedward/nlpaug (?)
- ✋✋

MODEL


Data
Predictions
Model
MODEL

Data
BASELINE

Predictions
MODEL
Why so dummy model?
I want BOOSTING DEEP LEARNING!

- Fast (preparing & training)
- Good first approximation
- Less chance for overfitting
- Still works
MODEL

Embed your data
Confucius
MODEL



Data
Embeddings
Encoder
Your model
Predictions
MODEL


MODEL

Where can I find a "encoders"?

Here
MODEL

Pre-trained models
NLP
MODEL

Pre-trained models
CV
MODEL

MOAR MODELS
CALCULATOR

Laptop is OK,
but ....
CALCULATOR


CALCULATOR




CALCULATOR

| Provider | GPU | GPU MEM (GB) | RAM (GB) | CPU (# cores) |
DISK (GB) |
|---|---|---|---|---|---|
| Floydhub | K80 V100 |
12-16 | 61 | - | 10 |
| Paperspace | P5000 | 16 | 30 | 8 | 250 |
| Google Collab | K80 TPU |
11.5 | 10-11 | 2 | 25 |
CALCULATOR


CONCLUSION
- Simple better than complex
- Fail fast
- LABEL YOUR DATA
- Re-use OSS data/models
- Take free stuff

CONCLUSION
Recipe
- Understand your task
- Collect data
- Label
- Augment
- Embed
- Create a simple model
- ...
- PROFIT!
Thanks!




ML из говна и палок
By Ivan Menshikh
ML из говна и палок
- 1,607