How to talk with a
data scientist

re:Clojure 2021 workshop series // 2021-11-13

J Santiago

Say hello, ask questions, learn, share, and be kind like with any other human being. Done!

How to talk with a data scientist

think like

The anti-fraud team is hiring!

FinTech from Berlin

  • 120 employees
  • Liquidity solutions for SMEs
  • Buy now, pay later for B2B online shops
  • Raised $100M in Series C

There are no stupid questions πŸ™‚

  1. What don't you know about data science?

  2. Core (data) principles (according to me Β―\_(ツ)_/Β―)
  3. Group problem-solving

Agenda

What do you want to know?

Question

Model

Analysis

maybe

Common terms

  • Training
  • Model
  • Metrics

πŸ‘¨β€πŸ¦°πŸ‘©πŸ½β€πŸ¦±πŸ‘§πŸΌπŸ§‘β€πŸ¦²πŸ§Ÿβ€β™‚οΈπŸ‘³πŸ§‘πŸ½β€πŸ¦±πŸ‘¨πŸΌ

Who is a zombie?

πŸ§Ÿβ€β™‚οΈπŸ§Ÿβ€β™‚οΈπŸ§Ÿβ€β™‚οΈπŸ§Ÿβ€β™‚οΈπŸ§‘πŸ½β€πŸ¦±πŸ‘¨πŸΌπŸ§‘πŸ½β€πŸ¦±πŸ‘¨πŸΌ

Training data

Model features

  • Heart-rate
  • Words per second

πŸ§‘β€πŸ¦²

"78% probability this is a zombie"

{amount: 122

Β  n-paid: null

Β  color: "blue"}

Model

Data

PROD

TRAINING

amount type color
122 0 "blue"

{amount: 122

Β  n-paid: 0

Β  color: "blue"}

core (DATA) principles

for a successful data science project

Immutable &

Traceable &

specific

(... and documented, documented, documented!)

Data Scientist

user_id day volume
1 2021-11-02 1200
2 2021-11-02 1300
3 2021-11-02 600
4 2021-11-02 2100

Data is immutable

Mutating data invariably leads to loss of information. Not recording what mutation occurred leads to chaos 😱.

Stored data must be an exact replica of what happens in a live system.

⚠️⚠️⚠️⚠️

user_id timestamp volume
1 2021-11-01 12:34:00 500
1 2021-11-01 15:10:07 500
1 2021-11-01 20:01:00 500
2 2021-11-01 10:15:00 100
2 2021-11-01 13:28:37 500
2 2021-11-01 21:59:19 500
3 2021-11-01 07:30:00 300
3 2021-11-01 12:34:00 500

No more mutation πŸŽ‰

  • When did the user drink?
  • How much did they drink at lunch?
  • When did they use the app the most?

Data is traceable

Events or entities created together, are linked together. Changes to data are events!

Β 

user_id timestamp volume session
1 2021-11-01 12:34:00 500 3
1 2021-11-01 15:10:07 500 4
1 2021-11-01 20:01:00 500 10
2 2021-11-01 10:15:00 100 1

Data is Specific

Good domain modelling is critical to fetch data specifically for what is needed.

user_id timestamp volume session beverage
1 2021-11-01 12:34:00 500 3 water
1 2021-11-01 15:10:07 500 4 water
1 2021-11-01 20:01:00 500 10 water
2 2021-11-01 10:15:00 100 1 coffee

Group WORK

https://docs.google.com/document/d/1QZMOPuLbe2eJBJgDaj_opg1JTYte7T7BOlWQ_ToCa10

English is the most powerful programming language.

- Paulus Esterhazy

Find me at http://jcpsantiago.xyz

and as @jcpsantiago almost everywhere else

Thank you!πŸ™

re:Clojure 2021 How to talk with a data scientist

By jcpsantiago

re:Clojure 2021 How to talk with a data scientist

  • 39