Engineering Systems for Data Science

... a way of thinking of systems that is global and encompassing, rather than focussed on particular issues.

What is Systems Thinking?

Data Science System =
Data + _____ + _____ + ...

Venn-data

DOMAIN 
KNOWLEDGE

HACKING SKILLS

MATH AND STATS

DATA
SCIENCE

Venn-data

BUSINESS

PROGRAMMING

STATISTICS

COMMUNICATION

DATA
SCIENCE

DATA ANALYST

RESEARCH ENGINEER

DATA ENGINEER

Engineering Systems for Data Science

Process

Programming

+

Process

+

Flow of steps

Agile improvement

CRISP-DM

  1. Business understanding

  2. Data understanding

  3. Data preparation

  4. Modelling

  5. Evaluation

  6. Deployment

DATA

CRISP-DM

  1. What are the business objectives?

  2. Why can data science achieve those objectives?

  3. How do we define success metrics ?

  4. Are there ethical considerations in data usage?

  5. What have other industries achieved?

DATA

CRISP-DM

  1. What are the sources of data?

  2. Does new data need to be collected?

  3. What is the quantity and quality of data available?

  4. What do different data items represent?

  5. Which data is relevant to the objectives?

DATA

CRISP-DM

  1. What are the different data formats?

  2. Is there need for annotating data?

  3. How can data be extracted, transformed, loaded?

  4. How to standardise and normalise data?

  5. How to efficiently store data for analysis?

DATA

CRISP-DM

  1. What assumptions to make for the models?

  2. Statistical or algorithmic modelling?

  3. Is clean data sufficient for modelling?

  4. Is compute budget sufficient for modelling?

  5. Are results statistically significant?

DATA

CRISP-DM

  1. Does model work correctly on test data?

  2. Does model achieve business objectives?

  3. Does model meet performance requirements?

  4. Is the model unbiased and robust?

  5. What are ways to improve the model?

DATA

CRISP-DM

  1. Where is the model to be deployed?

  2. What is the HW / SW stack for deployment

  3. Does it meet performance requirements?

  4. Does it violate privacy requirements?

  5. Does it meet users' expectations?

DATA

CRISP-DM

  1. Iterative design and deployment -> MVP

  2. Revise expectation of success and value of data science

  3. Upgrade human and hardware resources  

DATA

CRISP-DM

  1. Business understanding

  2. Data understanding

  3. Data preparation

  4. Modelling

  5. Evaluation

  6. Deployment

DATA

Programming tools

No code environments

Spreadsheets, BI tools

Programming languages

High peformance stacks

H20.ai, IBM Watson, Amazon Lex, Data Robot. ...

Microsoft Excel, Power BI, Google Sheets, Tableau, ...

Weka, SAS, R, Python, MATLAB, Mathematica, ...

Hadoop, Spark, ...

Why Python

1. PPP: Python for Prototyping and Production

Sometimes there are two silos in orgs

Business / Stats
[Excel, R, Tableau]

Programmers / IT
[C, Java, Cloud, Spark]

Lot of back and forth in different tools

Frustrating and inefficient

Why Python

1. PPP: Python for Prototyping and Production

Python solves this problem

Business / Stats

Programmers / IT

Easy to learn

Full-fledged language

Why Python

2. Python is beginner friendly

First designed by
Guido van Rossum
in 1991 as a language to
teach programming

Why Python

2. Python is beginner friendly

Executable pseudocode

# read csv file, print linewise values
with open('data.csv') as my_file:
    content = my_file.read()
    for line in content: 
        vals = line.split(',')
        for val in vals:
            print(val)

Simple syntax, whitespace, high readability

Why Python

3. Python is increasingly the default choice
for data science

Why Python

3. Python is increasingly the default choice
for data science

Why Python

4. Python is cool beyond data science too

As a scripting language

For web development

For programming IoT devices

Why not Python

Python is an interpreted language

Compiler

Runs on

specific HW

Interpreter

Any HW with interpreter support

Can be slower due to interpreter
Limited to run instruction-by-instruction

Python - Libraries

IPython and Jupyter

IPython is an interactive interface
Encourages iterative read-eval-print (REPL)

Python - Libraries

IPython and Jupyter

Jupyter: Interactive web-based code notebook

Python - Libraries

IPython and Jupyter

Mathematica

Python - Libraries

NumPy

Short for Numerical Python

Shortened fruther as np

In [1]: import numpy as np                                                      
In [2]: a = np.identity(3)                                                      
In [3]: a                                                                       
Out[3]: 
array([[1., 0., 0.],
       [0., 1., 0.],
       [0., 0., 1.]])

Python - Libraries

NumPy

Efficient loading, storing, and processing of high dimensional data

Numpy arrays are the common interfaces with low-level implementation

Python - Libraries

NumPy

MATLAB

Python - Libraries

Pandas

Shortened as pd. Started in 2010

Data structures and functions to manipulate structured data

DataFrame

Series

We will look at two common objects

Relational data

Time series

Python - Libraries

Pandas

Pandas is a mainstay across these 3 tasks

collect

process

store

describe

model

process

store

describe

Recall: This course is structured around:

Python - Libraries

Pandas

R programming language

Python - Libraries

Matplotlib, Seaborn

Default visualisation tools in Python
Especially work well with Jupyter

Python - Libraries

Matplotlib, Seaborn

Python - Libraries

SciPy

Bunch of useful sub-libraries to enable scientific computing with Python

Python - Libraries

SciPy

Bunch of useful sub-libraries to enable scientific computing with Python

scipy.stats will be of major help in modelling

collect

process

store

describe

model

describe

model

Python - Libraries

SciPy

MATLAB

Python - Libraries

Scikit-Learn

Python's workhorse for machine learning

We will begin to use it with the last section on linear regression

Our approach

This is not a Python course, but ...

we will do a brisk intro to Python.

provide the nuts-and-bolts skills required to implement theoretical ideas on real-world data sets

Main goal is to ...

Your approach

Do just not read code, write it

Replicate all exercises shown in videos

The datasets we provide will provide ample opportunity to explore 

Experiment beyond what is shown

System Requirements

Hardware you would need for this course

Standard PC (Windows, Linux, OS) 

Good internet connection

We will show how to use free cloud resources such as Google Colaboratory

Summary

Systems thinking

Roles in Data Science

Process for Data Science - CRISP-DM

Programming tools for Data Science

Why Python?

Libraries we will look at

Made with Slides.com