Data Science is the most sought after job of the twenty first century!

Data is the new oil and Data Science is its combustion engine!

But what exactly is Data Science!

Data science is the future!

What exactly is Data Science?

Why is it such a sought after job description?

What does a Data Scientist actually do?

How important are mathematics and programming skills for a data scientist ?

How does Data Science relate to other buzz words such as ML, DL, AI and DM?

What are some common misconceptions about Data Science?

Learning Objectives

Why are there multiple confusing definitions ?

assortment of several tasks

attention on tasks depends on application

What are these tasks ?

collect

process

store

describe

model

What is Data Science?

Data Science is the science of collecting, storing, processing, describing and modelling data

collect

process

store

describe

model

What is Data Science?

Collecting data

What is involved in data collection?

depends on the question a data scientist is trying to answer

depends on the environment in which the data scientist is working

Collecting data

1. A Data Scientist at an e-commerce company

Which items do customers buy together?

Data already exists

Access using code, SQL

Collecting data

2. A Data Scientist working for a political party

What are people saying about the new policy?

Data already exists

Needs to be crawled, scraped

Collecting data

3. A Data Scientist working with farmers

Data not available

Needs to design experiments

Effect of type of seed, fertiliser, irrigation on yield?

Collecting data

Intermediate level programming

Skills required

Intermediate level programming

Knowledge of Databases

Knowledge of Statistics

Storing data

1. Transactional and Operational Data

patient records

telephone bills

insurance claims

employee records

invoices

inventory

customer records

reimbursements

purchase orders

... ... ... ... ... ... ... ...

Emp ID Name Role Salary Email
00001 ABC CEO 100$ abc@a.com
00002 XYZ CTO 100$ xyz@abc.com
... ... ... ... ...

Structured  Data

Relational Databases

(select, insert, update, delete)

Storing data

2. Data from multiple databases

Data Warehouses

(analytics)

bank accounts

credit cards

investments

integrate into common repository

support analytics

Storing data

3. Unstructured Data

high volume

high variety

text

image

video

speech

high velocity

The era of Big Data

Since the evolution of writing the amount of data collected over 5 millennia up to 2003 is estimated to be about 5 exabytes. Since 2013, humans are generating the same amount of data every day!

Storing data

Summary

Relational Databases

Data Warehouses

Data Lakes

structured

optimised for SQL queries

structured

optimised for analytics

big data

uncurated

curated

Storing data

Programming and Engineering

Skills required

Knowledge of Relational Databases

Knowledge of NoSQL Databases

Knowledge of Data Lakes (Hadoop)

Knowledge of NoSQL Databases

Knowledge of Data Warehouses

Processing data

1. Data Wrangling or Data Munging

courier service

XYZ organisation

{
package_contents: "book"
delivery date: "03-Jan-2020" 
delivery time: "19:30:00"
receiver: "John Doe"
}
item timestamp First Name Last Name
book 1578079800 John Doe

extract, transform, load

Processing data

2. Data cleaning

fill missing values

standardise keywords tags

correct spelling errors

identify and remove outliers

Processing data

3. Data scaling, normalising, standardising

kilometres to miles, rupees to dollars, etc

scale

zero mean, unit variance

normalise

150

0

-10     0       10         30  

0   0.25    0.5           1  

standardise

all values between 0 and 1

Processing data

If data processing is to be performed on Big Data with millions of data items then performance becomes a key consideration

Distributed Processing

Hadoop (Map Reduce)

Programming Skills

Skills required

Map Reduce (Hadoop)

SQL and NoSQL Databases

Basic Statistics

Processing data

Describing Data

1. Visualising Data

red    green  blue

# of shirts

mobile   TV    fridge

sales($)

2018

2019

marketing expense

sales

Describing Data

2. Summarising Data

What is the typical # of of TVs sold daily?

What is the typical variation in # of TVs sold daily?

mean

median

mode

std. deviation

variance

 3 .  7 . 5 .11   9 . 8 . 6

 4 . 5 . 8   9 . 9 . 4 . 3

 5    7   6    11   0

 3.  5 .  9 .8 . 4 . 3 . 5

 5    3   4    11   8

monthly sales record

Describing Data

mean

median

mode

std. deviation

variance

Visualising Data

Summarising Data

Descriptive Statistics

Iterative Process

Exploratory Data Analysis

Statistics

Skills required

Excel

Python

R

Describing Data

Tableau

Modelling Data

Statistical Modelling: Underlying data distribn.

Is the new drug effective in reducing blood sugar level?

I think the readings follow a normal distribn. (Data Model)

I am 99% sure that the drug is effective (robust guarantee)

150

130

Modelling Data

Statistical Modelling: Underlying relationships

What is the relationship between blood sugar level and # of days of treatment?

I think there is a linear relationship between the no. of days of treatment and blood sugar level (Data Model)

I am 99% sure that the sugar level drops by 3 +/- 1 points for each day of treatment

no. of days of treatment

blood sugar level

Statistical Modelling

Modelling underlying data distribution

Modelling underlying relations in data

Formulate and test hypotheses

Give statistical guarantees (p-values, goodness-of-fit tests)

Modelling Data

Algorithmic Modelling

In Statistical Modelling, we assumed simple models which allowed robust statistical analysis

Give statistical guarantees (p-values, goodness-of-fit tests)

Modelling Data

Alternative approach: Build complex models

Algorithmic Modelling

Modelling Data

y = f(x)

blood sugar level after 30 days

[age, weight, height, blood pressure, ...]
[age, weight, height, blood pressure, ...]

Estimate f using data, optimisation techniques

For a new patient plug-in the value of x to get y

Focus on prediction (don't care about underlying phenomena)

[..., ... ,...]

Statistical Modelling v/s Algorithmic Modelling

Modelling Data

Simple, intuitive models

Complex, flexible models

More suited for low-dimensional data

Can work with high-dimensional data

Robust statistical analysis is possible

Not suitable for robust statistical analysis

Focus on interpretability

Focus on prediction

Data lean models

Data hungry models

More of Statistics

More of ML, DL

Statistical Modelling v/s Algorithmic Modelling

Modelling Data

Linear Regression, Logistic Regression, Linear Discriminant Analysis

Linear Regression, Logistic Regression, Linear Discriminant Analysis,                  Decision Trees, K-NNs     

SVMs, Naive Bayes,  Multilayered Neural Networks       

When you have large amounts of high-dimensional data and you want to learn very complex relationships between the output and input use a specific class of complex ML models and algorithms, collectively referred to as Deep Learning

Modelling Data

When you have large amounts of high-dimensional data and you want to learn very complex relationships between the output and input use a specific class of complex ML models and algorithms, collectively referred to as Deep Learning

Given a picture of the retina predict if the patient is suffering from diabetic retinopathy

Algorithmic Modelling: Deep Learning

Modelling Data

When you have large amounts of high-dimensional data and you want to learn very complex relationships between the output and input use a specific class of complex ML models and algorithms, collectively referred to as Deep Learning

Algorithmic Modelling: Deep Learning

Popular today because

- large amounts of data with complex relationships

- good software frameworks

- better compute

Modelling Data

the crux of a DS's job!

Inferential Statistics

Skills required

Probability Theory

Calculus

Optimisation algorithms

ML and DL

Python packages and frameworks (numpy, scipy, scikit-learn, TF, PyTorch, Keras)

Why is DS so popular today?

Keen interest in converting data into insights!

1. Data is everywhere

Personal devices

Sensors

Transactional Data (Digital revolution)

Why is DS so popular today?

Within the last decade the cost of bulk storage has reduced by over 6 times and GPUs have become 100 times more capable!

2. Devices have become powerful and cheaper

Bulk storage

Specialised hardware

Why is DS so popular today?

Popular open-source frameworks such as Tensorflow and Pytorch provide easy interfaces while hiding complexities such as compilation and optimisation on hardware

3. Democratisation of software and hardware

Software

Why is DS so popular today?

It is relatively easy for a single data scientist to setup complete stacks on the cloud which were beyond reach to even large companies a few years back!

3. Democratisation of software and hardware

Cloud compute

Software

What is the confusion ?

AI and DS are synonymous

One is a subset of the other

Are AI and DS related? If so, how?

AI and DS are completely unrelated

Confusion arises due to non-technical and broad usage of these terms

(not a very useful definition)

Defining AI

AI is about building systems or agents that demonstrate "intelligence"

What are the tasks that constitute AI?

Problem Solving

Knowledge Represn.

Reasoning

Decision Making

Perception, Communication, Actuation

Problem Solving

What is involved in problem solving?

L

R

L

R

L

R

L

R

L

R

R

R

R

No data, No modelling

Only needs efficient search algos (BFS, DFS, A*)

Knowledge Represn. & Reasoning

What happens if the games are more complex?

if there is a lion in the current cell then there is gold in the cell to its left

if the current cell is windy then there is a pit in the adjacent cell

isLion(cell) -->isGold(left(cell))

isWind(cell) --> isGold(near(cell))*

No Data. Knowledge representation and reasoning using propositional and first order logic

* meant to be isPit() instead of isGold()

Decision Making

Expert Systems

hasRash(Patient) AND hasVomiting(Patient) AND hasHighFever(Patient) --> hasDengue (Patient)

Rules given by domain experts

Rules encoded using knowledge representation

Execution of rules and reasoning done by a program

isTempGreater102(Patient) -->hasHighFever(Patient)

Decision Making

Limitations of Expert Systems

Rules maybe too complex

Rules maybe inexpressible

Rules maybe unknown

Alternative Approach: Learn from large amounts of data a.k.a Machine Learning

Decision Making

Machine Learning

y = f(x)

Ebola or not?

[age, weight, height, blood pressure, ...]
[age, weight, height, blood pressure, ...]
[..., ... ,...]

Estimate f using data, optimisation techniques

For a new patient plug-in the value of x to get y

Popular today because

Decision Making

Deep Learning

- large amounts of data with complex relationships

- good software frameworks

- better compute

When you have large amounts of high-dimensional data and you want to learn very complex relationships between the output and input use a specific class of complex ML models and algorithms, collectively referred to as Deep Learning

Dynamic environment

Decision Making

Sequential Decision Making

Partial Information

One-Off Rewards from the environment

No explicit supervision at each step

Reinforcement Learning

Decision Making

Reinforcement Learning

Deep Learning

Machine Learning

This data-driven part of AI intersects with the world of Data Science

Communication, Perception, Actuation

Communication using Language

Natural Language Generation

Natural Language Understanding

\{

Natural Language Processing

Modern NLP is completely data- driven

1950

1980

2010

Expert Systems

Machine Learning

Deep Learning

Communication, Perception, Actuation

Perception using Vision, Speech

Speech Technology

Computer Vision

Modern CV and Speech are completely data- driven

1950

1980

2010

Expert Systems

Machine Learning

Deep Learning

Communication, Perception, Actuation

Actuation with Physical Robots

Reinforcement Learning

Robotics

Increasingly data-driven wherein robots can learn to perform complex actuations by learning from simulations or by mimicking human examples

Speech Technology

Computer Vision

Natural Language Processing

This data-driven part of AI intersects with the world of Data Science

Communication, Perception, Actuation

(a part of) Robotics

Are AI and DS related? If so, how?

Problem Solving

Knowledge Represn.

Reasoning

Decision Making

Perception, Commn., Actuation

collect

process

store

describe

model

DS: I have data what do I do with it?

AI: I want an intelligent agent! What do I do?

Are AI and DS related? If so, how?

Problem Solving

Knowledge Represn.

Reasoning

Decision Making

Perception, Commn., Actuation

collect

process

store

describe

model

DS: I have data what do I do with it?

Data-driven

The Myths of Data Science

World Peace!

Myth #1: Machine does everything

What to collect?

Where to collect ?

How to collect ?

What schema?

Which file system?

Label data

Study and integrate multiple formats

Domain knowledge

What to clean?

How to clean?

Which columns ?

Which plots

Study trends

Hypothesise

Propose models

Oversee training

Estimate paramters

Execute scripts

Physical storage

Execute scripts

Execute scripts

The Myths of Data Science

Myth #2: DS requires Big Data and DL

=

Data Science

Example: A rural school with data of less than 500 students

Do more girls dropout from school than boys?

Do students really find maths to be harder than social science?

Do students staying farther from school perform poorly?

Statistics

Big data

Deep Learning

Hardware

The Myths of Data Science

Myth #3: DS is always successful

Data Science

Reasons why it could fail

No meaningful insights in data

Not enough data

No actionable insights in data

Noisy data

always

The Myths of Data Science

Myth #3: DS is always successful

Data Science

If the right amount of clean usable data is available, if skilled data scientists with technical and domain knowledge are available, and if the organisation has the capacity and resources to act on the insights generated from the data then data science can be successful and impactful.

always

The Path to Data Science

Python Packages

Prog. & Databases

Descriptive Statistics

Probability Theory

Inferential Statistics

Statistical Modelling

Functions

Calculus

Linear Algebra

Probability Theory (Adv.)

Optimis-ation

Machine Learning

Deep Learning

Pre-requisites

Foundations of DS

Foundations of ML

ML

DL

Information Theory

What is Data Science?

By One Fourth Labs

What is Data Science?

  • 199