The Data Scientist Endeavour

Contents


  • What is a Data Scientist

  • Defining a Data Science product

  • Real life examples

What is a Data Scientist

The term that seemed to fit best was data scientist: those who use data and science to create something new

Dj Patil 2011

... on any given day a team member could author a multistage processing pipeline in Python, design an hypothesis test, perform a regression analysis over data samples with R, design and implement an algorithm for some data intensive product or service on Hadoop, or communicate the results and analysis to other members of the organization.

Jeff hammerbacher 2009

Drew Conway

Defining a Data Science Product

From our previous definition...

Those who use data and science to create something new. Thus:

A data science product is such that uses data, together with the scientific method, to create knowledge

From our previous definition...

This method consists on the following steps:

 

  1. Observe a data-based phenomenon
  2. Postulate a conjecture or hypothesis
  3. Test your hypothesis
  4. Iterate to point 2 until 'convergence'
  5. Communicate results

Observe a data-based phenomenon

The main objective of this step is to familiarize yourself with the phenomenon you are trying to model. Data processing 90% of the time.

curl http://www.gutenberg.org/cache/epub/17192/pg17192.txt | grep -Eo '[a-z]+' | sort | uniq -c | sort -nr | head

Observe a data-based phenomenon

The main components of this stage:

 

- Exploratory data analysis

- Visualizations

Observe a data-based phenomenon

Exploratory data analysis

- Single feature transformations

- Multiple feature transformations

- Dealing with missing values

- Removing predictors

- Adding predictors

 

Single feature transformations

Centering and scaling

 

- Subtracting mean from values

- Divide values by standard deviation

 

 

Removing Skewness

- Apply logarithms

- Apply box-cox transform

 

Single feature transformations

Removing outliers

- Check data distribution (cross trend, in trend, fringe)

- Spatial Sign transform

Multiple feature transformations

Removing outliers

- Check data distribution (cross trend, in trend, fringe)

- Spatial Sign transform

Multiple feature transformations

Feature construction

- PCA

- PLS

- TSNE

Multiple feature transformations

What type of missing values are we dealing with

- MCAR (missing completely at random)

- MAR (missing at random, controlled with observed variables)

- NMAR (non missing at random, cannot be controlled with observed variables)

Missing values

- Collinearity

- Low variance

 

Removing predictors

- Knowledge of the problem: h x w = A

- Dummy variables

 

Adding predictors

Visualizations

Visualizations

Postulate a conjecture or hypothesis

Model construction

- Local Models

    - k-nearest neighbors

- Global Models

    - Linear regression

    - SVM

    - Tree based models

    - Neural Networks

Test Hypothesis

- Error measures

    - log-loss

    - F1 score

    - Square loss

    - AUC

- Types of error

    - Bias

    - Variance

Communicate

Real Life Examples

Crime prediction

Crime prediction

Crime prediction

Crime prediction

Reading Machines

 

Reading Machines

 

Reading Machines

 

Lexical Analysis

Reading Machines

 

Syntactic Analysis

Reading Machines

 

Semantic Analysis

Reading Machines

 

Opinion extraction

Opinion extraction

Opinion extraction

Recommender systems

Recommender systems

Recommender systems

Time series forecasting

Thanks!

deck

By Luis Roman

deck

  • 996