The Data Scientist Endeavour
Contents
- What is a Data Scientist
- Defining a Data Science product
- Real life examples
What is a Data Scientist
The term that seemed to fit best was data scientist: those who use data and science to create something new
Dj Patil 2011
... on any given day a team member could author a multistage processing pipeline in Python, design an hypothesis test, perform a regression analysis over data samples with R, design and implement an algorithm for some data intensive product or service on Hadoop, or communicate the results and analysis to other members of the organization.
Jeff hammerbacher 2009

Drew Conway
Defining a Data Science Product
From our previous definition...
Those who use data and science to create something new. Thus:
A data science product is such that uses data, together with the scientific method, to create knowledge
From our previous definition...
This method consists on the following steps:
- Observe a data-based phenomenon
- Postulate a conjecture or hypothesis
- Test your hypothesis
- Iterate to point 2 until 'convergence'
- Communicate results
Observe a data-based phenomenon
The main objective of this step is to familiarize yourself with the phenomenon you are trying to model. Data processing 90% of the time.




curl http://www.gutenberg.org/cache/epub/17192/pg17192.txt | grep -Eo '[a-z]+' | sort | uniq -c | sort -nr | head
Observe a data-based phenomenon
The main components of this stage:
- Exploratory data analysis
- Visualizations
Observe a data-based phenomenon
Exploratory data analysis
- Single feature transformations
- Multiple feature transformations
- Dealing with missing values
- Removing predictors
- Adding predictors
Single feature transformations
Centering and scaling
- Subtracting mean from values
- Divide values by standard deviation

Removing Skewness
- Apply logarithms
- Apply box-cox transform



Single feature transformations
Removing outliers
- Check data distribution (cross trend, in trend, fringe)
- Spatial Sign transform
Multiple feature transformations

Removing outliers
- Check data distribution (cross trend, in trend, fringe)
- Spatial Sign transform
Multiple feature transformations

Feature construction
- PCA
- PLS
- TSNE
Multiple feature transformations


What type of missing values are we dealing with
- MCAR (missing completely at random)
- MAR (missing at random, controlled with observed variables)
- NMAR (non missing at random, cannot be controlled with observed variables)
Missing values
- Collinearity
- Low variance
Removing predictors


- Knowledge of the problem: h x w = A
- Dummy variables
Adding predictors
Visualizations

Visualizations

Postulate a conjecture or hypothesis
Model construction
- Local Models
- k-nearest neighbors
- Global Models
- Linear regression
- SVM
- Tree based models
- Neural Networks
Test Hypothesis
- Error measures
- log-loss
- F1 score
- Square loss
- AUC
- Types of error
- Bias
- Variance


Communicate

Real Life Examples
Crime prediction

Crime prediction

Crime prediction


Crime prediction

Reading Machines

Reading Machines

Reading Machines

Lexical Analysis
Reading Machines
Syntactic Analysis

Reading Machines
Semantic Analysis

Reading Machines

Opinion extraction
Opinion extraction



Opinion extraction



Recommender systems

Recommender systems

Recommender systems

Time series forecasting

Thanks!
deck
By Luis Roman
deck
- 1,073