The Data Scientist Endeavour
Contents
- What is a Data Scientist
- Defining a Data Science product
- Real life examples
What is a Data Scientist
The term that seemed to fit best was data scientist: those who use data and science to create something new
Dj Patil 2011
... on any given day a team member could author a multistage processing pipeline in Python, design an hypothesis test, perform a regression analysis over data samples with R, design and implement an algorithm for some data intensive product or service on Hadoop, or communicate the results and analysis to other members of the organization.
Jeff hammerbacher 2009
Drew Conway
Defining a Data Science Product
From our previous definition...
Those who use data and science to create something new. Thus:
A data science product is such that uses data, together with the scientific method, to create knowledge
From our previous definition...
This method consists on the following steps:
- Observe a data-based phenomenon
- Postulate a conjecture or hypothesis
- Test your hypothesis
- Iterate to point 2 until 'convergence'
- Communicate results
Observe a data-based phenomenon
The main objective of this step is to familiarize yourself with the phenomenon you are trying to model. Data processing 90% of the time.
curl http://www.gutenberg.org/cache/epub/17192/pg17192.txt | grep -Eo '[a-z]+' | sort | uniq -c | sort -nr | head
Observe a data-based phenomenon
The main components of this stage:
- Exploratory data analysis
- Visualizations
Observe a data-based phenomenon
Exploratory data analysis
- Single feature transformations
- Multiple feature transformations
- Dealing with missing values
- Removing predictors
- Adding predictors
Single feature transformations
Centering and scaling
- Subtracting mean from values
- Divide values by standard deviation
Removing Skewness
- Apply logarithms
- Apply box-cox transform
Single feature transformations
Removing outliers
- Check data distribution (cross trend, in trend, fringe)
- Spatial Sign transform
Multiple feature transformations
Removing outliers
- Check data distribution (cross trend, in trend, fringe)
- Spatial Sign transform
Multiple feature transformations
Feature construction
- PCA
- PLS
- TSNE
Multiple feature transformations
What type of missing values are we dealing with
- MCAR (missing completely at random)
- MAR (missing at random, controlled with observed variables)
- NMAR (non missing at random, cannot be controlled with observed variables)
Missing values
- Collinearity
- Low variance
Removing predictors
- Knowledge of the problem: h x w = A
- Dummy variables
Adding predictors
Visualizations
Visualizations
Postulate a conjecture or hypothesis
Model construction
- Local Models
- k-nearest neighbors
- Global Models
- Linear regression
- SVM
- Tree based models
- Neural Networks
Test Hypothesis
- Error measures
- log-loss
- F1 score
- Square loss
- AUC
- Types of error
- Bias
- Variance
Communicate
Real Life Examples
Crime prediction
Crime prediction
Crime prediction
Crime prediction
Reading Machines
Reading Machines
Reading Machines
Lexical Analysis
Reading Machines
Syntactic Analysis
Reading Machines
Semantic Analysis
Reading Machines
Opinion extraction
Opinion extraction
Opinion extraction
Recommender systems
Recommender systems
Recommender systems
Time series forecasting
Thanks!
deck
By Luis Roman
deck
- 996