The term that seemed to fit best was data scientist: those who use data and science to create something new
Dj Patil 2011
... on any given day a team member could author a multistage processing pipeline in Python, design an hypothesis test, perform a regression analysis over data samples with R, design and implement an algorithm for some data intensive product or service on Hadoop, or communicate the results and analysis to other members of the organization.
Jeff hammerbacher 2009
Drew Conway
Those who use data and science to create something new. Thus:
A data science product is such that uses data, together with the scientific method, to create knowledge
This method consists on the following steps:
The main objective of this step is to familiarize yourself with the phenomenon you are trying to model. Data processing 90% of the time.
curl http://www.gutenberg.org/cache/epub/17192/pg17192.txt | grep -Eo '[a-z]+' | sort | uniq -c | sort -nr | head
The main components of this stage:
- Exploratory data analysis
- Visualizations
Exploratory data analysis
- Single feature transformations
- Multiple feature transformations
- Dealing with missing values
- Removing predictors
- Adding predictors
Centering and scaling
- Subtracting mean from values
- Divide values by standard deviation
Removing Skewness
- Apply logarithms
- Apply box-cox transform
Removing outliers
- Check data distribution (cross trend, in trend, fringe)
- Spatial Sign transform
Removing outliers
- Check data distribution (cross trend, in trend, fringe)
- Spatial Sign transform
Feature construction
- PCA
- PLS
- TSNE
What type of missing values are we dealing with
- MCAR (missing completely at random)
- MAR (missing at random, controlled with observed variables)
- NMAR (non missing at random, cannot be controlled with observed variables)
- Collinearity
- Low variance
- Knowledge of the problem: h x w = A
- Dummy variables
Model construction
- Local Models
- k-nearest neighbors
- Global Models
- Linear regression
- SVM
- Tree based models
- Neural Networks
- Error measures
- log-loss
- F1 score
- Square loss
- AUC
- Types of error
- Bias
- Variance
Lexical Analysis
Syntactic Analysis
Semantic Analysis
Opinion extraction
Opinion extraction
Opinion extraction
Recommender systems
Recommender systems
Recommender systems
Time series forecasting