Broad, shallow tour of data science tasks and methods
Intuition, not math*
Biased to a computer science pov
Sorry statisticians, physicists, signal processing folk
Necessarily incomplete
* ok it turns out there is a little math
What's a Data Science?
No pedantic definitions here, but themes:
Tools to extract meaning from lots of data
Explore structure and relationships in data
Multi-disciplinary, yay!
Statistics
Computer Science
Electrical/Computer Engineering
Econometrics
Multi-disciplinary, boo!
Different names for same thing
Math notation, conventions
Data Science
Data Mining
Machine Learning
Predictive Analytics
Artificial Intelligence
Knowledge Discovery
Obligatory Graphics
Tour Outline
Process
Exploration
Single points of data
Structure of data
Modelling themes
Task Families
Method families
Learning, Optimization
Now the bad news
It's an iterative process
Problem definition - who cares about this?
Data preparation - easy systematic access
Data exploration - signal vs noise, patterns
Modelling - noisy inputs -> something useful
Evaluation - what is good?
Deployment - from science to end users
Exploration
Data Exploration
Data types
Discrete: Categorical: Red, Blue,...
Continuous: Numerical: [0, 1], x > 0
Exploring single attributes
Mean, Variance
Skew, Kurtosis
Mode
Data Exploration
Exploring Single Values (cont)
Median, quantiles, box-plots
Histograms
Exploring Pairs of Values
Co-variance - how two attributes change together
Pearson correlation coefficient - how two attributes change together vs how change individually
t/z-test, ANOVA
Data Exploration
Exploring structure of data aka dimensionality reduction
Principle component analysis
what directions explain the most variance
Linear method
Kernel tricks + linear methods
Manifold learning
Assume there is some lower dimensional structure
Auto-encoders
Neural Networks trained on the identity function
Modelling Themes
Modelling Themes
Model? Explain/predict some output by some inputs
Minimize error
Why build models at all?
Incomplete noisy data
Discover some latent, hidden process
Describe phenomena in more compact form
Themes
Bias vs Variance
Parametric vs non-parametric
Frequentist vs Bayesian
Bias vs Variance
Bias vs Variance: two sources of error
Bias - how much does this model differ from true answer on average
Variance - if I build a lot of models using the same process how much will they vary from one another
Want low+low, but often they're antagonistic
Intuition: predicting election results
Only poll people from phone book, that model is biased towards home-phone owning folks -- doesn't matter how many people you poll
Only poll 30 people from phone book and you do it multiple times--each time the results might vary. If you increase the number of people, variance will go down
Bias vs Variance
Generally the challenge of model fitting: do not want to over-fit or under-fit
In machine learning, we use a methodology of cross-validation
train vs test
train vs dev vs test
n-fold validation
Bias vs Variance
Variance via model complexity
h_0(x) = b
h0(x)=b
h_1(x) = a x + b
h1(x)=ax+b
\theta
θ
Parameters =
Parametric vs Non-parametric
A few ways to say the same thing?
Is there a hidden process that can be described by finite, fixed parameters and can explain the observed data?
Can the data or process be described by a shape that has convenient math properties?
Parametric statistical tests assume distributions
Non-parametric make fewer assumptions but are often harder to interpret and less powerful
Frequentist vs Bayesian
Philosophical difference over interpretation of probability - we'll skip that