Predictive Analytics

War Stories

Feb 13, 2015

Hobson Lane

Choose Your Story

7707-2-TOTAL

(770) 728-6825

  1. Only Nyquist Knows
  2. The Meaning of Mean
  3. Data Dearth
  4. Question the Question
  5. Deep Net Runs Aground
  6. Escape the Maze

1. Only Nyquist Knows

When your vehicle is out of control...

SMS: 7707-2-TOTAL  or  (770) 728-6825    MSGS: "1", "2", "3", "4", "5", or "6"

Photo by

Eric Cutright 

Public Domain

Image by NASA

NASA

Public Domain

1. Only Nyquist Knows

  • Nav sensors (gyro., accel) are "pegged"
  • All you know is solar power:

 

 

How fast is the tumble? 

SMS: 7707-2-TOTAL  or  (770) 728-6825    MSGS: "1", "2", "3", "4", "5", or "6"

4 sec !

12 sec ?

1. Only Nyquist Knows

Try an Anti-Aliasing Filter

Fail: Only Nyquist Knows

12 sec

Workarounds

If Nyquist sampling (2x faster than truth) isn't possible....

  • Use a different sensor
    • Postprocess existing signal (radio doppler)
  • Sample irregularly!
    • Captures higher frequencies
    • Lomb-Scargle to post-process

  

 

  • Probabilistic modeling
    • Great for overwhelming data volume (IoT)
spectrum = scipy.signal.lombscargle(sample_times, samples, frequencies)

2. The Meaning of Mean

  • Means don't tell the whole story
  • Consider both      and
  • Meaning may be found in the means for each...
    • group, cluster, or class
  • For us we started with grouping by time of day, but that wasn't enough...

SMS: 7707-2-TOTAL  or  (770) 728-6825    MSGS: "1", "2", "3", "4", "5", or "6"

\mu
μ
\sigma
σ

2. The Meaning of Mean

SMS: 7707-2-TOTAL  or  (770) 728-6825    MSGS: "1", "2", "3", "4", "5", or "6"

  • Regression and classification required
  • Many "fundamental frequencies"

Mean for Each Time of Day

Classify Before Getting Mean

3. Data Dearth

  • Tuning a 2-DOF predictive filter for performance
  • More data gives algorithm more to work with
    • Less Overfitting
    • More Performance

Anticlined cliffs or "terraces"

More Data

Performance

($)

Conservatism

3. Data Dearth

  • Sometimes more of the same doesn't help
    • Exogenous factors confound the smartest algorithm
  • Make the exogenous endogenous (new data source)

SMS: 7707-2-TOTAL  or  (770) 728-6825    MSGS: "1", "2", "3", "4", "5", or "6"

4. Question the Question

SMS: 7707-2-TOTAL  or  (770) 728-6825    MSGS: "1", "2", "3", "4", "5", or "6"

Correlation != Causation

(a. la. Tyler Vigen)

More sales => More returns

Normalize return rate for sales

(lag-compensated)

Multiple interracting causes

Reduce these returns surges!

4. Question the        Question

Simple equation everyone can agree on

"Cost of quality"

"Customer reject rate"

"Defect rate"

But it's Wrong!

(last quarter)

(last quarter)

And it's Late!

6\sigma
6σ

Rejects

  Sales 

Reject rate 

=

SMS: 7707-2-TOTAL  or  (770) 728-6825    MSGS: "1", "2", "3", "4", "5", or "6"

4. Better "Question"

Rejects (last quarter)

Sales (qtr before last)

Reject rate 

=

SMS: 7707-2-TOTAL  or  (770) 728-6825    MSGS: "1", "2", "3", "4", "5", or "6"

Even Better

Rejects (last quarter)

Sales (estimate lagged quarter)

Reject rate 

=

Correct

Rejects (last week)

Sales (integral of lagged sales)

Reject rate 

=

r_r=\Sigma_k{\alpha{s_{n-k}}}
rr=Σkαsnk

"Birth-Death Process"

r_r=\Sigma_k{\alpha{s_{n-k}}}
rr=Σkαsnk
H(t,\tau)
H(t,τ)
S(t)
S(t)
R(t)
R(t)

All products "die",

Question is when

Flow rate

(Reject rate)

Product enters "pipeline" arbitrarily

Sale

Reject

Lag

And the portion that happens too soon

4. Question the Question

SMS: 7707-2-TOTAL  or  (770) 728-6825    MSGS: "1", "2", "3", "4", "5", or "6"

Histogram reveals trend and seasonality

Sales

Month-end Surge

Rejects

4. Question the Question

SMS: 7707-2-TOTAL  or  (770) 728-6825    MSGS: "1", "2", "3", "4", "5", or "6"

  • Fiscal Quarter
  • Geography
  • Diagnosis
  • Retailer
  • Salesperson
  • Model
  • Lot
  • Reason

Lag

Lagged Sales

Today

Predicted Returns

*

=

H(t,\tau)
H(t,τ)
S(t)
S(t)
R(t)
R(t)

Sales

Rejects

Lag Process

\div
÷

4. Analyze the Question

SMS: 7707-2-TOTAL  or  (770) 728-6825    MSGS: "1", "2", "3", "4", "5", or "6"

  • You stop counting
  • You stop accepting returns
  • You stop selling

Cumulative histograms focus attention on final total

Product returns stop when...

4. Normalize & Compare

SMS: 7707-2-TOTAL  or  (770) 728-6825    MSGS: "1", "2", "3", "4", "5", or "6"

  • Fiscal Quarter
  • Geography
  • Diagnosis
  • Retailer
  • Salesperson
  • Model
  • Lot
  • Reason

4. Analyze the Question

SMS: 7707-2-TOTAL  or  (770) 728-6825    MSGS: "1", "2", "3", "4", "5", or "6"

Normalize histograms to compare categories

  • Normalize by what?
    • Sales (which ones)?
    • Total returns?
  • How are we doing this week?
    • Not just this quarter

4. Question the Question

Unsupervised natural language processing?

SMS: 7707-2-TOTAL  or  (770) 728-6825    MSGS: "1", "2", "3", "4", "5", or "6"

President inaugural speeches

Target category = political party

4. Question the Question

What are the US Presidents' political parties based on speeches?

4. Question the Question

What are the US Presidents' political parties based on speeches?

4. Question the Question

  • The category you're interested in will not likely be the most important "factor" in the NLP statistics
  • Dimension reduction (SVD, PCA) can identify factors
    • Word-sets that are most significant
  • These represent the "themes"
    • Interpretation of these "themes" is up to you
    • Statistics      Meaning

 

\ne

5. Deep Nets Run Aground

Deep net performs well!

 

 

SMS: 7707-2-TOTAL  or  (770) 728-6825    MSGS: "1", "2", "3", "4", "5", "6"

5. Deep Nets Run Aground

Not so fast... it's overfitting

 

 

SMS: 7707-2-TOTAL  or  (770) 728-6825    MSGS: "1", "2", "3", "4", "5", "6"

5. Deep Nets Run Aground

SMS: 7707-2-TOTAL  or  (770) 728-6825    MSGS: "1", "2", "3", "4", "5", "6"

\text{a}=\text{W}^{k}_{S^k,S^{(k+1)}}\text\ \text{p}
a=WSk,S(k+1)k p
\text{p}
p
\text{W}^{k}_{S^k,S^{(k+1)}}
WSk,S(k+1)k
\text{a}
a
  • Conventional Hebb rule
\text{W}^{new}={W}^{old}+\text{t}_q\text{p}_q^T
Wnew=Wold+tqpqT
\text{W}^{new}={W}^{old}+\alpha(\text{t}_q-\text{a}_q)\text{p}_q^T
Wnew=Wold+α(tqaq)pqT
  • Hebb "delta" rule

5. Shallow Data

  • Model degree:

SMS: 7707-2-TOTAL  or  (770) 728-6825    MSGS: "1", "2", "3", "4", "5", "6"

\text{a}=\text{W}^{k}_{S^k,S^{(k+1)}}\text\ \text{p}
a=WSk,S(k+1)k p
\text{p}
p
\text{W}^{k}_{S^k,S^{(k+1)}}
WSk,S(k+1)k
\text{a}
a
\sum_k{{S^k}{S^{(k+1)}}}
kSkS(k+1)
  • Training data DOF:
{S}^1{S^3}N_{samples}
S1S3Nsamples

(independent samples)

5. Shallow Data

  • Model degree:

SMS: 7707-2-TOTAL  or  (770) 728-6825    MSGS: "1", "2", "3", "4", "5", "6"

\text{a}=\text{W}^{k}_{S^k,S^{(k+1)}}\text\ \text{p}
a=WSk,S(k+1)k p
\text{p}
p
\text{W}^{k}_{S^k,S^{(k+1)}}
WSk,S(k+1)k
\text{a}
a
{S^1}{S^2}+{S^2}{S^3}
S1S2+S2S3
  • Training data DOF:
({S}^1+{S^3})N_{samples}
(S1+S3)Nsamples

(1 hidden layer)

(independent samples)

5. Bottom Line

SMS: 7707-2-TOTAL  or  (770) 728-6825    MSGS: "1", "2", "3", "4", "5", "6"

N_{hidden} << N_{training}
Nhidden<<Ntraining

6. Escape the Maze

SMS: 7707-2-TOTAL  or  (770) 728-6825    MSGS: "1", "2", "3", "4", "5", or "6"

Find Connections

(Actionable Insight)

18 databases

> 10k tables

> 100k fields

> 10M records/table

6. Escape from the Maze

  • Tight heuristics vital for efficient graph search
  • "Always turn right" is not good enough

SMS: 7707-2-TOTAL  or  (770) 728-6825    MSGS: "1", "2", "3", "4", "5", or "6"

6. Escape from the Maze

  • Don't bother with "exhaustive" correlation search

SMS: 7707-2-TOTAL  or  (770) 728-6825    MSGS: "1", "2", "3", "4", "5", or "6"

\text{complexity} \approx{O({M^2}N^2)}\approx{10}^{24}
complexityO(M2N2)1024
  • Find db relationships using meta-data
    • min, max, median
    • #records
    • #distinct
    • for reals: mean, std
\text{complexity} \approx{O({M}{N}log(N))}\approx{10}^{13}
complexityO(MNlog(N))1013
10^5
105
10^7
107

Human Heuristics

SMS: 7707-2-TOTAL  or  (770) 728-6825    MSGS: "1", "2", "3", "4", "5", or "6"

  • Business knowledge narrows search:

    • Repair technicians

    • Product designers

    • Factory managers

    • Suppliers

    • Sales channels

    • Call center

Accidental "Experiements"

SMS: 7707-2-TOTAL  or  (770) 728-6825    MSGS: "1", "2", "3", "4", "5", or "6"

  • Look for differences in

    • Model
    • Lot
    • Product
    • Sales Channel
    • Customer Demographic
    • Region/Culture
  • Look for ...

    • New/deleted features
    • Documentation updates
    • Cost-saving parts changes
    • Production facilities (outsourced vs insourced)

Kruskal's Algorithm

Minimum Spanning Tree

  1. Add lowest cost edge with new node
  2. Repeat until all nodes accounted for
def minimum_spanning_zipcodes():
    zipcode_query_sequence = []
    G = build_graph(api.db, limit=1000000)
    for CG in nx.connected_component_subgraphs(G):
        for edge in nx.minimum_spanning_edges(CG):
            zipcode_query_sequence += [edge[2]['zipcode']]
    return zipcode_query_sequence

Produces one graph for each connected subgraph

 

Built into python graph library (`networkx`):

A* Algorithm

Minimum Path to Goal

from networkx.algorithms.shortest_paths import astar_path
astar_path(G, source, target, heuristic=None)

Provably optimal and optimally efficient

But typical data relationship graph has large branching factor

 

Built into python graph library (`networkx`)

A* Algorithm

Minimum Path to Goal

from networkx.algorithms.shortest_paths import astar_path
astar_path(G, source, target, heuristic=None)

Provably optimal and optimally efficient

 

Built into python graph library (`networkx`)

You better have a good heuristic!

It's Open Source!

 github.com/sharplabs

  • Consider sample rate
  • Classify before mean
  • Explore data sources
  • Reject rate metric
  • data > nodes x inputs
  • Lazy correlation

Choose Your Story

7707-2-TOTAL

(770) 728-6825

  1. Only Nyquist Knows
  2. The Meaning of Mean
  3. Data Dearth
  4. Question the Question
  5. Deep Net Runs Aground
  6. Escape the Maze

References

Data Analytics War Stories -- Predictive Analytics Innovation Summit

By Hobson Lane

Data Analytics War Stories -- Predictive Analytics Innovation Summit

Lessons learned in the war on data

  • 6,332