Feb 13, 2015

Hobson Lane

# (770) 728-6825

1. Only Nyquist Knows
2. The Meaning of Mean
3. Data Dearth
4. Question the Question
5. Deep Net Runs Aground
6. Escape the Maze

## 1. Only Nyquist Knows

When your vehicle is out of control...

SMS: 7707-2-TOTAL  or  (770) 728-6825    MSGS: "1", "2", "3", "4", "5", or "6"

Photo by

Eric Cutright

Public Domain

Image by NASA

NASA

Public Domain

## 1. Only Nyquist Knows

• Nav sensors (gyro., accel) are "pegged"
• All you know is solar power:

## How fast is the tumble?

SMS: 7707-2-TOTAL  or  (770) 728-6825    MSGS: "1", "2", "3", "4", "5", or "6"

12 sec

## Workarounds

If Nyquist sampling (2x faster than truth) isn't possible....

• Use a different sensor
• Postprocess existing signal (radio doppler)
• Sample irregularly!
• Captures higher frequencies
• Lomb-Scargle to post-process

• Probabilistic modeling
• Great for overwhelming data volume (IoT)
spectrum = scipy.signal.lombscargle(sample_times, samples, frequencies)

## 2. The Meaning of Mean

• Means don't tell the whole story
• Consider both      and
• Meaning may be found in the means for each...
• group, cluster, or class
• For us we started with grouping by time of day, but that wasn't enough...

SMS: 7707-2-TOTAL  or  (770) 728-6825    MSGS: "1", "2", "3", "4", "5", or "6"

\mu
μ
\sigma
σ

## 2. The Meaning of Mean

SMS: 7707-2-TOTAL  or  (770) 728-6825    MSGS: "1", "2", "3", "4", "5", or "6"

• Regression and classification required
• Many "fundamental frequencies"

## Mean for Each Time of Day

Classify Before Getting Mean

## 3. Data Dearth

• Tuning a 2-DOF predictive filter for performance
• More data gives algorithm more to work with
• Less Overfitting
• More Performance

Anticlined cliffs or "terraces"

## 3. Data Dearth

• Sometimes more of the same doesn't help
• Exogenous factors confound the smartest algorithm
• Make the exogenous endogenous (new data source)

SMS: 7707-2-TOTAL  or  (770) 728-6825    MSGS: "1", "2", "3", "4", "5", or "6"

## 4. Question the Question

SMS: 7707-2-TOTAL  or  (770) 728-6825    MSGS: "1", "2", "3", "4", "5", or "6"

Correlation != Causation

(a. la. Tyler Vigen)

More sales => More returns

Normalize return rate for sales

(lag-compensated)

Multiple interracting causes

Reduce these returns surges!

6\sigma
6σ

## =

SMS: 7707-2-TOTAL  or  (770) 728-6825    MSGS: "1", "2", "3", "4", "5", or "6"

## =

SMS: 7707-2-TOTAL  or  (770) 728-6825    MSGS: "1", "2", "3", "4", "5", or "6"

## =

r_r=\Sigma_k{\alpha{s_{n-k}}}
rr=Σkαsnk

## "Birth-Death Process"

r_r=\Sigma_k{\alpha{s_{n-k}}}
rr=Σkαsnk
H(t,\tau)
H(t,τ)
S(t)
S(t)
R(t)
R(t)

Question is when

Flow rate

(Reject rate)

Product enters "pipeline" arbitrarily

## Lag

And the portion that happens too soon

## 4. Question the Question

SMS: 7707-2-TOTAL  or  (770) 728-6825    MSGS: "1", "2", "3", "4", "5", or "6"

Histogram reveals trend and seasonality

Month-end Surge

## 4. Question the Question

SMS: 7707-2-TOTAL  or  (770) 728-6825    MSGS: "1", "2", "3", "4", "5", or "6"

• Fiscal Quarter
• Geography
• Diagnosis
• Retailer
• Salesperson
• Model
• Lot
• Reason

Today

# =

H(t,\tau)
H(t,τ)
S(t)
S(t)
R(t)
R(t)

\div
÷

## 4. Analyze the Question

SMS: 7707-2-TOTAL  or  (770) 728-6825    MSGS: "1", "2", "3", "4", "5", or "6"

• You stop counting
• You stop accepting returns
• You stop selling

Cumulative histograms focus attention on final total

Product returns stop when...

## 4. Normalize & Compare

SMS: 7707-2-TOTAL  or  (770) 728-6825    MSGS: "1", "2", "3", "4", "5", or "6"

• Fiscal Quarter
• Geography
• Diagnosis
• Retailer
• Salesperson
• Model
• Lot
• Reason

## 4. Analyze the Question

SMS: 7707-2-TOTAL  or  (770) 728-6825    MSGS: "1", "2", "3", "4", "5", or "6"

Normalize histograms to compare categories

• Normalize by what?
• Sales (which ones)?
• Total returns?
• How are we doing this week?
• Not just this quarter

## 4. Question the Question

Unsupervised natural language processing?

SMS: 7707-2-TOTAL  or  (770) 728-6825    MSGS: "1", "2", "3", "4", "5", or "6"

President inaugural speeches

Target category = political party

## 4. Question the Question

What are the US Presidents' political parties based on speeches?

## 4. Question the Question

What are the US Presidents' political parties based on speeches?

## 4. Question the Question

• The category you're interested in will not likely be the most important "factor" in the NLP statistics
• Dimension reduction (SVD, PCA) can identify factors
• Word-sets that are most significant
• These represent the "themes"
• Interpretation of these "themes" is up to you
• Statistics      Meaning

\ne

## 5. Deep Nets Run Aground

Deep net performs well!

SMS: 7707-2-TOTAL  or  (770) 728-6825    MSGS: "1", "2", "3", "4", "5", "6"

## 5. Deep Nets Run Aground

Not so fast... it's overfitting

SMS: 7707-2-TOTAL  or  (770) 728-6825    MSGS: "1", "2", "3", "4", "5", "6"

## 5. Deep Nets Run Aground

SMS: 7707-2-TOTAL  or  (770) 728-6825    MSGS: "1", "2", "3", "4", "5", "6"

\text{a}=\text{W}^{k}_{S^k,S^{(k+1)}}\text\ \text{p}
a=WSk,S(k+1)k p
\text{p}
p
\text{W}^{k}_{S^k,S^{(k+1)}}
WSk,S(k+1)k
\text{a}
a
• Conventional Hebb rule
\text{W}^{new}={W}^{old}+\text{t}_q\text{p}_q^T
Wnew=Wold+tqpqT
\text{W}^{new}={W}^{old}+\alpha(\text{t}_q-\text{a}_q)\text{p}_q^T
Wnew=Wold+α(tqaq)pqT
• Hebb "delta" rule

## 5. Shallow Data

• Model degree:

SMS: 7707-2-TOTAL  or  (770) 728-6825    MSGS: "1", "2", "3", "4", "5", "6"

\text{a}=\text{W}^{k}_{S^k,S^{(k+1)}}\text\ \text{p}
a=WSk,S(k+1)k p
\text{p}
p
\text{W}^{k}_{S^k,S^{(k+1)}}
WSk,S(k+1)k
\text{a}
a
\sum_k{{S^k}{S^{(k+1)}}}
kSkS(k+1)
• Training data DOF:
{S}^1{S^3}N_{samples}
S1S3Nsamples

(independent samples)

## 5. Shallow Data

• Model degree:

SMS: 7707-2-TOTAL  or  (770) 728-6825    MSGS: "1", "2", "3", "4", "5", "6"

\text{a}=\text{W}^{k}_{S^k,S^{(k+1)}}\text\ \text{p}
a=WSk,S(k+1)k p
\text{p}
p
\text{W}^{k}_{S^k,S^{(k+1)}}
WSk,S(k+1)k
\text{a}
a
{S^1}{S^2}+{S^2}{S^3}
S1S2+S2S3
• Training data DOF:
({S}^1+{S^3})N_{samples}
(S1+S3)Nsamples

(1 hidden layer)

(independent samples)

## 5. Bottom Line

SMS: 7707-2-TOTAL  or  (770) 728-6825    MSGS: "1", "2", "3", "4", "5", "6"

N_{hidden} << N_{training}
Nhidden<<Ntraining

## 6. Escape the Maze

SMS: 7707-2-TOTAL  or  (770) 728-6825    MSGS: "1", "2", "3", "4", "5", or "6"

Find Connections

(Actionable Insight)

## 6. Escape from the Maze

• Tight heuristics vital for efficient graph search
• "Always turn right" is not good enough

SMS: 7707-2-TOTAL  or  (770) 728-6825    MSGS: "1", "2", "3", "4", "5", or "6"

## 6. Escape from the Maze

• Don't bother with "exhaustive" correlation search

SMS: 7707-2-TOTAL  or  (770) 728-6825    MSGS: "1", "2", "3", "4", "5", or "6"

\text{complexity} \approx{O({M^2}N^2)}\approx{10}^{24}
complexityO(M2N2)1024
• Find db relationships using meta-data
• min, max, median
• #records
• #distinct
• for reals: mean, std
\text{complexity} \approx{O({M}{N}log(N))}\approx{10}^{13}
complexityO(MNlog(N))1013
10^5
105
10^7
107

## Human Heuristics

SMS: 7707-2-TOTAL  or  (770) 728-6825    MSGS: "1", "2", "3", "4", "5", or "6"

• ### Business knowledge narrows search:

• Repair technicians

• Product designers

• Factory managers

• Suppliers

• Sales channels

• Call center

## Accidental "Experiements"

SMS: 7707-2-TOTAL  or  (770) 728-6825    MSGS: "1", "2", "3", "4", "5", or "6"

• ### Look for differences in

• Model
• Lot
• Product
• Sales Channel
• Customer Demographic
• Region/Culture
• ### Look for ...

• New/deleted features
• Cost-saving parts changes
• Production facilities (outsourced vs insourced)

# Kruskal's Algorithm

## Minimum Spanning Tree

1. Add lowest cost edge with new node
2. Repeat until all nodes accounted for
def minimum_spanning_zipcodes():
zipcode_query_sequence = []
G = build_graph(api.db, limit=1000000)
for CG in nx.connected_component_subgraphs(G):
for edge in nx.minimum_spanning_edges(CG):
zipcode_query_sequence += [edge['zipcode']]
return zipcode_query_sequence

Produces one graph for each connected subgraph

Built into python graph library (networkx):

# A* Algorithm

## Minimum Path to Goal

from networkx.algorithms.shortest_paths import astar_path
astar_path(G, source, target, heuristic=None)


Provably optimal and optimally efficient

But typical data relationship graph has large branching factor

Built into python graph library (networkx)

# A* Algorithm

## Minimum Path to Goal

from networkx.algorithms.shortest_paths import astar_path
astar_path(G, source, target, heuristic=None)


Provably optimal and optimally efficient

Built into python graph library (networkx)

You better have a good heuristic!

## github.com/sharplabs

• Consider sample rate
• Classify before mean
• Explore data sources
• Reject rate metric
• data > nodes x inputs
• Lazy correlation