Interactively Exploring Financial Trades in R
Ryan Hafen
Hafen Consulting, LLC and Purdue University
Michael Kane
Yale University
R/Finance Conference
May 20, 2016
bit.ly/rfinance2016
Origin of This Work:
 Part of the White House Big Data Initiative, funding research and development of open source tools for analysis and visualization of big data
 Goal: "Support open source software toolkits to enable flexibility in government applications"
DARPA XDATA Program
XDATA Summer Camps
 All grant performers convene for ~2 months and work on challenge problems
 Several data sets and challenge problems each year
 Last summer one of the data sets was a sample of data from the Nanex NxCore database
bit.ly/rfinance2016
NxCore Data
 6 months of data, 25 ms resolution
 Datasets:
 trades: trade messages for last sale reports and corrections  a typical trading day will have 10+ million of these message types
 exchange quotes: sent for every exchange quote and BBO  a typical trading day will have 700+ million option quotes
 market maker quotes: sent for every Market Maker Quote ("level 2") from either Nasdaq SuperMontage, Intermarket quotes on Nasdaq issues, or Nasdaq Intermarket quotes on listed securities  a typical trading day will have 40+ million of these messages
bit.ly/rfinance2016
Trade Data
 ~1.25 billion records, 47 variables
 ~ tera scale (depending on compression, etc.)
 Data for equities, futures, future options, bonds, index, spreads, eq/idx opt root
 13,780,219,669 total trades
 33 exchanges
 ~24k equity symbols
bit.ly/rfinance2016
XDATA Challenge Problems
 Identify flash crashes in the data
 Identify and characterize instances of trading halts
 Identify suspected pump and dump schemes
 Identify anomalies associated with quote stuffing
But it's challenging enough to understand the data!
It's dangerous to just go out and start applying algorithms to the data when we don't understand all the variables and how they should be handled, etc.
bit.ly/rfinance2016
NxCore Exploratory Analysis
 A careful analysis was conducted of every variable in the trade data
 Ideally we would have NxCore / finance experts to iterate with
 We had to make our own best judgements based on exploratory analysis
 New generalpurpose interactive plotting library: rbokeh (hafen.github.io/rbokeh)
NxCore R Package
 Resolve flags to meaningful messages
 Handle out of order trades / cancellations / insertions
 Methods to "roll up" trade data to construct an order book at different time resolutions
 Symbol / name lookups
 Algorithms: cointegration, windowed realized volatility, outlier detection / mini flash crash detection
LargeScale Historical Analysis
 Summaries are a great start, but we need to explore the data in detail
 R is great for detailed EDA of small data
 To analyze data of this scale with R, we used our open source R / Big Data platform, Tessera
bit.ly/rfinance2016
Tessera
 Stay in R regardless of size of data
 Be able to use all methods available in R regardless of size of data
 Good for rapid iteration / ad hoc exploration / prototyping
 Offline historical analysis where being fastest in computation is not as important
 Want to minimize both computation time (through scaling) and analyst time (through a simple interface in R)  with emphasis on analyst time
 Flexible data structures
 Scalable without while keeping the same simple interface!
Guiding Principles:
bit.ly/rfinance2016
What's Different About Tessera?
 Restrictive in data structures (only data frames / tabluar)
 Restrictive in methods (only SQLlike operations or a handful of scalable nonnative algorithms)
 Or both!
Many other "big data" systems that support R are either:
Idea of Tessera:
 Let's use R for the flexibility it was designed for, regardless of the size of the data
 Use any R data structure and run any R code at scale
 Forget "deep learning", we're doing "deep analysis"!
bit.ly/rfinance2016
Tessera Environment
 Front end: two R packages, datadr & trelliscope
 Back ends: R, Hadoop, Spark, etc.
 R <> backend bridges: RHIPE, SparkR, etc.
NxCore Data with Tessera

Specification of data partitioning and data structures and ad hoc analysis

Data structures

Raw equity data partitioned by symbol and date

Raw equity and option data grouped by symbol and date


Ad hoc application of R code

Higher resolution summaries

Outlier / anomaly detection

Cointegration calculations

datadr / RHIPE
bit.ly/rfinance2016
NxCore Data with Tessera

Interactively investigate data in detail with several different displays

Example:

Plot price vs. time for data partitioned by symbol/day

There are ~1 million subsets based on this partitioning

Compute cognostics  metrics that allow us to navigate this large space of displays in a meaningful way

Trelliscope
bit.ly/rfinance2016
More Interesting Targeted Analysis

Assess marketwide systemic risk with cointegration

Investigate option prices leading equity prices as indicator of insider trading

Examine spikes in implied volatility and volume in options to look for fiscal malfeasance
AAPL price (top) and cointegration measure (bottom) S&P 500 on the 05/06/2010 flash crash
We are now in good shape to begin more interesting analyses
bit.ly/rfinance2016
Thank You
Reference:

Tessera: http://tessera.io

rbokeh: http://hafen.github.io/rbokeh

More: http://ryanhafen.com
@hafenstats
bit.ly/rfinance2016
R/Finance 2016
By Ryan Hafen