/r/playrust classifier
Real World Rust Data Science
Suchin Gururangan
Colin O'Brien
Suchin
Data Scientist
Appuri
suchin.co
Colin
Software Engineer
Rapid7
insanitybit.github.io
@insanitybit
@pegasos1
Talk Outline
- Technical Debt in Data Science
- /r/playrust classifier
- Vision for Rust ML
Basic Problem
Service-Oriented
Machine Learning
One-Off
Data Science
Research
Production
Data Collection
Model Generation
Prediction
Service-oriented ML
Data Investigation
Feature Extraction
Data Collection
Model Generation
Prediction
The ML Service
Data Investigation
Feature Extraction
{
Feature
Engineering
ML System Technical Debt
- Siloed teams
- Pipeline Jungles
- Unscalable Experiments
- ...and all the normal engineering tech debt
Sculley, D., et al. "Machine learning: The high
interest credit card of technical debt." (2014).
[...] a mature system might end up being (at most) 5% machine
learning code [...]
Building Production-level Data Science Services
from POC research is unexpectedly difficult.
ML services are not only dependent on the
quality of the model, but also the quality of
feature engineering and data ingestion.
How can Rust help?
+
=
What's on /r/rust?
/r/rust
/r/playrust
Rust lang
Rust video game
Create a binary classifier that can identify /r/playrust posts mistakenly published on /r/rust.
Task:
Data Collection
- Retrieve data from Reddit safely
- Store data for investigation
struct RawPostData {
is_self: bool,
author_name: String,
url: String,
downvotes: u64,
upvotes: u64,
score: u64,
edited: bool,
selftext: String,
subreddit: String,
title: String,
}
pub fn get_reddit_post(&self, url : &str) -> Vec<RawPostData> {
let mut res = self.client
.get(url)
.send()
.unwrap();
let data = extract_data(&mut res)
.unwrap();
data
}
Proof Of Concept Code
pub fn get_reddit_post(&self, url : &str) -> Result<Vec<RawPostData>> {
let mut res = try!(self.client
.get(url)
.send()
.chain_err(||
format!("Failed to GET {}", url)));
let data = try!(extract_data(&mut res)
.chain_err(||
format!("Failed to parse data {}", url)));
Ok(data)
}
Production
Proper error handling enforced by type signature.
def get_reddit_post(self, url):
res = self.client
.get(url)
.send()
data = extract_data(res)
return data
Unchecked Exceptions
Proper error handling relies on knowledge of implementation.
Data Investigation
- Explore the Reddit data to identify the features that best discriminate /r/rust and /r/playrust
Dynamic Languages dominate
- Instant feedback lends itself well to exploration
- Natural investigative feel of REPL
- Graphing and visuals are a high priority
- Performance and stability doesn't matter
- Can iterate faster by ignoring types
Feature Extraction
- Extract meaningful features based on trends we have seen in the data.
- Convert those features to a representation the model can understand.
struct ProcessedPostFeatures {
is_self: f32,
author_popularity: f32,
downs: f32,
ups: f32,
score: f32,
post_len: f32,
word_freq: Vec<f32>,
symbol_freq: Vec<f32>,
regex_matches: Vec<f32>,
}
struct RawPostData {
is_self: bool,
author_name: String,
url: String,
downvotes: u64,
upvotes: u64,
score: u64,
edited: bool,
selftext: String,
subreddit: String,
title: String,
}
DataFrames are ergonomic, but incur technical debt
- Tabular data structure with SQL-like syntax
- Popular in languages like Python, R, and Julia
- Mixed types can lead to error-prone situations
- Memory overhead and dynamic dispatch
author
ups
downs
1
2
3
4
Rust has no DataFrame
How do we interact with our data?
fn main() {
let v : Vec<RawPostData> = get_raw_data();
v.iter()
.map(|post| post.author)
.map(|author| calculate_author_value(author))
.collect()
}
Typed Approach with Structs
author
ups
downs
1
2
3
4
extern crate rayon;
use rayon::prelude::*;
fn main() {
let v: Vec<RawPostFeatures> = get_raw_features();
let processed: Vec<f64> = Vec::with_capacity(v.len());
v.par_iter()
.map(|post| post.author)
.map(|author| calculate_author_value(author))
.collect_into(&mut processed)
}
Parallel Implementation
author
ups
downs
1
2
3
4
LabelEncoding
"A"
"B"
"A"
"C"
0
1
0
2
100,000 Strings | 1,000,000 Strings | |
---|---|---|
Rust | 5ms, 11.9K |
128ms, 48KB |
Python | 90ms, 71K |
1816ms, 183MB |
With Python...
2500x increase in
memory usage!
Model Generation and Prediction
- Choose a machine learning algorithm
- Fit the model to our data
- Store the model, load it elsewhere to perform predictions
State of the Rust ML ecosystem
60+ crates with tags machine learning or linear algebra
~800K downloads
500+ versions published
http://arewelearningyet.com
#machine-learning (IRC)
A developing ML community
Rust linear algebra is promising
test dot_product_singlethreaded ... bench: 346,677 ns/iter (+/- 13,246)
test dot_product_rayon ... bench: 224,570 ns/iter (+/- 45,483)
test dot_product_openblas ... bench: 201,170 ns/iter (+/- 18,666)
source: http://www.suchin.co/2016/04/25/Matrix-Multiplication-In-Rust-Pt-1/
let mut model = RandomForestParameters::new(params, 10);
model.fit(&feat_matrix, &ground_truth).unwrap();
serialize_to_file(&model, "./models/rustlearnrf");
Model Generator
let rf: RandomForest = deserialize_from_file("./models/rustlearnrf");
rf.predict(&novel_features).unwrap()
Predictor
The final result?
0.1
Output
0.9
0.4
Other reasons we like Rust
- Great built-in tools for testing, benchmarking, and documentation
- High-level abstractions amenable to data scientists coming from dynamic languages
- Rust language community extremely welcoming and helpful
Where does Rust fall short?
- Machine learning ecosystem is fragmented
- Visualization tools are non-existent
- "Closer-to-metal" implies higher barrier to entry
- Data exploration is difficult in a static language
- Machine learning community is still sparse
Our vision for Rust Machine Learning
- Promoting Rust to improve the reliability of feature engineering systems
- Teaching ML and data science using Rust code
- Standardize implementations of data science tooling
- Building a community around rust ML, sharing ideas, having a metric for success
RustConf
By insanitybit
RustConf
PlayRustClassifier
- 896