Suchin Gururangan
Colin O'Brien
Suchin
Data Scientist
Appuri
suchin.co
Colin
Software Engineer
Rapid7
insanitybit.github.io
@insanitybit
@pegasos1
Talk Outline
Basic Problem
Service-Oriented
Machine Learning
One-Off
Data Science
Research
Production
Data Collection
Model Generation
Prediction
Service-oriented ML
Data Investigation
Feature Extraction
Data Collection
Model Generation
Prediction
The ML Service
Data Investigation
Feature Extraction
{
Feature
Engineering
ML System Technical Debt
Sculley, D., et al. "Machine learning: The high
interest credit card of technical debt." (2014).
[...] a mature system might end up being (at most) 5% machine
learning code [...]
Building Production-level Data Science Services
from POC research is unexpectedly difficult.
ML services are not only dependent on the
quality of the model, but also the quality of
feature engineering and data ingestion.
How can Rust help?
+
=
What's on /r/rust?
/r/rust
/r/playrust
Rust lang
Rust video game
Create a binary classifier that can identify /r/playrust posts mistakenly published on /r/rust.
Task:
Data Collection
struct RawPostData {
is_self: bool,
author_name: String,
url: String,
downvotes: u64,
upvotes: u64,
score: u64,
edited: bool,
selftext: String,
subreddit: String,
title: String,
}
pub fn get_reddit_post(&self, url : &str) -> Vec<RawPostData> {
let mut res = self.client
.get(url)
.send()
.unwrap();
let data = extract_data(&mut res)
.unwrap();
data
}
Proof Of Concept Code
pub fn get_reddit_post(&self, url : &str) -> Result<Vec<RawPostData>> {
let mut res = try!(self.client
.get(url)
.send()
.chain_err(||
format!("Failed to GET {}", url)));
let data = try!(extract_data(&mut res)
.chain_err(||
format!("Failed to parse data {}", url)));
Ok(data)
}
Production
Proper error handling enforced by type signature.
def get_reddit_post(self, url):
res = self.client
.get(url)
.send()
data = extract_data(res)
return data
Unchecked Exceptions
Proper error handling relies on knowledge of implementation.
Data Investigation
Dynamic Languages dominate
Feature Extraction
struct ProcessedPostFeatures {
is_self: f32,
author_popularity: f32,
downs: f32,
ups: f32,
score: f32,
post_len: f32,
word_freq: Vec<f32>,
symbol_freq: Vec<f32>,
regex_matches: Vec<f32>,
}
struct RawPostData {
is_self: bool,
author_name: String,
url: String,
downvotes: u64,
upvotes: u64,
score: u64,
edited: bool,
selftext: String,
subreddit: String,
title: String,
}
DataFrames are ergonomic, but incur technical debt
author
ups
downs
1
2
3
4
Rust has no DataFrame
How do we interact with our data?
fn main() {
let v : Vec<RawPostData> = get_raw_data();
v.iter()
.map(|post| post.author)
.map(|author| calculate_author_value(author))
.collect()
}
Typed Approach with Structs
author
ups
downs
1
2
3
4
extern crate rayon;
use rayon::prelude::*;
fn main() {
let v: Vec<RawPostFeatures> = get_raw_features();
let processed: Vec<f64> = Vec::with_capacity(v.len());
v.par_iter()
.map(|post| post.author)
.map(|author| calculate_author_value(author))
.collect_into(&mut processed)
}
Parallel Implementation
author
ups
downs
1
2
3
4
LabelEncoding
"A"
"B"
"A"
"C"
0
1
0
2
100,000 Strings | 1,000,000 Strings | |
---|---|---|
Rust | 5ms, 11.9K |
128ms, 48KB |
Python | 90ms, 71K |
1816ms, 183MB |
With Python...
2500x increase in
memory usage!
Model Generation and Prediction
State of the Rust ML ecosystem
60+ crates with tags machine learning or linear algebra
~800K downloads
500+ versions published
http://arewelearningyet.com
#machine-learning (IRC)
A developing ML community
Rust linear algebra is promising
test dot_product_singlethreaded ... bench: 346,677 ns/iter (+/- 13,246)
test dot_product_rayon ... bench: 224,570 ns/iter (+/- 45,483)
test dot_product_openblas ... bench: 201,170 ns/iter (+/- 18,666)
source: http://www.suchin.co/2016/04/25/Matrix-Multiplication-In-Rust-Pt-1/
let mut model = RandomForestParameters::new(params, 10);
model.fit(&feat_matrix, &ground_truth).unwrap();
serialize_to_file(&model, "./models/rustlearnrf");
Model Generator
let rf: RandomForest = deserialize_from_file("./models/rustlearnrf");
rf.predict(&novel_features).unwrap()
Predictor
The final result?
0.1
Output
0.9
0.4
Other reasons we like Rust
Where does Rust fall short?
Our vision for Rust Machine Learning