/r/playrust classifier

 

 Real World Rust Data Science

Suchin Gururangan

Colin O'Brien

Suchin

Data Scientist

Appuri

suchin.co

Colin

Software Engineer

Rapid7

insanitybit.github.io

@insanitybit

@pegasos1

Talk Outline

  • Technical Debt in Data Science
  • /r/playrust classifier
  • Vision for Rust ML

Basic Problem

Service-Oriented

Machine Learning

One-Off

Data Science

Research

Production

Data Collection

Model Generation

Prediction

Service-oriented ML

Data Investigation

Feature Extraction

Data Collection

Model Generation

Prediction

The ML Service

Data Investigation

Feature Extraction

{

Feature 

Engineering

ML System Technical Debt

  • Siloed teams
  • Pipeline Jungles
  • Unscalable Experiments
  • ...and all the normal engineering tech debt

Sculley, D., et al. "Machine learning:  The high

interest credit card of  technical debt." (2014).

[...] a mature system might end up being (at most) 5% machine

learning code [...]

Building Production-level Data Science Services

from POC research is unexpectedly difficult.

ML services are not only dependent on the

quality of the model, but also the quality of

feature engineering and data ingestion.

How can Rust help?

+

=

What's on /r/rust?​

/r/rust

/r/playrust

Rust lang

Rust video game

Create a binary classifier that can identify /r/playrust posts mistakenly published on /r/rust.

Task:

Data Collection

  • Retrieve data from Reddit safely
  • Store data for investigation
struct RawPostData {
    is_self: bool,
    author_name: String,
    url: String,
    downvotes: u64,
    upvotes: u64,
    score: u64,
    edited: bool,
    selftext: String,
    subreddit: String,
    title: String,
}


  
pub fn get_reddit_post(&self, url : &str) -> Vec<RawPostData> {
     let mut res = self.client
                       .get(url)
                       .send()
                       .unwrap();

     let data = extract_data(&mut res)
                    .unwrap();
     data
}

Proof Of Concept Code



  
pub fn get_reddit_post(&self, url : &str) -> Result<Vec<RawPostData>> {
     let mut res = try!(self.client
                            .get(url)
                            .send()
                            .chain_err(||
                                format!("Failed to GET {}", url)));


     let data = try!(extract_data(&mut res)
                    .chain_err(|| 
                          format!("Failed to parse data {}", url)));
     Ok(data)
}

Production

Proper error handling enforced by type signature.

def get_reddit_post(self, url):
   res = self.client
             .get(url)
             .send()

   data = extract_data(res)

   return data

Unchecked Exceptions

Proper error handling relies on knowledge of implementation.

Data Investigation

  • Explore the Reddit data to identify the features that best discriminate /r/rust and /r/playrust

Dynamic Languages dominate

  • Instant feedback lends itself well to exploration
  • Natural investigative feel of REPL
  • Graphing and visuals are a high priority
  • Performance and stability doesn't matter
  • Can iterate faster by ignoring types

Feature Extraction

  • Extract meaningful features based on trends we have seen in the data.
  • Convert those features to a representation the model can understand.
struct ProcessedPostFeatures {
   is_self: f32,
   author_popularity: f32,
   downs: f32,
   ups: f32,
   score: f32,
   post_len: f32,
   word_freq: Vec<f32>,
   symbol_freq: Vec<f32>,
   regex_matches: Vec<f32>,
}
struct RawPostData {
    is_self: bool,
    author_name: String,
    url: String,
    downvotes: u64,
    upvotes: u64,
    score: u64,
    edited: bool,
    selftext: String,
    subreddit: String,
    title: String,
}

DataFrames are ergonomic, but incur technical debt

  • Tabular data structure with SQL-like syntax
  • Popular in languages like Python, R, and Julia
  • Mixed types can lead to error-prone situations
  • Memory overhead and dynamic dispatch

author

ups

downs

1

2

3

4

Rust has no DataFrame

How do we interact with our data?

fn main() {
    let v : Vec<RawPostData> = get_raw_data();

    v.iter()
        .map(|post| post.author)
        .map(|author| calculate_author_value(author))
        .collect()
}

Typed Approach with Structs

author

ups

downs

1

2

3

4

extern crate rayon;

use rayon::prelude::*;

fn main() {
    let v: Vec<RawPostFeatures> = get_raw_features();
    let processed: Vec<f64> = Vec::with_capacity(v.len());
    
    v.par_iter()
        .map(|post| post.author)
        .map(|author| calculate_author_value(author))
        .collect_into(&mut processed)
}

Parallel Implementation

author

ups

downs

1

2

3

4

LabelEncoding

"A"

"B"

"A"

"C"

0

1

0

2

100,000 Strings 1,000,000 Strings
Rust 5ms,
11.9K
128ms, 48KB 
Python 90ms,
71K
1816ms, 183MB

With Python...

2500x increase in

memory usage!

Model Generation and Prediction

  • Choose a machine learning algorithm
  • Fit the model to our data
  • Store the model, load it elsewhere to perform predictions

State of the Rust ML ecosystem

60+ crates with tags machine learning or linear algebra

~800K downloads

500+ versions published

http://arewelearningyet.com

#machine-learning (IRC)

A developing ML community

Rust linear algebra is promising

test dot_product_singlethreaded    ... bench:     346,677 ns/iter (+/- 13,246)
test dot_product_rayon             ... bench:     224,570 ns/iter (+/- 45,483)
test dot_product_openblas          ... bench:     201,170 ns/iter (+/- 18,666)

source: http://www.suchin.co/2016/04/25/Matrix-Multiplication-In-Rust-Pt-1/


let mut model = RandomForestParameters::new(params, 10);
model.fit(&feat_matrix, &ground_truth).unwrap();
serialize_to_file(&model, "./models/rustlearnrf");

Model Generator

let rf: RandomForest = deserialize_from_file("./models/rustlearnrf");
rf.predict(&novel_features).unwrap()

Predictor

The final result?

 

0.1

Output

0.9

0.4

Other reasons we like Rust

  • Great built-in tools for testing, benchmarking, and documentation
  • High-level abstractions amenable to data scientists coming from dynamic languages
  • Rust language community extremely welcoming and helpful

Where does Rust fall short?

  • Machine learning ecosystem is fragmented 
  • Visualization tools are non-existent
  • "Closer-to-metal" implies higher barrier to entry
  • Data exploration is difficult in a static language
  • Machine learning community is still sparse 

Our vision for Rust Machine Learning

 

 

  • Promoting Rust to improve the reliability of feature engineering systems
  • Teaching ML and data science using Rust code
  • Standardize implementations of data science tooling
  • Building a community around rust ML, sharing ideas, having a metric for success

The PlayRust Classifier: RustConf 2016

By Suchin Gururangan

The PlayRust Classifier: RustConf 2016

PlayRustClassifier

  • 2,216