Practical Data Science in Ruby

#FF0000 Leather Pants.

ybur-yug == Developer
# => true

@yburyug - Twitter

ybur-yug - Github

PS: This will be my last CRB talk for a long while. Thank you for the support & allowing me to geek out

<3

"Data-Driven"

Data Infrastructure

  • Distributed Systems
  • ETL
  • MapReduce
  • Data Warehousing
  • High Availability
  • PubSub (live analysis)
  • A means of experimentation

Can We Do It Without This?

Yes. Sorta.

Where to Start?

Web Application Datapoints

  • Comments/Feedback
  • Traffic Patterns/Analysis
  • Navigation Recommendation (Netflix recommended content)

The Complicated Stuff

  • Recommender Models

Potential

Easy

Starts

Natural Language Analysis!

These generally are very linear algebra heavy, and require modern research

This too requires a lot of modern reading. Everything from K-Nearest Neighbors to Random Forest algorithms can be used and will be mentioned

  • Site Traversal

NLP

A field of computer science, artificial intelligence, and computational linguistics concerned with the interactions between computers and human (natural) languages. As such, NLP is related to the area of human–computer interaction.

Simple NLP In Ruby

Sentiment Analysis

Sentimental

A simple gem to get us started

https://github.com/ybur-yug/CRB_ruby_data

$ gem install sentimental

A Simple Start

Sentimental is a simple gem for analyzing the sentiment, or positive/negative/neutral inflection of a string or corpus

Let's say we have a JSON block of all our comments...

# analyzer.rb
require 'sentimental'

class Analyzer
  def initialize(comments, threshold)
    @ comments = comments
    Sentimental.load_defaults # load the default training model
    Sentimental.threshold = threshold # set threshold
    @analyzer = Sentimental.new
  end

  def sentiments
    @sentiments = @comments.map do |comment|
      { comment:   comment['body'],
        score:     @analyzer.get_score(comment['body']),
        sentiment: @analyzer.get_sentiment(comment['body']) }
    end
  end
end

A Simple Analysis Class

18 Lines of Code

Using It

  1. Prepare our data
  2. Prepare output destination
  3. Run

So now we do that...

$ ls data
jan feb mar apr may jun jul aug sep oct nov dec
$ ls data/jan
month.json

# lib/prep.rb
SPLIT_SIZE = 50000

Dir.foreach("../data/") do |dir|
  if dir != '.' && dir != '..' # Dir.pwd lists these
    Dir.mkdir "../data/#{dir}/split"
    `split -l #{SPLIT_SIZE} ../data/#{dir}/month.json ../data/#{dir}/split/data-`
  end
end

Run the provided setup script...

And now the sentiment analysis...

#lib/run.rb
require 'json'
require_relative 'analyzer'

all_data = []

Dir.foreach('../data') do |month|
  if month != '.' && month != '..'
    Dir.foreach("../data/#{month}/split") do |part|
      if part != '.' && part != '..'
        comments = File.open("../data/#{month}/split/#{part}").read.split("\n").map do |line|
          JSON.parse(line)
        end
        data = Analyzer.new(comments, 0.4).sentiments
        data.each { |s| s[:month] = month }
        all_data << data
      end
    end
  end
end

all_data.flatten!

Sample

$ ruby analyze.rb data_sample.json
# =>
Comment Set Size: 10000

Result One Avg (threshold 0.6): 0.12837111622858283
Result One Positive Sentiment Count: 6001
Result One Negative Sentiment Count: 1663
Result One Neutral Sentiment Count: 6001

Result Two Avg (threshold 0.8): 0.12837111622858283
Result Two Positive Sentiment Count: 7006
Result Two Negative Sentiment Count: 1199
Result Two Neutral Sentiment Count: 7006

Result Three Avg (threshold 0.4): 0.12837111622858283
Result Three Positive Sentiment Count: 4983
Result Three Negative Sentiment Count: 2097
Result Three Neutral Sentiment Count: 4983

Next:

Building It Into Rails

Sentimentalizer

A Rails Engine

$ cd my_rails_app
$ editor Gemfile # add `gem 'sentimentalizer'`
$ bundle
$ rails g sentimentalizer:install

`after_initialize` hook loading default training model

Now...

  • Create a 'Sentiment' model that belongs to what you are analyzing
  • Give it text, probability, and sentiment attributes
  • Create an `after_create :find_sentiment`
  • Persist

Usefulness

  • Understand positivity/negativity in comments/chat
  • Have a means to detect 'flame wars'
  • Detect changes in positive talk towards negative talk or vice-versa over time with simple analysis

Will This Change Your Life?

Nope.

Diving Deeper:

Bayesian Classifiers

What? Math Is Hard.

In machine learning, naive Bayes classifiers are a family of simple probabilistic classifiers based on applying Bayes' theorem with strong (naive) independence assumptions between the features.

Naive Bayes Classifier:

Why?

FIND ALL THE SPAM

Back When Spam Email Was REALLY Bad

A Gem, So That We Don't Have To Do The Math

$ gem install classifier

And now, we grab our spam

Training

spam = File.open('our_spam.txt').read
good_comments = File.open('our_comments.txt').read

require 'classifier'

classifier = Classifier::Bayes.new('Spam', 'Ham')

# single input example
classifier.train_spam 'BUY THIS SHIT'
classifier.train_ham 'that was an interesting and thought provoking piece'
classifier.classify "I enjoyed this article"
# => ham

# Train on a large set
spam.each_line { |spam| classifier.train_spam spam }
ham.each_line  { |ham|  classifier.train_ham ham   }

# Classify Away!



Takeaways

Call on God, but row away from the rocks.

- Hunter S. Thompson

Foundations of Machine Learning

Mohri

Advanced Calculus

Woods

The Tools Are There

  • Bayesian Classifiers
  • Term-Frequency Inverse-Document Frequency
  • Vector Space Models
  • Neural Networks

Commonality: All of these have robust, open source tools easily available to utilize

BUT

The Best Tools Are In Python

One Last Example (In Python + RestMQ)

# Simple word frequency counter
# Clone RestMQ (https://www.github.com/gleicon/restmq)
$ git clone https://www.github.com/gleicon/restmq.git
$ cd examples/mapreduce

# Download a huge text file (E.g Bible, some Gutenberg books)

$ mkdir files
$ split -l 1000 yourebook.txt files/bookfrag-
# In another terminal run our consumer
$ python reduce.py
# Now, run the producer:
$ for a in `ls files`; do python map.py files/$a; done

BONUS ROUND:

Live Editable & Updating Page w/no JS in < 20 lines

<!DOCTYPE html>
<html>
<head>
  <meta charset="utf-8">
  <title>Whoa</title>
</head>
<body>
  <p>Edit for live preview</p>
  <style contenteditable="true">
    style { font-family: open-sans; }
    div { color: red; background: black; }
  </style>
  <div>Hello World</div>
</body>
</html>

Magic

Q&A + Yell At Me For Having A Python Example

Practical Data Science In Ruby

By Bobby Grayson

Practical Data Science In Ruby

Think you need Mesos, Kafka, Data Warehousing, and crazy concurrency for data analytics? Well you sort of do, but we can do some useful things still without it, and wont even have to add a new language to our stack!

  • 4,869