Map Reduce




Talk by Kieran Andrews

http://kieranandrews.com.au

Map Reduce


...programming model for processing large data sets with a parallel, distributed algorithm on a cluster.

Basics

Map

filtering and sorting

This distributes the set into smaller problems


Reduce

collects all the answers to the sub problems

and combines them

In Ruby

rainbow = ['red', 'orange', 'yellow', 'green', 'blue', 'purple'] 
Select
rainbow.select {|color| color.size >= 5} # ['orange', 'yellow', 'green', 'purple'] 
Map
rainbow.map {|color| color.upcase}# ['RED', 'ORANGE', 'YELLOW', 'GREEN', 'BLUE', 'PURPLE' 
Reduce (inject)
# number of colorsrainbow.reduce(0) {|acc, n| acc += 1}# 6 

How about Highly dIstributed?

The chunks (during map) can be processed in parallel
Distribute this across several servers (100s)

word_chunks.map do |chunk|
  assign_to_server(count_words(chunk)) #[{"the" => 1}, {"cat" => 1}, {"the" => 1], {"dog" => 1}]
end 
Improves performance for large data sets


Example Time


Normally do on something like Hadoop
Local example showing the concept

http://www.bigfastblog.com/map-reduce-with-ruby-using-hadoop#coding-your-map-and-reduce-scripts-in-ruby

https://github.com/TigerWolf/ruby-map-reduce

Hadoop


open source project
for storing and large scale processing of data-sets on clusters of commodity hardware 
apache foundation

MongoDB


Also supports mapreduce!
Try it out yourself, mapreduce with mongo in the browser

https://github.com/angelozerr/mongo-mapreduce-webbrowser

RubyGem for MapReduce with Mongo:

https://github.com/jcoene/mongoid-mapreduce

Fin


Thanks

Hope you enjoyed.
Made with Slides.com