Loading

MapReduce: JavaScript and MongoDB Datasets

michaelcole

This is a live streamed presentation. You will automatically follow the presenter and see the slide they're currently on.

Map, Filter, and Reduce in JavaScript

Thank you

Data Geek User Group Vancouver WA

michael@powma.com

Michael Cole

  • Freelance consultant
  • Senior Dev/Tech Lead
  • Full-stack JavaScript
  • DevOps / µ-services

 

 

 

 

 

 

I like to do tech for fun.

Powma.com

The Power To Create

 

As a freelancer, I've specialized in moving small/medium sized projects from idea to beta.

 

 

"What's you next major business milestone?"

 

Hiring?  I can help, while you find the perfect employee.

michael@powma.com

What is MapReduce?

Letting other people do the work.


By example!

 

Get a card and a pen...

michael@powma.com

Distributed MapReduce

michael@powma.com

MapReduce by Example

1)   Write an animal (5-10 letters) on the top of card.  

 

 

 

 

 

 

 

This is our INPUT DATA SET.

 

 

 

Unicorn

 

 

 

<front>

michael@powma.com

What is Map?

2)   On the back, "map" a summary.

This is our MAP function, we transform a collection 1-1

 

 

 

 

 

 

 

MAP is a DATA TRANSFORMATION

First: u

Length: 7

Vowels: 3

 

 

<back>

What is Reduce?

   3)  Pass cards to the right.

If you're on the isle, you should have a pile of cards.

 

 

 

 

 

 

 

 

   4)  We are now a distrubuted system,

         processing a MapReduce.

 

 

 

unicorn

 

 

 

pony

 

 

 

cat

 

 

 

michael@powma.com

Who is MapReduce what?

DATA -> MAP -> SHUFFLE/SORT -> REDUCE

What is Reduce?

What is the longest word?

What is the last word alphabetically?

What word has the highest vowel/length ratio?

 

 

 

 

 

What do you want to know about this data?

First: u

Length: 7

Vowels: 3

JS: map(), filter(), and reduce()

Mmmmm.... Algorithms

michael@powma.com

Why not loops?

let foo = ['kitty', 'puppy', 'pony'];
let bar = [];

for(var i = 0; i < foo.length; i++) {
    if(foo.indexOf(foo[i]) === i) {
        bar.push(foo[i]);
    }
}

//  What is this code doing?
  • Loops are not self-documenting.
     

  • MapReduce can be a "distrubuted algorithm" and work across a cluster.

Small Data:

Map and Reduce can be used separately.

They make your code more expressive than for() loops

 

Big Data:

MapReduce on document huge sets instead of arrays

Let's do an example with MongoDB

JavaScript Map and Reduce

(browser and Node.js)

Map

Copy an array,
with processing.

Filter

Select items from an array, with processing.

Reduce

Aggregate or "reduce" an array/set of items

to a "total" value.

 

The total can be: a value, object, or arbitrary data structure.

michael@powma.com

Array.map()

Let's start with some stretches.

foo = ['kitty', 'puppy', 'pony'];

bar = foo.map( item => {
  return item.toUpperCase()
})
// ["KITTY", "PUPPY", "PONY"]

// ES6
baz = foo.map( item => item.toUpperCase() )


fahrenheit = [0, 32, 45, 50, 75, 80, 99, 120];
fahrenheit.map(elem => {
  return Math.round((elem - 32) * 5 / 9);
})
// [-18, 0, 7, 10, 24, 27, 37, 49]

// ES6
fahrenheit.map(elem => Math.round((elem - 32) * 5 / 9));

Array.filter()

A quick warm up

// map(), filter(), and reduce() 
// are methods of the Array prototype

['kitty', 'puppy', 'pony'].filter( item => {
  return item.length === 5;
})
// ["kitty", "puppy"]

['kitty', 'puppy', 'pony'].filter( item => item.length === 5)
// ["kitty", "puppy"]

michael@powma.com

Array.reduce()

The workout.

strings = ['kitty', 'puppy', 'pony']

// reduce() takes two arguments:
// A reducer function, and an initial value.

strings.reduce( 
  function reducer(total, item) { 
    // reduce passes same total instance to every item instance
    console.log(total, item)
    return total + item.length
  },
  0 // Initial value for total
)

// 0 "kitty"
// 5 "puppy"
// 10 "pony"
// 14

Array.reduce()

The workouts other 90%.

strings = ['kitty', 'puppy', 'pony']

// reduce() takes two arguments:
// A reducer function, and an initial value.

// Imagine the possibilities!  

strings.reduce( 
  (total, item) => { // reducer function
    total[item] = item.length; 
    return total;
  },
  {} // Initial value for total
)
// {kitty: 5, puppy: 5, pony: 4}

LoDash and map()

  • Map is ES5.  LoDash is a good alternative.
var users = [
  {name: 'barney', age: 36},
  {name: 'fred', age: 40}
];
// We want names = ['barney', 'fred']

// [].map() from ES5
ret = users.map(function(user) { return user.name; });
// ['barney', 'fred']

// LoDash + ES6
ret = _.map(users, user => user.name );
// ['barney', 'fred']

// LoDash sugar dot-string feature 
ret = _.map(users, 'name');
// ['barney', 'fred']

With these primitives,

we can process the work in parallel.  E.g. "cluster up"

  • map()

  • filter()

  • reduce()

michael@powma.com

MongoDB Map and Reduce

Wikipedia titles:

https://en.wikipedia.org/wiki/Wikipedia:Database_download#English-language_Wikipedia 

 

 

 

#!/bin/bash

# Download Wikipedia Article titles - https://en.wikipedia.org/wiki/Wikipedia:Database_download#English-language_Wikipedia
wget https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-all-titles-in-ns0.gz
gzip -d enwiki-latest-all-titles-in-ns0.gz 

# Start mongo:    
mongod --dbpath . &

# Import the data
mongoimport -d wikipedia -c titles \
  --type tsv --headerline \
  --file enwiki-latest-all-titles-in-ns0

# Start the mongo console
mongo

MongoDB Map and Reduce

MongoDB does the work

We'll run this command from MongoDB's console.

Reads/Writes sharded collections.  

One expressive statement handles many details.

 

API changes:

Map function `emit(key,value)` instead of return value.

 

// JS Map: item is key, item.length is value
function(item) { return item.length; }; 

// MongoDB Map: this._id is key, this.data is doc
function() { emit(this._id, this.data.length); }; 
db.someCollection.mapReduce( map, reduce, { query, out });

MongoDB Map and Reduce

use wikipedia

// Describe it
mapFun = function() {
  if (this.page_title && this.page_title.replace) {
    var noPunctuation = this.page_title.replace(/[^\w]/g, "_");
    var words = noPunctuation.split("_");

    words.forEach(function(word) { 
      if(word) emit(word.toLowerCase(),1); 
    });
  }
};
reduceFun = function(someKey, someValues) {
  return Array.sum(someValues);
};  // Why is this Array.sum() instead of someValues.length?

// Do it.
db.titles.mapReduce(mapFun, reduceFun, { out: {replace:"wordCounts"} });

// Trim the result document set
db.wordCounts.remove({value: {$lt:50}});

MongoDB Map and Reduce

use wikipedia

// Mapper
mapFun = function() {
  if (this.page_title && this.page_title.replace) {
    // Convert title to alpha numberic by replacing punct with _
    var noPunctuation = this.page_title.replace(/[^\w]/g, "_");
    // Split title into words
    var words = noPunctuation.split("_");
	// For each word, emit a lower case version with a count of 1
    words.forEach(function(word) { 
      if(word) emit(word.toLowerCase(),1); 
    });
  }
};
// Reducer
reduceFun = function(someKey, someValues) {
  return Array.sum(someValues);
};  // Why is this Array.sum() instead of someValues.length?

// Entry point
db.titles.mapReduce(mapFun, reduceFun, { out: {replace:"wordCounts"} });

// Trim the result document set
db.wordCounts.remove({value: {$lt:50}});

Map, Filter, and Reduce in JavaScript

Thank you

Data Geek User Group Vancouver WA

michael@powma.com

Questions?

Made with Slides.com