Functional programming and distributed data

Why should you care?

  • Understandable
  • Parallelizable

What is functional programming?

  • A style of programming in which pure functions are the main unit of computation
  • Think ​jQuery vs React
  • Possible in most languages, but easier in some

What makes a function pure?

  • No side effects
    • e.g ajax requests, writing to db, printing, changing external state
  • Same input → same output

Is it pure?

const square = (x) => {
  return x * x;
};

Is it pure?

class Counter {
  constructor() {
    this.count = 0;
  }

  increment() {
    this.count += 1;
    return this.count;
  }
}

Is it pure?

const randPlusOne = () => {
  return Math.random() + 1;
};

Is it pure?

const age = (birthday) => {
  return new Date() - birthday;
};

Is it pure?

const setText = (newText) => {
  $('#thing').text(newText);
};

Is it pure?

const addLengths = (str1, str2) => {
  return str1.length + str2.length;
};

Is it pure?

const addNameLengths = (person1, person2) => {
  return person1.name.length + 
    person2.name.length;
};
const w = { name: 'Will' };
const g = { name: 'Grace' };

addNameLengths(w, g); // 9
w.name = 'William';
addNameLengths(w, g); // 12

Purity requires immutability

Is it pure?

const range = (n) => {
  const result = [];
  for (let i = 0; i < n; i++) {
    result.push(i);
  }
  return result;
};

How can we do anything useful?

  • Functional core, imperative shell
  • Redux
    • Model state changes as pure reducers
    • Your code never mutates state
  • React
    • Model UI as pure components
    • Your code never mutates DOM

Understandable

  • Impure functions have hidden inputs and outputs
    • hidden inputs: mutable dependencies
    • hidden outputs: side effects
  • Impure functions are often coupled in invisible ways
  • Pure functions require all inputs/outputs to be explicit
  • Calling a pure function can never break other code
  • Values that change over time are difficult to keep track of
const x = impureThing(a, b);
const x = pureThing(a, b);
const makeTiramisu = (
  eggs, sugar1, wine, cheese, cream, 
  fingers, espresso, sugar2, cocoa
) => {
  dissolve(sugar2, espresso);
  const mixture = whisk(eggs);
  beat(mixture, sugar1, wine);
  whisk(mixture);
  whip(cream);
  beat(cheese);
  beat(mixture, cheese);
  fold(mixture, cream);
  assemble(mixture, fingers);
  sift(mixture, cocoa);
  refrigerate(mixture);
  return mixture;
};

Example: tiramisu recipe

const makeTiramisu = (
  eggs, sugar1, wine, cheese, cream, 
  fingers, espresso, sugar2, cocoa
) => {
  const beatEggs = beat(eggs);
  const mixture = beat(beatEggs, sugar1, wine);
  const whisked = whisk(mixture);
  const beatCheese = beat(cheese);
  const cheeseMixture = beat(whisked, beatCheese);
  const whippedCream = whip(cream);
  const foldedMixture = fold(cheeseMixture, whippedCream);
  const sweetEspresso = dissolve(sugar2, espresso);
  const wetFingers = soak2seconds(fingers, sweetEspresso);
  const assembled = assemble(foldedMixture, wetFingers);
  const complete = sift(assembled, cocoa);
  const readyTiramisu = refrigerate(complete);
  return readyTiramisu;
};

Example: tiramisu recipe

Parallelizable

  • Can't parallelize if we don't understand dependencies between steps
  • Mutable values make parallelization nearly impossible
let count = 5;

const increment = () => {
  count = count + 1;
};
const doubles = (arr) => {
  const result = [];
  for (let i = 0; i < arr.length; i++) {
    result.push(arr[i] * 2);
  }
  return result;
};
const doubles = (arr) => {
  return arr.map(x => x * 2);
};

Airbnb

  • Lots of transactions
  • Complex db schema
  • Incomprehensible to accountants

Apache Spark

  • "fast and general engine for large-scale data processing"
  • Supports Python, Java, Scala
  • Resilient Distributed Dataset (RDD)
// (Event, Rule) => Array[Entry]
const execute = (event, rule) => { ... };

// (Event, Rule) => Boolean
const applies = (event, rule) => { ... };

// (SparkContext) => RDD[Event]
const loadEvents = (sc) => { ... };

// (SparkContext) => RDD[Rule]
const loadRules = (sc) => { ... };

// (SparkContext, RDD[Entry]) => undefined
const saveEntries = (sc, entries) => { ... };
class Event { ... }
class Rule { ... }
class Entry { ... }
const sc = new SparkContext();
const events = loadEvents(sc);
const rules = loadRules(sc);
const entries = run(events, rules);
saveEntries(sc, entries);
// (RDD[Event], RDD[Rule]) => RDD[Entry]
const run = (events, rules) => {
  let result = [];

  events.forEach(event => {
    rules.forEach(rule => {
      if (applies(event, rule) {
        const entries = execute(event, rule);
        result = result.concat(entries);
      }
    });
  });

  return result;
};
const makePair = (n) => [n, n];

[1, 2].map(makePair); // [[1, 1], [2, 2]]
[1, 2].flatMap(makePair); // [1, 1, 2, 2]
// (RDD[Event], RDD[Rule]) => RDD[Entry]
const run = (events, rules) => (
  rules.flatMap(rule => (
    events
      .filter(event => applies(event, rule))
      .flatMap(event => execute(event, rule));
  ));
);

Why doesn't everybody do this?

  • Historical limitations in memory
  • Parallelism only needed recently
  • Entrenched in education and language design 
  • Doesn't always align with real world perception
  • But things are changing!

What next?

Appendix: Performance

"If you want fast, start with comprehensible"

-  Paul Phillips

Lazy evaluation

[1, 2, 3, 4, 5]
  .filter(x => x % 2 !== 0)
  .map(x => x * x)
  [1];
import { Seq } from 'immutable';
Seq([ 1, 2, 3, 4, 5 ])
  .filter(x => x % 2 !== 0)
  .map(x => x * x)
  .get(1);
import { Range } from 'immutable';
Range(1, Infinity)
  .filter(x => x % 2 !== 0)
  .map(x => x * x)
  .get(1);

Memoization

import { memoize } from 'lodash';

const memMakePair = memoize(makePair);

memMakePair(1); // [1, 1]
memMakePair(1); // use cached value
const onePair = memMakePair(1);
onePair.push(2);
memMakePair(1); // [1, 1, 2]

Questions?

a/A functional programming

By Phil Nachum

a/A functional programming

  • 280