Data intensive applications

Svetlana Filimonova, 20 August 2015

What is it?

“data is a primary challenge: the quantity of data, the complexity of data, or the speed at which it is changing”

Excerpt From: Martin Kleppmann. “Designing Data-Intensive Applications.” 

result = dataSource.map(element => 
    computeMagic(element)
)

Take a bigger machine?

Distribute computation

  • cluster of machines
  • parallel processing

Stream data

  • single machine
  • takes time to process

or read in chunks

  • Single Source of data
  • Split into chunks
  • Distribute to the machines
  • Collect the result somehow

http://bit.ly/1NvFPDy

$YOUR_SPARK_PATH/bin/pyspark
$YOUR_SPARK_PATH/bin/spark-shell

or

Made with Slides.com