IBM BigData
@RomeoKienzler
![](https://s3.amazonaws.com/media-p.slid.es/uploads/167657/images/1502091/Twitter_logo_white.png)
Preface
Most of the Technologies mentioned in this presentation are available in the IBM Cloud Free Tier at no cost, please have a look http://ibm.biz/joinIBMCloud
State of the Art
- SQL (42%)
- R (33%)
- Python (26%)
- Excel (25%)
- Java, Ruby, C++ (17%)
- SPSS, SAS (9%)
Limits
- Main Memory
- CPU <> Main Memory Bandwidth
- CPU
- Storage <> Main Memory Bandwidth (either Single node or SAN)
Hadoop
![](https://s3.amazonaws.com/media-p.slid.es/uploads/167657/images/1587530/Screen_Shot_2015-07-21_at_09.35.07.png)
![](https://s3.amazonaws.com/media-p.slid.es/uploads/167657/images/1587531/Screen_Shot_2015-07-21_at_09.35.46.png)
Hadoop
![](https://s3.amazonaws.com/media-p.slid.es/uploads/167657/images/1588228/cartoon-data-scientist-salary-negotiation.gif)
Why is Hadoop so fast?
Time to read 1 TB from Disk
- 1 disk - 3.4h
- 10 disks - 20m
- 100 disks - 2m
- 1000 disks - 12s
Time to read 1 TB from Main Memory
- 1 node - 100s
- 10 nodes - 10s
- 100 nodes - 1s
- 1000 nodes - 100ms
Data Parallelism
![](https://s3.amazonaws.com/media-p.slid.es/uploads/167657/images/1587556/Untitled_copy.png)
![](https://s3.amazonaws.com/media-p.slid.es/uploads/167657/images/1587565/Untitled_2.png)
Why use so much data?
The Unreasonable Effectiveness of Data¹: "sometimes it's not who has the best algorithm that wins; it's who has the most data."
¹http://www.csee.wvu.edu/~gidoretto/courses/2011-fall-cp/reading/TheUnreasonable%20EffectivenessofData_IEEE_IS2009.pdf
![](https://s3.amazonaws.com/media-p.slid.es/uploads/167657/images/1587551/Untitled.png)
How to store so much data?
![](https://s3.amazonaws.com/media-p.slid.es/uploads/167657/images/1587568/Untitled_2.png)
"Imagine a Filesystem with unlimited capacity, scalability and fault tolerance"
![](https://s3.amazonaws.com/media-p.slid.es/uploads/167657/images/1587572/Untitled_2.png)
BigR
![](https://s3.amazonaws.com/media-p.slid.es/uploads/167657/images/1588202/Screen_Shot_2015-07-21_at_16.49.58.png)
BigR
![](https://s3.amazonaws.com/media-p.slid.es/uploads/167657/images/1588214/Screen_Shot_2015-07-21_at_16.51.14.png)
Spark
![](https://s3.amazonaws.com/media-p.slid.es/uploads/167657/images/1588225/spark-devs1.png)
Spark
![](https://s3.amazonaws.com/media-p.slid.es/uploads/167657/images/1588224/spark-stack.png)
Life Science
![](https://s3.amazonaws.com/media-p.slid.es/uploads/167657/images/1588896/Biology-icon.png)
Hadoop - Genomics
Crossbow
- 1st tool on Hadoop
- based on Bowtie + soapSNP
ADAM
- genomics analysis platform
- runs on top of Spark
Hadoop - BAM
- on top of the Picard SAM JDK
Hadoop - Genomics
...some more examples...
- Contrail
- PeakRanger
- Quake
- BlastReduce
- CloudBLAST
- MrsRF
Downstream Analytics
![](https://s3.amazonaws.com/media-p.slid.es/uploads/167657/images/1588979/Screen_Shot_2015-07-21_at_22.07.33.png)
![](https://s3.amazonaws.com/media-p.slid.es/uploads/167657/images/1588981/Screen_Shot_2015-07-21_at_22.07.17.png)
![](https://s3.amazonaws.com/media-p.slid.es/uploads/167657/images/1588992/Screen_Shot_2015-07-21_at_22.09.54.png)
...downstream analytics...
![](https://s3.amazonaws.com/media-p.slid.es/uploads/167657/images/1589003/visualization-designer3.jpg)
![](https://s3.amazonaws.com/media-p.slid.es/uploads/167657/images/1589006/visualization-designer2.jpg)
![](https://s3.amazonaws.com/media-p.slid.es/uploads/167657/images/1589009/visualization-designer4.jpg)
Downstream Analytics
Image/Video Processing
![](https://s3.amazonaws.com/media-p.slid.es/uploads/167657/images/1589079/1269877648.png)
On Hadoop
Fiji/ImageJ
- 3D Image Processing library
- runs also on Hadoop / Spark
OpenCV
- Video Processing library
- runs also on Hadoop / Spark or IBM InfoSphere Streams
IBM BigData
By Romeo Kienzler
IBM BigData
A short overview on IBM BigData
- 1,428