Brad

Brent
Chandler

This project is based on the Microsoft Malware challenge hosted on Kaggle spring of 2015.

 

Classify 9 kinds of malware

The Problem

Background

Uni-gram byte code alone was a highly predictive feature

 

word frequency:

     56,00,ff

 

Enough to achieve 96% accuracy

Warning

Any classifier that achieves close to 100% could be overfitting.

Random Forest

Performance

accuracy at 97.8%

all in memory

Performance

Two-hour train and classify time with two eight core nodes and one four core master. On the entire Kaggle set.

Code

def byteCount(data: RDD[(String, Array[String])]): RDD[(String, Map[String, Double])] = {
    try {
      data.map({
        doc =>
          (
            doc._1,
            doc._2.foldLeft(Map.empty[String, Double]) {
              (acc: Map[String, Double], word: String) =>
                acc + (word -> (acc.getOrElse(word, 0.0) + 1.0))
            }
            )
      })
    } catch {
      case _ => println("Error at byteCount")
        Driver.sc.stop()
        return null
    }
  }
Made with Slides.com