This project is based on the Microsoft Malware challenge hosted on Kaggle spring of 2015.
Classify 9 kinds of malware
Uni-gram byte code alone was a highly predictive feature
word frequency:
56,00,ff
Enough to achieve 96% accuracy
Any classifier that achieves close to 100% could be overfitting.
accuracy at 97.8%
all in memory
Two-hour train and classify time with two eight core nodes and one four core master. On the entire Kaggle set.
def byteCount(data: RDD[(String, Array[String])]): RDD[(String, Map[String, Double])] = {
try {
data.map({
doc =>
(
doc._1,
doc._2.foldLeft(Map.empty[String, Double]) {
(acc: Map[String, Double], word: String) =>
acc + (word -> (acc.getOrElse(word, 0.0) + 1.0))
}
)
})
} catch {
case _ => println("Error at byteCount")
Driver.sc.stop()
return null
}
}