Héctor F. Jiménez Saldarriaga
hfjimenez@utp.edu.co
@c1b3rh4ck @h3ct0rjs
*CyberThreat Taxonomy tree, Taken from [3]
*CyberThreat Taxonomy tree, Taken from [3]
We try to discover explicit or latent characteristics hidden in the data. It can be used to teach an algorithm to recognize other forms of the data that exhibit the same set of characteristics.
Instead of learning specific patterns that exist within certain subsets of the data, the goal is to establish a notion of normality that describes most (say, more than 95%) of a given dataset
use of electronic messaging systems to send an unsolicited message (spam), especially advertising, as well as sending messages repeatedly on the same site
instant messaging spam, Usenet newsgroup spam, Web search engine spam, spam in blogs, wiki spam, online classified ads spam, mobile phone messaging spam, Internet forum spam, junk fax transmissions, social spam, spam mobile apps
Baye’s Theorem
sabiendo la probabilidad de tener un dolor de cabeza dado que se tiene gripe, se podría saber , la probabilidad de tener gripe si se tiene un dolor de cabeza
We shall use 75% of the dataset as train dataset and the rest as test dataset. Selection of this 75% of the data is uniformly random.
Bag of words and TF-IDF
John likes to watch movies. Mary likes movies too
"John","likes","to","watch","movies","Mary","likes","movies","too"
BoW1 = {"John":1,"likes":2,"to":1,"watch":1,"movies":2,"Mary":1,"too":1};
short for term frequency–inverse document frequency, is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus
The weight of a term that occurs in a document is simply proportional to the term frequency.
The specificity of a term can be quantified as an inverse function of the number of documents in which it occurs
Free , FREE, FrEE they're all the same meaning....let's normalize that data
Remove the stop words. Stop words are those words which occur extremely frequently in any text. For example words like ‘the’, ‘a’, ‘an’, ‘is’, ‘to’ etc
Stemming to words
John likes to watch movies. Mary likes movies too
"John","likes","to","watch","movies","Mary","likes","movies","too"
BoW1 = {"John":1,"likes":2,"to":1,"watch":1,"movies":2,"Mary":1,"too":1};