Topic Modelling (or not?)
for humans
pip install -U gensim
conda install -c conda-forge gensim
Vector + Similarity Is All You Need
More examples (click)
Input: text
XXX.YY.ZZ.AA - - [21/Oct/2018:06:59:38 +0200] "GET / HTTP/1.1" 301 194 "-" "Mozilla/5.0 zgrab/0.x"
XXX.YY.ZZ.AA - - [21/Oct/2018:07:11:53 +0200] "GET / HTTP/1.1" 301 194 "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.11; rv:47.0) Gecko/20100101 Firefox/47.0"
XXX.YY.ZZ.AA - - [21/Oct/2018:07:47:40 +0200] "t3 12.2.1" 400 182 "-" "-"
XXX.YY.ZZ.AA - - [21/Oct/2018:07:59:49 +0200] "GET / HTTP/1.1" 301 194 "-" "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.103 Safari/537.36"
XXX.YY.ZZ.AA - - [21/Oct/2018:08:28:19 +0200] "GET / HTTP/1.1" 301 194 "-" "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.103 Safari/537.36"
XXX.YY.ZZ.AA - - [21/Oct/2018:08:41:56 +0200] "GET / HTTP/1.1" 301 194 "-" "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.103 Safari/537.36"
XXX.YY.ZZ.AA - - [21/Oct/2018:08:41:56 +0200] "GET / HTTP/1.1" 301 194 "-" "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36"
XXX.YY.ZZ.AA - - [21/Oct/2018:08:53:02 +0200] "HEAD /wp-config.php HTTP/1.1" 301 0 "-" "-"
XXX.YY.ZZ.AA - - [21/Oct/2018:09:16:00 +0200] "GET / HTTP/1.1" 301 194 "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.11; rv:47.0) Gecko/20100101 Firefox/47.0"
XXX.YY.ZZ.AA - - [21/Oct/2018:10:11:33 +0200] "GET / HTTP/1.1" 301 194 "-" "Mozilla/5.0 zgrab/0.x"
XXX.YY.ZZ.AA - - [21/Oct/2018:11:26:50 +0200] "GET /images.php HTTP/1.1" 301 194 "-" "Mozilla/5.0 zgrab/0.x"
XXX.YY.ZZ.AA - - [21/Oct/2018:11:37:09 +0200] "GET /console HTTP/1.1" 301 194 "-" "python-requests/2.19.1"
XXX.YY.ZZ.AA - - [21/Oct/2018:12:46:42 +0200] "GET /wordpress/ HTTP/1.1" 301 194 "-" "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:58.0) Gecko/20100101 Firefox/58.0"
XXX.YY.ZZ.AA - - [21/Oct/2018:14:07:34 +0200] "GET / HTTP/1.1" 301 194 "-" "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.103 Safari/537.36"
XXX.YY.ZZ.AA - - [21/Oct/2018:14:19:41 +0200] "GET / HTTP/1.1" 301 194 "-" "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36"
XXX.YY.ZZ.AA - - [21/Oct/2018:14:40:01 +0200] "GET /manager/html HTTP/1.1" 301 194 "-" "Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.2; WOW64; Trident/6.0)"
XXX.YY.ZZ.AA - - [21/Oct/2018:15:12:56 +0200] "GET / HTTP/1.1" 301 194 "-" "Mozilla/5.0 zgrab/0.x"
XXX.YY.ZZ.AA - - [21/Oct/2018:16:47:36 +0200] "GET / HTTP/1.1" 301 194 "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/601.7.7 (KHTML, like Gecko) Version/9.1.2 Safari/601.7.7"
Text - sequence of tokens
Token ~ "word" (???)
Token ~ "discrete value"
Tasks?
bag of words
Main idea - matrix factorization
Main idea - matrix factorization
What are the advantages?
import gensim.downloader as api
from gensim.corpora import Dictionary
from gensim.models import LsiModel
# 1. Load data
data = api.load("text8")
# 2. Create dictionary
dct = Dictionary(data)
dct.filter_extremes(no_below=7, no_above=0.2)
# 3. Convert data to bag-of-word format
corpus = [dct.doc2bow(doc) for doc in data]
# 4. Fit model
model = LsiModel(corpus, id2word=dct, num_topics=300)
1 dice for choice topic
N dices for choice word from topic
import gensim.downloader as api
from gensim.corpora import Dictionary
from gensim.parsing import preprocess_string
from gensim.models import LdaModel
# 1. Load data
data = api.load("20-newsgroups")
# 2. Tokenize data
data = [preprocess_string(_["data"]) for _ in data]
# 3. Create dictionary
dct = Dictionary(data)
dct.filter_extremes(no_below=5, no_above=0.15)
# 4. Convert data to bag-of-word format
corpus = [dct.doc2bow(doc) for doc in data]
# 5. Fit model
model = LdaModel(corpus, id2word=dct, num_topics=20, passes=10)
for topic_id, repr in model.show_topics(3, num_words=5):
print("#{}: {}".format(topic_id, repr))
#5: 0.012*"armenian" + 0.009*"kill" + 0.009*"israel" + 0.008*"fbi" + 0.007*"war"
#0: 0.016*"govern" + 0.013*"law" + 0.010*"presid" + 0.010*"gun" + 0.006*"nation"
#9: 0.023*"space" + 0.011*"car" + 0.011*"mission" + 0.010*"earth" + 0.009*"shuttl"
although I know hack (:
M word
features
M
import gensim.downloader as api
model = api.load("word2vec-google-news-300")
model.most_similar(
positive=["king", "woman"],
negative=["man"],
topn=1
)
# [(u'queen', 0.7118192911148071)]
model.most_similar("cat")
# [(u'cats', 0.8099379539489746),
# (u'dog', 0.7609456777572632),
# (u'kitten', 0.7464985251426697),
# (u'feline', 0.7326233983039856),
# (u'beagle', 0.7150583267211914),
# (u'puppy', 0.7075453996658325),
# (u'pup', 0.6934291124343872),
# (u'pet', 0.6891531348228455),
# (u'felines', 0.6755931377410889),
# (u'chihuahua', 0.6709762215614319)]
2. Start from pre-trained
3. Implement end2end evaluation
You need something similar, but for your concrete task
Compare all, choose the best!