Play Shakespeare in Python
@bambooom
What Data?
- Shakespeare's works in xml format
- Comedy/Tragedy/History
- Poems (ignored)
<speech>
<speaker long="Hamlet">HAM.</speaker>
<line globalnumber="1331" number="267" form="prose">Why—</line>
<line globalnumber="1332" number="268" form="verse" offset="0">“One fair daughter, and no more,</line>
<line globalnumber="1333" number="268" form="verse" offset="2">The which he loved passing well.”</line>
</speech>
What I have done
- 用台词推测题材分类 (Genre)
- 假设不知道 Genre 进行聚类
-
可视化
台词推测 Genre
- 从 sklearn.feature_extraction.text 这个包中可以将文本数据转化为矩阵, 即 Term Frequency Inverse Document Frequency (TFIDF)
- naive_bayes.MultinomialNB 和 linear_model.SGDClassifier 是两个较适合处理文本数据的分类器
- 尝试 stemming tokenizer 降维和 使用grid search 搜寻合适参数数值后, 分类器的准确率仍在60%以下
![](https://s3.amazonaws.com/media-p.slid.es/uploads/475555/images/2391812/romeojuliet.jpg)
尝试聚类
- decomposition.TruncatedSVD 降维 也可用于 Latent Semantics Analysis (LSA 潜在语义学)
- 用 KMeans 和 MiniBatchKMeans 尝试聚类效果都不显著
![](https://s3.amazonaws.com/media-p.slid.es/uploads/475555/images/2391814/df4d3094dac5dccda08d8243a1959b93.jpg)
可视化 Term Document Matrix
![](https://s3.amazonaws.com/media-p.slid.es/uploads/475555/images/2391834/Play_Shakespeare_in_Python.png)
可视化 word cloud
![](https://s3.amazonaws.com/media-p.slid.es/uploads/475555/images/2391840/mask.jpg)
![](https://s3.amazonaws.com/media-p.slid.es/uploads/475555/images/2391843/download__1_.png)
What to do
- 根据台词分性别 (已有研究 paper)
- nltk: sentiment analysis
- 根据台词聚类人物性格?
- 中文世界!