Scraped lyrics.com for the top ~100 songs for 3 genres:
For each song, we gathered:
import urllib
import re
import pandas as pd
genre=[20,2,22,6,17,18,11,12,14,15,24,21]
increments=[0,30,60,90,120]
links=[]
tag=[]
for i in genre:
for j in increments:
l='http://www.lyrics.com/tophits/genres/'+str(j)+'/'+str(i)
p=i
tag.append(p)
links.append(l)
dictionary=dict(zip(links,tag))
lyrics_links=[]
gerne_tag=[]
for key,value in dictionary.items():
htmltext=urllib.urlopen(key).read()
htmlstr=str(htmltext)
p=re.findall('<a href="(.*?)" style="padding:2px', htmlstr)
for i in p:
g='http://www.lyrics.com'+i
lyrics_links.append(g)
gerne_tag.append(value)
dict2=dict(zip(lyrics_links,gerne_tag))
lyrics=[]
title=[]
artist=[]
genre2=[]
for key,value in dict2.items():
htmltext=urllib.urlopen(key).read()
htmlstr = str(htmltext)
d=htmlstr.replace('<br />\r\n',' ').replace('<br />\n',' ').replace('<div id="lyrics" class="SCREENONLY" itemprop="description">','')
title2=re.findall('<h1 id="profile_name">([^"]+)<br /><span', d)
print title2
m=re.findall('<div id="lyric_space">\s+((.*?)[^"]+)<br />---<br />', d)
c=re.findall('<h1 id="profile_name">([^"]+)<br /><span class="small">', d)
f=re.findall('">([^"]+)</a></span></h1>\s+<div class="fblike_btn">', d)
if len(m)==1:
if len(c)==1:
if len(f)==1:
lyrics.extend(re.findall('<div id="lyric_space">\s+((.*?)[^"]+)<br />---<br />', d))
title.extend(re.findall('<h1 id="profile_name">([^"]+)<br /><span class="small">', d))
artist.extend(re.findall('">([^"]+)</a></span></h1>\s+<div class="fblike_btn">', d))
genre2.append(value)
data={'lyrics':lyrics,'title':title, 'artist':artist, 'genre2':genre2}
lyric_data= pd.DataFrame(data)
number_to_genre={20:'alternative',2:'blues',22:'christain/gospel',6:'country',17:'dance',18:'hiphop/rap',11:'jazz',
12:'latino',14:'pop',15:'r&b/soul',24:'reggae', 21:'rock'}
lyric_data['genre']=lyric_data['genre2'].map(number_to_genre)
lyric_data.to_csv('C:\Users\MatthewGilmore\Desktop\lyric_data.csv')
1. Build Feature Extractor2. Build Classification Model
- Create a TF-IDF matrix representation of the training lyrics.
- Use SVD to reduce the dimensionally
corpus=numpy.asarray(train['lyrics'])
cv=TfidfVectorizer(stop_words=stops,sublinear_tf=True, use_idf=True)
counts=cv.fit_transform(corpus)
svd=TruncatedSVD(n_components=100)
lsa=svd.fit_transform(counts)
- Genre || Lyrics
- Genre || W1, W2, ..., ~W5000
- Genre || SVD1, SVD2..., SVD100
- With the text now represented numerically we can use standard classification techniques.
- The random forests model had the most accurate classification results.
clf = RandomForestClassifier(n_estimators=2000)
targets = numpy.asarray(train['genre2'])
clf.fit(lsa, targets)
scores=cross_validation.cross_val_score(clf,lsa, targets, cv=10)
print("Accuracy: %0.2f (+/- %0.2f)"%(scores.mean(),scores.std() * 2))
1. Apply Feature Extractor2. Apply Classification Model
test_lyrics= numpy.asarray(test['lyrics2'])
test_data= cv.transform(test_lyrics).toarray()
test_reduced_data=svd.transform(test_data)
test['predictions'] = clf.predict(test_reduced_data)
confusion_matrix=pd.crosstab(test.genre2,test.predictions)
(Accuracy ~ .78)
predictions country pop rap/hip-hop All genre country 9 5 0 14 pop 0 13 0 13 rap/hip-hop 0 3 6 9 All 9 21 6 36
- Use bi-grams
- Integrate POS
- Utilize song features