SAT Analogies
with word vectors
DC Hack and Tell
2017-07-18
Aaron Schumacher
@planarrowspace
http://planspace.org/20170705-word_vectors_and_sat_analogies/
Pop Quiz!
PALTRY : SIGNIFICANCE ::
A. redundant : discussion
B. austere : landscape
C. opulent : wealth
D. oblique : familiarity
E. banal : originality
RUNNER : MARATHON ::
A. envoy : embassy
B. martyr : massacre
C. oarsman : regatta
D. referee : tournament
E. horse : stable
KING : QUEEN ::
A. lion : cat
B. goose : flock
C. ewe : sheep
D. cub : bear
E. man : woman
... word vectors?
Word vectors!
- So much word vector hype
- Every word represented by D numbers
- "D-dimensional word vectors"
- Subtract two words to get a relationship vector
- Compare relationship vectors
- "this relationship is most like that other one"
cat
lion
goose
flock
ewe
sheep
cub
bear
How do we get
a bunch of questions?
Random guessing: 1/5=20%
Average college applicant: 57%
Human voting: 82%
https://aclweb.org/aclwiki/SAT_Analogy_Questions_(State_of_the_art)
- 2005 paper (not by AltaVista, just using it)
- for each relationship, do 128 AltaVista searches
-
- KING and QUEEN
- QUEEN but not KING
- etc.
- log(number of search results) are vector values
- worked! 47% accuracy
How do we get
word vectors though?
import gensim
word2vec_file = 'data/GoogleNews-vectors-negative300.bin'
word_to_vec = gensim.models.KeyedVectors.load_word2vec_format(
word2vec_file, binary=True)
vec = word2vec['cat']
import numpy as np
def read(filename):
word_to_vec = {}
with open(filename) as f:
for line in f:
first_space_index = line.index(' ')
word = line[:first_space_index]
values = line[first_space_index + 1:]
vector = np.fromstring(values, sep=' ', dtype=np.float16)
word_to_vec[word] = vector
return word_to_vec
word2vec = read('data/glove.twitter.27B.25d.txt')
vec = word2vec['cat']
type type type...
What is "near"
in 300 dimensions?
- Almost all of 3M words in their own quadrant!
- Euclidean distance?
- Cosine distance?
Thanks!
SAT Analogies
By ajschumacher
SAT Analogies
- 1,531