SAT Analogies

with word vectors

DC Hack and Tell

2017-07-18

Aaron Schumacher

@planarrowspace

http://planspace.org/20170705-word_vectors_and_sat_analogies/

Pop Quiz!

PALTRY : SIGNIFICANCE ::

A. redundant : discussion

B. austere : landscape

C. opulent : wealth

D. oblique : familiarity

E. banal : originality

RUNNER : MARATHON ::

A. envoy : embassy

B. martyr : massacre

C. oarsman : regatta

D. referee : tournament

E. horse : stable

KING : QUEEN ::

A. lion : cat

B. goose : flock

C. ewe : sheep

D. cub : bear

E. man : woman

... word vectors?

Word vectors!

  • So much word vector hype
  • Every word represented by D numbers
    • "D-dimensional word vectors"
  • Subtract two words to get a relationship vector
  • Compare relationship vectors
    • "this relationship is most like that other one"

cat

lion

goose

flock

ewe

sheep

cub

bear

How do we get

a bunch of questions?

 Random guessing: 1/5=20%

 Average college applicant: 57%

 Human voting: 82%

https://aclweb.org/aclwiki/SAT_Analogy_Questions_(State_of_the_art)

  • 2005 paper (not by AltaVista, just using it)
    • for each relationship, do 128 AltaVista searches
      • KING and QUEEN
      • QUEEN but not KING
      • etc.
    • log(number of search results) are vector values
    • worked! 47% accuracy

How do we get

word vectors though?

import gensim


word2vec_file = 'data/GoogleNews-vectors-negative300.bin'
word_to_vec = gensim.models.KeyedVectors.load_word2vec_format(
    word2vec_file, binary=True)

vec = word2vec['cat']
import numpy as np


def read(filename):
    word_to_vec = {}
    with open(filename) as f:
        for line in f:
            first_space_index = line.index(' ')
            word = line[:first_space_index]
            values = line[first_space_index + 1:]
            vector = np.fromstring(values, sep=' ', dtype=np.float16)
            word_to_vec[word] = vector
    return word_to_vec

word2vec = read('data/glove.twitter.27B.25d.txt')

vec = word2vec['cat']

type type type...

What is "near"

in 300 dimensions?

  • Almost all of 3M words in their own quadrant!
  • Euclidean distance?
  • Cosine distance?

Thanks!

Made with Slides.com