Understanding Python's DataStructure: Dictionary

Sayan Chowdhury

@chowdhury_sayan

The Dictionary

>>> d = {
...    'Fedora': 21,
...    3.1415: 'pi',
...    9 : 'Planets', 
...    (1, 4, 17): 'Django version',
...}

keys can be anything.

A dictionary is really a list

>>> # An empty dictionary is an 8-element list!
>>> d = {}
>>> # This actually creates a `hash table` 
>>> # containing `slots`

Keys are hashed to produce indexes

How the hashing is done?

Python uses the built-in hash() function.

 

Let's hash some keys now!

>>> def bits(n):
...    n += 2**32
...    return bin(n)[-32:]  # remove '0b'
...
>>> print bits(1)
00000000000000000000000000000001
>>> print bits(-1)
11111111111111111111111111111111
>>> for key in 'Monty', 3.1415, (1, 4, 17):
...     print bits(hash(key)), key

10011000011001101001001100000010 Monty
01101010101011010000100100000010 3.1415
11000111101010001110010110011011 (1, 4, 17)

 

>>> k1 = bits(hash('Monty'))
>>> k2 = bits(hash('Money'))
>>> diff = ('^ '[a==b] for a,b in zip(k1, k2))
>>> print k1; print k2; print ''.join(diff)

10011000011001101001001100000010
10011001010010110111010100010111
       ^  ^ ^^ ^^^^  ^^    ^ ^ ^

 

Even minor change in value can lead to drastic change in hashed value

Same value returns the same hash

for key in 3.1415, 3.1415, 3.1415:
...     print bits(hash(key)), key
01101010101011010000100100000010 3.1415
01101010101011010000100100000010 3.1415
01101010101011010000100100000010 3.1415

Now, let me show something interesting!

>>> for key in 9, 9.0, complex(9):
...     print bits(hash(key)), key
00000000000000000000000000001001 9
00000000000000000000000000001001 9.0
00000000000000000000000000001001 (9+0j)

Keys <===> Indexes

#1. Keys are hashed.

#2. To build an index, Python uses the bottom 

        n bits of the hash

>>> d['ftp'] = 21
>>> b = bits(hash('ftp'))
>>> print b
11010010011111111001001010100001
>>> print b[-3:]  # last 3 bits = 8 combinations
001
>>> d['ssh'] = 22
>>> print bits(hash('ssh'))[-3:]
101
>>> d['ssh'] = 22
>>> print bits(hash('ssh'))[-3:]
101
>>> d['smtp'] = 25
>>> print bits(hash('smtp'))[-3:]
100
>>> d['smtp'] = 25
>>> print bits(hash('smtp'))[-3:]
100
>>> d['time'] = 37
>>> print bits(hash('time'))[-3:]
111
>>> d['time'] = 37
>>> print bits(hash('time'))[-3:]
111
>>> d['www'] = 80
>>> print bits(hash('www'))[-3:]
010
>>> d['www'] = 80
>>> print bits(hash('www'))[-3:]
010
d = {'ftp': 21, 'ssh': 22,
     'smtp': 25, 'time': 37,
     'www': 80}

Lookup?

  • compute hash

  • get the index using `i` bits

  • search in the hash table

>>> print d['smtp']
25
>>> print bits(hash('smtp'))[-3:]
100

Ever noticed the crazy order of a dictionary ?

>>> # Different than our insertion order:
>>> print d
{'ftp': 21, 'www': 80, 'smtp': 25, 'ssh': 22,
 'time': 37}
>>> # But same order as in the hash table!
>>> # keys and values also in table order
>>> d.keys()
['ftp', 'www', 'smtp', 'ssh', 'time']
>>> d.values()
[21, 80, 25, 22, 37]

Collision

What happens when two keys need the same slot?

>>> # start over with a new dictionary
>>> d = {}
>>> # first item inserts fine
>>> d['smtp'] = 21
>>> # first item inserts fine
>>> d['smtp'] = 21
>>> # second item collides!
>>> d['dict'] = 2628
>>> # second item collides!
>>> d['dict'] = 2628
>>> # third item also finds empty slot
>>> d['svn'] = 3690
>>> # third item also finds empty slot
>>> d['svn'] = 3690
>>> # fourth item has multiple collisions
>>> d['ircd'] = 6667
>>> # fifth item collides, but less deeply
>>> d['zope'] = 9673
>>> # fifth item collides, but less deeply
>>> d['zope'] = 9673
>>> # fifth item collides, but less deeply
>>> d['zope'] = 9673

Because collisions move keys

away from their natural hash values,

key order is quite sensitive

to dictionary history

>>> d = {'smtp': 21, 'dict': 2628,
...   'svn': 3690, 'ircd': 6667, 'zope': 9673}
>>> d.keys()
['svn', 'dict', 'zope', 'smtp', 'ircd']
>>> e = {'ircd': 6667, 'zope': 9673,
...   'smtp': 21, 'dict': 2628, 'svn': 3690}
>>> e.keys()
['ircd', 'zope', 'smtp', 'svn', 'dict']

We are same yet so different!

>>> d == e
True
>>> d.keys()
['svn', 'dict', 'zope', 'smtp', 'ircd']
>>> e.keys()
['ircd', 'zope', 'smtp', 'svn', 'dict']

These dictionaries are same,

yet different by the order of keys in hash table.

>>> # Successful lookup, length 1
>>> # Compares HASHES then compares VALUES
>>> d['svn']
3690
>>> # Successful lookup, length 1
>>> # Compares HASHES then compares VALUES
>>> d['svn']
3690
>>> # Successful lookup, length 4
>>> d['ircd']
6667
>>> # Unsuccessful lookup, length 1
>>> d['nsca']
Traceback (most recent call last):
  ...
KeyError: 'nsca'
>>> # Unsuccessful lookup, length 4
>>> d['netstat']
Traceback (most recent call last):
  ...
KeyError: 'netstat'

Remember, lookups are costly !

Not all lookups are created equal.

Some finish at their first slot

Some loop over several slots

threes = {
   3: 1, 3+8: 2, 3+16: 3,
   3+24: 4,  3+32: 5
}
# Thanks to piling collisions atop each
# other, we can make lookup more expensive

timeit('d[3]', 'd=%r' % threes)    # -> 0.078
timeit('d[3+32]', 'd=%r' % threes) # -> 0.082

An interesting consequence!

what happens to lookups when we delete keys?

del d['smtp']

# Can we simply make its slot empty?
del d['smtp']

# But what would happen to d['ircd']?

When a key is deleted, its slot cannot simply

be marked as empty

Otherwise, lookups for any keys that collided would be now be impossible to find!

So we create a <dummy> key instead

>>> # Creates a <dummy> slot that
>>> # can be re-used as storage
>>> del d['smtp']
>>> # That way, we can still find d['ircd']
>>> d['ircd']
6667
>>> del d['svn'], d['dict'], d['zope']
>>> d['ircd']
6667
>>> # Still requires 4 steps!

Dictionary resize when they get full

To keep collisions rare, dicts resize when only ⅔ full

>>> wordfile = open('/home/sayan/words')
>>> text = wordfile.read().decode('utf-8')
>>> words = [ w for w in text.split()
...     if w == w.lower() and len(w) < 6 ]
>>> words
[u'a', u'abaci', u'aback', u'abaft', u'abase',
 ..., u'zoom', u'zooms', u'zoos', ...]
d = {}
# Again, an empty dict has 8 slots
# Let's start filling it with keys
d = dict.fromkeys(words[:5])
# collision rate 40%
# but now ⅔ full — on verge of resizing!
d['abash'] = None
# Resizes ×4 to 32, collision rate drops to 0%
d = dict.fromkeys(words[:21])
# ⅔ full again — collision rate 29%
d['abode'] = None
# Resizes ×4 to 128, collision rate drops to 9%
d = dict.fromkeys(words[:85])
# ⅔ full again — collision rate 33%

So, dictionary size increase as the keys become crowded

Average dictionary

performance is excellent

A dictionary of common words:

>>> wfile = open('/usr/sayan/words')
>>> words = wfile.read().split()[:1365]
>>> print words
['A', "A's", ..., "Backus's", 'Bacon', "Bacon's"]
>>> pmap = _dictinfo.probe_all_steps(words)

Some are lucky enough to hit in first try

>>> pmap['Ajax']
[1330]
>>> pmap['Agamemnon']
[2020]

While some of them where not that lucky enough!

>>> pmap['Aristarchus']  # requires 5 probes
[864, 1089, 801, 1108, 74]
>>> pmap['Baal']         # requires 16 probes!
[916, 1401, 250, 1359, 399, 1156, 1722, 420, 53,
 266, 1331, 512, 513, 518, 543, 668]

Probes sound to be slow ? right?

 

But they aren't !

>>> setup = "d=dict.fromkeys(%r)" % words
>>> fast = timeit("d['Ajax']", setup)
>>> slow = timeit("d['Baal']", setup)
>>> '%.1f' % (slow/fast)
'1.7'

An insert can completely reorder a dictionary during resizing! 

>>> d = {'Double': 1, 'double': 2, 'toil': 3,
...      'and': 4, 'trouble': 5}
>>> d.keys()
['toil', 'Double', 'and', 'trouble', 'double']
>>> d['fire'] = 6
>>> d.keys()
['and', 'fire', 'Double', 'double', 'toil',
 'trouble']

Because an insert can radically reorder a  dictionary, key insertion

is prohibited during iteration

>>> d = {'Double': 1, 'double': 2, 'toil': 3,
...     'and': 4, 'trouble': 5}
>>> for key in d:
...     d['fire'] = 6
Traceback (most recent call last):
  ...
RuntimeError: dictionary changed size during
  iteration

Rules to follow! 

  • Don't rely on the order
  • Don't insert while iterating
  • Mutable keys not allowed
  • Dictionaries trade space for time

Equal values should have equal hashes

regardless of their type!

>>> hash(9)
9
>>> hash(9.0)
9
>>> hash(complex(9, 0))
9
>>> dict = { 9: 'nine' }
9
>>> dict[9]
'nine'
>>> dict[9.0]
'nine'
>>> dict[complex(9)]
'nine'

References

  • The Mighty Dictionary by Brandon Rhodes - https://www.youtube.com/watch?v=C4Kc8xzcA68
  • http://svn.python.org/projects/python/trunk/Objects/dictnotes.txt
  • http://svn.python.org/projects/python/trunk/Objects/dictobject.c
  • https://gist.github.com/avances123/9497630

The End

Made with Slides.com