Got Diff?

pip install deepdiff

github.com/seperman/deepdiff

Sep Dehpour

zepworks.com

sep at zepworks.com

github.com/seperman

Pycon - June 1st 2016

Diff It To Dig It

[
  {
    "_id": "574ddc8215220308d959b480",
    "index": 0,
    "guid": "b750dcce-f8ee-45e8-bbb6-32371ce13dd2",
    "isActive": true,
    "balance": "$1,135.27",
    "picture": "http://placehold.it/32x32",
    "thatguy": {
      "age": 25,
      "eyeColor": "brown",
      "name": "Figueroa Kemp",
      "gender": "male",
      "friends": [
        {
          "id": 0,
          "name": "Mendez Foley",
          "thatguy": {
            "age": 23,
            "gender": "male",
            "friends": [
              {
                "thatguy": {
                  "age": 32,
                  "eyeColor": "blue",
                  "name": "Albert Vaughn",
                  "gender": "male",
                  "friends": [
                    {
                      "id": 0,
                      "name": "Harry"
                    },
                    {
                      "id": 1,
                      "name": "Joe"
                    }
                  ]
                },
                "name": "Robinson Fischer"
              }
            ]
          }
        },
        {
          "id": 1,
          "name": "Alyce Simpson",
          "thatguy": {
            "age": 40,
            "gender": "female",
            "friends": [
              {
                "thatguy": {
                  "age": 21,
                  "eyeColor": "blue",
                  "name": "Hobbs Galloway",
                  "gender": "male",
                  "friends": [
                    {
                      "id": 0,
                      "name": "Miranda Hartman"
                    },
                    {
                      "id": 1,
                      "name": "Rowland Peck"
                    }
                  ]
                },
                "name": "Dena Mccall"
              }
            ]
          }
        }
      ]
    },
    "company": "CALLFLEX",
    "email": "denamccall@callflex.com",
    "phone": "+1 (813) 507-3202",
    "address": "821 Wolf Place, Bagtown, Nebraska, 2245",
    "about": "In ex velit voluptate aute velit",
    "registered": "2015-03-22T02:45:12 +07:00",
    "latitude": 55.78885,
    "longitude": 37.013185,
},
]
[
  {
    "_id": "574ddc8215220308d959b480",
    "index": 0,
    "guid": "b750dcce-f8ee-45e8-bbb6-32371ce13dd2",
    "isActive": true,
    "balance": "$1,135.27",
    "picture": "http://placehold.it/32x32",
    "thatguy": {
      "age": 25,
      "eyeColor": "brown",
      "name": "Figueroa Kemp",
      "gender": "male",
      "friends": [
        {
          "id": 0,
          "name": "Mendez Foley",
          "thatguy": {
            "age": 23,
            "gender": "male",
            "friends": [
              {
                "thatguy": {
                  "age": 32,
                  "eyeColor": "blue",
                  "name": "Albert Vaughn",
                  "gender": "male",
                  "friends": [
                    {
                      "id": 0,
                      "name": "Harry"
                    },
                    {
                      "id": 1,
                      "name": "John"
                    }
                  ]
                },
                "name": "Robinson Fischer"
              }
            ]
          }
        },
        {
          "id": 1,
          "name": "Alyce Simpson",
          "thatguy": {
            "age": 40,
            "gender": "female",
            "friends": [
              {
                "thatguy": {
                  "age": 21,
                  "eyeColor": "blue",
                  "name": "Hobbs Galloway",
                  "gender": "male",
                  "friends": [
                    {
                      "id": 0,
                      "name": "Miranda Hartman"
                    },
                    {
                      "id": 1,
                      "name": "Rowland Peck"
                    }
                  ]
                },
                "name": "Dena Mccall"
              }
            ]
          }
        }
      ]
    },
    "company": "CALLFLEX",
    "email": "denamccall@callflex.com",
    "phone": "+1 (813) 507-3202",
    "address": "821 Wolf Place, Bagtown, Nebraska, 2245",
    "about": "In ex velit voluptate aute velit",
    "registered": "2015-03-22T02:45:12 +07:00",
    "latitude": 55.78885,
    "longitude": 37.013185,
},
]

{ 'values_changed': { "root[0]['thatguy']['friends'][0]['thatguy']['friends'][0]['thatguy']['friends'][1]['name']": {
  'oldvalue': '      '
  'newvalue': 'John',
}}}

Joe

  • Diff nested objects
  • Get the path and value of changes
  • Ignore order on demand

Objectives

{ 'values_changed': { "root[0]['thatguy']['friends'][0]['thatguy']['friends'][0]['thatguy']['friends'][1]['name']": {
  'oldvalue': '      '
  'newvalue': 'John',
}}}

Joe

[0, 1, 2] vs. [2, 0, 1]

How to diff?

Object categories in Py

1. Text Sequences

2. Numerics

3. Sets

5. Mappings

6. Other Iterables (List, Generator, Deque, Tuple, Custom Iterables)

7. User Defined Objects

Diff Text Sequences with Difflib

>>> import difflib
>>> t1="""
... Hello World!
... """.splitlines()
>>> t2="""
... Hello World!
... It is ice-cream time.
... """.splitlines()
>>> g = difflib.unified_diff(t1, t2, lineterm='')
>>> print('\n'.join(list(g)))
---
+++
@@ -1,2 +1,3 @@

 Hello World!
+It is ice-cream time.

Diffing...

1. Text Sequences

2. Numerics

3. Sets

5. Mappings

6. Other Iterables (List, Generator, Deque, Tuple, Custom Iterables)

7. User Defined Objects

Diff Sets, Frozensets

>>> t1 = {1,2,3}
>>> t2 = {3,4,5}
>>> items_added = t2 - t1
>>> items_removed = t1 - t2
>>> items_added
set([4, 5])
>>> items_removed
set([1, 2])

Diffing...

1. Text Sequences

2. Numerics

3. Sets

5. Mappings

6. Other Iterables (List, Generator, Deque, Tuple, Custom Iterables)

7. User Defined Objects

Diff Mapping

{
    'common1': {
        ...
    },
    'common2': {
        ...
    },
}

Dict, OrderedDict, Defaultdict

{
    'common1': {
        ...
    },
    'common2': {
        ...
    },
    'added':{
        ...
    }
}

Diff Mapping

t1_keys= set(t1.keys())
t2_keys= set(t2.keys())

same_keys = t2_keys.intersection(t1_keys)

added = t2_keys - same_keys
removed = t1_keys - same_keys

Dict, OrderedDict, Defaultdict

And then recursively check same_keys values

Diffing...

1. Text Sequences

2. Numerics

3. Sets

5. Mappings

6. Other Iterables (List, Generator, Deque, Tuple, Custom Iterables)

7. User Defined Objects

Diff Iterables

>>> t1 = [1, 2, 3]
>>> t2 = [1, 2, 5]

Consider Order

Diff Iterables

>>> t1 = [1, 2, 3]
>>> t2 = [1, 2, 5, 6]

Consider Order

Diff Iterables

>>> t1 = [1, 2, 3]
>>> t2 = [1, 2, 5, 6]
>>> 
>>> class NotFound(object):
...     "Fill value for zip_longest"
...     def __repr__(self):
...         return "NotFound"
... 
>>> notfound = NotFound()
>>> 
>>> list(zip_longest(t1, t2, fillvalue=notfound))
[(1, 1), (2, 2), (3, 5), (NotFound, 6)]

Consider Order

Diff Iterables

>>> for (x, y) in zip_longest(t1, t2, fillvalue=NotFound):
...     if x != y:
...         if y is NotFound:
...             removed.append(x)
...         elif x is NotFound:
...             added.append(y)
...         else:
...             modified.append("{} -> {}".format(x, y))
... 
>>> print removed
[]
>>> print added
[6]
>>> print modified
['3 -> 5']

Consider Order

Diff Iterables

Ignore Order

>>> t1=[1,2]
>>> t2=[1,3,4]
>>> t1set=set(t1)
>>> t2set=set(t2)
>>> t1set-t2set
{2}
>>> t2set-t1set
{3, 4}

Diff Iterable > Ignore order > convert to set

>>> t1=[1, 2, {3:3}]
>>> t2=[1]
>>> t1set = set(t1)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: unhashable type: 'dict'

but ...

Diff Iterable > Ignore order > convert to set

A set object is an unordered collection of distinct hashable objects.

Diff Iterable > Ignore order > convert to set

Hashable vs. Unhashable

Mutable vs. Immutable

Diff Iterable > Ignore order > convert to set

Mutable vs. Immutable

>>> a=[1,2]
>>> id(a)
400304246
>>> a.append(3)
>>> id(a)
400304246
>>> b=(1,2)
>>> id(b)
399960722
>>> b += (3,)
>>> id(b)
400670561

Diff Iterable > Ignore order > convert to set

Hashable

  • __hash__ with output that does NOT change over object's lifetime.
  • __eq__ for equality

Diff Iterable > Ignore order > convert to set

Unhashable vs. Mutable

Diff Iterable > Ignore order > convert to set

Hashable that is Mutable

>>> class A:
...     aa=1
...
>>> hash(A)
2857987
>>> A.aa=2
>>> hash(A)
2857987

Diff Iterable > Ignore order > convert to set

Diff Iterable > Ignore order > convert to set

list conversion to set fails when any item is unhashable.

Now what?

>>> t1=[{1:1}, {3:3}, {4:4}]
>>> t2=[{3:3}, {1:1}, {4:4}, {5:5}]

Diff Iterable > Ignore order > sort

>>> t1=[{1:1}, {3:3}, {4:4}]
>>> t2=[{1:1}, {3:3}, {4:4}, {5:5}]
>>> t1=[{1:1}, {3:3}, {4:4}]
>>> t2=[{3:3}, {1:1}, {4:4}]
>>> t1.sort()
>>> t1
[{1: 1}, {3: 3}, {4: 4}]
>>> t2.sort()
>>> t2
[{1: 1}, {3: 3}, {4: 4}]
>>> [(a, b) for a, b in zip(t1,t2) if a != b]
[]

Py2

Diff Iterable > Ignore order > sort

>>> t1=[{1:1}, {3:3}, {4:4}]
>>> t2=[{3:3}, {1:1}, {4:4}]
>>> t1.sort()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: unorderable types: dict() < dict()

Py3

Diff Iterable > Ignore order > sort

Sort key

Diff Iterable > Ignore order > sort

>>> students = [
        ('john', 'A', 15),
        ('jane', 'B', 12),
        ('dave', 'B', 10),
]
>>> sorted(students, key=lambda s: s[2])
[('dave', 'B', 10),
('jane', 'B', 12),
('john', 'A', 15)]

Diff Iterable > Ignore order > sort

What to use for sort key to order list of dictionaries?

Diff Iterable > Ignore order > sort

Sort key: hash of dictionary contents

>>> from json import dumps
>>> t1=[{1:1}, {3:3}, {4:4}]
>>> t2=[{3:3}, {1:1}, {4:4}]
>>> t1.sort(key=lambda x: hash(dumps(x)))
>>> t2.sort(key=lambda x: hash(dumps(x)))
>>> t1
[{1: 1}, {3: 3}, {4: 4}]
>>> t2
[{1: 1}, {3: 3}, {4: 4}]
>>> [(a, b) for a, b in zip(t1,t2) if a != b]
[]

Py2 & 3

Diff Iterable > Ignore order > sort

Iterables with different length

Diff Iterable > Ignore order > sort

iterables with different lengths

>>> import json
>>>
>>> t1=[10, {1:1}, {3:3}, {4:4}]
>>> t1.sort(key=lambda x: hash(json.dumps(x)))
>>>
>>> t2=[{3:3}, {1:1}, {4:4}]
>>> t2.sort(key=lambda x: hash(json.dumps(x)))
>>> t1
[{1: 1}, {3: 3}, {4: 4}, 10]
>>> t2
[{1: 1}, {3: 3}, {4: 4}]

Diff Iterable > Ignore order > sort

iterables with different lengths

>>> t1=[10, "a", {1:1}, {3:3}, {4:4}]
>>> t1.sort(key=lambda x: hash(dumps(x)))
>>> t1
['a', {1: 1}, {3: 3}, {4: 4}, 10]
>>> t2
[{1: 1}, {3: 3}, {4: 4}]
...


['a -> {1: 1}', '{1: 1} -> {3: 3}',
'{3: 3} -> {4: 4}']

Diff Iterable > Ignore order > sort

Put items in a dictionary of

{item_hash: item}

>>> t1 = [10, "a", {1:1}, {3:3}, {4:4}]
>>> t2 = [{3:3}, {1:1}, {4:4}, "b"]
>>> def create_hashtable(t):
...     hashes = {}
...     for item in t:
...         try:
...             item_hash = hash(item)
...         except TypeError:
...             try:
...                 item_hash = hash(json.dumps(item))
...             except:
...                 pass # For presentation purposes
...             else:
...                 hashes[item_hash] = item
...         else:
...             hashes[item_hash] = item
...     return hashes

Diff Iterable > Ignore order > hashtable

>>> t1 = [10, "a", {1:1}, {3:3}, {4:4}]
>>> t2 = [{3:3}, {1:1}, {4:4}, "b"]
>>> h1 = create_hashtable(t1)
>>> h2 = create_hashtable(t2)
>>>
>>> items_added = [h2[i] for i in h2 if i not in h1]
>>> items_removed = [h1[i] for i in h1 if i not in h2]
>>>
>>> items_added
['b']
>>> items_removed
['a', 10]

Diff Iterable > Ignore order > hashtable

Diff Iterable > Ignore order > hashtable

  1. Iterate over each item in each iterable
  2. try to hash
  3. if failed, serialize to Json and then hash
  4. create {item_hash:hash}
  5. Diff as you diff dictionaries!

What if the object is not json serializable?

What if json serializable version of 2 different objects are the same?

Diff Iterable > Ignore order > hashtable

Pickle

Diff Iterable > Ignore order > hashtable

>>> from pickle import dumps
>>> t = ({1: 1, 2: 4, 3: 6, 4: 8, 5: 10},
'Hello World', (1, 2, 3, 4, 5), [1, 2, 3, 4, 5])
>>> dumps(t)
"((dp0\nI1\nI1\nsI2\nI4\nsI3\nI6\nsI4\nI8\nsI5\
nI10\nsS'Hello World'\np1\n(I1\nI2\nI3\nI4\nI5\
ntp2\n(lp3\nI1\naI2\naI3\naI4\naI5\natp4\n."
>>> dumps(({1: 1, 2: 4, 3: 6, 4: 8, 5: 10},
           'Hello World', (1, 2, 3, 4, 5),
            [1, 2, 3, 4, 5]))
"((dp0\nI1\nI1\nsI2\nI4\nsI3\nI6\nsI4\nI8\nsI5\
nI10\nsS'Hello World'\np1\n(I1\nI2\nI3\nI4\nI5
\ntp2\n(lp3\nI1\naI2\naI3\naI4\naI5\natp4\n."

Diff Iterable > Ignore order > hashtable

>>> from cPickle import dumps
>>> t = ({1: 1, 2: 4, 3: 6, 4: 8, 5: 10},
'Hello World', (1, 2, 3, 4, 5), [1, 2, 3, 4, 5])
>>> dumps(t)
"((dp1\nI1\nI1\nsI2\nI4\nsI3\nI6\nsI4\nI8\nsI5\n
I10\nsS'Hello World'\np2\n(I1\nI2\nI3\nI4\nI5\n
tp3\n(lp4\nI1\naI2\naI3\naI4\naI5\nat."
>>> dumps(({1: 1, 2: 4, 3: 6, 4: 8, 5: 10},
           'Hello World', (1, 2, 3, 4, 5),
            [1, 2, 3, 4, 5]))
"((dp1\nI1\nI1\nsI2\nI4\nsI3\nI6\nsI4\nI8\nsI5\n
I10\nsS'Hello World'\n(I1\nI2\nI3\nI4\nI5\nt(lp2
\nI1\naI2\naI3\naI4\naI5\natp3\n."

What about cPIckle? It is faster than Pickle!

Diff Iterable > Ignore order > hashtable

cPickle includes if the object is referenced in the

serialization!

Diff Iterable > Ignore order > hashtable

Diff Iterables

What did we learn from diffing iterables?

 

- Difference of unhashable and mutable
- Sets can only contain hashable
- Create hash for dictionary
- Custom sorting with a key function
- Converting a squence into hashtable

- Pickling

Diffing...

1. Text Sequences

2. Numerics

3. Sets

5. Mappings

6. Other Iterables (List, Generator, Deque, Tuple, Custom Iterables)

7. User Defined Objects

Diff Custom Objects

__dict__

Diff Custom Objects

>>> class CL:
...     attr1 = 0
...     def __init__(self, thing):
...         self.thing = thing

>>> obj1 = CL(1)
>>> obj2 = CL(2)
>>> obj2.attr1 = 10
>>> obj1.__dict__
{'thing': 1}  # Notice that att1 is not here
>>> obj2.__dict__
{'attr1': 10, 'thing': 2}

Diff Custom Objects

__slots__

Diff Custom Objects

>>> class ClassA(object):
...     __slots__ = ['x', 'y']
...     def __init__(self, x, y):
...         self.x = x
...         self.y = y
...
>>> t1 = ClassA(1, 1)
>>> t2 = ClassA(1, 2)
>>>
>>> t1.new = 10
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'ClassA' object has no attribute 'new'

Diff Custom Objects

>>> t1 = {i: getattr(t1, i) for i in t1.__slots__}
>>> t2 = {i: getattr(t2, i) for i in t2.__slots__}
>>> t1
{'x': 1, 'y': 1}
>>> t2
{'x': 1, 'y': 2}

Loops

>>> a=[1,2]
>>> a.append(a)
>>> a
[1, 2, [...]]
>>> b={1:1, 2:2}
>>> b[3]=b
>>> b
{1: 1, 2: 2, 3: {...}}

Diff Custom Objects

>>> class LoopTest(object):
...     def __init__(self, a):
...         self.loop = self
...         self.a = a
...
>>> t1 = LoopTest(1)
>>> t2 = LoopTest(2)
>>> t1
<__main__.LoopTest object at 0x02B9A910>
>>> t1.__dict__
{'a': 1, 'loop': <__main__.LoopTest object at 0x02B9A910>}

Loops

Diff Custom Objects

Detect Loop with ID

A --> B --> C --> A

11 --> 23 --> 2 --> 11

Diff Custom Objects

Detect Loop with ID

def diff_common_children_of_dictionary(t1, t2,
                t_keys_intersect, parents_ids):

    for item_key in t_keys_intersect:

        t1_child = t1[item_key]
        t2_child = t2[item_key]

        item_id = id(t1_child)

        if parents_ids and item_id in parents_ids:
            print ("Warning, a loop is detected.")
            continue

        parents_added = set(parents_ids)
        parents_added.add(item_id)
        parents_added = frozenset(parents_added)

        diff(t1_child, t2_child, parents_ids=parents_added)

Diff Custom Objects

What did we learn about diffing custom objects

  • __dict__ or __slots__
  • Then diff as dictionary
  • Objects can point to self or parent
  • Detecting loops with IDs

Why Diff

  • Debugging
  • Testing, assertEqual with DeepDiff, faster than py.test
  • Comparing Big Datasets
  • Emotional Stability

Deep Diff

Zepworks.com

sep at zepworks.com

 

https://github.com/seperman/deepdiff

http://zepworks.com/blog/diff-it-to-digg-it

pip install deepdiff

Diff It To Dig It

By seperman

Diff It To Dig It

  • 288