Numpy and Pandas

Overview

Associate Professor Justin Dressel

Faculty of Mathematics, Physics, and Computation

Schmid College of Science and Technology

 

Vanilla Python Types

Recall the four basic Python structured data types:

  • tuple : unchanging (immutable) ordered sequence of mixed types
  • list : changing (mutable) ordered sequence of mixed types
  • set : mutable collection (frozenset immutable), no duplicates, mixed types
  • dict : mutable collection of key-value pairs, no duplicates, mixed types

Recall the basic Python data types:

  • int : 32 or 64 bit signed hardware integers
    (ranges from (-sys.maxint - 1) to sys.maxint)
  • long : infinite precision integer (in Python3, this is the default int)
  • float : double precision hardware float, see sys.float_info
  • complex : a pair of floats
  • bool : subclass of int, only False (0) and True (1) values

Rules of thumb :

  1. Hardware numbers are faster  (use ints and floats for speed)
  2. Immutable data structures are faster  (use tuples and frozensets for speed)
  3. Generality is nice (e.g., mixed types, mutability), but slow 

Python (Linked) Lists

Crucially, all four vanilla data structures support mixed types, which means that they are boxed: items are not stored in contiguous memory

To illustrate this, let us focus on the main workhorse of vanilla python: lists

l
address of l[0] value
l[0] address
l = [1,2.0]
address of l[1:] slice
1 : int
l[1:]
address of l[1] value
address of l[2:] slice
l[1] address
2.0 : float
None

To find elements of a list, python must traverse every element from the beginning to find memory links to the next elements, which may be stored anywhere in the entire computer memory

This makes it very efficient to add and drop new elements of any type, but makes traversal and random access very slow

Python Boxing

In fact, all types are "boxed" by default in Python

This means that when you access a value, Python must unpack a box of memory to find an address, then find the value at that address. 

This is slow, but very flexible.

a
address of a value
address
3 : int
a = 3

For large collections of values, this unboxing process can take a significant portion of the runtime

(memory box)

(opening box) 

(only here is the type of what is stored inside a revealed) 

Partial Solution: numpy

The numpy module provides an array type that is a contiguous block of memory, all of one type, stored in a single Python memory box

It is much faster when dealing with many values.

lnp : numpy array
packed array
length : size (2)
type   : type (int)
import numpy as np
lnp = np.array([1,2])
1
2

Since a single type has a fixed size in memory per element, and the numpy array stores how many elements there are, it is extremely efficient to randomly locate any element in memory 

Since the elements are stored contiguously in memory, it is even more efficient to traverse the elements sequentially in the array 

However, since the array has a fixed size, it must be recopied any time that size is changed, which is horribly slow

Arrays should be preallocated, not resized

Partial Solution: numpy

The numpy module also provides a much more comprehensive set of numeric types that permit more nuanced bit-level handling of binary data

Jargon reminder:

  • bit : a single 0 or 1
  • byte : 8 bits, e.g., 00010010
**Data type**	**Description**
bool_	        Boolean (True or False) stored as a byte
int_	        Default integer type (same as C long; normally either int64 or int32)
intc	        Identical to C int (normally int32 or int64)
intp	        Integer used for indexing (same as C ssize_t; normally either int32 or int64)
int8	        Byte (-128 to 127)
int16	        Integer (-32768 to 32767)
int32	        Integer (-2147483648 to 2147483647)
int64	        Integer (-9223372036854775808 to 9223372036854775807)
uint8	        Unsigned integer (0 to 255)
uint16	        Unsigned integer (0 to 65535)
uint32	        Unsigned integer (0 to 4294967295)
uint64	        Unsigned integer (0 to 18446744073709551615)
float_	        Shorthand for float64.
float16	        Half precision float: sign bit, 5 bits exponent, 10 bits mantissa
float32	        Single precision float: sign bit, 8 bits exponent, 23 bits mantissa
float64	        Double precision float: sign bit, 11 bits exponent, 52 bits mantissa
complex_	Shorthand for complex128.
complex64	Complex number, represented by two 32-bit floats (real and imaginary components)
complex128	Complex number, represented by two 64-bit floats (real and imaginary components)

Also, platform-dependent C integer types are defined: short, long, longlong and their unsigned versions.
In [2]: a = np.uint8(10)
In [3]: np.iinfo(a)  # Get info about an integer type
Out[3]: iinfo(min=0, max=255, dtype=uint8)

In [4]: b = np.array([1,3,5], dtype=np.float128)
In [5]: b
Out[5]: array([ 1.0,  3.0,  5.0], dtype=float128)
In [6]: np.finfo(b[1]) # Get info about a floating point type
Out[6]: finfo(resolution=1e-18, min=-1.18973149536e+4932, max=1.18973149536e+4932, dtype=float128)

Array Data: numpy

Remember: a numpy array is a contiguous block of memory,

all of one type, stored in a single Python memory box.

In [1]: import numpy as np
 
In [2]: %timeit l = range(100000)
1000 loops, best of 3: 889 µs per loop
 
In [3]: %timeit lnp = np.arange(100000)
10000 loops, best of 3: 140 µs per loop
lnp
packed array of ints

Unlike lists, numpy arrays are not copied when modifying, but are modified in place

This makes manipulating large arrays of data very efficient

Common operations are also vectorized into element-wise loops across the array

In [4]: lnp[5:12] = -1
 
In [4]: lnp[1:50] * lnp[1:50]
Out[4]: 
array([   1,    4,    9,   16,    1,    1,    1,    1,    1,    1,    1,
        144,  169,  196,  225,  256,  289,  324,  361,  400,  441,  484,
        529,  576,  625,  676,  729,  784,  841,  900,  961, 1024, 1089,
       1156, 1225, 1296, 1369, 1444, 1521, 1600, 1681, 1764, 1849, 1936,
       2025, 2116, 2209, 2304, 2401])

Universal Functions

Note, that numpy has special basic operations ("universal functions" or ufuncs) that are automatically vectorized over arrays

These completely replace the functionality of the math module, and are significantly more efficient: do not use math if you use numpy!

Example: computing a Gaussian probability density function over an array of dependent coordinates

In [1]: import numpy as np

In [2]: %time x = np.linspace(-100,100,1000000)
CPU times: user 12 ms, sys: 8 ms, total: 20 ms
Wall time: 17.4 ms

In [3]: D = 5.0

In [4]: x0 = -3.5

In [5]: %time g = (1.0/np.sqrt(2*np.pi*D))*np.exp(-(x - x0)**2/(2*D))
CPU times: user 52 ms, sys: 8 ms, total: 60 ms
Wall time: 58.7 ms
g(x) = \frac{e^{-(x-x_0)^2/2D}}{\sqrt{2\pi D}}

x : 1e6 points between -100 and 100

g : values of the Gaussian function above at each of the points in x

Note: no for loops are needed

All ufuncs automatically traverse arrays in a very efficient way

Vectorizing Functions

If you have a function of a single argument that is not a ufunc, and you want to apply this function across an entire array, you can vectorize it manually

In [2]: def f(x):
   ...:     if x%2==0:
   ...:         return "Even"
   ...:     else:
   ...:         return "Odd"
   ...: 

In [3]: x = np.arange(1,100)
In [4]: x
Out[4]: 
array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17,
       18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34,
       35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51,
       52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68,
       69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85,
       86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99])

In [5]: fv = np.vectorize(f)
In [6]: fv(x)
Out[6]: 
array(['Odd', 'Even', 'Odd', 'Even', 'Odd', 'Even', 'Odd', 'Even', 'Odd',
       'Even', 'Odd', 'Even', 'Odd', 'Even', 'Odd', 'Even', 'Odd', 'Even',
       'Odd', 'Even', 'Odd', 'Even', 'Odd', 'Even', 'Odd', 'Even', 'Odd',
       'Even', 'Odd', 'Even', 'Odd', 'Even', 'Odd', 'Even', 'Odd', 'Even',
       'Odd', 'Even', 'Odd', 'Even', 'Odd', 'Even', 'Odd', 'Even', 'Odd',
       'Even', 'Odd', 'Even', 'Odd', 'Even', 'Odd', 'Even', 'Odd', 'Even',
       'Odd', 'Even', 'Odd', 'Even', 'Odd', 'Even', 'Odd', 'Even', 'Odd',
       'Even', 'Odd', 'Even', 'Odd', 'Even', 'Odd', 'Even', 'Odd', 'Even',
       'Odd', 'Even', 'Odd', 'Even', 'Odd', 'Even', 'Odd', 'Even', 'Odd',
       'Even', 'Odd', 'Even', 'Odd', 'Even', 'Odd', 'Even', 'Odd', 'Even',
       'Odd', 'Even', 'Odd', 'Even', 'Odd', 'Even', 'Odd', 'Even', 'Odd'], 
      dtype='<U4')

x : integers from 1 under 100

f : function from 1 integer to 1 string

fv : vectorized function from array of integers to array of strings

fv(x) : array of unicode strings of length 4 characters or less

Array Views

According to the principle of not copying data, numpy arrays have views where data in memory can be presented differently

In [2]: l = np.arange(20)
In [3]: l
Out[3]: 
array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
       17, 18, 19])

In [4]: l_3d = np.reshape(l, (2,2,5))
In [5]: l_3d
Out[5]: 
array([[[ 0,  1,  2,  3,  4],
        [ 5,  6,  7,  8,  9]],

       [[10, 11, 12, 13, 14],
        [15, 16, 17, 18, 19]]])

In [6]: l_3d[1,1,4]
Out[6]: 19

In [7]: l[0:4] = 999
In [8]: l_3d[1,1,2:] = -1

In [9]: l_3d
Out[9]: 
array([[[999, 999, 999, 999,   4],
        [  5,   6,   7,   8,   9]],

       [[ 10,  11,  12,  13,  14],
        [ 15,  16,  -1,  -1,  -1]]])

(Why does changing l affect l_3d?

Note the difference in how the elements of each are referenced)

(Memory is not changed, but is reinterpreted as a 3D nested array of size (2,2,5)

i.e., with 2 elements in the outer array, 2 elements in the first inner array, and 5 elements in the second inner array: 2x2x5=20)

(Start with a flat -- 1D -- array of 20 integers)

Boolean Indexing

In [12]: data = np.random.randn(7,4)
 
In [13]: data
Out[13]: 
array([[-0.96984204, -0.55792773,  0.65584348, -0.96020013],
       [ 0.07280736,  0.610084  , -0.32043743, -0.36070071],
       [ 1.48343014,  0.81954353,  1.40535631,  1.67215618],
       [ 1.68529367, -1.19673775, -0.22360459,  1.71824879],
       [-0.28271127,  0.4158064 ,  0.98339965,  1.08078398],
       [-0.81622001,  0.09710239, -1.87426313, -1.57414564],
       [-0.22090031,  0.34779169, -1.4279908 ,  0.4511331 ]])
 
In [14]: data[data < 0]
Out[14]: 
array([-0.96984204, -0.55792773, -0.96020013, -0.32043743, -0.36070071,
       -1.19673775, -0.22360459, -0.28271127, -0.81622001, -1.87426313,
       -1.57414564, -0.22090031, -1.4279908 ])
 
In [15]: data[data < 0] = 0
 
In [16]: data
Out[16]: 
array([[ 0.        ,  0.        ,  0.65584348,  0.        ],
       [ 0.07280736,  0.610084  ,  0.        ,  0.        ],
       [ 1.48343014,  0.81954353,  1.40535631,  1.67215618],
       [ 1.68529367,  0.        ,  0.        ,  1.71824879],
       [ 0.        ,  0.4158064 ,  0.98339965,  1.08078398],
       [ 0.        ,  0.09710239,  0.        ,  0.        ],
       [ 0.        ,  0.34779169,  0.        ,  0.4511331 ]])

Items of an array may be indexed by Boolean tests

Self-quiz:

  • Run this code on the right and understand it.
  • Create a list version of data:

 

 

 

  • How could you accomplish the same thing as the right, but with the dlist above?
  • Why is using numpy beneficial for efficiently manipulating array-based data?
from random import random
dlist = [[random() for x in range(4)]
          for y in range(7)]

Thinking in Arrays

Consider the following code:

In [1]: r = np.linspace(-2,2,1000)

In [2]: x, y = np.meshgrid(r,r)

In [3]: z = x + y*1j

In [4]: (z)[0:2,0:2]
Out[4]: 
array([[-2.000000-2.j      , -1.995996-2.j      ],
       [-2.000000-1.995996j, -1.995996-1.995996j]])

In [5]: (z**2)[0:2,0:2]
Out[5]: 
array([[ 0.00000000+8.j        , -0.01599998+7.98398398j],
       [ 0.01599998+7.98398398j,  0.00000000+7.96800003j]])

Self-quiz:

  • What are x and y?  What is meshgrid doing?
  • What is z?  Why does this construction work?
In [2]: a = 3*1j
In [3]: a
Out[3]: 3j

In [4]: a*a
Out[4]: (-9+0j)

In [5]: b = a + 1
In [6]: b
Out[6]: (1+3j)

In [7]: b*b
Out[7]: (-8+6j)

1j is Python's representation of an imaginary number, so (1j)*(1j) == -1

Structured Data: pandas

For purely numeric arrays, numpy is sufficient.

Real world data contains labels and other descriptive information.

pandas.Series augments a numpy array with dictionary-like labeling.

In [1]: import pandas as pd

In [2]: s = pd.Series( [-1,0,5,4], index=['a','foo','banana','Charles'] )
 
In [3]: s
Out[3]: 
a         -1
foo        0
banana     5
Charles    4
dtype: int64
 
In [4]: s.index
Out[4]: Index([u'a', u'foo', u'banana', u'Charles'], dtype='object')

In [5]: s.values
Out[5]: array([-1,  0,  5,  4])

In [6]: s['banana']
Out[6]: 5

In [7]: np.exp(s)
Out[7]: 
a            0.367879
foo          1.000000
banana     148.413159
Charles     54.598150
dtype: float64

(Important note:

a pandas Series acts exactly as a numpy array in operations, with its indexing labels remaining unaffected)

(Note: each series must contains values of only a single data type, since it's an augmented numpy array)

Null Values in Data

In [1]: s = pd.Series( [-1,0,5,4], index=['a','foo','banana','Charles'] )
 
In [2]: d = pd.Series( s, index=['a','banana','Ronaldo'] )

In [3]: d
Out[3]: 
a          -1
banana      5
Ronaldo   NaN
dtype: float64

In [4]: d.isnull()
Out[4]: 
a          False
banana     False
Ronaldo     True
dtype: bool
 
In [5]: d[d.isnull()]
Out[5]: 
Ronaldo   NaN
dtype: float64

In [6]: d[d.notnull()]
Out[6]: 
a        -1
banana    5
dtype: float64

In [7]: d + s
Out[7]: 
Charles   NaN
Ronaldo   NaN
a          -2
banana     10
foo       NaN
dtype: float64

Data that doesn't exist is called "null" and has the special value NaN

 

Very often, filtering out null data values is a major part of a real world data processing task

NaN propagates through operations

 

(What does this do?

Why is it useful?)

Data Frames

In [1]: data = pd.DataFrame({'time': np.linspace(0,10,1000), 'position': np.random.randn(1000)})
 
In [2]: data[0:10]
Out[2]: 
   position     time
0  0.967507  0.00000
1 -0.732718  0.01001
2  0.678975  0.02002
3 -0.844301  0.03003
4 -0.202790  0.04004
5 -0.948616  0.05005
6  0.864689  0.06006
7  1.330334  0.07007
8  0.172459  0.08008
9  1.173954  0.09009

In [3]: data['position'][0:3]
Out[3]: 
0    0.967507
1   -0.732718
2    0.678975
Name: position, dtype: float64

In [4]: data.time[0:3]
Out[4]: 
0    0.00000
1    0.01001
2    0.02002
Name: time, dtype: float64

Tabular data (multiple columns, each an independent Series) is structured as a DataFrame

(Note: each column can have values of a different data type)

In [1]: s = pd.Series( [-1,0,5,4], index=['a','foo','banana','Charles'] )
 
In [2]: d = pd.Series( s, index=['a','banana','Ronaldo'] )

In [3]: f = pd.DataFrame( {"s":s, "d":d} )
 
In [4]: f
Out[4]: 
          d   s
Charles NaN   4
Ronaldo NaN NaN
a        -1  -1
banana    5   5
foo     NaN   0

In [5]: f.T
Out[5]: 
   Charles  Ronaldo  a  banana  foo
d      NaN      NaN -1       5  NaN
s        4      NaN -1       5    0

In [6]: f.T**4
Out[6]: 
   Charles  Ronaldo  a  banana  foo
d      NaN      NaN  1     625  NaN
s      256      NaN  1     625    0

DataFrames are a powerful tool for organizing and manipulating big data

(Transposing flips the series and labels.)

In [1]: data = pd.DataFrame(np.arange(16).reshape((4, 4)),
                            index=['Ohio', 'Colorado', 'Utah', 'New York'],
                            columns=['one', 'two', 'three', 'four'])
 
In [2]: data['three'] = data['three']**2
 
In [3]: data
Out[3]: 
          one  two  three  four
Ohio        0    1      4     3
Colorado    4    5     36     7
Utah        8    9    100    11
New York   12   13    196    15

In [4]: data % 2 == 0
Out[4]: 
           one    two three   four
Ohio      True  False  True  False
Colorado  True  False  True  False
Utah      True  False  True  False
New York  True  False  True  False
 
In [5]: data[data['two'] > 5]
Out[5]: 
          one  two  three  four
Utah        8    9    100    11
New York   12   13    196    15

In [6]: data.ix[:'Utah', 'two']
Out[6]: 
Ohio        1
Colorado    5
Utah        9
Name: two, dtype: int64

DataFrames can be sliced, diced, and indexed in a variety of convenient ways

(Test these out yourself.)

(What does this do?)

Saving and Loading

In [1]: file_name = "http://www.ats.ucla.edu/stat/data/binary.csv"

In [2]: df = pd.read_csv(file_name)

In [3]: df.head()
Out[3]: 
   admit  gre   gpa  rank
0      0  380  3.61     3
1      1  660  3.67     3
2      1  800  4.00     1
3      1  640  3.19     4
4      0  520  2.93     4

In [4]: df.to_json('binary.json')

In [5]: df2 = pd.read_json('binary.json').sort(['gre','gpa'])
In [6]: df2.tail()
Out[6]: 
     admit  gre  gpa  rank
2        1  800    4     1
10       0  800    4     4
33       1  800    4     3
77       1  800    4     3
377      1  800    4     2

Pandas provides very sophisticated ways to read and write various data formats

pd.read_clipboard  pd.read_excel      pd.read_gbq        pd.read_html       
pd.read_msgpack    pd.read_sql        pd.read_sql_table  pd.read_table
pd.read_csv        pd.read_fwf        pd.read_hdf        pd.read_json       
pd.read_pickle     pd.read_sql_query  pd.read_stata
pd.DataFrame.to_clipboard  pd.DataFrame.to_excel      pd.DataFrame.to_json       
pd.DataFrame.to_period     pd.DataFrame.to_sql        pd.DataFrame.to_wide
pd.DataFrame.to_csv        pd.DataFrame.to_gbq        pd.DataFrame.to_latex      
pd.DataFrame.to_pickle     pd.DataFrame.to_stata
pd.DataFrame.to_dense      pd.DataFrame.to_hdf        pd.DataFrame.to_msgpack    
pd.DataFrame.to_records    pd.DataFrame.to_string
pd.DataFrame.to_dict       pd.DataFrame.to_html       pd.DataFrame.to_panel      
pd.DataFrame.to_sparse     pd.DataFrame.to_timestamp

Notes:

  • csv : simplest to read
  • json : best for portability
  • hdf : best for huge datasets

Aside:   Jupyter notebooks are json!

(Read from csv file on the web)

(Output to json file)

(Read in json file to new dataframe, sort and process)

Further Reading

Practice makes perfect.

 

Keep references handy until you remember commands on command.