Associate Professor Justin Dressel
Faculty of Mathematics, Physics, and Computation
Schmid College of Science and Technology
Recall the four basic Python structured data types:
Recall the basic Python data types:
Rules of thumb :
Crucially, all four vanilla data structures support mixed types, which means that they are boxed: items are not stored in contiguous memory
To illustrate this, let us focus on the main workhorse of vanilla python: lists
l
address of l[0] value
l[0] address
l = [1,2.0]
address of l[1:] slice
1 : int
l[1:]
address of l[1] value
address of l[2:] slice
l[1] address
2.0 : float
None
To find elements of a list, python must traverse every element from the beginning to find memory links to the next elements, which may be stored anywhere in the entire computer memory
This makes it very efficient to add and drop new elements of any type, but makes traversal and random access very slow
In fact, all types are "boxed" by default in Python
This means that when you access a value, Python must unpack a box of memory to find an address, then find the value at that address.
This is slow, but very flexible.
a
address of a value
address
3 : int
a = 3
For large collections of values, this unboxing process can take a significant portion of the runtime
(memory box)
(opening box)
(only here is the type of what is stored inside a revealed)
The numpy module provides an array type that is a contiguous block of memory, all of one type, stored in a single Python memory box
It is much faster when dealing with many values.
lnp : numpy array
packed array
length : size (2)
type : type (int)
import numpy as np
lnp = np.array([1,2])
1
2
Since a single type has a fixed size in memory per element, and the numpy array stores how many elements there are, it is extremely efficient to randomly locate any element in memory
Since the elements are stored contiguously in memory, it is even more efficient to traverse the elements sequentially in the array
However, since the array has a fixed size, it must be recopied any time that size is changed, which is horribly slow
Arrays should be preallocated, not resized
The numpy module also provides a much more comprehensive set of numeric types that permit more nuanced bit-level handling of binary data
Jargon reminder:
**Data type** **Description**
bool_ Boolean (True or False) stored as a byte
int_ Default integer type (same as C long; normally either int64 or int32)
intc Identical to C int (normally int32 or int64)
intp Integer used for indexing (same as C ssize_t; normally either int32 or int64)
int8 Byte (-128 to 127)
int16 Integer (-32768 to 32767)
int32 Integer (-2147483648 to 2147483647)
int64 Integer (-9223372036854775808 to 9223372036854775807)
uint8 Unsigned integer (0 to 255)
uint16 Unsigned integer (0 to 65535)
uint32 Unsigned integer (0 to 4294967295)
uint64 Unsigned integer (0 to 18446744073709551615)
float_ Shorthand for float64.
float16 Half precision float: sign bit, 5 bits exponent, 10 bits mantissa
float32 Single precision float: sign bit, 8 bits exponent, 23 bits mantissa
float64 Double precision float: sign bit, 11 bits exponent, 52 bits mantissa
complex_ Shorthand for complex128.
complex64 Complex number, represented by two 32-bit floats (real and imaginary components)
complex128 Complex number, represented by two 64-bit floats (real and imaginary components)
Also, platform-dependent C integer types are defined: short, long, longlong and their unsigned versions.
In [2]: a = np.uint8(10)
In [3]: np.iinfo(a) # Get info about an integer type
Out[3]: iinfo(min=0, max=255, dtype=uint8)
In [4]: b = np.array([1,3,5], dtype=np.float128)
In [5]: b
Out[5]: array([ 1.0, 3.0, 5.0], dtype=float128)
In [6]: np.finfo(b[1]) # Get info about a floating point type
Out[6]: finfo(resolution=1e-18, min=-1.18973149536e+4932, max=1.18973149536e+4932, dtype=float128)
Remember: a numpy array is a contiguous block of memory,
all of one type, stored in a single Python memory box.
In [1]: import numpy as np
In [2]: %timeit l = range(100000)
1000 loops, best of 3: 889 µs per loop
In [3]: %timeit lnp = np.arange(100000)
10000 loops, best of 3: 140 µs per loop
lnp
packed array of ints
Unlike lists, numpy arrays are not copied when modifying, but are modified in place
This makes manipulating large arrays of data very efficient
Common operations are also vectorized into element-wise loops across the array
In [4]: lnp[5:12] = -1
In [4]: lnp[1:50] * lnp[1:50]
Out[4]:
array([ 1, 4, 9, 16, 1, 1, 1, 1, 1, 1, 1,
144, 169, 196, 225, 256, 289, 324, 361, 400, 441, 484,
529, 576, 625, 676, 729, 784, 841, 900, 961, 1024, 1089,
1156, 1225, 1296, 1369, 1444, 1521, 1600, 1681, 1764, 1849, 1936,
2025, 2116, 2209, 2304, 2401])
Note, that numpy has special basic operations ("universal functions" or ufuncs) that are automatically vectorized over arrays
These completely replace the functionality of the math module, and are significantly more efficient: do not use math if you use numpy!
Example: computing a Gaussian probability density function over an array of dependent coordinates
In [1]: import numpy as np
In [2]: %time x = np.linspace(-100,100,1000000)
CPU times: user 12 ms, sys: 8 ms, total: 20 ms
Wall time: 17.4 ms
In [3]: D = 5.0
In [4]: x0 = -3.5
In [5]: %time g = (1.0/np.sqrt(2*np.pi*D))*np.exp(-(x - x0)**2/(2*D))
CPU times: user 52 ms, sys: 8 ms, total: 60 ms
Wall time: 58.7 ms
x : 1e6 points between -100 and 100
g : values of the Gaussian function above at each of the points in x
Note: no for loops are needed
All ufuncs automatically traverse arrays in a very efficient way
If you have a function of a single argument that is not a ufunc, and you want to apply this function across an entire array, you can vectorize it manually
In [2]: def f(x):
...: if x%2==0:
...: return "Even"
...: else:
...: return "Odd"
...:
In [3]: x = np.arange(1,100)
In [4]: x
Out[4]:
array([ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17,
18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34,
35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51,
52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68,
69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85,
86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99])
In [5]: fv = np.vectorize(f)
In [6]: fv(x)
Out[6]:
array(['Odd', 'Even', 'Odd', 'Even', 'Odd', 'Even', 'Odd', 'Even', 'Odd',
'Even', 'Odd', 'Even', 'Odd', 'Even', 'Odd', 'Even', 'Odd', 'Even',
'Odd', 'Even', 'Odd', 'Even', 'Odd', 'Even', 'Odd', 'Even', 'Odd',
'Even', 'Odd', 'Even', 'Odd', 'Even', 'Odd', 'Even', 'Odd', 'Even',
'Odd', 'Even', 'Odd', 'Even', 'Odd', 'Even', 'Odd', 'Even', 'Odd',
'Even', 'Odd', 'Even', 'Odd', 'Even', 'Odd', 'Even', 'Odd', 'Even',
'Odd', 'Even', 'Odd', 'Even', 'Odd', 'Even', 'Odd', 'Even', 'Odd',
'Even', 'Odd', 'Even', 'Odd', 'Even', 'Odd', 'Even', 'Odd', 'Even',
'Odd', 'Even', 'Odd', 'Even', 'Odd', 'Even', 'Odd', 'Even', 'Odd',
'Even', 'Odd', 'Even', 'Odd', 'Even', 'Odd', 'Even', 'Odd', 'Even',
'Odd', 'Even', 'Odd', 'Even', 'Odd', 'Even', 'Odd', 'Even', 'Odd'],
dtype='<U4')
x : integers from 1 under 100
f : function from 1 integer to 1 string
fv : vectorized function from array of integers to array of strings
fv(x) : array of unicode strings of length 4 characters or less
According to the principle of not copying data, numpy arrays have views where data in memory can be presented differently
In [2]: l = np.arange(20)
In [3]: l
Out[3]:
array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,
17, 18, 19])
In [4]: l_3d = np.reshape(l, (2,2,5))
In [5]: l_3d
Out[5]:
array([[[ 0, 1, 2, 3, 4],
[ 5, 6, 7, 8, 9]],
[[10, 11, 12, 13, 14],
[15, 16, 17, 18, 19]]])
In [6]: l_3d[1,1,4]
Out[6]: 19
In [7]: l[0:4] = 999
In [8]: l_3d[1,1,2:] = -1
In [9]: l_3d
Out[9]:
array([[[999, 999, 999, 999, 4],
[ 5, 6, 7, 8, 9]],
[[ 10, 11, 12, 13, 14],
[ 15, 16, -1, -1, -1]]])
(Why does changing l affect l_3d?
Note the difference in how the elements of each are referenced)
(Memory is not changed, but is reinterpreted as a 3D nested array of size (2,2,5)
i.e., with 2 elements in the outer array, 2 elements in the first inner array, and 5 elements in the second inner array: 2x2x5=20)
(Start with a flat -- 1D -- array of 20 integers)
In [12]: data = np.random.randn(7,4)
In [13]: data
Out[13]:
array([[-0.96984204, -0.55792773, 0.65584348, -0.96020013],
[ 0.07280736, 0.610084 , -0.32043743, -0.36070071],
[ 1.48343014, 0.81954353, 1.40535631, 1.67215618],
[ 1.68529367, -1.19673775, -0.22360459, 1.71824879],
[-0.28271127, 0.4158064 , 0.98339965, 1.08078398],
[-0.81622001, 0.09710239, -1.87426313, -1.57414564],
[-0.22090031, 0.34779169, -1.4279908 , 0.4511331 ]])
In [14]: data[data < 0]
Out[14]:
array([-0.96984204, -0.55792773, -0.96020013, -0.32043743, -0.36070071,
-1.19673775, -0.22360459, -0.28271127, -0.81622001, -1.87426313,
-1.57414564, -0.22090031, -1.4279908 ])
In [15]: data[data < 0] = 0
In [16]: data
Out[16]:
array([[ 0. , 0. , 0.65584348, 0. ],
[ 0.07280736, 0.610084 , 0. , 0. ],
[ 1.48343014, 0.81954353, 1.40535631, 1.67215618],
[ 1.68529367, 0. , 0. , 1.71824879],
[ 0. , 0.4158064 , 0.98339965, 1.08078398],
[ 0. , 0.09710239, 0. , 0. ],
[ 0. , 0.34779169, 0. , 0.4511331 ]])
Items of an array may be indexed by Boolean tests
Self-quiz:
from random import random
dlist = [[random() for x in range(4)]
for y in range(7)]
Consider the following code:
In [1]: r = np.linspace(-2,2,1000)
In [2]: x, y = np.meshgrid(r,r)
In [3]: z = x + y*1j
In [4]: (z)[0:2,0:2]
Out[4]:
array([[-2.000000-2.j , -1.995996-2.j ],
[-2.000000-1.995996j, -1.995996-1.995996j]])
In [5]: (z**2)[0:2,0:2]
Out[5]:
array([[ 0.00000000+8.j , -0.01599998+7.98398398j],
[ 0.01599998+7.98398398j, 0.00000000+7.96800003j]])
Self-quiz:
In [2]: a = 3*1j
In [3]: a
Out[3]: 3j
In [4]: a*a
Out[4]: (-9+0j)
In [5]: b = a + 1
In [6]: b
Out[6]: (1+3j)
In [7]: b*b
Out[7]: (-8+6j)
1j is Python's representation of an imaginary number, so (1j)*(1j) == -1
For purely numeric arrays, numpy is sufficient.
Real world data contains labels and other descriptive information.
pandas.Series augments a numpy array with dictionary-like labeling.
In [1]: import pandas as pd
In [2]: s = pd.Series( [-1,0,5,4], index=['a','foo','banana','Charles'] )
In [3]: s
Out[3]:
a -1
foo 0
banana 5
Charles 4
dtype: int64
In [4]: s.index
Out[4]: Index([u'a', u'foo', u'banana', u'Charles'], dtype='object')
In [5]: s.values
Out[5]: array([-1, 0, 5, 4])
In [6]: s['banana']
Out[6]: 5
In [7]: np.exp(s)
Out[7]:
a 0.367879
foo 1.000000
banana 148.413159
Charles 54.598150
dtype: float64
(Important note:
a pandas Series acts exactly as a numpy array in operations, with its indexing labels remaining unaffected)
(Note: each series must contains values of only a single data type, since it's an augmented numpy array)
In [1]: s = pd.Series( [-1,0,5,4], index=['a','foo','banana','Charles'] )
In [2]: d = pd.Series( s, index=['a','banana','Ronaldo'] )
In [3]: d
Out[3]:
a -1
banana 5
Ronaldo NaN
dtype: float64
In [4]: d.isnull()
Out[4]:
a False
banana False
Ronaldo True
dtype: bool
In [5]: d[d.isnull()]
Out[5]:
Ronaldo NaN
dtype: float64
In [6]: d[d.notnull()]
Out[6]:
a -1
banana 5
dtype: float64
In [7]: d + s
Out[7]:
Charles NaN
Ronaldo NaN
a -2
banana 10
foo NaN
dtype: float64
Data that doesn't exist is called "null" and has the special value NaN
Very often, filtering out null data values is a major part of a real world data processing task
NaN propagates through operations
(What does this do?
Why is it useful?)
In [1]: data = pd.DataFrame({'time': np.linspace(0,10,1000), 'position': np.random.randn(1000)})
In [2]: data[0:10]
Out[2]:
position time
0 0.967507 0.00000
1 -0.732718 0.01001
2 0.678975 0.02002
3 -0.844301 0.03003
4 -0.202790 0.04004
5 -0.948616 0.05005
6 0.864689 0.06006
7 1.330334 0.07007
8 0.172459 0.08008
9 1.173954 0.09009
In [3]: data['position'][0:3]
Out[3]:
0 0.967507
1 -0.732718
2 0.678975
Name: position, dtype: float64
In [4]: data.time[0:3]
Out[4]:
0 0.00000
1 0.01001
2 0.02002
Name: time, dtype: float64
Tabular data (multiple columns, each an independent Series) is structured as a DataFrame
(Note: each column can have values of a different data type)
In [1]: s = pd.Series( [-1,0,5,4], index=['a','foo','banana','Charles'] )
In [2]: d = pd.Series( s, index=['a','banana','Ronaldo'] )
In [3]: f = pd.DataFrame( {"s":s, "d":d} )
In [4]: f
Out[4]:
d s
Charles NaN 4
Ronaldo NaN NaN
a -1 -1
banana 5 5
foo NaN 0
In [5]: f.T
Out[5]:
Charles Ronaldo a banana foo
d NaN NaN -1 5 NaN
s 4 NaN -1 5 0
In [6]: f.T**4
Out[6]:
Charles Ronaldo a banana foo
d NaN NaN 1 625 NaN
s 256 NaN 1 625 0
DataFrames are a powerful tool for organizing and manipulating big data
(Transposing flips the series and labels.)
In [1]: data = pd.DataFrame(np.arange(16).reshape((4, 4)),
index=['Ohio', 'Colorado', 'Utah', 'New York'],
columns=['one', 'two', 'three', 'four'])
In [2]: data['three'] = data['three']**2
In [3]: data
Out[3]:
one two three four
Ohio 0 1 4 3
Colorado 4 5 36 7
Utah 8 9 100 11
New York 12 13 196 15
In [4]: data % 2 == 0
Out[4]:
one two three four
Ohio True False True False
Colorado True False True False
Utah True False True False
New York True False True False
In [5]: data[data['two'] > 5]
Out[5]:
one two three four
Utah 8 9 100 11
New York 12 13 196 15
In [6]: data.ix[:'Utah', 'two']
Out[6]:
Ohio 1
Colorado 5
Utah 9
Name: two, dtype: int64
DataFrames can be sliced, diced, and indexed in a variety of convenient ways
(Test these out yourself.)
(What does this do?)
In [1]: file_name = "http://www.ats.ucla.edu/stat/data/binary.csv"
In [2]: df = pd.read_csv(file_name)
In [3]: df.head()
Out[3]:
admit gre gpa rank
0 0 380 3.61 3
1 1 660 3.67 3
2 1 800 4.00 1
3 1 640 3.19 4
4 0 520 2.93 4
In [4]: df.to_json('binary.json')
In [5]: df2 = pd.read_json('binary.json').sort(['gre','gpa'])
In [6]: df2.tail()
Out[6]:
admit gre gpa rank
2 1 800 4 1
10 0 800 4 4
33 1 800 4 3
77 1 800 4 3
377 1 800 4 2
Pandas provides very sophisticated ways to read and write various data formats
pd.read_clipboard pd.read_excel pd.read_gbq pd.read_html
pd.read_msgpack pd.read_sql pd.read_sql_table pd.read_table
pd.read_csv pd.read_fwf pd.read_hdf pd.read_json
pd.read_pickle pd.read_sql_query pd.read_stata
pd.DataFrame.to_clipboard pd.DataFrame.to_excel pd.DataFrame.to_json
pd.DataFrame.to_period pd.DataFrame.to_sql pd.DataFrame.to_wide
pd.DataFrame.to_csv pd.DataFrame.to_gbq pd.DataFrame.to_latex
pd.DataFrame.to_pickle pd.DataFrame.to_stata
pd.DataFrame.to_dense pd.DataFrame.to_hdf pd.DataFrame.to_msgpack
pd.DataFrame.to_records pd.DataFrame.to_string
pd.DataFrame.to_dict pd.DataFrame.to_html pd.DataFrame.to_panel
pd.DataFrame.to_sparse pd.DataFrame.to_timestamp
(Read from csv file on the web)
(Output to json file)
(Read in json file to new dataframe, sort and process)
Practice makes perfect.
Keep references handy until you remember commands on command.