pandas

Joel Ross

IMT 511

pandas:
Series

Joel Ross

IMT 511

pandas

Python Data Analysis Library

# import the library
import pandas as pd  # standard shortcut
import numpy as np  # underlying framework

Series

Series are one-dimensional ordered collections of values (similar to a list). Values are also given indices as labels (similar to dictionary keys).

# create a Series from a list
number_series = pd.Series([1, 2, 2, 3, 5, 8])
print(number_series)


# create a Series from a dictionary
age_series = pd.Series({'sarah':42, 'amit':35, 'zhang':13})
print(age_series)

Series Operations

Basic operations on Series are applied pair-wise, matching elements with the same index label (unmatched elements give NaN).

3

1

4

1

5

1

6

1

8

0

s1 = pd.Series([3, 1, 4, 1, 5])
s2 = pd.Series([1, 6, 1, 8, 0])

s3 = s1 + s2  # Add the Series
s4 = s1 > s2  # Compare the Series

4

7

5

9

5

+

3

1

4

1

5

1

6

1

8

0

>

=

=

True

False

True

False

True

Series of booleans!

Broadcasting

If one operand is a scalar (a single value), that value is broadcast across the Series. 

3

1

4

1

5

4

sample = pd.Series([3,1,4,1,5])
result = sample + 4  # add 4 to each element
print(result)

7

5

8

5

9

+

3

1

4

1

5

2

True

False

True

False

True

>

=

=

Series of booleans!

Series Methods

Series provide many methods (called on the Series, with dot notation), including:

Accessing Series

Elements in a Series can be accessed via bracket notation using the index label.

number_series = pd.Series([1, 2, 2, 3, 5, 8])
age_series = pd.Series({'sarah':42, 'amit':35, 'zhang':13})

# get the 1th element from the number_series
number_series[1]  # 2

# get the 'amit' element from age_series
age_series['amit']  # 35

# get the 0th element from age_series
# (Series are ordered, so can be accessed positionally)
# (when created from a dict they are ordered by key, 
#  which is not always in the same order as the literal)!
age_series[0]  # 35, because "amit" comes before "sarah"

Multiple Indices

We can also specify sequences (e.g., lists) of elements to access. This returns a new Series.

ages = pd.Series({'sarah':42, 'amit':35, 'zhang':13})

index_list = ['sarah', 'zhang']
print( ages[index_list] )

# using an anonymous variable for the index list 
# (notice the brackets!)
print( ages[['sarah', 'zhang']] )

Boolean Indexing

We can use a sequence of boolean values (True, False). This will extract every element that corresponds with True (or is a "truthy" value).

vowels = pd.Series[('a','e','i','o','u')]

# List of elements to extract
filter_indices = [True, False, False, True, True]

# Extract every element corresponding to True
list( vowels[filter_indices] )  # "a" "o" "u"

"a"

"e"

"i"

"o"

"u"

"a"

 

 

"o"

"u"

[

=

]

True

False

False

True

True

When combined with relational operators, we can use this approach to filter Series elements by a criteria!

shoe_sizes = pd.Series([(5.5, 11, 7, 8, 4])
small_sizes = shoe_sizes < 6  # True, False, False, False, True

small_shoes = shoe_sizes[small_sizes]  # has values 5.5, 4

# as one line: "shoe sizes, where shoe sizes is less than 6"
small_shoes = shoe_sizes[shoe_sizes < 6]

5.5

 

 

 

4

[

=

]

5.5

11

7

8

4

6

True

False

False

False

True

<

=

5.5

11

7

8

4

Boolean Indexing

pandas:
DataFrames

Joel Ross

IMT 511

DataFrames

DataFrames are two-dimensional collections of values, organized into rows and columns (like a table). Think of it as a dictionary of Series (each Series is a column).

column Series

row index label

column
labels

Creating DataFrames

Usually create DataFrame objects from a dictionary of columns (values can be anything that turns into a Series)

name_series = pd.Series(['Ada','Bob','Chris','Diya','Emma'])
heights = [64, 74, 69, 69, 71]
weights = [135, 156, 139, 144, 152]

people_df = pd.DataFrame({'name': name_series, 
                          'height': heights, 
                          'weight': weights})
print(people_df)

DataFrame Operations

Basic operations on DataFrames are applied element-wise. If the other operand is a scalar, it produces a new DataFrame where each element is modified.

# data frame of test scores
test_scores = pd.DataFrame({
    'math':[91, 82, 93, 100, 78, 91],
    'spanish':[88, 79, 77, 99, 88, 93]
})

# Mathematical operators apply to each element in the data frame
curved_scores = test_scores * 1.02  # curve scores up by 2%
print(curved_scores)
    #      math  spanish
    # 0   92.82    89.76
    # 1   83.64    80.58
    # 2   94.86    78.54
    # 3  102.00   100.98
    # 4   79.56    89.76
    # 5   92.82    94.86

DataFrame Methods

DataFrames provide many of the same methods as Series

DataFrame Methods

DataFrame methods are usually applied per column:

  • If a Series version of the method would return a scalar, then the DataFrame version returns a Series whose index labels are the column labels.
     
  • If a Series version of the method would return a Series, then the DataFrame version returns a DataFrame whose columns are each of the resulting Series.

Accessing DataFrames

DataFrames are like a dictionary of columns, so can access each column by its index label using bracket notation:

df = pd.DataFrame({
    'name':['Ada','Bob','Chris','Diya','Emma'],
    'height':[64, 74, 69, 69, 71],
    'weight':[135, 156, 139, 144, 152]
})

print( df['height'] )

You can also access each column as an attribute of the object using dot notation:

print( df.height )

Accessing DataFrames

It is possible to select multiple columns using a list of column labels:

# count the brackets carefully!
print( df[['name', 'height']] )  # get name and height cols

# supports boolean indexing on ROWS
print( df[ df.height > 60 ] )  # get ROWS with height > 60 

DataFrame Index Lookup

DataFrames provide two attributes loc and iloc which act as "look-up tables" for individual elements. Think of them as dictionaries wth a variety of keys!

pandas:
Grouping

Joel Ross

IMT 511

groupby() 

The groupby() method "separates" the rows of a DataFrame into groups. The rows in each grouping share the same value in a particular column.

Aggregation Methods

Aggregation methods (such as max(), mean(), all(), etc) are applied to each group. The result is a DataFrame whose rows are the results of each individual group.

The agg() method

You can apply multiple aggregations to specific columns by using the agg() method.

# Apply multiple aggregation functions at once
# by passing in a list of function names (as strings)
range_stats_df = by_section_groups.agg(['min', 'mean', 'max'])

# Named aggregations: apply specific aggregations to specific columns
# The argument name is what will be given to the aggregate column
# each argument is a tuple: (column_name, aggregation_function)
custom_stats_df = by_section_groups.agg(
  avg_mid=('midterm', 'mean'),
  avg_final=('final', 'mean'),
  max_final=('final', 'max')
)

imt511-pandas

By Joel Ross

imt511-pandas

  • 195