Data Manipulation

Buisness Analytics

Basics

Pandas DataFrame

Creating a DataFrame

Import the library

import pandas as pd

Create Dictionary

my_dict = {'vehicles': ['Toyota Camry', 'Ford Explorer', 'Ford F-150', 'Ford F-150', 'Ford Explorer', 'Toyota Prius', 'Toyota Tundra', 'Ford F-150'],
            'countries': ['USA', 'Canada', 'Ireland', 'Argentina', 'Ireland', 'Ireland', 'England', 'Ireland'],
            'sports': ['Basketball', 'Football', 'Soccer', 'Football', 'Hockley', 'Football', 'Football', 'Running']}

Create DataFrame

df = pd.DataFrame(my_dict)

Visualizing a DataFrame

Show first `n` Rows

df.head(5)

Show last `n` Rows

df.tail(5)

Visualizing a DataFrame

Frequency Table (Normalized)

df['countries'].value_counts(normalize=True)

Frequency Table (Counts)

df['countries'].value_counts()

Selecting Data From a DataFrame

Selecting a Row

df.loc[0]

Selecting a Column

df['countries']

Selecting Data From a DataFrame

Selecting Multiple Rows

df.loc[1:3]

Selecting Multiple Columns

df[['countries', 'sports']]

Selecting Multiple Rows & Columns from a DataFrame

Selecting Multiple Rows and Columns

df.loc[1:3, ['vehicles', 'sports']]

Check Your Understanding

What happens if we write: 

new_df = df.loc[1:3, ['vehicle', 'sports']]
new_df.loc[0]
# Try `new_df.iloc[0]` after

Real World Data

Website:

Click to download this data

What Jumps out?

  • There are multiple tabs each corresponding to a different figure/dataset
  • The first column doesn't have a name
  • The column names are specified in the fourth row
df = pd.read_excel(io='/content/business-analytics/datasets/Household_Debt_and_Credit.xlsx', 
                   sheet_name='Page 3 Data',
                   skiprows=3)

Read In Data

The file path (String)

io

The excel sheet (String)

sheet_name

The number of rows to skip when reading in the data (Int)

skiprows
df.head()

Visualize the Dataset

df = df.rename(columns={'Unnamed: 0' : 'Period'})

Rename a Column

Original Name

New Name

df.columns

All Columns

import matplotlib.pyplot as plt 
plt.plot(df['Period'], df['Mortgage'])
plt.show()

Time Line Plot

Check Your Understanding

Using a for loop, create the following figure

Scatter Plot

plt.scatter(df['Auto Loan'], df['Mortgage'])
plt.title('US Mortgages (Trillions)', loc='left')
plt.xlabel('Auto Loan Debt (Trillions)', size=14)
plt.show()

Correlation Coefficient

It’s a measure of how linearly related two variables are

r = \frac{\sum (x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum (x_i - \bar{x})^2 \sum (y_i - \bar{y})^2}}

Interpretation

-1
1
0
.1
-.1
-.3
-.7
.3
.7

Negative

Positive

Strong

Strong

Moderate

Moderate

Weak

Weak

Not

df['Auto Loan'].corr(df['Mortgage'])

Check Your Understanding

Using a for loop, create the following figure

Interpretation

The Basics

Selecting Observation(s)

Single Observation

df.loc[5]
df.loc[5:8].T

Multiple Observations

df.loc[5:8, ['median_income', 'median_house_value']]

Multiple Observations & Rows

Distribution

Business Analytics - Data Manipulation

By Patrick Power

Business Analytics - Data Manipulation

  • 81