Pandas DataFrame
Creating a DataFrame
Import the library
import pandas as pd
Create Dictionary
my_dict = {'vehicles': ['Toyota Camry', 'Ford Explorer', 'Ford F-150', 'Ford F-150', 'Ford Explorer', 'Toyota Prius', 'Toyota Tundra', 'Ford F-150'],
'countries': ['USA', 'Canada', 'Ireland', 'Argentina', 'Ireland', 'Ireland', 'England', 'Ireland'],
'sports': ['Basketball', 'Football', 'Soccer', 'Football', 'Hockley', 'Football', 'Football', 'Running']}
Create DataFrame
df = pd.DataFrame(my_dict)
Visualizing a DataFrame
Show first `n` Rows
df.head(5)
Show last `n` Rows
df.tail(5)
Visualizing a DataFrame
Frequency Table (Normalized)
df['countries'].value_counts(normalize=True)
Frequency Table (Counts)
df['countries'].value_counts()
Selecting Data From a DataFrame
Selecting a Row
df.loc[0]
Selecting a Column
df['countries']
Selecting Data From a DataFrame
Selecting Multiple Rows
df.loc[1:3]
Selecting Multiple Columns
df[['countries', 'sports']]
Selecting Multiple Rows & Columns from a DataFrame
Selecting Multiple Rows and Columns
df.loc[1:3, ['vehicles', 'sports']]
Check Your Understanding
What happens if we write:
new_df = df.loc[1:3, ['vehicle', 'sports']]
new_df.loc[0]
# Try `new_df.iloc[0]` after
Website:
Click to download this data
What Jumps out?
df = pd.read_excel(io='/content/business-analytics/datasets/Household_Debt_and_Credit.xlsx',
sheet_name='Page 3 Data',
skiprows=3)
Read In Data
The file path (String)
io
The excel sheet (String)
sheet_name
The number of rows to skip when reading in the data (Int)
skiprows
df.head()
Visualize the Dataset
df = df.rename(columns={'Unnamed: 0' : 'Period'})
Rename a Column
Original Name
New Name
df.columns
All Columns
import matplotlib.pyplot as plt
plt.plot(df['Period'], df['Mortgage'])
plt.show()
Time Line Plot
Check Your Understanding
Using a for loop, create the following figure
Scatter Plot
plt.scatter(df['Auto Loan'], df['Mortgage'])
plt.title('US Mortgages (Trillions)', loc='left')
plt.xlabel('Auto Loan Debt (Trillions)', size=14)
plt.show()
Correlation Coefficient
It’s a measure of how linearly related two variables are
Interpretation
Negative
Positive
Strong
Strong
Moderate
Moderate
Weak
Weak
Not
df['Auto Loan'].corr(df['Mortgage'])
Check Your Understanding
Using a for loop, create the following figure
Selecting Observation(s)
Single Observation
df.loc[5]
df.loc[5:8].T
Multiple Observations
df.loc[5:8, ['median_income', 'median_house_value']]
Multiple Observations & Rows