Data Analysis and Geocoding with Python

Rohan Bidarkota

rbidarko@gmu.edu

Agenda for the workshop

Introduction to Pandas Data frames
Loading data into data frames from different file formats
Descriptive statistics
Slicing and Omitting
Introduction to Geopy and GIS libraries
Geocoding using Geopy's geocoders, Calculating Geodistance
Visualizing data
Writing the changed data to .csv

Support and Help

Resources Required

Python 3.7
Jupyter Notebook with appropriate Working Directory
Geopy package
Pandas package
ArcGIS API
Matplotlib

Introduction to Pandas Dataframe

DataFrame is a 2-dimensional labeled data structure with columns of potentially different types. You can think of it like a spreadsheet or SQL table, or a dictionary of Series objects.

#Basic Syntax of a pandas dataframe
pandas.DataFrame( data, index, columns, dtype, copy)

where data is the data variable

Reading Data into Pandas Data frame

#importing packages
import pandas as pd
import numpy as np

#reading the CSV file into Pandas Dataframe
df=pd.read_csv("201812-capitalbikeshare-tripdata.csv")

Try this out. Print out the dataframe

Reading other types of files into Pandas Data frame


#reading an xls file
df2=pd.read_excel("sample datasets\supermarkets.xlsx",sheet_name=0)

#reading a json file
df3=pd.read_json("sample datasets\supermarkets.json")

#reading txt files
df4=pd.read_csv("sample datasets\supermarkets-semi-colons.txt",sep=";",header=None)

Now there is a .txt file in the sample datasets folder. Try to read that out into a pandas data frame and store it in df5.

Setting Headers if there are no headers in the data file and exploring some dataframe functions

#setting header rows incase there are no headers
df4.columns=["ID", "Address","City", "State","Country", "Name","Employees"]
df4=df4.set_index("Address")

#shape functions give you the order/dimension of the dataframe (no.of rows,no.of columns)
#ndim returns the demsionality of the dataframe (Eg: 2 dimension, 3 dimension, etc)
print(df4.shape)
print(df4.ndim)

Shape function gives you the order of the table/data frame

Dimension gives you the dimensionality of the table

Sampling the dataset and then print descriptive statistics

#Displaying Descriptive Stats
#Note: It works on the columns with numerical values
sample_df=df.sample(10)
sample_df.describe()

Select Row/Column using names and indexes

loc function- works with names
iloc function - works with indexes

#If we want to select the rows and columns based on the Index labels
#using square braces and the name of the column as a string, e.g. data['column_name']
#for using numeric indexing and the iloc selector data.iloc[:, <column_number>](Label Indexing)
sample_df.loc[:,["Duration","Start station","End station"]]

#If we want to select the rows and columns based on the index numbers of the dataframe
#Note: The indexes start with 0 and end with n-1(Positional Indexing)
sample_df.iloc[:-3,:-5]

Label Indexing and Positional Indexing

#To select rows whose column value equals a scalar, some_value, use ==:

df.loc[df['column_name'] == some_value]

#To select rows whose column value is in an iterable, some_values, use isin:

df.loc[df['column_name'].isin(some_values)]

#Combine multiple conditions with &:

df.loc[(df['column_name'] >= A) & (df['column_name'] <= B)]
#Note the parentheses. Due to Python's operator precedence rules, & binds more tightly than <= and >=. Thus, the parentheses in the last example #are necessary. Without the parentheses

df['column_name'] >= A & df['column_name'] <= B

#is parsed as

df['column_name'] >= (A & df['column_name']) <= B

Selecting and Omitting based on conditions

#Deleting the rows and columns
df4.drop("City",1) #Note 1 for columns and 0 for rows

#Deleting a particular address
df4.drop("332 Hill St",0)

#Deleting particular records based on indexes
df4.drop(df4.index[-2:],0)

#Similarly deleting columns based on indexes
df4.drop(df4.columns[-3:],1)

Deleting Rows / Columns

Geocoding

Converting location names to actual places on Earth.
Done using Geopandas and ArcGIS packages
Geocoders are the unit of codes which convert the text to the address entity
The text format for Geocoding:
Street,City, State, <Zip> may be optional
Premium Feature for most
Free ones have a limit

Formatting the Column suitable for Geocoding

#Formatting the column content to make it easy for Geocoding
sample_df["s_address"]=sample_df["Start station"]+","+"Washington"+","+"District of Columbia"


sample_df["e_address"]=sample_df["End station"]+","+"Washington"+","+"District of Columbia"
sample_df

Importing all the necessary packages and functions

#importing the Geocode function
from arcgis.geocoding import geocode
type(geocode)

#importing all the necessary packages for Geocoding, Geopy's arcgis geocoder and Geocode function
from geopy.geocoders import ArcGIS
nom = ArcGIS()
print(type(nom))

Lets get to geocoding every address in the dataframe

#Start geocoding the elements in the column one by one with a delay of 0.2 seconds
from geopy.extra.rate_limiter import RateLimiter
geodelay=RateLimiter(nom.geocode,min_delay_seconds=0.2)
sample_df["s_location"]=sample_df["s_address"].apply(geodelay)
sample_df["e_location"]=sample_df["e_address"].apply(geodelay)

The default geopy geocoder crashed at 15, so I didn't go past that. For google API: 2500 free requests per day and 50 requests per second

ArcGIS can do a bulk geocoding with a premium account and you can try batch geocoding if you want with google subscription.

Extracting only latitude and longitude from the geocoded address

#Create coordinate variables that stores latitude and longitude pair for every station
sample_df["s_coord"]=sample_df["s_location"].apply(lambda x:(x.latitude,x.longitude))
sample_df["e_coord"]=sample_df["e_location"].apply(lambda x:(x.latitude,x.longitude))
sample_df

We use Lambda function here to refer to each and every column element and extract the latitude and longitude part of the address.

Now we make station pairs

#Creating new variable with start station and end station coordinates pair to calculate Geodistance
sample_df['Station Pairs'] = list(zip(sample_df.s_coord, sample_df.e_coord))
sample_df['Station Pairs']

Now we pair up the start station and end station coordinates suitable to calculate the geodesic distance between them.

Calculating Geodistance

#Calculating Geodesic distance
from geopy.distance import geodesic
sample_df["Inter station distance"]=[geodesic(x[0],x[1]).miles for x in sample_df['Station Pairs']]
sample_df

A geodesic is the shortest route between two points on the Earth's surface.

Writing to a .csv file

#We write the following dataframe to a csv
sample_df.to_csv("Geocoded_bikeshare.csv")

df.to_csv to write to csv

df.to_json to write to json

for other delimited files:

df.to_csv('something.txt', header=True, index=False, sep='\t')

import matplotlib.pyplot as plt
#creating a histogram
sample_df.hist(column="Inter station distance")

#histogram for duration
sample_df.hist(column="Duration")

Data Visualisation and Plotting

We use the matplot lib for visualization

Here we are creating a histogram

#creating a bar chart
sample_df[["Inter station distance"]].plot(kind='bar')

#creating a bar with x and y
sample_df.plot.bar(x='Duration',y='Inter station distance')

#Creating a scatter plot
sample_df.plot.scatter(x='Inter station distance',y='Duration')

#Creating a box and whisker plot
sample_df.plot.box(x='Inter station distance',y='Duration')

Other Plots

Plot coordinates on Map

#Visualizing the Map using the gis.map function
import arcgis
from arcgis.gis import GIS
gis=GIS()
map=gis.map('Washington, District of Columbia')

Creating a map Object

#Using geocoding function from ArcGIS to plot it on the ArcMap
sample_df["s_location"]=sample_df["s_address"].apply(geocode)
sample_df["e_location"]=sample_df["e_address"].apply(geocode)
sample_df

#We create a new list to make it easier for plotting(Start Stations)
stations=list(sample_df["s_location"])
stations_s=[]
for station in stations:
    stations_s.append(station[0])
stations_s

#We create a new list to make it easier for plotting(End Stations)
stations=list(sample_df["e_location"])

stations_e=[]
for station in stations:
    stations_e.append(station[0])
stations_e

Geocoding again to a map plottable format

#Drawing every station on map
for station in stations_s:
    map.draw(station["location"])
for station in stations_e:
    map.draw(station["location"])

Drawing coordinates onto the Map

Conclusion

To check all the variables you created use "%whos"

AND

"THANK YOU"

Additional Resources:

Data Analysis and Geocoding with Python

By Rohan Bidarkota

Data Analysis and Geocoding with Python

An introduction to Data analysis, Geocoding and Data Visualization using Python.

2,449

Data Analysis and Geocoding with Python

Agenda for the workshop

Support and Help

Resources Required

Introduction to Pandas Dataframe

Reading Data into Pandas Data frame

Reading other types of files into Pandas Data frame

Setting Headers if there are no headers in the data file and exploring some dataframe functions

Sampling the dataset and then print descriptive statistics

Select Row/Column using names and indexes

Formatting the Column suitable for Geocoding

Importing all the necessary packages and functions

Lets get to geocoding every address in the dataframe

Extracting only latitude and longitude from the geocoded address

Now we make station pairs

Calculating Geodistance

Writing to a .csv file

Plot coordinates on Map

Conclusion

Data Analysis and Geocoding with Python

More from Rohan Bidarkota