Data Analysis and Geocoding with Python

Rohan Bidarkota

rbidarko@gmu.edu

 

Agenda for the workshop

  • Introduction to Pandas Data frames
  • Loading data into data frames from different file formats
  • Descriptive statistics
  • Slicing and Omitting
  • Introduction to Geopy and GIS libraries
  • Geocoding using Geopy's geocoders, Calculating Geodistance
  • Visualizing data
  • Writing the changed data to .csv

Support and Help

Resources Required

  • Python 3.7
  • Jupyter Notebook with appropriate Working Directory
  • Geopy package
  • Pandas package
  • ArcGIS API
  • Matplotlib

Introduction to Pandas Dataframe

  • DataFrame is a 2-dimensional labeled data structure with columns of potentially different types. You can think of it like a spreadsheet or SQL table, or a dictionary of Series objects.
#Basic Syntax of a pandas dataframe
pandas.DataFrame( data, index, columns, dtype, copy)

where data is the data variable

Reading Data into Pandas Data frame

#importing packages
import pandas as pd
import numpy as np

#reading the CSV file into Pandas Dataframe
df=pd.read_csv("201812-capitalbikeshare-tripdata.csv")

Try this out. Print out the dataframe

Reading other types of files into Pandas Data frame


#reading an xls file
df2=pd.read_excel("sample datasets\supermarkets.xlsx",sheet_name=0)

#reading a json file
df3=pd.read_json("sample datasets\supermarkets.json")

#reading txt files
df4=pd.read_csv("sample datasets\supermarkets-semi-colons.txt",sep=";",header=None)

Now there is a .txt file in the sample datasets folder. Try to read that out into a pandas data frame and store it in df5. 

Setting Headers if there are no headers in the data file and exploring some dataframe functions

#setting header rows incase there are no headers
df4.columns=["ID", "Address","City", "State","Country", "Name","Employees"]
df4=df4.set_index("Address")

#shape functions give you the order/dimension of the dataframe (no.of rows,no.of columns)
#ndim returns the demsionality of the dataframe (Eg: 2 dimension, 3 dimension, etc)
print(df4.shape)
print(df4.ndim)

Shape function gives you the order of the table/data frame

Dimension gives you the dimensionality of the table

Sampling the dataset and then print descriptive statistics 

#Displaying Descriptive Stats
#Note: It works on the columns with numerical values
sample_df=df.sample(10)
sample_df.describe()

Select Row/Column using names and indexes

  • loc function- works with names
  • iloc function - works with indexes
#If we want to select the rows and columns based on the Index labels
#using square braces and the name of the column as a string, e.g. data['column_name']
#for using numeric indexing and the iloc selector data.iloc[:, <column_number>](Label Indexing)
sample_df.loc[:,["Duration","Start station","End station"]]

#If we want to select the rows and columns based on the index numbers of the dataframe
#Note: The indexes start with 0 and end with n-1(Positional Indexing)
sample_df.iloc[:-3,:-5]

Label Indexing and Positional Indexing

#To select rows whose column value equals a scalar, some_value, use ==:

df.loc[df['column_name'] == some_value]

#To select rows whose column value is in an iterable, some_values, use isin:

df.loc[df['column_name'].isin(some_values)]

#Combine multiple conditions with &:

df.loc[(df['column_name'] >= A) & (df['column_name'] <= B)]
#Note the parentheses. Due to Python's operator precedence rules, & binds more tightly than <= and >=. Thus, the parentheses in the last example #are necessary. Without the parentheses

df['column_name'] >= A & df['column_name'] <= B

#is parsed as

df['column_name'] >= (A & df['column_name']) <= B

Selecting and Omitting based on conditions

 
#Deleting the rows and columns
df4.drop("City",1) #Note 1 for columns and 0 for rows

#Deleting a particular address
df4.drop("332 Hill St",0)

#Deleting particular records based on indexes
df4.drop(df4.index[-2:],0)

#Similarly deleting columns based on indexes
df4.drop(df4.columns[-3:],1)

Deleting Rows / Columns

 

Geocoding

 
  • Converting location names to actual places on Earth. 
  • Done using Geopandas and ArcGIS packages
  • Geocoders are the unit of codes which convert the text to the address entity
  • The text format for Geocoding:
    Street,City, State, <Zip> may be optional
  • Premium Feature for most
  • Free ones have a limit

Formatting the Column suitable for Geocoding

#Formatting the column content to make it easy for Geocoding
sample_df["s_address"]=sample_df["Start station"]+","+"Washington"+","+"District of Columbia"


sample_df["e_address"]=sample_df["End station"]+","+"Washington"+","+"District of Columbia"
sample_df

Importing all the necessary packages and functions

#importing the Geocode function
from arcgis.geocoding import geocode
type(geocode)

#importing all the necessary packages for Geocoding, Geopy's arcgis geocoder and Geocode function
from geopy.geocoders import ArcGIS
nom = ArcGIS()
print(type(nom))

Lets get to geocoding every address in the dataframe

#Start geocoding the elements in the column one by one with a delay of 0.2 seconds
from geopy.extra.rate_limiter import RateLimiter
geodelay=RateLimiter(nom.geocode,min_delay_seconds=0.2)
sample_df["s_location"]=sample_df["s_address"].apply(geodelay)
sample_df["e_location"]=sample_df["e_address"].apply(geodelay)

The default geopy geocoder crashed at 15, so I didn't go past that. For google API: 2500 free requests per day and 50 requests per second

ArcGIS can do a bulk geocoding with a premium account and you can try batch geocoding if you want with google subscription.

 

 

Extracting only latitude and longitude from the geocoded address

#Create coordinate variables that stores latitude and longitude pair for every station
sample_df["s_coord"]=sample_df["s_location"].apply(lambda x:(x.latitude,x.longitude))
sample_df["e_coord"]=sample_df["e_location"].apply(lambda x:(x.latitude,x.longitude))
sample_df

We use Lambda function here to refer to each and every column element and extract the latitude and longitude part of the address.

Now we make station pairs

#Creating new variable with start station and end station coordinates pair to calculate Geodistance
sample_df['Station Pairs'] = list(zip(sample_df.s_coord, sample_df.e_coord))
sample_df['Station Pairs']

Now we pair up the start station and end station coordinates suitable to calculate the geodesic distance between them.

Calculating Geodistance

#Calculating Geodesic distance
from geopy.distance import geodesic
sample_df["Inter station distance"]=[geodesic(x[0],x[1]).miles for x in sample_df['Station Pairs']]
sample_df

A geodesic is the shortest route between two points on the Earth's surface.

Writing to a .csv file

#We write the following dataframe to a csv
sample_df.to_csv("Geocoded_bikeshare.csv")

df.to_csv to write to csv

df.to_json to write to json

for other delimited files:

df.to_csv('something.txt', header=True, index=False, sep='\t')

import matplotlib.pyplot as plt
#creating a histogram
sample_df.hist(column="Inter station distance")

#histogram for duration
sample_df.hist(column="Duration")

Data Visualisation and Plotting

We use the matplot lib for visualization

Here we are creating a histogram

#creating a bar chart
sample_df[["Inter station distance"]].plot(kind='bar')

#creating a bar with x and y
sample_df.plot.bar(x='Duration',y='Inter station distance')

#Creating a scatter plot
sample_df.plot.scatter(x='Inter station distance',y='Duration')

#Creating a box and whisker plot
sample_df.plot.box(x='Inter station distance',y='Duration')

Other Plots

Plot coordinates on Map

#Visualizing the Map using the gis.map function
import arcgis
from arcgis.gis import GIS
gis=GIS()
map=gis.map('Washington, District of Columbia')

Creating a map Object

#Using geocoding function from ArcGIS to plot it on the ArcMap
sample_df["s_location"]=sample_df["s_address"].apply(geocode)
sample_df["e_location"]=sample_df["e_address"].apply(geocode)
sample_df

#We create a new list to make it easier for plotting(Start Stations)
stations=list(sample_df["s_location"])
stations_s=[]
for station in stations:
    stations_s.append(station[0])
stations_s

#We create a new list to make it easier for plotting(End Stations)
stations=list(sample_df["e_location"])

stations_e=[]
for station in stations:
    stations_e.append(station[0])
stations_e

Geocoding again to a map plottable format

#Drawing every station on map
for station in stations_s:
    map.draw(station["location"])
for station in stations_e:
    map.draw(station["location"])

Drawing coordinates onto the Map 

Conclusion

To check all the variables you created use "%whos"

AND

"THANK YOU"

Made with Slides.com