Data Analysis and Geocoding with Python
Rohan Bidarkota
Agenda for the workshop
- Introduction to Pandas Data frames
- Loading data into data frames from different file formats
- Descriptive statistics
- Slicing and Omitting
- Introduction to Geopy and GIS libraries
- Geocoding using Geopy's geocoders, Calculating Geodistance
- Visualizing data
- Writing the changed data to .csv
Support and Help
Resources Required
- Python 3.7
- Jupyter Notebook with appropriate Working Directory
- Geopy package
- Pandas package
- ArcGIS API
- Matplotlib
Introduction to Pandas Dataframe
- DataFrame is a 2-dimensional labeled data structure with columns of potentially different types. You can think of it like a spreadsheet or SQL table, or a dictionary of Series objects.
#Basic Syntax of a pandas dataframe
pandas.DataFrame( data, index, columns, dtype, copy)
where data is the data variable
Reading Data into Pandas Data frame
#importing packages
import pandas as pd
import numpy as np
#reading the CSV file into Pandas Dataframe
df=pd.read_csv("201812-capitalbikeshare-tripdata.csv")
Try this out. Print out the dataframe
Reading other types of files into Pandas Data frame
#reading an xls file
df2=pd.read_excel("sample datasets\supermarkets.xlsx",sheet_name=0)
#reading a json file
df3=pd.read_json("sample datasets\supermarkets.json")
#reading txt files
df4=pd.read_csv("sample datasets\supermarkets-semi-colons.txt",sep=";",header=None)
Now there is a .txt file in the sample datasets folder. Try to read that out into a pandas data frame and store it in df5.
Setting Headers if there are no headers in the data file and exploring some dataframe functions
#setting header rows incase there are no headers
df4.columns=["ID", "Address","City", "State","Country", "Name","Employees"]
df4=df4.set_index("Address")
#shape functions give you the order/dimension of the dataframe (no.of rows,no.of columns)
#ndim returns the demsionality of the dataframe (Eg: 2 dimension, 3 dimension, etc)
print(df4.shape)
print(df4.ndim)
Shape function gives you the order of the table/data frame
Dimension gives you the dimensionality of the table
Sampling the dataset and then print descriptive statistics
#Displaying Descriptive Stats
#Note: It works on the columns with numerical values
sample_df=df.sample(10)
sample_df.describe()
Select Row/Column using names and indexes
- loc function- works with names
- iloc function - works with indexes
#If we want to select the rows and columns based on the Index labels
#using square braces and the name of the column as a string, e.g. data['column_name']
#for using numeric indexing and the iloc selector data.iloc[:, <column_number>](Label Indexing)
sample_df.loc[:,["Duration","Start station","End station"]]
#If we want to select the rows and columns based on the index numbers of the dataframe
#Note: The indexes start with 0 and end with n-1(Positional Indexing)
sample_df.iloc[:-3,:-5]
Label Indexing and Positional Indexing
#To select rows whose column value equals a scalar, some_value, use ==:
df.loc[df['column_name'] == some_value]
#To select rows whose column value is in an iterable, some_values, use isin:
df.loc[df['column_name'].isin(some_values)]
#Combine multiple conditions with &:
df.loc[(df['column_name'] >= A) & (df['column_name'] <= B)]
#Note the parentheses. Due to Python's operator precedence rules, & binds more tightly than <= and >=. Thus, the parentheses in the last example #are necessary. Without the parentheses
df['column_name'] >= A & df['column_name'] <= B
#is parsed as
df['column_name'] >= (A & df['column_name']) <= B
Selecting and Omitting based on conditions
#Deleting the rows and columns
df4.drop("City",1) #Note 1 for columns and 0 for rows
#Deleting a particular address
df4.drop("332 Hill St",0)
#Deleting particular records based on indexes
df4.drop(df4.index[-2:],0)
#Similarly deleting columns based on indexes
df4.drop(df4.columns[-3:],1)
Deleting Rows / Columns
Geocoding
- Converting location names to actual places on Earth.
- Done using Geopandas and ArcGIS packages
- Geocoders are the unit of codes which convert the text to the address entity
- The text format for Geocoding:
Street,City, State, <Zip> may be optional - Premium Feature for most
- Free ones have a limit
Formatting the Column suitable for Geocoding
#Formatting the column content to make it easy for Geocoding
sample_df["s_address"]=sample_df["Start station"]+","+"Washington"+","+"District of Columbia"
sample_df["e_address"]=sample_df["End station"]+","+"Washington"+","+"District of Columbia"
sample_df
Importing all the necessary packages and functions
#importing the Geocode function
from arcgis.geocoding import geocode
type(geocode)
#importing all the necessary packages for Geocoding, Geopy's arcgis geocoder and Geocode function
from geopy.geocoders import ArcGIS
nom = ArcGIS()
print(type(nom))
Lets get to geocoding every address in the dataframe
#Start geocoding the elements in the column one by one with a delay of 0.2 seconds
from geopy.extra.rate_limiter import RateLimiter
geodelay=RateLimiter(nom.geocode,min_delay_seconds=0.2)
sample_df["s_location"]=sample_df["s_address"].apply(geodelay)
sample_df["e_location"]=sample_df["e_address"].apply(geodelay)
The default geopy geocoder crashed at 15, so I didn't go past that. For google API: 2500 free requests per day and 50 requests per second
ArcGIS can do a bulk geocoding with a premium account and you can try batch geocoding if you want with google subscription.
Extracting only latitude and longitude from the geocoded address
#Create coordinate variables that stores latitude and longitude pair for every station
sample_df["s_coord"]=sample_df["s_location"].apply(lambda x:(x.latitude,x.longitude))
sample_df["e_coord"]=sample_df["e_location"].apply(lambda x:(x.latitude,x.longitude))
sample_df
We use Lambda function here to refer to each and every column element and extract the latitude and longitude part of the address.
Now we make station pairs
#Creating new variable with start station and end station coordinates pair to calculate Geodistance
sample_df['Station Pairs'] = list(zip(sample_df.s_coord, sample_df.e_coord))
sample_df['Station Pairs']
Now we pair up the start station and end station coordinates suitable to calculate the geodesic distance between them.
Calculating Geodistance
#Calculating Geodesic distance
from geopy.distance import geodesic
sample_df["Inter station distance"]=[geodesic(x[0],x[1]).miles for x in sample_df['Station Pairs']]
sample_df
A geodesic is the shortest route between two points on the Earth's surface.
Writing to a .csv file
#We write the following dataframe to a csv
sample_df.to_csv("Geocoded_bikeshare.csv")
df.to_csv to write to csv
df.to_json to write to json
for other delimited files:
df.to_csv('something.txt', header=True, index=False, sep='\t')
import matplotlib.pyplot as plt
#creating a histogram
sample_df.hist(column="Inter station distance")
#histogram for duration
sample_df.hist(column="Duration")
Data Visualisation and Plotting
We use the matplot lib for visualization
Here we are creating a histogram
#creating a bar chart
sample_df[["Inter station distance"]].plot(kind='bar')
#creating a bar with x and y
sample_df.plot.bar(x='Duration',y='Inter station distance')
#Creating a scatter plot
sample_df.plot.scatter(x='Inter station distance',y='Duration')
#Creating a box and whisker plot
sample_df.plot.box(x='Inter station distance',y='Duration')
Other Plots
Plot coordinates on Map
#Visualizing the Map using the gis.map function
import arcgis
from arcgis.gis import GIS
gis=GIS()
map=gis.map('Washington, District of Columbia')
Creating a map Object
#Using geocoding function from ArcGIS to plot it on the ArcMap
sample_df["s_location"]=sample_df["s_address"].apply(geocode)
sample_df["e_location"]=sample_df["e_address"].apply(geocode)
sample_df
#We create a new list to make it easier for plotting(Start Stations)
stations=list(sample_df["s_location"])
stations_s=[]
for station in stations:
stations_s.append(station[0])
stations_s
#We create a new list to make it easier for plotting(End Stations)
stations=list(sample_df["e_location"])
stations_e=[]
for station in stations:
stations_e.append(station[0])
stations_e
Geocoding again to a map plottable format
#Drawing every station on map
for station in stations_s:
map.draw(station["location"])
for station in stations_e:
map.draw(station["location"])
Drawing coordinates onto the Map
Conclusion
To check all the variables you created use "%whos"
AND
"THANK YOU"
Additional Resources:
Data Analysis and Geocoding with Python
By Rohan Bidarkota
Data Analysis and Geocoding with Python
An introduction to Data analysis, Geocoding and Data Visualization using Python.
- 2,407