New York Taxi Cabs - F

Knowledge Discovery

Data Understanding

Yellow - 13587

Green - 6000

Summary statistics

clip outliers outside 1.5IQR

Hypotheses

H1: Predict future congestion to optimize infrastructure spending.

H2: Determine possible routes optimized for ride-sharing capabilities and find good shared pickup points

Data Preperation

- Location ID clarification

- Date standardization

- Removal of unnecessary variables

- Remove outliers

- Combine green & yellow datasets

- Rename similar variables

- Check similarity between 2015-2016 

Modelling

H1: Identifying highly congested areas during rush hours.
Forecast time series on these “hot spots” to predict probability of future congestion.

H2: K-means cluster analysis to locate common convenient pick up/ drop off locations minimizing walking distance to less than 500m from one location to another.

Evaluation

- Compare future with current congestion (1)

(ROC curve, Precision/Recall)

 - A stability-based model would be used to identify how many pick up/drop off

locations will be needed.

- Total distance traveled by all taxis should be lower when taking into account ride sharing.

- Compare mean number of passengers 

H1

H2

(1) http://www.pnas.org/content/114/3/462.abstract

Sources

knowledgediscovery1

By laurenstc

knowledgediscovery1

  • 524