Ames Housing Data Insights
Jake Okinow, Dan Toledano and Simon Yates
June 2020
House Price Data Provides Insights to Multiple Groups
For a potential purchaser or realtor
Given a list of requirements, what should I expect to pay?
What neighborhoods give me the best value for my criteria?
How much could I save if I relaxed some of my requirements?
For a homeowner
What changes can I make to my home to maximally increase its value?
For the Ames department of Finance
What model best predicts house sale prices, to be used as a basis for property tax?
Is that model one which I could explain if my assessment was challenged in court?
The Dataset and Our Analysis
The Dataset
- 1,460 observations of 78 features, including sale price
- Taken from home sales in Ames, Iowa between 2006 and 2010
Our Analysis
- First, we pre-processed the data to remove missing values and encoded categorical variables
- Second, we conducted exploratory data analysis to understand the dataset
- As a result of this stage we opted to focus on log(Sale Price) and exclude houses > 4,000 sq ft
- Then, we trained and analysed the following models for predicting log(sale price) from the other variables:
Forward stepwise multilinear regression
Lasso and Ridge penalized linear regressions
Tree-based models: Random Forests and Gradient Boosting (including an experimental XGBoost implementation)
A Support Vector Machine Regression
- This presentation focuses on the business insights. Full technical details are provided in an appendix
Exploratory Data Analysis: Stationarity
Despite the financial crisis of 2008 occurring in the midst of the period covered by the data, house price per square foot remained remarkably constant
Exploratory Data Analysis: Normality
A Q-Q plot revealed that the Sale Price data was not normally distributed. However, log(Sale Price) was much closer.
Forward Stepwise Linear Regression
AIC suggests limited value beyond variable number 12
OOS Model Scores Summary
- Out-of-sample R2 scores were very high. This dataset is almost too good!
- Automated parameter searches recommended very complex models (~76 features)
- This risks overfit (although it is not evident here), and reduces interpretability
- We included a 'manual override' of Lasso to force a more parsimonious model
Model Feature Selection
- There was widespread agreement among models on the most important features
- 16 features appeared in all of the models run
Results Analysis
- Measured by AIC or adjusted R2, the 12-feature forward-stepwise linear model performed the best
- The highest unadjusted R2 came from Gradient Boosting, followed by the very low alpha Lasso
- The linear models benefit from interpretability: Coefficients show the marginal value of a feature
- The lowest performing model was the Support Vector Machine
- This was also one of the more computationally intensive models, with complex interpretability
- Overall, linear models offered the best balance of accuracy and interpretability
Owner Action Step 1: Improve Quality Score
- The feature 'Overall Quality' (a 10-point scale) dominates the prediction with a univariate R2 of 68%
- The MV regression coefficient implies that a 1-point change in score correlates to a 28% increase in sale price
Buyer Action 1: Is Neighborhood Worth It?
Buyer Action 2: Find Large Lots
Above Grade Square Footage
- The next most important driver of price is square footage above ground level
- Houses that are small compared to their lot size may be profitable to extend
- Time since last remodel date also features as a driver of sale price
Average cost of construction in Iowa:
$103 / sq ft
Owner Action Steps 2 - 7
- Multivariate regression coefficients predict the value add from the following actions:
- Increase garage capacity
- Install central a/c
- Finish the basement and add a full bathroom
- Fireplaces are a no-brainer
- Consider a screen porch and/or wood deck
Challenges and Ideas for Future Work
- One challenge was that the dataset contained only 1,460 observations
- This is not particularly high given the richness of the 79 features
- Additional observations covering a longer time period would increase confidence in lower-ranked features
- We experimented with the XGBoost model, but need to delve much more deeply into it to be confident
- It would be interesting to consider other cities to see if the Ames data generalizes nationally
Summary Conclusions
The dataset and models applied were able to provide solid insights for multiple groups
For a potential purchaser or realtor
- Regression coefficients indicate how expensive a particular criteria will be
- Which neighborhoods offer the best development opportunities
For a homeowner
- 7 action items to increase home value
For the Ames department of Finance
- A simple linear model with 12 inputs that could form the basis for an accurate property tax investment
- A model with high interpretability which will be feasible to defend if challenged
- A variety of modelling approaches validating the conclusions reached by the first
Questions?
Ames Housing Data
By simondyates
Ames Housing Data
- 40