Jake Okinow, Dan Toledano and Simon Yates
June 2020
For a potential purchaser or realtor
Given a list of requirements, what should I expect to pay?
What neighborhoods give me the best value for my criteria?
How much could I save if I relaxed some of my requirements?
For a homeowner
What changes can I make to my home to maximally increase its value?
For the Ames department of Finance
What model best predicts house sale prices, to be used as a basis for property tax?
Is that model one which I could explain if my assessment was challenged in court?
The Dataset
- 1,460 observations of 78 features, including sale price
- Taken from home sales in Ames, Iowa between 2006 and 2010
Our Analysis
- First, we pre-processed the data to remove missing values and encoded categorical variables
- Second, we conducted exploratory data analysis to understand the dataset
- As a result of this stage we opted to focus on log(Sale Price) and exclude houses > 4,000 sq ft
- Then, we trained and analysed the following models for predicting log(sale price) from the other variables:
Forward stepwise multilinear regression
Lasso and Ridge penalized linear regressions
Tree-based models: Random Forests and Gradient Boosting (including an experimental XGBoost implementation)
A Support Vector Machine Regression
- This presentation focuses on the business insights. Full technical details are provided in an appendix
Despite the financial crisis of 2008 occurring in the midst of the period covered by the data, house price per square foot remained remarkably constant
A Q-Q plot revealed that the Sale Price data was not normally distributed. However, log(Sale Price) was much closer.
AIC suggests limited value beyond variable number 12
- Out-of-sample R2 scores were very high. This dataset is almost too good!
- Automated parameter searches recommended very complex models (~76 features)
- This risks overfit (although it is not evident here), and reduces interpretability
- We included a 'manual override' of Lasso to force a more parsimonious model
- There was widespread agreement among models on the most important features
- 16 features appeared in all of the models run
- Measured by AIC or adjusted R2, the 12-feature forward-stepwise linear model performed the best
- The highest unadjusted R2 came from Gradient Boosting, followed by the very low alpha Lasso
- The linear models benefit from interpretability: Coefficients show the marginal value of a feature
- The lowest performing model was the Support Vector Machine
- This was also one of the more computationally intensive models, with complex interpretability
- Overall, linear models offered the best balance of accuracy and interpretability
- The feature 'Overall Quality' (a 10-point scale) dominates the prediction with a univariate R2 of 68%
- The MV regression coefficient implies that a 1-point change in score correlates to a 28% increase in sale price
Above Grade Square Footage
- The next most important driver of price is square footage above ground level
- Houses that are small compared to their lot size may be profitable to extend
- Time since last remodel date also features as a driver of sale price
Average cost of construction in Iowa:
$103 / sq ft
- Multivariate regression coefficients predict the value add from the following actions:
- Increase garage capacity
- Install central a/c
- Finish the basement and add a full bathroom
- Fireplaces are a no-brainer
- Consider a screen porch and/or wood deck
- One challenge was that the dataset contained only 1,460 observations
- This is not particularly high given the richness of the 79 features
- Additional observations covering a longer time period would increase confidence in lower-ranked features
- We experimented with the XGBoost model, but need to delve much more deeply into it to be confident
- It would be interesting to consider other cities to see if the Ames data generalizes nationally
The dataset and models applied were able to provide solid insights for multiple groups
For a potential purchaser or realtor
- Regression coefficients indicate how expensive a particular criteria will be
- Which neighborhoods offer the best development opportunities
For a homeowner
- 7 action items to increase home value
For the Ames department of Finance
- A simple linear model with 12 inputs that could form the basis for an accurate property tax investment
- A model with high interpretability which will be feasible to defend if challenged
- A variety of modelling approaches validating the conclusions reached by the first