Washington House Prices Prediction

This repository holds an attempt to apply XGBoost regression to predict house sale prices in Washington State using tabular real estate data from Kaggle.

Overview

The task is to predict the final sale price (lastSoldPrice) of residential properties in Washington State given features such as square footage, number of bedrooms/bathrooms, year built, property type, and ZIP code. The central insight of this project is that location — encoded via ZIP code — is the dominant predictor of house prices, outweighing physical features like square footage or number of bedrooms alone. ZIP code alone accounted for over 70% of the model's predictive power (aggregated XGBoost feature importance score: 0.704), confirming that where a house is located matters far more than its physical characteristics. For reference, the next strongest feature, regional prefix, scored 0.166, while physical features like sqft (0.009) and baths (0.015) were comparatively minor.

The approach formulates this as a regression problem, using XGBoost as the primary model after establishing a Ridge Regression baseline. The target variable was log-transformed to handle price skewness before training, compressing the wide price range ($1k–$15.7M) so the model could learn from typical homes without being pulled toward rare luxury outliers. Predictions were converted back to dollars using the inverse transformation.

Our best model achieved an R² of 0.66 and a Mean Absolute Error of ~$105,000 on the held-out test set, improving over the Ridge Regression baseline (R² of 0.63, MAE ~$112,000). The performance gap between the two models confirms that house price relationships are non-linear and benefit from a tree-based approach.

Summary of Workdone

Data

Type: CSV file of tabular real estate listing features; target is a continuous numerical value (sale price)
Size: 12,017 rows, 15 features before cleaning
Split: 7,075 training / 1,769 validation / 2,211 test

Preprocessing / Clean up

Dropped columns that would cause data leakage: listPrice, list_to_sold_ratio, price_per_sqft, baths_full, and sanitized_text — these were either derived from the target or not useful for generalization
Dropped the 15 rows where lastSoldPrice was null (target variable)
Filled garage missing values with 0, assuming no garage when not listed (over 80% missing)
Filled remaining missing numerical values with the column median to preserve typical values
Removed outliers using IQR (1.5x multiplier) on lastSoldPrice to reduce the influence of extreme luxury properties that were not representative of the general market
One-hot encoded type (property type) and ZIP code at two levels: full ZIP and 3-digit regional prefix to capture both neighborhood and broader region signal
Applied StandardScaler to numerical features — fit only on training data to prevent data leakage into validation and test sets

Data Visualization

Feature distributions were compared before and after cleaning to confirm outlier removal and imputation worked as expected. After cleaning, histograms were compared across three price tiers (Low / Mid / High) to identify which features carried the most signal. Square footage showed the clearest separation between tiers, followed by year built and number of bathrooms. A correlation heatmap on the raw data confirmed which features were leaking information about the target.

Problem Formulation

Input: Numerical and categorical features describing a residential property (sqft, beds, baths, year built, stories, garage, ZIP code, property type)
Output: Predicted sale price in dollars
Models:
- Ridge Regression — used as a baseline linear model. Achieved R² of 0.62 and MAE of ~$113,000. Limited by its assumption of linear relationships between features and price.
- XGBoost Regressor — selected as the primary model because house prices have non-linear relationships with features (e.g. price doesn't scale linearly with sqft). XGBoost captures these patterns through ensembled decision trees and handles the mix of sparse one-hot encoded columns and dense numerical features well.
Target transformation: np.log1p applied to sale price before training to reduce skewness; predictions converted back with np.expm1
Hyperparameters: n_estimators=500, learning_rate=0.05, max_depth=6, subsample=0.8, colsample_bytree=0.8, early_stopping_rounds=20

Training

Trained locally using Python / Jupyter Notebook
XGBoost training completed in under 2 minutes on CPU
Early stopping monitored validation MAE — training halted automatically when no improvement was seen for 20 consecutive trees, preventing overfitting
No major difficulties after fixing a data leakage issue where the scaler was initially fit on the full dataset before splitting

Performance Comparison

Primary metrics: R² (proportion of variance explained) and MAE (average dollar error)

Model	Validation R²	Validation MAE	Test R²	Test MAE
Ridge Regression	0.6820	$113,156	0.6287	$112,559
XGBoost	0.7338	$105,701	0.6648	$105,205

XGBoost outperformed Ridge on both metrics. The consistency between validation and test scores indicates the model generalizes well and is not overfitting.

Conclusions

XGBoost meaningfully outperforms linear regression for house price prediction, confirming that the relationship between property features and sale price is non-linear
Square footage, year built, and number of bathrooms were the most informative features
Geographic encoding at two levels (ZIP + region) added useful location signal
A ~$105k average error on a median home price of ~$568k represents roughly 18% average error — reasonable for a single-model baseline without feature engineering

Future Work

Engineer additional features such as price per sqft by ZIP code or age of home at time of sale
Experiment with lower max_depth to close the gap between validation and test R²
Try ensemble stacking combining Ridge and XGBoost predictions
Incorporate the sanitized_text listing description column using NLP features (TF-IDF or embeddings) as additional input

How to Reproduce Results

Clone this repository
Download the dataset from Kaggle (link below) and place washington_ultimate.csv in the root directory
Install required packages (see Software Setup)
Run real_estate.ipynb top to bottom — all steps from loading to evaluation are contained in a single notebook

Overview of Files in Repository

real_estate.ipynb — main notebook containing all steps: data loading, cleaning, visualization, model training, and evaluation
washington_ultimate.csv — raw dataset (download from Kaggle separately)
README.md — this file

Software Setup

Required packages: pandas numpy matplotlib seaborn scikit-learn xgboost Install all with: pip install pandas numpy matplotlib seaborn scikit-learn xgboost

Data

Dataset: Washington State Real Estate — Kaggle
No additional preprocessing steps needed beyond running the notebook

Training

Open real_estate.ipynb in Jupyter Notebook or JupyterLab
Run all cells in order
XGBoost training cell will print progress every 50 trees and stop automatically via early stopping

Performance Evaluation

Validation and test metrics (R² and MAE) are printed at the end of the notebook
To evaluate on new data, apply the trained model object and scaler to your feature matrix following the same column structure

Citations

Washington State Real Estate dataset — Kaggle
XGBoost documentation: https://xgboost.readthedocs.io
Scikit-learn documentation: https://scikit-learn.org

Name		Name	Last commit message	Last commit date
Latest commit History 30 Commits
.gitignore		.gitignore
README.md		README.md
price_tiers_count.png		price_tiers_count.png
price_tiers_density.png		price_tiers_density.png
real_estate.ipynb		real_estate.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Washington House Prices Prediction

Overview

Summary of Workdone

Data

Preprocessing / Clean up

Data Visualization

Problem Formulation

Training

Performance Comparison

Conclusions

Future Work

How to Reproduce Results

Overview of Files in Repository

Software Setup

Data

Training

Performance Evaluation

Citations

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Washington House Prices Prediction

Overview

Summary of Workdone

Data

Preprocessing / Clean up

Data Visualization

Problem Formulation

Training

Performance Comparison

Conclusions

Future Work

How to Reproduce Results

Overview of Files in Repository

Software Setup

Data

Training

Performance Evaluation

Citations

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages