This repository holds an attempt to apply XGBoost regression to predict house sale prices in Washington State using tabular real estate data from Kaggle.
The task is to predict the final sale price (lastSoldPrice) of residential properties in Washington State given features such as square footage, number of bedrooms/bathrooms, year built, property type, and ZIP code. The central insight of this project is that location — encoded via ZIP code — is the dominant predictor of house prices, outweighing physical features like square footage or number of bedrooms alone. ZIP code alone accounted for over 70% of the model's predictive power (aggregated XGBoost feature importance score: 0.704), confirming that where a house is located matters far more than its physical characteristics. For reference, the next strongest feature, regional prefix, scored 0.166, while physical features like sqft (0.009) and baths (0.015) were comparatively minor.
The approach formulates this as a regression problem, using XGBoost as the primary model after establishing a Ridge Regression baseline. The target variable was log-transformed to handle price skewness before training, compressing the wide price range ($1k–$15.7M) so the model could learn from typical homes without being pulled toward rare luxury outliers. Predictions were converted back to dollars using the inverse transformation.
Our best model achieved an R² of 0.66 and a Mean Absolute Error of ~$105,000 on the held-out test set, improving over the Ridge Regression baseline (R² of 0.63, MAE ~$112,000). The performance gap between the two models confirms that house price relationships are non-linear and benefit from a tree-based approach.
- Type: CSV file of tabular real estate listing features; target is a continuous numerical value (sale price)
- Size: 12,017 rows, 15 features before cleaning
- Split: 7,075 training / 1,769 validation / 2,211 test
- Dropped columns that would cause data leakage:
listPrice,list_to_sold_ratio,price_per_sqft,baths_full, andsanitized_text— these were either derived from the target or not useful for generalization - Dropped the 15 rows where
lastSoldPricewas null (target variable) - Filled
garagemissing values with 0, assuming no garage when not listed (over 80% missing) - Filled remaining missing numerical values with the column median to preserve typical values
- Removed outliers using IQR (1.5x multiplier) on
lastSoldPriceto reduce the influence of extreme luxury properties that were not representative of the general market - One-hot encoded
type(property type) and ZIP code at two levels: full ZIP and 3-digit regional prefix to capture both neighborhood and broader region signal - Applied StandardScaler to numerical features — fit only on training data to prevent data leakage into validation and test sets
Feature distributions were compared before and after cleaning to confirm outlier removal and imputation worked as expected. After cleaning, histograms were compared across three price tiers (Low / Mid / High) to identify which features carried the most signal. Square footage showed the clearest separation between tiers, followed by year built and number of bathrooms. A correlation heatmap on the raw data confirmed which features were leaking information about the target.
- Input: Numerical and categorical features describing a residential property (sqft, beds, baths, year built, stories, garage, ZIP code, property type)
- Output: Predicted sale price in dollars
- Models:
- Ridge Regression — used as a baseline linear model. Achieved R² of 0.62 and MAE of ~$113,000. Limited by its assumption of linear relationships between features and price.
- XGBoost Regressor — selected as the primary model because house prices have non-linear relationships with features (e.g. price doesn't scale linearly with sqft). XGBoost captures these patterns through ensembled decision trees and handles the mix of sparse one-hot encoded columns and dense numerical features well.
- Target transformation:
np.log1papplied to sale price before training to reduce skewness; predictions converted back withnp.expm1 - Hyperparameters:
n_estimators=500,learning_rate=0.05,max_depth=6,subsample=0.8,colsample_bytree=0.8,early_stopping_rounds=20
- Trained locally using Python / Jupyter Notebook
- XGBoost training completed in under 2 minutes on CPU
- Early stopping monitored validation MAE — training halted automatically when no improvement was seen for 20 consecutive trees, preventing overfitting
- No major difficulties after fixing a data leakage issue where the scaler was initially fit on the full dataset before splitting
Primary metrics: R² (proportion of variance explained) and MAE (average dollar error)
| Model | Validation R² | Validation MAE | Test R² | Test MAE |
|---|---|---|---|---|
| Ridge Regression | 0.6820 | $113,156 | 0.6287 | $112,559 |
| XGBoost | 0.7338 | $105,701 | 0.6648 | $105,205 |
XGBoost outperformed Ridge on both metrics. The consistency between validation and test scores indicates the model generalizes well and is not overfitting.
- XGBoost meaningfully outperforms linear regression for house price prediction, confirming that the relationship between property features and sale price is non-linear
- Square footage, year built, and number of bathrooms were the most informative features
- Geographic encoding at two levels (ZIP + region) added useful location signal
- A ~$105k average error on a median home price of ~$568k represents roughly 18% average error — reasonable for a single-model baseline without feature engineering
- Engineer additional features such as price per sqft by ZIP code or age of home at time of sale
- Experiment with lower
max_depthto close the gap between validation and test R² - Try ensemble stacking combining Ridge and XGBoost predictions
- Incorporate the
sanitized_textlisting description column using NLP features (TF-IDF or embeddings) as additional input
- Clone this repository
- Download the dataset from Kaggle (link below) and place
washington_ultimate.csvin the root directory - Install required packages (see Software Setup)
- Run
real_estate.ipynbtop to bottom — all steps from loading to evaluation are contained in a single notebook
real_estate.ipynb— main notebook containing all steps: data loading, cleaning, visualization, model training, and evaluationwashington_ultimate.csv— raw dataset (download from Kaggle separately)README.md— this file
Required packages: pandas numpy matplotlib seaborn scikit-learn xgboost Install all with: pip install pandas numpy matplotlib seaborn scikit-learn xgboost
- Dataset: Washington State Real Estate — Kaggle
- No additional preprocessing steps needed beyond running the notebook
- Open
real_estate.ipynbin Jupyter Notebook or JupyterLab - Run all cells in order
- XGBoost training cell will print progress every 50 trees and stop automatically via early stopping
- Validation and test metrics (R² and MAE) are printed at the end of the notebook
- To evaluate on new data, apply the trained
modelobject andscalerto your feature matrix following the same column structure
- Washington State Real Estate dataset — Kaggle
- XGBoost documentation: https://xgboost.readthedocs.io
- Scikit-learn documentation: https://scikit-learn.org
