House Prices – Advanced Regression Modeling

This project was developed as part of a graduate data science course and submitted to Kaggle’s House Prices: Advanced Regression Techniques competition. The objective was to predict housing prices using a combination of numeric and categorical features. My final model placed in the top 16th percentile among 4,779 teams.

📊 Objective

Build a predictive regression model to estimate home prices in Ames, Iowa using 79 real estate features. Emphasis was placed on data cleaning, feature engineering, and stacked model performance optimization.

🛠️ Tools & Technologies

Tool	Purpose
Python	Programming language
scikit-learn	Model building, preprocessing, and evaluation
XGBoost	Gradient boosting model
Pandas, NumPy	Data manipulation and exploration
Matplotlib, Seaborn	Visualization tools
Kaggle	Competition platform and dataset source
VS Code	Integrated development environment (IDE)

🧾 Modeling Workflow Overview

EDA: Explored skewness, outliers, and feature correlations
Data Cleaning:
- Dropped columns with >47% missing values
- Filled NaNs using column-wise mode or mean
- Applied domain-specific logic for missing basement/garage data
Feature Engineering:
- Created features like TotalSF, BuildingAge, and TotalQual
- Applied an overfit reduction function for low-variance columns
Modeling:
- Built a stacked model using LinearRegression (required) and XGBRegressor
- Encoded categorical features and scaled numeric ones via ColumnTransformer
- Applied log transformation to stabilize target variable distribution
Validation:
- Cross-validated model performance
- Tuned hyperparameters for XGBoost with grid search
Reproducibility:
- All preprocessing steps are handled via pipeline objects
- File paths are platform-independent
- Repo includes train, test, and prediction CSVs for replication

📂 View the full notebook here: House_Prices_Advanced_Regression_Techniques_Stacked_16th_percentile.ipynb

📈 Sample Output

EDA Correlation Matrix

Modeling Pipeline

🧠 Key Insights

Most predictive features included GrLivArea, TotalSF, and OverallQual
Log-transforming the target variable improved accuracy and reduced skew
Pipeline structure ensured consistent preprocessing and reproducible results

🏅 Competition Performance

Metric	Value
Kaggle Score	0.12803 RMSE
Rank	771 out of 4,779
Percentile	Top 16%

🔒 Disclaimer

This project is for academic and portfolio purposes only. All data used is public and anonymized, sourced from Kaggle’s open dataset.