This project uses linear regression to explore and understand the relationships between Costco's key financial metrics and revenue. The analysis focuses on discovering which financial indicators have the strongest associations with revenue and how they relate to each other. While the relationships are intuitive, the project demonstrates how statistical modeling can formally quantify financial structure, validate assumptions, and provide analytical backing to business intuition.
Kaggle dataset from: https://www.kaggle.com/datasets/jarvanv/costcodata?resource=download
This exploratory data analysis examines:
- Financial Metrics Analyzed:
- Gross Profit
- Operating Income (Loss)
- Net Income Available to Common Shareholders
- Total Assets
The analysis generates:
- Statistical Report: Correlation coefficients, R² score, and regression coefficients
- Visualizations: Scatter plots and bar charts showing relationships
- Processed Data: Cleaned and engineered features saved to CSV
The linear regression model helps identify:
- Which financial metrics have the strongest relationship with revenue
- The direction (positive/negative) and magnitude of each relationship
- How well the combination of metrics explains revenue variation
- The financial metrics analyzed (Gross Profit, Operating Income, Net Income) are mathematically derived from Revenue. As a result, the extremely high correlations (r > 0.98) reflect accounting structure rather than independent predictive power.
- Small Sample Size due to limited publicly-available datasets on Costco's business operations. Due to the limited historical data (7 years), the model may overfit and cannot reliably forecast future revenue. The primary goal is to illustrate feature relationships and model workflow.
- Shift from Structural Metrics to Operational Drivers. Instead of modeling revenue from income statement items, analyze membership growth, store expansion, E-commerce sales.
- Expand Dataset and compare Costco against competitors (e.g. Walmart, Target), also applyinh regularization techniques like Ridge and Lasso regression to address multicollinearity.