BMW Car Sales Analysis & Price Prediction

Temporal Validation of Market Value

Min Set Khant (Solo)

Introduction

Project: BMW Car Sales — Data Analysis & Price Prediction

  • Goal: Build a robust, temporally validated model to predict BMW car prices using multi-regional sales data.
  • Key Questions Addressed:
    1. What are the key features influencing BMW car prices?
    2. Can a Machine Learning model accurately predict prices?
    3. How stable are these predictions over time (Temporal Validation)?

Data Overview & Setup

Dataset: BMW Worldwide Sales (2010–2024)

  • Source: Kaggle (Multi-Regional Sales Data)
  • Status: Data has been cleaned, imputed, and One-Hot Encoded.
  • Target Variable: Price_USD (Log-transformed for modeling).
Model Year Region Color Fuel_Type Transmission Engine_Size_L Mileage_KM Price_USD Sales_Volume Sales_Classification
0 5 Series 2016 Asia Red Petrol Manual 3.5 151748 98740 8300 High
1 i8 2013 North America Red Hybrid Automatic 1.6 121671 79219 3428 Low
2 5 Series 2022 North America Blue Petrol Automatic 4.5 10991 113265 6994 Low
3 X3 2024 Middle East Blue Petrol Automatic 1.7 27255 60971 4047 Low
4 7 Series 2020 South America Black Diesel Manual 2.1 122131 49898 3080 Low
5 5 Series 2017 Middle East Silver Diesel Manual 1.9 171362 42926 1232 Low
6 i8 2022 Europe White Diesel Manual 1.8 196741 55064 7949 High
7 M5 2014 Asia Black Diesel Automatic 1.6 121156 102778 632 Low
8 X3 2016 South America White Diesel Automatic 1.7 48073 116482 8944 High
9 i8 2019 Europe White Electric Manual 3.0 35700 96257 4411 Low

Global Market Demand

Global Market Demand: Top 10 Models (2010–2024)

Volume by Model (2010–2024)
Insight:Globally, sales volume is often dominated by core sedan and SUV series (e.g., 3-Series, 5-Series, X3, X5), reflecting high overall market liquidity.

Impact:Models with high global volume typically exhibit more stable pricing due to consistent demand, which reduces prediction volatility. EDA Detail:This volume data was analyzed before the temporal split to understand the underlying market foundation across all years. |

EDA: Correlation Analysis

Price Relationships

Correlation Matrix Price Distribution by Fuel Type
Observation: Log-Price shows a strong inverse linear relationship with Car Age (depreciation) and a strong positive correlation with Engine Size (performance/class). Observation: Significant differences in the median price across fuel types confirm this categorical feature has strong predictive power and is not independent of the target.

Detail: Two key features were engineered for this analysis: Car Age (calculated as Current_Year - Model_Year) and Price per KM (calculated as Price_USD / Mileage_KM), both showing significant predictive value.

Modeling Approach

Temporal Validation Strategy (The Key Differentiator)

  • Goal: Test model stability and real-world predictive ability.
  • Split: - Train Set: All data from 2010–2023.
    • Test Set: Data from Year 2024 (The “future” market).
  • Model Chosen: Random Forest Regressor (highest predictive power).
  • Evaluation: MAE and RMSE are converted back to USD for business interpretability.

Random Forest Performance

Model Accuracy (Reporting in USD)

Metric Value Interpretation
R-squared (Log Price) 0.9952 Variance explained by the model on the transformed price.
Root Mean Squared Error (RMSE) $2,292.13 Average prediction error, penalizing large errors.
Mean Absolute Error (MAE) $531.13 Most Interpretable: On average, the model’s price prediction is off by this amount in USD.

Interpretation: The model successfully predicted 2024 prices with an average absolute error of approximately $531.13, proving its temporal stability.

Key Price Drivers: Feature Importance

Top 5 Predictors of BMW Price

The Random Forest model identified these key drivers based on Gini Impurity Reduction:

  1. Price per KM (0.682) - Represents the calculated efficiency/condition ratio; its dominance suggests it captures the residual value most effectively.
  2. Mileage (KM) (0.318) - Primary measure of vehicle use and wear, directly driving the rate of depreciation.
  3. Engine Size (L) (0.000018)
  4. Car Age (0.000015)
  5. Transmission: Manual (0.000004)
Full Feature Importance

Technical Detail: The model was chosen for its ability to calculate Feature Importance via Mean Decrease in Impurity (Gini). The trained model was serialized and saved as rf_best_model.joblib for future deployment.

Conclusion: Summary of Findings

Answering the Key Questions

  • Top Drivers: Price is overwhelmingly driven by engineered features: Price per KM and Mileage (KM).
  • Model Success: The Random Forest model demonstrates strong temporal validation, accurately predicting 2024 market prices with an MAE of $531.13.
  • Market Insight: Model popularity and sales volume are highly specific to Region.

Business Recommendations

Actionable Insights for Strategy

  1. Pricing Strategy: Use the model’s feature importance to set pricing guidelines. The low MAE ($531.13) provides high confidence for automated pricing models.
  2. Inventory Optimization: Focus inventory based on the regional popularity derived from the EDA.
  3. Future Validation: Re-run the temporal validation annually to ensure the model’s stability and adapt to market shifts.

Thank You

INFO-523 Final Project

  • Notebooks: 01_Data_Preparation_and_EDA.ipynb, 02-modeling.ipynb
  • Output: Trained model saved as rf_best_model.joblib.
  • Data: Cleaned data saved as bmw_cleaned.csv and bmw_modeling_ready.csv.