BMW Car Sales Analysis & Price Prediction

Temporal Validation of Market Value Using Random Forest Regression

A data mining project to predict used BMW car prices, emphasizing temporal stability and engineered features.
Author
Affiliation

Min Set Khant (Solo)

College of Information Science, University of Arizona (INFO 523)

Abstract

This project focuses on building a robust, temporally validated machine learning model to predict the market price of used BMW vehicles across multiple global regions. The analysis used the BMW Worldwide Sales dataset (2010–2024) and applied Random Forest Regression.

A key methodological innovation was the use of a Temporal Train–Test Split, where the model was trained only on data up to 2023 and tested exclusively on future sales from 2024.

The final model achieved:

  • R²: 0.952
  • MAE: $531.13
  • MSE (log scale): 0.0079

The strongest finding is the dominance of the engineered feature Price per KM, which emerged as the most powerful predictor of BMW car prices, outperforming all original features.


1. Introduction and Project Goals

This report outlines the methodology, analysis, and results for the final project of INFO 523. The objective was not just prediction, but building a production-ready forecasting system capable of anticipating market shifts.

The project answers three core questions:

  1. Feature Importance What real-world factors most strongly determine BMW pricing?

  2. Model Accuracy Can a non-linear machine learning model achieve high predictive performance (R² > 0.90)?

  3. Temporal Stability How reliable are predictions when tested on unseen future data?

Full technical analysis and code can be found in the project notebooks:

  • 01_Data_Preparation_and_EDA.ipynb
  • 02_modeling.ipynb

2. Data Preparation and Feature Engineering

2.1 Dataset Overview

The dataset consists of BMW Worldwide Sales Records (2010–2024) sourced from Kaggle, containing over 50,000 vehicle records with fields such as:

  • Model
  • Mileage
  • Production Year
  • Region
  • Fuel Type
  • Transmission
  • Price in USD

Cleaning & Transformation Steps

  • Missing values imputed (categorical → "Unknown").

  • Car Age computed from sale year.

  • Log transformation applied to Price_USD.

  • One-Hot Encoding applied to:

    • Region
    • Model
    • Fuel_Type
    • Transmission

2.2 Engineered Feature: Price per KM

The most critical feature created:

\[ \text{Price per KM} = \frac{\text{Price_USD}}{\text{Mileage_KM}} \]

This variable quantifies value retention efficiency, separating high-quality vehicles from low-value ones despite similar mileage.

It ultimately became the single strongest predictor in the model.


3. Exploratory Data Analysis (EDA)

Several key findings influenced our modeling strategy.

3.1 Market Price Trend

Post-2020, BMW used car prices exhibit a sharp upward spike, reflecting:

  • Supply chain shortages
  • Inflation
  • Global demand shifts

This volatility made temporal validation essential.

3.2 Regional Segmentation

Top models vary significantly by region:

  • North America: SUVs & performance models
  • Europe: compact & diesel variants
  • Asia: fuel-efficient premium sedans

This confirmed that Region is a crucial categorical feature.


4. Modeling Approach and Validation Strategy

4.1 Baseline and Final Model

  • Baseline: Linear Regression
  • Final Model: Random Forest Regressor

Reasons for choosing Random Forest:

  • Handles nonlinear relationships
  • Robust to one-hot encoded features
  • Provides interpretable Feature Importance scores
  • Performs well under volatile price conditions

4.2 Temporal Train–Test Split

To simulate real-world forecasting:

  • Train: 2010–2023
  • Test: 2024

No future data was allowed during training.

This ensured the model learns historical trends and predicts market conditions it has never seen.


5. Results and Evaluation

5.1 Performance on Future Data (2023–2024)

Metric Result Interpretation
0.9952 Explains ~99% of price variance in future sales
MSE 0.0079 Low error after log transformation
MAE $531.13 Average price prediction error is only $531

This MAE is exceptionally low given BMW’s price range, demonstrating strong temporal stability.


5.2 Feature Importance

Ranked by reduction in Gini Impurity:

Rank Feature Importance Meaning
1 Price per KM 0.682 Dominant value-retention indicator
2 Mileage (KM) 0.318 Core depreciation measure
3 Engine Size (L) 0.000018 Minimal impact
4 Car Age 0.000015 Small but logical
5 Transmission: Manual 0.000004 Niche effect

Key insight: The engineered feature Price per KM is more informative than raw Mileage, Age, or Engine Size.


6. Conclusion and Business Recommendations

6.1 Summary of Findings

The project successfully accomplished its goals:

  • Top Predictors: Price per KM and Mileage dominate price prediction.
  • Accuracy: Random Forest achieved strong predictive performance (R² = 0.9952).
  • Stability: Low MAE on forward-looking data confirms the model is reliable for real-world forecasting.

7. Future Work and Response to Peer Review

The peer review provided insightful suggestions for enhancing the project’s methodological rigor and practical application. We plan to incorporate the following steps in future development:

  1. Enhanced Hyperparameter Tuning: We acknowledge the suggestion to split the training data (2010–2022) into separate internal training and validation sets for dedicated hyperparameter tuning. While the current model was robust, this practice will be adopted to systematically optimize the Random Forest Regressor and prevent any potential overfitting on the primary training set before the final temporal test.

  2. Model Deployment: As suggested, the logical next step is deployment. The trained model (rf_best_model.joblib) will be implemented within a simple web dashboard (like the simulated interface provided in the peer response) to allow non-technical users to interactively input car features and immediately receive a price prediction. This will demonstrate the model’s direct value as a real-time pricing tool.