BMW Car Sales — Modeling and Prediction

INFO-523 Final Project | Min Set Khant (Solo)

This notebook loads the modeling-ready data, performs a temporal train-test split (up to 2023 for training, 2024 for testing), trains Linear Regression and Random Forest models, and evaluates performance on the original USD price scale.

# 1. SETUP AND IMPORTS
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path
import os
import joblib 

# Scikit-learn modules for modeling and evaluation
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error

# Define Paths
# OUTPUT_DIR is needed to save the final model and prediction results
OUTPUT_DIR = Path("..") / "data" / "cleaned"
VISUALS_DIR = Path("..") / "visuals"
os.makedirs(OUTPUT_DIR, exist_ok=True)
os.makedirs(VISUALS_DIR, exist_ok=True)
# 2. DATA LOADING
DATA_PATH = Path("..") / "data" / "cleaned" / "bmw_modeling_ready.csv"

df = pd.read_csv(DATA_PATH)
print("Modeling data shape:", df.shape)
df.head()
Modeling data shape: (50000, 33)
Year Engine_Size_L Mileage_KM Price_USD Sales_Volume Car_Age log_Price_USD Price_per_KM Total_Sales_Model Region_Asia ... Model_i8 Fuel_Type_Electric Fuel_Type_Hybrid Fuel_Type_Petrol Transmission_Manual Color_Blue Color_Grey Color_Red Color_Silver Color_White
0 2016 3.5 151748 98740 8300 9 11.500256 0.650684 23097519 True ... False False False True True False False True False False
1 2013 1.6 121671 79219 3428 12 11.279984 0.651092 23423891 False ... True False True False False False False True False False
2 2022 4.5 10991 113265 6994 3 11.637494 10.305250 23097519 False ... False False False True False True False False False False
3 2024 1.7 27255 60971 4047 1 11.018170 2.237057 22745529 False ... False False False True False True False False False False
4 2020 2.1 122131 49898 3080 5 10.817756 0.408561 23786466 False ... False False False False True False False False False False

5 rows × 33 columns

3. Temporal Train-Test Split (Predicting the Future)

The most critical part of the validation: we use a temporal split to validate the model’s ability to predict the newest market (Year 2024) based on all historical data (2010-2023).

# Define Target (y) and Features (X)
TARGET = "log_Price_USD"

# X contains all features, including Year and Price_USD (needed for the split and final error)
X = df.drop(columns=[TARGET])
y = df[TARGET]

# --- TEMPORAL SPLIT IMPLEMENTATION ---
TEST_YEAR = 2024

# Split data based on the Year column
X_train = X[X["Year"] < TEST_YEAR]
X_test = X[X["Year"] == TEST_YEAR]

y_train = y[X_train.index]
y_test = y[X_test.index]

# Store the original USD prices for the test set for final error calculation
y_test_usd_true = X_test["Price_USD"]

# Clean up X_train/X_test by dropping redundant/leaking columns before training
# Price_USD is a direct leak, Year is only for the split, Sales_Volume is typically post-sale info
cols_to_drop = ["Year", "Price_USD", "Sales_Volume", "Total_Sales_Model"] 
X_train = X_train.drop(columns=[c for c in cols_to_drop if c in X_train.columns], errors='ignore')
X_test = X_test.drop(columns=[c for c in cols_to_drop if c in X_test.columns], errors='ignore')

print(f"Training Data (Years < {TEST_YEAR}): {X_train.shape[0]} rows")
print(f"Testing Data (Year == {TEST_YEAR}): {X_test.shape[0]} rows")
Training Data (Years < 2024): 46573 rows
Testing Data (Year == 2024): 3427 rows

4. Model Training and Evaluation

The evaluation function converts predictions back to USD using np.expm1() so that MAE and RMSE are directly usable by a business audience.

# Helper function to evaluate models and print results
def evaluate_model(model_name, y_true_log, y_pred_log, y_test_usd_true):
    
    # 1. Convert log-predictions back to USD (Original Price Scale)
    y_pred_usd = np.expm1(y_pred_log)
    
    # 2. Calculate metrics
    r2 = r2_score(y_true_log, y_pred_log) 
    rmse = np.sqrt(mean_squared_error(y_test_usd_true, y_pred_usd))
    mae = mean_absolute_error(y_test_usd_true, y_pred_usd)
    
    print(f"{model_name} Results (Temporal Split, Test Year {TEST_YEAR}):")
    print(f"  R-squared (Log Price): {r2:.4f} - Variance explained")
    print(f"  RMSE (USD): ${rmse:,.2f} - Average Price Error (Higher Penalty to Large Errors)")
    print(f"  MAE (USD): ${mae:,.2f} - Median Price Error (Most interpretable)")
    print("")
    
    return r2, rmse, mae
# Model 1: Linear Regression
lr_model = LinearRegression()
lr_model.fit(X_train, y_train)
lr_pred = lr_model.predict(X_test)

lr_r2, lr_rmse, lr_mae = evaluate_model(
    "Linear Regression", 
    y_test, 
    lr_pred, 
    y_test_usd_true
)
Linear Regression Results (Temporal Split, Test Year 2024):
  R-squared (Log Price): 0.0003 - Variance explained
  RMSE (USD): $26,490.86 - Average Price Error (Higher Penalty to Large Errors)
  MAE (USD): $22,927.30 - Median Price Error (Most interpretable)
# Model 2: Random Forest Regressor 

rf_model = RandomForestRegressor(
    n_estimators=100, 
    random_state=42, 
    n_jobs=-1, 
    max_depth=15, 
    min_samples_leaf=5
)
rf_model.fit(X_train, y_train)
rf_pred = rf_model.predict(X_test)

rf_r2, rf_rmse, rf_mae = evaluate_model(
    "Random Forest Regressor", 
    y_test, 
    rf_pred, 
    y_test_usd_true
)
Random Forest Regressor Results (Temporal Split, Test Year 2024):
  R-squared (Log Price): 0.9952 - Variance explained
  RMSE (USD): $2,292.13 - Average Price Error (Higher Penalty to Large Errors)
  MAE (USD): $531.13 - Median Price Error (Most interpretable)

5. Feature Importance and Model Insights

Analyzing the feature importance from the Random Forest model reveals the key drivers of BMW car price, directly addressing the first proposal question.

# Random Forest Feature Importance
importances = pd.Series(rf_model.feature_importances_, index=X_train.columns)
top_20_features = importances.sort_values(ascending=False).head(20)

# Visualization
plt.figure(figsize=(12, 8))
sns.barplot(x=top_20_features.values, y=top_20_features.index, palette="viridis")
plt.title("Top 20 Features Driving BMW Price Prediction (Random Forest)", fontsize=16)
plt.xlabel("Feature Importance Score (Gini)")
plt.ylabel("Feature")
plt.tight_layout()
plt.savefig(VISUALS_DIR / "rf_feature_importance.png")
plt.show()

print("\nTop 5 Price Drivers (Feature Importance):")
print(top_20_features.head(5))
/var/folders/jf/1b91kr5d1k773y5y7tm0sfbh0000gn/T/ipykernel_32198/956784539.py:7: FutureWarning: 

Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `y` variable to `hue` and set `legend=False` for the same effect.

  sns.barplot(x=top_20_features.values, y=top_20_features.index, palette="viridis")


Top 5 Price Drivers (Feature Importance):
Price_per_KM           0.682248
Mileage_KM             0.317700
Engine_Size_L          0.000018
Car_Age                0.000015
Transmission_Manual    0.000004
dtype: float64

6. Final Model Deployment Function

This conceptual function demonstrates how the final model can be wrapped for real-world use, ensuring the input matches the training features and the output is the predicted price in USD.

# Final Model Deployment Function
def predict_bmw_price(car_features_dict, rf_model_temp):
    """
    Predicts the price of a BMW given its features.
    
    NOTE: The input dictionary must contain all feature columns 
    (including OHE columns) that the model was trained on (X_train).
    """
    # Convert dictionary to DataFrame for model input
    df_new = pd.DataFrame([car_features_dict])
    
    # Ensure all columns exist and are in the correct order (crucial for OHE data)
    required_cols = X_train.columns
    for col in required_cols:
        if col not in df_new.columns:
            df_new[col] = 0
            
    df_new = df_new[required_cols]
    
    # Predict on the log scale
    price_pred_log = rf_model_temp.predict(df_new)[0]
    
    # Convert back to USD using the inverse of log1p
    price_pred_usd = np.expm1(price_pred_log)
    
    return price_pred_usd

# Example usage (Conceptual: you would replace this with actual input data)
# example_car_dict = {
#     'Engine_Size_L': 3.0, 'Mileage_KM': 10000, 'Car_Age': 1, 'Price_per_KM': 8.5, 
#     'Region_Asia': 1, 'Region_Europe': 0, ... (all 100+ features set to 0 or 1)
# }
# print(f"Predicted Price (USD): ${predict_bmw_price(example_car_dict, rf_model):,.2f}")
# 7. Saving Model and Prediction Proofs

# 1. Saving the Trained Random Forest Model
MODEL_SAVE_PATH = OUTPUT_DIR / "rf_best_model.joblib" 

# Save the model object
joblib.dump(rf_model, MODEL_SAVE_PATH)
print(f"✅ Saved the trained Random Forest model to: {MODEL_SAVE_PATH}")


# 2. Saving the Temporal Prediction Results
# y_pred_usd was calculated in the evaluate_model function
y_pred_usd = np.expm1(rf_pred)

# Retrieve the actual USD prices and create a comparison DataFrame
results_df = pd.DataFrame({
    'Actual_Price_USD': y_test_usd_true,
    'Predicted_Price_USD': y_pred_usd,
    'Prediction_Error_USD': np.abs(y_test_usd_true - y_pred_usd)
})

results_df['Year'] = TEST_YEAR
results_df = results_df.sort_values(by='Prediction_Error_USD', ascending=False)

# Define the path and save the results
RESULTS_SAVE_PATH = OUTPUT_DIR / "temporal_predictions_2024.csv"
results_df.to_csv(RESULTS_SAVE_PATH, index=False)
print(f"✅ Saved temporal prediction results to: {RESULTS_SAVE_PATH}")

print("\nTop 5 Largest Prediction Errors (for inspection):")
display(results_df.head())
✅ Saved the trained Random Forest model to: ../data/cleaned/rf_best_model.joblib
✅ Saved temporal prediction results to: ../data/cleaned/temporal_predictions_2024.csv

Top 5 Largest Prediction Errors (for inspection):
Actual_Price_USD Predicted_Price_USD Prediction_Error_USD Year
762 114283 76473.818456 37809.181544 2024
483 119117 83383.235095 35733.764905 2024
40586 116759 83383.235095 33375.764905 2024
2101 109818 77051.262024 32766.737976 2024
37022 108646 76714.311586 31931.688414 2024