INFO-523-DATA MININIG_Final Project ‘BMW Car Price Analysis & Prediction (Project Proposal)

Proposal

Proposal to analyze, visualize, and predict BMW car prices (2010–2024) to explore the temporal stability of market value as influenced by model, region, engine size, and transmission..
Author
Affiliation

Team Name - Min Set Khant (Solo)

College of Information Science, University of Arizona

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

Dataset

data = pd.read_csv("data/bmw_worldwide_sales.csv")
data.head(10)
Model Year Region Color Fuel_Type Transmission Engine_Size_L Mileage_KM Price_USD Sales_Volume Sales_Classification
0 5 Series 2016 Asia Red Petrol Manual 3.5 151748 98740 8300 High
1 i8 2013 North America Red Hybrid Automatic 1.6 121671 79219 3428 Low
2 5 Series 2022 North America Blue Petrol Automatic 4.5 10991 113265 6994 Low
3 X3 2024 Middle East Blue Petrol Automatic 1.7 27255 60971 4047 Low
4 7 Series 2020 South America Black Diesel Manual 2.1 122131 49898 3080 Low
5 5 Series 2017 Middle East Silver Diesel Manual 1.9 171362 42926 1232 Low
6 i8 2022 Europe White Diesel Manual 1.8 196741 55064 7949 High
7 M5 2014 Asia Black Diesel Automatic 1.6 121156 102778 632 Low
8 X3 2016 South America White Diesel Automatic 1.7 48073 116482 8944 High
9 i8 2019 Europe White Electric Manual 3.0 35700 96257 4411 Low

Dataset Overview

data.describe()
data.shape
(50000, 11)

This dataset — BMW Worldwide Sales Records (2010–2024) — contains over 50,000 records of BMW’s sales and specifications across multiple regions. Key features include: Model, Year, Engine_Size_L, Transmission, Fuel_Type, Color, Region, Price, and Sales_Volume. This dataset was chosen because it provides a diverse range of attributes for exploring market behavior, pricing trends, and customer preferences in the automotive industry.

Questions

  1. What are the key factors influencing BMW used-car prices in the market?
  2. Can machine learning models accurately predict used-car prices?
  3. How stable are these predictions over time?
  4. Using a temporal split, how well can the model predict pricing trends for “next year”?

Analysis Plan

  1. Data Cleaning & Preparation

Handle missing values

Standardize numerical units

Encode categorical variables

Ensure time-related variables are aligned for temporal analysis

  1. Exploratory Data Analysis (EDA)

Price trends by year, region, and model

Correlation analysis between price and vehicle attributes

Identify patterns that may indicate shifting market preferences

  1. Modeling Approach

Apply Machine Learning Models:

Linear Regression

Decision Tree Regression

Random Forest Regression

Evaluate using:

RMSE, MAE, R²

Perform:

Feature importance analysis

Temporal train–test split to simulate predicting next-year price trends

  1. Visualization Dashboard

Build clear, interactive visualizations using matplotlib / seaborn

Include:

Price distribution

Trend lines across years

Model performance summary

Visualization Dashboard:

  • Build clear, interactive plots for trends and insights using matplotlib and seaborn.

Ethical AI Use Disclosure:

  • AI tools (e.g., ChatGPT) were used ethically for code debugging, idea exploration, and documentation enhancement.
  • All datasets are real and sourced from Kaggle.