Pacific Beach Drive

Mike's Drive.

Follow me on GitHub

Data cleaning on the gas records of my BB.

I am driving a 2003 4cyl. 1.8 liter C230 coupe. It is a German thing:

  • small car to find always parking space.
  • small engine to make sure I don’t have to worry about prizes.
  • and enough HP since it has a compressor (German word for turbo): I shift into the third and emmidiatly get proper acceleration.

When I bought the car in 2010 (I guess) it had 87K on it and now (2024) the odometer is showing 165K. BB (name of my car; BB = BabyBenz since it is the smallest of all Mercs).

Since 2019 I am recording the gas consumption of my car to get an idea what age does to the gas consumption. Well, in part. I also was hoping to predict any probable issues based on higher or lower consumption as usual.

1.Initial Exploration

import pandas as pd
import numpy as np
df = pd.read_csv("data/bb.csv")
np.random.seed(0) 
print(df.head())  # View the first few rows
         Date  Odometer  miles driven  Gallons  Price/Gal  Total   MPG  \
0  05/23/2024  160439.0         259.0    12.66       5.10  64.57  20.5   
1  04/14/2024  160180.0         306.0    13.26       6.20  82.23  23.1   
2  03/28/2024  159874.0         215.0     9.64       4.80  46.24  22.3   
3  02/09/2024  159659.0         342.0    13.50       5.18  69.89  25.3   
4  12/26/2023  159371.0         317.0    13.00       4.24  55.20  24.4   

   l/100km  
0      NaN  
1     10.2  
2     10.5  
3      9.3  
4      9.6  

delete a column

  • at this point I had already tried to delete the ‘Comment’ column which didn’t work.
df = df.drop(columns=['Random notes', 'Comment\ncapacity: 16.2Gal GOAL:26MPG (8,8 l/km)'])

# Print the updated DataFrame (optional)
print(df)

# Get the index of the row where the second column contains word 'Average'
index_to_drop = df[df.iloc[:, 1] == 'Average'].index

# Drop the row using the index
df = df.drop(index=index_to_drop)

# Print the updated DataFrame (optional)
print(df)
  • For the sake of time and since these are only two columns I removed them by hand.

2.Columns and details

  • after I had deleted the columns, I need to know what I would be working with. So I print out the columns and then check the datatype (consistency, validity).
print(df.columns)  # Get column names
# result
Index(['Date', 'Odometer', 'miles driven', 'Gallons', 'Price/Gal', 'Total',
       'MPG', 'l/100km'],
      dtype='object')
print(df.info())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 99 entries, 0 to 98
Data columns (total 8 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Date          99 non-null     object 
 1   Odometer      98 non-null     float64
 2   miles driven  96 non-null     float64
 3   Gallons       96 non-null     float64
 4   Price/Gal     94 non-null     float64
 5   Total         94 non-null     float64
 6   MPG           94 non-null     float64
 7   l/100km       93 non-null     float64
dtypes: float64(7), object(1)
memory usage: 6.3+ KB
None

3.Handle Missing Values

  • I knew I had left some cells empty in the past. Before choosing a strategy, I had to get a clear picture.
  • the most common way to get an idea is using isnull().sum()
print(df.isnull().sum()) 
Date            0
Odometer        1
miles driven    3
Gallons         3
Price/Gal       5
Total           5
MPG             5
l/100km         6
dtype: int64
  • Maybe I can get that in percentage?
missing_values_count = df.isnull().sum()
total_cells = np.prod(df.shape)
total_missing = missing_values_count.sum()

# percent of data that is missing
percent_missing = (total_missing/total_cells) * 100
print(percent_missing)

result: 3.53%

to be continued