1. Load the data
import pandas as pd
df = pd.read_csv("data/2015.csv")
2. Initial Exploration
print(df.head()) # View the first few rows
print(df.columns) # Get column names
print(df.info()) # Summary of data types and missing values
print(df.describe()) # Summary statistics
3. Handle Missing Values
- Identify missing values
print(df.isnull().sum()) # Count missing values per column
Choose a strategy:
-
- Drop rows with missing values
df.dropna(inplace=True)
-
- Fill missing values with a constant (e.g., 0)
df.fillna(0, inplace=True)
-
- Fill using a method like ‘ffill’ (forward fill) or ‘bfill’ (backward fill)
df.fillna(method='ffill', inplace=True)
4. Data Type Conversion
- Columns that need conversion
# 'date' to datetime
df['date'] = pd.to_datetime(df['date'])
# Price to non-numeric values
df['price'] = pd.to_numeric(df['price'], errors='coerce')
5. Duplicates
- Identify/Count duplicate rows
print(df.duplicated().sum())
- Remove duplicates
df.drop_duplicates(inplace=True)
6. Outlier Handling
(I should probably use Jupyter here; I am still experimenting)
- Explore outliers visually (box plots, histograms)
- Apply methods like:
- Removing outliers directly
- Capping outliers (setting limits)
- Using robust statistical methods (e.g., interquartile range)
7. Standardization/Normalization
- If required for machine learning or analysis, consider:
- tandardization (zero mean, unit variance)
- Normalization (scaling to a range like [0, 1])
8. Encoding Categorical Features
- Convert categorical columns into numerical representations using methods like:
- One-hot encoding (create dummy variables)
- Label encoding (assign numerical labels)
9. Data Validation
- Ensure data integrity using checks:
- Range checks (e.g., age should be within a reasonable range)
- Consistency checks (e.g., values in different columns should be logically related)
10. Save the cleaned data
df.to_csv("cleaned_data/2015_cleaned.csv", index=False)