Data Cleaning: The 80% of Data Science Nobody Talks About

Mon, 16 Feb 2026

datacamp

Garbage in, garbage out

Data scientists spend up to 80% of their time cleaning data. Yet most courses skip to the modeling. Here's a real-world cleaning checklist.

Duplicates — df.drop_duplicates()
Missing values — Decide per column: drop, fill with median, or flag as "unknown"
Data types — Dates stored as strings? Numbers as text? Fix them early.
Outliers — Use IQR or z-scores. Don't blindly remove — investigate why they exist.
Inconsistent categories — "USA", "U.S.A", "United States" should be one value.
Leading/trailing whitespace — df["col"] = df["col"].str.strip()
Validate ranges — Age = -5? Salary = 0? Catch impossible values.

Clean data isn't glamorous, but it's what separates a useful model from a misleading one.

Loading comments…