Data Cleaning: The 80% of Data Science Nobody Talks About

PrimeTek Academy
PrimeTek Academy

Mon, 16 Feb 2026

datacamp
Data Cleaning: The 80% of Data Science Nobody Talks About

Garbage in, garbage out

Data scientists spend up to 80% of their time cleaning data. Yet most courses skip to the modeling. Here's a real-world cleaning checklist.

The 7-step cleaning checklist

  1. Duplicatesdf.drop_duplicates()
  2. Missing values — Decide per column: drop, fill with median, or flag as "unknown"
  3. Data types — Dates stored as strings? Numbers as text? Fix them early.
  4. Outliers — Use IQR or z-scores. Don't blindly remove — investigate why they exist.
  5. Inconsistent categories — "USA", "U.S.A", "United States" should be one value.
  6. Leading/trailing whitespacedf["col"] = df["col"].str.strip()
  7. Validate ranges — Age = -5? Salary = 0? Catch impossible values.

Clean data isn't glamorous, but it's what separates a useful model from a misleading one.

Share this article

Comments

0
?
0 / 2000
Loading comments…