Data Cleaning: The 80% of Data Science Nobody Talks About
Garbage in, garbage out
Data scientists spend up to 80% of their time cleaning data. Yet most courses skip to the modeling. Here's a real-world cleaning checklist.
The 7-step cleaning checklist
- Duplicates —
df.drop_duplicates() - Missing values — Decide per column: drop, fill with median, or flag as "unknown"
- Data types — Dates stored as strings? Numbers as text? Fix them early.
- Outliers — Use IQR or z-scores. Don't blindly remove — investigate why they exist.
- Inconsistent categories — "USA", "U.S.A", "United States" should be one value.
- Leading/trailing whitespace —
df["col"] = df["col"].str.strip() - Validate ranges — Age = -5? Salary = 0? Catch impossible values.
Clean data isn't glamorous, but it's what separates a useful model from a misleading one.
Comments
0
Loading comments…
No comments yet. Be the first to share your thoughts!