Back to Blog

Common Data Quality Issues and How Data Engineers Fix Them

Learn data quality problems, detection methods, and solutions with examples

Data quality issues are one of the biggest challenges in data engineering. Poor quality data leads to incorrect insights, bad business decisions, and wasted resources. Understanding common data quality problems and how to fix them is essential for building reliable data systems.

In this comprehensive guide, you'll learn the most common data quality issues, how data engineers detect them, proven solutions, and best practices for maintaining data quality. We'll use real-world examples and practical solutions to make everything actionable.

💡 Quick Tip

Use our free JSON Validator to check data quality and our JSON Formatter to identify formatting issues.

Definition: What Is Data Quality?

Data Quality refers to the accuracy, completeness, consistency, validity, and timeliness of data. High-quality data is reliable, accurate, and fit for its intended use. Poor data quality leads to incorrect analysis and bad decisions.

Dimensions of data quality:

Accuracy

Data is correct and reflects reality

Completeness

All required data is present

Consistency

Data is uniform across systems

Validity

Data follows defined rules and formats

Timeliness

Data is up-to-date and available when needed

Uniqueness

No duplicate records

What Are Common Data Quality Issues?

Here are the most common data quality problems data engineers encounter:

1. Missing Values (Nulls)

Problem: Required fields are empty or null

Example: Customer records with missing email addresses

Before:

ID | Name | Email
1 | John | john@ex.com
2 | Jane | null
3 | Bob | null

Impact: Can't send emails, incomplete analysis, broken business processes

2. Duplicate Records

Problem: Same record appears multiple times

Example: Customer "John Doe" entered twice with different IDs

Before:

ID | Name | Email
1 | John Doe | john@ex.com
5 | John Doe | john@ex.com
8 | John Doe | john.doe@ex.com

Impact: Inflated counts, incorrect aggregations, wasted storage

3. Inconsistent Formats

Problem: Same data in different formats

Example: Dates in multiple formats, phone numbers with/without dashes

Before:

Date formats: "2024-01-15", "01/15/2024", "Jan 15, 2024"
Phone: "123-456-7890", "(123) 456-7890", "1234567890"

Impact: Can't sort/filter properly, parsing errors, user confusion

4. Invalid Data

Problem: Data doesn't meet validation rules

Example: Email without @, age = 250, negative prices

Before:

Email: "notanemail", "john@", "@domain.com"
Age: 250, -5, "thirty"
Price: -10.50, "free", null

Impact: Application errors, failed validations, incorrect calculations

5. Data Inconsistency

Problem: Same entity has different values across systems

Example: Customer name is "John" in CRM but "Johnny" in orders system

Before:

CRM: customer_id=1, name="John Smith"
Orders: customer_id=1, name="Johnny Smith"

Impact: Can't join data correctly, reporting errors, user confusion

When Do Data Quality Issues Occur?

Data quality issues can happen at various stages:

Data entry - Human errors when entering data manually

System integration - When combining data from multiple sources

Data migration - When moving data between systems

API integrations - When external APIs return inconsistent data

Time decay - When data becomes outdated over time

How Data Engineers Fix Data Quality Issues

1. Missing Values - Solutions

Imputation (Fill Missing Values)

Fill missing values with statistical measures or predictions

// Mean imputation for numeric
missing_age = mean(ages) // Fill with average
// Mode imputation for categorical
missing_category = mode(categories) // Fill with most common

When to use: When missing data is random and you need complete dataset

Deletion

Remove records with missing critical values

data = data.dropna(subset=['email']) // Remove rows without email

When to use: When missing data is small percentage and not critical

Flag Missing Values

Create indicator column for missing values

data['email_missing'] = data['email'].isna() // True if missing

When to use: When missingness itself is informative

2. Duplicate Records - Solutions

Deduplication

Identify and remove duplicate records

// Remove exact duplicates
data = data.drop_duplicates()
// Remove based on key fields
data = data.drop_duplicates(subset=['email'])

Fuzzy Matching

Find near-duplicates using similarity algorithms

// Find similar names (Levenshtein distance)
"John Doe" vs "Jon Doe" // Similarity: 0.9
// Merge if similarity > threshold

3. Format Inconsistencies - Solutions

Standardization

Convert all values to standard format

// Standardize dates
date = parse_date(date_string) // "2024-01-15"
// Standardize phone numbers
phone = re.sub(r[^0-9], '', phone) // "1234567890"

Normalization

Convert to canonical form (lowercase, trim whitespace)

name = name.lower().strip() // "John Doe" → "john doe"
email = email.lower().strip() // Remove spaces

4. Invalid Data - Solutions

Validation Rules

Define and enforce validation rules

// Email validation
if not re.match(r^[\w\.-]+@[\w\.-]+\.[a-z]{2,}$, email):
flag_as_invalid(email)
// Range validation
if age < 0 or age > 120:
flag_as_invalid(age)

Data Type Conversion

Convert to correct data types with error handling

try:
price = float(price_string)
except
ValueError:
price = None // Mark as invalid

Data Quality Improvement Process

1

Profile Data

Analyze data to identify quality issues

2

Define Rules

Establish data quality rules and standards

3

Clean Data

Apply fixes: impute, deduplicate, standardize

4

Validate

Verify data meets quality standards

5

Monitor

Continuously monitor data quality metrics

6

Prevent

Implement validation at data entry points

Data Quality Metrics

MetricFormulaTargetExample
Completeness(Non-null records / Total records) × 100> 95%950/1000 = 95%
Accuracy(Correct records / Total records) × 100> 98%980/1000 = 98%
Uniqueness(Unique records / Total records) × 100100%1000/1000 = 100%
Validity(Valid records / Total records) × 100> 95%970/1000 = 97%

Why Data Quality Matters

Better Decisions

High-quality data leads to accurate insights and better business decisions

Cost Savings

Prevents costly mistakes, reduces rework, saves time

Trust & Compliance

Builds trust in data, ensures regulatory compliance

Efficiency

Reduces errors, automates processes, improves productivity

Cost of Poor Data Quality: According to studies, poor data quality costs organizations an average of $15 million per year in wasted time, incorrect decisions, and lost opportunities.

Data Quality Best Practices

Validate at Source

Catch issues early by validating data when it enters the system

Automate Data Quality Checks

Use automated pipelines to continuously monitor and fix data quality

Document Data Quality Rules

Maintain clear documentation of what constitutes good data quality

Create Data Quality Dashboards

Monitor data quality metrics in real-time with dashboards

Share this article with Your Friends, Collegue and Team mates

Stay Updated

Get the latest tool updates, new features, and developer tips delivered to your inbox.

No spam. Unsubscribe anytime. We respect your privacy.