YongJin Lee

Engineering Data, Investing in Tomorrow, Journeying Through Life.

Exploratory Data Analysis (EDA) Is Your Best Friend as a Data Engineer

Posted by:

|

On:

|

As Data Engineers, we often pride ourselves on building robust, scalable pipelines that move data efficiently from source to target. But amid the modeling, scheduling, and orchestration, it’s easy to overlook a foundational principle: you can’t build trustworthy pipelines on unverified data. That’s where exploratory data analysis (EDA) becomes not just helpful, but essential.


A Hard Lesson Learned Early

Years ago, I made a critical mistake that still shapes how I work today. I trusted a column name and its metadata description without verifying the actual data. On the surface, it all looked fine: a clearly named column, a sensible data type, a description that matched our business terminology. But under the hood, it told a different story.

The column had unexpected duplicates, inconsistent categorical values, and subtle shifts in meaning over time due to upstream changes. What should have been a straightforward transformation job turned into days of rework and downstream corrections. It was a humbling and expensive reminder that, as engineers, we must question the data, always.

Don’t Trust Metadata at Face Value

Depending on the organization you are working in, table and column names are hints, not truths. Metadata is often outdated, incomplete, or written for a different audience. Even well-intentioned documentation can fail to capture real-world edge cases or historical changes in schema or semantics.

Here are just a few common traps:

  • A `id` that isn’t unique per entity.
  • A status column with undocumented values was introduced years ago.
  • A created_at timestamp that isn’t actually the creation time due to a migration.
  • A revenue column with different currencies mixed in.

This is where EDA earns its keep. Although you might think I’m stating the obvious, you will be surprised how often engineers overlook fundamental data analysis.


My EDA Checklist as a Data Engineer

To avoid building pipelines on shaky ground, I’ve developed a core set of EDA practices I follow religiously (this is not everything, but the most common ones):

✅ 1. Check Granularity and Uniqueness

Before you do anything, understand the grain of the data. Are the records unique by a single key or composite keys? If not, what are the rules for deduplication or consolidation?

  • Use COUNT(*) vs COUNT(DISTINCT key) to detect duplicates.
  • Group by potential composite keys and filter HAVING COUNT(*) > 1.
  • Ask: What should make a row unique?

Failing to get this right can lead to inflated metrics, incorrect joins, or data loss when aggregating.

✅ 2. Explore Categorical and Enum-Like Columns

Look at value distributions over time. Use GROUP BY with COUNT(*), MIN(column_x), and MAX(column_y), etc to detect:

  • Rare or unexpected values
  • Typos or inconsistencies
  • Business logic shifts (e.g., a new product category or order status)

Don’t just analyze current snapshots—check for historical changes and anomalies.

✅ 3. Assess Nulls and Missingness

Missing data is rarely random. Investigate null counts across key columns. Use queries like:

SELECT COUNT(*) AS total_rows,

       COUNT(column_name) AS non_nulls,

       COUNTIF(column_name IS NUL) AS nulls

FROM table_name;

Then ask:

  • Are these nulls expected?
  • Are they indicative of pipeline or source issues?

Null patterns often reveal systemic issues or implicit filtering logic baked into upstream systems.

✅ 4. Document Assumptions and Edge Cases

EDA isn’t just about SQL queries—it’s about building context. As you uncover quirks, write them down:

  • What assumptions are you making about row-level uniqueness, timestamp ordering, and value meanings?
  • Which fields are required for your transformation logic to work?
  • Have you read through all the column names and descriptions from the table? You might think you have everything after extracting fields with column names of interest. Then there might be those columns that you missed that could give more semantic significance.
  • What edge cases could break your logic (e.g., late-arriving data, duplicate keys, nulls in joins)?

Keep these notes in a shared doc or commit them in code comments or test suites.

✅ 5. Implement Defensive Guardrails

Once you’ve verified your assumptions, encode them. For example:

  • Add SQL assertions in dbt (e.g., uniqueness tests, not-null checks)
  • Write Pytest-style unit tests for custom transformations
  • Include sanity checks (e.g., row count thresholds, enum enforcement) in Airflow or Dataform DAGs

These aren’t just safeguards—they’re also living documentation for your team.

Pro tip: Your pipeline should tell you when an upstream truth breaks, before a stakeholder does.

✅ 6. Log Everything

Data issues often surface weeks or months later. That’s when your EDA notes become gold. Keep logs of:

  • What you investigated
  • What you assumed
  • What you decided (and why)
  • Discussions/Agreements you had with the stakeholders

This habit has saved me more times than I can count—whether in code reviews, incident postmortems, or onboarding new teammates.


Final Thoughts

It’s easy to think of EDA as something for data scientists or analysts. But in reality, it’s a core competency for Data Engineers. Every clean pipeline, every reliable dashboard, every correct model starts with someone asking the right questions of the raw data.

EDA gives you that superpower. It turns “I think” into “I know.”

So next time you sit down to write a new transformation or debug a flaky pipeline, start not with the code, but with the data. Be curious. Be skeptical. Let the data surprise you.

Because the truth is: data isn’t always what it seems.

Posted by

in