In a world where data drives numerous businesses, we aim to establish sources of truth (sometimes called canonical data) characterized by correctness and reliability. In my view, companies like Spotify, Amazon, Meta, and Google are fundamentally data companies. While their customer offerings vary, their business objective remains consistent: to connect users with what they seek. For sectors like e-commerce, making accurate recommendations is crucial. Understanding core metrics, such as user engagement and behavior, is vital for businesses.
This underscores the importance of having reliable data for data analysts, data scientists, machine learning engineers, backend engineers, and other data users. There were many times when I needed to elaborate on the effort and time required for us to build trustworthy and reliable data. There’s a general expectation of having a trustworthy data source readily available without considering what makes a dataset canonical, the maintenance required, the importance of monitoring, etc. Creating and maintaining canonical datasets often require significant yet understated effort. It’s unrealistic to expect data to be in the proper format immediately; creating correct and reliable data is not created by simply saying, “Let there be canonical data.” If we can have canonical data like that, it will be great. In reality, it involves numerous considerations.
What Are Canonical Data?
Canonical data refers to a standardized or agreed-upon format within a specific domain or system. It serves as a common language for data representation, enabling effective communication between different systems or applications. This unified model simplifies data integration and communication, reducing complexity and potential errors from handling multiple data formats.
Creation of Canonical Data
Requirement Gathering/Specifications
Creating canonical data begins with gathering requirements and reaching a joint agreement on the expected output from the business or product area. This may seem straightforward, but different teams can interpret the same data differently, even within the same product area. Our goal is to provide data broad enough for teams to apply their semantic meanings through filtering and utilization. This process requires collaboration and a thorough sign-off procedure to ensure complete stakeholder agreement.
Feasibility is another aspect. While sometimes the necessary data is already available, other times it requires collaboration with backend engineers or acquiring external data, depending on the need. The process is iterative, balancing requirements with availability and feasibility. For instance, distinguishing between user and bot usage in data required backend system modifications in one of my projects.
Exploratory Data Analysis (EDA) and reviewing existing data documentation are crucial during the requirements gathering and development stages.
Development Stage
The development stage typically involves:
- Building data pipelines, either in batch or streaming formats.
- Implementing unit tests and validation queries for data quality.
- Creating and updating documentation.
- Establishing monitoring systems for the pipelines and their outputs.
Validation queries are essential to cover edge cases, even those that seem obvious. For example, we encountered negative revenue values in a previous role due to a data input error, significantly impacting our financial reporting. Validation queries help prevent such errors from affecting downstream processes.
Unit tests and validation queries also serve as informative documentation detailing expected behaviors and data source expectations. This is helpful for other engineers modifying data pipelines, ensuring consistency.
Thorough documentation is crucial for data consumers. Even with clear documentation, there can be misinterpretations, but it’s always beneficial to provide detailed guidance.
I advocate for including counters in data pipelines to monitor data volumes, such as the number of ingested, dropped, and processed data points. For financial data, counting negative versus positive entries is particularly useful. These counters aid in dashboarding and reporting and enable abnormality detection in data volumes.
Monitoring, Maintenance, and Modifications
The data team responsible for canonical datasets monitors pipeline performance, addresses validation failures, and implements necessary updates. This includes adapting to data volume changes, scaling compute resources, and modifying pipeline steps.
Effective communication with product and business teams is critical to prevent disruptions. Unexpected changes or system overhauls can lead to significant pipeline rework. Establishing dedicated communication channels and clear expectations between product areas and the data team is essential to maintaining dataset integrity.
Conclusion:
Creating and maintaining data, especially canonical datasets, is a time-intensive process. My overview might cover only some of the steps, like consideration of data storage, format, cadence, and many more, but it highlights the effort required to produce high-quality, reliable data for widespread organizational use.