Good data, bad data, and the art of consensus
For senior data consultant Mark Kaghazgarian, modernizing a data pipeline for a global manufacturing company wasn’t just about AWS architectures or code. It was about defining reality – and getting everyone to agree on it.
Recently, Asteroid’s data expert Mark Kaghazgarian took on a lead role in a massive project for a global manufacturing company. The goal was ambitious: build a scalable, near-real-time data processing pipeline to support IoT monitoring for industrial equipment worldwide.
However, if you ask Kaghazgarian what the most critical aspect of the project was, he doesn’t immediately talk about data compaction or stream processing. Instead, he points to the often-overlooked necessity of establishing a shared architectural reality before writing a single line of new code.
The project’s existing data processing pipeline had evolved over a long period, extending far beyond its original scope. As a result of incremental changes and shifting requirements, the architecture had accumulated significant technical debt. The complexity of the design limited developers’ ability to evaluate alternative approaches or adopt modern data tooling that could deliver better outcomes with lower operational overhead. In this context, any modernization effort required careful planning and a phased introduction of new concepts, ensuring the team could build understanding incrementally while integrating improvements in a controlled, low-risk manner.
The dilemma of good vs. bad data
The client needed to collect raw data from a myriad of sources – devices, sensors, and legacy systems – and refine it into actionable dashboards for their customers. However, Kaghazgarian identified a hurdle that had to be cleared before a single dashboard could be built: the foundation.
"My initial assignment on the project was to define a data quality (DQ) solution," Kaghazgarian explains. "At the outset, the requirement was loosely scoped, and it became clear that 'data quality' was being used as an umbrella term for a wide range of issues. Rather than applying a generic definition, I focused first on identifying concrete pain points that stakeholders were classifying as bad data."
Given the project’s role in an end-to-end IoT data processing pipeline, the platform ingested data from multiple providers in heterogeneous formats and served a diverse set of downstream consumers. Beyond common IoT data issues, such as late-arriving events or out-of-range values, Kaghazgarian observed that many reported data quality incidents stemmed from misinterpreted data contracts or historical requirement changes that were no longer well-documented. In several cases, data that was technically valid and correctly processed was being flagged as defective due to a misalignment between producer assumptions, transformation logic, and consumer expectations.
Based on these findings, he introduced the concept of an agnostic data quality and observability framework. This was designed to be extensible and future-proof rather than tightly coupled to specific tools or schemas. Through a series of workshops with the delivery team and key stakeholders, they explored the intended capabilities, benefits, and operational implications of such a solution.
The blueprint for consensus
Kaghazgarian saw that the organization had plenty of ambition but needed a structured way to navigate the complexity. He organized a series of workshops designed not just to plan features, but to teach the team how to evaluate data quality.
"You can’t just guess at business requirements," Kaghazgarian notes. "We needed a rigorous process to ensure the foundation was solid."
As a continuation of this effort, the project revisited the existing data processing pipeline to establish how a deterministic architecture would deliver current requirements while addressing identified pain points.
"I conducted a comprehensive assessment of the existing solution, focusing on objective evaluation rather than subjective opinion," Kaghazgarian says. "The analysis was grounded in established data architecture principles and well-architected framework guidelines. The resulting assessment, comprising approximately 40 pages, included metrics, diagrams, and holistic evaluations across multiple dimensions: static code analysis, code ownership, testing coverage, data modeling practices, and observability capabilities.”
Based on these findings, Kaghazgarian outlined several pragmatic modernization approaches aimed at introducing determinism into the data pipeline. These proposals were grounded in empirical evidence gathered during the assessment and informed by an in-depth understanding of the team’s operating model. Kaghazgarian shared the completed material with the team and key stakeholders through structured walkthrough sessions to build a shared understanding and assess organizational readiness.
To support this transition, he introduced a simple three-step working framework:
1. Documentation: Capturing architectural intent, decisions, and assumptions to prevent knowledge erosion over time.
2. Communication: Ensuring transparency and inclusion across all parties involved, reducing the risk of siloed decision-making.
3. Consensus: Adopting an iterative, stepwise decision process, ensuring shared understanding and alignment before progressing.
With this framework in place, the team began designing and implementing a modernized data processing stack, capable of meeting existing functional requirements while reducing the identified pain points. The target architecture emphasized standardization, improved maintainability, and predictable behavior – all while remaining within the same operational cost envelope as the legacy solution.
The result: A 90% reduction in operational costs
Once the human and theoretical foundation was set, the technical execution followed. Following the evaluation of the proposed approaches, the client granted approval to proceed with the design and implementation of a new data processing pipeline.
”A key design principle throughout the implementation was developer ergonomics, treating engineers as first-class users of the platform”, Kaghazgarian emphasizes. ”The solution was intentionally aligned with the team’s existing skill sets and workflows, avoiding unnecessary tooling or steep learning curves.”
This approach proved effective: after onboarding a single team member, an additional legacy pipeline was migrated independently with minimal guidance, demonstrating the accessibility of the new model.
As the prototype matured, additional capabilities were introduced to achieve functional parity with the legacy pipeline. Development continued under the established documentation, communication, and consensus framework, with continuous stakeholder engagement to validate progress.
The outcome was a fully functional, deterministic data processing pipeline delivering the same business logic as the legacy solution at a significantly reduced operational footprint. Operational costs were reduced by up to 90% through the optimized use of cloud-native services, modern data processing frameworks, and adherence to secure data engineering best practices.
Finally, with the deterministic pipeline in place, the original requirement for data quality monitoring could be revisited from a stronger foundation. Increased confidence in pipeline behavior significantly reduced investigation time within the processing layer itself, allowing data quality analysis to focus on upstream ingestion issues or downstream consumption mismatches.
"My philosophy has always been: get the foundation right, deliver quick wins to build trust, and then modernize incrementally," says Mark Kaghazgarian. "Curiosity and discomfort keep me going, but seeing a team align around a shared understanding of their data? That is where the real value lies."
The article is written in collaboration with a partner of Asteroid.