(Part 1)
When clients see real-world data (RWD) in HealthVerity Marketplace, it may appear seamless. But behind the scenes, a sophisticated framework of quality assurance is constantly at work, cleaning, validating, and protecting that data from the moment it enters our ecosystem.
To better understand what it takes to deliver clean, privacy-protected data at scale, we sat down with two members of the HealthVerity Data Quality Assurance team: Ike Osuagwu, Manager of Data Quality, and Dr. Ellen McCleskey, PhD, Senior Data QA Engineer. In this first installment of our two-part series, they walk us through the core responsibilities of their team and explain how the HealthVerity proprietary Theseus framework is transforming the speed and reliability of real-world data delivery.
Ensuring real-world data accuracy and privacy at every stage
At HealthVerity, data quality assurance isn’t a final step. It’s a continuous process that begins the moment a new dataset arrives and continues through normalization, enrichment, and privacy review. As Dr. McCleskey points out, “We check the data at all points. We check the source data coming in, and then through normalization, and the final product before it gets to [HealthVerity] Marketplace.”
This work involves a combination of manual review and automated controls. Source files are scanned for format and schema compliance, and every transformation, such as deduplication, standardization, or enrichment, is monitored for unintended anomalies.
Crucially, privacy is always top of mind:
“The big thing is privacy, of course. We ensure that we’re always conforming to the privacy rules. HealthVerity might be more stringent than required, in a good way, to protect patient privacy and ensure full regulatory compliance.”
- Dr. McCleskey, Ph.D., Senior Data QA Engineer
Before any dataset becomes discoverable in HealthVerity Marketplace, it is thoroughly vetted to remove or obfuscate protected health information (PHI) and ensure it adheres to HIPAA Expert Determination standards. This privacy rigor is integrated into the quality control pipeline, not an afterthought. Check out our article on privacy-preserving record linkage (PPRL) for more information.
How Theseus improves QA and accelerates data delivery
The quality assurance process is powered by Theseus, HealthVerity’s custom-built ETL framework. Theseus is designed to streamline the onboarding of data suppliers while applying rigorous validation at every stage of the pipeline.
Built on a medallion architecture: Bronze, Silver, and Gold, Theseus enables the QA team to progressively check for completeness, standardization, and privacy compliance.
- Bronze (Source ingestion): Ensures raw files are complete and match the expected schema.
- Silver (Normalization): Applies deduplication, formatting rules, and other cleanup steps.
- Gold (Curation & privacy): Applies advanced privacy-preserving techniques to ensure HIPAA compliance.
The team is currently working on a new version of Theseus that allows for more granular and real-time validation. “Instead of waiting for the entire dataset to load in,” explained Dr. McCleskey, “we can actually run a few tests on each row to make sure the basics are there, and then we can quarantine any [bad] rows while still allowing the rest of it to go through.”
This more dynamic approach also translates into faster timelines for data availability. “You’re getting data at higher rates,” Ike added, “and hitting contract dates a lot better.”
“Onboarding used to take anywhere from three weeks to a month (if you’re really, really fast). And now, we’re projecting, with the loading process that is Theseus, that we can get that done, with all the augmentation of setting up the provider included, in about 5 business days.”
- Ike Osuagwu, Manager of Data Quality at HealthVerity
Why clean real-world data must preserve real-life complexity
Clients may think of “clean” data as synonymous with uniformity. But in real-world data, the opposite is often true. True signal lives in the variation. Ike highlights, “When you put that label of ‘clean’ on [real-world data], you expect a very normal distribution. But it’s kind of a beautiful thing that you don’t get really clean data. You get that truth that goes with real life—like high row counts for one month because everyone decided to go to the doctor after COVID or trends like that.”
That kind of natural irregularity, what statisticians might call meaningful heterogeneity, is exactly what makes RWD so valuable for life sciences. The goal of the HealthVerity QA process is to preserve that complexity while simultaneously formatting to a consistent format and filtering out what doesn’t belong, ensuring data is not only accurate and privacy-compliant but also rich in clinical insight.
The importance of quality assurance and privacy in real-world data research
When data quality and privacy are engineered into the pipeline from day one, researchers and analysts can trust that what they’re seeing reflects real-world behaviors, not artifacts of poor processing. HealthVerity uses superior linkage algorithms to ensure our matching accuracy stays high while protecting PII. Find out how our matching metrics work in the matching accuracy metrics deep dive.
Whether you’re conducting HEOR studies, building predictive models, or tracking public health trends, the accuracy and stability of your underlying data can make or break your results.
At HealthVerity, we’re making that trust Verified—with transparent sourcing, rigorous QA, and industry-leading privacy controls built into every dataset delivered via HealthVerity Marketplace.