(Part 2)
In Part 1 we explored how HealthVerity Marketplace delivers high-quality, privacy-compliant real-world data through rigorous validation, the Theseus pipeline, and continuous quality monitoring.
In Part 2, we go into the data types that pose the biggest challenges, how HealthVerity handles complex formats like unstructured notes, and what makes the data in HealthVerity Marketplace fundamentally different from traditional aggregators with two members of the HealthVerity Data Quality Assurance team: Ike Osuagwu, Manager of Data Quality, and Ellen McCleskey, PhD, Senior Data QA Engineer.
Why EMR/EHR and chargemaster data are harder to clean
Not all real-world data is structured in the same way. Electronic medical records (EMR), also known as electronic health records (EHR) and chargemaster data are known to be particularly messy, especially when they involve multiple tables attempting to represent a single patient event.
“We see a lot of variation in EMR data,” said Dr. Ellen McCleskey, Senior Data QA Engineer at HealthVerity. “EMR/EHR, chargemaster, anything where multiple tables coalesce to form one patient event, that’s definitely where you see the messiest of things,” added Ike Osuagwu, Manager of Data Quality.
EMR data can vary widely depending on the provider system, the software used, and the documentation practices of clinicians. These differences must be reconciled before any dataset reaches HealthVerity Marketplace, where clients depend on consistency, completeness, and privacy compliance that HealthVerity Marketplace provides.
How HealthVerity de-identifies and validates unstructured notes
Unstructured clinician notes introduce a different level of complexity. These free text entries often contain highly variable language, shorthand, and even direct patient identifiers (Figure 1).
Figure 1: Examples of three clinical notea that might found in HealthVerity Notes data with PII that must be addressed.
HealthVerity Notes product was designed to solve this problem by turning unstructured clinical documentation into structured, research-ready data that meets the same high standards as the rest of HealthVerity Marketplace.
“We have doctors that write patient names in notes and you’ll find more de-identification issues the less common something is. We use a privacy filter so that over 97% of the most common distinct entries for a field are filtered out right away.” - Ike Osuagwu, Manager of Data Quality
To handle this challenge, HealthVerity uses a multi-layered approach to privacy and quality:
- Frequency-based filtering helps flag rare or unusual terms that may indicate personally identifiable information.
- Machine learning models are trained on known patient names as well as the context in which they typically appear. For example, the system is able to spot phrases like, “Sarah came in on this day” as likely containing a name.
- Privacy and QA teams work together to review flagged content and continuously refine filtering techniques.
These protections ensure that clients using unstructured data from HealthVerity Marketplace receive data that is not only de-identified, but also fit for generating real-world evidence.
Why HealthVerity Marketplace delivers BETTER real-world data
Many vendors claim to offer clean, usable real-world data, but the scale and sophistication of the HealthVerity ecosystem provide several critical advantages. “I personally think it’s the volume of data that we have that allows us to compete at a level that other people just can’t,” said Ike Osuagwu.
“What really helps us compete, is that the HVID is so powerful with tracking a patient’s journey from some arbitrary start date all the way through the present day. It’s at a level that’s rarely seen in our competitors.” - Ike Osuagwu, Manager of Data Quality
The HealthVerity ID (HVID) enables accurate, longitudinal patient tracking across open and closed claims, labs, EMR, and even unstructured notes. This makes it possible for clients to build cohorts that reflect real clinical experiences over time, not just isolated encounters.
Just as important is the active feedback loop with data suppliers. When potential issues are identified, they are addressed before the data can ever be surfaced in HealthVerity Marketplace.
“A big thing is PII,” said Osuagwu. “If found, it does not make it past the person who found it. It gets pulled out of the data by the privacy team, and then we go back to the vendor letting them know that, ‘Hey, you sent us patient information.’”
This proactive model sets HealthVerity apart. Rather than simply aggregating files, HealthVerity acts as a data steward, ensuring that clients only receive datasets that have been verified and certified to meet the highest standards.
Delivering real-world evidence from complex data types
Whether it comes from EMR systems, clinician notes, or claims databases, real-world data is rarely clean when it first arrives. What matters is how that data is transformed into something trustworthy, usable, and privacy-compliant.
Through expert QA teams, automated validation processes, and proprietary identity technology, HealthVerity delivers structured insights from even the most chaotic sources.
And with every dataset available through HealthVerity Marketplace, clients can access real-world data with full confidence in its accuracy, completeness, and governance.