What we found about long COVID identification in claims and clinical notes
The first takeaway showed that unstructured notes materially expanded long COVID identification.
More than 242K unique patients had a post-COVID diagnosis code. When we included mentions of post-COVID in EMR notes, the cohort increased to about 287K unique patients. That means note-based evidence added about 45K patients (an increase of ~19%) beyond diagnosis codes alone (Figure 1).

Figure 1: Cohort attrition; unique patients with a post-COVID diagnosis code U09.9 in clinical notes within 30 days from Source 42 within HealthVerity Marketplace.
Diagnosis coding for long COVID plateaued at approximately 75% of patients after its introduction, indicating that providers are not consistently applying U09.9 in routine practice. At the same time, unstructured notes accounted for 25–30% of patient identification, and overlap between the two sources remained low (~4–6%), demonstrating that claims and clinical notes capture complementary, not redundant, patient populations (Figure 2).

Figure 2: Proportion of long COVID patients identified by diagnosis codes and clinical notes over time
Additionally, the overlap between diagnosis codes and EMR notes declined over time. Among patients with a post-COVID diagnosis, the share who also had a post-COVID mention in EMR notes fell from 8% in 2023 to 3% in 2025. Among patients identified through clinical notes, the share with a diagnosis code dropped from 25% in late 2022 to less than 10% in 2025 (Figure 3).

Figure 3: Trends in diagnosis and clinical notes
That pattern suggests diagnosis codes and clinical notes are increasingly capturing different patient subsets, indicating that long COVID is not consistently formalized in diagnosis coding and is often documented descriptively in clinical practice.
We also saw a demographic signal worth noting. Adults age 45 and older represented the majority of patients with post-COVID diagnosis or note evidence (Figure 4). That pattern aligns with what we would expect in claims- and EMR-based identification, where older adults typically have higher healthcare utilization and more opportunities for documentation.

Figure 4: demographics of post-COVID diagnosis or evidence in clinical notes
Why this matters for long COVID case identification
For long COVID surveillance, if you want a more comprehensive view of the population, you need more than structured claims alone.
Closed claims bring consistency, longitudinal context and coded clinical events. Unstructured notes add symptom-level detail and post-COVID evidence that may never appear in a diagnosis field. When used together, they create a stronger foundation for surveillance, cohort definition and downstream pathway analysis.
This project also reinforces a broader point for real-world evidence teams. In conditions where coding is uneven or still evolving, unstructured data can change the size and shape of the population you are studying. That affects study design, incidence estimates, care pathway analysis and the confidence you bring to research decisions.
The outcome for long COVID surveillance and cohort strategy
For our client, this work showed that long COVID surveillance improves when closed claims and HealthVerity Notes are used in combination. It also demonstrated that unstructured data is not just additive, it can reveal patient groups that structured sources miss.
That insight gives research teams a better starting point for studying pathways, understanding burden and evaluating how post-COVID conditions appear in routine care.
Long COVID is only one example of what becomes visible when you bring structured and unstructured real-world data together. To see how HealthVerity can help you uncover hard to capture patient populations and strengthen syndromic surveillance, explore our approach to privacy-protected real-world data or connect with our team.