HealthVerity Blog

Why diagnosis codes alone fall short for long COVID surveillance

Written by HealthVerity | Mar 19, 2026 8:59:59 AM

When our client, a global pharmaceutical company, set out to better understand long COVID pathways, they had a central question to answer: how much of the long COVID population are we missing if we rely on diagnosis codes alone?

That question matters because long COVID rarely follows a clean, uniform path through the healthcare system. Symptoms can persist, evolve and surface across care settings in ways that are not always captured consistently in structured data like closed claims. For teams focused on syndromic surveillance, that creates a real evidence gap.

In real world data, long COVID (post-acute sequelae of SARS-CoV-2 infection, or PASC) is not consistently captured through diagnosis codes. Instead, it is often documented descriptively in clinical notes, where providers record symptoms, suspected conditions, or ongoing concerns without assigning a formal code. As a result, approaches that rely on structured claims data alone may miss a meaningful portion of the population.

This challenge extends beyond long COVID. In emerging or poorly standardized conditions, diagnosis codes often lag behind clinical reality. Providers may document symptoms and suspected conditions in notes long before coding practices stabilize. As a result, relying on structured data alone can systematically undercount patient populations and delay insight generation.

In this project, we worked with our client to evaluate how closed claims and unstructured clinical notes could be used together to identify and characterize patients with evidence of long COVID more effectively.

The challenge: identifying long COVID in real-world data

Long COVID is difficult to study because documentation patterns vary widely. Some patients receive a formal diagnosis code for post-COVID conditions, while others have symptoms or post-COVID language documented in only clinical notes. Over time, that disconnect introduces misclassification risk and can make surveillance less reliable.

Our client wanted to better understand:

  • How much unstructured data expands case identification

  • Whether diagnosis codes and EMR notes capture the same patients

  • What this means for long COVID surveillance and pathway analysis

The approach to identifying long COVID using claims and clinical notes

We analyzed a synced population of approximately 24.1 million patients with both closed medical claims and HealthVerity Notes across the study period of October 1, 2021 through September 30, 2025.

Within that population, we identified patients with post-COVID evidence based on either:

  • A diagnosis of ICD-10 U09.9

  • A mention of post-COVID-19 (e.g., “long COVID,” “post-COVID,” “PASC”) identified through structured keyword-based extraction of clinical notes

Within the linked population, approximately 1.19% of patients had evidence of long COVID based on either source.

We then measured the overlap between coded diagnoses and note-based evidence over time, with a focus on how each source contributed to overall patient capture.

What we found about long COVID identification in claims and clinical notes

The first takeaway showed that unstructured notes materially expanded long COVID identification.

More than 242K unique patients had a post-COVID diagnosis code. When we included mentions of post-COVID in EMR notes, the cohort increased to about 287K unique patients. That means note-based evidence added about 45K patients (an increase of ~19%) beyond diagnosis codes alone (Figure 1).

Figure 1: Cohort attrition; unique patients with a post-COVID diagnosis code U09.9 in clinical notes within 30 days from Source 42 within HealthVerity Marketplace.

 

Diagnosis coding for long COVID plateaued at approximately 75% of patients after its introduction, indicating that providers are not consistently applying U09.9 in routine practice. At the same time, unstructured notes accounted for 25–30% of patient identification, and overlap between the two sources remained low (~4–6%), demonstrating that claims and clinical notes capture complementary, not redundant, patient populations (Figure 2).

Figure 2: Proportion of long COVID patients identified by diagnosis codes and clinical notes over time

 

Additionally, the overlap between diagnosis codes and EMR notes declined over time. Among patients with a post-COVID diagnosis, the share who also had a post-COVID mention in EMR notes fell from 8% in 2023 to 3% in 2025. Among patients identified through clinical notes, the share with a diagnosis code dropped from 25% in late 2022 to less than 10% in 2025 (Figure 3).

Figure 3: Trends in diagnosis and clinical notes

That pattern suggests diagnosis codes and clinical notes are increasingly capturing different patient subsets, indicating that long COVID is not consistently formalized in diagnosis coding and is often documented descriptively in clinical practice.

We also saw a demographic signal worth noting. Adults age 45 and older represented the majority of patients with post-COVID diagnosis or note evidence (Figure 4). That pattern aligns with what we would expect in claims- and EMR-based identification, where older adults typically have higher healthcare utilization and more opportunities for documentation.

Figure 4: demographics of post-COVID diagnosis or evidence in clinical notes

 

Why this matters for long COVID case identification

For long COVID surveillance, if you want a more comprehensive view of the population, you need more than structured claims alone.

Closed claims bring consistency, longitudinal context and coded clinical events. Unstructured notes add symptom-level detail and post-COVID evidence that may never appear in a diagnosis field. When used together, they create a stronger foundation for surveillance, cohort definition and downstream pathway analysis.

This project also reinforces a broader point for real-world evidence teams. In conditions where coding is uneven or still evolving, unstructured data can change the size and shape of the population you are studying. That affects study design, incidence estimates, care pathway analysis and the confidence you bring to research decisions.

The outcome for long COVID surveillance and cohort strategy

For our client, this work showed that long COVID surveillance improves when closed claims and HealthVerity Notes are used in combination. It also demonstrated that unstructured data is not just additive, it can reveal patient groups that structured sources miss.

That insight gives research teams a better starting point for studying pathways, understanding burden and evaluating how post-COVID conditions appear in routine care.

Long COVID is only one example of what becomes visible when you bring structured and unstructured real-world data together. To see how HealthVerity can help you uncover hard to capture patient populations and strengthen syndromic surveillance, explore our approach to privacy-protected real-world data or connect with our team.