Sync forward: Revealing the top data trends for 2024

Top Data Trends of 2024

Given its role at the intersection of novel real-world data and emerging technologies, HealthVerity is uniquely positioned to offer its predictions for the five most important trends that are likely to impact your data strategy in 2024. If you missed our oversubscribed webinar, here’s a recap: 

Trend #5 - Inflation Reduction Act

By mandate of the Inflation Reduction Act (IRA),the Department of Health and Human Services released the list of the first 10 drugs targeted for price reductions and inflation caps, beginning in 2026. The long awaited list rocked the industry and caused a frenetic wave of activity on how impacted pharmaceutical manufacturers can reallocate investment to either expand indications on named drugs or shift investment to new drugs currently in the pipeline  We believe that the financial uncertainty imposed by the IRA will force pharmaceutical companies to explore two options: 

  • Launch more phase IV trials to support new indications for the named drug in order to expand the market, increasing the addressable market for prescriptions and total revenue potential of the drug

  • Conversely, the IRA may discourage the same companies from investing further in those named drugs with existing markets potentially being capped and instead shift investment to other drugs earlier in the pipeline that have a longer runway for IRA consideration

In both cases, the IRA shifts the focus to more clinical trial activity that can be directly improved and even accelerated with real-world data (RWD). Devising an active RWD strategy for clinical trials can reduce the time to FDA submission, enabling drugs to come to market faster and offset the potential impact of the IRA.  

Trend #4 - Cell and gene therapy

A study in Gene Therapy forecasted that between 18 and 23 gene therapies will be approved between January 2020 and January 2035.1 While the number of projected new therapies might be lower than expected, the impact of this trend is really around the uncertainty of the economic model for drugs that can often have a ten-year-long treatment and observation period and cost millions of dollars per patient but still fail, putting the notion of a single, upfront payer as a tenuous notion.

Insurance companies and employers may be reluctant to pay these exorbitant costs at the time of treatment with the uncertainty of long-term effectiveness and the potential of the patient leaving the employer or health plan who covered the life-saving investment. If the disease recurs three years later, is the insurance company or employer entitled to a refund from the drug manufacturer, or can they expect some form of cost recovery from the downstream employer or health plan? Additionally, there are concerns about comorbidities or side effects causing the patient to have to stop taking the medication. This could all lead to the emergence of novel economic models, such as installment payments, risk pools, reinsurance, price-volume agreements, expenditure caps, subscriptions, outcomes-based payments and rebates, warranties, and coverage with evidence development.     

RWD can play a critical role in addressing this economic challenge, allowing the patient to be continually monitored even if they’ve departed the health plan or changed their provider. RWD, whether de-identified under HIPAA or even identifiable with patient consent, can provide direct insight on long-term outcomes, as well the average timeframe for outcomes, validating patient journeys for payers, providers and pharma. 

Trend #3 - Clinical trials and RWD/RWE

The drum beat for the role of RWD and real-world evidence (RWE) in clinical trials is getting louder by the year. FDA Commissioner Robert Califf has been outspoken at many industry conferences regarding the opportunity to integrate RWD into clinical trials, clearly seeking to accelerate the tempo. The pharma industry, however, has been slow to adapt and adopt this approach. In December 2023 alone, the FDA released three notices or requests for guidance related to this topic (Use of Real-World Evidence to Support Regulatory Decision-Making for Medical Devices, Real-World Data: Assessing Registries to Support Regulatory Decision-Making for Drug and Biological Products Guidance for Industry, and Data Standards for Drug and Biological Product Submissions Containing Real-World Data). 

Today, RWD is being used to inform study design, select endpoints, identify participants and sites, and conduct post-market safety assessments. While all of this is beneficial and can minimize the time to study, RWD needs to evolve from this status quo to be incorporated throughout the trial and contextualize what’s happening with the patients. For example, a patient might see another doctor or specialist during the trial, or they could suffer a heart attack or other event and be treated in the emergency room. If the patient doesn’t share this information with the site investigator, whether because they’re embarrassed, fear they’ll be excused from the study, simply forget to share the information, or consider it irrelevant, the absence of this patient insight leads to a critical gap in better understanding the efficacy of the drug in question. RWD offers a reliable and HIPAA-compliant means to that insight, potentially even reducing the time to clinical trial outcomes by further demonstrating the benefits, safety and/or mitigated risks of the drug.

The FDA has given the green light for a RWD-heavy approach. In return, they are asking for pharma companies to engage early with FDA, to describe their data strategy and to review their approach. HealthVerity is an advocate for this trend and believes it will lead to a better outcome for patients who can more quickly gain market access to therapies that lead to a better quality of life.

Trend #2 - The token is broken

Tokenization, or the means by which patient identity is masked with an alphanumeric string yet patient records are still matched across data sources in a HIPAA-compliant manner, is directly related to trend #3. FDA has gone as far as to provide draft guidance regarding data linkage in Real-World Data: Assessing Electronic Health Records and Medical Claims Data to Support Regulatory Decision-Making for Drug and Biological Products:

  • Protocols involving internal and external data linkages should describe each data source and the accuracy and completeness of data linkages over time

  • Probabilistic and deterministic approaches to data linkage may result in different linkage quality

  • Demonstrate whether and how data from different sources can be obtained and integrated with acceptable quality

  • Specific attention to data curation, including individual-level and population-level linkages and understanding of many-to-one and 1:1 linkage is fundamental to assessing a new data linkage  

This guidance gets to the heart of understanding the approaches to tokenization and ensuring the accuracy and quality of the RWD. Accurately resolving patient identities across data sources is critical for building a comprehensive view of the patient journey and making informed correlations; however, because the data is de-identified and with the inherent noise in RWD, such as nicknames, misspellings and missing fields, this can be a challenge. 

There are two methods for linking data as mentioned in the FDA guidance, probabilistic and deterministic. The legacy deterministic approach, employed by most established vendors, generally requires an exact match of the PII to create a valid token. Therefore, if a patient is listed as Andrew in the EMR, but as Andy in the medical claims being linked, two different tokens will be generated and the patient journey will be incomplete due to the failed token match. This error is known as a false negative and creates a fragmented view of that patient’s journey. Deterministic approaches tend to have an average false negative rate of 15% to 20%, falsely inflating patient counts with extra tokens while leaving important gaps in care for patient records.

Probabilistic approaches use machine learning and other techniques to compare the slight deviations created by the noise in healthcare data to determine that non-exact records are still the same patient. Refining its probabilistic approach, HealthVerity has been able to achieve 10x greater accuracy than legacy tokenization, resulting in the broad synchronization of patient identities over time and across data sources. This is why HealthVerity believes the token (as we know it) is broken and that traditional tokenization fails to uphold the standard required of clinical trials. Only the HealthVerity ID offers the characteristics that meet the FDA standard.

Quality of the data sources is also noted in the guidance. This topic is also known as data provenance or the ability to cite the origin of the RWD, as well as the validity of the data source. Per their contracts, certain data aggregators or resellers are not permitted to share where the data comes from beyond providing the source type, such as medical claims. With HealthVerity Marketplace, clients know the data source and how the data was obtained. Both clients and FDA can be put in contact with the data owner to ensure validity and provenance. This important distinction is critical to any clinical trial-driven RWD strategy.

Tokenization, however, is not the end game but rather is only the first step. Clients need to implement a tokenization strategy to be able to exchange RWD when, where and how they need it. Beyond tokenization, data synchronization provides the accuracy and provenance needed to exchange near limitless data quickly, efficiently and in a HIPAA-compliant, research-ready manner.

Trend #1 - Artificial intelligence

While artificial intelligence (AI) is all the rage, it is still early days and HealthVerity encourages pharma companies to walk, not run. There are already issues with data quality from public sources, not to mention the uncertainty of black box models, expensive compute processes and unclear outputs that will take years to resolve. Following are a few particular concerns and considerations:

Hallucinations and drift - Hallucinations occur when a Large Language Model (LLM) perceives patterns or objects that are nonexistent. A good example of this that was recently in the news was where a lawyer cited fake cases in his court filing after using AI to write the document, with AI hallucinating and creating cases that didn’t exist.2 Drift is degradation caused by a difference in the training data and the data that is run through the model over time. As more data gets processed, this causes the model to drift and produce different answers. A study by Stanford University and UC Berkeley evaluated the March 2023 and June 2023 versions of ChatGPT 3.5 and ChatGPT 4, having each complete several tasks.  The study found that the performance varied greatly over time. For example, the March 2023 version was able to identify prime versus composite numbers with 84% accuracy. Just three months later, ChatGPT was only 51% accurate.3   

To mitigate these challenges:

      • Use high-quality training data - Synthetic data or publicly available data may be inexpensive, but the quality of the data used to build a model helps to control hallucinations and drift.

      • Define the purpose of your AI model - Don’t expect the model will do everything. Pick one outcome as a focus.

      • Use data templates - Have clear data templates about what kind of data is going to be consumed in the model.

      • Limit responses - Keep the breadth of responses focused.

      • Test and define the system continually - Use A/B testing to identify hallucinations or drift before it’s too late.

      • Rely on human oversight - Even in a digital age, humans are still needed to serve as a safeguard against issues, errors and glitches.

Process versus predictions - AI is going to yield very different wins for companies depending on whether it is used to make predictions or manage processes. The process approach includes applications such as back office processes and filing. An example would be using AI to complete clinical trial case report forms. This is an incredibly burdensome task that AI could likely do well with minimal editing to save time. In payer underwriting, AI could be used to review patients’ clinical history or for claims management and adjudication using benchmarks, something that is already happening today despite some controversy.

There are numerous prediction use cases for life sciences, such as prioritizing which physicians to target, molecule candidate review, as seen with Moderna in developing the COVID vaccine, discovery of rare disease patients absent a diagnosis code, and even predicting the likelihood of success for a clinical trial. On the payer side, AI could be used to predict risk before onboarding members. Obviously, more work needs to be done to be able to trust the outputs, but these are just some examples of how AI can change business as we know it.

Digital rights management - One thorny issue in AI is the concept of digital rights management or intellectual property (IP) management, specifically as it impacts the ownership of data being used to build or power AI solutions. For data owners, the data they provide is their IP. For example with Labcorp or Quest, lab results are their IP. Typically, when a pharma company or other organization licenses that data for a traditional project, data is destroyed at the end of the project or a certain retention period.When data is fed into an AI model, however, it’s nearly impossible to ever remove it from the model at the end of a data license. That data is now part of the fabric of the current state of the model and persists over time.

This conundrum can be compared to the early days of digital music when Napster burst onto the scene, allowing people to freely trade music with no economic consideration for the musician's IP. Today, we have subscription-based providers, such as Spotify and Apple Music, that use an economic model that provides royalties to the music owner. In addition, once your subscription ends, your access to the music also terminates. Something similar will need to be developed for healthcare data owners and AI, but it’s going to be more challenging because, unlike music, when a healthcare data subscription for AI ends, it’s nearly impossible to claw back the IP from the AI model. We’re already seeing an example of this issue where the New York Times is suing the creators of ChatGPT for using its articles to train chatbots without citations or compensation. Economic models need to evolve to provide value to both sides, data and AI owners, and address the upside for early data inclusion balanced by considerations for the long-term value add.

This issue is particularly important because of data provenance. As mentioned earlier, it is important to know where your data comes from and that you can continue to feed your model with high-quality data while dealing with the economic considerations. With the nation’s largest healthcare and consumer data ecosystem, HealthVerity is ideally suited to assist both buyers and sellers address this issue.

To learn more about how HealthVerity can help you address these trends and advance the science:

Click here


1Wong, C.H., Li, D., Wang, N., et al (2023). The estimated annual financial impact of gene therapy in the United States. Gene Therapy 30, 761-773. November 2023. https://www.nature.com/articles/s41434-023-00419-9.

2 Bohannon, M. (2023). Lawyer Used ChatGPT In Court – And Cited Fake Cases. A Judge Is Considering Sanctions. Forbes. June 8, 2023. https://www.forbes.com/sites/mollybohannon/2023/06/08/lawyer-used-chatgpt-in-court-and-cited-fake-cases-a-judge-is-considering-sanctions/?sh=30c331cf7c7f

3Chen. L, Zaharia, M., Zou, J. (2023). How Is ChatGPT’s Behavior Changing Over Time? Stanford University and UC Berkeley.