Life Sciences R&D & Medical

BMS solves its clinical study data conformance challenge with a machine learning system

By Aditya Jain, and Sagar Madgi

Nov. 30, 2023 | Article | 10-minute read

BMS solves its clinical study data conformance challenge with a machine learning system

This article includes contributions from Dan Bachalis, Senior Director, Engineering & Operations, Drug Development IT Data & Analytics at Bristol Myers Squibb.


Pharma companies typically conduct exploratory analytics on legacy clinical studies (or trial data) to uncover insights that could inform current and future trial phases. These analytics may, for example, uncover characteristics of slower or faster treatment responses in patients or efficacy signals in a subpopulation that could be used to develop tailored treatments. Combining legacy studies is a process prone to friction—not only due to the sheer complexity of clinical data but also because legacy studies come in disparate formats and can differ dramatically across therapeutic areas, organizations and study phases. These challenges are further amplified when there are mergers and acquisitions.


This complexity is a natural consequence of data coming from hundreds of different trial sites that produce data packaged in different formats at different points in time. The insights already generated in data formats like SDTM and ADaM are required to be combined with some of the raw trial data present in studies, adding to the conformance challenge. This data conformance challenge is one that spans the entire industry, as sponsors today maintain historical data in multiple legacy formats.

Chasing the promise of a customizable common data model in clinical study analysis

Source studies must conform to a custom common data model (CDM). This process involves mapping source data columns to target data columns (accounting for nuances such as one-to-many and many-to-one mappings) and transforming some data columns to new columns to generate the final study data package.


Pharma organizations have tried to manage this process using multiple tools and technologies with varied degrees of success. Known as Intelligent Data Management (IDM), this continues to be a priority for pharma R&D organizations.

The data conformance challenge at Bristol Meyers Squibb (BMS)

Like its peers, Bristol Myers Squibb (BMS) has a massive repository of historical studies, which it mines on an ongoing basis to uncover insights that could guide clinical and medical decisions. The conformance process was a bottleneck, however—a typical study took three weeks or more to be conformed, which slowed the downstream exploratory analysis phase. BMS identified a strong need to fully automate the conformance process, while building in the ability to handle regular refreshes.

FIGURE 1: Studies analyzed by BMS represent different data formats

To achieve this, BMS worked with a third-party partner to create a sophisticated rules-based solution to map clinical studies to a custom CDM. The solution did not address BMS’s challenges, resulting in low adoption that further slowed the process of deriving study insights. That’s when BMS partnered with ZS to rethink a new approach to solve these underlying challenges.

How ZS approached the BMS challenge in a phased approach

The key issue at hand was developing a solution that could handle the sheer diversity of studies—across both legacy studies and those in the pipeline. After analyzing some mapped legacy studies, the ZS team uncovered some interesting observations.


Structural features could act as accelerants. We found structural features (such as syntactic and semantic similarities) between the source and destination common data models, which we knew could help accelerate the conformance. While syntactic similarity allowed us to map the more straightforward scenarios, the ability to use semantics or meaning similarity helped us map variations observed across diverse data sets.


Data-centric features informed our recommendations. In addition to the structural features of the solution, the team also explored data-centric features to understand the various data-type- and data-value-based nuances (such as distribution entropy of values, percentage null or unique values). We considered the relevance across these features holistically to arrive at a final recommendation list.


It was not a deterministic, nor a stochastic process. While the process was not entirely deterministic (not limited to one-to-one mappings), it was not highly stochastic either. On average, a reviewer would have to select between two options and not more than three or four in most cases.


A large body of historical data existed. We could mine these mappings to understand the broad patterns involved in the conformance process, considering nuances such as formats, therapy areas, study phases and more.


Given the availability of historical data and that we could leverage structural features in the data, the ZS team recommended implementing a machine learning system to handle the unique challenges that this data mapping presented.

Phase 1: Launching the pilot for the BMS clinical studies conformance using machine learning project

We approached this project in three phases, starting with a pilot that focused on implementing a limited set of studies selected for the diversity in their data and mapping complexity. The idea was to assess whether a machine learning solution could meet a “reasonable” degree of accuracy and cut down the time to conformance.


One of the primary questions in the pilot’s design was the selection of the studies that would be used to train and test the AI models. Given the sheer number of permutations of information (domains, therapeutic areas, study phase and more), it was important that the studies we selected were not overly simplistic and would reflect the diversity and complexity of the data in the study ecosystem.


The BMS and ZS teams developed a framework for selecting the studies, domains and therapy areas—ultimately aligning on 30 mapped studies and three of the most complex domains. We used this to build the machine learning algorithm and test its ability to generate mapping and transformation recommendations.


The pilot involved five distinct steps:

  • Creating training and test data sets to ensure the machine learning system could be appropriately trained
  • Using statistical and unsupervised methods for data profiling and generation of data features
  • Modeling to generate both mapping and transformation recommendations using a range of supervised and unsupervised models
  • Validating the accuracy of the generated recommendations and transformations of the test set of studies
  • Designing a new DataOps-centric user interface to bring in process efficiencies

The results from the pilot were extremely encouraging, with nearly 90% accuracy in mapping and transformation recommendations, as well as an efficiency gain of nearly 85%. Even with limited training data, the accuracy and efficiency gains were remarkable.

Phase 2: Productizing the BMS automated metadata mapping solution using machine learning

As a result of the success of the pilot, ZS and BMS moved on to phase two of the project to scale to a full-fledged, production-grade solution with the latest serverless architecture and MLOps compliance. This phase also involved testing the solution on a much larger subset of legacy data with higher diversity before deploying it and making it a part of the current data conformance workflow at BMS.

FIGURE 2: Phase 2 of the project productized and scaled the solution

As we scaled the solution, we aimed to build in flexibility to adapt to new domains and therapy areas without requiring manual fine-tuning of the model. We developed it with these considerations:

  • Transformation execution language (TEL) to give more control to the end users (most of whom were not programmers)
  • A parser and compiler to understand and process the TEL so the model could understand the complex expressions and recommendation steps
  • A mechanism to handle unseen domains so the solution can scale beyond the known or seen set of domains
  • Inference pipeline development and containerized deployment to leverage the latest models and generate recommendations for newer data sets
  • A production-grade deployment setup with different test environments and a continuous integration, continuous development (CI/CD) pipeline to run at scale
  • A configuration-driven solution for the database with a no-code update setup to support functionalities ranging from adding a model threshold to adding a new domain or therapy area

Overall, we built an adaptable solution that offers updated guidance based on human responses and set it up to significantly improve the accuracy of recommendations over time.

Phase 3: Establishing ongoing operations for the BMS study data conformance system

Post productionization of the solution, the ZS and BMS team onboarded new studies at a faster speed, comparing the performance of the new solution versus existing workflows to check for data drift and possible fixes. The team also added new therapy areas and new domains while solving for novel use cases.


This solution has been in a steady state of operation since 2020, with new studies being added daily and its data consumers expanding to more than 100 users across a variety of functional and therapeutic areas at BMS. It was pressure tested when different formats and clinical standards were introduced as a result of the integration after BMS acquired Celgene. The team was able to standardize this new data using the recommendations generated by the solution’s machine learning model, requiring only minimal changes.


The solution continues to scale to include newer therapeutic areas and domains to support evolving scientific research at BMS, including studies from the immunoscience and cardiovascular therapeutic areas that were added in 2021 and 2022. Each therapeutic area has specific endpoints used to analyze clinical efficacy. Today, the solution standardizes approximately 40 domains that include endpoints such as biomarkers, vitals, the British Isles Lupus Assessment Group Composite Lupus Assessment (BICLA), SF-36 scores and more.

BMS achieving significant efficiency gains and cost savings

After the clinical studies conformance solution went live, BMS saw efficiency gains—the time required to map studies decreased by nearly 85% (Figure 3). This, in turn, led to a three-time increase in the number of conformed studies that could be published, increasing from 27 in 2020 to nearly 80 in 2022 working under a fixed year-over-year budget.


The machine-learning-driven solution has added efficiency and speed to the study data conformance process that now allows the BMS team to accomplish more with fewer resources. According to Dan Bachalis, Senior Director, Engineering & Operations, Drug Development IT Data & Analytics at BMS, “Once the model was created for a therapy area, we could constantly move through the backlog of closed studies. For a completely new therapeutic area, the model can take a couple of weeks to months to conform. Subsequently, we can ingest additional studies in a day to several days, depending on the size. I would estimate that for the same amount of work, BMS would have spent an additional $3 million over the past three years if we had not moved to this machine learning solution.”

FIGURE 3: Average time to map each study decreased with the ZS solution, resulting in a higher number of studies mapped annually

“Our business stakeholders were looking to run exploratory analysis on 200 studies for a specific therapy area and indication,” Bachalis explained. “We had already ingested and conformed half of the studies. The scientists were extremely impressed and frankly weren’t prepared for the data to already be available. They had immediate access to get started on their modeling.” The solution has helped BMS address advanced and novel use cases that would have taken months and now take days to solve, such as:

  • Interrogation of clinical and translational data to identify specific biomarkers that do not respond to immuno-oncology therapy and to help design rational treatments driven by underlying biology
  • Identification of key therapeutic areas where improved protein CTLA-4 in the form of an anti-CTLA-4 probody could lead to enhancements in efficacy-based clinical trial endpoints related to progression-free survival and overall survival

Parts of the solution have found their way into other enterprise projects across BMS, supporting other data management processes. For example, it was modified and used in BMS’s intelligent data profiler to reduce a dependency on data dictionaries, in the intelligent data transfer portal it uses with external vendors, for anomaly detection and in other Intelligent Data Management solutions.


Beyond adding speed and efficiency to the study metadata mapping process, this solution increased confidence at BMS in the company’s ability to deploy and achieve value from an enterprise-grade machine learning system. “Once the data is conformed and usable, all the sexy AI, machine learning and even big data hype isn’t hype anymore,” said Bachalis. “It can become reality.”

Add insights to your inbox

We’ll send you content you’ll want to read – and put to use.