Pharmaceuticals & Biotech

A common data model is the solution for oncology data woes

By Mukul Jain, and Athul Nair

Jan. 14, 2022 | Article | 6-minute read

A common data model is the solution for oncology data woes

Analyzing oncology data can feel like sitting in a room full of people shouting pertinent and useful information in different dialects, using different rules of speech and completely different vocabularies. The best chance you would have of discovering any sort of meaning in this noise would be to record the shouting and pore through the transcript for hours. In the case of the health data analysts, biopharma professionals and medical experts charged with making sense of oncology information, they are working against time to help save the lives of cancer patients.


Cancer is a complex and diverse disease with numerous subcategorizations and nuances. Oncology-specific data exhibits this tendency for variance and intricacies, even before considering the variability in data at the individual patient’s level. Because of the multitude of different procedures, regimens and drugs that exist in the oncology sphere, there is a marked inconsistency in the schematic representation of the real-world data associated with cancer-related claims. 

The perils of translating oncology data

Due to this variability, analysts also must perform extensive onboarding and knowledge transfer activities to familiarize themselves with the structure and granularities of the multiple data sources they have at their disposal. Even if they successfully maneuver through the complexities of each individual dataset, they must then correlate and connect the results from the distinct datasets to form an overall picture.


For example, the diagnostic information gathered about a metastasized breast tumor would be very different from the diagnostic information for prostate cancer. Data aggregators take one of two approaches to capture these radically different events:

  1. Use a multiple diagnosis table molded to conform to the nuances of each tumor.
  2. Create one all-encompassing diagnosis table with all the elements required for the concerned tumors.

Analyzing either of these models presents several challenges. Having multiple variegated diagnosis tables would make it extremely difficult to accomplish a consistent, homogenous analysis. On the other hand, having one single diagnosis table bloated with custom elements would create NULL (e.g., empty or unknown) values for irrelevant conditions, requiring extensive data quality checks and filters.


Even without considering the approach to data capture, certain oncological events can present an analytical conundrum, usually due to the fact that they cannot easily be compartmentalized into a single type of claim. For example, these are just some of the questions that may arise when developing a patient journey based on oncology data:

  • Should a CAR-T-related claim count solely as a procedure, even though the process involves the use of a chemotherapy agent?
  • Should a sampling be referred to by the specimen involved, the actual process of extracting the sample or the measurement or test that it is used for?
  • Even if extra effort was taken to partition compound claims into their various components, how do you maintain the source-level relationship to the main event in the patient journey?

What exactly is the OMOP Common Data Model?

The process of analyzing oncology data can be simplified and enhanced using a modified version of the Observational Medical Outcomes Partnership (OMOP) Common Data Model. The Common Data Model (CDM) is a standardized data schema created by the Observational Health Data Sciences and Informatics program, better known as OHDSI. This open-source community is dedicated to improving the value and quality of healthcare data through innovations in analytics, tools and technologies. With its predefined set of tables and columns, this data model provides a means to standardize any dataset. It also contains a concept table that allows for a centralized repository of terms and vocabulary mentioned in the data.


The OMOP model helps data analysts and consultants by offering data protection, standardization and reuse of concepts and vocabulary, as well as scalability and backward compatibility.


The utility of OMOP-generated IDs and specific reference tables (used to aggregate repeating details like patient info) allows for removal of data redundancy and introduces optimization. OHDSI’s Athena, a repository of commonly used concept names, codes and vocabularies, supports the standardization and reuse of the conceptual information present within the data (for example, procedure names, drugs, lab tests, etc.). The OMOP model is scalable and can be applied to a dataset regardless of its size. Perhaps its most important advantage is the ability to reuse code, which sets the foundation for more efficient and economic analytics. It also produces comparable results when the other data sources are also converted to the OMOP format.

Why the OMOP Common Data Model brings value to oncology data

While the OMOP model has a positive effect on the study of most therapy areas, its benefits are most evident in oncology. The CDM allows for homogeneity to be applied by pivoting down tables to fit the more generic tabular structures specified by OMOP. OHDSI is working on an official extension to the CDM in its next release with oncology data specifically in mind. For now, though, we can rely on the latest release of OMOP CDM v6.0 and extend this using oncology-specific customizations.


One critical aspect of converting oncology data is preserving the source-level mappings and associations. For this, we attach unique universal identifiers to records at the source level and allow them to flow through the various transformations, later using them to relink the partitioned records. By applying these techniques, it’s possible to sort and shuffle diverse and complex oncology data into the orderly OMOP format. 

A common language for real-world oncology data

Let’s go back to that room full of shouting people. Now let’s imagine that all of these shouters have a defined lexicon, defined grammar and an agenda for discussion. Once we bring some standards and guidelines to this interaction, we can turn the chaos into an orderly conversation. As a result, we can understand what each of them is talking about, extract more useful information and compare and contrast one person’s words with what others are saying. The power of the OMOP Common Data Model is the power to help derive insights more quickly from real-world data to improve health outcomes. 

Add insights to your inbox

We’ll send you content you’ll want to read – and put to use.