The foundation for analysis across cohorts: a common data model

One of the objectives within SOPHIA WP2 is to harmonise and standardise obesity cohorts and link them into a federated database system to allow for analysis across cohorts. The crucial benefit of the federated database is that the individual cohorts can stay locally on edge nodes, while a central server along with specialized software can learn models across nodes, while preserving privacy and security.

Each cohort is stored in a data model suitable for the original purposes of the data collection, for example most clinical trials are stored in the CDISC format, used for reporting to the FDA, while hospitals or academic institutions collecting data might use another data model. Having disparate data models means that an analysis developed on one cohort cannot readily be applied to another.

To make sure that insights can be gathered using all of the available data, two approaches can be taken:

  1. Either adapt each analysis separately to each cohort, or
  2. Transfer each cohort to a common data model, allowing the same analysis to be applied to each cohort

The second solution also allows for federated learning, and is the solution chosen in SOPHIA.

The data model that all the cohorts will be transformed into is the OMOP Common Data Model (CDM), maintained by the OHDSI community. The CDM is a description of both the structure that the data should have and a common vocabulary to describe of the events in the data. A great introduction can be found in  the book of OHDSI and a tutorial is available in the OHDSI youtube channel.

The conversion to the OMOP CDM is typically managed by a series of SQL statements, however, the CDM is not only concerned with structure, also the description of the content. Due to the use of different coding schemes two clinically similar events may have different descriptions in different datamodels. The OMOP standardized vocabulary offers a way to harmonise disparate coding systems, which is just as important for cross-cohort analyses as having a shared data structure, if not more.

Open source tools for all stages of data transformation and analysis have been developed by the OHDSI community, here is a list of useful resources

Finally, an end-to-end example of converting a synthetic database to the OMOP CDM, can be found on the OHDSI github.

This project has received funding from the Innovative Medicines Initiative 2 Joint Undertaking (JU) under grant agreement No 875534. The JU receives support from the European Union’s Horizon 2020 research and innovation programme and EFPIA and T1D Exchange, JDRF, and Obesity Action Coalition.

About IMI

The Innovative Medicines Initiative (IMI) is Europe’s largest public-private initiative aiming to improve health by speeding up the development of, and patient access to, innovative medicines, particularly in areas where there is an unmet medical or social need. IMI facilitates collaboration between the key players involved in healthcare research, including universities, the pharmaceutical and other industries, small and medium-sized enterprises (SMEs), patient organisations, and medicines regulators. It is a partnership between the European Union (represented by the European Commission) and the European pharmaceutical industry (represented by EFPIA, the European Federation of Pharmaceutical Industries and Associations). For further information: