MIDRC Learn.

Representativeness through interoperability.

PART 1.

Accelerating AI development through interoperability.

The development and validation of artificial intelligence and machine learning models that incorporate multimodal data, including from medical imaging, depends on access to large and representative real-world datasets. Collecting these from scratch can be costly in both time and resources, and the expense can exponentially increase with the number of variables collected. Developing new data resources from the beginning requires institutional agreements, IRB approvals, data transfer infrastructure, and extended accrual periods, and the resulting datasets often lack the linked clinical context needed to train, validate, or interpret AI models.

Interoperability among data commons addresses these costs by enabling researchers to assemble multimodal cohorts from data that have already been collected, curated, and de-identified. Rather than acquiring new data, researchers draw on the investments that contributing institutions and data commons have already made, combining existing resources into datasets that none of the contributing commons could provide on its own. Interoperability between data resources can be particularly high impact for medicine and is one of the FAIR principles for scientific data management (Findability, Accessibility, Interoperability, and Reusability), under which interoperability is defined as the ability of data or tools from non-cooperating resources to integrate or work together with minimal effort.

For AI development, several improvements are possible through interoperability.

Easier cohort collection. Once interoperability has been established between two data commons, building a matched multimodal cohort requires only a well-constructed query. A researcher can begin with imaging data in one commons, identify the corresponding subjects in a clinical data commons via privacy-preserving record linkage, and produce an analysis-ready cohort in substantially less time than would be required to acquire new data with the desired multimodal variables.

Better validation and generalizability. Linked data make it possible to characterize cohorts along clinical and demographic dimensions before training and to evaluate model performance across subgroups defined by attributes (e.g., comorbidity status, treatment history, outcome) that may not be present in the imaging commons. Likewise, imaging characteristics such as imaging device vendor can be used to characterize how image acquisition and processing influence the development and performance of multimodal models.

Last updated July 1, 2026

Preserved privacy protections. Linkage through non-identifying identifiers and honest brokers allows matched cohorts to be assembled without personally identifiable information leaving any commons and without controlled-access data leaving its approved enclave.

Lower cost for each new study. Once an interoperability pathway between two commons has been established, it can be reused across many studies and disease areas. The cost of building the pathway is paid once, and each subsequent study that uses it benefits from the existing infrastructure for authentication, matching, and analysis.

Interoperability between MIDRC and other data commons has been demonstrated through coordination with the National COVID Cohort Collaborative (N3C) and with the NHLBI BioData Catalyst (BDC). In each case, matched cohorts of patients with both imaging data in MIDRC and clinical data in the partner commons were curated using existing data, supported by expanded collaboration and coordinated governance across the contributing resources.

More features for the same patients. When imaging data are linked to variables such as laboratory values, comorbidity indices, pharmaceutical records, procedures, and outcomes, AI models can incorporate features that enhance and complement what can be extracted from medical images, and vice versa. The integration of multimodal data has emerged as a promising approach for various clinical tasks, including characterization of disease severity, risk stratification, and treatment selection.