AI reliability tools.
An overview.
Brought to you by the MIDRC AI Reliability Working Group.
Last updated March 23, 2026
AI reliability tool
Credit: MIDRC AI Reliability Working Group
Selected literature
-
Citation: Drukker, Karen, Weijie Chen, Judy Gichoya, Nicholas Gruszauskas, Jayashree Kalpathy-Cramer, Sanmi Koyejo, Kyle Myers et al. "Toward fairness in artificial intelligence for medical image analysis: identification and mitigation of potential biases in the roadmap from data collection to model deployment." Journal of Medical Imaging 10, no. 6 (2023): 061104-061104.
-
citation: Banerjee, Imon, Kamanasish Bhattacharjee, John L. Burns, Hari Trivedi, Saptarshi Purkayastha, Laleh Seyyed-Kalantari, Bhavik N. Patel, Rakesh Shiradkar, and Judy Gichoya. "‘Shortcuts’ causing bias in radiology artificial intelligence: causes, evaluation and mitigation." Journal of the American College of Radiology (2023).
-
Citation: Zong, Yongshuo, Yongxin Yang, and Timothy Hospedales. "MEDFAIR: Benchmarking fairness for medical imaging." arXiv preprint arXiv:2210.01725 (2022).
-
Citation: Brown, Alexander, Nenad Tomasev, Jan Freyberg, Yuan Liu, Alan Karthikesalingam, and Jessica Schrouff. "Detecting shortcut learning for fair medical AI using shortcut testing." Nature Communications 14, no. 1 (2023): 4314.
-
-
Citation: Guldogan, Ozgur, Yuchen Zeng, Jy-yong Sohn, Ramtin Pedarsani, and Kangwook Lee. "Equal improvability: A new fairness notion considering the long-term impact." arXiv preprint arXiv:2210.06732 (2022).
-
Citation: Xiao, Yuxin, Shulammite Lim, Tom Joseph Pollard, and Marzyeh Ghassemi. "In the Name of Fairness: Assessing the Bias in Clinical Record De-identification." In Proceedings of the 2023 ACM Conference on Fairness, Accountability, and Transparency, pp. 123-137. 2023.
-
Citation: DeGrave, Alex J., Joseph D. Janizek, and Su-In Lee. "AI for radiographic COVID-19 detection selects shortcuts over signal." Nature Machine Intelligence 3, no. 7 (2021): 610-619.
Selected code
-
NIH’s NCATS challenged participants to create a solution that detects bias in AI/ML models used in clinical decisions. Note that the provided solutions were not necessarily directly related to medical imaging.
url: https://www.expeditionhacks.com/nih-bias-detection-gallery
-
Check back soon.
MIDRC-developed code
-
MIDRC REACT (representativeness exploration and comparison tool) is a tool designed to compare the representativeness of biomedical data. By leveraging the Jensen-Shannon distance (JSD) measure, this tool provides insights into the demographic representativeness of datasets within the biomedical field. It also supports monitoring the representativeness of datasets over time by assessing the representativeness of historical data. Developed and utilized by MIDRC, this tool assesses the representativeness of data within the open data commons to the US population. Additionally, it can be generalized by users for other diversity representativeness needs, such as assessing the similarity of demographic distributions across multiple attributes in different biomedical datasets.
Available at https://github.com/MIDRC/MIDRC_Diversity_Calculator
-
The generalized stratified sampling tool on GitHub is a resource for researchers looking to implement advanced sampling techniques in medical imaging studies. This tool offers a framework for stratified sampling, which helps ensure that samples are representative of various subgroups within a dataset. It supports the development of more robust and generalizable models by improving the distribution and representativeness of sampled data, making it easier to analyze and interpret complex imaging datasets effectively.
Read more about the de-identifier in this peer-reviewed publication.
-
MIDRC-MELODY (Model EvaLuation across subgroups for cOnsistent Decision accuracY) is a free open-source tool designed to assess the performance and subgroup-level reliability and robustness of AI models developed for medical imaging analysis tasks, such as the estimation of disease severity. It enables consistent evaluation of models across predefined subgroups (e.g. manufacturer, race, scanner type) by computing intergroup performance metrics and corresponding confidence intervals.