Update section 6 draft#5
Merged
blengerich merged 10 commits intoAdaptInfer:mainfrom May 4, 2026
Merged
Conversation
Updated sections on evaluation, reliability, deployment, and summary to enhance clarity and structure.
Removed the 'Summary and Outlook' section to streamline content and focus on deployment considerations.
This update enhances the evaluation, reliability, and deployment sections of the document, emphasizing task-centered utility, external validation, and the importance of uncertainty quantification in biomedical foundation models. It also discusses the operational feasibility and staged implementation strategies necessary for effective deployment.
Removed extensive sections on evaluation, reliability, and deployment of biomedical foundation models to streamline content.
Expanded sections on evaluation, reliability, and deployment of biomedical foundation models, detailing task-centered utility, external validation, uncertainty quantification, and operational feasibility.
Add section on integrating foundation models with biomedical data systems.
Removed section on external validation emphasizing the need for multi-cohort and multi-task assessment in biomedical data evaluation.
LZYEIL
reviewed
Apr 16, 2026
LZYEIL
left a comment
There was a problem hiding this comment.
Hi Jingyun, this is great work!
Good Things
- Very clear separation of subsections and topics
- The length of each paragraph is long enough to convey the core insights and short enough to help me(and other audiences) understand
- Reference lists are very comprehensive!
Things to Improve
- May need some content on bio foundation models and bio lab automation pipeline for balanced narrative
- Would be great to have 1-2 informative figures on the benchmark/metrics subsection and 'Deployment' section
- Current draft's style leans towards a perspective and position paper? Maybe to include more narrative on what others did in the evaluation/reliability/deployment (summarization)? (But this depends on what our end goal is with this paper)
hjung-stat
reviewed
Apr 19, 2026
| Evaluation of biomedical foundation models should be task-centered rather than model-centered. The central question is not whether a model scores highly on a generic leaderboard, but whether it improves performance on the biomedical task that actually matters, such as diagnosis, prognosis, survival prediction, retrieval, report generation, or biomarker discovery [@agrawal2025evaluation; @wornow2023shaky]. This requires clearly specifying the intended use case, the target population, the relevant comparator, and the consequences of model errors. Emerging clinical benchmarking paradigms make this point clearly: models excelling on general leaderboards do not necessarily perform best on clinical decision-making tasks, and there remains no single widely accepted benchmark for clinical utility [@sandmann2025deepseek]. | ||
|
|
||
| #### Moving Beyond Discriminative Metrics | ||
| Evaluation must also go beyond discrimination alone. Metrics such as AUROC, AUPRC, F1, concordance index, and segmentation overlap are informative, but they are insufficient for high-stakes biomedical use. Methodological reviews in medical imaging emphasize that proper assessment must also incorporate calibration, uncertainty, and other task-specific criteria. A model can appear accurate on average while still producing poorly calibrated or clinically misleading outputs [@kocak2025evaluationmetrics]. The choice of metric should therefore follow the downstream use case: prediction tasks require calibration and risk stratification, retrieval and generation tasks demand expert judgment of quality, and discovery-oriented applications necessitate reproducibility and biological plausibility. |
Contributor
There was a problem hiding this comment.
For readers less familiar with these metrics, it might help to briefly explain what AUROC or AUPRC capture, or give a short intuitive example.
hjung-stat
reviewed
Apr 19, 2026
| Evaluation must also go beyond discrimination alone. Metrics such as AUROC, AUPRC, F1, concordance index, and segmentation overlap are informative, but they are insufficient for high-stakes biomedical use. Methodological reviews in medical imaging emphasize that proper assessment must also incorporate calibration, uncertainty, and other task-specific criteria. A model can appear accurate on average while still producing poorly calibrated or clinically misleading outputs [@kocak2025evaluationmetrics]. The choice of metric should therefore follow the downstream use case: prediction tasks require calibration and risk stratification, retrieval and generation tasks demand expert judgment of quality, and discovery-oriented applications necessitate reproducibility and biological plausibility. | ||
|
|
||
| #### Addressing Benchmark Contamination | ||
| Benchmark design directly impacts the validity of results. If test sets are contaminated by publicly available pretraining data, reported performance inevitably exaggerates true progress. Pathology benchmarking initiatives have addressed this concern by withholding test data and exposing only an automated evaluation pipeline, explicitly citing contamination risk as the rationale for not releasing test cohorts [@campanella2025clinicalbenchmark]. This issue is especially acute for foundation models, whose massive pretraining corpora make overlap difficult to rule out. Fair evaluation demands comparing foundation models against robust task-specific baselines and smaller domain-adapted models, strictly utilizing contamination-aware protocols [@campanella2025clinicalbenchmark; @agrawal2025evaluation]. Often, the most relevant comparator is not the largest available model, but the strongest clinically realistic alternative. |
Contributor
There was a problem hiding this comment.
For the first sentence, it might be helpful to briefly clarify why performance is exaggerated. For example, this could happen because the model may recall previously seen cases (memorization) rather than genuinely reasoning.
Expanded sections on evaluation, reliability, and deployment of biomedical foundation models, emphasizing task-centered assessment, distribution shift, and operational feasibility.
Author
|
I updated my forked repo. Not sure if it will sync correctly. |
github-actions Bot
pushed a commit
that referenced
this pull request
May 4, 2026
[ci skip] This build is based on d487020. This commit was created by the following CI build and job: https://github.com/AdaptInfer/fm-survey/commit/d4870205b21624400bcf7e506006b03690a062ea/checks https://github.com/AdaptInfer/fm-survey/actions/runs/25328661636
github-actions Bot
pushed a commit
that referenced
this pull request
May 4, 2026
[ci skip] This build is based on d487020. This commit was created by the following CI build and job: https://github.com/AdaptInfer/fm-survey/commit/d4870205b21624400bcf7e506006b03690a062ea/checks https://github.com/AdaptInfer/fm-survey/actions/runs/25328661636
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.