Update section 6 draft by Clouddelta · Pull Request #5 · AdaptInfer/fm-survey

Clouddelta · 2026-04-13T19:43:56Z

No description provided.

Updated sections on evaluation, reliability, deployment, and summary to enhance clarity and structure.

Removed the 'Summary and Outlook' section to streamline content and focus on deployment considerations.

This update enhances the evaluation, reliability, and deployment sections of the document, emphasizing task-centered utility, external validation, and the importance of uncertainty quantification in biomedical foundation models. It also discusses the operational feasibility and staged implementation strategies necessary for effective deployment.

Removed extensive sections on evaluation, reliability, and deployment of biomedical foundation models to streamline content.

Expanded sections on evaluation, reliability, and deployment of biomedical foundation models, detailing task-centered utility, external validation, uncertainty quantification, and operational feasibility.

Add section on integrating foundation models with biomedical data systems.

Removed section on external validation emphasizing the need for multi-cohort and multi-task assessment in biomedical data evaluation.

LZYEIL

Hi Jingyun, this is great work!

Good Things

Very clear separation of subsections and topics
The length of each paragraph is long enough to convey the core insights and short enough to help me(and other audiences) understand
Reference lists are very comprehensive!

Things to Improve

May need some content on bio foundation models and bio lab automation pipeline for balanced narrative
Would be great to have 1-2 informative figures on the benchmark/metrics subsection and 'Deployment' section
Current draft's style leans towards a perspective and position paper? Maybe to include more narrative on what others did in the evaluation/reliability/deployment (summarization)? (But this depends on what our end goal is with this paper)

hjung-stat · 2026-04-19T02:58:03Z

+Evaluation of biomedical foundation models should be task-centered rather than model-centered. The central question is not whether a model scores highly on a generic leaderboard, but whether it improves performance on the biomedical task that actually matters, such as diagnosis, prognosis, survival prediction, retrieval, report generation, or biomarker discovery [@agrawal2025evaluation; @wornow2023shaky]. This requires clearly specifying the intended use case, the target population, the relevant comparator, and the consequences of model errors. Emerging clinical benchmarking paradigms make this point clearly: models excelling on general leaderboards do not necessarily perform best on clinical decision-making tasks, and there remains no single widely accepted benchmark for clinical utility [@sandmann2025deepseek].
+
+#### Moving Beyond Discriminative Metrics
+Evaluation must also go beyond discrimination alone. Metrics such as AUROC, AUPRC, F1, concordance index, and segmentation overlap are informative, but they are insufficient for high-stakes biomedical use. Methodological reviews in medical imaging emphasize that proper assessment must also incorporate calibration, uncertainty, and other task-specific criteria. A model can appear accurate on average while still producing poorly calibrated or clinically misleading outputs [@kocak2025evaluationmetrics]. The choice of metric should therefore follow the downstream use case: prediction tasks require calibration and risk stratification, retrieval and generation tasks demand expert judgment of quality, and discovery-oriented applications necessitate reproducibility and biological plausibility.


For readers less familiar with these metrics, it might help to briefly explain what AUROC or AUPRC capture, or give a short intuitive example.

hjung-stat · 2026-04-19T03:06:59Z

+Evaluation must also go beyond discrimination alone. Metrics such as AUROC, AUPRC, F1, concordance index, and segmentation overlap are informative, but they are insufficient for high-stakes biomedical use. Methodological reviews in medical imaging emphasize that proper assessment must also incorporate calibration, uncertainty, and other task-specific criteria. A model can appear accurate on average while still producing poorly calibrated or clinically misleading outputs [@kocak2025evaluationmetrics]. The choice of metric should therefore follow the downstream use case: prediction tasks require calibration and risk stratification, retrieval and generation tasks demand expert judgment of quality, and discovery-oriented applications necessitate reproducibility and biological plausibility.
+
+#### Addressing Benchmark Contamination
+Benchmark design directly impacts the validity of results. If test sets are contaminated by publicly available pretraining data, reported performance inevitably exaggerates true progress. Pathology benchmarking initiatives have addressed this concern by withholding test data and exposing only an automated evaluation pipeline, explicitly citing contamination risk as the rationale for not releasing test cohorts [@campanella2025clinicalbenchmark]. This issue is especially acute for foundation models, whose massive pretraining corpora make overlap difficult to rule out. Fair evaluation demands comparing foundation models against robust task-specific baselines and smaller domain-adapted models, strictly utilizing contamination-aware protocols [@campanella2025clinicalbenchmark; @agrawal2025evaluation]. Often, the most relevant comparator is not the largest available model, but the strongest clinically realistic alternative.


For the first sentence, it might be helpful to briefly clarify why performance is exaggerated. For example, this could happen because the model may recall previously seen cases (memorization) rather than genuinely reasoning.

Expanded sections on evaluation, reliability, and deployment of biomedical foundation models, emphasizing task-centered assessment, distribution shift, and operational feasibility.

Clouddelta · 2026-04-30T06:37:31Z

I updated my forked repo. Not sure if it will sync correctly.

[ci skip] This build is based on d487020. This commit was created by the following CI build and job: https://github.com/AdaptInfer/fm-survey/commit/d4870205b21624400bcf7e506006b03690a062ea/checks https://github.com/AdaptInfer/fm-survey/actions/runs/25328661636

Clouddelta added 9 commits April 13, 2026 07:28

Update document

cd55f7c

Revise evaluation, reliability, and deployment sections

9704f07

Updated sections on evaluation, reliability, deployment, and summary to enhance clarity and structure.

Remove Summary and Outlook section

7e10171

Removed the 'Summary and Outlook' section to streamline content and focus on deployment considerations.

Add deployment documentation file

85c0938

Remove detailed sections on evaluation and deployment

534fbbf

Removed extensive sections on evaluation, reliability, and deployment of biomedical foundation models to streamline content.

Enhance deployment documentation for biomedical models

28bd7d2

Expanded sections on evaluation, reliability, and deployment of biomedical foundation models, detailing task-centered utility, external validation, uncertainty quantification, and operational feasibility.

Add integration section for foundation models

c737a22

Add section on integrating foundation models with biomedical data systems.

Remove external validation section from deployment guide

0ec95c6

Removed section on external validation emphasizing the need for multi-cohort and multi-task assessment in biomedical data evaluation.

LZYEIL reviewed Apr 16, 2026

View reviewed changes

hjung-stat reviewed Apr 19, 2026

View reviewed changes

Enhance evaluation and reliability criteria for models

ec615e0

Expanded sections on evaluation, reliability, and deployment of biomedical foundation models, emphasizing task-centered assessment, distribution shift, and operational feasibility.

blengerich merged commit d487020 into AdaptInfer:main May 4, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update section 6 draft#5

Update section 6 draft#5
blengerich merged 10 commits intoAdaptInfer:mainfrom
Clouddelta:main

Clouddelta commented Apr 13, 2026

Uh oh!

LZYEIL left a comment

Uh oh!

hjung-stat Apr 19, 2026

Uh oh!

hjung-stat Apr 19, 2026

Uh oh!

Clouddelta commented Apr 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

Clouddelta commented Apr 13, 2026

Uh oh!

LZYEIL left a comment

Choose a reason for hiding this comment

Good Things

Things to Improve

Uh oh!

hjung-stat Apr 19, 2026

Choose a reason for hiding this comment

Uh oh!

hjung-stat Apr 19, 2026

Choose a reason for hiding this comment

Uh oh!

Clouddelta commented Apr 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants