From e2b32c495dccd02c1e297e7d1053cf7fb0523a04 Mon Sep 17 00:00:00 2001 From: tang274 <160694996+tang274@users.noreply.github.com> Date: Sat, 11 Apr 2026 21:42:06 -0500 Subject: [PATCH 1/5] Integration with biomedical data systems Expanded the section on integrating foundation models with biomedical data systems, detailing various integration paradigms and their challenges. --- content/05.integrating.md | 50 +++++++++++++++++++++++++++++++++++++-- 1 file changed, 48 insertions(+), 2 deletions(-) diff --git a/content/05.integrating.md b/content/05.integrating.md index ea06f99..9b6a581 100644 --- a/content/05.integrating.md +++ b/content/05.integrating.md @@ -1,4 +1,50 @@ +## Integration with Biomedical Data Systems -## Integrating Foundation Models with Biomedical Data Systems +### Overview -How foundation models interact with structured biomedical data and knowledge sources, including electronic health records, ontologies, and databases. +An important theme in biomedical AI is that model performance depends not only on the model itself, but also on how effectively it interacts with biomedical data systems. In real applications, foundation models rarely operate on a single isolated modality. Instead, they often need to connect unstructured inputs such as clinical text, medical images, or biological sequences with structured and semi-structured resources such as electronic health records (EHRs), ontologies, and curated biomedical databases [@doi:10.1038/sdata.2016.35; @doi:10.1093/nar/gkh061; @doi:10.1093/nar/gkaa1113]. + +This perspective is closely related to the central question of this review: whether biomedical tasks are better addressed by domain-specific foundation models or by adapting general foundation models. Domain-specific models are often designed with biomedical data structures in mind, whereas general models typically require additional adaptation strategies, such as prompting, retrieval, or tool use, to interact effectively with structured biomedical information. Therefore, integration with biomedical data systems provides a useful lens for comparing the practical strengths and limitations of these two approaches [@doi:10.1093/jamia/ocae074; @doi:10.1093/jamia/ocae202]. + +Among biomedical data systems, EHRs are especially important in clinical AI. They include structured information such as diagnosis codes, medications, procedures, and laboratory measurements, as well as semi-structured or unstructured components such as clinical notes and discharge summaries. EHR data is inherently longitudinal, sparse, noisy, and irregularly sampled, which makes it substantially different from the text corpora or image datasets on which many foundation models are pretrained [@doi:10.1038/sdata.2016.35]. Biomedical ontologies and controlled vocabularies provide another key layer of structure. Resources such as SNOMED CT, UMLS, and Gene Ontology support normalization, interoperability, and structured reasoning across datasets and institutions [@doi:10.1093/nar/gkh061; @doi:10.1093/nar/gkaa1113; @doi:10.2196/62924]. Biomedical workflows also depend heavily on curated databases and knowledge resources, including literature repositories, population-scale datasets, molecular interaction resources, and disease-specific knowledge bases. In many practical settings, the usefulness of a biomedical model depends not only on what it stores internally, but also on how effectively it can access and use such external resources [@doi:10.1093/jamia/ocaf008]. + +### Integration Paradigms + +Existing work on integrating foundation models with biomedical data systems can be broadly grouped into several paradigms. These paradigms are not mutually exclusive, and many practical systems combine more than one of them. Rather than treating them as rigid categories, it is more useful to view them as different ways of connecting learned representations with structured biomedical information [@doi:10.1093/jamia/ocae074]. + +#### Native integration in domain-specific models + +One line of work builds domain-specific models that directly operate on biomedical data structures, especially EHRs. For example, BEHRT adapts the Transformer architecture to longitudinal patient records by representing diagnosis and treatment codes as token sequences while also incorporating visit and demographic information [@doi:10.1038/s41598-020-62922-y]. Med-BERT follows a similar direction, using large-scale structured EHR data and pretraining objectives inspired by masked language modeling to learn reusable patient representations [@doi:10.1038/s41746-021-00455-y]. Related efforts such as CLMBR also focus on learning representations from longitudinal clinical records in ways that preserve temporal and clinical structure rather than forcing patient histories into free-form text [@doi:10.1016/j.jbi.2020.103637]. + +These models illustrate the main advantage of native integration: they encode biomedical structure directly rather than indirectly through generic text interfaces. Temporal patterns, code co-occurrence, and patient history can be modeled more naturally, which is especially helpful when working with longitudinal EHR data [@doi:10.1038/s41598-020-62922-y; @doi:10.1038/s41746-021-00455-y]. However, these benefits come with trade-offs. Such models are often tied to particular coding systems, healthcare institutions, or data formats, and they usually require substantial domain-specific pretraining and infrastructure. As a result, they may be harder to scale across settings than general-purpose models [@doi:10.1016/j.jbi.2020.103637; @doi:10.1093/jamia/ocae074]. + +#### Adaptation of general foundation models + +A second line of work adapts general-purpose foundation models to biomedical data systems instead of designing new architectures from scratch. In this paradigm, structured biomedical inputs are often serialized into natural language or converted into formats compatible with existing general models. One representative example is Med-PaLM, which adapts a general large language model to the medical domain through instruction tuning and demonstrates how broad language-model capabilities can be specialized for medical reasoning without building a fully domain-specific architecture from scratch [@doi:10.1038/s41586-023-06291-2]. More broadly, this line of work treats biomedical integration as an interface problem: rather than redesigning the model around EHR tables, ontologies, or databases, it reformats biomedical information into something a general model can consume [@doi:10.1038/s41586-023-06291-2; @doi:10.1093/jamia/ocae202]. + +In practice, this adaptation can take several forms. One is prompt-based serialization, where patient records, lab trends, or coded events are rewritten as textual summaries and then fed to a language model for question answering, summarization, or risk assessment. Another is biomedical fine-tuning or instruction tuning, which helps general models better interpret biomedical prompts and produce more domain-appropriate outputs. A third is tool-mediated access, where the model itself remains general-purpose but interacts with external biomedical systems through retrieval, APIs, or auxiliary modules [@doi:10.1038/s41586-023-06291-2; @doi:10.1093/jamia/ocae074]. The appeal of this paradigm lies in flexibility: the same model can often support many tasks and modalities. However, its limitations are equally important. Irregular time series, coded clinical variables, and hierarchical schemas are not naturally expressed through prompt text alone, so performance may depend heavily on formatting choices and workflow-level orchestration [@doi:10.1093/jamia/ocae202]. + +#### Retrieval-augmented integration + +A third paradigm combines foundation models with external biomedical knowledge sources at inference time. Rather than relying only on information stored in model parameters, these systems retrieve relevant papers from PubMed, entries from biomedical databases, clinical guidelines, or ontology-linked documents and then condition their outputs on that evidence. This approach is particularly attractive in biomedicine because factual correctness, updateability, and traceability matter more than in many general-purpose settings [@doi:10.1093/jamia/ocaf008; @doi:10.1371/journal.pdig.0000877]. + +Concrete examples already show several variants of this idea. Some systems use literature-grounded retrieval to answer biomedical questions or support clinical decision making with explicit evidence rather than unsupported generation [@doi:10.1093/jamia/ocaf008]. Others augment large language models with external medical knowledge bases so that retrieved facts are injected into the model context at inference time; MKRAG is a representative example in medical question answering [@arxiv:2309.16035]. Retrieval can also be structured rather than purely textual; for example, KG-RAG retrieves from the SPOKE biomedical knowledge graph and uses graph-derived biomedical relations to guide prompt generation [@doi:10.1093/bioinformatics/btae560]. Compared with purely parametric models, retrieval-augmented systems can be more transparent and easier to update. At the same time, they create new bottlenecks: the retriever may miss key evidence, retrieved sources may be noisy or conflicting, and the downstream model may still fail to use the retrieved context correctly [@doi:10.1093/jamia/ocaf008; @doi:10.1371/journal.pdig.0000877]. + +#### Graph- and ontology-aware integration + +Another important paradigm is to integrate foundation models with structured biomedical graphs or ontologies. This is especially relevant in domains where diseases, genes, proteins, drugs, and phenotypes are linked through explicit semantic or biological relationships. Instead of flattening everything into plain text, graph- and ontology-aware methods try to preserve relational structure and use it during representation learning or inference [@doi:10.1093/nar/gkh061; @doi:10.1093/nar/gkaa1113]. + +Several types of work fall into this category. Biomedical knowledge graphs can support reasoning over gene-disease-drug associations, while ontologies can constrain concept normalization and improve consistency across datasets [@doi:10.1093/bioinformatics/btae560; @doi:10.2196/62924]. In some systems, graph structure is used as an external source of biomedical facts during generation, as in KG-RAG [@doi:10.1093/bioinformatics/btae560]. In others, large language models are used to help construct or extend graph resources from clinical text and biomedical literature [@arxiv:2301.12473]. These examples show that the interaction between foundation models and biomedical graphs can be bidirectional: graphs can guide models, and models can also help update structured knowledge resources. The attraction of this paradigm is that it preserves domain structure that generic sequence models often ignore. The challenge is that graph resources are frequently incomplete, heterogeneous, and difficult to integrate cleanly with large pretrained models not originally designed for relational reasoning [@doi:10.1093/bioinformatics/btae560; @doi:10.2196/62924]. + +#### Multimodal integration + +Many biomedical applications require models to combine heterogeneous modalities rather than operate on a single input type. Clinical decision making may depend simultaneously on EHR data, physician notes, imaging, pathology slides, molecular profiles, and literature-derived knowledge. Multimodal integration attempts to coordinate these information sources, either by learning joint representations or by combining specialized models into a larger pipeline [@pmid:39321458; @pmid:40754135]. + +This paradigm appears in several forms. In clinical settings, multimodal systems may combine imaging with reports, or EHR trajectories with free-text notes, so that the model reasons over both structured and unstructured evidence [@pmid:39321458]. In research settings, multimodal integration may connect molecular measurements, pathology images, and knowledge-graph information for tasks such as drug discovery or precision medicine [@doi:10.1016/j.drudis.2024.104254]. More recent medical multimodal language-model work also shows how visual and textual biomedical evidence can be brought into a common reasoning framework [@pmid:39321458]. The appeal of multimodal integration is clear because biomedical decision-making is rarely unimodal. At the same time, it is one of the hardest paradigms to implement well: modality alignment is often imperfect, many modalities are missing for large subsets of samples, and evaluation becomes substantially more complicated as the number of interacting components increases [@pmid:40754135]. + + +### Open Challenges + +Despite recent progress, integrating foundation models with biomedical data systems still faces several important limitations. A central challenge is the mismatch between symbolic biomedical structure and continuous neural representations. Ontologies, coding systems, and curated databases encode discrete relations, hierarchies, and constraints, but these structures are often only partially preserved when they are mapped into embedding-based models [@doi:10.1093/nar/gkh061; @doi:10.1093/nar/gkaa1113]. This difficulty becomes even more pronounced in temporal settings such as EHR modeling, where clinically meaningful interpretation depends not only on which events occur, but also on their ordering, persistence, and irregular timing. Although domain-specific models such as BEHRT, Med-BERT, and CLMBR move in this direction, longitudinal reasoning remains a substantial challenge for current foundation-model-based systems [@doi:10.1038/s41598-020-62922-y; @doi:10.1038/s41746-021-00455-y; @doi:10.1016/j.jbi.2020.103637]. + +These technical issues are compounded by the characteristics of biomedical data itself. Clinical and biomedical datasets are often incomplete, noisy, and highly heterogeneous across institutions, with different coding practices, uneven population coverage, and non-random missingness [@doi:10.1038/sdata.2016.35; @doi:10.1093/jamia/ocae202]. At the same time, high-stakes biomedical applications place demands on interpretability and accountability that go beyond predictive performance alone. In many settings, users need to understand which records, concepts, retrieved sources, or external systems contributed to a model output, yet system-level integration can make such reasoning harder to trace when multiple modules interact in opaque ways [@doi:10.1038/s44172-025-00453-y]. Finally, these challenges are further constrained by privacy, governance, and access. Large-scale clinical and biomedical data are subject to strict regulatory and institutional controls, which limits both the development of foundation models and the realism of downstream evaluation and deployment [@doi:10.1093/jamia/ocae202; @doi:10.1038/s44387-025-00047-1]. From 1963461575e40cd4cd43f114e62da1d4e81dc833 Mon Sep 17 00:00:00 2001 From: tang274 <160694996+tang274@users.noreply.github.com> Date: Mon, 13 Apr 2026 23:06:24 -0500 Subject: [PATCH 2/5] Delete open challenges section from integrating.md Removed section on open challenges in multimodal integration. --- content/05.integrating.md | 7 ------- 1 file changed, 7 deletions(-) diff --git a/content/05.integrating.md b/content/05.integrating.md index 9b6a581..ff23010 100644 --- a/content/05.integrating.md +++ b/content/05.integrating.md @@ -41,10 +41,3 @@ Several types of work fall into this category. Biomedical knowledge graphs can s Many biomedical applications require models to combine heterogeneous modalities rather than operate on a single input type. Clinical decision making may depend simultaneously on EHR data, physician notes, imaging, pathology slides, molecular profiles, and literature-derived knowledge. Multimodal integration attempts to coordinate these information sources, either by learning joint representations or by combining specialized models into a larger pipeline [@pmid:39321458; @pmid:40754135]. This paradigm appears in several forms. In clinical settings, multimodal systems may combine imaging with reports, or EHR trajectories with free-text notes, so that the model reasons over both structured and unstructured evidence [@pmid:39321458]. In research settings, multimodal integration may connect molecular measurements, pathology images, and knowledge-graph information for tasks such as drug discovery or precision medicine [@doi:10.1016/j.drudis.2024.104254]. More recent medical multimodal language-model work also shows how visual and textual biomedical evidence can be brought into a common reasoning framework [@pmid:39321458]. The appeal of multimodal integration is clear because biomedical decision-making is rarely unimodal. At the same time, it is one of the hardest paradigms to implement well: modality alignment is often imperfect, many modalities are missing for large subsets of samples, and evaluation becomes substantially more complicated as the number of interacting components increases [@pmid:40754135]. - - -### Open Challenges - -Despite recent progress, integrating foundation models with biomedical data systems still faces several important limitations. A central challenge is the mismatch between symbolic biomedical structure and continuous neural representations. Ontologies, coding systems, and curated databases encode discrete relations, hierarchies, and constraints, but these structures are often only partially preserved when they are mapped into embedding-based models [@doi:10.1093/nar/gkh061; @doi:10.1093/nar/gkaa1113]. This difficulty becomes even more pronounced in temporal settings such as EHR modeling, where clinically meaningful interpretation depends not only on which events occur, but also on their ordering, persistence, and irregular timing. Although domain-specific models such as BEHRT, Med-BERT, and CLMBR move in this direction, longitudinal reasoning remains a substantial challenge for current foundation-model-based systems [@doi:10.1038/s41598-020-62922-y; @doi:10.1038/s41746-021-00455-y; @doi:10.1016/j.jbi.2020.103637]. - -These technical issues are compounded by the characteristics of biomedical data itself. Clinical and biomedical datasets are often incomplete, noisy, and highly heterogeneous across institutions, with different coding practices, uneven population coverage, and non-random missingness [@doi:10.1038/sdata.2016.35; @doi:10.1093/jamia/ocae202]. At the same time, high-stakes biomedical applications place demands on interpretability and accountability that go beyond predictive performance alone. In many settings, users need to understand which records, concepts, retrieved sources, or external systems contributed to a model output, yet system-level integration can make such reasoning harder to trace when multiple modules interact in opaque ways [@doi:10.1038/s44172-025-00453-y]. Finally, these challenges are further constrained by privacy, governance, and access. Large-scale clinical and biomedical data are subject to strict regulatory and institutional controls, which limits both the development of foundation models and the realism of downstream evaluation and deployment [@doi:10.1093/jamia/ocae202; @doi:10.1038/s44387-025-00047-1]. From 61bcac890374f94f0639cf4f63e0f45ac3c76e21 Mon Sep 17 00:00:00 2001 From: tang274 <160694996+tang274@users.noreply.github.com> Date: Sat, 18 Apr 2026 21:45:32 -0500 Subject: [PATCH 3/5] Simplify the main body delete the outline and simplify the main body of five integrations --- content/05.integrating.md | 34 +++++++++++----------------------- 1 file changed, 11 insertions(+), 23 deletions(-) diff --git a/content/05.integrating.md b/content/05.integrating.md index ff23010..479167b 100644 --- a/content/05.integrating.md +++ b/content/05.integrating.md @@ -1,43 +1,31 @@ -## Integration with Biomedical Data Systems - -### Overview - -An important theme in biomedical AI is that model performance depends not only on the model itself, but also on how effectively it interacts with biomedical data systems. In real applications, foundation models rarely operate on a single isolated modality. Instead, they often need to connect unstructured inputs such as clinical text, medical images, or biological sequences with structured and semi-structured resources such as electronic health records (EHRs), ontologies, and curated biomedical databases [@doi:10.1038/sdata.2016.35; @doi:10.1093/nar/gkh061; @doi:10.1093/nar/gkaa1113]. - -This perspective is closely related to the central question of this review: whether biomedical tasks are better addressed by domain-specific foundation models or by adapting general foundation models. Domain-specific models are often designed with biomedical data structures in mind, whereas general models typically require additional adaptation strategies, such as prompting, retrieval, or tool use, to interact effectively with structured biomedical information. Therefore, integration with biomedical data systems provides a useful lens for comparing the practical strengths and limitations of these two approaches [@doi:10.1093/jamia/ocae074; @doi:10.1093/jamia/ocae202]. - -Among biomedical data systems, EHRs are especially important in clinical AI. They include structured information such as diagnosis codes, medications, procedures, and laboratory measurements, as well as semi-structured or unstructured components such as clinical notes and discharge summaries. EHR data is inherently longitudinal, sparse, noisy, and irregularly sampled, which makes it substantially different from the text corpora or image datasets on which many foundation models are pretrained [@doi:10.1038/sdata.2016.35]. Biomedical ontologies and controlled vocabularies provide another key layer of structure. Resources such as SNOMED CT, UMLS, and Gene Ontology support normalization, interoperability, and structured reasoning across datasets and institutions [@doi:10.1093/nar/gkh061; @doi:10.1093/nar/gkaa1113; @doi:10.2196/62924]. Biomedical workflows also depend heavily on curated databases and knowledge resources, including literature repositories, population-scale datasets, molecular interaction resources, and disease-specific knowledge bases. In many practical settings, the usefulness of a biomedical model depends not only on what it stores internally, but also on how effectively it can access and use such external resources [@doi:10.1093/jamia/ocaf008]. - -### Integration Paradigms - -Existing work on integrating foundation models with biomedical data systems can be broadly grouped into several paradigms. These paradigms are not mutually exclusive, and many practical systems combine more than one of them. Rather than treating them as rigid categories, it is more useful to view them as different ways of connecting learned representations with structured biomedical information [@doi:10.1093/jamia/ocae074]. +Existing work on integrating foundation models with biomedical data systems can be grouped into several paradigms. These categories are not rigid, and many practical systems combine multiple strategies [@doi:10.1093/jamia/ocae074]. #### Native integration in domain-specific models -One line of work builds domain-specific models that directly operate on biomedical data structures, especially EHRs. For example, BEHRT adapts the Transformer architecture to longitudinal patient records by representing diagnosis and treatment codes as token sequences while also incorporating visit and demographic information [@doi:10.1038/s41598-020-62922-y]. Med-BERT follows a similar direction, using large-scale structured EHR data and pretraining objectives inspired by masked language modeling to learn reusable patient representations [@doi:10.1038/s41746-021-00455-y]. Related efforts such as CLMBR also focus on learning representations from longitudinal clinical records in ways that preserve temporal and clinical structure rather than forcing patient histories into free-form text [@doi:10.1016/j.jbi.2020.103637]. +One approach builds domain-specific models that directly operate on biomedical structures, especially EHRs. BEHRT adapts the Transformer architecture to longitudinal patient records by modeling diagnosis and treatment codes together with visit and demographic information [@doi:10.1038/s41598-020-62922-y]. Med-BERT similarly uses large-scale structured EHR data with pretraining objectives inspired by masked language modeling to learn reusable patient representations [@doi:10.1038/s41746-021-00455-y]. Related work such as CLMBR also learns representations from longitudinal clinical records while preserving temporal structure [@doi:10.1016/j.jbi.2020.103637]. -These models illustrate the main advantage of native integration: they encode biomedical structure directly rather than indirectly through generic text interfaces. Temporal patterns, code co-occurrence, and patient history can be modeled more naturally, which is especially helpful when working with longitudinal EHR data [@doi:10.1038/s41598-020-62922-y; @doi:10.1038/s41746-021-00455-y]. However, these benefits come with trade-offs. Such models are often tied to particular coding systems, healthcare institutions, or data formats, and they usually require substantial domain-specific pretraining and infrastructure. As a result, they may be harder to scale across settings than general-purpose models [@doi:10.1016/j.jbi.2020.103637; @doi:10.1093/jamia/ocae074]. +The advantage of this paradigm is strong alignment with biomedical structure: temporal patterns, code co-occurrence, and patient history can be modeled more naturally than through generic text interfaces [@doi:10.1038/s41598-020-62922-y; @doi:10.1038/s41746-021-00455-y]. Its limitation is reduced portability, since these models are often tied to specific coding systems, institutions, or data formats and require substantial domain-specific pretraining [@doi:10.1016/j.jbi.2020.103637; @doi:10.1093/jamia/ocae074]. #### Adaptation of general foundation models -A second line of work adapts general-purpose foundation models to biomedical data systems instead of designing new architectures from scratch. In this paradigm, structured biomedical inputs are often serialized into natural language or converted into formats compatible with existing general models. One representative example is Med-PaLM, which adapts a general large language model to the medical domain through instruction tuning and demonstrates how broad language-model capabilities can be specialized for medical reasoning without building a fully domain-specific architecture from scratch [@doi:10.1038/s41586-023-06291-2]. More broadly, this line of work treats biomedical integration as an interface problem: rather than redesigning the model around EHR tables, ontologies, or databases, it reformats biomedical information into something a general model can consume [@doi:10.1038/s41586-023-06291-2; @doi:10.1093/jamia/ocae202]. +A second approach adapts general-purpose foundation models to biomedical data systems instead of designing new architectures from scratch. Structured biomedical inputs are often serialized into natural language or otherwise reformatted into model-compatible forms. Med-PaLM is a representative example, showing how a general large language model can be adapted to medical reasoning through instruction tuning rather than fully domain-specific pretraining [@doi:10.1038/s41586-023-06291-2]. -In practice, this adaptation can take several forms. One is prompt-based serialization, where patient records, lab trends, or coded events are rewritten as textual summaries and then fed to a language model for question answering, summarization, or risk assessment. Another is biomedical fine-tuning or instruction tuning, which helps general models better interpret biomedical prompts and produce more domain-appropriate outputs. A third is tool-mediated access, where the model itself remains general-purpose but interacts with external biomedical systems through retrieval, APIs, or auxiliary modules [@doi:10.1038/s41586-023-06291-2; @doi:10.1093/jamia/ocae074]. The appeal of this paradigm lies in flexibility: the same model can often support many tasks and modalities. However, its limitations are equally important. Irregular time series, coded clinical variables, and hierarchical schemas are not naturally expressed through prompt text alone, so performance may depend heavily on formatting choices and workflow-level orchestration [@doi:10.1093/jamia/ocae202]. +In practice, this adaptation may involve prompt-based serialization of patient records, biomedical fine-tuning, or tool-mediated access to external biomedical systems [@doi:10.1038/s41586-023-06291-2; @doi:10.1093/jamia/ocae074]. The main benefit is flexibility, since the same model can support multiple tasks and modalities. However, irregular time series, coded variables, and hierarchical schemas are not naturally captured through prompt text alone, so performance may depend heavily on input formatting and orchestration [@doi:10.1093/jamia/ocae202]. #### Retrieval-augmented integration -A third paradigm combines foundation models with external biomedical knowledge sources at inference time. Rather than relying only on information stored in model parameters, these systems retrieve relevant papers from PubMed, entries from biomedical databases, clinical guidelines, or ontology-linked documents and then condition their outputs on that evidence. This approach is particularly attractive in biomedicine because factual correctness, updateability, and traceability matter more than in many general-purpose settings [@doi:10.1093/jamia/ocaf008; @doi:10.1371/journal.pdig.0000877]. +A third paradigm combines foundation models with external biomedical knowledge sources at inference time. Instead of relying only on parametric memory, these systems retrieve relevant papers, database entries, clinical guidelines, or ontology-linked documents and condition their outputs on that evidence [@doi:10.1093/jamia/ocaf008; @doi:10.1371/journal.pdig.0000877]. -Concrete examples already show several variants of this idea. Some systems use literature-grounded retrieval to answer biomedical questions or support clinical decision making with explicit evidence rather than unsupported generation [@doi:10.1093/jamia/ocaf008]. Others augment large language models with external medical knowledge bases so that retrieved facts are injected into the model context at inference time; MKRAG is a representative example in medical question answering [@arxiv:2309.16035]. Retrieval can also be structured rather than purely textual; for example, KG-RAG retrieves from the SPOKE biomedical knowledge graph and uses graph-derived biomedical relations to guide prompt generation [@doi:10.1093/bioinformatics/btae560]. Compared with purely parametric models, retrieval-augmented systems can be more transparent and easier to update. At the same time, they create new bottlenecks: the retriever may miss key evidence, retrieved sources may be noisy or conflicting, and the downstream model may still fail to use the retrieved context correctly [@doi:10.1093/jamia/ocaf008; @doi:10.1371/journal.pdig.0000877]. +Several variants already exist. Some systems use literature-grounded retrieval for biomedical question answering or clinical decision support [@doi:10.1093/jamia/ocaf008]. Others augment large language models with external medical knowledge bases, as in MKRAG [@arxiv:2309.16035]. Retrieval can also be structured rather than purely textual; for example, KG-RAG uses relations from the SPOKE biomedical knowledge graph to guide prompt generation [@doi:10.1093/bioinformatics/btae560]. These systems are attractive because they are more transparent and easier to update, but they introduce new bottlenecks when retrieval is incomplete, noisy, or poorly used by the downstream model [@doi:10.1093/jamia/ocaf008; @doi:10.1371/journal.pdig.0000877]. #### Graph- and ontology-aware integration -Another important paradigm is to integrate foundation models with structured biomedical graphs or ontologies. This is especially relevant in domains where diseases, genes, proteins, drugs, and phenotypes are linked through explicit semantic or biological relationships. Instead of flattening everything into plain text, graph- and ontology-aware methods try to preserve relational structure and use it during representation learning or inference [@doi:10.1093/nar/gkh061; @doi:10.1093/nar/gkaa1113]. +Another paradigm integrates foundation models with biomedical graphs or ontologies, especially in settings where diseases, genes, proteins, drugs, and phenotypes are linked through explicit semantic or biological relationships [@doi:10.1093/nar/gkh061; @doi:10.1093/nar/gkaa1113]. -Several types of work fall into this category. Biomedical knowledge graphs can support reasoning over gene-disease-drug associations, while ontologies can constrain concept normalization and improve consistency across datasets [@doi:10.1093/bioinformatics/btae560; @doi:10.2196/62924]. In some systems, graph structure is used as an external source of biomedical facts during generation, as in KG-RAG [@doi:10.1093/bioinformatics/btae560]. In others, large language models are used to help construct or extend graph resources from clinical text and biomedical literature [@arxiv:2301.12473]. These examples show that the interaction between foundation models and biomedical graphs can be bidirectional: graphs can guide models, and models can also help update structured knowledge resources. The attraction of this paradigm is that it preserves domain structure that generic sequence models often ignore. The challenge is that graph resources are frequently incomplete, heterogeneous, and difficult to integrate cleanly with large pretrained models not originally designed for relational reasoning [@doi:10.1093/bioinformatics/btae560; @doi:10.2196/62924]. +Biomedical knowledge graphs can support reasoning over gene-disease-drug associations, while ontologies can constrain concept normalization and improve consistency across datasets [@doi:10.1093/bioinformatics/btae560; @doi:10.2196/62924]. In some systems, graph structure is used as an external source of biomedical facts during generation, as in KG-RAG [@doi:10.1093/bioinformatics/btae560]. In others, large language models help construct or extend graph resources from clinical text and biomedical literature [@arxiv:2301.12473]. These approaches preserve domain structure that generic sequence models often ignore, but graph resources are often incomplete, heterogeneous, and difficult to integrate cleanly with large pretrained models [@doi:10.1093/bioinformatics/btae560; @doi:10.2196/62924]. #### Multimodal integration -Many biomedical applications require models to combine heterogeneous modalities rather than operate on a single input type. Clinical decision making may depend simultaneously on EHR data, physician notes, imaging, pathology slides, molecular profiles, and literature-derived knowledge. Multimodal integration attempts to coordinate these information sources, either by learning joint representations or by combining specialized models into a larger pipeline [@pmid:39321458; @pmid:40754135]. +Many biomedical applications require models to combine multiple modalities, including EHR data, clinical notes, imaging, pathology slides, molecular profiles, and literature-derived knowledge [@pmid:39321458; @pmid:40754135]. -This paradigm appears in several forms. In clinical settings, multimodal systems may combine imaging with reports, or EHR trajectories with free-text notes, so that the model reasons over both structured and unstructured evidence [@pmid:39321458]. In research settings, multimodal integration may connect molecular measurements, pathology images, and knowledge-graph information for tasks such as drug discovery or precision medicine [@doi:10.1016/j.drudis.2024.104254]. More recent medical multimodal language-model work also shows how visual and textual biomedical evidence can be brought into a common reasoning framework [@pmid:39321458]. The appeal of multimodal integration is clear because biomedical decision-making is rarely unimodal. At the same time, it is one of the hardest paradigms to implement well: modality alignment is often imperfect, many modalities are missing for large subsets of samples, and evaluation becomes substantially more complicated as the number of interacting components increases [@pmid:40754135]. +In clinical settings, multimodal systems may combine imaging with reports or EHR trajectories with free-text notes [@pmid:39321458]. In research settings, they may connect molecular measurements, pathology images, and knowledge-graph information for tasks such as drug discovery or precision medicine [@doi:10.1016/j.drudis.2024.104254]. More recent medical multimodal language-model work similarly aims to place visual and textual biomedical evidence into a shared reasoning framework [@pmid:39321458]. This paradigm is attractive because biomedical decision-making is rarely unimodal, but it is also one of the most difficult to implement because alignment is imperfect, modalities are often missing, and evaluation becomes more complex as the system grows [@pmid:40754135]. From 331f4f3879f3dee195f5fde23f8d32a8ec15279a Mon Sep 17 00:00:00 2001 From: tang274 <160694996+tang274@users.noreply.github.com> Date: Sun, 26 Apr 2026 16:32:18 -0500 Subject: [PATCH 4/5] Add an overview back Added an overview section discussing the interaction between biomedical AI models and data systems, highlighting the comparison between domain-specific and general foundation models. --- content/05.integrating.md | 6 ++++++ 1 file changed, 6 insertions(+) diff --git a/content/05.integrating.md b/content/05.integrating.md index 479167b..652e295 100644 --- a/content/05.integrating.md +++ b/content/05.integrating.md @@ -1,3 +1,9 @@ +## Overview +An important theme in biomedical AI is that model performance depends not only on the model itself, but also on how effectively it interacts with biomedical data systems. In practice, foundation models often need to connect unstructured inputs such as clinical text, medical images, or biological sequences with structured and semi-structured resources such as electronic health records (EHRs), ontologies, and curated biomedical databases [@doi:10.1038/sdata.2016.35; @doi:10.1093/nar/gkh061; @doi:10.1093/nar/gkaa1113]. + +This perspective is closely tied to the central question of this review: whether biomedical tasks are better addressed by domain-specific foundation models or by adapting general foundation models. Domain-specific models are often designed around biomedical data structures, whereas general models usually require prompting, retrieval, or tool use to interact effectively with structured biomedical information. Integration with biomedical data systems therefore provides a useful lens for comparing the strengths and limitations of these two approaches [@doi:10.1093/jamia/ocae074; @doi:10.1093/jamia/ocae202]. + +## Exsiting models Existing work on integrating foundation models with biomedical data systems can be grouped into several paradigms. These categories are not rigid, and many practical systems combine multiple strategies [@doi:10.1093/jamia/ocae074]. #### Native integration in domain-specific models From 6fb488d75cc685990a95c4d9dc3715e3b51bbff4 Mon Sep 17 00:00:00 2001 From: tang274 <160694996+tang274@users.noreply.github.com> Date: Sun, 26 Apr 2026 16:36:50 -0500 Subject: [PATCH 5/5] modify the discriptiton of EHR to make it more concise and precise Expanded on the challenges of integrating EHRs with foundation models, highlighting their longitudinal, sparse, noisy, and irregularly timed nature. --- content/05.integrating.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/content/05.integrating.md b/content/05.integrating.md index 652e295..46cf4a6 100644 --- a/content/05.integrating.md +++ b/content/05.integrating.md @@ -1,5 +1,5 @@ ## Overview -An important theme in biomedical AI is that model performance depends not only on the model itself, but also on how effectively it interacts with biomedical data systems. In practice, foundation models often need to connect unstructured inputs such as clinical text, medical images, or biological sequences with structured and semi-structured resources such as electronic health records (EHRs), ontologies, and curated biomedical databases [@doi:10.1038/sdata.2016.35; @doi:10.1093/nar/gkh061; @doi:10.1093/nar/gkaa1113]. +An important theme in biomedical AI is that model performance depends not only on the model itself, but also on how effectively it interacts with biomedical data systems. In practice, foundation models often need to connect unstructured inputs such as clinical text, medical images, or biological sequences with structured and semi-structured resources such as electronic health records (EHRs), ontologies, and curated biomedical databases [@doi:10.1038/sdata.2016.35; @doi:10.1093/nar/gkh061; @doi:10.1093/nar/gkaa1113]. EHRs are especially challenging in this respect because they are longitudinal, sparse, noisy, and irregularly timed, with measurements collected according to clinical need rather than on a fixed schedule [@doi:10.1038/sdata.2016.35]. This perspective is closely tied to the central question of this review: whether biomedical tasks are better addressed by domain-specific foundation models or by adapting general foundation models. Domain-specific models are often designed around biomedical data structures, whereas general models usually require prompting, retrieval, or tool use to interact effectively with structured biomedical information. Integration with biomedical data systems therefore provides a useful lens for comparing the strengths and limitations of these two approaches [@doi:10.1093/jamia/ocae074; @doi:10.1093/jamia/ocae202].