Skip to content

Conversation

@jsochava
Copy link

@jsochava jsochava commented Oct 27, 2025

Closes #14085

This Draft PR introduces a complete, deterministic pipeline for harvesting contextual summaries from a citing paper’s “Related Work” section and appending them to the appropriate BibEntry records.

What and why

  1. RelatedWorkAnnotator.java
  • Appends contextual summaries from a citing paper’s “Related Work” section into a target BibEntry.
  • Uses JabRef’s comment- convention (resolved as UserSpecificCommentField).
  1. HeuristicRelatedWorkExtractor.java
  • Deterministic parser that locates author–year citations (e.g. (Vesce et al., 2016), (Bianchi, 2021)) within Related Work text.
  • Extracts descriptive snippets surrounding each citation.
  • Matches each citation to an existing BibEntry by first author surname + year.
  • Implemented without AI dependencies; designed for reliability and transparent logic.
  1. RelatedWorkHarvester.java
  • High-level orchestrator that connects the extractor and annotator:
  • Accepts PDF-extracted or plain text input.
  • Calls the extractor to identify citation–context pairs.
  • Invokes RelatedWorkAnnotator.appendSummaryToEntry(...) for each match.
  1. RelatedWorkSectionLocator.java
  • Deterministically isolates the “Related Work” / “Literature Review” / “Prior Work” / “Background and Related Work” section from a paper’s full plain text.
  • Recognizes numeric and textual headers.
  • Captures content until the next top-level header, ignoring figure/table captions and unrelated content.
  1. RelatedWorkPipeline.java
  • Introduces a convenience façade that chains the full extraction process:
    - SectionLocator → HeuristicRelatedWorkExtractor → RelatedWorkAnnotator
  • Enables single-call usage for clients that have the full plain text and a list of candidate BibEntry objects.
  1. PdfTextProvider.java
    -A tiny SPI interface: Path -> Optional plain-text extraction.
    -Keeps all PDF specifics behind a seam; facilitates unit testing with fakes/mocks and avoids hard deps in core logic.

  2. PdfRelatedWorkTextExtractor.java

  • Adapter: PDF -> plain text (via PdfTextProvider) → “Related Work” block (via RelatedWorkSectionLocator).
  • Returns Optional with the body of the section (header stripped), or empty if not found/blank.
  • Validates inputs and surfaces IO errors; does not depend on any PDF library directly.
  1. RelatedWorkPdfPipeline.java
  • End-to-end façade for callers that have a PDF and candidate entries:
  • PdfRelatedWorkTextExtractor → HeuristicRelatedWorkExtractor → RelatedWorkAnnotator.
  • Returns the number of annotations appended across matched entries.
  1. RelatedWorkEvaluationRunner.java
  • Deterministic evaluator comparing extracted (citation → snippet) pairs against gold fixtures.
  • Computes precision, recall, F1, and coverage statistics.
  • Supports in-memory or JSON-based fixture definitions.
  1. RelatedWorkMetrics.java
  • Immutable results object summarizing global and per-entry metrics.
  • Includes pretty-printed summaries and detailed statistics for debugging.
  1. RelatedWorkFixture.java
  • Simple model for “gold” fixture data: includes related-work text and expected (author, year, snippet) expectations.
  • Supports direct in-memory creation or loading from a JSON file.
  1. HeuristicExtractorAdapter.java
  • Bridge layer converting the HeuristicRelatedWorkExtractor output (Map<String, String>) into the Map<BibEntry, List> format expected by the evaluation runner.
  • Keeps the original extractor untouched.
  1. RelatedWorkSummarizer.java / NoOpRelatedWorkSummarizer.java
  • SPI interface for optional snippet summarization.
  • Default no-op implementation returns empty → the harvester keeps individual snippets unchanged.
  • Future AI integrations (e.g., LangChain4j or OpenAI) can implement this interface.
  1. CitationResolver.java / NoOpCitationResolver.java
  • SPI interface for resolving missing citations by key or author–year, optionally creating new entries.
  • Default implementation performs a simple local lookup and never creates entries.
  1. RelatedWorkPluginConfig.java + RelatedWorkPluginsFactory.java
  • Lightweight configuration object with feature flags:
    - enableSummarization
    - enableResolution
  • Provides builder methods to safely compose plugin pipelines.
  • Used by RelatedWorkHarvester to inject summarizer and resolver instances.
  • Default build → no-op config (preserves old behavior).

Next steps

  1. PDF text extraction
  • Integrate JabRef’s existing PDF parsing utilities or the LangChain4j interface to automatically extract the “Related Work” section from PDFs.
  • Focus on reliable section header detection (e.g., Related Work, Literature Review).
  1. Reference lookup
  • For each parsed in-text citation:
    - Match to an existing library entry.
    - If missing, create a new BibEntry and annotate it.
  1. AI-Assisted Context Summarization
  • Optional integration with LangChain4j for generating concise natural-language summaries when multiple snippets exist.

Steps to test

  1. Run the unit tests for the new features only:
    ./gradlew :jablib:test --tests "org.jabref.logic.importer.RelatedWorkAnnotatorTest"
    ./gradlew :jablib:test --tests "org.jabref.logic.importer.relatedwork.HeuristicRelatedWorkExtractorTest"
    ./gradlew :jablib:test --tests "org.jabref.logic.importer.relatedwork.RelatedWorkHarvesterTest"
    ./gradlew :jablib:test --tests "org.jabref.logic.importer.relatedwork.RelatedWorkSectionLocatorTest"
    ./gradlew :jablib:test --tests "org.jabref.logic.importer.relatedwork.PdfRelatedWorkTextExtractorTest"
    ./gradlew :jablib:test --tests "org.jabref.logic.importer.relatedwork.RelatedWorkMetricsTest"

Mandatory checks

  • I own the copyright of the code submitted and I license it under the MIT license
  • I manually tested my changes in running JabRef (always required)
  • I added JUnit tests for changes (if applicable)
  • [/] I added screenshots in the PR description (if change is visible to the user)
  • [/] I described the change in CHANGELOG.md in a way that is understandable for the average user (if change is visible to the user)
  • [/] I checked the user documentation: Is the information available and up to date? If not, I created an issue at https://github.com/JabRef/user-documentation/issues or, even better, I submitted a pull request updating file(s) in https://github.com/JabRef/user-documentation/tree/main/en.

…ries to comment-<username> (JabRef#14085)

This helper takes a BibEntry, a username, the citing paper's key,
and a summary sentence, and appends a block like:

  [LunaOstos_2024]: <summary>

to the field comment-<username>. If that field already has content,
the new block is appended after a blank line.

Includes unit tests verifying first append and multi-append behavior.
@github-actions
Copy link
Contributor

Hey @jsochava!

Thank you for contributing to JabRef! Your help is truly appreciated ❤️.

We have automatic checks in place, based on which you will soon get automated feedback if any of them are failing. We also use TragBot with custom rules that scans your changes and provides some preliminary comments, before a maintainer takes a look. TragBot is still learning, and may not always be accurate. In the "Files changed" tab, you can go through its comments and just click on "Resolve conversation" if you are sure that it is incorrect, or comment on the conversation if you are doubtful.

Please re-check our contribution guide in case of any other doubts related to our contribution workflow.

@jsochava jsochava force-pushed the feature/related-work-annotator branch from 21d4bac to 711c3a9 Compare October 30, 2025 00:47
…f#14085)

Implements a deterministic extractor for author–year style citations
in "Related Work" sections and integrates it with RelatedWorkAnnotator.

- Added org.jabref.logic.importer.relatedwork package
- Introduced RelatedWorkExtractor interface
- Implemented HeuristicRelatedWorkExtractor for author–year citation parsing
- Implemented RelatedWorkHarvester orchestrator that uses the extractor
  and appends summaries via RelatedWorkAnnotator
- Added comprehensive JUnit tests verifying extraction and annotation behavior

This change completes the non-AI (LangChain4j-free) MVP for issue JabRef#14085.
Future work may introduce an AI-based RelatedWorkExtractor using LangChain4j.
…critics(JabRef#14085)

- Updated AUTHOR_YEAR_INNER regex to allow all-caps acronyms (e.g., "CIA")
  and Unicode names (e.g., "Šimić").
- Added acronym indexing in buildIndex() so corporate or multi-word authors
  (e.g., "Central Intelligence Agency") map to their acronyms.
- Ensures citations like (CIA, 2021) correctly match entries such as
  "Central Intelligence Agency, 2021".
- Keeps deterministic behavior while improving coverage of real-world
  citation formats in Related Work sections.
…or section detection and overall run(JabRef#14085)

Adds RelatedWorkSectionLocator, a deterministic logic class to extract the “Related Work” or “Literature Review” section from full plain text using common header variants and numeric patterns

Introduces RelatedWorkPipeline as a high-level façade chaining:
  - RelatedWorkSectionLocator → HeuristicRelatedWorkExtractor → RelatedWorkAnnotator

The pipeline provides a clean, dependency-free entry point for integrating related-work extraction into JabRef’s importer pipeline.

Includes comprehensive unit tests for RelatedWorkSectionLocator to verify detection robustness across multiple header variants.

This change enables downstream components to reliably isolate and process related-work text without external dependencies.
…JabRef#14085)

Introduces a metrics module to evaluate the heuristic Related Work extractor.

- Added RelatedWorkEvaluationRunner: computes precision, recall, F1, and coverage
  for extracted citation–context pairs against gold fixtures.
- Added RelatedWorkMetrics: immutable summary object with per-entry statistics.
- Added RelatedWorkFixture: compact JSON or in-memory format for evaluation data.
- Added HeuristicExtractorAdapter: bridges HeuristicRelatedWorkExtractor output
  (Map<String,String>) to the runner’s expected Map<BibEntry,List<String>> form.
- Added RelatedWorkMetricsTest: self-contained JUnit test that runs the extractor
  on a gold "Related Work" text and prints evaluation metrics.

This provides a deterministic, non-AI benchmark for assessing heuristic coverage
before integrating more complex methods.
…olution(JabRef#14085)

- Introduce RelatedWorkSummarizer and CitationResolver interfaces
- Add no-op default implementations (NoOpRelatedWorkSummarizer, NoOpCitationResolver)
- Add RelatedWorkPluginConfig with feature toggles and DI-friendly builder
- Wire RelatedWorkHarvester to optionally use resolver for missing keys and summarizer for multi-snippet entries
- Keep default behavior unchanged (both features disabled)
Copy link
Member

@koppor koppor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please check your IDE config. It seems you refomatted too much with the wrong style. Hard to give content feedback.

@github-actions github-actions bot added the status: changes-required Pull requests that are not yet complete label Nov 15, 2025
@jsochava
Copy link
Author

@koppor when I used the IDE config described in the project's setup instructions, I consistently failed the automatic format checks. Is that expected?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

first contrib status: changes-required Pull requests that are not yet complete

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Extract text about papers from "related work" sections

2 participants