Skip to content

feat: enhanced scientific RAG pipeline for research workflows (ISAAC-497)#23

Open
watcharaponthod-code wants to merge 1 commit into
aietal:masterfrom
watcharaponthod-code:feat/enhanced-scientific-rag-isaac497
Open

feat: enhanced scientific RAG pipeline for research workflows (ISAAC-497)#23
watcharaponthod-code wants to merge 1 commit into
aietal:masterfrom
watcharaponthod-code:feat/enhanced-scientific-rag-isaac497

Conversation

@watcharaponthod-code

Copy link
Copy Markdown

Bounty: ISAAC-497
Algora bounty: https://algora.io/isaac/bounties/clq18zr98000ejs0gt0nv7gwu

Summary

This PR implements an enhanced RAG pipeline for scientific and research document workflows. Five files changed, adding a dedicated utility module with 30 fully-tested helper functions.

Key improvements

Section-aware chunking: SCIENTIFIC_SEPARATORS splits documents at Abstract/Methods/Results boundaries. Every chunk stores citationKey, section, and sectionWeight in ChromaDB.

Multi-query retrieval with RRF + section weighting: buildResearchQueries expands the user query into 4 deterministic variants. fuseQueryResults applies Reciprocal Rank Fusion with section importance weights (abstract 1.4x, results 1.3x, methods 1.2x, body 0.8x). Duplicate chunks accumulate scores across query variants.

Stable citation keys: buildCitationKey produces deterministic title-slug:pPage:cChunk+1 keys. buildChunkMetadata strips server-side temp upload paths from the public source field.

Scientific chat prompt: system prompt updated to strict research assistant persona - cite every claim by key, prefer Results/Methods evidence over Introduction/Discussion. fetchResearchEvidence uses same-origin URL instead of hard-coded localhost:3000. Temperature taken from request instead of hard-coded 0.

Validation

npx vitest run - 30/30 tests passed
npx tsc --noEmit - 0 type errors

Payout: Algora bounty-platform payout to GitHub user @watcharaponthod-code.

…497)

Implement section-aware document ingestion, multi-query retrieval with
reciprocal rank fusion, stable citation keys, and budget-capped evidence
context for the scientific/research RAG workflow.

Changes:
- Add ui/utils/server/scientific-rag.ts: core RAG utilities
  * detectScientificSection: identifies abstract/methods/results/etc from chunk text
  * sectionWeight: importance weights (abstract 1.4x, results 1.3x, methods 1.2x...)
  * buildChunkMetadata: typed metadata with stable citationKey, strips temp paths
  * buildResearchQueries: expands query into 4 deterministic variants for recall
  * fuseQueryResults: RRF + section-weighted deduplication across query result sets
  * buildEvidencePayload: budget-capped evidence context + source manifest
  * parseBoundedInteger: safe integer parsing for API params
  * SCIENTIFIC_SEPARATORS: section-heading-first text splitter separators
- Update ui/pages/api/inject-documents.ts:
  * Use SCIENTIFIC_SEPARATORS for section-aligned chunking (900 char chunks)
  * Replace processDocuments with buildChunkMetadata for typed, safe metadata
  * Store citationKey, section, sectionWeight in ChromaDB for downstream ranking
- Update ui/pages/api/fetch-documents.ts:
  * Expand query with buildResearchQueries before Chroma lookup
  * Apply fuseQueryResults RRF + section-weight fusion across all variants
  * Return structured evidence payload instead of raw Chroma response
- Update ui/pages/api/rag-chat.ts:
  * Use same-origin URL for fetch-documents (works in any deployment)
  * Scientific research assistant system prompt with strict citation rules
  * Use temperature from request instead of hard-coded 0
  * Section-prioritised citation rules in prompt (prefer Results > Methods > Abstract)
- Add ui/__tests__/scientific-rag.test.ts: 30 tests covering all public helpers

Validation:
- npx vitest run (30/30 passed)
- npx tsc --noEmit (0 errors)

Bounty: ISAAC-497
Algora bounty: https://algora.io/isaac/bounties/clq18zr98000ejs0gt0nv7gwu
@themachinecorp

Copy link
Copy Markdown

Hi aietal team — THEMACHINE Corp. here. We're a small studio that ships RAG systems for a living (production Chroma-backed retrieval across three of our own products), and ISAAC-497 sits squarely in our wheelhouse.

We see the existing PR #23 (and the 20+ others) tackling this in a single mega-PR shape. Our approach is different — we propose four weekly milestones instead, each landing as its own reviewable PR:

  1. Week 1 — Source Unification Layer. A UnifiedRetriever adapter that flattens uploaded PDFs/TXT and Semantic Scholar refs into a single Document schema (title, authors, year, source, content_hash). Ship behind a feature flag; zero breaking changes.
  2. Week 2 — Citation Engine. Replace ad-hoc string templating with a CitationResolver that produces deterministic [SRC-N] keys per chunk and a structured JSON evidenceContext for the LLM. Backed by unit tests using a 50-doc gold set.
  3. Week 3 — Performance Pass. Async ingestion via asyncio.gather + persistent Chroma client; benchmark R@5 and p95 latency on the existing test set. Target: <800 ms p95 for top-k=8 retrieval on 1k doc corpus.
  4. Week 4 — Stability + Handoff. Pin versions, write migration notes, ship a CHANGELOG, and open a short Loom walking the maintainer through the upgrade path.

Working branch: feat/isaac-497-scientific-rag against aietal/aimengpt@main. Estimated diff: +1,400 / −180 LOC across 9 files, ~600 LOC of which is test + gold data. We will not touch unrelated files.

We don't need write access to the private aietal/isaac repo for the PR — we'll open the implementation as a feature branch in aietal/aimengpt with the diff against current main so the maintainer can review in isolation. Our advantage over the existing 22 PRs is scope discipline and test coverage, not novel ideas.

If this shape fits what you want, we can kick off Week 1 within 48h. If you'd rather pick a different submission, no problem — happy to step aside.

— Kevin / CTO, THEMACHINE Corp.
GitHub: github.com/THEMACHINE-HF
Algora: algora.io/THEMACHINE-HF

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants