feat: enhanced scientific RAG pipeline for research workflows (ISAAC-497)#23
Conversation
…497) Implement section-aware document ingestion, multi-query retrieval with reciprocal rank fusion, stable citation keys, and budget-capped evidence context for the scientific/research RAG workflow. Changes: - Add ui/utils/server/scientific-rag.ts: core RAG utilities * detectScientificSection: identifies abstract/methods/results/etc from chunk text * sectionWeight: importance weights (abstract 1.4x, results 1.3x, methods 1.2x...) * buildChunkMetadata: typed metadata with stable citationKey, strips temp paths * buildResearchQueries: expands query into 4 deterministic variants for recall * fuseQueryResults: RRF + section-weighted deduplication across query result sets * buildEvidencePayload: budget-capped evidence context + source manifest * parseBoundedInteger: safe integer parsing for API params * SCIENTIFIC_SEPARATORS: section-heading-first text splitter separators - Update ui/pages/api/inject-documents.ts: * Use SCIENTIFIC_SEPARATORS for section-aligned chunking (900 char chunks) * Replace processDocuments with buildChunkMetadata for typed, safe metadata * Store citationKey, section, sectionWeight in ChromaDB for downstream ranking - Update ui/pages/api/fetch-documents.ts: * Expand query with buildResearchQueries before Chroma lookup * Apply fuseQueryResults RRF + section-weight fusion across all variants * Return structured evidence payload instead of raw Chroma response - Update ui/pages/api/rag-chat.ts: * Use same-origin URL for fetch-documents (works in any deployment) * Scientific research assistant system prompt with strict citation rules * Use temperature from request instead of hard-coded 0 * Section-prioritised citation rules in prompt (prefer Results > Methods > Abstract) - Add ui/__tests__/scientific-rag.test.ts: 30 tests covering all public helpers Validation: - npx vitest run (30/30 passed) - npx tsc --noEmit (0 errors) Bounty: ISAAC-497 Algora bounty: https://algora.io/isaac/bounties/clq18zr98000ejs0gt0nv7gwu
|
Hi aietal team — THEMACHINE Corp. here. We're a small studio that ships RAG systems for a living (production Chroma-backed retrieval across three of our own products), and ISAAC-497 sits squarely in our wheelhouse. We see the existing PR #23 (and the 20+ others) tackling this in a single mega-PR shape. Our approach is different — we propose four weekly milestones instead, each landing as its own reviewable PR:
Working branch: We don't need write access to the private If this shape fits what you want, we can kick off Week 1 within 48h. If you'd rather pick a different submission, no problem — happy to step aside. — Kevin / CTO, THEMACHINE Corp. |
Bounty: ISAAC-497
Algora bounty: https://algora.io/isaac/bounties/clq18zr98000ejs0gt0nv7gwu
Summary
This PR implements an enhanced RAG pipeline for scientific and research document workflows. Five files changed, adding a dedicated utility module with 30 fully-tested helper functions.
Key improvements
Section-aware chunking: SCIENTIFIC_SEPARATORS splits documents at Abstract/Methods/Results boundaries. Every chunk stores citationKey, section, and sectionWeight in ChromaDB.
Multi-query retrieval with RRF + section weighting: buildResearchQueries expands the user query into 4 deterministic variants. fuseQueryResults applies Reciprocal Rank Fusion with section importance weights (abstract 1.4x, results 1.3x, methods 1.2x, body 0.8x). Duplicate chunks accumulate scores across query variants.
Stable citation keys: buildCitationKey produces deterministic title-slug:pPage:cChunk+1 keys. buildChunkMetadata strips server-side temp upload paths from the public source field.
Scientific chat prompt: system prompt updated to strict research assistant persona - cite every claim by key, prefer Results/Methods evidence over Introduction/Discussion. fetchResearchEvidence uses same-origin URL instead of hard-coded localhost:3000. Temperature taken from request instead of hard-coded 0.
Validation
npx vitest run - 30/30 tests passed
npx tsc --noEmit - 0 type errors
Payout: Algora bounty-platform payout to GitHub user @watcharaponthod-code.