feat: enhanced scientific RAG pipeline for research workflows (ISAAC-497) by watcharaponthod-code · Pull Request #23 · aietal/aimengpt

watcharaponthod-code · 2026-05-30T18:08:36Z

Bounty: ISAAC-497
Algora bounty: https://algora.io/isaac/bounties/clq18zr98000ejs0gt0nv7gwu

Summary

This PR implements an enhanced RAG pipeline for scientific and research document workflows. Five files changed, adding a dedicated utility module with 30 fully-tested helper functions.

Key improvements

Section-aware chunking: SCIENTIFIC_SEPARATORS splits documents at Abstract/Methods/Results boundaries. Every chunk stores citationKey, section, and sectionWeight in ChromaDB.

Multi-query retrieval with RRF + section weighting: buildResearchQueries expands the user query into 4 deterministic variants. fuseQueryResults applies Reciprocal Rank Fusion with section importance weights (abstract 1.4x, results 1.3x, methods 1.2x, body 0.8x). Duplicate chunks accumulate scores across query variants.

Stable citation keys: buildCitationKey produces deterministic title-slug:pPage:cChunk+1 keys. buildChunkMetadata strips server-side temp upload paths from the public source field.

Scientific chat prompt: system prompt updated to strict research assistant persona - cite every claim by key, prefer Results/Methods evidence over Introduction/Discussion. fetchResearchEvidence uses same-origin URL instead of hard-coded localhost:3000. Temperature taken from request instead of hard-coded 0.

Validation

npx vitest run - 30/30 tests passed
npx tsc --noEmit - 0 type errors

Payout: Algora bounty-platform payout to GitHub user @watcharaponthod-code.

…497) Implement section-aware document ingestion, multi-query retrieval with reciprocal rank fusion, stable citation keys, and budget-capped evidence context for the scientific/research RAG workflow. Changes: - Add ui/utils/server/scientific-rag.ts: core RAG utilities * detectScientificSection: identifies abstract/methods/results/etc from chunk text * sectionWeight: importance weights (abstract 1.4x, results 1.3x, methods 1.2x...) * buildChunkMetadata: typed metadata with stable citationKey, strips temp paths * buildResearchQueries: expands query into 4 deterministic variants for recall * fuseQueryResults: RRF + section-weighted deduplication across query result sets * buildEvidencePayload: budget-capped evidence context + source manifest * parseBoundedInteger: safe integer parsing for API params * SCIENTIFIC_SEPARATORS: section-heading-first text splitter separators - Update ui/pages/api/inject-documents.ts: * Use SCIENTIFIC_SEPARATORS for section-aligned chunking (900 char chunks) * Replace processDocuments with buildChunkMetadata for typed, safe metadata * Store citationKey, section, sectionWeight in ChromaDB for downstream ranking - Update ui/pages/api/fetch-documents.ts: * Expand query with buildResearchQueries before Chroma lookup * Apply fuseQueryResults RRF + section-weight fusion across all variants * Return structured evidence payload instead of raw Chroma response - Update ui/pages/api/rag-chat.ts: * Use same-origin URL for fetch-documents (works in any deployment) * Scientific research assistant system prompt with strict citation rules * Use temperature from request instead of hard-coded 0 * Section-prioritised citation rules in prompt (prefer Results > Methods > Abstract) - Add ui/__tests__/scientific-rag.test.ts: 30 tests covering all public helpers Validation: - npx vitest run (30/30 passed) - npx tsc --noEmit (0 errors) Bounty: ISAAC-497 Algora bounty: https://algora.io/isaac/bounties/clq18zr98000ejs0gt0nv7gwu

themachinecorp · 2026-06-26T07:34:25Z

Hi aietal team — THEMACHINE Corp. here. We're a small studio that ships RAG systems for a living (production Chroma-backed retrieval across three of our own products), and ISAAC-497 sits squarely in our wheelhouse.

We see the existing PR #23 (and the 20+ others) tackling this in a single mega-PR shape. Our approach is different — we propose four weekly milestones instead, each landing as its own reviewable PR:

Week 1 — Source Unification Layer. A UnifiedRetriever adapter that flattens uploaded PDFs/TXT and Semantic Scholar refs into a single Document schema (title, authors, year, source, content_hash). Ship behind a feature flag; zero breaking changes.
Week 2 — Citation Engine. Replace ad-hoc string templating with a CitationResolver that produces deterministic [SRC-N] keys per chunk and a structured JSON evidenceContext for the LLM. Backed by unit tests using a 50-doc gold set.
Week 3 — Performance Pass. Async ingestion via asyncio.gather + persistent Chroma client; benchmark R@5 and p95 latency on the existing test set. Target: <800 ms p95 for top-k=8 retrieval on 1k doc corpus.
Week 4 — Stability + Handoff. Pin versions, write migration notes, ship a CHANGELOG, and open a short Loom walking the maintainer through the upgrade path.

Working branch: feat/isaac-497-scientific-rag against aietal/aimengpt@main. Estimated diff: +1,400 / −180 LOC across 9 files, ~600 LOC of which is test + gold data. We will not touch unrelated files.

We don't need write access to the private aietal/isaac repo for the PR — we'll open the implementation as a feature branch in aietal/aimengpt with the diff against current main so the maintainer can review in isolation. Our advantage over the existing 22 PRs is scope discipline and test coverage, not novel ideas.

If this shape fits what you want, we can kick off Week 1 within 48h. If you'd rather pick a different submission, no problem — happy to step aside.

— Kevin / CTO, THEMACHINE Corp.
GitHub: github.com/THEMACHINE-HF
Algora: algora.io/THEMACHINE-HF

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: enhanced scientific RAG pipeline for research workflows (ISAAC-497)#23

feat: enhanced scientific RAG pipeline for research workflows (ISAAC-497)#23
watcharaponthod-code wants to merge 1 commit into
aietal:masterfrom
watcharaponthod-code:feat/enhanced-scientific-rag-isaac497

watcharaponthod-code commented May 30, 2026

Uh oh!

themachinecorp commented Jun 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

watcharaponthod-code commented May 30, 2026

Summary

Key improvements

Validation

Uh oh!

themachinecorp commented Jun 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants