Skip to content

feat: add badcase recording, LLM judge fallback, dual metrics, and fi…#26

Open
xujiayuan0205 wants to merge 1 commit into
mainfrom
feature/pipeline-enhancement
Open

feat: add badcase recording, LLM judge fallback, dual metrics, and fi…#26
xujiayuan0205 wants to merge 1 commit into
mainfrom
feature/pipeline-enhancement

Conversation

@xujiayuan0205
Copy link
Copy Markdown
Contributor

…x ToMi field mapping

  • Add StructuredResult dataclass wrapping parsed output with raw_response and reasoning_content
  • Add src/judge.py for LLM semantic judge fallback when structured extraction fails
  • Extend runner.py with collect_badcases() and build_corrected_predictions()
  • Add dual metrics (strict + judge-corrected) to all task run.py scripts
  • Fix ToMi field mapping: Story.full_story/Question/Answer.Correct_Answer
  • Fix reasoning_content capture: support both 'reasoning' and 'reasoning_content' field names
  • Fix run_all.py subprocess PYTHONPATH for src module resolution
  • Update SUMMARY.md with deepseek-chat and deepseek-r1 results

…x ToMi field mapping

- Add StructuredResult dataclass wrapping parsed output with raw_response and reasoning_content
- Add src/judge.py for LLM semantic judge fallback when structured extraction fails
- Extend runner.py with collect_badcases() and build_corrected_predictions()
- Add dual metrics (strict + judge-corrected) to all task run.py scripts
- Fix ToMi field mapping: Story.full_story/Question/Answer.Correct_Answer
- Fix reasoning_content capture: support both 'reasoning' and 'reasoning_content' field names
- Fix run_all.py subprocess PYTHONPATH for src module resolution
- Update SUMMARY.md with deepseek-chat and deepseek-r1 results
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant