feat: add badcase recording, LLM judge fallback, dual metrics, and fi… by xujiayuan0205 · Pull Request #26 · TomTraining/TomTest

xujiayuan0205 · 2026-04-15T09:30:32Z

…x ToMi field mapping

Add StructuredResult dataclass wrapping parsed output with raw_response and reasoning_content
Add src/judge.py for LLM semantic judge fallback when structured extraction fails
Extend runner.py with collect_badcases() and build_corrected_predictions()
Add dual metrics (strict + judge-corrected) to all task run.py scripts
Fix ToMi field mapping: Story.full_story/Question/Answer.Correct_Answer
Fix reasoning_content capture: support both 'reasoning' and 'reasoning_content' field names
Fix run_all.py subprocess PYTHONPATH for src module resolution
Update SUMMARY.md with deepseek-chat and deepseek-r1 results

…x ToMi field mapping - Add StructuredResult dataclass wrapping parsed output with raw_response and reasoning_content - Add src/judge.py for LLM semantic judge fallback when structured extraction fails - Extend runner.py with collect_badcases() and build_corrected_predictions() - Add dual metrics (strict + judge-corrected) to all task run.py scripts - Fix ToMi field mapping: Story.full_story/Question/Answer.Correct_Answer - Fix reasoning_content capture: support both 'reasoning' and 'reasoning_content' field names - Fix run_all.py subprocess PYTHONPATH for src module resolution - Update SUMMARY.md with deepseek-chat and deepseek-r1 results

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add badcase recording, LLM judge fallback, dual metrics, and fi…#26

feat: add badcase recording, LLM judge fallback, dual metrics, and fi…#26
xujiayuan0205 wants to merge 1 commit into
mainfrom
feature/pipeline-enhancement

xujiayuan0205 commented Apr 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

xujiayuan0205 commented Apr 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant