Skip to content

Conversation

@chenhao0205
Copy link
Contributor

@chenhao0205 chenhao0205 commented Nov 21, 2025

📌 PR 内容 / PR Description

  • 按照cf需求,在LazyLLMReaderBase类内添加了检测文件编码格式,并根据实际文件格式进行读取。
  • 实现过程中修了DirectoryReader内的一个类和实例初始化的bug
  • 添加大量编码检测流程优化

🔍 相关 Issue / Related Issue

  • image

✅ 变更类型 / Type of Change

  • 修复 Bug / Bug fix (non-breaking change that fixes an issue)

🧪 如何测试 / How Has This Been Tested?

  1. 测试添加自检后Reader功能是否正常 test_reader.py
  2. 测试编码格式混合的文件夹
  3. 添加自检和手动指定格式的性能损耗测试:
    (1) 在Reader上:14-15%
    (2) 在SimpleDirectoryReader上:7-8%
    (3) 在Document上:3-5%

⚠️ 注意事项 / Additional Notes

新增依赖charset_normalizer

Copy link
Contributor

@ChenJiahaoST ChenJiahaoST left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

resolve conflict and add some unitest for the Auto-detection feature

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants