Feature/improve scraping rules: Improve Scraping Rules & Error Handling #7
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
Improves crawling efficiency and robustness by filtering out irrelevant pages and enhancing error handling during history retrieval.
Changes
Tale Spider Optimization
Problem: Spider was crawling 13,792+ pages but only ~6,374 were actual tales, wasting 47+ minutes on irrelevant system pages.
Solution: Added
denyrules to filter out non-content pages:Impact: Reduces crawl time by avoiding ~7,400 unnecessary page requests.
Enhanced Error Handling
try-exceptblocks to gracefully handle missinghistorydatahistorykey exists in items before processing to preventKeyErrorsRobustness Improvements
historylookup fails (returns empty dict instead of crashing)historydata structure before sortingTesting