Skip to content

Conversation

@heroheman
Copy link
Contributor

@heroheman heroheman commented Dec 17, 2025

Description

Improves crawling efficiency and robustness by filtering out irrelevant pages and enhancing error handling during history retrieval.

Changes

Tale Spider Optimization

Problem: Spider was crawling 13,792+ pages but only ~6,374 were actual tales, wasting 47+ minutes on irrelevant system pages.

Solution: Added deny rules to filter out non-content pages:

Rule(LinkExtractor(deny=[r"system:.*", r".*:.*", re.escape("tag-search")]))

Impact: Reduces crawl time by avoiding ~7,400 unnecessary page requests.

Enhanced Error Handling

  • History Processing: Added try-except blocks to gracefully handle missing history data
  • Initialization: Ensure history key exists in items before processing to prevent KeyErrors
  • Logging: Improved error messages for better debugging

Robustness Improvements

  • Handle cases where history lookup fails (returns empty dict instead of crashing)
  • Validate history data structure before sorting
  • Set sensible defaults ("unknown") when creator/date cannot be determined

Testing

make data/scp_tales.json  # Should be significantly faster
make data/processed/tales # Should handle edge cases gracefully

- Updated the LinkExtractor in ScpTaleSpider to deny links
  matching specific patterns, improving the relevance of parsed tales.
- this fixes unintend removal of Linkextractor Rule
- Add checks for empty responses and missing 'body' in JSON
- Log errors for various failure scenarios to improve debugging
- Ensure robust parsing of history HTML to prevent crashes
- Handle empty history cases by returning an empty list
- Support both dict and list formats for history input
- Safely parse date strings with error handling
- Sort revisions by date, ensuring robustness against missing values
- Use `get` method to safely access history in hubs and items
- Prevent potential KeyError by ensuring history key is present
@tedivm tedivm force-pushed the feature/improve-scraping-rules branch from b7ec8bc to f2ececf Compare December 29, 2025 22:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant