Skip to content

Conversation

@Ahmed-Tawfik94
Copy link
Collaborator

Summary

Fixed a critical bug in AsyncUrlSeeder where _resolve_head() was incorrectly returning redirect targets without verifying they were alive. This could cause dead URLs to be treated as valid during URL discovery.
#1603
Key Changes:

  • Added verify_redirect_targets parameter to AsyncUrlSeeder.__init__() (default: True)
  • Modified _resolve_head() to conditionally verify redirect targets based on the parameter
  • Enhanced documentation with parameter descriptions
  • Added comprehensive test suite for regression testing

Backward Compatibility: Existing code continues to work with improved behavior by default. Users can set verify_redirect_targets=False to restore the previous behavior if needed.

List of files changed and why

  • crawl4ai/async_url_seeder.py - Core bug fix and parameter addition
  • tests/test_async_url_seeder.py - Added unit tests for both verification modes
  • test_scripts/test_async_url_seeder_fixes.py - Comprehensive demo/test suite for all fixes
  • test_scripts/README.md - Documentation for test scripts

How Has This Been Tested?

Created comprehensive test suite covering:

  • ✅ Dead redirect targets are filtered out (default behavior)
  • ✅ Legacy behavior preserved when verify_redirect_targets=False
  • ✅ Valid redirects still work correctly
  • ✅ Direct URL validation unchanged
  • ✅ Context manager functionality preserved
  • ✅ All existing AsyncUrlSeeder functionality works
  • ✅ Integration with SeedingConfig.live_check parameter

Run the test suite with: python test_scripts/test_async_url_seeder_fixes.py

Checklist:

  • My code follows the style guidelines of this project
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have added/updated unit tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

- Added `verify_redirect_targets` parameter to control redirect verification.
- Modified `_resolve_head()` to verify redirect targets based on the new parameter.
- Implemented tests for both verification modes, ensuring dead redirects are filtered out and legacy behavior is preserved.
@ntohidi
Copy link
Collaborator

ntohidi commented Nov 27, 2025

@Ahmed-Tawfik94 Why didn't you implement the recommended solution you provided in the root cause message here? #1603 (comment)

@Ahmed-Tawfik94
Copy link
Collaborator Author

@Ahmed-Tawfik94 Why didn't you implement the recommended solution you provided in the root cause message here? #1603 (comment)

According to my root cause analysis, _resolve_head returns redirect targets without confirming that they resolve to 2xx. Setting follow_redirects=True and only returning the final URL if it's 2xx was my suggested solution. it might cause an issue for users who are using the functionality like this and might be a breaking change.
while by usingverify_redirect_targets parameter for single-level verification with an opt-out, making it more conservative and backward-compatible.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants