How to aggressively optimize for speed (Target < 5s) even with content loss? #1578
Unanswered
vika55iii
asked this question in
Forums - Q&A
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Hello! I am trying to optimize scraping speed as much as possible for my use case.
My Problem
Currently, my [FETCH] time for a page is around 10-11 seconds, even with aggressive resource blocking. My [SCRAPE] time is fast (1-2 seconds).
Running for only one link. But I still use async with AsyncWebCrawler
Here is a log example:
[FETCH]... ↓ https://www.localeclectic.com/products/example | ✓ | ⏱: 10.39s
[SCRAPE].. ◆ https://www.localeclectic.com/products/example | ✓ | ⏱: 1.12s
[COMPLETE] ● https://www.localeclectic.com/products/example | ✓ | ⏱: 11.52s
My Goal
My hard requirement is to get a response under 5 seconds.
The most important point: I am willing to sacrifice content accuracy. But it is important that the scraper should work with dynamic pages
My Configuration
Here is the configuration I am using:
JSON
{
"browser": {
"headers": { "Accept-Language": "en-US,en;q=0.9" },
"user_agent_mode": "random",
"enable_stealth": true,
"headless": true,
"browser_mode": "dedicated"
},
"run": {
"magic": true,
"simulate_user": false,
"override_navigator": true,
"remove_overlay_elements": false,
"page_timeout": 10000,
"delay_before_return_html": 0.1,
"exclude_all_images": true,
"markdown_generator": { "content_source": "raw_html", "options": { "ignore_links": true } }
},
"block_resource_types": ["image", "font", "media"],
"block_hosts": []
}
My Questions
Given my goal is speed above all, what is the recommended "fastest possible" configuration?
Is there a way to set a hard global timeout of 5 seconds for the entire [FETCH] operation? Any tips or life hacks would be very helpful!
I want to interrupt the entire proces 5 seconds.
Thank you!
Beta Was this translation helpful? Give feedback.
All reactions