[Feature Request]: Allow Multiple Proxies in CrawlerRunConfig via Docker API #1315
Closed
duartemvix
started this conversation in
Feature requests
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
What needs to be done?
Currently, it's only possible to set 1 proxy via
BrowserConfigorCrawlerRunConfigwhile that is useful, it requires chaining together multiple API calls to the docker endpoint, turning all crawling much slower.My suggestion is to find a way to pass a list of proxies (I get it from another API) and pass them as an array of either
proxyorproxy_configin bothBrowserConfigorCrawlerRunConfig. The only way supported to set up multiple proxies is via env_var's by setting aPROXIESvar on start up. As I get it from an API, this would require quite a workaround to make it work, but I thought a lot of other people couple benefit from this as well.Here's my config for crawling just 1 page via API:
{ "urls": ["https://example.com/"], "browser_config": { "type": "BrowserConfig", "params": { "headless": true, "light_mode": true, "text_mode": true, "user_agent_mode": "random", "verbose": true, "use_persistent_context": true, "extra_args": [ "--disable-extensions", "--disable-gpu", "--disable-dev-shm-usage", "--no-sandbox" ] } }, "crawler_config": { "type": "CrawlerRunConfig", "params": { "cache_mode": "bypass", "remove_forms": true, "override_navigator": true, "only_text": true, "exclude_external_images": true, "exclude_all_images": true, "page_timeout": 10000, "wait_until": "domcontentloaded", "wait_for": "body", "stream": false, "verbose" : true, "mean_delay": 0.3, "magic": true, "delay_before_return_html": 1, "simulate_user": true, "remove_overlay_elements": true, "semaphore_count": 3, "proxy_config": { "server": "127.0.0.1:3000" // <- There should be a way to add multiple proxies here }, "markdown_generator": { "type": "DefaultMarkdownGenerator", "params": { "content_filter": { "type": "PruningContentFilter", "params": { "threshold_type": "dynamic", "min_word_threshold": 3 } } } }, "deep_crawl_strategy": { "type": "BestFirstCrawlingStrategy", "params": { "max_depth": 1, "max_pages": 10, "include_external": false, "filter_chain": { "type": "FilterChain", "params": { "filters": [ { "type": "URLPatternFilter", "params": { "patterns": ["*login*", "*terms*", "*privacy*", "*contact*"], "reverse": true } } ] } }, "url_scorer": { "type": "CompositeScorer", "params": { "scorers": [ { "type": "KeywordRelevanceScorer", "params": { "weight": 1.0, "keywords": [ "growth", "business", "market", "product", "team", "people", "news", "about", "pricing", "company", "how it works" ] } } ] } } } } } } }What problem does this solve?
Increase crawling efficiency and removes an annoying bottleneck that requires spinning up new browsers and start new crawls for just changing proxies.
Target users/beneficiaries
All community
Current alternatives/workarounds
There are ways to workaround this but they're not fast and production ready. My suggestion entails that Crawl4AI could more robust.
Proposed approach
Changing the
serverparameter to take either a string (current setup) or an array (new setup).Beta Was this translation helpful? Give feedback.
All reactions