[Feature Request]: Config-aware caching #1614
anna-xing
started this conversation in
Feature requests
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
What needs to be done?
When retrieving a cached URL from the SQLite cache, we want to only get a cache hit if the
CrawlerRunConfigused to scrape that URL matches the currentCrawlerRunConfigin the request.What problem does this solve?
We want to run parallel scraping jobs, each of which crawls multiple URLs, and the URLs crawled between jobs can overlap. Crawler run configs vary between scrapers (e.g. one might exclude external links, while another may include them). If we first crawl a URL with a config that excludes certain elements, then we crawl it again soon after with a config that includes those elements, we don't want to fetch the cached URL that excludes the elements.
Target users/beneficiaries
Developers
Current alternatives/workarounds
We could configure all our crawlers to use the most permissive run configs required across all of them. However, this means that many crawls will return more information than we need, and having extraneous information may dilute a downstream LLM's ability to parse out relevant details.
Proposed approach
Store the
CrawlerRunConfigJSON blob as a new column in the SQLite db. In the(a)runmethod, allow the caller to specify if they want the cache query to be config-aware. In(a)get_cached_url, allow the caller to optionally include the currentCrawlerRunConfigin the SQLite db query.Beta Was this translation helpful? Give feedback.
All reactions