[Feature Request]: Config-aware caching #1614

anna-xing · 2025-11-14T18:51:32Z

anna-xing
Nov 14, 2025

What needs to be done?

When retrieving a cached URL from the SQLite cache, we want to only get a cache hit if the CrawlerRunConfig used to scrape that URL matches the current CrawlerRunConfig in the request.

What problem does this solve?

We want to run parallel scraping jobs, each of which crawls multiple URLs, and the URLs crawled between jobs can overlap. Crawler run configs vary between scrapers (e.g. one might exclude external links, while another may include them). If we first crawl a URL with a config that excludes certain elements, then we crawl it again soon after with a config that includes those elements, we don't want to fetch the cached URL that excludes the elements.

Target users/beneficiaries

Developers

Current alternatives/workarounds

We could configure all our crawlers to use the most permissive run configs required across all of them. However, this means that many crawls will return more information than we need, and having extraneous information may dilute a downstream LLM's ability to parse out relevant details.

Proposed approach

Store the CrawlerRunConfig JSON blob as a new column in the SQLite db. In the (a)run method, allow the caller to specify if they want the cache query to be config-aware. In (a)get_cached_url, allow the caller to optionally include the current CrawlerRunConfig in the SQLite db query.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Feature Request]: Config-aware caching #1614

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Uh oh!

[Feature Request]: Config-aware caching #1614

Uh oh!

anna-xing Nov 14, 2025

What needs to be done?

What problem does this solve?

Target users/beneficiaries

Current alternatives/workarounds

Proposed approach

Replies: 0 comments

anna-xing
Nov 14, 2025