fix: Protect `Request` from partial mutations on request handler failure #1585

Mantisus · 2025-11-30T00:54:56Z

Description

Protect Request from partial mutations on request handler failure relevant with AdaptivePlaywrightCrawler

Issues

Closes: Track changes made to request and session in RequestHandlerResult #1514

Pijukatel · 2025-12-03T08:55:23Z

I am not sure about doing this due to some race conditions. My concern is reverting the changes made to the session by parallel handlers.

Scenario with 2 handlers running at the same time and both using the same session:

Both handlers create their copy of the session. (Identical at this point)
Both handlers modify the session.
One of the handlers fails and reverts both modifications.

Also, we should keep JS version aligned, so please add a note here if any changes should be made there as well:
apify/crawlee#2798

Mantisus · 2025-12-03T15:42:33Z

I am not sure about doing this due to some race conditions. My concern is reverting the changes made to the session by parallel handlers.

Yes, I am also concerned about the race condition in the session for some cases. But I am not yet sure about the best approach to avoid it.

One of the handlers fails and reverts both modifications.

This should not revert both modifications.
The failed handler will continue to work with the original session.
The successful handler will apply the update from the copy to the original.

I will try to add a test that reproduces this behavior.

Pijukatel · 2025-12-10T09:54:53Z

This is the scenario I had in mind. But maybe I misunderstood the intent of this PR, because the session modification is not reverted in this case:

async def test_protect_session_in_run_handlers() -> None:
    """Test that session in crawling context are protected in run handlers."""

    both_handlers_started = asyncio.Barrier(2) 

    async with SessionPool(max_pool_size=1) as session_pool:
        crawler = BasicCrawler(session_pool=session_pool, max_request_retries=0)

        @crawler.router.default_handler
        async def handler(context: BasicCrawlingContext) -> None:
            await both_handlers_started.wait()  # Block until both handlers are started.
            # Mutate session user data
            context.session.user_data[context.request.url] = "whatever"
            if context.request.url == 'https://test.url1/':
                # Fail in one handler after modifying session
                raise ValueError('Simulated error after modifying session')

        await crawler.run(['https://test.url1/', 'https://test.url2/'])

        # Correct modification not reverted 
        assert (await session_pool.get_session()).user_data.get("https://test.url2/")
        # Failed handler modification reverted
        assert not (await session_pool.get_session()).user_data.get("https://test.url1/")

Mantisus · 2025-12-10T14:01:11Z

Yes, I experimented with this. For Sessions, the race condition problem arises when parallel launches are successful, since the final state was overwritten by the last successful one. As a result, I canceled the update for Session, but I wanted to think about it some more.

After thinking about it, I believe that such changes should not be made for Sessions, and Session isolation should not be used in AdaptivePlaywrightCrawler, as state rollbacks may be undesirable and unexpected for the user.
For example, I can imagine a case where pre_navigation_hook checks whether Session has certain cookies or a parameter in user_data, and if it does not, then send_request is executed to a third-party resource. This will update the session state, and if a network error occurs during navigation after that, it will cause a state rollback. The same applies to AdaptivePlaywrightCrawler.

The second problem, as you correctly pointed out, is race condition and synchronization of user_data because it can be a nested structure, which feels like overkill.

For Request, isolation may make sense in AdaptivePlaywrightCrawler, but I'm not sure it makes sense to extend it to the BasicCrawler level as we discussed earlier, #1488 (comment)

Pijukatel · 2025-12-10T14:11:44Z

Ok, thanks for the explanation. I am on the same page.

So let's remove the Session from this PR text, comments and tests docs, so that the PR deals only with the Request isolation.

Pijukatel · 2025-12-10T14:14:43Z

The session (being sort of a global variable) has the same problem as point 3 mentioned here: apify/crawlee#2798 (comment), so I think it makes sense, maybe only to document this, but not try to do rollbacks.

src/crawlee/_types.py

Pijukatel

I am little afraid that through small incremental changes we have reached a state of the code that is very hard to reason about. The AdaptiveCrawler was kind of hacked on top of BasicCrawler with a some expectations no longer being valid.
This is by no means fault of this change, but I realize it with each such small change more and more.

For example. In BasicCrawler there are 2 at some point identical requests. One is set on context as readonly property, it gradually diverges from the original one and at some point we restore this read-only property to the original state by rewriting it. We keep using context.request and (original) request and the reader has to keep in mind when they are still identical, when they are diverged and when they are identical again. This is very hard to think about.

I would be very much in favor of taking some time to refactor the way we deal with read-only and mutables in BasicCrawler and AdaptiveCrawler so that it is more clear.

Pijukatel · 2025-12-11T09:00:08Z

src/crawlee/crawlers/_basic/_basic_crawler.py


            try:
-                await self._run_request_handler(context=context)
+                with swaped_context(context, request):


The name is somewhat confusing. It sets the original Request object to the readonly property context.request. It is kind of breaking our own constraint.

Yes, this was done so that protected request would only be executed within _run_request_handler.

Outside of _run_request_handler, we always work with the original Request, since we work with fields such as state, retry_count, no_retry, and others. Their state is not synchronized, since we expect that the user will not change it in user handlers.

But yes, it is not intuitive and makes the code more difficult to read.

Mantisus · 2025-12-11T15:32:37Z

I am little afraid that through small incremental changes we have reached a state of the code that is very hard to reason about. The AdaptiveCrawler was kind of hacked on top of BasicCrawler with a some expectations no longer being valid.
This is by no means fault of this change, but I realize it with each such small change more and more.

I think I felt something similar when I worked on this PR and thought about it afterwards. But I couldn't quite put my finger on what bothered me the most.

Essentially, this PR solves two problems.

It protects attributes that we don't want users to change.
It protects user data from changes when the execution of the user handler has failed.

The first problem has already been partially solved in #1603, protecting attributes that could lead to unexpected crawler behavior.
Changing other internal attributes that the crawlee uses, such as retry_count and loaded_url, would be undesirable, but not critical.

As for the second problem, it might be best to leave it until we receive a similar request from users.

src/crawlee/crawlers/_basic/_basic_crawler.py

codecov · 2025-12-15T09:59:34Z

Codecov Report

❌ Patch coverage is 97.29730% with 1 line in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (master@826adbf). Learn more about missing BASE report.

Files with missing lines	Patch %	Lines
src/crawlee/_types.py	90.90%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff            @@
##             master    #1585   +/-   ##
=========================================
  Coverage          ?   92.44%           
=========================================
  Files             ?      157           
  Lines             ?    10415           
  Branches          ?        0           
=========================================
  Hits              ?     9628           
  Misses            ?      787           
  Partials          ?        0

Flag	Coverage Δ
unit	`92.44% <97.29%> (?)`

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

vdusek

LGTM

isolate request and session in request handler

dd35256

Mantisus requested review from Pijukatel and janbuchar and removed request for janbuchar November 30, 2025 00:55

Mantisus self-assigned this Nov 30, 2025

Mantisus marked this pull request as draft November 30, 2025 01:33

Mantisus added 2 commits December 3, 2025 00:52

replace result request and session in context by originals

6b014a7

add test

22a81ba

Mantisus marked this pull request as ready for review December 3, 2025 01:22

Mantisus added 3 commits December 6, 2025 16:05

resolve

32e16a5

merge

acdfd99

remove extra prints

d0c9a7d

Pijukatel reviewed Dec 10, 2025

View reviewed changes

src/crawlee/_types.py Show resolved Hide resolved

Mantisus changed the title ~~fix: Protect Request and Session from partial mutations on request handler failure~~ fix: Protect Request from partial mutations on request handler failure Dec 10, 2025

Mantisus added 2 commits December 10, 2025 17:31

fix headers in apply_request_changes

b91ef31

update test docs

851ca51

Pijukatel reviewed Dec 11, 2025

View reviewed changes

Pijukatel approved these changes Dec 11, 2025

View reviewed changes

vdusek reviewed Dec 15, 2025

View reviewed changes

src/crawlee/crawlers/_basic/_basic_crawler.py Outdated Show resolved Hide resolved

Update src/crawlee/crawlers/_basic/_basic_crawler.py

f9bd1e2

vdusek approved these changes Dec 15, 2025

View reviewed changes

vdusek removed the request for review from janbuchar December 15, 2025 12:25

vdusek merged commit a69caf8 into apify:master Dec 15, 2025
30 of 31 checks passed

fix: Protect Request from partial mutations on request handler failure #1585

fix: Protect Request from partial mutations on request handler failure #1585

Uh oh!

Conversation

Mantisus commented Nov 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Issues

Uh oh!

Pijukatel commented Dec 3, 2025

Uh oh!

Mantisus commented Dec 3, 2025

Uh oh!

Pijukatel commented Dec 10, 2025

Uh oh!

Mantisus commented Dec 10, 2025

Uh oh!

Pijukatel commented Dec 10, 2025

Uh oh!

Pijukatel commented Dec 10, 2025

Uh oh!

Uh oh!

Pijukatel left a comment

Choose a reason for hiding this comment

Uh oh!

Pijukatel Dec 11, 2025

Choose a reason for hiding this comment

Uh oh!

Mantisus Dec 11, 2025

Choose a reason for hiding this comment

Uh oh!

Mantisus commented Dec 11, 2025

Uh oh!

Uh oh!

codecov bot commented Dec 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

vdusek left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

fix: Protect `Request` from partial mutations on request handler failure #1585

fix: Protect `Request` from partial mutations on request handler failure #1585

Mantisus commented Nov 30, 2025 •

edited

Loading

codecov bot commented Dec 15, 2025 •

edited

Loading