Skip to content

Conversation

@Mantisus
Copy link
Collaborator

@Mantisus Mantisus commented Nov 30, 2025

Description

  • Protect Request from partial mutations on request handler failure relevant with AdaptivePlaywrightCrawler

Issues

@Mantisus Mantisus requested review from Pijukatel and janbuchar and removed request for janbuchar November 30, 2025 00:55
@Mantisus Mantisus self-assigned this Nov 30, 2025
@Mantisus Mantisus marked this pull request as draft November 30, 2025 01:33
@Mantisus Mantisus marked this pull request as ready for review December 3, 2025 01:22
@Pijukatel
Copy link
Collaborator

I am not sure about doing this due to some race conditions. My concern is reverting the changes made to the session by parallel handlers.

Scenario with 2 handlers running at the same time and both using the same session:

Both handlers create their copy of the session. (Identical at this point)
Both handlers modify the session.
One of the handlers fails and reverts both modifications.

Also, we should keep JS version aligned, so please add a note here if any changes should be made there as well:
apify/crawlee#2798

@Mantisus
Copy link
Collaborator Author

Mantisus commented Dec 3, 2025

I am not sure about doing this due to some race conditions. My concern is reverting the changes made to the session by parallel handlers.

Yes, I am also concerned about the race condition in the session for some cases. But I am not yet sure about the best approach to avoid it.

One of the handlers fails and reverts both modifications.

This should not revert both modifications.
The failed handler will continue to work with the original session.
The successful handler will apply the update from the copy to the original.

I will try to add a test that reproduces this behavior.

@Pijukatel
Copy link
Collaborator

This is the scenario I had in mind. But maybe I misunderstood the intent of this PR, because the session modification is not reverted in this case:

async def test_protect_session_in_run_handlers() -> None:
    """Test that session in crawling context are protected in run handlers."""

    both_handlers_started = asyncio.Barrier(2) 

    async with SessionPool(max_pool_size=1) as session_pool:
        crawler = BasicCrawler(session_pool=session_pool, max_request_retries=0)

        @crawler.router.default_handler
        async def handler(context: BasicCrawlingContext) -> None:
            await both_handlers_started.wait()  # Block until both handlers are started.
            # Mutate session user data
            context.session.user_data[context.request.url] = "whatever"
            if context.request.url == 'https://test.url1/':
                # Fail in one handler after modifying session
                raise ValueError('Simulated error after modifying session')

        await crawler.run(['https://test.url1/', 'https://test.url2/'])

        # Correct modification not reverted 
        assert (await session_pool.get_session()).user_data.get("https://test.url2/")
        # Failed handler modification reverted
        assert not (await session_pool.get_session()).user_data.get("https://test.url1/")

@Mantisus
Copy link
Collaborator Author

Yes, I experimented with this. For Sessions, the race condition problem arises when parallel launches are successful, since the final state was overwritten by the last successful one. As a result, I canceled the update for Session, but I wanted to think about it some more.

After thinking about it, I believe that such changes should not be made for Sessions, and Session isolation should not be used in AdaptivePlaywrightCrawler, as state rollbacks may be undesirable and unexpected for the user.
For example, I can imagine a case where pre_navigation_hook checks whether Session has certain cookies or a parameter in user_data, and if it does not, then send_request is executed to a third-party resource. This will update the session state, and if a network error occurs during navigation after that, it will cause a state rollback. The same applies to AdaptivePlaywrightCrawler.

The second problem, as you correctly pointed out, is race condition and synchronization of user_data because it can be a nested structure, which feels like overkill.

For Request, isolation may make sense in AdaptivePlaywrightCrawler, but I'm not sure it makes sense to extend it to the BasicCrawler level as we discussed earlier, #1488 (comment)

@Pijukatel
Copy link
Collaborator

Ok, thanks for the explanation. I am on the same page.

So let's remove the Session from this PR text, comments and tests docs, so that the PR deals only with the Request isolation.

@Pijukatel
Copy link
Collaborator

The session (being sort of a global variable) has the same problem as point 3 mentioned here: apify/crawlee#2798 (comment), so I think it makes sense, maybe only to document this, but not try to do rollbacks.

@Mantisus Mantisus changed the title fix: Protect Request and Session from partial mutations on request handler failure fix: Protect Request from partial mutations on request handler failure Dec 10, 2025
Copy link
Collaborator

@Pijukatel Pijukatel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am little afraid that through small incremental changes we have reached a state of the code that is very hard to reason about. The AdaptiveCrawler was kind of hacked on top of BasicCrawler with a some expectations no longer being valid.
This is by no means fault of this change, but I realize it with each such small change more and more.

For example. In BasicCrawler there are 2 at some point identical requests. One is set on context as readonly property, it gradually diverges from the original one and at some point we restore this read-only property to the original state by rewriting it. We keep using context.request and (original) request and the reader has to keep in mind when they are still identical, when they are diverged and when they are identical again. This is very hard to think about.

I would be very much in favor of taking some time to refactor the way we deal with read-only and mutables in BasicCrawler and AdaptiveCrawler so that it is more clear.


try:
await self._run_request_handler(context=context)
with swaped_context(context, request):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The name is somewhat confusing. It sets the original Request object to the readonly property context.request. It is kind of breaking our own constraint.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this was done so that protected request would only be executed within _run_request_handler.

Outside of _run_request_handler, we always work with the original Request, since we work with fields such as state, retry_count, no_retry, and others. Their state is not synchronized, since we expect that the user will not change it in user handlers.

But yes, it is not intuitive and makes the code more difficult to read.

@Mantisus
Copy link
Collaborator Author

I am little afraid that through small incremental changes we have reached a state of the code that is very hard to reason about. The AdaptiveCrawler was kind of hacked on top of BasicCrawler with a some expectations no longer being valid.
This is by no means fault of this change, but I realize it with each such small change more and more.

I think I felt something similar when I worked on this PR and thought about it afterwards. But I couldn't quite put my finger on what bothered me the most.

Essentially, this PR solves two problems.

  1. It protects attributes that we don't want users to change.
  2. It protects user data from changes when the execution of the user handler has failed.

The first problem has already been partially solved in #1603, protecting attributes that could lead to unexpected crawler behavior.
Changing other internal attributes that the crawlee uses, such as retry_count and loaded_url, would be undesirable, but not critical.

As for the second problem, it might be best to leave it until we receive a similar request from users.

@codecov
Copy link

codecov bot commented Dec 15, 2025

Codecov Report

❌ Patch coverage is 97.29730% with 1 line in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (master@826adbf). Learn more about missing BASE report.

Files with missing lines Patch % Lines
src/crawlee/_types.py 90.90% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff            @@
##             master    #1585   +/-   ##
=========================================
  Coverage          ?   92.44%           
=========================================
  Files             ?      157           
  Lines             ?    10415           
  Branches          ?        0           
=========================================
  Hits              ?     9628           
  Misses            ?      787           
  Partials          ?        0           
Flag Coverage Δ
unit 92.44% <97.29%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Copy link
Collaborator

@vdusek vdusek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@vdusek vdusek removed the request for review from janbuchar December 15, 2025 12:25
@vdusek vdusek merged commit a69caf8 into apify:master Dec 15, 2025
30 of 31 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Track changes made to request and session in RequestHandlerResult

3 participants