-
Notifications
You must be signed in to change notification settings - Fork 535
fix: Protect Request from partial mutations on request handler failure
#1585
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
I am not sure about doing this due to some race conditions. My concern is reverting the changes made to the session by parallel handlers. Scenario with 2 handlers running at the same time and both using the same session: Both handlers create their copy of the session. (Identical at this point) Also, we should keep JS version aligned, so please add a note here if any changes should be made there as well: |
Yes, I am also concerned about the race condition in the session for some cases. But I am not yet sure about the best approach to avoid it.
This should not revert both modifications. I will try to add a test that reproduces this behavior. |
|
This is the scenario I had in mind. But maybe I misunderstood the intent of this PR, because the session modification is not reverted in this case: |
|
Yes, I experimented with this. For After thinking about it, I believe that such changes should not be made for The second problem, as you correctly pointed out, is race condition and synchronization of For |
|
Ok, thanks for the explanation. I am on the same page. So let's remove the |
|
The session (being sort of a global variable) has the same problem as point 3 mentioned here: apify/crawlee#2798 (comment), so I think it makes sense, maybe only to document this, but not try to do rollbacks. |
Request and Session from partial mutations on request handler failureRequest from partial mutations on request handler failure
Pijukatel
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am little afraid that through small incremental changes we have reached a state of the code that is very hard to reason about. The AdaptiveCrawler was kind of hacked on top of BasicCrawler with a some expectations no longer being valid.
This is by no means fault of this change, but I realize it with each such small change more and more.
For example. In BasicCrawler there are 2 at some point identical requests. One is set on context as readonly property, it gradually diverges from the original one and at some point we restore this read-only property to the original state by rewriting it. We keep using context.request and (original) request and the reader has to keep in mind when they are still identical, when they are diverged and when they are identical again. This is very hard to think about.
I would be very much in favor of taking some time to refactor the way we deal with read-only and mutables in BasicCrawler and AdaptiveCrawler so that it is more clear.
|
|
||
| try: | ||
| await self._run_request_handler(context=context) | ||
| with swaped_context(context, request): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The name is somewhat confusing. It sets the original Request object to the readonly property context.request. It is kind of breaking our own constraint.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this was done so that protected request would only be executed within _run_request_handler.
Outside of _run_request_handler, we always work with the original Request, since we work with fields such as state, retry_count, no_retry, and others. Their state is not synchronized, since we expect that the user will not change it in user handlers.
But yes, it is not intuitive and makes the code more difficult to read.
I think I felt something similar when I worked on this PR and thought about it afterwards. But I couldn't quite put my finger on what bothered me the most. Essentially, this PR solves two problems.
The first problem has already been partially solved in #1603, protecting attributes that could lead to unexpected crawler behavior. As for the second problem, it might be best to leave it until we receive a similar request from users. |
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #1585 +/- ##
=========================================
Coverage ? 92.44%
=========================================
Files ? 157
Lines ? 10415
Branches ? 0
=========================================
Hits ? 9628
Misses ? 787
Partials ? 0
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
vdusek
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Description
Requestfrom partial mutations on request handler failure relevant withAdaptivePlaywrightCrawlerIssues
requestandsessioninRequestHandlerResult#1514