Fix test hang in subprocess expansion service on port bind failure#38572
Conversation
|
r: @derrickaw for YAML and python expansion service. |
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request addresses a recurring test timeout issue caused by the expansion service failing to bind to a port. By introducing a retry mechanism, enforcing strict error checking during port binding, and switching to a more compatible local bind address, the changes ensure that the service either starts successfully or fails immediately, preventing parent process hangs. Highlights
New Features🧠 You can now enable Memory (public preview) to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize the Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counterproductive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. Footnotes
|
|
Stopping reviewer notifications for this pull request: review requested by someone other than the bot, ceding control. If you'd like to restart, comment |
There was a problem hiding this comment.
Code Review
This pull request improves the reliability of the expansion service by switching the binding address to localhost for better dual-stack compatibility and adding explicit error handling for gRPC port binding failures. Additionally, it introduces a retry mechanism with up to three attempts for starting subprocess services. Feedback suggests adding a delay between these retry attempts to more effectively handle transient issues such as port collisions.
* Revert "Fix test hang in subprocess expansion service on port bind failure (#38572)" This reverts commit 930b94c. * Ensure immediate cleanup of subprocess server on start failure When a SubprocessServer fails to start (e.g., due to a process exit or startup error), the server process could leak if standard purging is blocked by other active owners sharing the cached subprocess. To fix this: - Implement `_SharedCache.force_remove()` to immediately remove a key from the cache and run its destructor regardless of active owners. - Add `SubprocessServer.stop_force()` which calls `force_remove()` to completely terminate the server's process. - Call `stop_force()` in the `except` block of `SubprocessServer.start()` * Support modern manylinux tags based on pip version in Stager This ensures we can download pre-built wheels for environment staging rather than relying on tarball building, which is sometimes slow. * Formatting * Trigger more python tests. * Typo
We are seeing the following test timeout when expansion service is being started.
If the expansion service failed to bind to its port,
add_insecure_portreturned0but was ignored.beam/sdks/python/apache_beam/runners/portability/expansion_service_main.py
Line 74 in 290e372
The subprocess remained alive, causing the parent to wait indefinitely for the gRPC channel to become ready until pytest timeout.
Additionally, binding to '0.0.0.0' caused connection timeouts on dual-stack hosts when the client connected via 'localhost' (resolving to IPv6 loopback
::1).In this PR, we address the above problems by:
RuntimeErrorinexpansion_service_mainifadd_insecure_portfails, allowing the parent to instantly detect the crash, and fail fast.