Fix test hang in subprocess expansion service on port bind failure by shunping · Pull Request #38572 · apache/beam

shunping · 2026-05-21T02:03:36Z

We are seeing the following test timeout when expansion service is being started.

=================================== FAILURES ===================================
______________________ MLTest.test_ml_preprocessing_yaml _______________________

...

/opt/hostedtoolcache/Python/3.14.5/x64/lib/python3.14/threading.py:373: Failed
----------------------------- Captured stdout call -----------------------------
Cloning dev environment from /runner/_work/beam/beam/sdks/python/test-suites/tox/py314/build/srcs/sdks/python/target/.tox-py314/py314
Requirement already satisfied: numpy in ./target/.tox-py314/test-venv-cache-_ujxtw1x/31094b9cfa8383d00947c0b043fad03c1b780a2756271f73ea8e0c51b4845142/lib/python3.14/site-packages (1.26.4)
------------------------------ Captured log call -------------------------------
WARNING  apache_beam.yaml.yaml_provider:yaml_provider.py:1410    WARNING: Apache Beam is installing Python packages from PyPI at runtime.
   This may pose security risks or cause instability due to repository availability.
   Packages: numpy
   Consider pre-staging dependencies or using a private repository mirror.
   For more information, see: https://beam.apache.org/documentation/sdks/python-dependencies/
WARNING  root:subprocess_server.py:219 Waiting for grpc channel to be ready at localhost:55955.
WARNING  root:subprocess_server.py:219 Waiting for grpc channel to be ready at localhost:55955.
WARNING  root:subprocess_server.py:219 Waiting for grpc channel to be ready at localhost:55955.
WARNING  root:subprocess_server.py:219 Waiting for grpc channel to be ready at localhost:55955.
WARNING  root:subprocess_server.py:219 Waiting for grpc channel to be ready at localhost:55955.
WARNING  root:subprocess_server.py:219 Waiting for grpc channel to be ready at localhost:55955.
WARNING  root:subprocess_server.py:219 Waiting for grpc channel to be ready at localhost:55955.
WARNING  root:subprocess_server.py:219 Waiting for grpc channel to be ready at localhost:55955.
WARNING  root:subprocess_server.py:219 Waiting for grpc channel to be ready at localhost:55955.
WARNING  root:subprocess_server.py:219 Waiting for grpc channel to be ready at localhost:55955.
WARNING  root:subprocess_server.py:219 Waiting for grpc channel to be ready at localhost:55955.
WARNING  root:subprocess_server.py:219 Waiting for grpc channel to be ready at localhost:55955.
WARNING  root:subprocess_server.py:219 Waiting for grpc channel to be ready at localhost:55955.
WARNING  root:subprocess_server.py:219 Waiting for grpc channel to be ready at localhost:55955.
WARNING  root:subprocess_server.py:219 Waiting for grpc channel to be ready at localhost:55955.
WARNING  root:subprocess_server.py:219 Waiting for grpc channel to be ready at localhost:55955.
WARNING  root:subprocess_server.py:219 Waiting for grpc channel to be ready at localhost:55955.
WARNING  root:subprocess_server.py:219 Waiting for grpc channel to be ready at localhost:55955.
WARNING  root:subprocess_server.py:219 Waiting for grpc channel to be ready at localhost:55955.
WARNING  root:subprocess_server.py:219 Waiting for grpc channel to be ready at localhost:55955.
WARNING  root:subprocess_server.py:219 Waiting for grpc channel to be ready at localhost:55955.
WARNING  root:subprocess_server.py:219 Waiting for grpc channel to be ready at localhost:55955.
WARNING  root:subprocess_server.py:219 Waiting for grpc channel to be ready at localhost:55955.
WARNING  root:subprocess_server.py:219 Waiting for grpc channel to be ready at localhost:55955.
WARNING  root:subprocess_server.py:219 Waiting for grpc channel to be ready at localhost:55955.
WARNING  root:subprocess_server.py:219 Waiting for grpc channel to be ready at localhost:55955.
ERROR    apache_beam.utils.subprocess_server:subprocess_server.py:225 Error bringing up service
Traceback (most recent call last):
  File "/runner/_work/beam/beam/sdks/python/test-suites/tox/py314/build/srcs/sdks/python/target/.tox-py314/py314/lib/python3.14/site-packages/apache_beam/utils/subprocess_server.py", line 215, in start
    channel_ready.result(timeout=wait_secs)
    ~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^
  File "/runner/_work/beam/beam/sdks/python/test-suites/tox/py314/build/srcs/sdks/python/target/.tox-py314/py314/lib/python3.14/site-packages/grpc/_utilities.py", line 160, in result
    self._block(timeout)
    ~~~~~~~~~~~^^^^^^^^^
  File "/runner/_work/beam/beam/sdks/python/test-suites/tox/py314/build/srcs/sdks/python/target/.tox-py314/py314/lib/python3.14/site-packages/grpc/_utilities.py", line 106, in _block
    self._condition.wait(timeout=remaining)
    ~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^
  File "/opt/hostedtoolcache/Python/3.14.5/x64/lib/python3.14/threading.py", line 373, in wait
    gotit = waiter.acquire(True, timeout)
  File "/runner/_work/beam/beam/sdks/python/test-suites/tox/py314/build/srcs/sdks/python/target/.tox-py314/py314/lib/python3.14/site-packages/pytest_timeout.py", line 317, in handler
    timeout_sigalrm(item, settings)
    ~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^
  File "/runner/_work/beam/beam/sdks/python/test-suites/tox/py314/build/srcs/sdks/python/target/.tox-py314/py314/lib/python3.14/site-packages/pytest_timeout.py", line 502, in timeout_sigalrm
    pytest.fail(PYTEST_FAILURE_MESSAGE % settings.timeout)
    ~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/runner/_work/beam/beam/sdks/python/test-suites/tox/py314/build/srcs/sdks/python/target/.tox-py314/py314/lib/python3.14/site-packages/_pytest/outcomes.py", line 163, in __call__
    raise Failed(msg=reason, pytrace=pytrace)
Failed: Timeout (>600.0s) from pytest-timeout.

If the expansion service failed to bind to its port, add_insecure_port returned 0 but was ignored.

beam/sdks/python/apache_beam/runners/portability/expansion_service_main.py

Line 74 in 290e372

server.add_insecure_port(address)

The subprocess remained alive, causing the parent to wait indefinitely for the gRPC channel to become ready until pytest timeout.

Additionally, binding to '0.0.0.0' caused connection timeouts on dual-stack hosts when the client connected via 'localhost' (resolving to IPv6 loopback ::1).

In this PR, we address the above problems by:

Raising a RuntimeError in expansion_service_main if add_insecure_port
fails, allowing the parent to instantly detect the crash, and fail fast.
Adding an automatic retry loop in subprocess server to address port problems.
Changing the bind address from '0.0.0.0' to 'localhost'.

…lure

shunping · 2026-05-21T02:43:19Z

r: @derrickaw for YAML and python expansion service.

gemini-code-assist · 2026-05-21T02:43:34Z

Summary of Changes

Hello, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request addresses a recurring test timeout issue caused by the expansion service failing to bind to a port. By introducing a retry mechanism, enforcing strict error checking during port binding, and switching to a more compatible local bind address, the changes ensure that the service either starts successfully or fails immediately, preventing parent process hangs.

Highlights

Improved Service Reliability: Implemented an automatic retry loop in the subprocess server to handle transient startup failures and port binding issues.
Fail-Fast Mechanism: Added explicit error handling in the expansion service to raise a RuntimeError if port binding fails, preventing the parent process from hanging indefinitely.
Network Compatibility: Updated the bind address from '0.0.0.0' to 'localhost' to resolve connection timeouts on dual-stack systems.

New Features

🧠 You can now enable Memory (public preview) to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize the Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counterproductive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

github-actions · 2026-05-21T02:44:32Z

Stopping reviewer notifications for this pull request: review requested by someone other than the bot, ceding control. If you'd like to restart, comment assign set of reviewers

gemini-code-assist

Code Review

This pull request improves the reliability of the expansion service by switching the binding address to localhost for better dual-stack compatibility and adding explicit error handling for gRPC port binding failures. Additionally, it introduces a retry mechanism with up to three attempts for starting subprocess services. Feedback suggests adding a delay between these retry attempts to more effectively handle transient issues such as port collisions.

* Revert "Fix test hang in subprocess expansion service on port bind failure (#38572)" This reverts commit 930b94c. * Ensure immediate cleanup of subprocess server on start failure When a SubprocessServer fails to start (e.g., due to a process exit or startup error), the server process could leak if standard purging is blocked by other active owners sharing the cached subprocess. To fix this: - Implement `_SharedCache.force_remove()` to immediately remove a key from the cache and run its destructor regardless of active owners. - Add `SubprocessServer.stop_force()` which calls `force_remove()` to completely terminate the server's process. - Call `stop_force()` in the `except` block of `SubprocessServer.start()` * Support modern manylinux tags based on pip version in Stager This ensures we can download pre-built wheels for environment staging rather than relying on tarball building, which is sometimes slow. * Formatting * Trigger more python tests. * Typo

Fix silent test hang in subprocess expansion service on port bind fai…

ff0f0ab

…lure

github-actions Bot added python runners labels May 21, 2026

shunping added 2 commits May 20, 2026 22:28

Formatting

eb62fce

Add retry when starting subprocess server.

f6252ef

shunping marked this pull request as ready for review May 21, 2026 02:43

gemini-code-assist Bot reviewed May 21, 2026

View reviewed changes

Comment thread sdks/python/apache_beam/utils/subprocess_server.py

Add sleep before retrying.

002d50b

derrickaw approved these changes May 21, 2026

View reviewed changes

shunping merged commit 930b94c into apache:master May 21, 2026
99 of 103 checks passed

shunping mentioned this pull request May 23, 2026

Reduce python expansion service startup time #38611

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix test hang in subprocess expansion service on port bind failure#38572

Fix test hang in subprocess expansion service on port bind failure#38572
shunping merged 4 commits into
apache:masterfrom
shunping:fix-expansion-service-port-binding

shunping commented May 21, 2026 •

edited

Loading

Uh oh!

shunping commented May 21, 2026

Uh oh!

gemini-code-assist Bot commented May 21, 2026

Uh oh!

github-actions Bot commented May 21, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

shunping commented May 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

shunping commented May 21, 2026

Uh oh!

gemini-code-assist Bot commented May 21, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

github-actions Bot commented May 21, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

shunping commented May 21, 2026 •

edited

Loading