Skip to content

Fix test hang in subprocess expansion service on port bind failure#38572

Merged
shunping merged 4 commits into
apache:masterfrom
shunping:fix-expansion-service-port-binding
May 21, 2026
Merged

Fix test hang in subprocess expansion service on port bind failure#38572
shunping merged 4 commits into
apache:masterfrom
shunping:fix-expansion-service-port-binding

Conversation

@shunping
Copy link
Copy Markdown
Collaborator

@shunping shunping commented May 21, 2026

We are seeing the following test timeout when expansion service is being started.

=================================== FAILURES ===================================
______________________ MLTest.test_ml_preprocessing_yaml _______________________

...

/opt/hostedtoolcache/Python/3.14.5/x64/lib/python3.14/threading.py:373: Failed
----------------------------- Captured stdout call -----------------------------
Cloning dev environment from /runner/_work/beam/beam/sdks/python/test-suites/tox/py314/build/srcs/sdks/python/target/.tox-py314/py314
Requirement already satisfied: numpy in ./target/.tox-py314/test-venv-cache-_ujxtw1x/31094b9cfa8383d00947c0b043fad03c1b780a2756271f73ea8e0c51b4845142/lib/python3.14/site-packages (1.26.4)
------------------------------ Captured log call -------------------------------
WARNING  apache_beam.yaml.yaml_provider:yaml_provider.py:1410    WARNING: Apache Beam is installing Python packages from PyPI at runtime.
   This may pose security risks or cause instability due to repository availability.
   Packages: numpy
   Consider pre-staging dependencies or using a private repository mirror.
   For more information, see: https://beam.apache.org/documentation/sdks/python-dependencies/
WARNING  root:subprocess_server.py:219 Waiting for grpc channel to be ready at localhost:55955.
WARNING  root:subprocess_server.py:219 Waiting for grpc channel to be ready at localhost:55955.
WARNING  root:subprocess_server.py:219 Waiting for grpc channel to be ready at localhost:55955.
WARNING  root:subprocess_server.py:219 Waiting for grpc channel to be ready at localhost:55955.
WARNING  root:subprocess_server.py:219 Waiting for grpc channel to be ready at localhost:55955.
WARNING  root:subprocess_server.py:219 Waiting for grpc channel to be ready at localhost:55955.
WARNING  root:subprocess_server.py:219 Waiting for grpc channel to be ready at localhost:55955.
WARNING  root:subprocess_server.py:219 Waiting for grpc channel to be ready at localhost:55955.
WARNING  root:subprocess_server.py:219 Waiting for grpc channel to be ready at localhost:55955.
WARNING  root:subprocess_server.py:219 Waiting for grpc channel to be ready at localhost:55955.
WARNING  root:subprocess_server.py:219 Waiting for grpc channel to be ready at localhost:55955.
WARNING  root:subprocess_server.py:219 Waiting for grpc channel to be ready at localhost:55955.
WARNING  root:subprocess_server.py:219 Waiting for grpc channel to be ready at localhost:55955.
WARNING  root:subprocess_server.py:219 Waiting for grpc channel to be ready at localhost:55955.
WARNING  root:subprocess_server.py:219 Waiting for grpc channel to be ready at localhost:55955.
WARNING  root:subprocess_server.py:219 Waiting for grpc channel to be ready at localhost:55955.
WARNING  root:subprocess_server.py:219 Waiting for grpc channel to be ready at localhost:55955.
WARNING  root:subprocess_server.py:219 Waiting for grpc channel to be ready at localhost:55955.
WARNING  root:subprocess_server.py:219 Waiting for grpc channel to be ready at localhost:55955.
WARNING  root:subprocess_server.py:219 Waiting for grpc channel to be ready at localhost:55955.
WARNING  root:subprocess_server.py:219 Waiting for grpc channel to be ready at localhost:55955.
WARNING  root:subprocess_server.py:219 Waiting for grpc channel to be ready at localhost:55955.
WARNING  root:subprocess_server.py:219 Waiting for grpc channel to be ready at localhost:55955.
WARNING  root:subprocess_server.py:219 Waiting for grpc channel to be ready at localhost:55955.
WARNING  root:subprocess_server.py:219 Waiting for grpc channel to be ready at localhost:55955.
WARNING  root:subprocess_server.py:219 Waiting for grpc channel to be ready at localhost:55955.
ERROR    apache_beam.utils.subprocess_server:subprocess_server.py:225 Error bringing up service
Traceback (most recent call last):
  File "/runner/_work/beam/beam/sdks/python/test-suites/tox/py314/build/srcs/sdks/python/target/.tox-py314/py314/lib/python3.14/site-packages/apache_beam/utils/subprocess_server.py", line 215, in start
    channel_ready.result(timeout=wait_secs)
    ~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^
  File "/runner/_work/beam/beam/sdks/python/test-suites/tox/py314/build/srcs/sdks/python/target/.tox-py314/py314/lib/python3.14/site-packages/grpc/_utilities.py", line 160, in result
    self._block(timeout)
    ~~~~~~~~~~~^^^^^^^^^
  File "/runner/_work/beam/beam/sdks/python/test-suites/tox/py314/build/srcs/sdks/python/target/.tox-py314/py314/lib/python3.14/site-packages/grpc/_utilities.py", line 106, in _block
    self._condition.wait(timeout=remaining)
    ~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^
  File "/opt/hostedtoolcache/Python/3.14.5/x64/lib/python3.14/threading.py", line 373, in wait
    gotit = waiter.acquire(True, timeout)
  File "/runner/_work/beam/beam/sdks/python/test-suites/tox/py314/build/srcs/sdks/python/target/.tox-py314/py314/lib/python3.14/site-packages/pytest_timeout.py", line 317, in handler
    timeout_sigalrm(item, settings)
    ~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^
  File "/runner/_work/beam/beam/sdks/python/test-suites/tox/py314/build/srcs/sdks/python/target/.tox-py314/py314/lib/python3.14/site-packages/pytest_timeout.py", line 502, in timeout_sigalrm
    pytest.fail(PYTEST_FAILURE_MESSAGE % settings.timeout)
    ~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/runner/_work/beam/beam/sdks/python/test-suites/tox/py314/build/srcs/sdks/python/target/.tox-py314/py314/lib/python3.14/site-packages/_pytest/outcomes.py", line 163, in __call__
    raise Failed(msg=reason, pytrace=pytrace)
Failed: Timeout (>600.0s) from pytest-timeout.

If the expansion service failed to bind to its port, add_insecure_port returned 0 but was ignored.

The subprocess remained alive, causing the parent to wait indefinitely for the gRPC channel to become ready until pytest timeout.

Additionally, binding to '0.0.0.0' caused connection timeouts on dual-stack hosts when the client connected via 'localhost' (resolving to IPv6 loopback ::1).

In this PR, we address the above problems by:

  • Raising a RuntimeError in expansion_service_main if add_insecure_port
    fails, allowing the parent to instantly detect the crash, and fail fast.
  • Adding an automatic retry loop in subprocess server to address port problems.
  • Changing the bind address from '0.0.0.0' to 'localhost'.

@shunping
Copy link
Copy Markdown
Collaborator Author

r: @derrickaw for YAML and python expansion service.

@shunping shunping marked this pull request as ready for review May 21, 2026 02:43
@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request addresses a recurring test timeout issue caused by the expansion service failing to bind to a port. By introducing a retry mechanism, enforcing strict error checking during port binding, and switching to a more compatible local bind address, the changes ensure that the service either starts successfully or fails immediately, preventing parent process hangs.

Highlights

  • Improved Service Reliability: Implemented an automatic retry loop in the subprocess server to handle transient startup failures and port binding issues.
  • Fail-Fast Mechanism: Added explicit error handling in the expansion service to raise a RuntimeError if port binding fails, preventing the parent process from hanging indefinitely.
  • Network Compatibility: Updated the bind address from '0.0.0.0' to 'localhost' to resolve connection timeouts on dual-stack systems.
New Features

🧠 You can now enable Memory (public preview) to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize the Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counterproductive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@github-actions
Copy link
Copy Markdown
Contributor

Stopping reviewer notifications for this pull request: review requested by someone other than the bot, ceding control. If you'd like to restart, comment assign set of reviewers

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request improves the reliability of the expansion service by switching the binding address to localhost for better dual-stack compatibility and adding explicit error handling for gRPC port binding failures. Additionally, it introduces a retry mechanism with up to three attempts for starting subprocess services. Feedback suggests adding a delay between these retry attempts to more effectively handle transient issues such as port collisions.

Comment thread sdks/python/apache_beam/utils/subprocess_server.py
@shunping shunping merged commit 930b94c into apache:master May 21, 2026
99 of 103 checks passed
shunping added a commit that referenced this pull request May 24, 2026
* Revert "Fix test hang in subprocess expansion service on port bind failure (#38572)"

This reverts commit 930b94c.

* Ensure immediate cleanup of subprocess server on start failure

When a SubprocessServer fails to start (e.g., due to a process exit or
startup error), the server process could leak if standard purging
is blocked by other active owners sharing the cached subprocess.

To fix this:
- Implement `_SharedCache.force_remove()` to immediately remove a key
  from the cache and run its destructor regardless of active owners.
- Add `SubprocessServer.stop_force()` which calls `force_remove()` to
  completely terminate the server's process.
- Call `stop_force()` in the `except` block of `SubprocessServer.start()`

* Support modern manylinux tags based on pip version in Stager

This ensures we can download pre-built wheels for environment staging
rather than relying on tarball building, which is sometimes slow.

* Formatting

* Trigger more python tests.

* Typo
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants