Skip to content

Reduce python expansion service startup time#38611

Open
shunping wants to merge 6 commits into
apache:masterfrom
shunping:fix-expansion-service-timeout
Open

Reduce python expansion service startup time#38611
shunping wants to merge 6 commits into
apache:masterfrom
shunping:fix-expansion-service-timeout

Conversation

@shunping
Copy link
Copy Markdown
Collaborator

@shunping shunping commented May 23, 2026

This PR reverts the changes from #38572 and addresses the root cause of the expansion service timeouts by enabling pre-built wheel downloads and improving subprocess cleanup.

Further investigation into the expansion service test hangs revealed that the issue was not caused by network instability or port collisions. Instead, the service was timing out while downloading and building the numpy source tarball. This build process is heavily influenced by GitHub Runner load, leading to frequent timeouts during environment staging.

In this PR, we update Stager to support new manylinux tag, which allows the service to download pre-built wheels rather than compiling from source. This significanly accelerate the startup time of the expansion service. In local environments, these changes reduced the runtime of MLTest::test_ml_preprocessing_yaml from 110 seconds to 8 seconds!

Furthermore, we introduce force_remove() and stop_force() to ensure that if a SubprocessServer fails to start, the process is immediately terminated and removed from the cache, preventing leaks even when shared by multiple owners.

shunping added 3 commits May 22, 2026 19:24
When a SubprocessServer fails to start (e.g., due to a process exit or
startup error), the server process could leak if standard purging
is blocked by other active owners sharing the cached subprocess.

To fix this:
- Implement `_SharedCache.force_remove()` to immediately remove a key
  from the cache and run its destructor regardless of active owners.
- Add `SubprocessServer.stop_force()` which calls `force_remove()` to
  completely terminate the server's process.
- Call `stop_force()` in the `except` block of `SubprocessServer.start()`
This ensures we can download pre-built wheels for environment staging
rather than relying on tarball building, which is sometimes slow.
@shunping shunping changed the title Fix expansion service timeout Fix python expansion service startup timeout May 23, 2026
@shunping
Copy link
Copy Markdown
Collaborator Author

r: @derrickaw

@shunping shunping marked this pull request as ready for review May 24, 2026 01:13
@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request addresses expansion service startup timeouts by optimizing dependency installation and improving subprocess lifecycle management. By allowing the use of pre-built wheels and implementing robust cleanup mechanisms for failed processes, the changes ensure more reliable and faster service initialization in CI environments.

Highlights

  • Expansion Service Performance: Updated the Stager to support newer manylinux tags, enabling the download of pre-built wheels instead of compiling from source, which significantly reduces startup time.
  • Subprocess Reliability: Introduced force_remove() and stop_force() methods to ensure immediate termination and cleanup of SubprocessServer instances that fail to start, preventing resource leaks.
  • Network Configuration: Reverted the expansion service binding to 0.0.0.0 to ensure broader compatibility.
New Features

🧠 You can now enable Memory (public preview) to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize the Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counterproductive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@github-actions
Copy link
Copy Markdown
Contributor

Stopping reviewer notifications for this pull request: review requested by someone other than the bot, ceding control. If you'd like to restart, comment assign set of reviewers

@shunping
Copy link
Copy Markdown
Collaborator Author

/gemini review

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates the Python post-commit trigger version, changes the expansion service binding address to '0.0.0.0', and updates the manylinux platform detection logic for newer pip versions. Additionally, it refactors the SubprocessServer by adding a force_remove method to the shared cache and implementing a stop_force cleanup mechanism during startup failures. Review feedback highlights that removing the validation check for add_insecure_port may lead to silent failures and misleading logs. Furthermore, the removal of the retry logic in the start method could make the service less resilient to transient startup issues.

Comment thread sdks/python/apache_beam/utils/subprocess_server.py
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

The pull request modifies the expansion service to bind to 0.0.0.0 and updates the SubprocessServer with a force_remove method and a stop_force cleanup routine to prevent resource leaks. It also adjusts the manylinux platform detection logic for Python SDK containers. The reviewer recommends restoring the gRPC port binding check to ensure the service starts correctly and handles ephemeral ports properly. Additionally, the reviewer suggests replacing bare except blocks with Exception to avoid catching system interrupts and to improve error logging.

Comment thread sdks/python/apache_beam/utils/subprocess_server.py
Comment thread sdks/python/apache_beam/utils/subprocess_server.py
@shunping shunping changed the title Fix python expansion service startup timeout Reduce python expansion service startup timeout May 24, 2026
@shunping shunping changed the title Reduce python expansion service startup timeout Reduce python expansion service startup time May 24, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants