Skip to content

Revert gcsio 3 upgrades#38583

Closed
Abacn wants to merge 4 commits into
apache:masterfrom
Abacn:revert-gcsio-3
Closed

Revert gcsio 3 upgrades#38583
Abacn wants to merge 4 commits into
apache:masterfrom
Abacn:revert-gcsio-3

Conversation

@Abacn
Copy link
Copy Markdown
Contributor

@Abacn Abacn commented May 21, 2026

Please add a meaningful description for your change here

It is found in DataflowTemplates release validation that staging Dataflow artifacts encountering a JUL bug JDK-8349206: GoogleCloudPlatform/DataflowTemplates#3833 (comment)

Likely due to gcsio 3 added added signidicantly more fluent log logging. There is a risk of deadlock in high concurrency scenarios

Considering this risk and we are in progress of migrating off bigdataoss/gcsio, revert this upgrade for now. As a result IcebergIO's gcs-hadoop-connector (as well as Hadoop version) upgrade then needs to be done after the bigdataoss/gcsio migration.


Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:

  • Mention the appropriate issue in your description (for example: addresses #123), if applicable. This will automatically add a link to the pull request in the issue. If you would like the issue to automatically close on merging the pull request, comment fixes #<ISSUE NUMBER> instead.
  • Update CHANGES.md with noteworthy changes.
  • If this contribution is large, please file an Apache Individual Contributor License Agreement.

See the Contributor Guide for more tips on how to make review process smoother.

To check the build health, please visit https://github.com/apache/beam/blob/master/.test-infra/BUILD_STATUS.md

GitHub Actions Tests Status (on master branch)

Build python source distribution and wheels
Python tests
Java tests
Go tests

See CI.md for more information about GitHub Actions CI or the workflows README to see a list of phrases to trigger workflows.

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request reverts recent upgrades to gcsio (3.x) to mitigate deadlock risks and address stability concerns identified during release validation. The changes include downgrading the bigdataoss library version, updating build configurations to remove obsolete dependency pins, and adding defensive coding via reflection in GcsUtilV1 to handle connector construction safely. These adjustments stabilize the current environment while the project continues its migration away from bigdataoss/gcsio.

Highlights

  • Reversion of gcsio Upgrades: Reverted the upgrade of gcsio to version 3.x due to identified deadlock risks and compatibility issues with Java logging in high-concurrency scenarios.
  • Dependency Management Adjustments: Downgraded google_cloud_bigdataoss_version to 2.2.26 and updated related Hadoop connector dependencies to ensure stability while migration efforts are ongoing.
  • GcsUtilV1 Compatibility: Implemented reflection-based construction for GoogleCloudStorageImpl to maintain compatibility with older versions of the GCS connector while preparing for future upgrades.
  • Build Configuration Cleanup: Removed outdated Hadoop version pins and adjusted Iceberg runtime dependencies to improve build reliability and support Java 11+ environments.
New Features

🧠 You can now enable Memory (public preview) to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize the Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counterproductive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request downgrades the google_cloud_bigdataoss_version to 2.2.26 and introduces reflection-based logic in GcsUtilV1 to maintain compatibility with both gcsio 2.x and 3.x. It also updates Iceberg dependencies to be conditional on Java 11+ and removes various Hadoop version pins. Review feedback identifies critical issues regarding the removal of IOException handling and declarations which are still required for gcsio 2.x compatibility. Additionally, suggestions were made to improve the robustness of the reflection logic by avoiding hardcoded internal class names, ensuring proper null handling for the httpRequestInitializer, and reverting the use of Hadoop utility classes in favor of Beam's vendored Guava dependencies.

Comment on lines +270 to +271
googleCloudStorage =
createGoogleCloudStorage(googleCloudStorageOptions, storageClient, credentials);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The createGoogleCloudStorage method still needs to handle or declare IOException because the GoogleCloudStorageImpl constructor in gcsio 2.x throws it. The try-catch block should be restored here to maintain proper error handling and ensure the code compiles.

    try {
      googleCloudStorage =
          createGoogleCloudStorage(googleCloudStorageOptions, storageClient, credentials);
    } catch (IOException e) {
      throw new RuntimeException(e);
    }

Comment on lines 727 to +728
GoogleCloudStorage createGoogleCloudStorage(
GoogleCloudStorageOptions options, Storage storage, Credentials credentials)
throws IOException {
return GoogleCloudStorageImpl.builder()
.setOptions(options)
.setHttpTransport(storage.getRequestFactory().getTransport())
.setCredentials(credentials)
// gcsio 3 expects httpRequestInitializer to be either absent or
// com.google.cloud.hadoop.util.RetryHttpInitializer when credentials not provided
.setHttpRequestInitializer(credentials != null ? httpRequestInitializer : null)
.build();
GoogleCloudStorageOptions options, Storage storage, Credentials credentials) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The GoogleCloudStorageImpl constructor in gcsio 2.x throws a checked IOException. Removing throws IOException from this method signature will cause a compilation error when using the 2.x library. Please restore the exception declaration.

  GoogleCloudStorage createGoogleCloudStorage(
      GoogleCloudStorageOptions options, Storage storage, Credentials credentials)
      throws IOException {

Comment on lines +737 to +739
final Class<?> builderClass =
Class.forName(
"com.google.cloud.hadoop.gcsio.AutoBuilder_GoogleCloudStorageImpl_Builder");
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Hardcoding the internal AutoBuilder class name is fragile and depends on implementation details of the gcsio library. It is safer to obtain the class directly from the builder instance returned by GoogleCloudStorageImpl.builder().

Suggested change
final Class<?> builderClass =
Class.forName(
"com.google.cloud.hadoop.gcsio.AutoBuilder_GoogleCloudStorageImpl_Builder");
final Class<?> builderClass = builder.getClass();

Comment on lines +757 to +760
final Method setHttpRequestInitializerMethod =
builderClass.getMethod("setHttpRequestInitializer", HttpRequestInitializer.class);
setHttpRequestInitializerMethod.setAccessible(true);
builder = setHttpRequestInitializerMethod.invoke(builder, httpRequestInitializer);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The reflection logic for gcsio 3.x should preserve the original behavior of passing null for the httpRequestInitializer when credentials are null. This was specifically required for gcsio 3.x as noted in the previous implementation's comments.

Suggested change
final Method setHttpRequestInitializerMethod =
builderClass.getMethod("setHttpRequestInitializer", HttpRequestInitializer.class);
setHttpRequestInitializerMethod.setAccessible(true);
builder = setHttpRequestInitializerMethod.invoke(builder, httpRequestInitializer);
final Method setHttpRequestInitializerMethod =
builderClass.getMethod("setHttpRequestInitializer", HttpRequestInitializer.class);
setHttpRequestInitializerMethod.setAccessible(true);
builder = setHttpRequestInitializerMethod.invoke(
builder, credentials != null ? httpRequestInitializer : null);

import static org.apache.beam.vendor.guava.v32_1_2_jre.com.google.common.base.Preconditions.checkArgument;
import static org.apache.beam.vendor.guava.v32_1_2_jre.com.google.common.base.Preconditions.checkNotNull;
import static org.apache.beam.vendor.guava.v32_1_2_jre.com.google.common.collect.Sets.newHashSet;
import static org.apache.hadoop.util.Sets.newHashSet;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Beam prefers using its vendored Guava dependencies (org.apache.beam.vendor.guava...) over external utility classes like org.apache.hadoop.util.Sets to ensure classpath stability and avoid version conflicts with user-provided Hadoop environments.

Suggested change
import static org.apache.hadoop.util.Sets.newHashSet;
import static org.apache.beam.vendor.guava.v32_1_2_jre.com.google.common.collect.Sets.newHashSet;

@Abacn Abacn marked this pull request as draft May 21, 2026 13:59
@Abacn Abacn closed this May 21, 2026
@Abacn Abacn deleted the revert-gcsio-3 branch May 21, 2026 19:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant