Revert gcsio 3 upgrades#38583
Conversation
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request reverts recent upgrades to gcsio (3.x) to mitigate deadlock risks and address stability concerns identified during release validation. The changes include downgrading the bigdataoss library version, updating build configurations to remove obsolete dependency pins, and adding defensive coding via reflection in GcsUtilV1 to handle connector construction safely. These adjustments stabilize the current environment while the project continues its migration away from bigdataoss/gcsio. Highlights
New Features🧠 You can now enable Memory (public preview) to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize the Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counterproductive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request downgrades the google_cloud_bigdataoss_version to 2.2.26 and introduces reflection-based logic in GcsUtilV1 to maintain compatibility with both gcsio 2.x and 3.x. It also updates Iceberg dependencies to be conditional on Java 11+ and removes various Hadoop version pins. Review feedback identifies critical issues regarding the removal of IOException handling and declarations which are still required for gcsio 2.x compatibility. Additionally, suggestions were made to improve the robustness of the reflection logic by avoiding hardcoded internal class names, ensuring proper null handling for the httpRequestInitializer, and reverting the use of Hadoop utility classes in favor of Beam's vendored Guava dependencies.
| googleCloudStorage = | ||
| createGoogleCloudStorage(googleCloudStorageOptions, storageClient, credentials); |
There was a problem hiding this comment.
The createGoogleCloudStorage method still needs to handle or declare IOException because the GoogleCloudStorageImpl constructor in gcsio 2.x throws it. The try-catch block should be restored here to maintain proper error handling and ensure the code compiles.
try {
googleCloudStorage =
createGoogleCloudStorage(googleCloudStorageOptions, storageClient, credentials);
} catch (IOException e) {
throw new RuntimeException(e);
}| GoogleCloudStorage createGoogleCloudStorage( | ||
| GoogleCloudStorageOptions options, Storage storage, Credentials credentials) | ||
| throws IOException { | ||
| return GoogleCloudStorageImpl.builder() | ||
| .setOptions(options) | ||
| .setHttpTransport(storage.getRequestFactory().getTransport()) | ||
| .setCredentials(credentials) | ||
| // gcsio 3 expects httpRequestInitializer to be either absent or | ||
| // com.google.cloud.hadoop.util.RetryHttpInitializer when credentials not provided | ||
| .setHttpRequestInitializer(credentials != null ? httpRequestInitializer : null) | ||
| .build(); | ||
| GoogleCloudStorageOptions options, Storage storage, Credentials credentials) { |
There was a problem hiding this comment.
The GoogleCloudStorageImpl constructor in gcsio 2.x throws a checked IOException. Removing throws IOException from this method signature will cause a compilation error when using the 2.x library. Please restore the exception declaration.
GoogleCloudStorage createGoogleCloudStorage(
GoogleCloudStorageOptions options, Storage storage, Credentials credentials)
throws IOException {| final Class<?> builderClass = | ||
| Class.forName( | ||
| "com.google.cloud.hadoop.gcsio.AutoBuilder_GoogleCloudStorageImpl_Builder"); |
There was a problem hiding this comment.
Hardcoding the internal AutoBuilder class name is fragile and depends on implementation details of the gcsio library. It is safer to obtain the class directly from the builder instance returned by GoogleCloudStorageImpl.builder().
| final Class<?> builderClass = | |
| Class.forName( | |
| "com.google.cloud.hadoop.gcsio.AutoBuilder_GoogleCloudStorageImpl_Builder"); | |
| final Class<?> builderClass = builder.getClass(); |
| final Method setHttpRequestInitializerMethod = | ||
| builderClass.getMethod("setHttpRequestInitializer", HttpRequestInitializer.class); | ||
| setHttpRequestInitializerMethod.setAccessible(true); | ||
| builder = setHttpRequestInitializerMethod.invoke(builder, httpRequestInitializer); |
There was a problem hiding this comment.
The reflection logic for gcsio 3.x should preserve the original behavior of passing null for the httpRequestInitializer when credentials are null. This was specifically required for gcsio 3.x as noted in the previous implementation's comments.
| final Method setHttpRequestInitializerMethod = | |
| builderClass.getMethod("setHttpRequestInitializer", HttpRequestInitializer.class); | |
| setHttpRequestInitializerMethod.setAccessible(true); | |
| builder = setHttpRequestInitializerMethod.invoke(builder, httpRequestInitializer); | |
| final Method setHttpRequestInitializerMethod = | |
| builderClass.getMethod("setHttpRequestInitializer", HttpRequestInitializer.class); | |
| setHttpRequestInitializerMethod.setAccessible(true); | |
| builder = setHttpRequestInitializerMethod.invoke( | |
| builder, credentials != null ? httpRequestInitializer : null); |
| import static org.apache.beam.vendor.guava.v32_1_2_jre.com.google.common.base.Preconditions.checkArgument; | ||
| import static org.apache.beam.vendor.guava.v32_1_2_jre.com.google.common.base.Preconditions.checkNotNull; | ||
| import static org.apache.beam.vendor.guava.v32_1_2_jre.com.google.common.collect.Sets.newHashSet; | ||
| import static org.apache.hadoop.util.Sets.newHashSet; |
There was a problem hiding this comment.
Beam prefers using its vendored Guava dependencies (org.apache.beam.vendor.guava...) over external utility classes like org.apache.hadoop.util.Sets to ensure classpath stability and avoid version conflicts with user-provided Hadoop environments.
| import static org.apache.hadoop.util.Sets.newHashSet; | |
| import static org.apache.beam.vendor.guava.v32_1_2_jre.com.google.common.collect.Sets.newHashSet; |
Please add a meaningful description for your change here
It is found in DataflowTemplates release validation that staging Dataflow artifacts encountering a JUL bug JDK-8349206: GoogleCloudPlatform/DataflowTemplates#3833 (comment)
Likely due to gcsio 3 added added signidicantly more fluent log logging. There is a risk of deadlock in high concurrency scenarios
Considering this risk and we are in progress of migrating off bigdataoss/gcsio, revert this upgrade for now. As a result IcebergIO's gcs-hadoop-connector (as well as Hadoop version) upgrade then needs to be done after the bigdataoss/gcsio migration.
Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:
addresses #123), if applicable. This will automatically add a link to the pull request in the issue. If you would like the issue to automatically close on merging the pull request, commentfixes #<ISSUE NUMBER>instead.CHANGES.mdwith noteworthy changes.See the Contributor Guide for more tips on how to make review process smoother.
To check the build health, please visit https://github.com/apache/beam/blob/master/.test-infra/BUILD_STATUS.md
GitHub Actions Tests Status (on master branch)
See CI.md for more information about GitHub Actions CI or the workflows README to see a list of phrases to trigger workflows.