Skip to content

Conversation

@RocMarshal
Copy link
Contributor

What is the purpose of the change

  • [FLINK-36746][core] Fix the deadlock bug in SerializedThrowable

Brief change log

  • [FLINK-36746][core] Fix the deadlock bug in SerializedThrowable

Verifying this change

Add the related test cases in org.apache.flink.runtime.util.SerializedThrowableTest

Does this pull request potentially affect one of the following parts:

  • Dependencies (does it add or upgrade a dependency): (yes / no)
  • The public API, i.e., is any changed class annotated with @Public(Evolving): (yes / no)
  • The serializers: (yes / no / don't know)
  • The runtime per-record code paths (performance sensitive): (yes / no / don't know)
  • Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Kubernetes/Yarn, ZooKeeper: (yes / no / don't know)
  • The S3 file system connector: (yes / no / don't know)

Documentation

  • Does this pull request introduce a new feature? (yes / no)
  • If yes, how is the feature documented? (not applicable / docs / JavaDocs / not documented)

@flinkbot
Copy link
Collaborator

flinkbot commented Nov 3, 2025

CI report:

Bot commands The @flinkbot bot supports the following commands:
  • @flinkbot run azure re-run the last Azure build

serializedThrowable = (SerializedThrowable) s;
} else {
serializedThrowable = new SerializedThrowable(s);
if (alreadySeen.add(s)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know this is draught - it would be worth checking if the throwable is already seen before doing any logic to consider adding it.

@RocMarshal RocMarshal marked this pull request as ready for review November 4, 2025 15:15
@github-actions github-actions bot added the community-reviewed PR has been reviewed by the community. label Nov 4, 2025
@RocMarshal
Copy link
Contributor Author

Hi, @Myasuka Could u help take a look if you had the free time ?
Thanks

Copy link
Member

@Myasuka Myasuka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please improve the test code to avoid possible endless running.

}

@Test
void testCyclicSuppressedThrowableConcurrentSerialized() throws InterruptedException {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't this test is stable without timeout limit, if we do not fix the original code in flink-core/src/main/java/org/apache/flink/util/SerializedThrowable.java, this test would run endlessly without quit properly.

Copy link
Contributor Author

@RocMarshal RocMarshal Nov 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

image
image

Hi, @Myasuka Thanks for your reminding.

I increased the number of test runs for the current test case using RepeatedTest and found that deadlocks can indeed still occur.

We considered a solution: making all methods of SerializedThrowable private, thereby controlling the creation of SerializedThrowable objects through a single unified entry point. The goal was to try to elevate the locking hierarchy to the serialized exception object itself.

image

Unfortunately, the following situation can still occur:

When one thread calls the unified creation method, another thread may already be holding the object lock of another exception object referenced within the exception. In such a case, a deadlock can still potentially happen.

Looking forward to your thoughts on this~
Any input is aprreciated !

@RocMarshal RocMarshal requested a review from Myasuka November 18, 2025 02:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

community-reviewed PR has been reviewed by the community.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants