Connection timeout #11

MohamedRady-99 · 2025-11-06T10:20:10Z

This pull request introduces a connection_timeout to the agent configuration. During the startup phase, agents (primary, secondaries, and recorders) attempt to establish communication channels with each other. Previously, if an expected agent failed to connect, the process could hang indefinitely.

With this change:

A connection_timeout can be specified in the PrimaryConfig.
This timeout dictates the maximum duration the primary agent will wait for all secondary agents and recorders to connect.
If any agent fails to establish a connection within this period, the primary agent will time out, log an error, and exit, preventing the system from getting stuck in a non-operational state.
This improves the robustness and predictability of the system's startup sequence.

github-actions · 2025-11-06T10:21:07Z

License Check Results

🚀 The license check job ran with the Bazel command:

bazel run //:license-check

Status: ⚠️ Needs Review

Click to expand output

[License Check Output]
Starting local Bazel server (8.3.0) and connecting to it...
INFO: Invocation ID: 526235dc-0cb6-4408-bd65-fdc32352e33a
Computing main repo mapping: 
Computing main repo mapping: 
DEBUG: Rule 'rules_boost+' indicated that a canonical reproducible form can be obtained by modifying arguments integrity = "sha256-iKHWUIBrUN/fFOCltkTmHfDcKw0h4ZVmP7NVSoc8zBs="
DEBUG: Repository rules_boost+ instantiated at:
  <builtin>: in <toplevel>
Repository rule http_archive defined at:
  /home/runner/.bazel/external/bazel_tools/tools/build_defs/repo/http.bzl:394:31: in <toplevel>
Computing main repo mapping: 
Computing main repo mapping: 
Computing main repo mapping: 
Computing main repo mapping: 
Computing main repo mapping: 
Loading: 
Loading: 4 packages loaded
Loading: 4 packages loaded
    currently loading: 
Analyzing: target //:license-check (5 packages loaded, 0 targets configured)
Analyzing: target //:license-check (5 packages loaded, 0 targets configured)

Analyzing: target //:license-check (78 packages loaded, 9 targets configured)

Analyzing: target //:license-check (88 packages loaded, 9 targets configured)

Analyzing: target //:license-check (150 packages loaded, 3423 targets configured)

Analyzing: target //:license-check (156 packages loaded, 6864 targets configured)

Analyzing: target //:license-check (156 packages loaded, 6864 targets configured)

INFO: Analyzed target //:license-check (159 packages loaded, 8880 targets configured).
[11 / 13] [Prepa] Generating Dash formatted dependency file ... ... (2 actions, 0 running)
INFO: From Generating Dash formatted dependency file ...:
INFO: Successfully converted 2 packages from Cargo.lock to bazel-out/k8-fastbuild/bin/formatted.txt
INFO: Found 1 target...
Target //:license.check.license_check up-to-date:
  bazel-bin/license.check.license_check
  bazel-bin/license.check.license_check.jar
INFO: Elapsed time: 16.674s, Critical Path: 0.44s
INFO: 13 processes: 4 disk cache hit, 9 internal.
INFO: Build completed successfully, 13 total actions
INFO: Running command line: bazel-bin/license.check.license_check ./formatted.txt <args omitted>
usage: org.eclipse.dash.licenses.cli.Main [-batch <int>] [-cd <url>]
       [-confidence <int>] [-ef <url>] [-excludeSources <sources>] [-help] [-lic
       <url>] [-project <shortname>] [-repo <url>] [-review] [-summary <file>]
       [-timeout <seconds>] [-token <token>]

github-actions · 2025-11-06T10:22:08Z

The created documentation from the pull request is available at: docu-html

MohamedRady-99 · 2025-11-06T10:32:49Z

#8

the link to the issue that PR created for

armin-acn

Thank you for the nice improvement by adding a connection timeout!

However, please do not force signaling senders and receivers to implement the Send trait. I think you could simply leave the code unchanged in that respect.

armin-acn · 2025-11-11T09:56:37Z

feo/src/agent/relayed/primary.rs

+    where
+        <InterChannelTcp as IsChannel>::MultiReceiver: Send,
+        <InterChannelUnix as IsChannel>::MultiReceiver: Send,
+        <IntraChannel as IsChannel>::Sender: Send,


In our original implementation we avoided making senders and receivers "Send", because this is a limitation that not all communication implementations do actually fulfill. As an example, we found iceoryx2. For that reason, we used builders that are passed to the target thread, and the target thread uses them to build it senders and receivers directly.

You're right, thank you for the detailed clarification. I wasn't familiar with using the builder pattern to avoid Send bounds, and your explanation was very helpful.

I'll remove the unnecessary Send trait bounds, as the builder pattern already ensures thread safety by creating the channel components within the target thread. I'll ensure this pattern is followed consistently.

armin-acn · 2025-11-11T09:58:21Z

feo/src/signalling/relayed/connectors/relays.rs

+
        let thread = thread::spawn(move || {
-            Self::thread_main(inter_receiver_builder, intra_sender_builder, timeout)
+            Self::thread_main(inter_receiver, intra_sender, timeout)


As mentioned above, we used the call pattern with builders on purpose to avoid having to implement the Send trait for senders and receivers.

Thank you for the excellent feedback, this confirms my understanding from our previous conversation. I will refactor the code to remove the unnecessary Send trait bounds.

armin-acn · 2025-11-11T09:59:40Z

feo/src/signalling/relayed/connectors/relays.rs

-            .connect_receiver(timeout)
-            .expect("failed to connect intra-process sender");
-        trace!("PrimaryReceiveRelay connected");
        loop {


This is where the senders and receivers should be built directly in the target thread, so that they do not need to implement the Send trait.

Thank you for the excellent feedback, this confirms my understanding from our previous conversation. I will refactor the code to remove the unnecessary Send trait bounds.

armin-acn · 2025-11-11T10:03:32Z

feo/src/signalling/relayed/connectors/scheduler.rs

+    where
+        Inter::MultiReceiver: Send,
+        Intra::Sender: Send,
+    {


Same here, please do not enforce implementation of the Send trait.

Thank you for the excellent feedback, this confirms my understanding from our previous conversation. I will refactor the code to remove the unnecessary Send trait bounds.

armin-acn · 2025-11-11T10:04:03Z

feo/src/signalling/relayed/connectors/scheduler.rs

+where
+    Inter::MultiReceiver: Send,
+    Intra::Sender: Send,
+{


Same here, please do not enforce implementation of the Send trait.

Thank you for the excellent feedback, this confirms my understanding from our previous conversation. I will refactor the code to remove the unnecessary Send trait bounds.

armin-acn · 2025-11-11T10:06:04Z

feo/src/signalling/relayed/sockets/endpoint.rs

-                    _ => {
-                        return Err(Error::UnexpectedProtocolSignal);
+                    other => {
+                        warn!("Received unexpected signal {:?} from unknown token {:?} during connect", other, token);


Why isn't this an error anymore?

I've changed this from an Error to a warn! to make the connection process more robust.

The previous implementation would cause the entire agent to fail if it received any signal other than ChannelHello during startup. This could happen due to network noise or a race condition where a fast client sends a valid signal immediately after connecting.

The new behavior simply logs the unexpected signal for debugging purposes and continues waiting for the required ChannelHello messages. This makes the startup process resilient to transient issues and ensures it only fails for the correct reason: a timeout because not all peers connected in time.

Ok, I have no objection to the change, although I think in our current implementation of the startup-handshake, the error should never occur.

MohamedRady-99 · 2025-11-12T09:36:20Z

Hi @armin-acn, could you please review the PR again after I made the changes based on your last feedback?

armin-acn · 2025-11-12T09:51:21Z

Sure, I will, but need to find some time.

armin-acn

Sorry, still two minor comments. I hope that will be all.
Did you test all the available signaling variants, though?
I did not but assume you did ...

armin-acn · 2025-11-12T10:35:54Z

feo/src/signalling/direct/scheduler.rs

        let mut missing_activities: HashSet<ActivityId> =
            self.all_activities.iter().cloned().collect();
        let mut missing_recorders: HashSet<AgentId> = self.all_recorders.iter().cloned().collect();
+        let start_time = std::time::Instant::now();


Please add a use-statement for Instant from feo_time and just use Instant::now() here. (As you already did in relayed/connectors/relays.rs.)

Thanks! Updated this to match the implementation in relayed/connectors/relays.rs using Instant::now().

armin-acn · 2025-11-12T10:36:47Z

feo/src/signalling/relayed/mod.rs


 mod connectors;
-mod interface;
+pub(crate) mod interface;


Why does this have to be public now?

Sorry about that, I forgot to revert it.

armin-acn

Looks good now.

armin-acn

Please rebase on top of main before merging, just to be sure not to break our changes wrt the bazel build.
I am surprised the CI tests passed, so I am afraid a simple merge could mess things up ...

MohamedRady-99 · 2025-11-12T19:14:14Z

Hello @armin-acn, the branch has been successfully rebased on top of main. The PR is now ready for merge.

armin-acn

Thank you.

armin-acn · 2025-11-13T09:15:29Z

@MohamedRady-99, are you going to merge?

armin-acn · 2025-11-13T09:32:08Z

Hi @MohamedRady-99, would it be a problem for you, if the very first Release of FEO v0.5 (to be prepared until tomorrow the latest) would not include this PR? We can probably create a minor release update very soon, which could then include your changes.

MohamedRady-99 · 2025-11-13T09:45:43Z

Hi @armin-acn, No problem at all, that’s fine. It can be included in the next minor release.

armin-acn · 2025-11-13T10:36:28Z

Anyway, you can go ahead and merge your PR, if you want.

MohamedRady-99 · 2025-11-13T12:28:07Z

Hi @armin-acn, It seems I don’t have permission to merge, so I’d appreciate your support with that.

MohamedRady-99 added 2 commits November 3, 2025 15:14

Handling Connection timeout in direct mode

fd411d7

Handling Connection timeout in relayed mode

0bccf71

MohamedRady-99 mentioned this pull request Nov 6, 2025

Connection timeout Implementation proposal #8

Closed

armin-acn requested changes Nov 11, 2025

View reviewed changes

remove the implementaion of Send trait

14f95ec

armin-acn reviewed Nov 12, 2025

View reviewed changes

refactor: use Instant::now() and make interface module pub(crate)

9dc193f

armin-acn approved these changes Nov 12, 2025

View reviewed changes

armin-acn requested changes Nov 12, 2025

View reviewed changes

Merge branch 'main' into Connection_timeout

bf8d314

armin-acn approved these changes Nov 12, 2025

View reviewed changes

armin-acn merged commit a1f4373 into eclipse-score:main Nov 13, 2025
8 checks passed

Connection timeout #11

Connection timeout #11

Uh oh!

Conversation

MohamedRady-99 commented Nov 6, 2025

Uh oh!

github-actions bot commented Nov 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

License Check Results

Uh oh!

github-actions bot commented Nov 6, 2025

Uh oh!

MohamedRady-99 commented Nov 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

armin-acn left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

MohamedRady-99 commented Nov 12, 2025

Uh oh!

armin-acn commented Nov 12, 2025

Uh oh!

armin-acn left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

armin-acn left a comment

Choose a reason for hiding this comment

Uh oh!

armin-acn left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

MohamedRady-99 commented Nov 12, 2025

Uh oh!

armin-acn left a comment

Choose a reason for hiding this comment

Uh oh!

armin-acn commented Nov 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

armin-acn commented Nov 13, 2025

Uh oh!

MohamedRady-99 commented Nov 13, 2025

Uh oh!

armin-acn commented Nov 13, 2025 via email • edited Loading Uh oh! There was an error while loading. Please reload this page.

github-actions bot commented Nov 6, 2025 •

edited

Loading

MohamedRady-99 commented Nov 6, 2025 •

edited

Loading

armin-acn left a comment •

edited

Loading

armin-acn left a comment •

edited

Loading

armin-acn commented Nov 13, 2025 •

edited

Loading

armin-acn commented Nov 13, 2025 via email •

edited

Loading