Skip to content

Conversation

@MohamedRady-99
Copy link
Contributor

This pull request introduces a connection_timeout to the agent configuration. During the startup phase, agents (primary, secondaries, and recorders) attempt to establish communication channels with each other. Previously, if an expected agent failed to connect, the process could hang indefinitely.

With this change:

A connection_timeout can be specified in the PrimaryConfig.
This timeout dictates the maximum duration the primary agent will wait for all secondary agents and recorders to connect.
If any agent fails to establish a connection within this period, the primary agent will time out, log an error, and exit, preventing the system from getting stuck in a non-operational state.
This improves the robustness and predictability of the system's startup sequence.

@github-actions
Copy link

github-actions bot commented Nov 6, 2025

License Check Results

🚀 The license check job ran with the Bazel command:

bazel run //:license-check

Status: ⚠️ Needs Review

Click to expand output
[License Check Output]
Starting local Bazel server (8.3.0) and connecting to it...
INFO: Invocation ID: 526235dc-0cb6-4408-bd65-fdc32352e33a
Computing main repo mapping: 
Computing main repo mapping: 
DEBUG: Rule 'rules_boost+' indicated that a canonical reproducible form can be obtained by modifying arguments integrity = "sha256-iKHWUIBrUN/fFOCltkTmHfDcKw0h4ZVmP7NVSoc8zBs="
DEBUG: Repository rules_boost+ instantiated at:
  <builtin>: in <toplevel>
Repository rule http_archive defined at:
  /home/runner/.bazel/external/bazel_tools/tools/build_defs/repo/http.bzl:394:31: in <toplevel>
Computing main repo mapping: 
Computing main repo mapping: 
Computing main repo mapping: 
Computing main repo mapping: 
Computing main repo mapping: 
Loading: 
Loading: 4 packages loaded
Loading: 4 packages loaded
    currently loading: 
Analyzing: target //:license-check (5 packages loaded, 0 targets configured)
Analyzing: target //:license-check (5 packages loaded, 0 targets configured)

Analyzing: target //:license-check (78 packages loaded, 9 targets configured)

Analyzing: target //:license-check (88 packages loaded, 9 targets configured)

Analyzing: target //:license-check (150 packages loaded, 3423 targets configured)

Analyzing: target //:license-check (156 packages loaded, 6864 targets configured)

Analyzing: target //:license-check (156 packages loaded, 6864 targets configured)

INFO: Analyzed target //:license-check (159 packages loaded, 8880 targets configured).
[11 / 13] [Prepa] Generating Dash formatted dependency file ... ... (2 actions, 0 running)
INFO: From Generating Dash formatted dependency file ...:
INFO: Successfully converted 2 packages from Cargo.lock to bazel-out/k8-fastbuild/bin/formatted.txt
INFO: Found 1 target...
Target //:license.check.license_check up-to-date:
  bazel-bin/license.check.license_check
  bazel-bin/license.check.license_check.jar
INFO: Elapsed time: 16.674s, Critical Path: 0.44s
INFO: 13 processes: 4 disk cache hit, 9 internal.
INFO: Build completed successfully, 13 total actions
INFO: Running command line: bazel-bin/license.check.license_check ./formatted.txt <args omitted>
usage: org.eclipse.dash.licenses.cli.Main [-batch <int>] [-cd <url>]
       [-confidence <int>] [-ef <url>] [-excludeSources <sources>] [-help] [-lic
       <url>] [-project <shortname>] [-repo <url>] [-review] [-summary <file>]
       [-timeout <seconds>] [-token <token>]

@github-actions
Copy link

github-actions bot commented Nov 6, 2025

The created documentation from the pull request is available at: docu-html

@MohamedRady-99
Copy link
Contributor Author

MohamedRady-99 commented Nov 6, 2025

#8

the link to the issue that PR created for

Copy link
Contributor

@armin-acn armin-acn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the nice improvement by adding a connection timeout!

However, please do not force signaling senders and receivers to implement the Send trait. I think you could simply leave the code unchanged in that respect.

where
<InterChannelTcp as IsChannel>::MultiReceiver: Send,
<InterChannelUnix as IsChannel>::MultiReceiver: Send,
<IntraChannel as IsChannel>::Sender: Send,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In our original implementation we avoided making senders and receivers "Send", because this is a limitation that not all communication implementations do actually fulfill. As an example, we found iceoryx2. For that reason, we used builders that are passed to the target thread, and the target thread uses them to build it senders and receivers directly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, thank you for the detailed clarification. I wasn't familiar with using the builder pattern to avoid Send bounds, and your explanation was very helpful.

I'll remove the unnecessary Send trait bounds, as the builder pattern already ensures thread safety by creating the channel components within the target thread. I'll ensure this pattern is followed consistently.


let thread = thread::spawn(move || {
Self::thread_main(inter_receiver_builder, intra_sender_builder, timeout)
Self::thread_main(inter_receiver, intra_sender, timeout)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As mentioned above, we used the call pattern with builders on purpose to avoid having to implement the Send trait for senders and receivers.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the excellent feedback, this confirms my understanding from our previous conversation. I will refactor the code to remove the unnecessary Send trait bounds.

.connect_receiver(timeout)
.expect("failed to connect intra-process sender");
trace!("PrimaryReceiveRelay connected");
loop {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is where the senders and receivers should be built directly in the target thread, so that they do not need to implement the Send trait.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the excellent feedback, this confirms my understanding from our previous conversation. I will refactor the code to remove the unnecessary Send trait bounds.

where
Inter::MultiReceiver: Send,
Intra::Sender: Send,
{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here, please do not enforce implementation of the Send trait.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the excellent feedback, this confirms my understanding from our previous conversation. I will refactor the code to remove the unnecessary Send trait bounds.

where
Inter::MultiReceiver: Send,
Intra::Sender: Send,
{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here, please do not enforce implementation of the Send trait.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the excellent feedback, this confirms my understanding from our previous conversation. I will refactor the code to remove the unnecessary Send trait bounds.

_ => {
return Err(Error::UnexpectedProtocolSignal);
other => {
warn!("Received unexpected signal {:?} from unknown token {:?} during connect", other, token);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why isn't this an error anymore?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've changed this from an Error to a warn! to make the connection process more robust.

The previous implementation would cause the entire agent to fail if it received any signal other than ChannelHello during startup. This could happen due to network noise or a race condition where a fast client sends a valid signal immediately after connecting.

The new behavior simply logs the unexpected signal for debugging purposes and continues waiting for the required ChannelHello messages. This makes the startup process resilient to transient issues and ensures it only fails for the correct reason: a timeout because not all peers connected in time.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I have no objection to the change, although I think in our current implementation of the startup-handshake, the error should never occur.

@MohamedRady-99
Copy link
Contributor Author

Hi @armin-acn, could you please review the PR again after I made the changes based on your last feedback?

@armin-acn
Copy link
Contributor

Sure, I will, but need to find some time.

Copy link
Contributor

@armin-acn armin-acn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, still two minor comments. I hope that will be all.
Did you test all the available signaling variants, though?
I did not but assume you did ...

let mut missing_activities: HashSet<ActivityId> =
self.all_activities.iter().cloned().collect();
let mut missing_recorders: HashSet<AgentId> = self.all_recorders.iter().cloned().collect();
let start_time = std::time::Instant::now();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a use-statement for Instant from feo_time and just use Instant::now() here. (As you already did in relayed/connectors/relays.rs.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Updated this to match the implementation in relayed/connectors/relays.rs using Instant::now().

mod connectors;
mod interface;
pub(crate) mod interface;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does this have to be public now?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry about that, I forgot to revert it.

Copy link
Contributor

@armin-acn armin-acn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good now.

Copy link
Contributor

@armin-acn armin-acn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please rebase on top of main before merging, just to be sure not to break our changes wrt the bazel build.
I am surprised the CI tests passed, so I am afraid a simple merge could mess things up ...

@MohamedRady-99
Copy link
Contributor Author

Hello @armin-acn, the branch has been successfully rebased on top of main. The PR is now ready for merge.

Copy link
Contributor

@armin-acn armin-acn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you.

@armin-acn
Copy link
Contributor

armin-acn commented Nov 13, 2025

@MohamedRady-99, are you going to merge?

@armin-acn
Copy link
Contributor

Hi @MohamedRady-99, would it be a problem for you, if the very first Release of FEO v0.5 (to be prepared until tomorrow the latest) would not include this PR? We can probably create a minor release update very soon, which could then include your changes.

@MohamedRady-99
Copy link
Contributor Author

Hi @armin-acn, No problem at all, that’s fine. It can be included in the next minor release.

@armin-acn
Copy link
Contributor

armin-acn commented Nov 13, 2025 via email

@MohamedRady-99
Copy link
Contributor Author

Hi @armin-acn, It seems I don’t have permission to merge, so I’d appreciate your support with that.

@armin-acn armin-acn merged commit a1f4373 into eclipse-score:main Nov 13, 2025
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants