-
Notifications
You must be signed in to change notification settings - Fork 9
Connection timeout #11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Connection timeout #11
Conversation
License Check Results🚀 The license check job ran with the Bazel command: bazel run //:license-checkStatus: Click to expand output |
|
The created documentation from the pull request is available at: docu-html |
|
the link to the issue that PR created for |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the nice improvement by adding a connection timeout!
However, please do not force signaling senders and receivers to implement the Send trait. I think you could simply leave the code unchanged in that respect.
feo/src/agent/relayed/primary.rs
Outdated
| where | ||
| <InterChannelTcp as IsChannel>::MultiReceiver: Send, | ||
| <InterChannelUnix as IsChannel>::MultiReceiver: Send, | ||
| <IntraChannel as IsChannel>::Sender: Send, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In our original implementation we avoided making senders and receivers "Send", because this is a limitation that not all communication implementations do actually fulfill. As an example, we found iceoryx2. For that reason, we used builders that are passed to the target thread, and the target thread uses them to build it senders and receivers directly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right, thank you for the detailed clarification. I wasn't familiar with using the builder pattern to avoid Send bounds, and your explanation was very helpful.
I'll remove the unnecessary Send trait bounds, as the builder pattern already ensures thread safety by creating the channel components within the target thread. I'll ensure this pattern is followed consistently.
|
|
||
| let thread = thread::spawn(move || { | ||
| Self::thread_main(inter_receiver_builder, intra_sender_builder, timeout) | ||
| Self::thread_main(inter_receiver, intra_sender, timeout) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As mentioned above, we used the call pattern with builders on purpose to avoid having to implement the Send trait for senders and receivers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the excellent feedback, this confirms my understanding from our previous conversation. I will refactor the code to remove the unnecessary Send trait bounds.
| .connect_receiver(timeout) | ||
| .expect("failed to connect intra-process sender"); | ||
| trace!("PrimaryReceiveRelay connected"); | ||
| loop { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is where the senders and receivers should be built directly in the target thread, so that they do not need to implement the Send trait.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the excellent feedback, this confirms my understanding from our previous conversation. I will refactor the code to remove the unnecessary Send trait bounds.
| where | ||
| Inter::MultiReceiver: Send, | ||
| Intra::Sender: Send, | ||
| { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here, please do not enforce implementation of the Send trait.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the excellent feedback, this confirms my understanding from our previous conversation. I will refactor the code to remove the unnecessary Send trait bounds.
| where | ||
| Inter::MultiReceiver: Send, | ||
| Intra::Sender: Send, | ||
| { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here, please do not enforce implementation of the Send trait.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the excellent feedback, this confirms my understanding from our previous conversation. I will refactor the code to remove the unnecessary Send trait bounds.
| _ => { | ||
| return Err(Error::UnexpectedProtocolSignal); | ||
| other => { | ||
| warn!("Received unexpected signal {:?} from unknown token {:?} during connect", other, token); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why isn't this an error anymore?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've changed this from an Error to a warn! to make the connection process more robust.
The previous implementation would cause the entire agent to fail if it received any signal other than ChannelHello during startup. This could happen due to network noise or a race condition where a fast client sends a valid signal immediately after connecting.
The new behavior simply logs the unexpected signal for debugging purposes and continues waiting for the required ChannelHello messages. This makes the startup process resilient to transient issues and ensures it only fails for the correct reason: a timeout because not all peers connected in time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I have no objection to the change, although I think in our current implementation of the startup-handshake, the error should never occur.
|
Hi @armin-acn, could you please review the PR again after I made the changes based on your last feedback? |
|
Sure, I will, but need to find some time. |
armin-acn
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, still two minor comments. I hope that will be all.
Did you test all the available signaling variants, though?
I did not but assume you did ...
| let mut missing_activities: HashSet<ActivityId> = | ||
| self.all_activities.iter().cloned().collect(); | ||
| let mut missing_recorders: HashSet<AgentId> = self.all_recorders.iter().cloned().collect(); | ||
| let start_time = std::time::Instant::now(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add a use-statement for Instant from feo_time and just use Instant::now() here. (As you already did in relayed/connectors/relays.rs.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Updated this to match the implementation in relayed/connectors/relays.rs using Instant::now().
feo/src/signalling/relayed/mod.rs
Outdated
| mod connectors; | ||
| mod interface; | ||
| pub(crate) mod interface; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why does this have to be public now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry about that, I forgot to revert it.
armin-acn
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please rebase on top of main before merging, just to be sure not to break our changes wrt the bazel build.
I am surprised the CI tests passed, so I am afraid a simple merge could mess things up ...
|
Hello @armin-acn, the branch has been successfully rebased on top of main. The PR is now ready for merge. |
armin-acn
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you.
|
@MohamedRady-99, are you going to merge? |
|
Hi @MohamedRady-99, would it be a problem for you, if the very first Release of FEO v0.5 (to be prepared until tomorrow the latest) would not include this PR? We can probably create a minor release update very soon, which could then include your changes. |
|
Hi @armin-acn, No problem at all, that’s fine. It can be included in the next minor release. |
|
Anyway, you can go ahead and merge your PR, if you want.
|
|
Hi @armin-acn, It seems I don’t have permission to merge, so I’d appreciate your support with that. |
This pull request introduces a connection_timeout to the agent configuration. During the startup phase, agents (primary, secondaries, and recorders) attempt to establish communication channels with each other. Previously, if an expected agent failed to connect, the process could hang indefinitely.
With this change:
A connection_timeout can be specified in the PrimaryConfig.
This timeout dictates the maximum duration the primary agent will wait for all secondary agents and recorders to connect.
If any agent fails to establish a connection within this period, the primary agent will time out, log an error, and exit, preventing the system from getting stuck in a non-operational state.
This improves the robustness and predictability of the system's startup sequence.