Skip to content

Conversation

@Lynx005F
Copy link

@Lynx005F Lynx005F commented Jul 8, 2024

This changes how the true_done flag is calculated:

  • If the internal is_working register experiences a single event upset, then currently true_done might never be set and as such the FSM can stall the accelerator. At the same time true_done can not just be done from the input since that might be set on reset.
    To solve this, make true_done assert on the rising edge of done input.
    (A fault-tolerant accelerator should continuously asserts done and then has the guarantee that this will eventually be forwarded).
  • The pulsed true_done output itself might also experience a single event upset in just the cycle where it is asserted and thus done signal is destroyed. To mitigate this extend the above mechanism to assert the output for two cycles at minimum.

This does not add any protection in the other direction e.g. an SEU causing an abort when the accelerator is in fact doing fine.

@Lynx005F Lynx005F force-pushed the itemm/fault_tolerant_fsm branch from ef1663b to 08fffa0 Compare July 8, 2024 08:20
Copy link

@Smephite Smephite left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The proposed solution to make the done signal assert two cycles, creates an issue with the context counters.
Two cycles will increase/decrease the counters twice, thus leading to a potential over/underflow.
This in turn will lock up the accelerator if not reset in between calculations.

I suggest to add a positive edge detection before in-/decrementing the counters.

I am uncertain on the effect of the edge detection on the reliability, as it once again creates a SEE susceptible signal. An alternative approach could be to double up the counter / decrease it with every done signal only once.

@Lynx005F
Copy link
Author

I haven't looked at this code for a long time, so I don't remember the exact details, but I think there is two different levels of vulnerability here:

  • The more prevalent important case is that the is_working register eventually recieves a SEU and the state is then incoherent causing a stall. This is because this SEU might happen at any time during the whole execution.
  • The output getting masked by a SEU however can only happen at a very specific cycle.

I would propose to fix the first and more common vulnerability and keep the smaller one for now e.g. just having a basic edge detector like this:

logic done_q
...
regfile_flags.true_done     = ctrl_i.done & ~done_q;
...
// FF to make flags_o.done a pulse without being SEU vulnerable or depending on internal state
always_ff @(posedge clk_i or negedge rst_ni)
begin
  if(!rst_ni) begin
    done <= 1'b1;
  end
  else begin
    done_q <= ctrl_i.done;
  end
end

Unfortunately I am no longer at IIS so I can't simulate this easily if it works as intended.

If you can move this edge detector closer to your counters that might be a way to reduce SEU vulnerability a bit more as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants