Skip to content

Conversation

@nicu-da
Copy link
Contributor

@nicu-da nicu-da commented Nov 13, 2025

This ensures the process is resilient to a restart after we unhosted the parties from the old node

fixes #2694

Tried to add a unit test but mocking the participant admin connection is a full on terror
testing in the integration tests that we fail at the right time it's also a pain

Pull Request Checklist

Cluster Testing

  • If a cluster test is required, comment /cluster_test on this PR to request it, and ping someone with access to the DA-internal system to approve it.
  • If a hard-migration test is required (from the latest release), comment /hdm_test on this PR to request it, and ping someone with access to the DA-internal system to approve it.

PR Guidelines

  • Include any change that might be observable by our partners or affect their deployment in the release notes.
  • Specify fixed issues with Fixes #n, and mention issues worked on using #n
  • Include a screenshot for frontend-related PRs - see README or use your favorite screenshot tool

Merge Guidelines

  • Make the git commit message look sensible when squash-merging on GitHub (most likely: just copy your PR description).

Signed-off-by: Nicu Reut <[email protected]>

[ci] Store parties being migrated during key recovery

This ensures the process is resilient to a restart after we unhosted the parties from the old node

Signed-off-by: Nicu Reut <[email protected]>
Signed-off-by: Nicu Reut <[email protected]>
Signed-off-by: Nicu Reut <[email protected]>
@nicu-da
Copy link
Contributor Author

nicu-da commented Nov 13, 2025

@pasindutennage-da also requests a review from you as I touched on some of the code you added, mostly tried to keep it as is just try to integrate it with the validator app

@nicu-da nicu-da changed the title Nicuda/fix/resilience key recovery recovery from key - store the parties being migrated in the local db for recovery in case of failure Nov 13, 2025
Signed-off-by: Nicu Reut <[email protected]>
Signed-off-by: Nicu Reut <[email protected]>
Signed-off-by: Nicu Reut <[email protected]>
@martinflorian-da
Copy link
Contributor

martinflorian-da commented Nov 14, 2025

testing in the integration tests that we fail at the right time it's also a pain

Hmm thinking about options...

  • Could we disable the ACS endpoint on scan (via toxiproxy)? Will that cause it to crash at just the right time?
  • Could we startSync once, then stop, then mess around with the participant and/or topology state in a way that simulates a previous crash? You clean up diligently so sadly this won't work.

I'm on the fence of whether it is worth it to add something like this. I'd definitely feel better to have tested this somehow at least once... So I think I'd also be happy with some form of manual test? (For example on a cluster: deploy, rig things so the validator crashes on the right place after the migration, confirm it resumes fine once we unrig things again?)

)(implicit ec: ExecutionContext): OptionT[Future, A] = {
OptionT(futureUnlessShutdownToFuture(f.value))
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very important to have newlines! 😛

import org.apache.pekko.stream.scaladsl.{Sink, Source}
import org.apache.pekko.stream.testkit.StreamSpec
import org.lfdecentralizedtrust.splice.admin.api.client.commands.HttpCommandException
import org.lfdecentralizedtrust.splice.util.{SpliceRateLimitMetrics, SpliceRateLimiter}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just going to pretend I understand how this passed static checks previously...

for {
_ <- importAcs(partiesToMigrateFinal, getAcsSnapshot)
_ <- importAcs(partiesToMigrateFinal.parties, getAcsSnapshot)
_ <- configProvider.clearPartiesToMigrate()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, makes sense... so I guess one of my testing ideas doesn't work then...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well we can't test in integration tests because we sys.exit if the app fails to start. I will do a manual test

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we sys.exit if the app fails to start

Ugh right, we don't have info to "assert-fails-to-start". Meh.

I will do a manual test

Thanks! The code looks perfect to me but it seems like the right thing to do.

}

class ValidatorConfigProvider(config: ValidatorInternalStore) {
class ValidatorConfigProvider(config: ValidatorInternalStore, val loggerFactory: NamedLoggerFactory)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
class ValidatorConfigProvider(config: ValidatorInternalStore, val loggerFactory: NamedLoggerFactory)
class ValidatorDynamicConfigProvider(config: ValidatorInternalStore, val loggerFactory: NamedLoggerFactory)

Quite off-topic to this PR, more like a general "let's consider refactoring this" comment; to a casual reader it is hard to distinguish otherwise that this is not related to the (static) config from the config files.

(cc @pasindutennage-da )

Copy link
Contributor

@martinflorian-da martinflorian-da left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work, thanks! I like how ValidatorInternalStore is helping us do smart things.

Signed-off-by: Nicu Reut <[email protected]>
Signed-off-by: Nicu Reut <[email protected]>
Signed-off-by: Nicu Reut <[email protected]>
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@martinflorian-da can you take another look? Added a test for the new functionality. In theory there's a slight chance that it will be flaky but I am confident enough that the startup is not fast enough for it to flake.

Copy link
Contributor

@martinflorian-da martinflorian-da Nov 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ufff IDK, it does sound quite a bit shaky flaky to me... Things change, we'll forget what we did here, then we'll reorder app init order or change something else... and then go debug this.

I guess you run start where you do so it sets up the DB for you? Can't we just do that in the test, before running start?

Yes it seems a bit over the top for this particular PR, but the capability of initialing this store with arbitrary data before we start an app sounds like a handy thing to have also for other tests.

I am confident enough

Not going to veto it but if it was my PR I think I'd lean towards not adding that test, even if the alternative was "I only tested it manually once".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess you run start where you do so it sets up the DB for you? Can't we just do that in the test, before running start?

We could but it's not trivial

Not going to veto it but if it was my PR I think I'd lean towards not adding that test, even if the alternative was "I only tested it manually once".

honestly SGTM, will just remove the test, I was on the fence about it but then realized that our startup is real slow but then again that can always change and we shouldn't count on it

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

...it's also a dangerous precedent... Not everyone's gut feeling is as well calibrated as yours, but they might still end up copying this approach...

Signed-off-by: Nicu Reut <[email protected]>
# Conflicts:
#	apps/validator/src/main/scala/org/lfdecentralizedtrust/splice/validator/ValidatorApp.scala
Signed-off-by: Nicu Reut <[email protected]>
Signed-off-by: Nicu Reut <[email protected]>
@nicu-da nicu-da merged commit f440a20 into main Nov 19, 2025
62 checks passed
@nicu-da nicu-da deleted the nicuda/fix/resilience_key_recovery branch November 19, 2025 10:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Improve detection of ACS import-related failure in reonboard from key

4 participants