Skip to content

Conversation

@ff137
Copy link
Member

@ff137 ff137 commented Jul 18, 2025

This PR refactors AnonCreds revocation registry management into an event-driven, recoverable workflow. The goal is to make revocation registry lifecycle operations resilient to partial failures (ledger, DB, tails server) and agent restarts, so that a credential definition does not silently become unusable when something goes wrong.


Background / motivation

Today, revocation registry setup and maintenance relies on a sequence of automatic handlers. If the agent crashes or any step fails midway (e.g., registry created on the ledger but not stored locally; registry marked full but backup not activated), the issuer may only discover the problem later, when issuance starts failing.

To address this, each step in the revocation workflow is now modeled as an event with:

  • A persisted record in storage (including payload, state, correlation ID, retry metadata), and
  • A clear request/response pair that can be retried or recovered.

Resolves #3819.


What this PR changes

1. Event model for revocation workflows

Adds a more granular set of AnonCreds events for revocation registry lifecycle steps:

  • Revocation registry definition

    • anoncreds::rev-reg-def::create-requested / create-response
    • anoncreds::rev-reg-def::store-requested / store-response
  • Revocation status list

    • anoncreds::revocation-list::create-requested / create-response
    • anoncreds::revocation-list::store-requested / store-response
    • anoncreds::revocation-list::finished (kept for backwards compatibility)
  • Registry activation & full handling

    • anoncreds::revocation-registry::activation-requested / activation-response
    • anoncreds::revocation-registry::full-detected
    • anoncreds::revocation-registry::full-handling-completed

Event payloads now:

  • Carry a generic options dict (including request_id, correlation_id, retry_count, recovery flag, etc).
  • For response events, include an ErrorInfoPayload with error_msg, should_retry, and retry_count when something fails and is eligible for retry.

The existing CredDefFinishedEvent is updated to use this new event machinery as the entry point for revocation setup after a credential definition is created.

The event bus is also updated so EventBus.subscribe accepts string topics (compiled internally to regex) in addition to pre-compiled patterns, simplifying usage for the new events.


2. Persistent event storage & retry metadata

Introduces an EventStorageManager to persist revocation-related events as storage records:

  • New record types:

    • rev_reg_def_create_event
    • rev_reg_def_store_event
    • rev_list_create_event
    • rev_list_store_event
    • rev_reg_activation_event
    • rev_reg_full_handling_event
  • New event states:

    • requested
    • response_success
    • response_failure

Each stored event includes:

  • event_type, serialized event_data (payload), correlation_id, optional request_id
  • state, options, created_at, and an expiry_timestamp
  • Optional response_data, error_msg, and retry_metadata when a response is processed

Key behaviors:

  • store_event_request(...)
    Creates a record for the request event, using correlation_id as the record ID for easy lookup.
  • update_event_response(...)
    Updates state to response_success or response_failure, attaching response and retry metadata.
  • update_event_for_retry(...)
    Re-classifies a failed event for retry: increments retry count, refreshes expiry, sets state back to requested.
  • get_in_progress_events(...)
    Returns events still in progress, optionally only those past their expiry timestamp (ready for recovery).

3. Exponential backoff & expiry windows

Adds retry_utils to centralize retry timing and expiry calculation:

Configurable via environment variables:

  • ANONCREDS_REVOCATION_MIN_RETRY_DURATION_SECONDS (default: 2)
  • ANONCREDS_REVOCATION_MAX_RETRY_DURATION_SECONDS (default: 60)
  • ANONCREDS_REVOCATION_RETRY_MULTIPLIER (default: 2.0)
    → with defaults: 2, 4, 8, 16… seconds, capped at max
  • ANONCREDS_REVOCATION_RECOVERY_DELAY_SECONDS (default: 30)

Functions:

  • calculate_exponential_backoff_delay(retry_count) – computes next delay.
  • calculate_event_expiry_timestamp(retry_count) – sets when a failed event should be considered “expired” and ready to be recovered.
  • is_event_expired(expiry_timestamp) – checks whether an event is past its recovery window.

4. Event-driven revocation setup & full-registry handling

DefaultRevocationSetup is updated to orchestrate the whole revocation lifecycle purely via events and EventStorageManager:

  • On cred def finished, it:

    • Generates a request_id and correlation_id
    • Stores the create-rev-reg-def request event
    • Emits rev-reg-def::create-requested
  • Each subsequent step (def create/store, list create/store, activation):

    • Reads/writes event records using EventStorageManager

    • On success, emits the next event in the chain

    • On failure:

      • If should_retry is true, schedules a retry with backoff and updates the stored event
      • Otherwise, marks the event as non-recoverable and surfaces a meaningful error

For full registry handling:

  • A new handle_full_registry_event(...) sets the current registry’s state to FULL, logs context, and emits events to:

    • Activate the backup registry, and
    • Create a new backup registry
  • Activation and backup creation are now handled via the same event pipeline, so an agent crash or transient failure in any of these steps can be recovered rather than leaving a cred def with no active registry.


5. Recovery at startup / per-request

Two new components handle recovery:

  1. EventRecoveryManager

    • Scans EventStorageManager for in-progress events that have expired (based on their expiry_timestamp).
    • Re-emits the corresponding request events onto the event bus, tagging options["recovery"] = True and reusing the original correlation_id.
  2. revocation_recovery_middleware (admin server middleware)

    • Registered in admin/server.py for admin routes.

    • For each admin request:

      • Resolves the current profile/tenant.

      • If revocation recovery has not yet run for this profile:

        • Fetches any in-progress revocation events.
        • Runs EventRecoveryManager.recover_in_progress_events() for events whose expiry time has passed.
        • Marks the profile as “recovered” for this server session to avoid repeated scans.
    • All errors during recovery are logged but do not block the original HTTP request.

This means that if the agent is terminated mid-operation, the next admin request for that profile will automatically attempt to repair any stuck revocation workflows.


6. API / behavior changes

Revocation admin routes are updated to align with the new behavior:

  • Revocation list creation
    Delegates to AnonCredsRevocation.create_and_register_revocation_list.

    • If the method returns a RevListResult, it is serialized as the response.
    • If it returns a string, it is treated as a user-facing error message and mapped to 400 Bad Request.
  • Revocation registry state
    set_rev_reg_state now raises AnonCredsRevocationError with “not found” messaging if the rev reg cannot be located; the route maps this to 404 Not Found, rather than silently returning None.

  • Tails file retrieval
    If the revocation registry definition cannot be retrieved, the tails route now returns 404 Not Found.

OpenAPI / Swagger definitions are updated accordingly to reflect the revised error handling.


7. Testing

New and updated tests cover:

  • Event storage lifecycle (create / update / retry / delete / query).

  • Event recovery logic and mapping from stored records back to events.

  • Revocation recovery middleware behavior, including:

    • One-time-per-profile behavior
    • Distinguishing pending vs expired events
    • Robustness when errors occur during recovery
  • Revocation setup, full-registry handling, and route behavior changes.


Backwards compatibility

  • Existing workflows continue to function, but now gain resiliency: broken or partial revocation operations are captured, retried, and recoverable instead of silently failing.
  • Legacy event topics like anoncreds::revocation-list::finished remain in place for compatibility.
  • New storage record types are additive; existing records are unaffected.
  • Callers of revocation admin endpoints are unchanged, aside from improved error semantics in edge cases (more accurate 4xx/5xx codes and messages).

@swcurran

This comment was marked as outdated.

@ff137

This comment was marked as outdated.

@ff137 ff137 force-pushed the feat/rev-reg-granular-events branch 3 times, most recently from 1730ec8 to 951adf9 Compare October 15, 2025 18:23
@ff137 ff137 marked this pull request as ready for review October 15, 2025 18:38
@ff137 ff137 changed the title 🚧 Recoverable, event-driven revocation registry management ✨ Recoverable, event-driven revocation registry management Oct 15, 2025
@ff137
Copy link
Member Author

ff137 commented Oct 15, 2025

This is a big one ... It'll probably be necessary to contribute some documentation for what's going on here.

@ff137 ff137 marked this pull request as draft October 15, 2025 19:30
@ff137 ff137 force-pushed the feat/rev-reg-granular-events branch from 5c429e9 to a7d10f7 Compare October 16, 2025 11:59
@ff137 ff137 marked this pull request as ready for review October 16, 2025 12:22
@ff137 ff137 requested a review from a team October 28, 2025 11:06
@ff137
Copy link
Member Author

ff137 commented Oct 28, 2025

Tagging reviewers to please check out what the idea is over here. It's fairly complex, and it'll help to organize the code a lil bit better. But before polishing, it'll be good to get some feedback over the ideas here, and how it's implemented. All questions / comments are welcome.

Note: we're currently using this in production in acapy-cloud (or a separately maintained branch of this work; just need to rebase and confirm it's all same-same)

@PatStLouis
Copy link
Contributor

@ff137 I just gave this a first look, there's a lot going on here. Couple of questions:

Where did the need for this come from, has there been many cases where a cred def ended up in a bad state because of this issue or is this a preventive feature?

How could one reproduce a condition where this would kick in with a running agent?

Would it be possible to get a document for this feature, documenting different states? The existing aries rfc documents come to mind, since some of them also deal with multiple states. Some sort of diagram would be useful for me.
https://identity.foundation/aries-rfcs/latest/features/0453-issue-credential-v2/#choreography-diagram

Does endorsement affect this at all? Has it been tested with other AnonCreds registry than indy?

If events continue to fail, exceeding some maximum retry setting, then the issuer can be notified that manual intervention is required to fix revocation registries and ensure the credential definition remains usable.

Its not clear what an issuer needs to do in the case where it needs to take action because of a failure.

Otherwise the code looks solid and well organized, will give a deeper look!

Copy link
Member

@esune esune left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the direction and structure, being unable to recover from a failed revocation step was pretty frustrating so this is definitely an improvement.

I wonder if the "accumulator fix" that was wired-up a while back (that gets triggered both automatically and manually by calling [/revocation/registry/{rev_reg_id}/fix-revocation-entry-state](https://traction-tenant-proxy-dev.apps.silver.devops.gov.bc.ca/api/doc#/AnonCreds%20-%20Revocation/put_anoncreds_revocation_registry__rev_reg_id__fix_revocation_entry_state)) is something we can now remove, or should at least review where the self-firing event is triggered to avoid conflicts (I may very well have missed something in the changeset, but thought it is worth flagging).

"ACAPY_PRESERVE_EXCHANGE_RECORDS": "preserve_exchange_records",
"ACAPY_PUBLIC_INVITES": "public_invites",
"ACAPY_REQUESTS_THROUGH_PUBLIC_DID": "requests_through_public_did",
"ACAPY_ENABLE_AUTO_REVOCATION_RECOVERY": "anoncreds.revocation.auto_recovery_enabled",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I may have missed it, but there doesn't seem to be a (new) startup argument that would allow for this setting to be turned on or off - I thought it would be the default an only behaviour?

This block (and the following one, with related mappings) was meant to expose tenant-specific settings.

@esune
Copy link
Member

esune commented Oct 28, 2025

How could one reproduce a condition where this would kick in with a running agent?

The "easisest" way I can think of is by making the tails server unavailable when the file needs to be uploaded. Recovering from that would require manually setting the related registry to decommissioned and doing it over again.

Its not clear what an issuer needs to do in the case where it needs to take action because of a failure.

Agree that it would be nice to have a "documented procedure", even now recovering from a failed event is a bit of trial-and-error.

@ff137
Copy link
Member Author

ff137 commented Nov 13, 2025

Hey guys -- I've been less active here, busy on some other work. But I still want to make sure this gets over the line, as I think it's quite an important feature for reliability. So here's some feedback on the comments:


@PatStLouis

Where did the need for this come from, has there been many cases where a cred def ended up in a bad state because of this issue or is this a preventive feature?

Basically, anything that can go wrong, will go wrong! And so the idea is to streamline automated revocation registry management as much as possible, to reduce manual intervention.

We did find that revocable cred defs can silently become unusable, when anything goes wrong with the background creation processes. An issuer could try to send credentials, and there could be no active registry available, which would raise an error. Importantly: the issuer would not know that something had gone wrong (e.g. a new backup registry wasn't created), until they try send credentials. This is assuming a multitenant scenario and they don't have access to logs.

Ideally, an issuer should not have to understand all the complexities of rev-registry management, and the steps to manually resolve these scenarios. So that's the motivation for making the software handle as much of that as possible.

How could one reproduce a condition where this would kick in with a running agent?

It depends. We use Cheqd, which accepts a local registrar URL. So all we had to do is create a revocable cred def with a small registry size, issue a bunch, and then disable the local registrar service. In which case the background operations would reliably fail. This can probably also be done by making the tails server unavailable.

Another reliable way to reproduce it is to abruptly terminate the agent while it's in the middle of the background job. That's how we've validated that the solutions in this PR works: we have this automated test script in acapy-cloud that monitors acapy logs, and terminates the agent when a specific log is found. That way we could cause failures when the agent restarts and an issuer tries to send creds. We also used that script to prove that this new recovery logic works as intended.

Would it be possible to get a document for this feature, documenting different states?

100%. I've just got to sit down and write the documentation. 😄 Soon ™️

Does endorsement affect this at all? Has it been tested with other AnonCreds registry than indy?

So as mentioned, it does work with Cheqd. And we've not tested it directly with Indy (apart from the automated GHA tests), since we don't use it anymore. But I understand the operations to be ledger-agnostic, and the BDD tests do cover an endorsement scenario. So that flow works as usual, and anything that kicked off the automated creation of another registry will still function, with the added auto-recoverability.

Its not clear what an issuer needs to do in the case where it needs to take action because of a failure.

Indeed - that's where the documentation will have to lend from the existing steps (e.g. decommissioning or manually setting a registry to active). The hope is that manual intervention should never be necessary, but it'll be important to explain what that entails, if it needs to happen.


@esune

there doesn't seem to be a (new) startup argument that would allow for this setting to be turned on or off - I thought it would be the default an only behaviour?

Yes ... the motivation there is related to the new middleware, which does the recovery check. The flag basically allows for a micro-optimisation, so that we know which tenants should be checked for recovery, and which shouldn't. We configure the wallet setting for issuer tenants, so the middleware check only runs for them.

When the agent starts up, we want to check if there are any in-progress operations that were interrupted. And the middleware does that once per tenant, and keeps an in-memory map of whether a tenant has been checked yet (I believe it's similar to the anoncreds-wallet upgrade middleware). Every tenant only needs to be checked once to see if anything is pending and should be picked up again, but technically, non-issuer tenants never need to be checked.

For context: in acapy-cloud we have the concept of issuer tenants, with a trust registry, and 'issue cred' operations only be possible for issuers. Now, when we create an issuer tenant, we set this auto-recovery flag to true (in the wallet settings), and so the middleware only intercepts their first call, and maintains an "already been checked" map for these tenants. In a different context where anyone can be an issuer, it should be enabled for everyone.

We felt this optimisation makes sense because there may be 10s of thousands of holders for every 1 issuer, and there's no point checking if holders have pending rev-reg events. It could be enabled by default, because it's just a micro-optimisation, and probably only matters if there's millions of holders (even then, I believe the in-memory map doesn't use too much memory). But that was the thinking behind not enabling it by default for all tenants.

I wonder if the "accumulator fix" that was wired-up a while back [...] is something we can now remove,

That may still be useful. We haven't had to use it in a while, as far as I know, but probably no harm in keeping it.


There we go! Sorry for the delay, and the long response. I've been moving on from this feature, since it's already used where we need it. So I figured adding the context is good. And, contributing this upstream is still a priority for us, so that our fork doesn't diverge too much. And it's probably a useful feature for the community, so long as it gets documented, and if the implementation all looks good and makes sense. In the next week or so I'll try add an .md file to help explain the new features here.

- Implement retry mechanism for failed events
- Add auto-recovery functionality
- Improve event traceability and logging
- Handle resource conflicts gracefully
- Add configurable retry settings

:art: Support passing uncompiled string to EventBus.subscribe

Signed-off-by: ff137 <[email protected]>

:art:

Signed-off-by: ff137 <[email protected]>

:construction:

Signed-off-by: ff137 <[email protected]>

:art:

Signed-off-by: ff137 <[email protected]>

:art:

Signed-off-by: ff137 <[email protected]>

:bug: Fix required field

Signed-off-by: ff137 <[email protected]>

:bug:

Signed-off-by: ff137 <[email protected]>

:sparkles: Update handling of full registries

Signed-off-by: ff137 <[email protected]>

:bug:

Signed-off-by: ff137 <[email protected]>

:loud_sound:

Signed-off-by: ff137 <[email protected]>

:sparkles: Failure payloads, retry_count handling, and

Signed-off-by: ff137 <[email protected]>

:sparkles: Recovery manager and middleware wiring

Signed-off-by: ff137 <[email protected]>

:loud_sound:

Signed-off-by: ff137 <[email protected]>

:sparkles: Store full handling events too

Signed-off-by: ff137 <[email protected]>

:art: New correlation_id for each successful request

Signed-off-by: ff137 <[email protected]>

:art: Clear the set_active_registry_event option of correlation id

Signed-off-by: ff137 <[email protected]>

:art: Fix correlation of full handling event

Signed-off-by: ff137 <[email protected]>

:construction: Test async event bus

Signed-off-by: ff137 <[email protected]>

:art: Rename "get or create" method; it just gets

Signed-off-by: ff137 <[email protected]>

:sparkles: Shield methods that create resource on ledger

Signed-off-by: ff137 <[email protected]>

:construction: Debug logging

Signed-off-by: ff137 <[email protected]>

:bug: Fix default value for settings

Signed-off-by: ff137 <[email protected]>

:construction: Debug deserializing

Signed-off-by: ff137 <[email protected]>

:art: Remove rev_reg_def_private from payloads

Write private def to temp key

Signed-off-by: ff137 <[email protected]>

:sparkles: Handle resource already exists for RevList

Signed-off-by: ff137 <[email protected]>

:art:

Signed-off-by: ff137 <[email protected]>

:construction:

Signed-off-by: ff137 <[email protected]>

:bug: pop correlation_id from set active registry options

Signed-off-by: ff137 <[email protected]>

:art: Add retry sleep time

Signed-off-by: ff137 <[email protected]>

:recycle: First upload tail file; skip TailsUpload request / response

Signed-off-by: ff137 <[email protected]>

:art: Call notify issuer placeholder

Signed-off-by: ff137 <[email protected]>

:bug: Fix recovery of private rev reg def

Signed-off-by: ff137 <[email protected]>

:art: Remove unnecessary event and record type

Signed-off-by: ff137 <[email protected]>

:sparkles: Only recover events older than some delay

Signed-off-by: ff137 <[email protected]>

:art: Add descriptions to events

Signed-off-by: ff137 <[email protected]>

:art: Improve state machine log traceability

Signed-off-by: ff137 <[email protected]>

:art: Only mark profile recovered with no pending event

Signed-off-by: ff137 <[email protected]>

:sparkles: Emit event if manual intervention required

Signed-off-by: ff137 <[email protected]>

:wrench: Make retry count and sleep time configurable

Signed-off-by: ff137 <[email protected]>

:art: Add request id, for traceability

Signed-off-by: ff137 <[email protected]>

:sparkles: Retry all failed events if retryable, with expiry timestamps

Signed-off-by: ff137 <[email protected]>

:loud_sound: Verbose logs for traceability

Signed-off-by: ff137 <[email protected]>

:bug:

Signed-off-by: ff137 <[email protected]>

:art:

Signed-off-by: ff137 <[email protected]>

:bug: Add missing log arg

Signed-off-by: ff137 <[email protected]>

:art:

Signed-off-by: ff137 <[email protected]>

:art: Generate request_id with emitting full registry

Signed-off-by: ff137 <[email protected]>

:art: Add options to RevListFinishedEvent

Signed-off-by: ff137 <[email protected]>

:bug: Fix import

Signed-off-by: ff137 <[email protected]>

:bug: Always call handlers outside of try blocks

Signed-off-by: ff137 <[email protected]>

:recycle: Remove legacy in-between event: RevRegDefFinishedEvent

Signed-off-by: ff137 <[email protected]>

:recycle: Refactor event storage type

Keep events in requested state when being retried. Remove completed state -- success or failure is enough

Signed-off-by: ff137 <[email protected]>

:art: Debug

Signed-off-by: ff137 <[email protected]>

:bug: Fix updating event for retry

Signed-off-by: ff137 <[email protected]>

:art: Make auto-recovery check disabled by default

Signed-off-by: ff137 <[email protected]>

:art: Add config map for auto recovery

`ACAPY_ENABLE_AUTO_REVOCATION_RECOVERY` maps to `anoncreds.revocation.auto_recovery_enabled`, such that it can be passed in wallet extra settings

Signed-off-by: ff137 <[email protected]>
@ff137
Copy link
Member Author

ff137 commented Nov 17, 2025

Is the kanon_issuance_and_presentation scenario test failure expected? If it's new, it's maybe related to these changes

@ff137
Copy link
Member Author

ff137 commented Nov 17, 2025

I see it also failed on last main branch run, so it's unrelated to these changes
https://github.com/openwallet-foundation/acapy/actions/runs/19397669428/job/55505821837

@ff137 ff137 force-pushed the feat/rev-reg-granular-events branch from 0972b2b to 806ea38 Compare November 17, 2025 15:10
@ff137
Copy link
Member Author

ff137 commented Nov 17, 2025

I used some AI to help summarise the diff of this PR, and updated the PR description with that.

There's also some first-draft documentation that was added: 806ea38

Some more detail remains to be added to those docs, especially with regards to how it's configured (wallet / agent setting). I'll need some feedback on the desired approach with that. I think the default behavior should be that it's enabled for all tenants by default, with perhaps a flag that can be passed to make it only configured for tenants that have the wallet setting. I think that's a good way to do it, but let me know what you guys think. Also, please take a look at the new doc file, and let me know if there's anything specific to improve.

@swcurran
Copy link
Contributor

@ff137 -- thanks. @openwallet-foundation/acapy-committers -- let's get this through. As we look at the Kanon test issues.

@esune
Copy link
Member

esune commented Nov 18, 2025

Yes ... the motivation there is related to the new middleware, which does the recovery check. The flag basically allows for a micro-optimisation, so that we know which tenants should be checked for recovery, and which shouldn't. We configure the wallet setting for issuer tenants, so the middleware check only runs for them.

When the agent starts up, we want to check if there are any in-progress operations that were interrupted. And the middleware does that once per tenant, and keeps an in-memory map of whether a tenant has been checked yet (I believe it's similar to the anoncreds-wallet upgrade middleware). Every tenant only needs to be checked once to see if anything is pending and should be picked up again, but technically, non-issuer tenants never need to be checked.

For context: in acapy-cloud we have the concept of issuer tenants, with a trust registry, and 'issue cred' operations only be possible for issuers. Now, when we create an issuer tenant, we set this auto-recovery flag to true (in the wallet settings), and so the middleware only intercepts their first call, and maintains an "already been checked" map for these tenants. In a different context where anyone can be an issuer, it should be enabled for everyone.

We felt this optimisation makes sense because there may be 10s of thousands of holders for every 1 issuer, and there's no point checking if holders have pending rev-reg events. It could be enabled by default, because it's just a micro-optimisation, and probably only matters if there's millions of holders (even then, I believe the in-memory map doesn't use too much memory). But that was the thinking behind not enabling it by default for all tenants.

This makes sense, thank you for the explanation. As it it right now though (I may have missed something), the self-recovery middleware is a "secret" feature that needs to be enabled, and to do so you would need to know which setting to set. Additionally, for single-instance scenarios this extra step is a bit unusual, so I wonder if it would be best to choose a "reasonable default" of true that can be controlled by a startup argument like for other functionality, and let the deployer/multitenant administrator select whether to keep it this way, or set it to false and selectively turn it on for issuers.

@swcurran
Copy link
Contributor

@ff137 -- could you please reply to @esune last comment? Would like to get this merged...

Thanks

@ff137
Copy link
Member Author

ff137 commented Nov 25, 2025

Sorry for the delay.

This makes sense, thank you for the explanation. As it it right now though (I may have missed something), the self-recovery middleware is a "secret" feature that needs to be enabled, and to do so you would need to know which setting to set. Additionally, for single-instance scenarios this extra step is a bit unusual, so I wonder if it would be best to choose a "reasonable default" of true that can be controlled by a startup argument like for other functionality, and let the deployer/multitenant administrator select whether to keep it this way, or set it to false and selectively turn it on for issuers.

That's true. By default this feature is not yet enabled. And I guess that may be fine since auto-recovery is "experimental" depending on ledger/plugin being used. Only we at DIDx have used it so far, with cheqd. And we may have added manual tweaks to the cheqd plugin to make it work (in terms of error handling and changing error messages to identify retryable vs not retryable errors). I'm not so sure. There's been so much code and we're maintaining divergent forks.

However, I'm confident the logic is sound and should work in general. It might make sense to merge it as is, and then other devs can experiment with it (we have the test script on acapy-cloud to reproduce recovery scenarios). Then there's time to decide on best configuration before adding it to a release.

Personally, I can't decide on my own what the config should be, like adding yet another startup arg, or just baking it in and making it default. Anything is fine really -- considering the feature should only come up when something has anyway failed, and the auto-recovery is just a bonus if possible -- if it doesn't work, nothing changes. So the best way to configure it should be whatever works best for others. Which probably just requires some playing around with it to understand it better.

For now it's indeed true that the auto-recovery is enabled by a wallet setting. What remains is maybe to document that better. But that's also a change that can be committed after it's merged. My primary focus is shifting away from acapy-cloud, so from my end it's done, but I will address change requests where needed.

@esune
Copy link
Member

esune commented Nov 27, 2025

@ff137 fair points. based on your reply I think we may be fine with merging as-is, doing some additional testing and deciding whether to bake this in fully as a replacement for the current stream (which so far would be my preference having fought with broken revocation data in the past). If other maintainers agree (@jamshale @PatStLouis @swcurran ) we can merge and move on.

@sonarqubecloud
Copy link

@jamshale
Copy link
Contributor

I'm just going to go over this again... I think it's great. I'll merge it after.

@jamshale
Copy link
Contributor

Is the kanon_issuance_and_presentation scenario test failure expected? If it's new, it's maybe related to these changes

Yes, It's expected to fail intermittently right now. We can ignore the failure.

Copy link
Contributor

@jamshale jamshale left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit unsure if the existing accumulator recovery is still necessary despite this. If it's possible to still get the accumulator out of sync with the revocation registry values? It's happens automatically now as well so I don't see any conflict with this.

This is a big improvement. Thanks for contributing it to the root project. I'd prefer it just to be enabled by default in the future.

@jamshale jamshale merged commit 34f9472 into openwallet-foundation:main Nov 27, 2025
12 checks passed
@ff137
Copy link
Member Author

ff137 commented Nov 27, 2025

I think the accumulator value can still go out of sync, because as far as I can recall, the "revoke credential" flow changes local records to status revoked right before publishing to the ledger, without ledger necessarily reflecting that. If that publish fails, the cred-ex / rev-reg records on the issuer's side will still indicate that it's revoked, but it's not. I don't know the technicals, but presumably that relates to accumulator value.

That's something we've looked into fixing as well, to revert / clean up local records if something goes wrong. Or may be possible to adopt a similar auto-recovery pattern for the revoke-credential logic. Here it's just for registry creation (initial and backups). And publish-revocation can also be turned into an event that gets replayed if something goes wrong (if it's a retryable error). But that's left as open, future work, because we felt this PR was big enough, and didn't want to complicate by changing revocation-publish too. But it may be a worthwhile idea to take note of

@swcurran
Copy link
Contributor

From the previous [comment]:

I don't know the technicals, but presumably that relates to accumulator value.

The accumulator is a function of the status of all of the credentials. So if the status of all the credentials is the same, the same accumulator is calculated.

That's something we've looked into fixing as well, to revert / clean up local records if something goes wrong.

That's a questionable general approach, as often the local records reflect the status that has come from the Application/Controller, which is often keeping its own state. So you are really looking at three things that need to be in sync -- the Application/Controller that has the business logic, ACA-Py (the agent) that is told the state by the Application/Controller, and the ledger. I think the only solution to the failed ledger update is to fix the ledger interface and get an update transaction completed. I would note that the "fix rev reg" code that was implemented a few years ago does that -- checks the published state of the RevReg locally vs. the ledger and retries the transaction until it works.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

✨ Recoverable revocation registry management

5 participants