✨ Recoverable, event-driven revocation registry management #3831

ff137 · 2025-07-18T18:05:24Z

This PR refactors AnonCreds revocation registry management into an event-driven, recoverable workflow. The goal is to make revocation registry lifecycle operations resilient to partial failures (ledger, DB, tails server) and agent restarts, so that a credential definition does not silently become unusable when something goes wrong.

Background / motivation

Today, revocation registry setup and maintenance relies on a sequence of automatic handlers. If the agent crashes or any step fails midway (e.g., registry created on the ledger but not stored locally; registry marked full but backup not activated), the issuer may only discover the problem later, when issuance starts failing.

To address this, each step in the revocation workflow is now modeled as an event with:

A persisted record in storage (including payload, state, correlation ID, retry metadata), and
A clear request/response pair that can be retried or recovered.

Resolves #3819.

What this PR changes

1. Event model for revocation workflows

Adds a more granular set of AnonCreds events for revocation registry lifecycle steps:

Revocation registry definition
- anoncreds::rev-reg-def::create-requested / create-response
- anoncreds::rev-reg-def::store-requested / store-response
Revocation status list
- anoncreds::revocation-list::create-requested / create-response
- anoncreds::revocation-list::store-requested / store-response
- anoncreds::revocation-list::finished (kept for backwards compatibility)
Registry activation & full handling
- anoncreds::revocation-registry::activation-requested / activation-response
- anoncreds::revocation-registry::full-detected
- anoncreds::revocation-registry::full-handling-completed

Event payloads now:

Carry a generic options dict (including request_id, correlation_id, retry_count, recovery flag, etc).
For response events, include an ErrorInfoPayload with error_msg, should_retry, and retry_count when something fails and is eligible for retry.

The existing CredDefFinishedEvent is updated to use this new event machinery as the entry point for revocation setup after a credential definition is created.

The event bus is also updated so EventBus.subscribe accepts string topics (compiled internally to regex) in addition to pre-compiled patterns, simplifying usage for the new events.

2. Persistent event storage & retry metadata

Introduces an EventStorageManager to persist revocation-related events as storage records:

New record types:
- rev_reg_def_create_event
- rev_reg_def_store_event
- rev_list_create_event
- rev_list_store_event
- rev_reg_activation_event
- rev_reg_full_handling_event
New event states:
- requested
- response_success
- response_failure

Each stored event includes:

event_type, serialized event_data (payload), correlation_id, optional request_id
state, options, created_at, and an expiry_timestamp
Optional response_data, error_msg, and retry_metadata when a response is processed

Key behaviors:

store_event_request(...)
Creates a record for the request event, using correlation_id as the record ID for easy lookup.
update_event_response(...)
Updates state to response_success or response_failure, attaching response and retry metadata.
update_event_for_retry(...)
Re-classifies a failed event for retry: increments retry count, refreshes expiry, sets state back to requested.
get_in_progress_events(...)
Returns events still in progress, optionally only those past their expiry timestamp (ready for recovery).

3. Exponential backoff & expiry windows

Adds retry_utils to centralize retry timing and expiry calculation:

Configurable via environment variables:

ANONCREDS_REVOCATION_MIN_RETRY_DURATION_SECONDS (default: 2)
ANONCREDS_REVOCATION_MAX_RETRY_DURATION_SECONDS (default: 60)
ANONCREDS_REVOCATION_RETRY_MULTIPLIER (default: 2.0)
→ with defaults: 2, 4, 8, 16… seconds, capped at max
ANONCREDS_REVOCATION_RECOVERY_DELAY_SECONDS (default: 30)

Functions:

calculate_exponential_backoff_delay(retry_count) – computes next delay.
calculate_event_expiry_timestamp(retry_count) – sets when a failed event should be considered “expired” and ready to be recovered.
is_event_expired(expiry_timestamp) – checks whether an event is past its recovery window.

4. Event-driven revocation setup & full-registry handling

DefaultRevocationSetup is updated to orchestrate the whole revocation lifecycle purely via events and EventStorageManager:

On cred def finished, it:
- Generates a request_id and correlation_id
- Stores the create-rev-reg-def request event
- Emits rev-reg-def::create-requested
Each subsequent step (def create/store, list create/store, activation):
- Reads/writes event records using EventStorageManager
- On success, emits the next event in the chain
- On failure:
  - If should_retry is true, schedules a retry with backoff and updates the stored event
  - Otherwise, marks the event as non-recoverable and surfaces a meaningful error

For full registry handling:

A new handle_full_registry_event(...) sets the current registry’s state to FULL, logs context, and emits events to:
- Activate the backup registry, and
- Create a new backup registry
Activation and backup creation are now handled via the same event pipeline, so an agent crash or transient failure in any of these steps can be recovered rather than leaving a cred def with no active registry.

5. Recovery at startup / per-request

Two new components handle recovery:

EventRecoveryManager
- Scans EventStorageManager for in-progress events that have expired (based on their expiry_timestamp).
- Re-emits the corresponding request events onto the event bus, tagging options["recovery"] = True and reusing the original correlation_id.
revocation_recovery_middleware (admin server middleware)
- Registered in admin/server.py for admin routes.
- For each admin request:
  - Resolves the current profile/tenant.
  - If revocation recovery has not yet run for this profile:
    - Fetches any in-progress revocation events.
    - Runs EventRecoveryManager.recover_in_progress_events() for events whose expiry time has passed.
    - Marks the profile as “recovered” for this server session to avoid repeated scans.
- All errors during recovery are logged but do not block the original HTTP request.

This means that if the agent is terminated mid-operation, the next admin request for that profile will automatically attempt to repair any stuck revocation workflows.

6. API / behavior changes

Revocation admin routes are updated to align with the new behavior:

Revocation list creation
Delegates to AnonCredsRevocation.create_and_register_revocation_list.
- If the method returns a RevListResult, it is serialized as the response.
- If it returns a string, it is treated as a user-facing error message and mapped to 400 Bad Request.
Revocation registry state
set_rev_reg_state now raises AnonCredsRevocationError with “not found” messaging if the rev reg cannot be located; the route maps this to 404 Not Found, rather than silently returning None.
Tails file retrieval
If the revocation registry definition cannot be retrieved, the tails route now returns 404 Not Found.

OpenAPI / Swagger definitions are updated accordingly to reflect the revised error handling.

7. Testing

New and updated tests cover:

Event storage lifecycle (create / update / retry / delete / query).
Event recovery logic and mapping from stored records back to events.
Revocation recovery middleware behavior, including:
- One-time-per-profile behavior
- Distinguishing pending vs expired events
- Robustness when errors occur during recovery
Revocation setup, full-registry handling, and route behavior changes.

Backwards compatibility

Existing workflows continue to function, but now gain resiliency: broken or partial revocation operations are captured, retried, and recoverable instead of silently failing.
Legacy event topics like anoncreds::revocation-list::finished remain in place for compatibility.
New storage record types are additive; existing records are unaffected.
Callers of revocation admin endpoints are unchanged, aside from improved error semantics in edge cases (more accurate 4xx/5xx codes and messages).

acapy_agent/anoncreds/revocation/revocation.py

ff137 · 2025-10-15T18:46:07Z

This is a big one ... It'll probably be necessary to contribute some documentation for what's going on here.

ff137 · 2025-10-28T11:09:15Z

Tagging reviewers to please check out what the idea is over here. It's fairly complex, and it'll help to organize the code a lil bit better. But before polishing, it'll be good to get some feedback over the ideas here, and how it's implemented. All questions / comments are welcome.

Note: we're currently using this in production in acapy-cloud (or a separately maintained branch of this work; just need to rebase and confirm it's all same-same)

PatStLouis · 2025-10-28T18:20:13Z

@ff137 I just gave this a first look, there's a lot going on here. Couple of questions:

Where did the need for this come from, has there been many cases where a cred def ended up in a bad state because of this issue or is this a preventive feature?

How could one reproduce a condition where this would kick in with a running agent?

Would it be possible to get a document for this feature, documenting different states? The existing aries rfc documents come to mind, since some of them also deal with multiple states. Some sort of diagram would be useful for me.
https://identity.foundation/aries-rfcs/latest/features/0453-issue-credential-v2/#choreography-diagram

Does endorsement affect this at all? Has it been tested with other AnonCreds registry than indy?

If events continue to fail, exceeding some maximum retry setting, then the issuer can be notified that manual intervention is required to fix revocation registries and ensure the credential definition remains usable.

Its not clear what an issuer needs to do in the case where it needs to take action because of a failure.

Otherwise the code looks solid and well organized, will give a deeper look!

esune

I like the direction and structure, being unable to recover from a failed revocation step was pretty frustrating so this is definitely an improvement.

I wonder if the "accumulator fix" that was wired-up a while back (that gets triggered both automatically and manually by calling [/revocation/registry/{rev_reg_id}/fix-revocation-entry-state](https://traction-tenant-proxy-dev.apps.silver.devops.gov.bc.ca/api/doc#/AnonCreds%20-%20Revocation/put_anoncreds_revocation_registry__rev_reg_id__fix_revocation_entry_state)) is something we can now remove, or should at least review where the self-firing event is triggered to avoid conflicts (I may very well have missed something in the changeset, but thought it is worth flagging).

esune · 2025-10-28T18:53:44Z

acapy_agent/multitenant/admin/routes.py

    "ACAPY_PRESERVE_EXCHANGE_RECORDS": "preserve_exchange_records",
    "ACAPY_PUBLIC_INVITES": "public_invites",
    "ACAPY_REQUESTS_THROUGH_PUBLIC_DID": "requests_through_public_did",
+    "ACAPY_ENABLE_AUTO_REVOCATION_RECOVERY": "anoncreds.revocation.auto_recovery_enabled",


I may have missed it, but there doesn't seem to be a (new) startup argument that would allow for this setting to be turned on or off - I thought it would be the default an only behaviour?

This block (and the following one, with related mappings) was meant to expose tenant-specific settings.

esune · 2025-10-28T19:00:03Z

How could one reproduce a condition where this would kick in with a running agent?

The "easisest" way I can think of is by making the tails server unavailable when the file needs to be uploaded. Recovering from that would require manually setting the related registry to decommissioned and doing it over again.

Its not clear what an issuer needs to do in the case where it needs to take action because of a failure.

Agree that it would be nice to have a "documented procedure", even now recovering from a failed event is a bit of trial-and-error.

ff137 · 2025-11-13T11:59:59Z

Hey guys -- I've been less active here, busy on some other work. But I still want to make sure this gets over the line, as I think it's quite an important feature for reliability. So here's some feedback on the comments:

@PatStLouis

Where did the need for this come from, has there been many cases where a cred def ended up in a bad state because of this issue or is this a preventive feature?

Basically, anything that can go wrong, will go wrong! And so the idea is to streamline automated revocation registry management as much as possible, to reduce manual intervention.

We did find that revocable cred defs can silently become unusable, when anything goes wrong with the background creation processes. An issuer could try to send credentials, and there could be no active registry available, which would raise an error. Importantly: the issuer would not know that something had gone wrong (e.g. a new backup registry wasn't created), until they try send credentials. This is assuming a multitenant scenario and they don't have access to logs.

Ideally, an issuer should not have to understand all the complexities of rev-registry management, and the steps to manually resolve these scenarios. So that's the motivation for making the software handle as much of that as possible.

How could one reproduce a condition where this would kick in with a running agent?

It depends. We use Cheqd, which accepts a local registrar URL. So all we had to do is create a revocable cred def with a small registry size, issue a bunch, and then disable the local registrar service. In which case the background operations would reliably fail. This can probably also be done by making the tails server unavailable.

Another reliable way to reproduce it is to abruptly terminate the agent while it's in the middle of the background job. That's how we've validated that the solutions in this PR works: we have this automated test script in acapy-cloud that monitors acapy logs, and terminates the agent when a specific log is found. That way we could cause failures when the agent restarts and an issuer tries to send creds. We also used that script to prove that this new recovery logic works as intended.

Would it be possible to get a document for this feature, documenting different states?

100%. I've just got to sit down and write the documentation. 😄 Soon ™️

Does endorsement affect this at all? Has it been tested with other AnonCreds registry than indy?

So as mentioned, it does work with Cheqd. And we've not tested it directly with Indy (apart from the automated GHA tests), since we don't use it anymore. But I understand the operations to be ledger-agnostic, and the BDD tests do cover an endorsement scenario. So that flow works as usual, and anything that kicked off the automated creation of another registry will still function, with the added auto-recoverability.

Its not clear what an issuer needs to do in the case where it needs to take action because of a failure.

Indeed - that's where the documentation will have to lend from the existing steps (e.g. decommissioning or manually setting a registry to active). The hope is that manual intervention should never be necessary, but it'll be important to explain what that entails, if it needs to happen.

@esune

there doesn't seem to be a (new) startup argument that would allow for this setting to be turned on or off - I thought it would be the default an only behaviour?

Yes ... the motivation there is related to the new middleware, which does the recovery check. The flag basically allows for a micro-optimisation, so that we know which tenants should be checked for recovery, and which shouldn't. We configure the wallet setting for issuer tenants, so the middleware check only runs for them.

When the agent starts up, we want to check if there are any in-progress operations that were interrupted. And the middleware does that once per tenant, and keeps an in-memory map of whether a tenant has been checked yet (I believe it's similar to the anoncreds-wallet upgrade middleware). Every tenant only needs to be checked once to see if anything is pending and should be picked up again, but technically, non-issuer tenants never need to be checked.

For context: in acapy-cloud we have the concept of issuer tenants, with a trust registry, and 'issue cred' operations only be possible for issuers. Now, when we create an issuer tenant, we set this auto-recovery flag to true (in the wallet settings), and so the middleware only intercepts their first call, and maintains an "already been checked" map for these tenants. In a different context where anyone can be an issuer, it should be enabled for everyone.

We felt this optimisation makes sense because there may be 10s of thousands of holders for every 1 issuer, and there's no point checking if holders have pending rev-reg events. It could be enabled by default, because it's just a micro-optimisation, and probably only matters if there's millions of holders (even then, I believe the in-memory map doesn't use too much memory). But that was the thinking behind not enabling it by default for all tenants.

I wonder if the "accumulator fix" that was wired-up a while back [...] is something we can now remove,

That may still be useful. We haven't had to use it in a while, as far as I know, but probably no harm in keeping it.

There we go! Sorry for the delay, and the long response. I've been moving on from this feature, since it's already used where we need it. So I figured adding the context is good. And, contributing this upstream is still a priority for us, so that our fork doesn't diverge too much. And it's probably a useful feature for the community, so long as it gets documented, and if the implementation all looks good and makes sense. In the next week or so I'll try add an .md file to help explain the new features here.

- Implement retry mechanism for failed events - Add auto-recovery functionality - Improve event traceability and logging - Handle resource conflicts gracefully - Add configurable retry settings :art: Support passing uncompiled string to EventBus.subscribe Signed-off-by: ff137 <[email protected]> :art: Signed-off-by: ff137 <[email protected]> :construction: Signed-off-by: ff137 <[email protected]> :art: Signed-off-by: ff137 <[email protected]> :art: Signed-off-by: ff137 <[email protected]> :bug: Fix required field Signed-off-by: ff137 <[email protected]> :bug: Signed-off-by: ff137 <[email protected]> :sparkles: Update handling of full registries Signed-off-by: ff137 <[email protected]> :bug: Signed-off-by: ff137 <[email protected]> :loud_sound: Signed-off-by: ff137 <[email protected]> :sparkles: Failure payloads, retry_count handling, and Signed-off-by: ff137 <[email protected]> :sparkles: Recovery manager and middleware wiring Signed-off-by: ff137 <[email protected]> :loud_sound: Signed-off-by: ff137 <[email protected]> :sparkles: Store full handling events too Signed-off-by: ff137 <[email protected]> :art: New correlation_id for each successful request Signed-off-by: ff137 <[email protected]> :art: Clear the set_active_registry_event option of correlation id Signed-off-by: ff137 <[email protected]> :art: Fix correlation of full handling event Signed-off-by: ff137 <[email protected]> :construction: Test async event bus Signed-off-by: ff137 <[email protected]> :art: Rename "get or create" method; it just gets Signed-off-by: ff137 <[email protected]> :sparkles: Shield methods that create resource on ledger Signed-off-by: ff137 <[email protected]> :construction: Debug logging Signed-off-by: ff137 <[email protected]> :bug: Fix default value for settings Signed-off-by: ff137 <[email protected]> :construction: Debug deserializing Signed-off-by: ff137 <[email protected]> :art: Remove rev_reg_def_private from payloads Write private def to temp key Signed-off-by: ff137 <[email protected]> :sparkles: Handle resource already exists for RevList Signed-off-by: ff137 <[email protected]> :art: Signed-off-by: ff137 <[email protected]> :construction: Signed-off-by: ff137 <[email protected]> :bug: pop correlation_id from set active registry options Signed-off-by: ff137 <[email protected]> :art: Add retry sleep time Signed-off-by: ff137 <[email protected]> :recycle: First upload tail file; skip TailsUpload request / response Signed-off-by: ff137 <[email protected]> :art: Call notify issuer placeholder Signed-off-by: ff137 <[email protected]> :bug: Fix recovery of private rev reg def Signed-off-by: ff137 <[email protected]> :art: Remove unnecessary event and record type Signed-off-by: ff137 <[email protected]> :sparkles: Only recover events older than some delay Signed-off-by: ff137 <[email protected]> :art: Add descriptions to events Signed-off-by: ff137 <[email protected]> :art: Improve state machine log traceability Signed-off-by: ff137 <[email protected]> :art: Only mark profile recovered with no pending event Signed-off-by: ff137 <[email protected]> :sparkles: Emit event if manual intervention required Signed-off-by: ff137 <[email protected]> :wrench: Make retry count and sleep time configurable Signed-off-by: ff137 <[email protected]> :art: Add request id, for traceability Signed-off-by: ff137 <[email protected]> :sparkles: Retry all failed events if retryable, with expiry timestamps Signed-off-by: ff137 <[email protected]> :loud_sound: Verbose logs for traceability Signed-off-by: ff137 <[email protected]> :bug: Signed-off-by: ff137 <[email protected]> :art: Signed-off-by: ff137 <[email protected]> :bug: Add missing log arg Signed-off-by: ff137 <[email protected]> :art: Signed-off-by: ff137 <[email protected]> :art: Generate request_id with emitting full registry Signed-off-by: ff137 <[email protected]> :art: Add options to RevListFinishedEvent Signed-off-by: ff137 <[email protected]> :bug: Fix import Signed-off-by: ff137 <[email protected]> :bug: Always call handlers outside of try blocks Signed-off-by: ff137 <[email protected]> :recycle: Remove legacy in-between event: RevRegDefFinishedEvent Signed-off-by: ff137 <[email protected]> :recycle: Refactor event storage type Keep events in requested state when being retried. Remove completed state -- success or failure is enough Signed-off-by: ff137 <[email protected]> :art: Debug Signed-off-by: ff137 <[email protected]> :bug: Fix updating event for retry Signed-off-by: ff137 <[email protected]> :art: Make auto-recovery check disabled by default Signed-off-by: ff137 <[email protected]> :art: Add config map for auto recovery `ACAPY_ENABLE_AUTO_REVOCATION_RECOVERY` maps to `anoncreds.revocation.auto_recovery_enabled`, such that it can be passed in wallet extra settings Signed-off-by: ff137 <[email protected]>

Signed-off-by: ff137 <[email protected]>

ff137 · 2025-11-17T14:07:08Z

Is the kanon_issuance_and_presentation scenario test failure expected? If it's new, it's maybe related to these changes

ff137 · 2025-11-17T14:09:04Z

I see it also failed on last main branch run, so it's unrelated to these changes
https://github.com/openwallet-foundation/acapy/actions/runs/19397669428/job/55505821837

Signed-off-by: ff137 <[email protected]>

ff137 · 2025-11-17T15:14:18Z

I used some AI to help summarise the diff of this PR, and updated the PR description with that.

There's also some first-draft documentation that was added: 806ea38

Some more detail remains to be added to those docs, especially with regards to how it's configured (wallet / agent setting). I'll need some feedback on the desired approach with that. I think the default behavior should be that it's enabled for all tenants by default, with perhaps a flag that can be passed to make it only configured for tenants that have the wallet setting. I think that's a good way to do it, but let me know what you guys think. Also, please take a look at the new doc file, and let me know if there's anything specific to improve.

swcurran · 2025-11-17T15:16:55Z

@ff137 -- thanks. @openwallet-foundation/acapy-committers -- let's get this through. As we look at the Kanon test issues.

Signed-off-by: ff137 <[email protected]>

esune · 2025-11-18T15:12:50Z

Yes ... the motivation there is related to the new middleware, which does the recovery check. The flag basically allows for a micro-optimisation, so that we know which tenants should be checked for recovery, and which shouldn't. We configure the wallet setting for issuer tenants, so the middleware check only runs for them.

When the agent starts up, we want to check if there are any in-progress operations that were interrupted. And the middleware does that once per tenant, and keeps an in-memory map of whether a tenant has been checked yet (I believe it's similar to the anoncreds-wallet upgrade middleware). Every tenant only needs to be checked once to see if anything is pending and should be picked up again, but technically, non-issuer tenants never need to be checked.

For context: in acapy-cloud we have the concept of issuer tenants, with a trust registry, and 'issue cred' operations only be possible for issuers. Now, when we create an issuer tenant, we set this auto-recovery flag to true (in the wallet settings), and so the middleware only intercepts their first call, and maintains an "already been checked" map for these tenants. In a different context where anyone can be an issuer, it should be enabled for everyone.

We felt this optimisation makes sense because there may be 10s of thousands of holders for every 1 issuer, and there's no point checking if holders have pending rev-reg events. It could be enabled by default, because it's just a micro-optimisation, and probably only matters if there's millions of holders (even then, I believe the in-memory map doesn't use too much memory). But that was the thinking behind not enabling it by default for all tenants.

This makes sense, thank you for the explanation. As it it right now though (I may have missed something), the self-recovery middleware is a "secret" feature that needs to be enabled, and to do so you would need to know which setting to set. Additionally, for single-instance scenarios this extra step is a bit unusual, so I wonder if it would be best to choose a "reasonable default" of true that can be controlled by a startup argument like for other functionality, and let the deployer/multitenant administrator select whether to keep it this way, or set it to false and selectively turn it on for issuers.

swcurran · 2025-11-25T16:10:25Z

@ff137 -- could you please reply to @esune last comment? Would like to get this merged...

Thanks

ff137 · 2025-11-25T17:24:54Z

Sorry for the delay.

This makes sense, thank you for the explanation. As it it right now though (I may have missed something), the self-recovery middleware is a "secret" feature that needs to be enabled, and to do so you would need to know which setting to set. Additionally, for single-instance scenarios this extra step is a bit unusual, so I wonder if it would be best to choose a "reasonable default" of true that can be controlled by a startup argument like for other functionality, and let the deployer/multitenant administrator select whether to keep it this way, or set it to false and selectively turn it on for issuers.

That's true. By default this feature is not yet enabled. And I guess that may be fine since auto-recovery is "experimental" depending on ledger/plugin being used. Only we at DIDx have used it so far, with cheqd. And we may have added manual tweaks to the cheqd plugin to make it work (in terms of error handling and changing error messages to identify retryable vs not retryable errors). I'm not so sure. There's been so much code and we're maintaining divergent forks.

However, I'm confident the logic is sound and should work in general. It might make sense to merge it as is, and then other devs can experiment with it (we have the test script on acapy-cloud to reproduce recovery scenarios). Then there's time to decide on best configuration before adding it to a release.

Personally, I can't decide on my own what the config should be, like adding yet another startup arg, or just baking it in and making it default. Anything is fine really -- considering the feature should only come up when something has anyway failed, and the auto-recovery is just a bonus if possible -- if it doesn't work, nothing changes. So the best way to configure it should be whatever works best for others. Which probably just requires some playing around with it to understand it better.

For now it's indeed true that the auto-recovery is enabled by a wallet setting. What remains is maybe to document that better. But that's also a change that can be committed after it's merged. My primary focus is shifting away from acapy-cloud, so from my end it's done, but I will address change requests where needed.

esune · 2025-11-27T15:59:24Z

@ff137 fair points. based on your reply I think we may be fine with merging as-is, doing some additional testing and deciding whether to bake this in fully as a replacement for the current stream (which so far would be my preference having fought with broken revocation data in the past). If other maintainers agree (@jamshale @PatStLouis @swcurran ) we can merge and move on.

sonarqubecloud · 2025-11-27T16:07:20Z

Quality Gate passed

Issues
1 New issue
3 Accepted issues

Measures
0 Security Hotspots
87.4% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

jamshale · 2025-11-27T16:55:56Z

I'm just going to go over this again... I think it's great. I'll merge it after.

jamshale · 2025-11-27T17:18:19Z

Is the kanon_issuance_and_presentation scenario test failure expected? If it's new, it's maybe related to these changes

Yes, It's expected to fail intermittently right now. We can ignore the failure.

jamshale

I'm a bit unsure if the existing accumulator recovery is still necessary despite this. If it's possible to still get the accumulator out of sync with the revocation registry values? It's happens automatically now as well so I don't see any conflict with this.

This is a big improvement. Thanks for contributing it to the root project. I'd prefer it just to be enabled by default in the future.

ff137 · 2025-11-27T17:39:43Z

I think the accumulator value can still go out of sync, because as far as I can recall, the "revoke credential" flow changes local records to status revoked right before publishing to the ledger, without ledger necessarily reflecting that. If that publish fails, the cred-ex / rev-reg records on the issuer's side will still indicate that it's revoked, but it's not. I don't know the technicals, but presumably that relates to accumulator value.

That's something we've looked into fixing as well, to revert / clean up local records if something goes wrong. Or may be possible to adopt a similar auto-recovery pattern for the revoke-credential logic. Here it's just for registry creation (initial and backups). And publish-revocation can also be turned into an event that gets replayed if something goes wrong (if it's a retryable error). But that's left as open, future work, because we felt this PR was big enough, and didn't want to complicate by changing revocation-publish too. But it may be a worthwhile idea to take note of

swcurran · 2025-11-27T18:26:29Z

From the previous [comment]:

I don't know the technicals, but presumably that relates to accumulator value.

The accumulator is a function of the status of all of the credentials. So if the status of all the credentials is the same, the same accumulator is calculated.

That's something we've looked into fixing as well, to revert / clean up local records if something goes wrong.

That's a questionable general approach, as often the local records reflect the status that has come from the Application/Controller, which is often keeping its own state. So you are really looking at three things that need to be in sync -- the Application/Controller that has the business logic, ACA-Py (the agent) that is told the state by the Application/Controller, and the ledger. I think the only solution to the failed ledger update is to fix the ledger interface and get an update transaction completed. I would note that the "fix rev reg" code that was implemented a few years ago does that -- checks the published state of the RevReg locally vs. the ledger and retries the transaction until it works.

github-advanced-security bot found potential problems Jul 18, 2025

View reviewed changes