-
Notifications
You must be signed in to change notification settings - Fork 532
✨ Recoverable, event-driven revocation registry management #3831
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
✨ Recoverable, event-driven revocation registry management #3831
Conversation
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
c572a61 to
1b3f5aa
Compare
1730ec8 to
951adf9
Compare
|
This is a big one ... It'll probably be necessary to contribute some documentation for what's going on here. |
5c429e9 to
a7d10f7
Compare
|
Tagging reviewers to please check out what the idea is over here. It's fairly complex, and it'll help to organize the code a lil bit better. But before polishing, it'll be good to get some feedback over the ideas here, and how it's implemented. All questions / comments are welcome. Note: we're currently using this in production in acapy-cloud (or a separately maintained branch of this work; just need to rebase and confirm it's all same-same) |
|
@ff137 I just gave this a first look, there's a lot going on here. Couple of questions: Where did the need for this come from, has there been many cases where a cred def ended up in a bad state because of this issue or is this a preventive feature? How could one reproduce a condition where this would kick in with a running agent? Would it be possible to get a document for this feature, documenting different states? The existing aries rfc documents come to mind, since some of them also deal with multiple states. Some sort of diagram would be useful for me. Does endorsement affect this at all? Has it been tested with other AnonCreds registry than indy?
Its not clear what an issuer needs to do in the case where it needs to take action because of a failure. Otherwise the code looks solid and well organized, will give a deeper look! |
esune
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the direction and structure, being unable to recover from a failed revocation step was pretty frustrating so this is definitely an improvement.
I wonder if the "accumulator fix" that was wired-up a while back (that gets triggered both automatically and manually by calling [/revocation/registry/{rev_reg_id}/fix-revocation-entry-state](https://traction-tenant-proxy-dev.apps.silver.devops.gov.bc.ca/api/doc#/AnonCreds%20-%20Revocation/put_anoncreds_revocation_registry__rev_reg_id__fix_revocation_entry_state)) is something we can now remove, or should at least review where the self-firing event is triggered to avoid conflicts (I may very well have missed something in the changeset, but thought it is worth flagging).
| "ACAPY_PRESERVE_EXCHANGE_RECORDS": "preserve_exchange_records", | ||
| "ACAPY_PUBLIC_INVITES": "public_invites", | ||
| "ACAPY_REQUESTS_THROUGH_PUBLIC_DID": "requests_through_public_did", | ||
| "ACAPY_ENABLE_AUTO_REVOCATION_RECOVERY": "anoncreds.revocation.auto_recovery_enabled", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I may have missed it, but there doesn't seem to be a (new) startup argument that would allow for this setting to be turned on or off - I thought it would be the default an only behaviour?
This block (and the following one, with related mappings) was meant to expose tenant-specific settings.
The "easisest" way I can think of is by making the tails server unavailable when the file needs to be uploaded. Recovering from that would require manually setting the related registry to decommissioned and doing it over again.
Agree that it would be nice to have a "documented procedure", even now recovering from a failed event is a bit of trial-and-error. |
|
Hey guys -- I've been less active here, busy on some other work. But I still want to make sure this gets over the line, as I think it's quite an important feature for reliability. So here's some feedback on the comments:
Basically, anything that can go wrong, will go wrong! And so the idea is to streamline automated revocation registry management as much as possible, to reduce manual intervention. We did find that revocable cred defs can silently become unusable, when anything goes wrong with the background creation processes. An issuer could try to send credentials, and there could be no active registry available, which would raise an error. Importantly: the issuer would not know that something had gone wrong (e.g. a new backup registry wasn't created), until they try send credentials. This is assuming a multitenant scenario and they don't have access to logs. Ideally, an issuer should not have to understand all the complexities of rev-registry management, and the steps to manually resolve these scenarios. So that's the motivation for making the software handle as much of that as possible.
It depends. We use Cheqd, which accepts a local registrar URL. So all we had to do is create a revocable cred def with a small registry size, issue a bunch, and then disable the local registrar service. In which case the background operations would reliably fail. This can probably also be done by making the tails server unavailable. Another reliable way to reproduce it is to abruptly terminate the agent while it's in the middle of the background job. That's how we've validated that the solutions in this PR works: we have this automated test script in acapy-cloud that monitors acapy logs, and terminates the agent when a specific log is found. That way we could cause failures when the agent restarts and an issuer tries to send creds. We also used that script to prove that this new recovery logic works as intended.
100%. I've just got to sit down and write the documentation. 😄 Soon ™️
So as mentioned, it does work with Cheqd. And we've not tested it directly with Indy (apart from the automated GHA tests), since we don't use it anymore. But I understand the operations to be ledger-agnostic, and the BDD tests do cover an endorsement scenario. So that flow works as usual, and anything that kicked off the automated creation of another registry will still function, with the added auto-recoverability.
Indeed - that's where the documentation will have to lend from the existing steps (e.g. decommissioning or manually setting a registry to active). The hope is that manual intervention should never be necessary, but it'll be important to explain what that entails, if it needs to happen.
Yes ... the motivation there is related to the new middleware, which does the recovery check. The flag basically allows for a micro-optimisation, so that we know which tenants should be checked for recovery, and which shouldn't. We configure the wallet setting for issuer tenants, so the middleware check only runs for them. When the agent starts up, we want to check if there are any in-progress operations that were interrupted. And the middleware does that once per tenant, and keeps an in-memory map of whether a tenant has been checked yet (I believe it's similar to the anoncreds-wallet upgrade middleware). Every tenant only needs to be checked once to see if anything is pending and should be picked up again, but technically, non-issuer tenants never need to be checked. For context: in acapy-cloud we have the concept of issuer tenants, with a trust registry, and 'issue cred' operations only be possible for issuers. Now, when we create an issuer tenant, we set this auto-recovery flag to true (in the wallet settings), and so the middleware only intercepts their first call, and maintains an "already been checked" map for these tenants. In a different context where anyone can be an issuer, it should be enabled for everyone. We felt this optimisation makes sense because there may be 10s of thousands of holders for every 1 issuer, and there's no point checking if holders have pending rev-reg events. It could be enabled by default, because it's just a micro-optimisation, and probably only matters if there's millions of holders (even then, I believe the in-memory map doesn't use too much memory). But that was the thinking behind not enabling it by default for all tenants.
That may still be useful. We haven't had to use it in a while, as far as I know, but probably no harm in keeping it. There we go! Sorry for the delay, and the long response. I've been moving on from this feature, since it's already used where we need it. So I figured adding the context is good. And, contributing this upstream is still a priority for us, so that our fork doesn't diverge too much. And it's probably a useful feature for the community, so long as it gets documented, and if the implementation all looks good and makes sense. In the next week or so I'll try add an .md file to help explain the new features here. |
- Implement retry mechanism for failed events - Add auto-recovery functionality - Improve event traceability and logging - Handle resource conflicts gracefully - Add configurable retry settings :art: Support passing uncompiled string to EventBus.subscribe Signed-off-by: ff137 <[email protected]> :art: Signed-off-by: ff137 <[email protected]> :construction: Signed-off-by: ff137 <[email protected]> :art: Signed-off-by: ff137 <[email protected]> :art: Signed-off-by: ff137 <[email protected]> :bug: Fix required field Signed-off-by: ff137 <[email protected]> :bug: Signed-off-by: ff137 <[email protected]> :sparkles: Update handling of full registries Signed-off-by: ff137 <[email protected]> :bug: Signed-off-by: ff137 <[email protected]> :loud_sound: Signed-off-by: ff137 <[email protected]> :sparkles: Failure payloads, retry_count handling, and Signed-off-by: ff137 <[email protected]> :sparkles: Recovery manager and middleware wiring Signed-off-by: ff137 <[email protected]> :loud_sound: Signed-off-by: ff137 <[email protected]> :sparkles: Store full handling events too Signed-off-by: ff137 <[email protected]> :art: New correlation_id for each successful request Signed-off-by: ff137 <[email protected]> :art: Clear the set_active_registry_event option of correlation id Signed-off-by: ff137 <[email protected]> :art: Fix correlation of full handling event Signed-off-by: ff137 <[email protected]> :construction: Test async event bus Signed-off-by: ff137 <[email protected]> :art: Rename "get or create" method; it just gets Signed-off-by: ff137 <[email protected]> :sparkles: Shield methods that create resource on ledger Signed-off-by: ff137 <[email protected]> :construction: Debug logging Signed-off-by: ff137 <[email protected]> :bug: Fix default value for settings Signed-off-by: ff137 <[email protected]> :construction: Debug deserializing Signed-off-by: ff137 <[email protected]> :art: Remove rev_reg_def_private from payloads Write private def to temp key Signed-off-by: ff137 <[email protected]> :sparkles: Handle resource already exists for RevList Signed-off-by: ff137 <[email protected]> :art: Signed-off-by: ff137 <[email protected]> :construction: Signed-off-by: ff137 <[email protected]> :bug: pop correlation_id from set active registry options Signed-off-by: ff137 <[email protected]> :art: Add retry sleep time Signed-off-by: ff137 <[email protected]> :recycle: First upload tail file; skip TailsUpload request / response Signed-off-by: ff137 <[email protected]> :art: Call notify issuer placeholder Signed-off-by: ff137 <[email protected]> :bug: Fix recovery of private rev reg def Signed-off-by: ff137 <[email protected]> :art: Remove unnecessary event and record type Signed-off-by: ff137 <[email protected]> :sparkles: Only recover events older than some delay Signed-off-by: ff137 <[email protected]> :art: Add descriptions to events Signed-off-by: ff137 <[email protected]> :art: Improve state machine log traceability Signed-off-by: ff137 <[email protected]> :art: Only mark profile recovered with no pending event Signed-off-by: ff137 <[email protected]> :sparkles: Emit event if manual intervention required Signed-off-by: ff137 <[email protected]> :wrench: Make retry count and sleep time configurable Signed-off-by: ff137 <[email protected]> :art: Add request id, for traceability Signed-off-by: ff137 <[email protected]> :sparkles: Retry all failed events if retryable, with expiry timestamps Signed-off-by: ff137 <[email protected]> :loud_sound: Verbose logs for traceability Signed-off-by: ff137 <[email protected]> :bug: Signed-off-by: ff137 <[email protected]> :art: Signed-off-by: ff137 <[email protected]> :bug: Add missing log arg Signed-off-by: ff137 <[email protected]> :art: Signed-off-by: ff137 <[email protected]> :art: Generate request_id with emitting full registry Signed-off-by: ff137 <[email protected]> :art: Add options to RevListFinishedEvent Signed-off-by: ff137 <[email protected]> :bug: Fix import Signed-off-by: ff137 <[email protected]> :bug: Always call handlers outside of try blocks Signed-off-by: ff137 <[email protected]> :recycle: Remove legacy in-between event: RevRegDefFinishedEvent Signed-off-by: ff137 <[email protected]> :recycle: Refactor event storage type Keep events in requested state when being retried. Remove completed state -- success or failure is enough Signed-off-by: ff137 <[email protected]> :art: Debug Signed-off-by: ff137 <[email protected]> :bug: Fix updating event for retry Signed-off-by: ff137 <[email protected]> :art: Make auto-recovery check disabled by default Signed-off-by: ff137 <[email protected]> :art: Add config map for auto recovery `ACAPY_ENABLE_AUTO_REVOCATION_RECOVERY` maps to `anoncreds.revocation.auto_recovery_enabled`, such that it can be passed in wallet extra settings Signed-off-by: ff137 <[email protected]>
Signed-off-by: ff137 <[email protected]>
Signed-off-by: ff137 <[email protected]>
Signed-off-by: ff137 <[email protected]>
Signed-off-by: ff137 <[email protected]>
Signed-off-by: ff137 <[email protected]>
Signed-off-by: ff137 <[email protected]>
Signed-off-by: ff137 <[email protected]>
Signed-off-by: ff137 <[email protected]>
|
Is the |
|
I see it also failed on last main branch run, so it's unrelated to these changes |
Signed-off-by: ff137 <[email protected]>
Signed-off-by: ff137 <[email protected]>
0972b2b to
806ea38
Compare
|
I used some AI to help summarise the diff of this PR, and updated the PR description with that. There's also some first-draft documentation that was added: 806ea38 Some more detail remains to be added to those docs, especially with regards to how it's configured (wallet / agent setting). I'll need some feedback on the desired approach with that. I think the default behavior should be that it's enabled for all tenants by default, with perhaps a flag that can be passed to make it only configured for tenants that have the wallet setting. I think that's a good way to do it, but let me know what you guys think. Also, please take a look at the new doc file, and let me know if there's anything specific to improve. |
|
@ff137 -- thanks. @openwallet-foundation/acapy-committers -- let's get this through. As we look at the Kanon test issues. |
Signed-off-by: ff137 <[email protected]>
This makes sense, thank you for the explanation. As it it right now though (I may have missed something), the self-recovery middleware is a "secret" feature that needs to be enabled, and to do so you would need to know which setting to set. Additionally, for single-instance scenarios this extra step is a bit unusual, so I wonder if it would be best to choose a "reasonable default" of |
|
Sorry for the delay.
That's true. By default this feature is not yet enabled. And I guess that may be fine since auto-recovery is "experimental" depending on ledger/plugin being used. Only we at DIDx have used it so far, with cheqd. And we may have added manual tweaks to the cheqd plugin to make it work (in terms of error handling and changing error messages to identify retryable vs not retryable errors). I'm not so sure. There's been so much code and we're maintaining divergent forks. However, I'm confident the logic is sound and should work in general. It might make sense to merge it as is, and then other devs can experiment with it (we have the test script on acapy-cloud to reproduce recovery scenarios). Then there's time to decide on best configuration before adding it to a release. Personally, I can't decide on my own what the config should be, like adding yet another startup arg, or just baking it in and making it default. Anything is fine really -- considering the feature should only come up when something has anyway failed, and the auto-recovery is just a bonus if possible -- if it doesn't work, nothing changes. So the best way to configure it should be whatever works best for others. Which probably just requires some playing around with it to understand it better. For now it's indeed true that the auto-recovery is enabled by a wallet setting. What remains is maybe to document that better. But that's also a change that can be committed after it's merged. My primary focus is shifting away from acapy-cloud, so from my end it's done, but I will address change requests where needed. |
|
@ff137 fair points. based on your reply I think we may be fine with merging as-is, doing some additional testing and deciding whether to bake this in fully as a replacement for the current stream (which so far would be my preference having fought with broken revocation data in the past). If other maintainers agree (@jamshale @PatStLouis @swcurran ) we can merge and move on. |
|
|
I'm just going to go over this again... I think it's great. I'll merge it after. |
Yes, It's expected to fail intermittently right now. We can ignore the failure. |
jamshale
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a bit unsure if the existing accumulator recovery is still necessary despite this. If it's possible to still get the accumulator out of sync with the revocation registry values? It's happens automatically now as well so I don't see any conflict with this.
This is a big improvement. Thanks for contributing it to the root project. I'd prefer it just to be enabled by default in the future.
|
I think the accumulator value can still go out of sync, because as far as I can recall, the "revoke credential" flow changes local records to status revoked right before publishing to the ledger, without ledger necessarily reflecting that. If that publish fails, the cred-ex / rev-reg records on the issuer's side will still indicate that it's revoked, but it's not. I don't know the technicals, but presumably that relates to accumulator value. That's something we've looked into fixing as well, to revert / clean up local records if something goes wrong. Or may be possible to adopt a similar auto-recovery pattern for the revoke-credential logic. Here it's just for registry creation (initial and backups). And publish-revocation can also be turned into an event that gets replayed if something goes wrong (if it's a retryable error). But that's left as open, future work, because we felt this PR was big enough, and didn't want to complicate by changing revocation-publish too. But it may be a worthwhile idea to take note of |
|
From the previous [comment]:
The accumulator is a function of the status of all of the credentials. So if the status of all the credentials is the same, the same accumulator is calculated.
That's a questionable general approach, as often the local records reflect the status that has come from the Application/Controller, which is often keeping its own state. So you are really looking at three things that need to be in sync -- the Application/Controller that has the business logic, ACA-Py (the agent) that is told the state by the Application/Controller, and the ledger. I think the only solution to the failed ledger update is to fix the ledger interface and get an update transaction completed. I would note that the "fix rev reg" code that was implemented a few years ago does that -- checks the published state of the RevReg locally vs. the ledger and retries the transaction until it works. |



This PR refactors AnonCreds revocation registry management into an event-driven, recoverable workflow. The goal is to make revocation registry lifecycle operations resilient to partial failures (ledger, DB, tails server) and agent restarts, so that a credential definition does not silently become unusable when something goes wrong.
Background / motivation
Today, revocation registry setup and maintenance relies on a sequence of automatic handlers. If the agent crashes or any step fails midway (e.g., registry created on the ledger but not stored locally; registry marked full but backup not activated), the issuer may only discover the problem later, when issuance starts failing.
To address this, each step in the revocation workflow is now modeled as an event with:
Resolves #3819.
What this PR changes
1. Event model for revocation workflows
Adds a more granular set of AnonCreds events for revocation registry lifecycle steps:
Revocation registry definition
anoncreds::rev-reg-def::create-requested/create-responseanoncreds::rev-reg-def::store-requested/store-responseRevocation status list
anoncreds::revocation-list::create-requested/create-responseanoncreds::revocation-list::store-requested/store-responseanoncreds::revocation-list::finished(kept for backwards compatibility)Registry activation & full handling
anoncreds::revocation-registry::activation-requested/activation-responseanoncreds::revocation-registry::full-detectedanoncreds::revocation-registry::full-handling-completedEvent payloads now:
optionsdict (includingrequest_id,correlation_id,retry_count,recoveryflag, etc).ErrorInfoPayloadwitherror_msg,should_retry, andretry_countwhen something fails and is eligible for retry.The existing
CredDefFinishedEventis updated to use this new event machinery as the entry point for revocation setup after a credential definition is created.The event bus is also updated so
EventBus.subscribeaccepts string topics (compiled internally to regex) in addition to pre-compiled patterns, simplifying usage for the new events.2. Persistent event storage & retry metadata
Introduces an
EventStorageManagerto persist revocation-related events as storage records:New record types:
rev_reg_def_create_eventrev_reg_def_store_eventrev_list_create_eventrev_list_store_eventrev_reg_activation_eventrev_reg_full_handling_eventNew event states:
requestedresponse_successresponse_failureEach stored event includes:
event_type, serializedevent_data(payload),correlation_id, optionalrequest_idstate,options,created_at, and anexpiry_timestampresponse_data,error_msg, andretry_metadatawhen a response is processedKey behaviors:
Creates a record for the request event, using
correlation_idas the record ID for easy lookup.Updates state to
response_successorresponse_failure, attaching response and retry metadata.Re-classifies a failed event for retry: increments retry count, refreshes expiry, sets state back to
requested.Returns events still in progress, optionally only those past their expiry timestamp (ready for recovery).
3. Exponential backoff & expiry windows
Adds
retry_utilsto centralize retry timing and expiry calculation:Configurable via environment variables:
ANONCREDS_REVOCATION_MIN_RETRY_DURATION_SECONDS(default:2)ANONCREDS_REVOCATION_MAX_RETRY_DURATION_SECONDS(default:60)ANONCREDS_REVOCATION_RETRY_MULTIPLIER(default:2.0)→ with defaults: 2, 4, 8, 16… seconds, capped at max
ANONCREDS_REVOCATION_RECOVERY_DELAY_SECONDS(default:30)Functions:
4. Event-driven revocation setup & full-registry handling
DefaultRevocationSetupis updated to orchestrate the whole revocation lifecycle purely via events andEventStorageManager:On cred def finished, it:
request_idandcorrelation_idrev-reg-def::create-requestedEach subsequent step (def create/store, list create/store, activation):
Reads/writes event records using
EventStorageManagerOn success, emits the next event in the chain
On failure:
should_retryis true, schedules a retry with backoff and updates the stored eventFor full registry handling:
A new
handle_full_registry_event(...)sets the current registry’s state to FULL, logs context, and emits events to:Activation and backup creation are now handled via the same event pipeline, so an agent crash or transient failure in any of these steps can be recovered rather than leaving a cred def with no active registry.
5. Recovery at startup / per-request
Two new components handle recovery:
EventRecoveryManager
EventStorageManagerfor in-progress events that have expired (based on theirexpiry_timestamp).options["recovery"] = Trueand reusing the originalcorrelation_id.revocation_recovery_middleware (admin server middleware)
Registered in
admin/server.pyfor admin routes.For each admin request:
Resolves the current profile/tenant.
If revocation recovery has not yet run for this profile:
EventRecoveryManager.recover_in_progress_events()for events whose expiry time has passed.All errors during recovery are logged but do not block the original HTTP request.
This means that if the agent is terminated mid-operation, the next admin request for that profile will automatically attempt to repair any stuck revocation workflows.
6. API / behavior changes
Revocation admin routes are updated to align with the new behavior:
Revocation list creation
Delegates to
AnonCredsRevocation.create_and_register_revocation_list.RevListResult, it is serialized as the response.400 Bad Request.Revocation registry state
set_rev_reg_statenow raisesAnonCredsRevocationErrorwith “not found” messaging if the rev reg cannot be located; the route maps this to404 Not Found, rather than silently returningNone.Tails file retrieval
If the revocation registry definition cannot be retrieved, the tails route now returns
404 Not Found.OpenAPI / Swagger definitions are updated accordingly to reflect the revised error handling.
7. Testing
New and updated tests cover:
Event storage lifecycle (create / update / retry / delete / query).
Event recovery logic and mapping from stored records back to events.
Revocation recovery middleware behavior, including:
Revocation setup, full-registry handling, and route behavior changes.
Backwards compatibility
anoncreds::revocation-list::finishedremain in place for compatibility.