-
Notifications
You must be signed in to change notification settings - Fork 0
Description
Cache Invalidation in Cachew
This is a one-pager for cache invalidation in cachew.
We need a way to reliably remove objects from the cache, across replicas and tiers. Our requirements are that a single API call will delete an object from all caches. Deletion need not be immediate but it must be guaranteed.
Architecture
At the top level are the tier one replicas. Except for S3, these are stateless and do not coordinate at all.
S3 <- Disk <- Tier1 Replica1 |
S3 <- Disk <- Tier1 Replica2 +-- Tier1 Load Balancer
S3 <- Disk <- Tier1 Replica3 |
Tier one replicas run as a StatefulSet in Kubernetes. Each pod gets a stable ordinal index (0, 1, 2, ...) derived from its hostname.
Tier two caches run on each worker node, backed by disk then the upstream tier one load balancer endpoint:
Worker1
Tier1 LB <- Disk <- Tier2 Service <- Apps
Worker2
Tier1 LB <- Disk <- Tier2 Service <- Apps
A delete request can arrive at any node, tier one or tier two.
Problems
The first problem is that the tier one replicas don't communicate. A delete request arriving at one replica must be propagated to the others.
The second problem is that deletes arriving at tier one must also propagate to all tier two nodes.
Solution
Deletion Log
We use S3 as a durable, append-only log of deletion requests. The log is partitioned by day for efficient listing:
deletes/2025-01-15/43200000000-0.json
deletes/2025-01-15/43200000123-1.json
deletes/2025-01-16/00000012345-2.json
Each filename is <microseconds-since-2025>-<node-id>. This ensures uniqueness: a node can only collide with itself, and self-collision at microsecond resolution is effectively impossible.
Each file contains the key to delete:
{"key": "5d0c39b872a86528c12ddd96410dbb8b06c2379dc11a3b44a25b8472bec11e77"}Writing a Deletion
When a delete request arrives:
- Generate the filename as
<microseconds-since-2025>-<node-id> - Write to S3 with
If-None-Match: *header - On 412 (collision with self, extremely rare), increment microsecond and retry
- Return success to caller once S3 write succeeds
The delete is now guaranteed to be processed by all replicas.
Processing Deletions and Tombstone Lookups
Each replica maintains a single roaring bitmap that serves as both the processing cursor and the tombstone set. On startup, the replica scans the entire deletion log and populates the bitmap.
The bitmap maps each deletion entry to a 64-bit integer:
bitmap_key = (microseconds_since_2025 * max_nodes) + node_id
Where max_nodes is a fixed constant (e.g., 256). Roaring bitmaps handle sparse 64-bit keyspaces efficiently — roughly 8 bytes per deletion.
| Deletes | Memory |
|---|---|
| 100,000 | ~800 KB |
| 1,000,000 | ~8 MB |
| 10,000,000 | ~80 MB |
Periodically, each replica:
- Scans the current day's partition and the previous day's partition (to handle clock skew)
- Parses each filename to extract the timestamp and node ID
- Computes the bitmap key and skips if already set
- Deletes the corresponding cache key from local disk
- Sets the bit in the bitmap and persists it
On every cache read, the replica checks the bitmap. If the key's bitmap index is set, the replica returns a cache miss. This prevents serving deleted content even if the local disk cache hasn't been purged yet.
Deletions are idempotent, so reprocessing an entry is harmless.
Tier Two Propagation
Tier two replicas maintain their own bitmap on local disk. They connect to an SSE endpoint on the tier one load balancer, sending their last-seen event ID. Tier one streams back deletion entries from that point forward.
On startup, the tier two node loads its bitmap from disk, then connects to tier one and catches up. The bitmap serves the same dual purpose: tracking processed deletions and providing tombstone lookups.
If a tier two node restarts, it resumes from its last persisted bitmap state.
Failure Modes
| Failure | Outcome |
|---|---|
| S3 write fails | API returns error, caller retries |
| Replica crashes mid-processing | Restarts, scans log, rebuilds bitmap |
| Replica offline for extended period | Catches up by scanning all partitions |
| Network partition | Resumes processing when connectivity restored |
Guarantees
- Every deletion is durably logged before the API returns success
- Every replica eventually processes every deletion
- Bitmap is updated only after successful local deletion
- Log entries are never garbage collected
- Tombstone lookups prevent serving deleted content even before local purge
Metadata
Metadata
Assignees
Labels
Type
Projects
Status