Cache invalidation design

# Cache Invalidation in Cachew

This is a one-pager for cache invalidation in cachew.

We need a way to reliably remove objects from the cache, across replicas and tiers. Our requirements are that a single API call will delete an object from all caches. Deletion need not be immediate but it must be guaranteed.

## Architecture

At the top level are the tier one replicas. Except for S3, these are stateless and do not coordinate at all.

```
S3 <- Disk <- Tier1 Replica1  |
S3 <- Disk <- Tier1 Replica2  +-- Tier1 Load Balancer
S3 <- Disk <- Tier1 Replica3  |
```

Tier one replicas run as a StatefulSet in Kubernetes. Each pod gets a stable ordinal index (0, 1, 2, ...) derived from its hostname.

Tier two caches run on each worker node, backed by disk then the upstream tier one load balancer endpoint:

```
Worker1
Tier1 LB <- Disk <- Tier2 Service <- Apps

Worker2
Tier1 LB <- Disk <- Tier2 Service <- Apps
```

A delete request can arrive at any node, tier one or tier two.

## Problems

The first problem is that the tier one replicas don't communicate. A delete request arriving at one replica must be propagated to the others.

The second problem is that deletes arriving at tier one must also propagate to all tier two nodes.

## Solution

### Deletion Log

We use S3 as a durable, append-only log of deletion requests. The log is partitioned by day for efficient listing:

```
deletes/2025-01-15/43200000000-0.json
deletes/2025-01-15/43200000123-1.json
deletes/2025-01-16/00000012345-2.json
```

Each filename is `<microseconds-since-2025>-<node-id>`. This ensures uniqueness: a node can only collide with itself, and self-collision at microsecond resolution is effectively impossible.

Each file contains the key to delete:

```json
{"key": "5d0c39b872a86528c12ddd96410dbb8b06c2379dc11a3b44a25b8472bec11e77"}
```

### Writing a Deletion

When a delete request arrives:

1. Generate the filename as `<microseconds-since-2025>-<node-id>`
2. Write to S3 with `If-None-Match: *` header
3. On 412 (collision with self, extremely rare), increment microsecond and retry
4. Return success to caller once S3 write succeeds

The delete is now guaranteed to be processed by all replicas.

### Processing Deletions and Tombstone Lookups

Each replica maintains a single [roaring bitmap](https://github.com/outcaste-io/sroar) that serves as both the processing cursor and the tombstone set. On startup, the replica scans the entire deletion log and populates the bitmap.

The bitmap maps each deletion entry to a 64-bit integer:

```
bitmap_key = (microseconds_since_2025 * max_nodes) + node_id
```

Where `max_nodes` is a fixed constant (e.g., 256). Roaring bitmaps handle sparse 64-bit keyspaces efficiently — roughly 8 bytes per deletion.

| Deletes | Memory |
|---------|--------|
| 100,000 | ~800 KB |
| 1,000,000 | ~8 MB |
| 10,000,000 | ~80 MB |

Periodically, each replica:

1. Scans the current day's partition and the previous day's partition (to handle clock skew)
2. Parses each filename to extract the timestamp and node ID
3. Computes the bitmap key and skips if already set
4. Deletes the corresponding cache key from local disk
5. Sets the bit in the bitmap and persists it

On every cache read, the replica checks the bitmap. If the key's bitmap index is set, the replica returns a cache miss. This prevents serving deleted content even if the local disk cache hasn't been purged yet.

Deletions are idempotent, so reprocessing an entry is harmless.

### Tier Two Propagation

Tier two replicas maintain their own bitmap on local disk. They connect to an SSE endpoint on the tier one load balancer, sending their last-seen event ID. Tier one streams back deletion entries from that point forward.

On startup, the tier two node loads its bitmap from disk, then connects to tier one and catches up. The bitmap serves the same dual purpose: tracking processed deletions and providing tombstone lookups.

If a tier two node restarts, it resumes from its last persisted bitmap state.

## Failure Modes

| Failure | Outcome |
|---------|---------|
| S3 write fails | API returns error, caller retries |
| Replica crashes mid-processing | Restarts, scans log, rebuilds bitmap |
| Replica offline for extended period | Catches up by scanning all partitions |
| Network partition | Resumes processing when connectivity restored |

## Guarantees

- Every deletion is durably logged before the API returns success
- Every replica eventually processes every deletion
- Bitmap is updated only after successful local deletion
- Log entries are never garbage collected
- Tombstone lookups prevent serving deleted content even before local purge


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Cache invalidation design #38

Cache Invalidation in Cachew

Architecture

Problems

Solution

Deletion Log

Writing a Deletion

Processing Deletions and Tombstone Lookups

Tier Two Propagation

Failure Modes

Guarantees

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Failure	Outcome
S3 write fails	API returns error, caller retries
Replica crashes mid-processing	Restarts, scans log, rebuilds bitmap
Replica offline for extended period	Catches up by scanning all partitions
Network partition	Resumes processing when connectivity restored

Cache invalidation design #38

Description

Cache Invalidation in Cachew

Architecture

Problems

Solution

Deletion Log

Writing a Deletion

Processing Deletions and Tombstone Lookups

Tier Two Propagation

Failure Modes

Guarantees

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions