We want to isolate our main store instance from being overloaded by external RPC requests. To facilitate this we need to replicate the primary store to secondary follower stores.
We can re-use most of the existing interfaces to do this, making the initial implementation fairly non-invasive.
Our store contains several separate storage media, and it is therefore not possible to use an off-the-shelf solution to sync. We therefore use an in-protocol approach.
This acts as both a backup solution, and an IO scaling solution.
No gRPC changes
A secondary store can be spun up by:
- fetching blocks from another store by polling the
getBlockByNumber endpoint
- applying each block to itself using
apply_block as per normal
This means a secondary store is no longer simply passive, but also contains an active task which drives the fetch and apply loop.
Shortcomings
The above approach does fall short a bit, because we also want to fetch block proofs. The endpoint does support including block proofs, but since proofs will also lag behind a committed block, we will have to fetch the same block data twice. Once for the committed block, and then a second time to fetch the proof (and we get the block data redundantly).
Additionally, since we use polling, and we cannot know what the latest proven block is, we have to continually poll block proven+1 which will exist as a committed block (but maybe not proven yet), and will therefore return redundant block data on each poll.
This means we are fetching quite a bit of redundant data continuously, but in the short term this is unlikely to be a large problem. In part this is because we don't actually have block proofs yet, so the proven and committed tips will be close/identical for a large part.
This also sort of fails as a robust backup solution, since the primary store has no way of knowing that a block backup has completed.
There are also alternatives available, but they require quite a bit more work, and this will suffice for now.
Infrastructure
A trickier part is defining health at the infrastructure level. We want more than 1 of these secondary stores in order to load-balance, and for redundancy. This also means we should be able to identify a lagging/unhealthy store node.
This can be done by defining the chain tip as as the maximum of all stores, and then marking any that lag behind by e.g. N=2 as unhealthy or out-of-sync. This can be done by the load-balancer (presumably), though that means it needs to perform non-trivial work.
We want to isolate our main store instance from being overloaded by external RPC requests. To facilitate this we need to replicate the primary store to secondary follower stores.
We can re-use most of the existing interfaces to do this, making the initial implementation fairly non-invasive.
Our store contains several separate storage media, and it is therefore not possible to use an off-the-shelf solution to sync. We therefore use an in-protocol approach.
This acts as both a backup solution, and an IO scaling solution.
No gRPC changes
A secondary store can be spun up by:
getBlockByNumberendpointapply_blockas per normalThis means a secondary store is no longer simply passive, but also contains an active task which drives the fetch and apply loop.
Shortcomings
The above approach does fall short a bit, because we also want to fetch block proofs. The endpoint does support including block proofs, but since proofs will also lag behind a committed block, we will have to fetch the same block data twice. Once for the committed block, and then a second time to fetch the proof (and we get the block data redundantly).
Additionally, since we use polling, and we cannot know what the latest proven block is, we have to continually poll block
proven+1which will exist as a committed block (but maybe not proven yet), and will therefore return redundant block data on each poll.This means we are fetching quite a bit of redundant data continuously, but in the short term this is unlikely to be a large problem. In part this is because we don't actually have block proofs yet, so the proven and committed tips will be close/identical for a large part.
This also sort of fails as a robust backup solution, since the primary store has no way of knowing that a block backup has completed.
There are also alternatives available, but they require quite a bit more work, and this will suffice for now.
Infrastructure
A trickier part is defining health at the infrastructure level. We want more than 1 of these secondary stores in order to load-balance, and for redundancy. This also means we should be able to identify a lagging/unhealthy store node.
This can be done by defining the
chain tipas as the maximum of all stores, and then marking any that lag behind by e.g.N=2as unhealthy or out-of-sync. This can be done by the load-balancer (presumably), though that means it needs to perform non-trivial work.