Support streaming split/splice APIs for large blobs#377
Conversation
f2161ab to
e8f806f
Compare
There was a problem hiding this comment.
Pull request overview
Note
Copilot was unable to run its full agentic suite in this review.
Adds streamed equivalents of CAS blob split/splice operations to avoid unary message size limits while exposing capability flags for feature discovery.
Changes:
- Document unary SplitBlob/SpliceBlob size limitations and recommend streaming alternatives.
- Add
SplitChunks(server-streaming) andSpliceChunks(client-streaming) RPCs plus request/response messages. - Extend
CacheCapabilitieswith support flags for the new streaming RPCs.
Files not reviewed (2)
- build/bazel/remote/execution/v2/remote_execution.pb.go: Language not supported
- build/bazel/remote/execution/v2/remote_execution_grpc.pb.go: Language not supported
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
I got added as a reviewer, so here's my take: just like how chunking has been added to the protocol in its current form, I'm not a fan of this. Not having streaming, but requiring a recursive construction of very large files using Prolly trees is in my opinion the way to go. That said, if people actually on the working group think that this is the way to go, go for it! :-) |
@EdSchouten Thanks, that makes sense. I agree that a recursive/Merkle-style representation such as Prolly trees is probably the better long-term model for representing extremely large blobs, since it bounds individual node sizes and avoids one giant flat chunk list. My goal with this PR is narrower: provide a transport-level alternative to #376 without making clients materialize or dereference a chunk-list object in CAS. Streaming lets clients send the existing ordered chunk mapping in bounded batches, and it lets servers begin verification as chunks become available instead of waiting until every chunk has been uploaded and then receiving the whole mapping in one unary I also see this as compatible with a future recursive representation. A server could still store the received mapping internally as a Prolly tree or some other recursive structure. The protocol question here is whether clients need to understand and construct that representation, or whether they can just stream the chunk references they already have. Longer term, I would prefer for the streaming APIs to become the primary split/splice shape for clients that support them, with the existing unary APIs remaining as the compatibility/simple-case path. For cases where the chunk list fits in one request, sending it as a single stream message should be performance-equivalent to a unary gRPC request in practice, while still leaving the same API able to scale when the list grows. A recursive representation also fits into that model: the stream does not require the server to store a flat list internally, it just gives the server live visibility into the composition progress. The server can choose to persist that mapping as a Prolly tree, a flat manifest, or another internal structure, while still getting the benefits of streaming transport: bounded message sizes, earlier validation, and the ability to pipeline verification while the client is still uploading or discovering chunks, since chunking is inherently sequential. Some recursive composition can be approximated with the current API today by splicing subranges and then splicing those intermediate blobs, but standardizing that as the preferred model seems like it would need more protocol work: tree node encoding, fanout rules, digest/validation semantics, and how |
|
@EdSchouten One open question I have for a Prolly tree design is how it preserves the existing full-blob digest semantics. In REAPI, the digest in an If the Prolly root digest is a digest of tree-node bytes, it is not a substitute for the file's content digest. If the key remains the full blob digest, the server still has to verify the leaves in order by hashing the reconstructed content, or trust a mapping it has already validated. And if a server only has chunks/tree nodes under one chunking strategy, it is not obvious how it can know that a different client-provided tree already represents the same blob without either having a stored full-digest-to-tree mapping or walking/rehashing the content. |
5af53b6 to
638b7e6
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 1 out of 3 changed files in this pull request and generated 2 comments.
Files not reviewed (2)
- build/bazel/remote/execution/v2/remote_execution.pb.go: Language not supported
- build/bazel/remote/execution/v2/remote_execution_grpc.pb.go: Language not supported
| // call [SplitBlob][build.bazel.remote.execution.v2.ContentAddressableStorage.SplitBlob] | ||
| // to check what chunk mapping the server is using. |
| // If set, this commits the splice and this MUST be the final request in the | ||
| // stream. Clients MUST set this on exactly one request. If the final request | ||
| // also contains `chunk_digests`, servers MUST include those digests in the | ||
| // spliced chunk list before committing the splice. |
That is true for this specific RPC, but many other parts of the protocol already work this way.
I agree that it would probably be easier to adapt most servers and clients.
The speed improvement I can see is that splicing can finish faster after the last chunk has been uploaded. The work still has to be done, so it is a matter of cutting latency, instead of loading GB of data at the end, it can be loaded while the client is uploading the chunks. @tyler-french Do you see the latency as a problem in your current clusters? My biggest concern is that the servers may have to hold an infinite number of connections open, waiting for more chunks to arrive.
To address my concerns, I might implement the streaming splicing by waiting for the last message and then do the splicing anyway, effectively converting it into the chunk-list version.
The chunk-list approach also keeps the old API fully usable by older clients.
Either we have the feature of handling large blobs capability gated or we can put it as requirement on Split/Splice API from a certain version of the REv2 API. |
|
@moroten (and cc @meroton-benjamin) Thanks, these are good concerns. I am also happy to support #376 if that is where we land. I needed to add streaming support on our side anyway, so this felt like a good time to see whether it could also solve the large chunk-list problem. I think the main point is that streaming does not remove the splice work. It gives clients a bounded-message way to send the chunk list, and it lets servers start seeing the mapping earlier if they want to. A client does not have to keep the stream open while chunks are still uploading. It can upload the chunks first, then call
Yes. We have hit issues with blobs around 1 GB. Today the client uploads chunks with I agree that the work still has to be done. The benefit is mostly message sizing and tail latency. With a fast chunker, I would expect an incremental
That risk is real, but it is similar to Also, clients that want to avoid long idle splice streams can wait until all chunks are uploaded, then send the mapping in one or a few
I think this should follow the normal RPC deadline model. For Bazel, the call still needs to complete inside
This is the main tradeoff. If the stream fails before the server commits the splice, the client has to retry. After the chunks are uploaded, that retry is much cheaper than retrying the original upload. There is also a useful fallback. If the chunk list fits in one message, the client can retry with unary
Agreed. Keeping a []digests is pretty cheap. You could even just keep []offsets to save even more memory.
I got this feedback, and I think it makes sense the request requires
That seems like a valid implementation. It would still avoid the unary message size limit and let clients use one API shape for larger blobs.
Agreed. I did not mean that the chunk-list approach breaks older clients. The difference I see is that PR #376 adds another representation to the existing unary API shape. Newer clients then need logic to write, read, and interpret chunk-list objects. With streaming, the old unary APIs stay as the simple path, and the large-blob path is a separate capability-gated RPC. A client that is not aware of indirection may fail if the SplitBlob response sets it. |
|
@tyler-french I would be happy with the streaming approach as well. Both approaches have their pros and cons, but the main problem is that the whole API refers to blobs as whole files, not a Prolly tree. Maybe set the decision next monthly? I'll add some code comments as well. |
638b7e6 to
d06fafd
Compare
d06fafd to
34e4e2d
Compare
| // operation. Starting in RE API v2.13, this also includes the | ||
| // [ContentAddressableStorage.SplitChunks][build.bazel.remote.execution.v2.ContentAddressableStorage.SplitChunks] |
There was a problem hiding this comment.
Servers that stop supporting RE API v2.12 should be able to drop the old implementation.
| // operation. Starting in RE API v2.13, this also includes the | |
| // [ContentAddressableStorage.SplitChunks][build.bazel.remote.execution.v2.ContentAddressableStorage.SplitChunks] | |
| // operation for RE API v2.12 and from | |
| // [ContentAddressableStorage.SplitChunks][build.bazel.remote.execution.v2.ContentAddressableStorage.SplitChunks] | |
| // operation for RE API v2.13 or higher. |
| // operation. Starting in RE API v2.13, this also includes the | ||
| // [ContentAddressableStorage.SpliceChunks][build.bazel.remote.execution.v2.ContentAddressableStorage.SpliceChunks] |
There was a problem hiding this comment.
Servers that stop supporting RE API v2.12 should be able to drop the old implementation.
| // operation. Starting in RE API v2.13, this also includes the | |
| // [ContentAddressableStorage.SpliceChunks][build.bazel.remote.execution.v2.ContentAddressableStorage.SpliceChunks] | |
| // operation for RE API v2.12 and from | |
| // [ContentAddressableStorage.SpliceChunks][build.bazel.remote.execution.v2.ContentAddressableStorage.SpliceChunks] | |
| // operation for RE API v2.13 or higher. |
| // The parameters for the RepMaxCDC chunking algorithm. | ||
| // If set, the server supports the RepMaxCDC chunking algorithm. | ||
| RepMaxCdcParams rep_max_cdc_params = 12; | ||
|
|
There was a problem hiding this comment.
Remove this inserted line.
| // split information available for the blob, OR at least one chunk needed to | ||
| // reconstruct the blob is missing from the CAS. | ||
| // * `RESOURCE_EXHAUSTED`: There is insufficient disk quota to store the blob | ||
| // chunks. |
There was a problem hiding this comment.
| // chunks. | |
| // | |
| // New in v2.12 and removed in v2.13, use [SplitChunks][build.bazel.remote.execution.v2.ContentAddressableStorage.SplitChunks] instead. |
| // it prefers a different chunking and extended those instead. Clients can | ||
| // call [SplitBlob][build.bazel.remote.execution.v2.ContentAddressableStorage.SplitBlob] | ||
| // or [SplitChunks][build.bazel.remote.execution.v2.ContentAddressableStorage.SplitChunks] | ||
| // to check what chunk mapping the server is using. |
There was a problem hiding this comment.
| // to check what chunk mapping the server is using. | |
| // to check what chunk mapping the server is using. | |
| // | |
| // New in v2.12 and removed in v2.13, use [SpliceChunks][build.bazel.remote.execution.v2.ContentAddressableStorage.SpliceChunks] instead. |
|
Thanks to the BuildBarn folks for the thorough reviews. From my understanding, this PR seems to be in a good state for a final round of readings and discussion in tomorrow's meeting. If you have any questions / comments / amendments / objections, make sure to add them async here (preferred) and/or during the WG meeting tomorrow. |
| // | ||
| // Clients SHOULD limit the number of digests in each request to remain below | ||
| // the maximum message size accepted by the client/server pair. | ||
| repeated Digest chunk_digests = 3; |
There was a problem hiding this comment.
It would be good to specify the expectations around an empty chunk_digests list now that clients/servers may send multiple messages.
This PR adds streaming versions of split and splice. This lets clients send the ordered chunk list across multiple messages instead of fitting it all into one unary request or response.
Streaming also gives servers a chance to process the mapping as it arrives. For splice, this can avoid doing all of the work in one final RPC after every chunk has already been uploaded. Servers can still choose to wait until the stream is complete and process it like the unary path.
This solves the issues described in #376.