Skip to content

Support streaming split/splice APIs for large blobs#377

Open
tyler-french wants to merge 2 commits into
bazelbuild:mainfrom
tyler-french:tfrench/stream-splice
Open

Support streaming split/splice APIs for large blobs#377
tyler-french wants to merge 2 commits into
bazelbuild:mainfrom
tyler-french:tfrench/stream-splice

Conversation

@tyler-french

@tyler-french tyler-french commented May 18, 2026

Copy link
Copy Markdown
Contributor

This PR adds streaming versions of split and splice. This lets clients send the ordered chunk list across multiple messages instead of fitting it all into one unary request or response.

Streaming also gives servers a chance to process the mapping as it arrives. For splice, this can avoid doing all of the work in one final RPC after every chunk has already been uploaded. Servers can still choose to wait until the stream is complete and process it like the unary path.

This solves the issues described in #376.

@tyler-french tyler-french changed the title feat: Support streaming versions of split/splice for very large blobs WIP feat: Support streaming versions of split/splice for very large blobs May 18, 2026
@tyler-french tyler-french force-pushed the tfrench/stream-splice branch from f2161ab to e8f806f Compare May 18, 2026 14:39
@tyler-french tyler-french changed the title WIP feat: Support streaming versions of split/splice for very large blobs Support streaming split/splice APIs for large blobs May 19, 2026
@tyler-french tyler-french marked this pull request as ready for review May 19, 2026 15:08
Copilot AI review requested due to automatic review settings May 19, 2026 15:08

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Adds streamed equivalents of CAS blob split/splice operations to avoid unary message size limits while exposing capability flags for feature discovery.

Changes:

  • Document unary SplitBlob/SpliceBlob size limitations and recommend streaming alternatives.
  • Add SplitChunks (server-streaming) and SpliceChunks (client-streaming) RPCs plus request/response messages.
  • Extend CacheCapabilities with support flags for the new streaming RPCs.
Files not reviewed (2)
  • build/bazel/remote/execution/v2/remote_execution.pb.go: Language not supported
  • build/bazel/remote/execution/v2/remote_execution_grpc.pb.go: Language not supported

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread build/bazel/remote/execution/v2/remote_execution.proto
Comment thread build/bazel/remote/execution/v2/remote_execution.proto Outdated
Comment thread build/bazel/remote/execution/v2/remote_execution.proto Outdated
Comment thread build/bazel/remote/execution/v2/remote_execution.proto
@EdSchouten

Copy link
Copy Markdown
Collaborator

I got added as a reviewer, so here's my take: just like how chunking has been added to the protocol in its current form, I'm not a fan of this. Not having streaming, but requiring a recursive construction of very large files using Prolly trees is in my opinion the way to go.

That said, if people actually on the working group think that this is the way to go, go for it! :-)

@tyler-french

Copy link
Copy Markdown
Contributor Author

I got added as a reviewer, so here's my take: just like how chunking has been added to the protocol in its current form, I'm not a fan of this. Not having streaming, but requiring a recursive construction of very large files using Prolly trees is in my opinion the way to go.

That said, if people actually on the working group think that this is the way to go, go for it! :-)

@EdSchouten Thanks, that makes sense. I agree that a recursive/Merkle-style representation such as Prolly trees is probably the better long-term model for representing extremely large blobs, since it bounds individual node sizes and avoids one giant flat chunk list.

My goal with this PR is narrower: provide a transport-level alternative to #376 without making clients materialize or dereference a chunk-list object in CAS. Streaming lets clients send the existing ordered chunk mapping in bounded batches, and it lets servers begin verification as chunks become available instead of waiting until every chunk has been uploaded and then receiving the whole mapping in one unary SpliceBlob request.

I also see this as compatible with a future recursive representation. A server could still store the received mapping internally as a Prolly tree or some other recursive structure. The protocol question here is whether clients need to understand and construct that representation, or whether they can just stream the chunk references they already have.

Longer term, I would prefer for the streaming APIs to become the primary split/splice shape for clients that support them, with the existing unary APIs remaining as the compatibility/simple-case path. For cases where the chunk list fits in one request, sending it as a single stream message should be performance-equivalent to a unary gRPC request in practice, while still leaving the same API able to scale when the list grows. A recursive representation also fits into that model: the stream does not require the server to store a flat list internally, it just gives the server live visibility into the composition progress. The server can choose to persist that mapping as a Prolly tree, a flat manifest, or another internal structure, while still getting the benefits of streaming transport: bounded message sizes, earlier validation, and the ability to pipeline verification while the client is still uploading or discovering chunks, since chunking is inherently sequential.

Some recursive composition can be approximated with the current API today by splicing subranges and then splicing those intermediate blobs, but standardizing that as the preferred model seems like it would need more protocol work: tree node encoding, fanout rules, digest/validation semantics, and how chunking_function applies at each level. Streaming is a smaller additive step that addresses the immediate unary message-size and final-RPC verification problem while leaving room for that design.

@tyler-french

Copy link
Copy Markdown
Contributor Author

@EdSchouten One open question I have for a Prolly tree design is how it preserves the existing full-blob digest semantics. In REAPI, the digest in an ActionResult is the digest of the output file contents, and CAS lookups are keyed by that full content digest. So even if the server stores the blob as a recursive tree internally, both the client and server eventually need to know that the tree represents exactly the bytes for the full blob digest.

If the Prolly root digest is a digest of tree-node bytes, it is not a substitute for the file's content digest. If the key remains the full blob digest, the server still has to verify the leaves in order by hashing the reconstructed content, or trust a mapping it has already validated. And if a server only has chunks/tree nodes under one chunking strategy, it is not obvious how it can know that a different client-provided tree already represents the same blob without either having a stored full-digest-to-tree mapping or walking/rehashing the content.

@tyler-french tyler-french force-pushed the tfrench/stream-splice branch 2 times, most recently from 5af53b6 to 638b7e6 Compare May 20, 2026 16:39
@tyler-french tyler-french requested a review from Copilot May 20, 2026 16:55

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 1 out of 3 changed files in this pull request and generated 2 comments.

Files not reviewed (2)
  • build/bazel/remote/execution/v2/remote_execution.pb.go: Language not supported
  • build/bazel/remote/execution/v2/remote_execution_grpc.pb.go: Language not supported

Comment on lines +638 to +639
// call [SplitBlob][build.bazel.remote.execution.v2.ContentAddressableStorage.SplitBlob]
// to check what chunk mapping the server is using.
// If set, this commits the splice and this MUST be the final request in the
// stream. Clients MUST set this on exactly one request. If the final request
// also contains `chunk_digests`, servers MUST include those digests in the
// spliced chunk list before committing the splice.
@moroten

moroten commented May 21, 2026

Copy link
Copy Markdown
Contributor

The streaming approach keeps chunk-list transport in the RPC layer instead. That has a few advantages:

  • It avoids using CAS as a protocol transport for RPC metadata.

That is true for this specific RPC, but many other parts of the protocol already work this way.

  • It does not require clients to add extra read/write logic for chunk-list objects.

I agree that it would probably be easier to adapt most servers and clients.

  • It lets servers receive the ordered chunk mapping incrementally instead of only at the final splice call.

The speed improvement I can see is that splicing can finish faster after the last chunk has been uploaded. The work still has to be done, so it is a matter of cutting latency, instead of loading GB of data at the end, it can be loaded while the client is uploading the chunks. @tyler-french Do you see the latency as a problem in your current clusters?

My biggest concern is that the servers may have to hold an infinite number of connections open, waiting for more chunks to arrive.

  • What timeout should be allowed? For Bazel, we will still have the be quicker than the global --remote_timeout.
  • If a splice call times out, the server has to throw away all work. With a unary call with a chunk-list the server can process the whole splicing and store the result, so timeout should not be any problem (unless server internal networking is slow).
  • Minor note: For clients to handle timeouts, they need to keep the full list of chunks until splicing is done. There will not be any benefit in lower memory consumption on client side.
  • With a unary call, the splicing can be deduplicated on server side.

To address my concerns, I might implement the streaming splicing by waiting for the last message and then do the splicing anyway, effectively converting it into the chunk-list version.

  • It keeps the old unary APIs simple and fully usable by older clients.

The chunk-list approach also keeps the old API fully usable by older clients.

  • It gives clients a capability-gated path for large blobs without changing the semantics of existing fields.

Either we have the feature of handling large blobs capability gated or we can put it as requirement on Split/Splice API from a certain version of the REv2 API.

@tyler-french

Copy link
Copy Markdown
Contributor Author

@moroten (and cc @meroton-benjamin) Thanks, these are good concerns. I am also happy to support #376 if that is where we land. I needed to add streaming support on our side anyway, so this felt like a good time to see whether it could also solve the large chunk-list problem.

I think the main point is that streaming does not remove the splice work. It gives clients a bounded-message way to send the chunk list, and it lets servers start seeing the mapping earlier if they want to. A client does not have to keep the stream open while chunks are still uploading. It can upload the chunks first, then call SpliceChunks with the full chunk list in one or a few messages. That makes this usable as an alternative to the indirect chunk-list approach, without requiring the chunk list itself to be stored and fetched through CAS.

The speed improvement I can see is that splicing can finish faster after the last chunk has been uploaded. The work still has to be done, so it is a matter of cutting latency, instead of loading GB of data at the end, it can be loaded while the client is uploading the chunks. @tyler-french Do you see the latency as a problem in your current clusters?

Yes. We have hit issues with blobs around 1 GB. Today the client uploads chunks with BatchUpdateBlobs, then calls SpliceBlob after all chunks are available. Since the server only sees the mapping at the end, all splice work is deferred into one final RPC, and we have seen that call time out.

I agree that the work still has to be done. The benefit is mostly message sizing and tail latency. With a fast chunker, I would expect an incremental SpliceChunks flow to take about as long as Write for the same blob.

My biggest concern is that the servers may have to hold an infinite number of connections open, waiting for more chunks to arrive.

That risk is real, but it is similar to Write. Servers still need normal RPC controls, such as deadlines, idle timeouts, max concurrent streams, and per instance limits.

Also, clients that want to avoid long idle splice streams can wait until all chunks are uploaded, then send the mapping in one or a few SpliceChunks messages. That gives a similar call shape to PR #376, but keeps the chunk-list transport in the RPC layer.

What timeout should be allowed? For Bazel, we will still have the be quicker than the global --remote_timeout.

I think this should follow the normal RPC deadline model. For Bazel, the call still needs to complete inside --remote_timeout. Probably something similar to whatever the Write call was for the blob.

If a splice call times out, the server has to throw away all work. With a unary call with a chunk-list the server can process the whole splicing and store the result, so timeout should not be any problem (unless server internal networking is slow).

This is the main tradeoff. If the stream fails before the server commits the splice, the client has to retry. After the chunks are uploaded, that retry is much cheaper than retrying the original upload.

There is also a useful fallback. If the chunk list fits in one message, the client can retry with unary SpliceBlob. If it does not, it can retry SpliceChunks with smaller batches or only a few messages. So the client has a bit more control over getting things to succeed

Minor note: For clients to handle timeouts, they need to keep the full list of chunks until splicing is done. There will not be any benefit in lower memory consumption on client side.

Agreed. Keeping a []digests is pretty cheap. You could even just keep []offsets to save even more memory.

With a unary call, the splicing can be deduplicated on server side.

I got this feedback, and I think it makes sense the request requires blob_digest. It should deduplicate well now by routing by this.

To address my concerns, I might implement the streaming splicing by waiting for the last message and then do the splicing anyway, effectively converting it into the chunk-list version.

That seems like a valid implementation. It would still avoid the unary message size limit and let clients use one API shape for larger blobs.

The chunk-list approach also keeps the old API fully usable by older clients.

Agreed. I did not mean that the chunk-list approach breaks older clients.

The difference I see is that PR #376 adds another representation to the existing unary API shape. Newer clients then need logic to write, read, and interpret chunk-list objects. With streaming, the old unary APIs stay as the simple path, and the large-blob path is a separate capability-gated RPC. A client that is not aware of indirection may fail if the SplitBlob response sets it.

@moroten

moroten commented May 21, 2026

Copy link
Copy Markdown
Contributor

@tyler-french I would be happy with the streaming approach as well. Both approaches have their pros and cons, but the main problem is that the whole API refers to blobs as whole files, not a Prolly tree.

Maybe set the decision next monthly? I'll add some code comments as well.

Comment thread build/bazel/remote/execution/v2/remote_execution.proto Outdated
Comment thread build/bazel/remote/execution/v2/remote_execution.proto Outdated
Comment thread build/bazel/remote/execution/v2/remote_execution.proto Outdated
@tyler-french tyler-french force-pushed the tfrench/stream-splice branch 3 times, most recently from 638b7e6 to d06fafd Compare May 22, 2026 18:06
Comment on lines +2472 to +2473
// operation. Starting in RE API v2.13, this also includes the
// [ContentAddressableStorage.SplitChunks][build.bazel.remote.execution.v2.ContentAddressableStorage.SplitChunks]

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Servers that stop supporting RE API v2.12 should be able to drop the old implementation.

Suggested change
// operation. Starting in RE API v2.13, this also includes the
// [ContentAddressableStorage.SplitChunks][build.bazel.remote.execution.v2.ContentAddressableStorage.SplitChunks]
// operation for RE API v2.12 and from
// [ContentAddressableStorage.SplitChunks][build.bazel.remote.execution.v2.ContentAddressableStorage.SplitChunks]
// operation for RE API v2.13 or higher.

Comment on lines +2481 to +2482
// operation. Starting in RE API v2.13, this also includes the
// [ContentAddressableStorage.SpliceChunks][build.bazel.remote.execution.v2.ContentAddressableStorage.SpliceChunks]

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Servers that stop supporting RE API v2.12 should be able to drop the old implementation.

Suggested change
// operation. Starting in RE API v2.13, this also includes the
// [ContentAddressableStorage.SpliceChunks][build.bazel.remote.execution.v2.ContentAddressableStorage.SpliceChunks]
// operation for RE API v2.12 and from
// [ContentAddressableStorage.SpliceChunks][build.bazel.remote.execution.v2.ContentAddressableStorage.SpliceChunks]
// operation for RE API v2.13 or higher.

// The parameters for the RepMaxCDC chunking algorithm.
// If set, the server supports the RepMaxCDC chunking algorithm.
RepMaxCdcParams rep_max_cdc_params = 12;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove this inserted line.

Suggested change

// split information available for the blob, OR at least one chunk needed to
// reconstruct the blob is missing from the CAS.
// * `RESOURCE_EXHAUSTED`: There is insufficient disk quota to store the blob
// chunks.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// chunks.
//
// New in v2.12 and removed in v2.13, use [SplitChunks][build.bazel.remote.execution.v2.ContentAddressableStorage.SplitChunks] instead.

// it prefers a different chunking and extended those instead. Clients can
// call [SplitBlob][build.bazel.remote.execution.v2.ContentAddressableStorage.SplitBlob]
// or [SplitChunks][build.bazel.remote.execution.v2.ContentAddressableStorage.SplitChunks]
// to check what chunk mapping the server is using.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// to check what chunk mapping the server is using.
// to check what chunk mapping the server is using.
//
// New in v2.12 and removed in v2.13, use [SpliceChunks][build.bazel.remote.execution.v2.ContentAddressableStorage.SpliceChunks] instead.

Comment thread build/bazel/remote/execution/v2/remote_execution.proto
@sluongng

sluongng commented Jun 8, 2026

Copy link
Copy Markdown
Collaborator

Thanks to the BuildBarn folks for the thorough reviews.

From my understanding, this PR seems to be in a good state for a final round of readings and discussion in tomorrow's meeting. If you have any questions / comments / amendments / objections, make sure to add them async here (preferred) and/or during the WG meeting tomorrow.

cc other active maintainers: @tjgq @ulfjack @werkt

@tyler-french

Copy link
Copy Markdown
Contributor Author

@tjgq @sluongng @fmeum let me know if you have time to take a pass. Thanks!

//
// Clients SHOULD limit the number of digests in each request to remain below
// the maximum message size accepted by the client/server pair.
repeated Digest chunk_digests = 3;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be good to specify the expectations around an empty chunk_digests list now that clients/servers may send multiple messages.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants