Skip to content

Conversation

@jrray
Copy link
Collaborator

@jrray jrray commented Nov 11, 2025

After testing with the blob removal against a repo that contains blob objects it became apparent that it was impossible to do things like spfs cat the payload digest (both using a full or a partial digest) because it would fail saying the digest is ambiguous.

$ spfs info PWDF5FM3ERTJDDEYMOX4VFBNB64J27E2YDEZXL6DOSKQJXWZO4YA====
Error:   × Ambiguous reference [too short]: PWDF5FM3ERTJDDEYMOX4VFBNB64J27E2YDEZXL6DOSKQJXWZO4YA

My first thought was that this situation is not ambiguous and implemented a way to make it prefer the payload when a blob object exists. However, on second thought I realized that it is possible for a non-blob object to share a digest with an unrelated payload (granted this is highly improbable). I decided to make it so the caller can hint at what type of item is expected when resolving a partial digest.

I think it is for the greater good to prefer payloads as the default behavior when the type is unknown, but now there is still a way to resolve ambiguities for this other extreme edge case. It didn't seem appropriate to try to read the object to check if it is a blob and disambiguate that way, but that would be an option.

If the repo still contains legacy blob files, resolving a partial digest
can discover both the blob object and payload digest, which is then
treated as ambiguous. Even when not using a partial digest, resolving
the env spec can fail as ambiguous.

This change makes it prefer finding a payload instead of being ambiguous
in this situation.

It ignores the possibility that a non-blob object may exist with the
same payload as a digest, but in the past this would have been a bigger
problem of the payload's blob conflicting with the non-blob object that
both want to have the same filename.

Signed-off-by: J Robert Ray <[email protected]>
As mentioned in the previous commit, a possibility exists for a non-blob
object to have the same digest as a payload. When resolving a partial
digest for this case, the most correct behavior would be to return an
ambiguous digest error. What happens as of the previous commit is it
will always resolve to the payload.

This new argument offers a way to still resolve the object, if the
caller knows in advance that the type of item expected is an object. In
practice there was only one place where the expected item type is
already known.

Signed-off-by: J Robert Ray <[email protected]>
If a parallel test changes the config to use legacy encoding then this
test will fail when trying to write a layer with blob annotation.

Signed-off-by: J Robert Ray <[email protected]>
@jrray jrray self-assigned this Nov 11, 2025
@jrray jrray added bug Something isn't working pr-chain This PR doesn't target the main branch, don't merge! labels Nov 11, 2025
It order for the test to be able to create a blob object [on disk], add
a `write_object_unchecked` operation that bypasses the check that the
object being written is not a blob.

Signed-off-by: J Robert Ray <[email protected]>
@rydrman
Copy link
Collaborator

rydrman commented Nov 12, 2025

Am I wrong in thinking that we had some kind of salting process for objects so that we wouldn't need to practically worry about the kind of collision that you're addressing with the hinting?

@jrray
Copy link
Collaborator Author

jrray commented Nov 12, 2025

Am I wrong in thinking that we had some kind of salting process for objects so that we wouldn't need to practically worry about the kind of collision that you're addressing with the hinting?

If you mean guaranteeing no hash collision, that's impossible to guarantee, but it is already unlikely enough to not have to worry about it practically.

The hint allows some of the implementations to skip doing I/O for the undesired type so there is some practical benefit.

If we treat object digests and payload digests as completely different things then collisions aren't possible in the sense that even if two things share a digest they are still independently addressable.

@jrray
Copy link
Collaborator Author

jrray commented Nov 12, 2025

As an exercise I modified the flatbuffer object digest calculation to save all the bytes that pass through the hasher.

The test called storage::repository::database_test::test_object_existence writes a layer object with digest NCDBKSOXIJFJDENWSKY5OPKAWMYUGAG32KFMHEHOWGTRLZWDWGTA====.

With my modifications I get a file called digest-NCDBKSOXIJFJDENWSKY5OPKAWMYUGAG32KFMHEHOWGTRLZWDWGTA====.bin. The contents of this file are the bytes that were hashed to calculate the object digest. If I put this file into an spfs layer then I get a payload with the same digest as the object:

$ spfs info D3QJEFHQRMM24WQXOSXA3F6WXJHPFZNU75G57UJLGUZU4IH3PQHQ====
D3QJEFHQRMM24WQXOSXA3F6WXJHPFZNU75G57UJLGUZU4IH3PQHQ====:
manifest:
 100644 file NCDBKSOXIJFJDENWSKY5OPKAWMYUGAG32KFMHEHOWGTRLZWDWGTA==== /spfs/digest-NCDBKSOXIJFJDENWSKY5OPKAWMYUGAG32KFMHEHOWGTRLZWDWGTA====.bin

So it is a contrived example but shows how it is possible to have a non-blob object and payload share the same digest. A payload contains arbitrary bytes so all possible digest values are possible.

I was thinking about using this technique to generate content for tests for coverage of when there is a digest collision in a repo. But, we don't require payload contents actually hash correctly so test content could be cheesed by manually writing a payload with a desired digest without proper file contents.

But I don't think this is something we need to worry about. A change that could help is making a decision that an EnvSpecItem can't be a payload digest. Trouble is that that type is used is contexts where a payload is acceptable.

  • spfs push <payload digest>
  • spfs shell <payload digest>

There could be a different type that is a superset of EnvSpecItem that does allow payload digests but there would be limited places that accept this type.

jrray added a commit that referenced this pull request Nov 14, 2025
The `EnvSpec` type has become overloaded and used is contexts where it
can represent things that don't make sense. For example, `"-"` is a
legal (empty) `EnvSpec` which can be used to create an empty
environment, but it doesn't make sense to `spfs push` an empty
environment.

A `RefSpec` is a similar type but instead of representing items that can
be used to create an environment, it represents items that can be copied
between repositories.

The `EnvSpec` scope has narrowed to restrict that any digests within it
are treated as object digests (not payloads), because it is non-sensical
to create an environment from a payload digest.

The `RefSpec` allows digests to be payload digests, but does not support
live layer files. It is also required to be non-empty, since it doesn't
make sense to copy "nothing" between repositories. Spec files are still
supported although the files themselves cannot be copied, however it
should be possible to provide the list of refs to sync to `spfs push`
via a spec file.

# Design Notes

There are still several places that accept an `EnvSpec` that is intended
to be used to create an environment, but it is first used to sync items
locally, so it needs to be converted to a `RefSpec` first (and the sync
result converted back to an `EnvSpec`). These conversions are either
lossy or fallible, so there are things likely broken in this PR that
just aren't covered by tests yet.

In this design, it makes sense for the conversion from `EnvSpec` to
`RefSpec` to be lossy (instead of fallible) in the context of syncing content
locally.

These changes were created to build on the ideas in PR #1289 and further
leverage `PartialDigestType` introduced there.

Signed-off-by: J Robert Ray <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working pr-chain This PR doesn't target the main branch, don't merge!

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants