-
Notifications
You must be signed in to change notification settings - Fork 9
Ambiguous digest problems with existing blob objects #1289
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: push-volovwxpnxry
Are you sure you want to change the base?
Conversation
If the repo still contains legacy blob files, resolving a partial digest can discover both the blob object and payload digest, which is then treated as ambiguous. Even when not using a partial digest, resolving the env spec can fail as ambiguous. This change makes it prefer finding a payload instead of being ambiguous in this situation. It ignores the possibility that a non-blob object may exist with the same payload as a digest, but in the past this would have been a bigger problem of the payload's blob conflicting with the non-blob object that both want to have the same filename. Signed-off-by: J Robert Ray <[email protected]>
As mentioned in the previous commit, a possibility exists for a non-blob object to have the same digest as a payload. When resolving a partial digest for this case, the most correct behavior would be to return an ambiguous digest error. What happens as of the previous commit is it will always resolve to the payload. This new argument offers a way to still resolve the object, if the caller knows in advance that the type of item expected is an object. In practice there was only one place where the expected item type is already known. Signed-off-by: J Robert Ray <[email protected]>
If a parallel test changes the config to use legacy encoding then this test will fail when trying to write a layer with blob annotation. Signed-off-by: J Robert Ray <[email protected]>
Codecov Report❌ Patch coverage is 📢 Thoughts on this report? Let us know! |
It order for the test to be able to create a blob object [on disk], add a `write_object_unchecked` operation that bypasses the check that the object being written is not a blob. Signed-off-by: J Robert Ray <[email protected]>
|
Am I wrong in thinking that we had some kind of salting process for objects so that we wouldn't need to practically worry about the kind of collision that you're addressing with the hinting? |
If you mean guaranteeing no hash collision, that's impossible to guarantee, but it is already unlikely enough to not have to worry about it practically. The hint allows some of the implementations to skip doing I/O for the undesired type so there is some practical benefit. If we treat object digests and payload digests as completely different things then collisions aren't possible in the sense that even if two things share a digest they are still independently addressable. |
|
As an exercise I modified the flatbuffer object digest calculation to save all the bytes that pass through the hasher. The test called With my modifications I get a file called So it is a contrived example but shows how it is possible to have a non-blob object and payload share the same digest. A payload contains arbitrary bytes so all possible digest values are possible. I was thinking about using this technique to generate content for tests for coverage of when there is a digest collision in a repo. But, we don't require payload contents actually hash correctly so test content could be cheesed by manually writing a payload with a desired digest without proper file contents. But I don't think this is something we need to worry about. A change that could help is making a decision that an
There could be a different type that is a superset of |
The `EnvSpec` type has become overloaded and used is contexts where it can represent things that don't make sense. For example, `"-"` is a legal (empty) `EnvSpec` which can be used to create an empty environment, but it doesn't make sense to `spfs push` an empty environment. A `RefSpec` is a similar type but instead of representing items that can be used to create an environment, it represents items that can be copied between repositories. The `EnvSpec` scope has narrowed to restrict that any digests within it are treated as object digests (not payloads), because it is non-sensical to create an environment from a payload digest. The `RefSpec` allows digests to be payload digests, but does not support live layer files. It is also required to be non-empty, since it doesn't make sense to copy "nothing" between repositories. Spec files are still supported although the files themselves cannot be copied, however it should be possible to provide the list of refs to sync to `spfs push` via a spec file. # Design Notes There are still several places that accept an `EnvSpec` that is intended to be used to create an environment, but it is first used to sync items locally, so it needs to be converted to a `RefSpec` first (and the sync result converted back to an `EnvSpec`). These conversions are either lossy or fallible, so there are things likely broken in this PR that just aren't covered by tests yet. In this design, it makes sense for the conversion from `EnvSpec` to `RefSpec` to be lossy (instead of fallible) in the context of syncing content locally. These changes were created to build on the ideas in PR #1289 and further leverage `PartialDigestType` introduced there. Signed-off-by: J Robert Ray <[email protected]>
After testing with the blob removal against a repo that contains blob objects it became apparent that it was impossible to do things like
spfs catthe payload digest (both using a full or a partial digest) because it would fail saying the digest is ambiguous.My first thought was that this situation is not ambiguous and implemented a way to make it prefer the payload when a blob object exists. However, on second thought I realized that it is possible for a non-blob object to share a digest with an unrelated payload (granted this is highly improbable). I decided to make it so the caller can hint at what type of item is expected when resolving a partial digest.
I think it is for the greater good to prefer payloads as the default behavior when the type is unknown, but now there is still a way to resolve ambiguities for this other extreme edge case. It didn't seem appropriate to try to read the object to check if it is a blob and disambiguate that way, but that would be an option.