You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: design/sedimentree.md
+84-20Lines changed: 84 additions & 20 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -108,29 +108,29 @@ As implied by these diagrams, the sedimentree data structure first organizes com
108
108
109
109
## Terminology
110
110
111
-
A "commit" refers to the abstract idea of a node in the DAG which has a payload, a hash, and a set of parents identified by hash. A range of commits which has been compressed is referred to as a "stratum". A stratum has a start and end hash and zero or more interior "checkpoint" hashes - on which more later. If a commit is stored outside of a stratum it is a "loose commit". The payloads of both strata and loose commits are stored separately from the metadata about those objects as a "blob" - which is a content addressed binary array.
111
+
A "commit" refers to the abstract idea of a node in the DAG which has a payload, a hash, and a set of parents identified by hash. A range of commits which has been compressed is referred to as a "chunk". A chunk has a start and end hash and zero or more interior "checkpoint" hashes - on which more later. If a commit is stored outside of a chunk it is a "loose commit". All the chunks at a given layer are a "stratum". The payloads of both chunks and loose commits are stored separately from the metadata about those objects as a "blob" - which is a content addressed binary array.
112
112
113
113

114
114
115
-
Each stratum has a "level". Stratum with higher levels are further down in the sedimentree - composed of larger ranges of the commit graph. The first level stratum is level 1.
115
+
Each stratum has a "depth". Stratum with larger depths are further down in the sedimentree - composed of larger ranges of the commit graph. The first level stratum is depth 1.
116
116
117
-
A stratum which contains the data from some strata or loose commits above it is said to "support" the smaller strata. A sedimentree can be simplified by removing all the strata or loose commits which are supported by strata below them recursively, such a simplified sedimentree is called "minimal".
117
+
A chunk which contains the data from some chunks or loose commits above it (with a smaller depth) is said to "support" the smaller chunks. A sedimentree can be simplified by removing all the chunks or loose commits which are supported by chunks below them recursively, such a simplified sedimentree is called "minimal".
118
118
119
119
## Constructing a Sedimentree
120
120
121
121
To construct a sedimentree we need these things
122
122
123
123
* A way to organize the commit DAG into a linear order
124
-
* A way to choose the stratum boundaries
125
-
* A way to recognize whenever one stratum supports another
124
+
* A way to choose the chunk boundaries
125
+
* A way to recognize whenever one chunk supports another
126
126
127
-
All of these mechanisms need to produce the same results for the shared components of the history for peers with divergent commit graphs. Furthermore, we would like for chains of commits to be more likely to end up in the same stratum as this allows us to achieve better compression.
127
+
All of these mechanisms need to produce the same results for the shared components of the history for peers with divergent commit graphs. Furthermore, we would like for chains of commits to be more likely to end up in the same chunk as this allows us to achieve better compression.
128
128
129
129
Here are the ingredients we are going to use:
130
130
131
131
* Order the graph via a reverse depth first traversal of the reversed change graph. I.e. start from the heads of the graph and traverse from commit to the parents of the commit
132
-
* Choose stratum boundaries based on the number of leading zeros in the hash of each commit. A commit with two leading zeros is a level 1 stratum
133
-
* Every stratum regardless of level retains checkpoint hashes - the hashes of the level 1 boundaries of which it is composed - which allows us to always tell if one stratum supports another by checking if the lower stratum contains the boundaries of the higher stratum in it's checkpoint
132
+
* Choose chunk boundaries based on the number of leading zeros in the hash of each commit. A commit with two leading zeros is a depth 1 chunk
133
+
* Every chunk regardless of depth retains checkpoint hashes - the hashes of the depth 1 boundaries of which it is composed - which allows us to always tell if one chunk supports another by checking if the deeper chunk contains the boundaries of the shallower chunk in it's checkpoint
134
134
135
135
### Reverse Depth First Traversal
136
136
@@ -176,32 +176,35 @@ c --> a
176
176
177
177
Stating from `b` we still have `b,a` and from `e` we still have `e,c,a` but now we also have `d` as a loose commit.
178
178
179
+
There is one detail remaining. Where do we start our traversal from? When we strart from the root it is easy as there is only one root, but starting from the heads of the document there are many options. To solve this we actually run the traversal starting from each head - that is, instead of having a single traversal over the graph, we have one traversal for each head. For Automerge documents this is not a huge problem because most documents are tall and narrow, so even with many heads most traversals consist of duplicate chunks.
180
+
179
181
### Chunk Boundaries
180
182
181
-
We want a way to divide up the linear order into chunks in such a way that everyone agrees on the chunk boundaries. We also want to be able to do this recursively, so that we choose the boundaries for lower strata consistently. We can do this by interpreting the hash of each commit as a number and using the number of trailing zeros in the number as the level of the chunk boundary.
183
+
We want a way to divide up the linear order into chunks in such a way that everyone agrees on the chunk boundaries. We also want to be able to do this recursively, so that we choose the boundaries for deeper chunks consistently. We can do this by interpreting the hash of each commit as a number and using the number of trailing zeros in the number as the level of the chunk boundary.
182
184
183
-
For example, if we have a commit with hash `0xbce71a3b59784f0d507fd66abeb8d95e6bb2f2d606ff159ae01f8c719b2e0000` then we can say that this is the boundary of a level 4 stratum due to the four trailing zeros. Because hashes are distributed uniformly (or else we have other problems) then the chance of any particular character in some hash being `0` is $\frac{1}{10}$ and so the chance of having $n$ trailing zeros is $10^{-n}$ which means that we will have a hash boundary approximately every $10^{n}$ changes.
185
+
For example, if we have a commit with hash `0xbce71a3b59784f0d507fd66abeb8d95e6bb2f2d606ff159ae01f8c719b2e0000` then we can say that this is the boundary of a level 4 chunk due to the four trailing zeros. Because hashes are distributed uniformly (or else we have other problems) then the chance of any particular character in some hash being `0` is $\frac{1}{10}$ and so the chance of having $n$ trailing zeros is $10^{-n}$ which means that we will have a hash boundary approximately every $10^{n}$ changes.
184
186
185
187
We are not forced to stick with base 10, we can interpret the hash as a number in some base $b$ and then the number of trailing zeros in that base will give us changes every $b^{n}$ changes.
186
188
187
-
### Supporting Stratum And Checkpoint Commits
189
+
### Supporting Chunks And Checkpoint Commits
188
190
189
-
A stratum $x$ supports another stratum $y$ whenever $x$ contains all the commits in $y$. It is important for sedimentree sync to be able to determine whether one stratum supports another in order to be able to determine the minimal sedimentree.
191
+
A chunk $x$ supports another chunk $y$ whenever $x$ contains all the commits in $y$. It is important for sedimentree sync to be able to determine whether one chunk supports another in order to be able to determine the minimal sedimentree.
190
192
191
-
To this point I have talked about the boundaries of a stratum, but the start and an end hash is not enough to determine whether one stratum supports another without additional information. Consider this sedimentree:
193
+
To this point I have talked about the boundaries of a chunk, but the start and an end hash is not enough to determine whether one chunk supports another without additional information. Consider this sedimentree:
192
194
193
195

194
196
195
-
In this example the ghosted out boxes represent commits (in the case of square boxes with a letter) and stratum (in the case of rectangles) which were used to derive the non-ghosted strata but which we don't have access to (maybe we never had them, maybe we discarded them). All we know is that we have some strata, one which starts at `A` and ends at `F` (the larger one), one which starts at `A` and ends at `C`, and one which starts at `G` and ends at `I`. How can we know which of the smaller strata the large one supports?
197
+
In this example the ghosted out boxes represent commits (in the case of square boxes with a letter) and chunks (in the case of rectangles) which were used to derive the non-ghosted strata but which we don't have access to (maybe we never had them, maybe we discarded them). All we know is that we have some strata, one which starts at `A` and ends at `F` (the larger one), one which starts at `A` and ends at `C`, and one which starts at `G` and ends at `I`. How can we know which of the smaller strata the large one supports?
196
198
197
-
To solve this we add the concept of "checkpoint commits". A checkpoint commit is a commit hash which would be the boundary of the smallest stratum in the system. For example, if we are producing strata for every commit that begins with two zeros then every commit hash which begins with two zeros is a checkpoint commit. We never discard checkpoint commits, which means that a stratum is now defined by it's start and end hash _and_ the checkpoint commits in it's interior.
199
+
To solve this we add the concept of "checkpoint commits". A checkpoint commit is a commit hash which would be the boundary of the shallowest chunk in the system. For example, if we are producing chunks for every commit that begins with two zeros then every commit hash which begins with two zeros is a checkpoint commit. We never discard checkpoint commits, which means that a chunk is now defined by it's start and end hash _and_ the checkpoint commits in it's interior.
198
200
199
201

200
202
201
-
With checkpoint commits we can always determine the supporting relationship. All stratum boundaries are on checkpoint commits, so if stratum $x$ supports stratum $y$ then the start and end hashes of $y$ will be somewhere in the set (start hash of x, end hash of x, checkpoints of x).
203
+
With checkpoint commits we can always determine the supporting relationship. All chunk boundaries are on checkpoint commits, so if chunk $x$ supports chunk $y$ then the start and end hashes of $y$ will be somewhere in the set (start hash of x, end hash of x, checkpoints of x).
204
+
202
205
### Loose Commits
203
206
204
-
The reverse depth first traversal ordering allows us to group commits into strata. But what do we do about commits for which we don't yet have a stratum boundary? Consider this DAG (with arrows from parents to children):
207
+
The reverse depth first traversal ordering allows us to group commits into chunks. But what do we do about commits for which we don't yet have a chunk boundary? Consider this DAG (with arrows from parents to children):
205
208
206
209
```mermaid
207
210
graph LR
@@ -210,19 +213,80 @@ b --> c
210
213
a --> d
211
214
```
212
215
213
-
Let's say we have stratum boundaries, `a,c`. Then we have one chunk which is `c,b,a`, but `d` doesn't belong in any stratum yet because there are no stratum boundaries which are children of it. This means we must store and transmit the commit as is. However, the commit on its own is not enough because we also need to be able to determine if, given some stratum $x$, the commit is supported by the stratum so that we can discard loose commits when we receive strata which cover them.
216
+
Let's say we have chunk boundaries, `a,c`. Then we have one chunk which is `c,b,a`, but `d` doesn't belong in any stratum yet because there are no stratum boundaries which are children of it. This means we must store and transmit the commit as is. However, the commit on its own is not enough because we also need to be able to determine if, given some chunk $x$, the commit is supported by the chunk so that we can discard loose commits when we receive chunks which support them.
214
217
215
-
As with strata, just knowing the hash of a commit isn't enough to know whether it is supported by some stratum, so for loose commits we must ensure that we always retain all the commits linking the original commit back to any stratum boundaries which are it's parents.
218
+
As with chunks, just knowing the hash of a commit isn't enough to know whether it is supported by some chunk, so for loose commits we must ensure that we always retain all the commits linking the original commit back to any chunk boundaries which are it's parents.
216
219
217
220
## Syncing a Sedimentree
218
221
219
222
Sedimentree sync starts by first requesting from a remote peer a "summary" of the minimal sedimentree according to that remote. Having received this summary we will know that there are certain ranges of the tree which we do not have and how large the data representing that section is. At this point we can decide to recurse into the missing structure to see if there are smaller sections of it to download at the cost of an extra round trip, or just download the whole missing part, accepting that we might download some data we already have. Once we have completed this process we know exactly which blobs we need to download from the remote in order to be in sync and we also know what blobs we need to upload in order for them to be in sync.
220
223
221
224
#### Sedimentree Summaries
222
225
223
-
The summary contains the boundaries of the strata in the tree as well as any loose commits. Importantly the summary _does not_ contain the internal checkpoint hashes of any of the strata or any of the actual data in the strata or commits.
226
+
The summary contains the boundaries of the chunks in the tree as well as any loose commits. Importantly the summary _does not_ contain the internal checkpoint hashes of any of the chunks or any of the actual data in the chunks or commits.
224
227
225
228
Omitting the checkpoint hashes is necessary because otherwise we would have to transmit all the checkpoint hashes in the document. If we use the first two leading zeros in base 10 as our strata boundary then this would mean we're sending approximately 1% of the hashes in the document every time you sync. A hash is 32 bytes and it is quite normal for a document to contain hundreds of thousands of changes, for large documents this would mean sending multiple megabytes of data just to get the sedimentree.
226
229
227
230
For loose commits we can also omit all but the end of the commit chain and a count from the summary.
228
231
232
+
## The Sedimentree Traversal Algorithm
233
+
234
+
This section describes in detail how to traverse the commit graph to produce the sequences of commits in each chunk in the sedimentree. Roughly speaking we want to start at the heads of the commit graph, then walk backwards towards the root. Every time we encounter a commit which is a chunk boundary, we continue walking towards the root, until every path in the traversal towards the root has encountered a chunk boundary at the same level. Now we know the start commits and end commit, we gather every commit on every path between any start commit and the end commit. This gives us the contents of the chunk. Finally, we update the heads we are traversing from to be the parents of the start commits and repeat the process until we have traversed the entire graph.
235
+
236
+
(Note that this means that each chunk has a single end commit but may have multiple start commits).
237
+
238
+
```
239
+
current_heads = stack of commits which are the heads of the commit graph
0 commit comments