diff --git a/docs/xet/chunking.md b/docs/xet/chunking.md
index 1d501e044f..16db8fcc11 100644
--- a/docs/xet/chunking.md
+++ b/docs/xet/chunking.md
@@ -48,6 +48,24 @@ When a boundary found or taken:
At end-of-file, if `start_offset < len(data)`, emit the final chunk `[start_offset, len(data))`.
+### Decision Flowchart
+
+```mermaid
+flowchart TD
+ A["Read next byte b"] --> B["h = (h << 1) + TABLE[b]"]
+ B --> C["size = offset - start + 1"]
+ C --> D{"size < MIN_CHUNK_SIZE\n(8 KiB)?"}
+ D -->|Yes| A
+ D -->|No| E{"size >= MAX_CHUNK_SIZE\n(128 KiB)?"}
+ E -->|Yes| G["Emit chunk, reset h = 0"]
+ E -->|No| F{"(h & MASK) == 0?"}
+ F -->|Yes| G
+ F -->|No| A
+ G --> H{"End of file?"}
+ H -->|No| A
+ H -->|Yes| I["Emit final chunk if data remains"]
+```
+
### Pseudocode
```text
diff --git a/docs/xet/deduplication.md b/docs/xet/deduplication.md
index 1f00cefa10..4c625bdb6f 100644
--- a/docs/xet/deduplication.md
+++ b/docs/xet/deduplication.md
@@ -56,10 +56,10 @@ When a file is processed for upload, it undergoes the following steps:
```mermaid
graph TD
- A[File Input] --> B[Content-Defined Chunking]
- B --> C[Hash Computation]
- C --> D[Chunk Creation]
- D --> E[Deduplication Query]
+ A["File Input"] --> B["Content-Defined Chunking"]
+ B --> C["Hash Computation"]
+ C --> D["Chunk Creation"]
+ D --> E["Deduplication Query"]
```
1. **Chunking**: Content-defined chunking using GearHash algorithm creates variable-sized chunks of file data
diff --git a/docs/xet/file-id.md b/docs/xet/file-id.md
index 9e4bae8afd..d6ab8bd70b 100644
--- a/docs/xet/file-id.md
+++ b/docs/xet/file-id.md
@@ -31,3 +31,14 @@ This is the string representation of the hash and can be used directly in the fi
> [!NOTE]
> The resolve URL will return a 302 redirect http status code, following the redirect will download the content via the old LFS compatible route rather than through the Xet protocol.
In order to use the Xet protocol you MUST NOT follow this redirect.
+
+```mermaid
+sequenceDiagram
+ autonumber
+ actor C as Client
+ participant Hub as Hugging Face Hub
+ C->>Hub: GET /namespace/repo/resolve/branch/filepath
Authorization: Bearer
+ Hub-->>C: 302 Redirect + X-Xet-Hash header
+ Note over C: Extract X-Xet-Hash value = Xet File ID
Do NOT follow the 302 redirect
+ C->>C: Use File ID with CAS Reconstruction API
+```
diff --git a/docs/xet/hashing.md b/docs/xet/hashing.md
index 6b2c3ab7c9..db157ba5c5 100644
--- a/docs/xet/hashing.md
+++ b/docs/xet/hashing.md
@@ -9,6 +9,19 @@ The Xet protocol utilizes a few different hashing types.
All hashes referenced are 32 bytes (256 bits) long.
+```mermaid
+flowchart LR
+ subgraph Input
+ CD["Chunk Data"]
+ CH["Chunk Hashes"]
+ end
+ CD -->|"blake3(data, DATA_KEY)"| ChunkHash["Chunk Hash"]
+ ChunkHash --> CH
+ CH -->|"Merkle Tree\n+ INTERNAL_NODE_KEY"| XorbHash["Xorb Hash"]
+ CH -->|"Merkle Tree\n+ INTERNAL_NODE_KEY\nthen blake3(root, zeros)"| FileHash["File Hash"]
+ CH -->|"blake3(concat hashes,\nVERIFICATION_KEY)"| VerifHash["Term Verification Hash"]
+```
+
## Chunk Hashes
After cutting a chunk of data, the chunk hash is computed via a blake3 keyed hash with the following key (DATA_KEY):
diff --git a/docs/xet/index.md b/docs/xet/index.md
index 18faa58822..532852eb62 100644
--- a/docs/xet/index.md
+++ b/docs/xet/index.md
@@ -19,6 +19,43 @@ Implementors can create their own clients, SDKs, and tools that speak the Xet pr
## Overall Xet Architecture
+```mermaid
+block
+ columns 3
+ File["π File"]
+ space
+ space
+ CDC["Chunking (CDC)"]
+ space
+ space
+ block:chunks
+ columns 5
+ C0["Chunk 0"] C1["Chunk 1"] C2["Chunk 2"] C3["..."] C4["Chunk N"]
+ end
+ space
+ space
+ space
+ block:xorbs
+ columns 2
+ X0["Xorb A\n(chunks 0β1023)"]
+ X1["Xorb B\n(chunks 1024βN)"]
+ end
+ space
+ Shard["Shard\n(file reconstructions\n+ xorb metadata)"]
+ space
+ space
+ space
+ CAS["CAS Server\n(Content Addressable Storage)"]
+ space
+ space
+ File --> CDC
+ CDC --> chunks
+ chunks --> xorbs
+ xorbs --> Shard
+ xorbs --> CAS
+ Shard --> CAS
+```
+
- [Content-Defined Chunking](./chunking): Gearhash-based CDC with parameters, boundary rules, and performance optimizations.
- [Hashing Methods](./hashing): Descriptions and definitions of the different hashing functions used for chunks, xorbs and term verification entries.
- [File Reconstruction](./file-reconstruction): Defining "term"-based representation of files using xorb hash + chunk ranges.
diff --git a/docs/xet/shard.md b/docs/xet/shard.md
index ab62b370e1..31d3a87e9d 100644
--- a/docs/xet/shard.md
+++ b/docs/xet/shard.md
@@ -116,12 +116,14 @@ struct MDBShardFileHeader {
**Memory Layout**:
-```txt
-ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ¬ββββββββββββ¬ββββββββββββ
-β tag (32 bytes) β version β footer_sz β
-β Magic Number Identifier β (8 bytes) β (8 bytes) β
-ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ΄ββββββββββββ΄ββββββββββββ
-0 32 40 48
+```mermaid
+---
+title: "MDBShardFileHeader (48 bytes)"
+---
+packet
+ 0-31: "tag (32 bytes) β Magic Number Identifier"
+ 32-39: "version (u64)"
+ 40-47: "footer_size (u64)"
```
**Deserialization steps**:
@@ -220,12 +222,15 @@ Given the `file_data_sequence_header.file_flags & MASK` (bitwise AND) operations
**Memory Layout**:
-```txt
-ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ¬βββββββββββ¬ββββββββββββ¬βββββββββββββ
-β file_hash (32 bytes) βfile_flagsβnum_entriesβ _unused β
-β File Hash Value β(4 bytes) β(4 bytes) β (8 bytes) β
-ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ΄βββββββββββ΄ββββββββββββ΄βββββββββββββ
-0 32 36 40 48
+```mermaid
+---
+title: "FileDataSequenceHeader (48 bytes)"
+---
+packet
+ 0-31: "file_hash (32 bytes)"
+ 32-35: "file_flags (u32)"
+ 36-39: "num_entries (u32)"
+ 40-47: "_unused (8 bytes)"
```
### FileDataSequenceEntry
@@ -247,13 +252,16 @@ struct FileDataSequenceEntry {
**Memory Layout**:
-```txt
-ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ¬ββββββββββ¬ββββββββββ¬ββββββββββ¬ββββββββββ
-β cas_hash (32 bytes) βcas_flagsβunpacked βchunk_idxβchunk_idxβ
-β CAS Block Hash β(4 bytes)βseg_bytesβstart βend β
-β β β(4 bytes)β(4 bytes)β(4 bytes)β
-ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ΄ββββββββββ΄ββββββββββ΄ββββββββββ΄ββββββββββ
-0 32 36 40 44 48
+```mermaid
+---
+title: "FileDataSequenceEntry (48 bytes)"
+---
+packet
+ 0-31: "cas_hash (32 bytes) β Xorb Hash"
+ 32-35: "cas_flags (u32)"
+ 36-39: "unpacked_segment_bytes (u32)"
+ 40-43: "chunk_index_start (u32)"
+ 44-47: "chunk_index_end (u32)"
```
### FileVerificationEntry (OPTIONAL)
@@ -271,12 +279,13 @@ struct FileVerificationEntry {
**Memory Layout**:
-```txt
-ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββ
-β range_hash (32 bytes) β _unused (16 bytes) β
-β Verification Hash β Reserved Space β
-ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ΄βββββββββββββββββββββββββββββββββ
-0 32 48
+```mermaid
+---
+title: "FileVerificationEntry (48 bytes)"
+---
+packet
+ 0-31: "range_hash (32 bytes) β Verification Hash"
+ 32-47: "_unused (16 bytes)"
```
When a shard has verification entries, all file info sections MUST have verification entries.
@@ -302,12 +311,13 @@ struct FileMetadataExt {
**Memory Layout**:
-```txt
-ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββ
-β sha256 (32 bytes) β _unused (16 bytes) β
-β SHA256 Hash β Reserved Space β
-ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ΄βββββββββββββββββββββββββββββββββ
-0 32 48
+```mermaid
+---
+title: "FileMetadataExt (48 bytes)"
+---
+packet
+ 0-31: "sha256 (32 bytes) β SHA256 Hash"
+ 32-47: "_unused (16 bytes)"
```
### File Info Bookend
@@ -381,13 +391,16 @@ struct CASChunkSequenceHeader {
**Memory Layout**:
-```txt
-ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ¬ββββββββββ¬ββββββββββ¬ββββββββββ¬ββββββββββ
-β cas_hash (32 bytes) βcas_flagsβnum_ βnum_bytesβnum_bytesβ
-β CAS Block Hash β(4 bytes)βentries βin_cas βon_disk β
-β β β(4 bytes)β(4 bytes)β(4 bytes)β
-ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ΄ββββββββββ΄ββββββββββ΄ββββββββββ΄ββββββββββ
-0 32 36 40 44 48
+```mermaid
+---
+title: "CASChunkSequenceHeader (48 bytes)"
+---
+packet
+ 0-31: "cas_hash (32 bytes) β Xorb Hash"
+ 32-35: "cas_flags (u32)"
+ 36-39: "num_entries (u32)"
+ 40-43: "num_bytes_in_cas (u32)"
+ 44-47: "num_bytes_on_disk (u32)"
```
### CASChunkSequenceEntry
@@ -406,15 +419,15 @@ struct CASChunkSequenceEntry {
**Memory Layout**:
-```txt
-ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ¬ββββββββββ¬ββββββββββ¬ββββββββββββββββββ
-β chunk_hash (32 bytes) βchunk_ βunpacked β _unused β
-β Chunk Hash βbyte_ βsegment_ β (8 bytes) β
-β βrange_ βbytes β β
-β βstart β(4 bytes)β β
-β β(4 bytes)β β β
-ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ΄ββββββββββ΄ββββββββββ΄ββββββββββββββββββ
-0 32 36 40 48
+```mermaid
+---
+title: "CASChunkSequenceEntry (48 bytes)"
+---
+packet
+ 0-31: "chunk_hash (32 bytes)"
+ 32-35: "chunk_byte_range_start (u32)"
+ 36-39: "unpacked_segment_bytes (u32)"
+ 40-47: "_unused (8 bytes)"
```
### CAS Info Bookend
@@ -451,23 +464,20 @@ struct MDBShardFileFooter {
**Memory Layout**:
-> [!NOTE]
-> Fields are not exactly to scale
-
-```txt
-βββββββββββ¬ββββββββββ¬ββββββββββ¬ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ¬ββββββββββββββββββββββββββββββββββββββ
-β version βfile_infoβcas_info β _buffer (reserved) β chunk_hash_hmac_key β
-β(8 bytes)βoffset βoffset β (48 bytes) β (32 bytes) β
-β β(8 bytes)β(8 bytes)β β β
-βββββββββββ΄ββββββββββ΄ββββββββββ΄ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ΄ββββββββββββββββββββββββββββββββββββββ
-0 8 16 24 72 104
-
-βββββββββββ¬βββββββββββ¬ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ¬ββββββββββ
-βcreation βshard_ β _buffer (reserved) βfooter_ β
-βtimestampβkey_expiryβ (72 bytes) βoffset β
-β(8 bytes)β (8 bytes)β β(8 bytes)β
-βββββββββββ΄βββββββββββ΄ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ΄ββββββββββ
-104 112 120 192 200
+```mermaid
+---
+title: "MDBShardFileFooter (200 bytes)"
+---
+packet
+ 0-7: "version (u64)"
+ 8-15: "file_info_offset (u64)"
+ 16-23: "cas_info_offset (u64)"
+ 24-71: "_buffer (48 bytes reserved)"
+ 72-103: "chunk_hash_hmac_key (32 bytes)"
+ 104-111: "shard_creation_timestamp (u64)"
+ 112-119: "shard_key_expiry (u64)"
+ 120-191: "_buffer2 (72 bytes reserved)"
+ 192-199: "footer_offset (u64)"
```
**Deserialization steps**:
diff --git a/docs/xet/xorb.md b/docs/xet/xorb.md
index e043b129ec..b1f399dcb5 100644
--- a/docs/xet/xorb.md
+++ b/docs/xet/xorb.md
@@ -58,13 +58,15 @@ the uncompressed size also being at a maximum of 128KiB.
#### Chunk Header Layout
-```txt
-βββββββββββ¬ββββββββββββββββββββββββββββββββββ¬βββββββββββββββ¬ββββββββββββββββββββββββββββββββββ
-β Version β Compressed Size β Compression β Uncompressed Size β
-β 1 byte β 3 bytes β Type β 3 bytes β
-β β (little-endian) β 1 byte β (little-endian) β
-βββββββββββ΄ββββββββββββββββββββββββββββββββββ΄βββββββββββββββ΄ββββββββββββββββββββββββββββββββββ
-0 1 4 5 8
+```mermaid
+---
+title: "Chunk Header (8 bytes)"
+---
+packet
+ 0-7: "Version (1 byte)"
+ 8-31: "Compressed Size (3 bytes, LE)"
+ 32-39: "Compression Type (1 byte)"
+ 40-63: "Uncompressed Size (3 bytes, LE)"
```
### Chunk Compression Schemes