feat: Lazy Spans and KV Blocks #249

nrfulton · 2025-12-11T21:18:40Z

Introduction to Mellea's Spans

A Span is a contiguous piece of text that defines a tokenization and KV cache boundary within a context.

Spans play two roles.

Spans delineate conceptually/semantically related content. Examples of Spans in this sense include: RAG documents, chunks of RAG docuemnts, encoded images, other artifacts (such as code, execution traces, error logs), and chat messages.

Most spans also define sensible KV boundaries modulo positional encodings. For example: we can pre-compute KV cache for all of the documents in a RAG database and then re-use those as prefixes. So we have the KV blocks associated with each document, and each of those KV blocks corresponds to a conceptually whole entity (the document). This is why we include the words "tokenization boundary" in the definition of the term Span.

It is useful to distinguish between the two roles that a Span plays when discussing implementation details. We refer to all the KV caching semantics as "KV blocks" and we refer the conceptual grouping semantics as "conceptual spans".

When we say "span" nakedly, we mean something that is both a "conceptual span" and also a "kv block"; i.e., a span is is a contiguous piece of text that defines a tokenization and KV cache boundary within a context

This PR is about both. It started with conceptual spans and focused on re-introducing these "conceptual spans" into mellea from one of our earlier experimental code bases. We then merged in the corresponding PR, now closed, on the KV span / KV block aspect (#111). The two PRs are now merged together.

Lazy Span Implementation Details

The Mellea tutorial uses the stdlib MelleaSession and mfunc abstractions to hide Mellea's core from the user. In this section we peel back the Session and mfunc abstracions so that we can see how Mellea works under the hood.

Mellea represents data using three types: Component | CBlock | ModelOutputThunk.

CBlocks are a wrapper around inputs to an LLM.
ModelOutputThunks are outputs from LLMs. These are created prior to any LLM call actually happening.
Components are composite types that implement a protocol that explains how the Component should be represented to an LLM.

Let's review each of these.

CBlocks and Thunks

CBlocks (and Components) are passed into a model via a Backend. The Backend emits a ModelOutputThunk (with a new Context which we will talk about in a moent). For example,

async def main():
    in_0 = CBlock("What is 1+1? Reply with only the number.")

    out_0: ModelOutputThunk, _ = await backend.generate_from_context(in_0, ctx=SimpleContext())
    print(f"Note: right now out_0 is not computed, so out_0=None (proof: {out_0.value})")
    next_int = await out_0.avalue()
    print(next_int) # out_0.value == next_int.

    in_1 = CBlock(value=f"What is {next_int} + {next_int}? Reply with only the number.")
    out_1: ModelOutputThunk, _ = await backend.generate_from_context(in_1, ctx=SimpleContext())
    print(await out_1.avalue())

asyncio.run(main())

Notice how a ModelOutputThunk can be uncomputed (mot.value is None) or computed mot.avalue is not None.

Important

We need to think about intermediate MoT states, such as where a mot has been cmoputed but has a tool call that is pending.

Components

Components can be composed of both CBlocks and ModelOutputThunks. For example,

class SimpleComponent(Component):
    """ aka tagless spans """
    def __init__(self, parts):
        self._parts = parts
    
    def parts(self):
        return self._parts

Let's extend this component a bit so that we can print it out and see which of its thunks are computed:

    @staticmethod
    def part_to_string(part) -> str:
        match part:
            case ModelOutputThunk() if part.value is not None:
                return part.value
            case ModelOutputThunk() if part.value is None:
                return "uncomputed!"
            case CBlock():
                return part.value
            case Component():
                formatted = part.format_for_llm()
                assert type(formatted) == str, "sic: actually need a formatter because this could be a template repr or a str but we're simplifying for now."
                return formatted

    def format_for_llm(self):
        str_parts : list[str] = [SimpleComponent.part_to_string(x) for x in self._parts]
        return " :: ".join(str_parts) + " :: EndList"

(Aside: Recall in the first example we had to await the value of out_0 before computing next_int.
One of the things we need to change is automatic awaiting on MoTs that are contituents of Components as part of the generate call. This existed in our first couple codebases and we need to add thatb ack here.)

Notably, Components can be constructed using ModelOutputThunks that are not yet computed. So, in our core data structure we have a data dependency graph. E.g.,

async def main():
    in_0 = CBlock("This is an input query")
    out_0, _ = await backend.generate_from_context(in_0, ctx=SimpleContext())

    # nb: out_0 is NOT necessarily computed as this program point! Notice the None.
    component = SimpleComponent(parts=[in_0, out_0])
    print(backend.formatter.print(component))

    # again: out_0 is not yet computed so this component is also not computed.
    another_component = SimpleComponent(parts=[in_0, out_0, component])

    # we can generate into the component after forcing out_0 to complete.
    await out_0.avalue()
    out_component, _ = await backend.generate_from_context(component, ctx=SimpleContext())

    # now out_0 IS computed, but the output from component (out_component) is not!
    third_component = SimpleComponent(parts=[in_0, out_0, out_component])
    print(backend.formatter.print(third_component))

    # so le's resolve everything and make sure there's nothing uncomputed in the output from third component.
    await out_component.avalue()
    third_out, _ = await backend.generate_from_context(third_component, ctx=SimpleContext())
    assert "uncomputed!" not in await third_out.avalue()

asyncio.run(main())

KV Blocks

Each CBlock | Component additionally corresponds to a tokenization boundary and associated KV cache. These KV caches can be mashed together to allow for cache reuse beyond prefix-based reuse. This is currently implemented in the huggingface backend.

Remaining Todos

kv block stuff

Component.parts() needs implementation and walks, similar to the old span default_formatter_walk.
Heuristics for excluding improperly marked blocks/parts
Think about how this integrates with alora code
Add use_kv_caching flag to generate_from_context.
Add benchmarking (quality impacts and time savings)
Tests, examples, tutorials
deal with positions

Won't-do-for-now

lmcache integration
Perhaps refactor at least the HF backend code - captured in a separate issue.

stdlib fixes

Define parts() for existing components:

no-verify.

And the way it fits into a model that uses apply_chat_tempalte or any other parser/renderer. Note that there's still a bug entailed by the chance that there are also substrings which "hit" on the cached contents. We don't anticipate this happens often in practice because of how KV cache smashing should typically be used, but it's something we need to address by introducing the use of sentinel values, or indexing string machines, or something else along those lines. no-verify commit because the point of this code is documentation.

mergify · 2025-12-11T21:19:14Z

Merge Protections

Your pull request matches the following merge protections and will not be merged until they are valid.

🟢 Enforce conventional commit

Wonderful, this rule succeeded.

Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/

title ~= ^(fix|feat|docs|style|refactor|perf|test|build|ci|chore|revert|release)(?:\(.+\))?:

nrfulton · 2025-12-12T03:30:35Z

mellea/backends/_utils.py

+def generate_walk(c: CBlock | Component | ModelOutputThunk) -> list[ModelOutputThunk]:
+    """Returns the generation walk ordering for a Span."""
+    match c:
+        case ModelOutputThunk() if not c.is_computed():
+            return [c]
+        case CBlock():
+            return []
+        case Component():
+            parts_walk = [generate_walk(p) for p in c.parts()]
+            return itertools.chain.from_iterable(parts_walk)  # aka flatten


FYI @jakelorocco

we'll have to start doing this in the backend generate calls.

This also means that we need to go back through stdlib and use parts() correctly. (No action on your part atm)

[-] We probably want some sort of linting rule for third party code that warns the developer when they've got data in a Component class which has type CBlock | Component but which does not appear in parts().

[-] I think we might want to make ModelOutputThunk NOT be a subtype of CBlock because Python pattern matching is first-match not most-specific-match.

@nrfulton, should we also add some sort of computed / non-computed flag to Components because they will now suffer a similar situation as ModelOutputThunks?

And is it up to the Component owner what happens when not all parts of a Component are computed? For example, with a ModelOutputThunk, it's value is None until it is fully computed. Should we specify a similar default behavior for components?

I think we might want to make ModelOutputThunk NOT be a subtype of CBlock because Python pattern matching is first-match not most-specific-match.

I think that's fine. It's yet to be seen / fully implemented, but in the work for adding return types and parsing functions to Components, a CBlock is really just a Component with no parts (or one part?) that has a str return type.

should we also add some sort of computed / non-computed flag to Components because they will now suffer a similar situation as ModelOutputThunks?

I need to think about this. It's not quite the same as ModelOutputThunks. And I think it can be a computed method rather than a flag.

Should we specify a similar default behavior for components?

We need to think about this. It's different from what happens with mots.

Things can go wrong. In particular: Component.format_for_llm should only be called when component prefillable judgement is derivable. But to your question regarding "similar behavior": format_for_llm can't ensure this contract holds itself because it doesn't have a backend in context (and shouldn't!).

NB: the problem isn't introduced by this PR, it already exists, right?

Yes, it already exists but doesn't manifest since we pretty much always use computed stuff right now.

nrfulton · 2025-12-12T03:33:29Z

docs/examples/melp/states.py

+import mellea
+from mellea.stdlib.base import CBlock, Context, SimpleContext
+from mellea.stdlib.span import Span, SimpleComponent
+from mellea.backends import Backend
+from mellea.backends.ollama import OllamaModelBackend
+import asyncio
+
+
+async def main(backend: Backend, ctx: Context):
+    a_states = "Alaska,Arizona,Arkansas".split(",")
+    m_states = "Missouri", "Minnesota", "Montana", "Massachusetts"
+
+    a_state_pops = dict()
+    for state in a_states:
+        a_state_pops[state], _ = await backend.generate_from_context(
+            CBlock(f"What is the population of {state}? Respond with an integer only."),
+            SimpleContext(),
+        )
+    a_total_pop = SimpleComponent(
+        instruction=CBlock(
+            "What is the total population of these states? Respond with an integer only."
+        ),
+        **a_state_pops,
+    )
+    a_state_total, _ = await backend.generate_from_context(a_total_pop, SimpleContext())
+
+    m_state_pops = dict()
+    for state in m_states:
+        m_state_pops[state], _ = await backend.generate_from_context(
+            CBlock(f"What is the population of {state}? Respond with an integer only."),
+            SimpleContext(),
+        )
+    m_total_pop = SimpleComponent(
+        instruction=CBlock(
+            "What is the total population of these states? Respond with an integer only."
+        ),
+        **m_state_pops,
+    )
+    m_state_total, _ = await backend.generate_from_context(m_total_pop, SimpleContext())
+
+    print(await a_state_total.avalue())
+    print(await m_state_total.avalue())
+
+
+backend = OllamaModelBackend(model_id="granite4:latest")
+asyncio.run(main(backend, SimpleContext()))


FYI @HendrikStrobelt this is what lazy spans look like now.

Remember that await backend.generate_from_context doesn't actually await on the computation of the result. This merely awaits on the triggering on the generate call. So the full lifecycle of an call that looks sync has two awaits:

mot, new_ctx = await backend.generate_from_context(...) result: str = await mot.avalue()

It's not the prettiest code in the world, but it's nice to see that lazy spans still work after our long sojourn into devexp land.

Remember that await backend.generate_from_context doesn't actually await on the computation of the result. This merely awaits on the triggering on the generate call.

Just wanted to call this out since python async is weird. Since backend.generate_from_context() can always do work immediately (ie processing the model opts / context, queueing up the API call, ...), Python should never actually pause the control flow at that await boundary. It will always immediately do the work to get you the ModelOutputThunk since none of the backends (currently) have await statements inside their backend.generate_from_context() functions that actually have to await asynchronous work being done.

Another gotcha: we should await .gather() rather than await if you are awaiting on multiple things. There's a bug in my version of the generate_walk:

_to_compute = generate_walk(action) await asyncio.gather([x.avalue() for x in _to_compute])

nrfulton · 2025-12-12T15:36:31Z

Related stuff coming out of today's standup:

Refactor Backend so that the generate calls implement a protocol and individual stepsi n that protocol are overridden by specific implementations while others aren't <- opened issue and did it the manual way for now.
Similarly, we need to define a lifecycle for spans. (new -> prefilled -> computed -> post-processing). <- this is being addressing in melp.

nrfulton · 2025-12-12T21:45:00Z

Backend cleanup debt captured in #253

TODO-nrf: we need to add generate walks to every generation call.

Deletes the stdlib.span package and moves simplecomponent into base. Fixes a big in call to gather (should be *list not list)

Accepted the `nathan/conceptual_spans` side of this merge for huggingface.py. I'm now going to re-add that code in the next commit.

nrfulton · 2026-01-05T15:58:27Z

From a conversation with @jakelorocco : It is dangerous for us to have both format_for_llm and parts.

We should make it easy for Component developers to say how a component should be represented to an LLM without understanding the core, but also in a way that does not violate invariants about parts().

A major "gotcha" is that Component developers might rip out strings from CBlocks or ModelOutputThunks within their format_for_llm call. This would violate our parts() invariant (would it?). the main failure mode is that someone calls .value on a mot in format_for_llm (or prior to format_for_llm), the .value call returns None because the mot is not computed, and the component developer (or their users) get surprising behavior because of Nones everywhere.

We can guard against this in two ways:

In format_for_llm our output types should insist on CBlock | Component | MOT. E.g., in TemplateRepresntation. We should do runtime type checking to provide training wheels.
Documentation...

Then, the parts() method is entirely derived from format_for_llm.

The format_for_llm method might need to be refactored to enable this.

docs/examples/melp/lazy.py

docs/examples/melp/lazy_fib_sample.py

docs/examples/melp/lazy_fib.py

docs/rewrite/streaming/1.py

mellea/backends/huggingface.py

mellea/stdlib/base.py

mellea/stdlib/docs/richdocument.py

mellea/stdlib/intrinsics/intrinsic.py

mellea/stdlib/requirement.py

jakelorocco · 2026-01-05T17:21:38Z

[!IMPORTANT]
We need to think about intermediate MoT states, such as where a mot has been cmoputed but has a tool call that is pending.

Mots are fully computed when generation is done (even if they have an unresolved tool request). It's up to the user or sampling strategy to decide whether to actually call that tool. We can flag that better. But the result of the tool call shouldn't necessarily be a part of the mot. It's its own object that must be passed back to the model as a separate message if desired.

anyways.

jakelorocco

lgtm; ran all tests

* Adds cache smash code from the Project M codebase. * rename to avoid clash b/w cache/ and cache.py * Adds cache flag to CBlock. * Initial work on re-introducing span-ish KV caching. no-verify. * Adds a crystallization of the kv smash code And the way it fits into a model that uses apply_chat_tempalte or any other parser/renderer. Note that there's still a bug entailed by the chance that there are also substrings which "hit" on the cached contents. We don't anticipate this happens often in practice because of how KV cache smashing should typically be used, but it's something we need to address by introducing the use of sentinel values, or indexing string machines, or something else along those lines. no-verify commit because the point of this code is documentation. * Adds KV cache smash. * Adds example of kv cache smash. * Adds a SimpleComponent. * Adds a simple lazy example. * ollama generate walk. TODO-nrf: we need to add generate walks to every generation call. * Does gather() instead of awaiting on each thunk separately. * Refactor and bug fixes. Deletes the stdlib.span package and moves simplecomponent into base. Fixes a big in call to gather (should be *list not list) * backend walks. * Adds heapcomponents. * Make uncomputed mots logging less noisy. * adds a simple example. * Cleans up fib example. * Adds parts() for instruction and genslot components. * Don't call things components which are not components. * ruff. * Starts adding some examples for a deepdive on sessions. * blah * blah * Add parts() to chat. * Fixes GenerativeSlot.parts() * Confirm assumption that RichDocument has no parts() for now. * Define parts() on TableQuery * Fixes ruff errors. * Fixes error in HeapContext.add caught by mypy. * Fixes mypy errors caused by shadowing * Adds parts() definitions to the rest of the RichDocument components. These need a substantial cleanup and refactor with greater attention to detail. * fixes Instruction.parts() * Improves warning message for Intrinsic.parts() * update comment on mify.parts() * parts() implementations for MObject components. * parts() implementation for Requirements. * Some notes about the deep dives. * Fixes line noise in previous commit. * Finish resolving merge. * Examples are working (for some value of working -- results are garbage. * precommit hooks are passing. * Small changes to hf kv smash example. * Fix fib example. * Remove accidental commit. * Removes unnecessary print statements. * Removes HeapContext. * Intrinsics cannot surface parts because they always rewrite history anyways. * removes dead helper code. * removed code clone. * adds test. * Adds type:ignore because mypy 1.19.1 is buggy. * fixes bug in GenerativeSlot.parts() * adds missing arg in span tests. * See generative-computing#258 * fixes failing tests. --------- Co-authored-by: Avinash Balakrishnan <[email protected]>

nrfulton and others added 10 commits August 27, 2025 19:31

Adds cache smash code from the Project M codebase.

49fdcf1

rename to avoid clash b/w cache/ and cache.py

a1a4eb7

Adds cache flag to CBlock.

5989664

Initial work on re-introducing span-ish KV caching.

fab35d9

no-verify.

Adds KV cache smash.

ead3fe8

Adds example of kv cache smash.

1cd08ae

Merge branch 'main' into nathan/kv_block_hack

212a768

Merge branch 'generative-computing:main' into nathan/kv_block_hack

e53e13b

Merge branch 'main' into nathan/kv_block_hack

806eef7

nrfulton commented Dec 12, 2025

View reviewed changes

nrfulton assigned guicho271828 and unassigned guicho271828 Dec 17, 2025

nrfulton added 13 commits December 17, 2025 13:09

Adds a SimpleComponent.

60f2178

Adds a simple lazy example.

ef6daf6

ollama generate walk.

2fa6967

TODO-nrf: we need to add generate walks to every generation call.

Does gather() instead of awaiting on each thunk separately.

a9be6f6

Refactor and bug fixes.

5ea5312

Deletes the stdlib.span package and moves simplecomponent into base. Fixes a big in call to gather (should be *list not list)

backend walks.

22ac0db

Adds heapcomponents.

7187941

Make uncomputed mots logging less noisy.

6ea6d46

adds a simple example.

4f37d96

Cleans up fib example.

152ede9

Adds parts() for instruction and genslot components.

477275d

Don't call things components which are not components.

976ac06

ruff.

1797be4

nrfulton mentioned this pull request Jan 3, 2026

Spans and KV Blocks #279

Closed

nrfulton linked an issue Jan 3, 2026 that may be closed by this pull request

Spans and KV Blocks #279

Closed

nrfulton added 4 commits January 3, 2026 13:47

PARTIAL Merge branch 'nathan/kv_block_hack' into nathan/conceptual_spans

7ec3aef

Accepted the `nathan/conceptual_spans` side of this merge for huggingface.py. I'm now going to re-add that code in the next commit.

Finish resolving merge.

6095189

Examples are working (for some value of working -- results are garbage.

912d6ea

precommit hooks are passing.

156f3db

nrfulton mentioned this pull request Jan 3, 2026

feat: KV Blocks #111

Closed

8 tasks

nrfulton changed the title ~~feat: Conceptual Spans~~ feat: Lazy Spans and KV Blocks Jan 3, 2026

Small changes to hf kv smash example.

09d502c

jakelorocco reviewed Jan 5, 2026

View reviewed changes

nrfulton and others added 15 commits January 5, 2026 14:04

Fix fib example.

7f21472

Remove accidental commit.

ece648a

Removes unnecessary print statements.

a81818d

Removes HeapContext.

1c5a03a

Intrinsics cannot surface parts because they always rewrite history

0e962b4

anyways.

removes dead helper code.

f247196

removed code clone.

adea6aa

adds test.

5d9a2c1

Adds type:ignore because mypy 1.19.1 is buggy.

cce4fd2

fixes bug in GenerativeSlot.parts()

511bca7

Merge branch 'main' into nathan/conceptual_spans

f8a6a52

adds missing arg in span tests.

c91d3ad

Merge branch 'main' into nathan/conceptual_spans

2cb6abe

See generative-computing#258

f4d1afe

fixes failing tests.

b388091

jakelorocco approved these changes Jan 7, 2026

View reviewed changes

jakelorocco merged commit b9e4a33 into generative-computing:main Jan 7, 2026
1 of 4 checks passed

feat: Lazy Spans and KV Blocks #249

feat: Lazy Spans and KV Blocks #249

Uh oh!

Conversation

nrfulton commented Dec 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Introduction to Mellea's Spans

Lazy Span Implementation Details

CBlocks and Thunks

Components

KV Blocks

Remaining Todos

kv block stuff

Won't-do-for-now

stdlib fixes

Uh oh!

mergify bot commented Dec 11, 2025

Merge Protections

🟢 Enforce conventional commit

Uh oh!

nrfulton Dec 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jakelorocco Dec 12, 2025

Choose a reason for hiding this comment

Uh oh!

jakelorocco Dec 12, 2025

Choose a reason for hiding this comment

Uh oh!

nrfulton Dec 12, 2025

Choose a reason for hiding this comment

Uh oh!

nrfulton Dec 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jakelorocco Dec 12, 2025

Choose a reason for hiding this comment

Uh oh!

nrfulton Dec 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jakelorocco Dec 12, 2025

Choose a reason for hiding this comment

Uh oh!

nrfulton Dec 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nrfulton commented Dec 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nrfulton commented Dec 12, 2025

Uh oh!

nrfulton commented Jan 5, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jakelorocco commented Jan 5, 2026

Uh oh!

jakelorocco left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

nrfulton commented Dec 11, 2025 •

edited

Loading

nrfulton Dec 12, 2025 •

edited

Loading

nrfulton Dec 12, 2025 •

edited

Loading

nrfulton Dec 12, 2025 •

edited

Loading

nrfulton Dec 12, 2025 •

edited

Loading

nrfulton commented Dec 12, 2025 •

edited

Loading