Skip to content

@register_section: pluggable AnnData sections#7

Draft
katosh wants to merge 15 commits intohtml_repfrom
register_section
Draft

@register_section: pluggable AnnData sections#7
katosh wants to merge 15 commits intohtml_repfrom
register_section

Conversation

@katosh
Copy link
Copy Markdown
Collaborator

@katosh katosh commented Mar 30, 2026

@register_section: pluggable AnnData sections

This PR lets external packages add new sections to AnnData — with storage, validation, subsetting, IO, and repr — using a single decorator. No subclassing needed.

Quick example

from anndata.extensions import register_section

@register_section("obst", alignment="obs")
class ObstSection:
    """Observation trees (like obsm, but for tree data)."""
    pass

That's it. Now every AnnData object has an obst section:

>>> adata = ad.AnnData(
...     X=np.random.rand(4, 3),
...     obs=pd.DataFrame({"cell_type": ["T", "T", "B", "B"]}, index=["c1", "c2", "c3", "c4"]),
...     var=pd.DataFrame(index=["CD8A", "LCK", "MS4A1"]),
...     obst={"lineage": np.random.rand(4, 2)},    # init kwarg works
... )

>>> adata.obst["lineage"].shape
(4, 2)

>>> repr(adata)
AnnData object with n_obs × n_vars = 4 × 3
    obs: 'cell_type'
    obst: 'lineage'

>>> t_cells = adata[adata.obs["cell_type"] == "T"]   # subsetting works
>>> t_cells.obst["lineage"].shape
(2, 2)

>>> adata.write("test.h5ad")                          # IO works
>>> adata2 = ad.read_h5ad("test.h5ad")
>>> adata2.obst["lineage"].shape
(4, 2)

Why

Packages that extend AnnData (TreeData, SpatialData) currently have to subclass AnnData or reimplement its internals. TreeData reimplements the entire write traversal with its own hardcoded section list. The same list is hardcoded in at least four places across anndata (write_h5ad, write_anndata, read_anndata, _gen_repr). @register_section makes all four discoverable.

Alignment

The alignment parameter declares which AnnData axes each dimension of the stored data is aligned to. This controls both validation (shape must match) and subsetting (which dims get sliced when you do adata[obs_idx, var_idx]).

alignment subsetting behavior like use case
"obs" dim 0 follows obs obsm per-cell embeddings
"var" dim 0 follows var varm per-gene annotations
("obs", "var") dim 0 = obs, dim 1 = var layers alternative matrices
("obs", "obs") both dims follow obs obsp cell-cell distances
("var", "var") both dims follow var varp gene-gene correlations
() no subsetting images, configs
("obs", "obs", "var") 3D tensor cell-cell communication per gene
("obs", "var", "var") 3D tensor cell-specific gene regulation

3D tensors for cell-cell communication (CellChat, LIANA, CellPhoneDB):

@register_section("cellcomm", alignment=("obs", "obs", "var"))
class CellCommSection:
    """Ligand-receptor scores: sender_cell × receiver_cell × gene."""
    pass
>>> adata.cellcomm["lr_scores"] = np.random.rand(4, 4, 3)  # (n_obs, n_obs, n_vars)
>>> adata.cellcomm["lr_scores"].shape
(4, 4, 3)

>>> t_cells = adata[adata.obs["cell_type"] == "T"]
>>> t_cells.cellcomm["lr_scores"].shape      # both cell dims subset
(2, 2, 3)

>>> sub = adata[:, ["CD8A", "LCK"]]
>>> sub.cellcomm["lr_scores"].shape           # gene dim subsets
(4, 4, 2)

Cell-specific gene regulatory networks (SCENIC, CellOracle, Dictys):

@register_section("genereg", alignment=("obs", "var", "var"))
class GeneRegSection:
    """Per-cell GRN: cell × source_gene × target_gene."""
    pass
>>> adata.genereg["scenic"] = np.random.rand(4, 3, 3)  # (n_obs, n_vars, n_vars)

>>> t_cells = adata[adata.obs["cell_type"] == "T"]
>>> t_cells.genereg["scenic"].shape          # cell dim subsets
(2, 3, 3)

>>> sub = adata[:, ["CD8A", "LCK"]]
>>> sub.genereg["scenic"].shape               # both gene dims subset
(4, 2, 2)

Custom behavior

All methods are optional. Omit any you don't need.

@register_section("obst", alignment="obs")
class ObstSection:
    value_type = nx.DiGraph                      # type enforcement
    section_after = "obsm"                       # position in repr
    section_tooltip = "Observation trees"        # hover text

    @staticmethod
    def validate(key, value):                    # custom validation
        if not nx.is_tree(value):
            raise ValueError(f"{key} must be a tree")

    @staticmethod
    def subset(value, idx):                      # custom subsetting
        return subset_tree(value, idx)

    @staticmethod
    def serialize(value):                        # custom write
        return digraph_to_json(value)

    @staticmethod
    def deserialize(data):                       # custom read
        return json_to_digraph(data)

    @staticmethod
    def repr_entry(key, value, context):         # custom HTML repr
        return FormattedOutput(type_name=f"Tree ({value.number_of_nodes()} nodes)")

Validation in action:

>>> adata.obst["bad"] = [[1, 2], [3, 4]]
TypeError: Values in 'obst' must be ndarray, got list

>>> adata.obst["bad"] = np.ones(3)              # custom validate
ValueError: bad must be 2D, got 1D

>>> adata.obst["bad"] = np.ones((10, 2))        # alignment check
ValueError: Value for obst['bad'] has shape[0]=10, expected 4 (n_obs)

xarray DataArray example

Custom types that anndata can't natively serialize work end-to-end via serialize/deserialize:

import xarray as xr

@register_section("xr_layers", alignment=("obs", "var"))
class XarrayLayers:
    value_type = xr.DataArray

    @staticmethod
    def serialize(value):
        return value.values          # xarray → numpy for h5ad

    @staticmethod
    def deserialize(data):
        return xr.DataArray(data)    # numpy → xarray on read
>>> adata.xr_layers["scaled"] = xr.DataArray(np.random.rand(4, 3), dims=["obs", "var"])
>>> adata.write("test.h5ad")
>>> adata2 = ad.read_h5ad("test.h5ad")
>>> isinstance(adata2.xr_layers["scaled"], xr.DataArray)
True

What you get for free

Feature Works automatically
adata.obst["x"] = array Property accessor + validation
adata[:10].obst Subsetting via declared alignment
adata.copy() Deep copy of registered sections
adata.write("f.h5ad") IO via serialize (or standard write_elem)
ad.read_h5ad("f.h5ad") IO via deserialize (or standard read_elem)
AnnData(obst={...}) Init kwargs
repr(adata) Shows when non-empty
View copy-on-write Writing to a view triggers copy

Scaling 3D tensors: factored storage + accessor

A dense (n_obs × n_obs × n_vars) tensor is infeasible for large datasets (1M cells × 1M cells × 30K genes ≈ 10^16 entries). The practical pattern is to store compact rank-R factors and reconstruct on demand:

# Register factor storage (tiny: n_obs × rank and n_vars × rank)
@register_section("comm_obs", alignment="obs")
class CommObs:
    pass

@register_section("comm_var", alignment="var")
class CommVar:
    pass

# Register accessor for tensor reconstruction
@register_anndata_namespace("comm")
class CellCommAccessor:
    def __init__(self, adata: ad.AnnData):
        self._adata = adata

    def tensor(self, key="default"):
        """Reconstruct (obs × obs × var) tensor from factors."""
        U = self._adata.comm_obs[key]   # (n_obs, rank)
        V = self._adata.comm_var[key]   # (n_vars, rank)
        return np.einsum("ir,jr,kr->ijk", U, U, V)

    def query(self, sender, receiver, gene, key="default"):
        """O(rank) point query without materializing tensor."""
        U = self._adata.comm_obs[key]
        V = self._adata.comm_var[key]
        i = self._adata.obs_names.get_loc(sender)
        j = self._adata.obs_names.get_loc(receiver)
        k = self._adata.var_names.get_loc(gene)
        return float(U[i] @ (U[j] * V[k]))
>>> adata.comm_obs["lr"] = np.random.rand(100, 10)   # factors: 12 KB
>>> adata.comm_var["lr"] = np.random.rand(50, 10)

>>> adata.comm.tensor("lr").shape                      # dense tensor: 4 MB
(100, 100, 50)

>>> adata.comm.query("cell_0", "cell_1", "CD8A", "lr") # O(rank), no tensor
0.7386

>>> t_cells = adata[adata.obs["cell_type"] == "T"]
>>> t_cells.comm.tensor("lr").shape                     # factors were subsetted
(50, 50, 50)

>>> adata.write("test.h5ad")                            # only factors written
>>> adata2 = ad.read_h5ad("test.h5ad")
>>> adata2.comm.tensor("lr").shape                      # reconstructs from factors
(100, 100, 50)

This combines @register_section (factor storage with automatic subsetting and IO) with @register_anndata_namespace (tensor API and point queries). For 1M cells with rank 20, the factors are ~160 MB while the dense tensor would be ~240 TB — a 1,500,000× compression.

For moderately-sized datasets, sparse.COO from the PyData sparse package also works directly in registered sections (subsetting handles N-D sparse arrays).

iter_sections: centralized section iteration

All built-in sections are registered in _registered_sections alongside extension sections. The iter_sections() utility provides filtered iteration, replacing the hardcoded section lists that were previously duplicated across write_h5ad, write_anndata, read_anndata, _gen_repr, and _mutated_copy.

from anndata._core.section_registry import iter_sections

# All sections with metadata
for spec, value in iter_sections(adata):
    print(f"{spec.name}: kind={spec.kind}, alignment={spec.alignment}")
X: kind=X, alignment=('obs', 'var')
obs: kind=dataframe, alignment=('obs',)
var: kind=dataframe, alignment=('var',)
uns: kind=unstructured, alignment=()
obsm: kind=mapping, alignment=('obs',)
varm: kind=mapping, alignment=('var',)
layers: kind=mapping, alignment=('obs', 'var')
obsp: kind=mapping, alignment=('obs', 'obs')
varp: kind=mapping, alignment=('var', 'var')
raw: kind=raw, alignment=()
obst: kind=mapping, alignment=('obs',)          # ← registered extension

Filter by kind:

# Only dict-like sections (built-in + registered)
for spec, mapping in iter_sections(adata, kinds={"mapping"}):
    print(f"{spec.name}: {list(mapping.keys())}")

# Everything except X, raw, and uns
for spec, value in iter_sections(adata, exclude_kinds={"X", "raw", "unstructured"}):
    ...

# Non-empty sections (for repr)
for spec, value in iter_sections(adata, only_nonempty=True):
    ...

This is how anndata's own IO now works internally:

# write_h5ad (simplified)
for spec, value in iter_sections(adata, exclude_kinds={"X", "raw"}):
    if spec.kind == "dataframe":
        write_elem(f, spec.io_key, value, ...)        # DataFrame directly
    else:
        write_elem(f, spec.io_key, dict(value), ...)   # mapping → dict

Section kinds: "X", "dataframe" (obs/var), "mapping" (obsm/layers/etc. + extensions), "unstructured" (uns), "raw".

Also in this PR

  • @register_anndata_namespace — custom accessor APIs (adata.spatial.images)
  • @register_formatter — custom HTML type/section formatters
  • anndata.extensions module consolidating all extension APIs

Test coverage

73 tests covering all alignment patterns, custom validation, custom IO (JSON, xarray), 3D tensor subsetting, factored tensor with accessor, copy-on-write, and end-to-end workflows for TreeData-like, SpatialData-like, CellChat-like, SCENIC-like, and factored communication scenarios.

Future direction

The alignment tuple naturally extends to custom axes beyond obs/var. A future register_axis could let packages define new named dimensions with their own indices, enabling N-dimensional indexing like adata[obs_idx, var_idx, spatial_idx]. This is the conceptual step from DataFrame (2D) to xarray Dataset (N-D) — with @register_section as the foundation.

# Conflicts:
#	tests/test_repr_html.py
#	tests/visual_inspect_repr_html.py
Add register_aligned_section() to anndata.extensions that allows
external packages to register new axis-aligned sections (like obsm,
layers) on AnnData without subclassing.

A registered section gets:
- Property accessor (adata.obst)
- Axis-aligned storage with validation
- Automatic subsetting (adata[:10].obst works)
- IO integration (write/read to h5ad and zarr)
- Repr discovery (shows in repr output)
- Init kwargs (AnnData(obst={...}))

Changes:
- aligned_mapping.py: AlignedMappingProperty lazily inits backing store
- extensions.py: SectionRegistration dataclass + register_aligned_section()
- anndata.py: _registered_sections ClassVar, **extra_sections in init,
  registered sections in _gen_repr
- methods.py: write_anndata/read_anndata iterate registered sections
- h5ad.py: write_h5ad iterates registered sections
- AlignedMappingProperty.construct sets _attrname_override so
  registered sections report their own name (e.g., "obst") instead
  of the default ("obsm")
- AlignedView propagates _attrname_override from parent mapping
- _mutated_copy includes registered sections in the copy loop
- _init_as_actual copies registered sections when init from AnnData
- _default_attrname replaces attrname in concrete bases (LayersBase,
  AxisArraysBase, PairwiseArraysBase) to support the override pattern
- Add comprehensive test suite (35 tests) covering storage,
  validation, subsetting, copy-on-write, IO roundtrip, repr, and
  TreeData-like workflow
…section

New @register_section decorator with:
- Alignment as tuple of "obs"/"var" axes: ("obs",), ("obs","var"),
  ("obs","obs"), ("var","var"), () for unaligned
- Custom value_type enforcement
- Custom validate/subset/serialize/deserialize methods
- Custom repr_entry for HTML repr
- Auto-registers SectionFormatter for HTML repr

New container classes in section_registry.py:
- SectionMapping: validates on assignment (type, alignment, custom)
- SectionMappingView: subsets on access, copy-on-write on mutation
- SectionProperty: descriptor creating ephemeral containers

45 tests covering all alignment combinations, custom validation,
custom IO, subsetting, copy-on-write, init kwargs, TreeData-like
and SpatialData-like scenarios.
alignment="obs" is now equivalent to alignment=("obs",).
Updated docstring examples and tests to use the string form.
Demonstrates using register_section with custom types: xr.DataArray
as layer values with serialize/deserialize for h5ad IO roundtrip.
Shows that custom types work end-to-end: storage with type enforcement,
alignment validation, subsetting, copy, IO, repr.

6 new tests (51 total).
Support >2D alignment tuples with proper subsetting. anndata's
built-in _subset only handles ≤2D, so SectionMappingView implements
N-D fancy indexing via np.ix_ for higher dimensions.

New biology-motivated test cases:
- cellcomm: alignment=("obs", "obs", "var") for ligand-receptor
  cell-cell communication tensors (CellChat, LIANA, CellPhoneDB)
- genereg: alignment=("obs", "var", "var") for cell-specific gene
  regulatory networks (SCENIC, CellOracle, Dictys)

67 tests total, all passing.
@settylab settylab deleted a comment from coderabbitai bot Mar 30, 2026
@katosh katosh force-pushed the register_section branch from 38e77af to a40a331 Compare March 30, 2026 20:34
Add TestFactoredTensor: sections store compact rank-R factors
(n_obs × rank) and (n_vars × rank), accessor reconstructs the
full (obs × obs × var) tensor on demand via einsum. Includes
point queries without materializing the tensor.

Demonstrates combining register_section (for factor storage with
axis-aligned subsetting and IO) with register_anndata_namespace
(for the tensor reconstruction API and HTML repr).

73 tests total, all passing. Ruff formatting applied.
@settylab settylab deleted a comment from coderabbitai bot Mar 30, 2026
@katosh katosh force-pushed the register_section branch 2 times, most recently from d8d2de9 to 70fe498 Compare March 31, 2026 01:44
Register all built-in sections (X, obs, var, uns, obsm, varm, obsp,
varp, layers, raw) in _registered_sections with SectionSpec metadata.
Add iter_sections() utility for filtered iteration with options for
kind filtering, empty-section skipping.

Replace hardcoded section lists in:
- _gen_repr: uses iter_sections(exclude_kinds={"X", "raw"})
- _mutated_copy: uses iter_sections(kinds={"dataframe", "mapping"})
- write_h5ad: uses iter_sections(exclude_kinds={"X", "raw"})
- write_anndata: same
- read_anndata: iterates _registered_sections.values()

The five aligned mapping sections (obsm, varm, obsp, varp, layers),
both DataFrames (obs, var), uns, and all extension sections are now
discovered from a single registry. Only X and raw retain special
handling due to their unique structure.
@katosh katosh force-pushed the register_section branch from 70fe498 to cd2b37d Compare March 31, 2026 01:45
Raw manages its own subsetting internally (X along obs, var/varm
unchanged). The alignment tuple shouldn't imply it behaves like
an obs-aligned mapping.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant