Skip to content

Bug: vocab.length not decremented when memory_zone clears transient lexemes #13882

@TonyStark7862

Description

@TonyStark7862

@honnibal

Description

The Vocab.length counter is incremented when adding lexemes but is never decremented when memory_zone clears transient lexemes. This causes len(nlp.vocab) to grow continuously even though the actual lexemes are properly removed from the internal hash map.

This makes len(nlp.vocab) unreliable for monitoring memory_zone effectiveness in production environments.

Reproduction

import spacy

# Load model
nlp = spacy.load("en_core_web_sm")

# Check initial vocab size
initial_vocab_size = len(nlp.vocab)
print(f"Initial vocab size: {initial_vocab_size}")

# Process text with memory_zone
texts = ["unique_word_" + str(i) for i in range(1000)]

for text in texts:
    with nlp.memory_zone():
        doc = nlp(text)

# Check final vocab size
final_vocab_size = len(nlp.vocab)
print(f"Final vocab size: {final_vocab_size}")
print(f"Growth: {final_vocab_size - initial_vocab_size}")

# But iterate to see actual lexemes
actual_count = sum(1 for _ in nlp.vocab)
print(f"Actual lexeme count (via iteration): {actual_count}")

Output:

Initial vocab size: 1,456
Final vocab size: 2,456
Growth: 1,000
Actual lexeme count (via iteration): 1,456

Expected Behavior

After exiting memory_zone, len(nlp.vocab) should decrease to reflect the removal of transient lexemes, matching the count from iteration.

Actual Behavior

  • len(nlp.vocab) continues to grow and never decreases
  • Iteration over nlp.vocab correctly shows only permanent lexemes
  • Actual memory IS freed (confirmed via RSS measurements)
  • Only the counter is incorrect

Root Cause

Looking at spacy/vocab.pyx:

When adding lexemes:

cdef int _add_lex_to_vocab(self, hash_t key, const LexemeC* lex, bint is_transient):
    self._by_orth.set(lex.orth, <void*>lex)
    self.length += 1  # Counter incremented
    if is_transient and self.in_memory_zone:
        self._transient_orths.push_back(lex.orth)

When clearing transient lexemes:

def _clear_transient_orths(self):
    """Remove transient lexemes from the index"""
    cdef hash_t orth
    for orth in self._transient_orths:
        map_clear(self._by_orth.c_map, orth)  # Hash map cleared
    self._transient_orths.clear()
    # self.length is never decremented!

When getting length:

def __len__(self):
    return self.length  # Returns the incorrect counter

Impact

  • Production Monitoring: Teams cannot reliably use len(nlp.vocab) to monitor memory_zone effectiveness
  • Misleading Metrics: Counter suggests memory isn't being freed when it actually is
  • Semantic Inconsistency: len(vocab) doesn't match iteration count

Suggested Fix

Decrement self.length when clearing transient lexemes:

def _clear_transient_orths(self):
    """Remove transient lexemes from the index"""
    cdef hash_t orth
    cdef int num_cleared = 0
    
    for orth in self._transient_orths:
        if self._by_orth.get(orth) is not NULL:
            map_clear(self._by_orth.c_map, orth)
            num_cleared += 1
    
    self._transient_orths.clear()
    self.length -= num_cleared  # Decrement counter

Verification

I've tested this on production workloads with 1.9M records and confirmed:

  • Memory IS being freed correctly (RSS measurements show ~9-10% reduction)
  • memory_zone functionality works as intended
  • Only the length counter is inaccurate

Environment

  • spaCy version: 3.8.7 (issue likely present since memory_zone introduction in 3.8.0)
  • Platform: Linux (Databricks)
  • Python version: 3.9+

Additional Context

This was discovered while implementing memory_zone in a production NER pipeline processing millions of records. The discrepancy between len(vocab) and actual memory behavior initially caused confusion until source code analysis revealed the counter bug.

The underlying memory management works correctly - this is purely a counter bookkeeping issue that affects observability.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions