Bug: vocab.length not decremented when memory_zone clears transient lexemes

@honnibal 
## Description

The `Vocab.length` counter is incremented when adding lexemes but is never decremented when `memory_zone` clears transient lexemes. This causes `len(nlp.vocab)` to grow continuously even though the actual lexemes are properly removed from the internal hash map.

This makes `len(nlp.vocab)` unreliable for monitoring memory_zone effectiveness in production environments.

## Reproduction

```python
import spacy

# Load model
nlp = spacy.load("en_core_web_sm")

# Check initial vocab size
initial_vocab_size = len(nlp.vocab)
print(f"Initial vocab size: {initial_vocab_size}")

# Process text with memory_zone
texts = ["unique_word_" + str(i) for i in range(1000)]

for text in texts:
    with nlp.memory_zone():
        doc = nlp(text)

# Check final vocab size
final_vocab_size = len(nlp.vocab)
print(f"Final vocab size: {final_vocab_size}")
print(f"Growth: {final_vocab_size - initial_vocab_size}")

# But iterate to see actual lexemes
actual_count = sum(1 for _ in nlp.vocab)
print(f"Actual lexeme count (via iteration): {actual_count}")
```

**Output:**
```
Initial vocab size: 1,456
Final vocab size: 2,456
Growth: 1,000
Actual lexeme count (via iteration): 1,456
```

## Expected Behavior

After exiting `memory_zone`, `len(nlp.vocab)` should decrease to reflect the removal of transient lexemes, matching the count from iteration.

## Actual Behavior

- `len(nlp.vocab)` continues to grow and never decreases
- Iteration over `nlp.vocab` correctly shows only permanent lexemes
- Actual memory IS freed (confirmed via RSS measurements)
- Only the counter is incorrect

## Root Cause

Looking at `spacy/vocab.pyx`:

**When adding lexemes:**
```cython
cdef int _add_lex_to_vocab(self, hash_t key, const LexemeC* lex, bint is_transient):
    self._by_orth.set(lex.orth, <void*>lex)
    self.length += 1  # Counter incremented
    if is_transient and self.in_memory_zone:
        self._transient_orths.push_back(lex.orth)
```

**When clearing transient lexemes:**
```cython
def _clear_transient_orths(self):
    """Remove transient lexemes from the index"""
    cdef hash_t orth
    for orth in self._transient_orths:
        map_clear(self._by_orth.c_map, orth)  # Hash map cleared
    self._transient_orths.clear()
    # self.length is never decremented!
```

**When getting length:**
```cython
def __len__(self):
    return self.length  # Returns the incorrect counter
```

## Impact

- **Production Monitoring**: Teams cannot reliably use `len(nlp.vocab)` to monitor memory_zone effectiveness
- **Misleading Metrics**: Counter suggests memory isn't being freed when it actually is
- **Semantic Inconsistency**: `len(vocab)` doesn't match iteration count

## Suggested Fix

Decrement `self.length` when clearing transient lexemes:

```cython
def _clear_transient_orths(self):
    """Remove transient lexemes from the index"""
    cdef hash_t orth
    cdef int num_cleared = 0
    
    for orth in self._transient_orths:
        if self._by_orth.get(orth) is not NULL:
            map_clear(self._by_orth.c_map, orth)
            num_cleared += 1
    
    self._transient_orths.clear()
    self.length -= num_cleared  # Decrement counter
```

## Verification

I've tested this on production workloads with 1.9M records and confirmed:
- Memory IS being freed correctly (RSS measurements show ~9-10% reduction)
- `memory_zone` functionality works as intended
- Only the `length` counter is inaccurate

## Environment

- **spaCy version**: 3.8.7 (issue likely present since memory_zone introduction in 3.8.0)
- **Platform**: Linux (Databricks)
- **Python version**: 3.9+

## Additional Context

This was discovered while implementing memory_zone in a production NER pipeline processing millions of records. The discrepancy between `len(vocab)` and actual memory behavior initially caused confusion until source code analysis revealed the counter bug.

The underlying memory management works correctly - this is purely a counter bookkeeping issue that affects observability.



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Bug: vocab.length not decremented when memory_zone clears transient lexemes #13882

Description

Reproduction

Expected Behavior

Actual Behavior

Root Cause

Impact

Suggested Fix

Verification

Environment

Additional Context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Bug: vocab.length not decremented when memory_zone clears transient lexemes #13882

Description

Description

Reproduction

Expected Behavior

Actual Behavior

Root Cause

Impact

Suggested Fix

Verification

Environment

Additional Context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions