-
-
Notifications
You must be signed in to change notification settings - Fork 4.6k
Description
Description
The Vocab.length counter is incremented when adding lexemes but is never decremented when memory_zone clears transient lexemes. This causes len(nlp.vocab) to grow continuously even though the actual lexemes are properly removed from the internal hash map.
This makes len(nlp.vocab) unreliable for monitoring memory_zone effectiveness in production environments.
Reproduction
import spacy
# Load model
nlp = spacy.load("en_core_web_sm")
# Check initial vocab size
initial_vocab_size = len(nlp.vocab)
print(f"Initial vocab size: {initial_vocab_size}")
# Process text with memory_zone
texts = ["unique_word_" + str(i) for i in range(1000)]
for text in texts:
with nlp.memory_zone():
doc = nlp(text)
# Check final vocab size
final_vocab_size = len(nlp.vocab)
print(f"Final vocab size: {final_vocab_size}")
print(f"Growth: {final_vocab_size - initial_vocab_size}")
# But iterate to see actual lexemes
actual_count = sum(1 for _ in nlp.vocab)
print(f"Actual lexeme count (via iteration): {actual_count}")Output:
Initial vocab size: 1,456
Final vocab size: 2,456
Growth: 1,000
Actual lexeme count (via iteration): 1,456
Expected Behavior
After exiting memory_zone, len(nlp.vocab) should decrease to reflect the removal of transient lexemes, matching the count from iteration.
Actual Behavior
len(nlp.vocab)continues to grow and never decreases- Iteration over
nlp.vocabcorrectly shows only permanent lexemes - Actual memory IS freed (confirmed via RSS measurements)
- Only the counter is incorrect
Root Cause
Looking at spacy/vocab.pyx:
When adding lexemes:
cdef int _add_lex_to_vocab(self, hash_t key, const LexemeC* lex, bint is_transient):
self._by_orth.set(lex.orth, <void*>lex)
self.length += 1 # Counter incremented
if is_transient and self.in_memory_zone:
self._transient_orths.push_back(lex.orth)When clearing transient lexemes:
def _clear_transient_orths(self):
"""Remove transient lexemes from the index"""
cdef hash_t orth
for orth in self._transient_orths:
map_clear(self._by_orth.c_map, orth) # Hash map cleared
self._transient_orths.clear()
# self.length is never decremented!When getting length:
def __len__(self):
return self.length # Returns the incorrect counterImpact
- Production Monitoring: Teams cannot reliably use
len(nlp.vocab)to monitor memory_zone effectiveness - Misleading Metrics: Counter suggests memory isn't being freed when it actually is
- Semantic Inconsistency:
len(vocab)doesn't match iteration count
Suggested Fix
Decrement self.length when clearing transient lexemes:
def _clear_transient_orths(self):
"""Remove transient lexemes from the index"""
cdef hash_t orth
cdef int num_cleared = 0
for orth in self._transient_orths:
if self._by_orth.get(orth) is not NULL:
map_clear(self._by_orth.c_map, orth)
num_cleared += 1
self._transient_orths.clear()
self.length -= num_cleared # Decrement counterVerification
I've tested this on production workloads with 1.9M records and confirmed:
- Memory IS being freed correctly (RSS measurements show ~9-10% reduction)
memory_zonefunctionality works as intended- Only the
lengthcounter is inaccurate
Environment
- spaCy version: 3.8.7 (issue likely present since memory_zone introduction in 3.8.0)
- Platform: Linux (Databricks)
- Python version: 3.9+
Additional Context
This was discovered while implementing memory_zone in a production NER pipeline processing millions of records. The discrepancy between len(vocab) and actual memory behavior initially caused confusion until source code analysis revealed the counter bug.
The underlying memory management works correctly - this is purely a counter bookkeeping issue that affects observability.