Skip to content

Conversation

@ram-nadella
Copy link
Owner

Summary

  • Implemented string interning to reduce memory usage by deduplicating file paths and module paths
  • Changed Symbol struct to use Arc<str> instead of PathBuf/String for interned fields
  • Added thread-safe StringCache module for managing interned strings

Motivation

In large codebases with thousands of files and tens of thousands of symbols, the same file paths and module paths are duplicated across many symbols. This PR optimizes memory usage by ensuring each unique path string is stored only once in memory.

Changes

  1. Symbol struct changes:

    • file_path: PathBufArc<str> (interned)
    • module_path: StringArc<str> (interned)
    • Custom Serialize/Deserialize implementations to handle Arc<str>
  2. String interning implementation:

    • Created StringCache module with thread-safe string deduplication
    • Uses parking_lot::RwLock and HashMap<String, Arc<str>>
    • Integrated into SymbolIndex to automatically intern strings when symbols are added
  3. Parser updates:

    • Updated all parsers to use String instead of PathBuf for file paths
    • File paths are converted to strings at parse time
  4. Test updates:

    • Fixed all tests to work with the new types
    • Updated benchmarks to use strings instead of PathBuf

Performance Impact

  • Memory: Significant reduction for large codebases (each unique path stored once)
  • CPU: Minimal overhead from string cache lookups (fast path for cache hits)
  • Thread safety: No contention issues due to read-heavy workload

Test plan

  • All existing tests pass
  • cargo fmt
  • cargo clippy (no warnings)
  • Manual testing with pylight binary

🤖 Generated with Claude Code

- Changed Symbol struct to use Arc<str> for file_path and module_path
- Created StringCache module for thread-safe string deduplication
- Integrated string interning into SymbolIndex at symbol insertion time
- Updated all parsers to use String instead of PathBuf for file paths
- Fixed all tests to work with the new types

This optimization significantly reduces memory usage for large codebases
by ensuring each unique file path and module path is stored only once
in memory, with symbols holding lightweight Arc<str> references.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants