Skip to content

Conversation

@Xuanwo
Copy link
Collaborator

@Xuanwo Xuanwo commented Dec 18, 2025

This PR provides an alternative implementation for memtest. In this version, we introduce memtrace as a new feature in Pylance. Users can enable it via the memtrace feature flag. It offers a similar API to memtest, but it eliminates the need for users to hook or inject dynamic libraries, making it easier to use and test.

Before:

maturin develop
make -C ../memtest build-release
LIB_PATH=$(lance-memtest)
LD_PRELOAD=$LIB_PATH pytest python/ci_benchmarks

After:

maturin develop --features memtrace
pytest python/ci_benchmarks

Parts of this PR were drafted with assistance from Codex (with gpt-5.2) and fully reviewed and edited by me. I take full responsibility for all changes.

@github-actions github-actions bot added enhancement New feature or request python labels Dec 18, 2025
Copy link
Contributor

@wjones127 wjones127 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This approach only captures allocations made in our Rust code, which makes it not work for tests like the insert one in test_memory.py. In that test, we create input data with PyArrow, which won't be captured. The point of that test is to show we don't buffer or collect too much data into memory.

The memtest one is the fourth approach I tried, so I might share the four approaches I tried and why I went with this:

  1. First I tried to implement a solution that uses tracing subscribers to capture allocations in Rust. This would have been cool as it would have worked in Rust unit tests even if there were tests running concurrently. However, each time we called tokio::spawn, we needed to make sure we passed down the span so tracing would continue to capture them. This ended up being too much work.
  2. Next I tried implementing a custom allocator in Rust, similar to this PR, but using it in Rust tests. That was much simpler and caught all allocations. However, it would not work if tests were running concurrently in the same process. There wasn't an easy way to force this to happen in a Rust test. You could always pass cargo test --test-threads=1, but that would be annoying. We could use cargo nextest run, which uses separate processes for each test and is generally faster. But (a) that library crashes on my Mac and (b) new contributors might call cargo test and get confused by the failures.
  3. Next I implemented basically what is in this PR. The idea being I could solve the multi-threading issue by just using Python, which will run only one test a at time in a process by default. However, I found that it wasn't that useful for things like write tests if it didn't capture allocations made outside of our Rust code.
    a. I tried to get around the limitation by also using allocation stats from other libraries in Python. PyArrow has some stats on it's global memory pool. But most of those stats can't be reset like our stats, so there wasn't a clear way to combine them.
  4. What I finally settled on was memtest, using the LD_PRELOAD trick with a python library. This captures all allocations reliably and because it's run in Python it doesn't need to worry about concurrency.

That's all the attempts I made I can think of. If you have any new ideas, I'd be glad to hear them.

@Xuanwo
Copy link
Collaborator Author

Xuanwo commented Dec 19, 2025

The memtest one is the fourth approach I tried, so I might share the four approaches I tried and why I went with this:

Thank you very much for this! I initially assumed that most of our workload is handled within the Rust core, so the Python part wouldn't require much attention. However, it seems my assumption was incorrect.

Could you elaborate on why we need to consider Python's memory usage as well? From my current understanding, if we're building an online service around lancedb, operations like building the index should be handled server-side in Rust, while users would primarily use Python on the client side.

@wjones127
Copy link
Contributor

Could you elaborate on why we need to consider Python's memory usage as well? From my current understanding, if we're building an online service around lancedb, operations like building the index should be handled server-side in Rust, while users would primarily use Python on the client side.

This is the Lance library, not LanceDB.

Some operations are handled entirely in Rust, like indexing and queries. But for writes, the data comes from outside. Testing that we can stream writes properly is one of the main things we want to test. That's why I gave the example of insert earlier: it tests that we can take a stream of data, and write it out without collecting it all into memory.

@Xuanwo
Copy link
Collaborator Author

Xuanwo commented Dec 19, 2025

Thanks! The great answer addressed my questions. I'm now going to close this PR as I do think the current way seems to be the only way.

But for writes, the data comes from outside.

One last question: is it a good idea to measure that lance memory usage didn't grow up during writing? So we can still measure the rust side instead.

@Xuanwo
Copy link
Collaborator Author

Xuanwo commented Jan 4, 2026

Let's close.

@Xuanwo Xuanwo closed this Jan 4, 2026
@Xuanwo Xuanwo deleted the xuanwo/mem-usage-measure branch January 4, 2026 10:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request python

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants