-
Notifications
You must be signed in to change notification settings - Fork 22
Rewrite batch mode to use temp directory #136
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rewrite batch mode to use temp directory #136
Conversation
Write batch data to temp directory instead of memory to avoid high RAM usage with large files. Each batch creates its own temp directory, writes blocks directly to disk, and renames all files atomically on commit.
|
Thank you for submitting this PR!
Getting other community members to do a review would be great help too on complex PRs (you can ask in the chats/forums). If you are unsure about something, just leave us a comment.
We currently aim to provide initial feedback/triaging within two business days. Please keep an eye on any labelling actions, as these will indicate priorities and status of your contribution. |
- Removed unused tempFileOnce() method - Replaced unused 'done' variable with '_' in Commit method
Only check temp directory for keys that exist in puts slice
ddd53d9 to
d939e0e
Compare
|
Need to review. Check that there is still ability to do in-memory batching for smaller batches. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My concern is that this will negatively impact the performance of Put. This will make Commit faster because the temp files are already written, but existing code my expect a Put that is not slowed down by writing to the filesystem.
It may be worth considering having Put write the temp files asynchronously in a separate goroutine. Then Commit will wait for any outstanding Put to finish, handle any errors from async Puts , and then move all the temp files to their final destination.
|
Triage questions that come up:
|
|
Triage notes:
@requilence gentle ping, are you able to answer where this problem/need surfaced? Are yo uusing Kubo or your own implementation? On what hardware are you running? |
I do not think we need or want to guarantee this, because the ability to reuse a Batch was not consistent across different datastores, so should not be done in general. |
Address PR review feedback
|
@lidel Thanks for the review! I've addressed the feedback: Async Write Operations
|
We are using flatfs and the IPLD structures in Anytype (http://github.com/anyproto/anytype-heart) to store files on users' devices. If we are not using a batch operation, we can find ourselves in a situation where the user force-closes the app in the middle of a large file (e.g., 4 GB video) addin, which will leave a lot of garbage blocks in flatfs. Therefore, we are using atomic batch operations. At the same time, we are trying to avoid heap allocations and reusing memory, as it is especially critical on mobile devices. @lidel, I'm not sure async Put is really a good option for us. We would prefer to do it synchronously; otherwise, it could lead to heap spikes. I'm considering a clean way to make this configurable while maintaining backward compatibility. Perhaps a new OpenWithArgs that will have variadic arguments? It will also be possible to override the fs.FS implementation with a custom or mocked version. |
Consider limiting the number of asynchronous write operations. I think that it is reasonable for an async write operation to wait until the previous async write operation has completed. This will allow a write function to return immediately, while the work is done during accumulation of operations in the next batch. Additionally, there could be an option to enable/disable asynchronous puts. Although this is probably not necessary if above limit is implemented. |
gammazero
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See and commit suggestions: Added a semaphore to limit concurrent puts
lidel
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried to document the current state (see godoc drafts in comments below) and one things that stands out is O(n) related to linear seartch in puts – not sure how big of an impact this is, but we already have O(1) for deletes so maybe do the same for puts? (details inline).
If we do that, godoc should be updated to relfect that.
|
Triage: We will look at this after kubo v0.39 |
Co-authored-by: Marcin Rataj <[email protected]>
Co-authored-by: Marcin Rataj <[email protected]>
Co-authored-by: Marcin Rataj <[email protected]>
Co-authored-by: Marcin Rataj <[email protected]>
Co-authored-by: Marcin Rataj <[email protected]>
Co-authored-by: Marcin Rataj <[email protected]>
Co-authored-by: Marcin Rataj <[email protected]>
Co-authored-by: Marcin Rataj <[email protected]>
Problem
When uploading large files through UnixFS or other IPLD structures, the current batch implementation keeps everything in memory and writes blocks individually. If the application crashes mid-process, partial blocks remain in FlatFS, making garbage collection extremely difficult since there's no way to identify which blocks belong to incomplete operations.
Solution
This PR rewrites the batch mode to use a temporary directory approach:
BatchReaderinterface implementing Get/Has/GetSize methods. This is essential for IPLD structures which need to verify block links during construction. Without batch reads, you cannot build complex IPLD structures that reference blocks added earlier in the same batch. This follows standard database transaction semantics where reads within a transaction see uncommitted writes.Implementation Details