Fix: Omit threading.get_ident() from cache token for async implementations in sync mode#2028
Open
yuxin00j wants to merge 2 commits intofsspec:masterfrom
Open
Fix: Omit threading.get_ident() from cache token for async implementations in sync mode#2028yuxin00j wants to merge 2 commits intofsspec:masterfrom
threading.get_ident() from cache token for async implementations in sync mode#2028yuxin00j wants to merge 2 commits intofsspec:masterfrom
Conversation
ec015b2 to
8f896d7
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
The Problem:
Currently, the
AbstractFileSysteminstance cache incorporatesthreading.get_ident()into its cache token. For workloads using multi-threading (like Hugging Facedatasetsfetching metadata), this completely bypasses the cache, forcingfsspecto create a brand-new filesystem instance andaiohttp.ClientSessionfor every single thread. This results in redundant authentication and connection overhead, defeating the purpose of the cache.The Solution:
As discussed in #2020, because
fsspeccurrently uses a single globalfsspecIOevent loop for all synchronous calls (asynchronous=False), allsync()coroutine executions are inherently funneled into that single background thread. Therefore, multiple calling Python threads can perfectly and safely share the exact sameAsyncFileSysteminstance.This PR modifies
fsspec/spec.pyto conditionally removethreading.get_ident()from the caching token if:async_impl = True).asynchronous=False).For purely synchronous filesystems (e.g., FTP) or when
asynchronous=True, the thread ID remains in the cache token to ensure thread safety and avoid cross-loop boundary violations.Changes:
tokenin_Cached.__call__.fsspec/tests/test_cache_threads.pyto explicitly verify the caching behavior across multiple threads for bothAsyncFileSystemandAbstractFileSystemunder different modes.Closes #2020