Skip to content

Fix: Omit threading.get_ident() from cache token for async implementations in sync mode#2028

Open
yuxin00j wants to merge 2 commits intofsspec:masterfrom
yuxin00j:fix-cache-thread-id
Open

Fix: Omit threading.get_ident() from cache token for async implementations in sync mode#2028
yuxin00j wants to merge 2 commits intofsspec:masterfrom
yuxin00j:fix-cache-thread-id

Conversation

@yuxin00j
Copy link
Copy Markdown

The Problem:
Currently, the AbstractFileSystem instance cache incorporates threading.get_ident() into its cache token. For workloads using multi-threading (like Hugging Face datasets fetching metadata), this completely bypasses the cache, forcing fsspec to create a brand-new filesystem instance and aiohttp.ClientSession for every single thread. This results in redundant authentication and connection overhead, defeating the purpose of the cache.

The Solution:
As discussed in #2020, because fsspec currently uses a single global fsspecIO event loop for all synchronous calls (asynchronous=False), all sync() coroutine executions are inherently funneled into that single background thread. Therefore, multiple calling Python threads can perfectly and safely share the exact same AsyncFileSystem instance.

This PR modifies fsspec/spec.py to conditionally remove threading.get_ident() from the caching token if:

  1. The class uses an async implementation (async_impl = True).
  2. The user requests a synchronous interface (asynchronous=False).

For purely synchronous filesystems (e.g., FTP) or when asynchronous=True, the thread ID remains in the cache token to ensure thread safety and avoid cross-loop boundary violations.

Changes:

  • Conditionally generated the cache token in _Cached.__call__.
  • Added tests in fsspec/tests/test_cache_threads.py to explicitly verify the caching behavior across multiple threads for both AsyncFileSystem and AbstractFileSystem under different modes.

Closes #2020

@yuxin00j yuxin00j force-pushed the fix-cache-thread-id branch from ec015b2 to 8f896d7 Compare April 30, 2026 12:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Remove threading.get_ident() from instance cache token for asynchronous=False instances

1 participant