You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The Problem:
Currently, the AbstractFileSystem instance cache incorporates threading.get_ident() into its cache token (in fsspec/spec.py).
This means that if multiple Python threads request the same filesystem (e.g., fsspec.filesystem("gs")), fsspec completely bypasses the cache and creates a brand-new filesystem instance and a separate aiohttp.ClientSession for every single thread.
In libraries like Hugging Face datasets, which heavily utilize multi-threading (via thread_map) to fetch metadata. Every background thread creates a new FS instance, forcing redundant authentication, TLS handshaking, and connection pooling overhead. This defeats the purpose of the cache and wastes significant time and bandwidth.
Historical Context:
By tracing the fsspec git history, it appears this thread-specific caching is a vestige of an older event-loop architecture:
In PR One event loop per thread #572 (Commit b252369), fsspec implemented thread-local event loops (loops[ident] = loop). Because aiohttp.ClientSession is bound to a specific event loop, threading.get_ident() was correctly added to the cache token so each calling thread got its own FS instance and its own event loop.
In PR Iothread #590 (Commit 57d6ce92), the architecture was completely revamped. The thread-local event loops were removed in favor of a single, global background IO loop (fsspecIO).
However, when the event loop became a global singleton, threading.get_ident() was never removed from the cache token.
Why it is safe to remove now (for asynchronous=False):
Because fsspec currently uses a single global fsspecIO event loop for all synchronous calls (asynchronous=False), all sync() coroutine executions are inherently funneled into that single background thread.
Therefore, multiple calling Python threads can perfectly and safely share the exact same AsyncFileSystem instance (and its underlying aiohttp session). The session is only ever mutated by the cooperative fsspecIO background thread.
Proposed Solution:
We propose modifying fsspec/spec.py to remove threading.get_ident() from the caching token if asynchronous=False (the default behavior).
For asynchronous=True, retaining a loop-specific or thread-specific token is still necessary, as advanced users may be running multiple event loops across different threads, and aiohttp sessions cannot cross event loop boundaries.
This change would massively improve performance and reduce connection overhead for synchronous, multi-threaded workloads interacting with remote filesystems.
The Problem:
Currently, the
AbstractFileSysteminstance cache incorporatesthreading.get_ident()into its cache token (infsspec/spec.py).This means that if multiple Python threads request the same filesystem (e.g.,
fsspec.filesystem("gs")),fsspeccompletely bypasses the cache and creates a brand-new filesystem instance and a separateaiohttp.ClientSessionfor every single thread.In libraries like Hugging Face
datasets, which heavily utilize multi-threading (viathread_map) to fetch metadata. Every background thread creates a new FS instance, forcing redundant authentication, TLS handshaking, and connection pooling overhead. This defeats the purpose of the cache and wastes significant time and bandwidth.Historical Context:
By tracing the
fsspecgit history, it appears this thread-specific caching is a vestige of an older event-loop architecture:b252369),fsspecimplemented thread-local event loops (loops[ident] = loop). Becauseaiohttp.ClientSessionis bound to a specific event loop,threading.get_ident()was correctly added to the cache token so each calling thread got its own FS instance and its own event loop.57d6ce92), the architecture was completely revamped. The thread-local event loops were removed in favor of a single, global background IO loop (fsspecIO).threading.get_ident()was never removed from the cache token.Why it is safe to remove now (for
asynchronous=False):Because
fsspeccurrently uses a single globalfsspecIOevent loop for all synchronous calls (asynchronous=False), allsync()coroutine executions are inherently funneled into that single background thread.Therefore, multiple calling Python threads can perfectly and safely share the exact same
AsyncFileSysteminstance (and its underlyingaiohttpsession). The session is only ever mutated by the cooperativefsspecIObackground thread.Proposed Solution:
We propose modifying
fsspec/spec.pyto removethreading.get_ident()from the caching token ifasynchronous=False(the default behavior).For
asynchronous=True, retaining a loop-specific or thread-specific token is still necessary, as advanced users may be running multiple event loops across different threads, andaiohttpsessions cannot cross event loop boundaries.This change would massively improve performance and reduce connection overhead for synchronous, multi-threaded workloads interacting with remote filesystems.