Skip to content

Remove threading.get_ident() from instance cache token for asynchronous=False instances #2020

@yuxin00j

Description

@yuxin00j

The Problem:
Currently, the AbstractFileSystem instance cache incorporates threading.get_ident() into its cache token (in fsspec/spec.py).

This means that if multiple Python threads request the same filesystem (e.g., fsspec.filesystem("gs")), fsspec completely bypasses the cache and creates a brand-new filesystem instance and a separate aiohttp.ClientSession for every single thread.

In libraries like Hugging Face datasets, which heavily utilize multi-threading (via thread_map) to fetch metadata. Every background thread creates a new FS instance, forcing redundant authentication, TLS handshaking, and connection pooling overhead. This defeats the purpose of the cache and wastes significant time and bandwidth.

Historical Context:
By tracing the fsspec git history, it appears this thread-specific caching is a vestige of an older event-loop architecture:

  1. In PR One event loop per thread #572 (Commit b252369), fsspec implemented thread-local event loops (loops[ident] = loop). Because aiohttp.ClientSession is bound to a specific event loop, threading.get_ident() was correctly added to the cache token so each calling thread got its own FS instance and its own event loop.
  2. In PR Iothread #590 (Commit 57d6ce92), the architecture was completely revamped. The thread-local event loops were removed in favor of a single, global background IO loop (fsspecIO).
  3. However, when the event loop became a global singleton, threading.get_ident() was never removed from the cache token.

Why it is safe to remove now (for asynchronous=False):
Because fsspec currently uses a single global fsspecIO event loop for all synchronous calls (asynchronous=False), all sync() coroutine executions are inherently funneled into that single background thread.

Therefore, multiple calling Python threads can perfectly and safely share the exact same AsyncFileSystem instance (and its underlying aiohttp session). The session is only ever mutated by the cooperative fsspecIO background thread.

Proposed Solution:
We propose modifying fsspec/spec.py to remove threading.get_ident() from the caching token if asynchronous=False (the default behavior).

For asynchronous=True, retaining a loop-specific or thread-specific token is still necessary, as advanced users may be running multiple event loops across different threads, and aiohttp sessions cannot cross event loop boundaries.

This change would massively improve performance and reduce connection overhead for synchronous, multi-threaded workloads interacting with remote filesystems.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions