Add API-backed OneLake personal drive layer#262
Conversation
0be9c30 to
6d417bd
Compare
Keayoub
left a comment
There was a problem hiding this comment.
Performance Review — API-backed OneLake layer
The broker-credential pattern and no-FUSE direction are correct for Zone's privilege-restricted containers. However there are several performance issues that will make this noticeably slower than BlobFuse2 in interactive use. All are fixable without changing the architecture. Inline comments on the five critical paths below, with suggested fixes included.
Critical (will cause visible slowness):
- BrokerCredential.get_token() has no token cache — broker HTTP call on every SDK operation
- No
equests.Session reuse — new TLS handshake per token fetch - _read_saved_config() reads disk on every API operation
- Missing _open() in OneLakeFileSystem — JupyterLab downloads entire files before rendering
- info() makes 2 API calls per entry; isfile() chains 4
Full corrected versions of onelake_utils.py and onelake_fsspec.py available on request.
| headers["Authorization"] = f"Bearer {token}" | ||
|
|
||
| response = requests.get( | ||
| f"{self.broker_url}/onelake/token", |
There was a problem hiding this comment.
[Performance - Critical] No token cache — broker HTTP call on every SDK operation
get_token() makes a live HTTP request + new TLS handshake to the broker on every single call. The Azure SDK calls get_token() before each REST operation, so every ls, read, write, or append triggers a full broker roundtrip (~50–100 ms). In a JupyterLab session browsing files this fires hundreds of times per minute.
The AccessToken already contains expires_on — use it. This is exactly what azure-identity's DefaultAzureCredential does internally.
`python
import time, threading
class BrokerCredential:
_EXPIRY_BUFFER_SECONDS = 60
def __init__(self, broker_url=None, token_file=None):
...
self._cached_token: tuple[str, int] | None = None
self._session = None
self._lock = threading.Lock()
def _get_session(self):
# reuse one TCP+TLS connection — see requests.Session docs
if self._session is None:
import requests
self._session = requests.Session()
return self._session
def get_token(self, *scopes, **_kwargs):
with self._lock:
if self._cached_token:
token_str, mono_exp = self._cached_token
if time.monotonic() < mono_exp - self._EXPIRY_BUFFER_SECONDS:
from azure.core.credentials import AccessToken
return AccessToken(token_str, int(mono_exp))
# fetch, then: self._cached_token = (token, mono_expiry)
`
Without this fix the implementation will be measurably slower than BlobFuse2 for interactive use regardless of network speed.
| if not CONFIG_FILE.exists(): | ||
| return {} | ||
| try: | ||
| return json.loads(CONFIG_FILE.read_text(encoding="utf-8")) |
There was a problem hiding this comment.
[Performance] _read_saved_config() reads disk on every operation
_get_config() calls _read_saved_config() which does a stat() + read() syscall every time. Since _get_config() is on the hot path of every SDK operation (ls, read, write, _get_fs_client(), _build_path()), this adds unnecessary disk I/O to every API call.
Fix: cache in a module-level variable, invalidate only on configure() / reset_clients():
`python
_config_cache: dict | None = None
def _read_saved_config():
global _config_cache
if _config_cache is None:
if not CONFIG_FILE.exists():
_config_cache = {}
else:
try:
_config_cache = json.loads(CONFIG_FILE.read_text(encoding="utf-8"))
except (json.JSONDecodeError, OSError):
_config_cache = {}
return _config_cache
def reset_clients():
global _client, _fs_client, _config_cache
_client = _fs_client = _config_cache = None # invalidate on reset
`
"@
},
@{
path = "images/onelake/onelake_utils.py"
position = 129
body = @"
[Performance + Correctness] Singleton init is not thread-safe and has no retry policy
Two issues here:
1. TOCTOU race: JupyterLab runs an async Tornado event loop alongside kernel threads. Two concurrent coroutines can both observe _client is None and create duplicate DataLakeServiceClient instances. Guard with threading.Lock().
2. No retry/backoff: OneLake's DFS endpoint throttles under load (HTTP 429) and returns 503 on transient failures. In a shared JupyterHub with multiple concurrent users this is expected. The SDK accepts retry config directly:
`python
_client_lock = threading.Lock()
def _get_service_client():
global _client
with _client_lock:
if _client is None:
from azure.storage.filedatalake import DataLakeServiceClient
_client = DataLakeServiceClient(
account_url=os.environ.get("ONELAKE_ENDPOINT", ONELAKE_ENDPOINT),
credential=BrokerCredential(),
retry_total=4,
retry_backoff_factor=0.5,
)
return _client
"@ }, @{ path = "images/onelake/onelake_utils.py" position = 256 body = @" **[Performance]readall()` buffers the entire file into RAM**
readall() downloads the complete file before returning a single byte. For a 200 MB CSV or Parquet file this means JupyterLab freezes for 30–120 s and risks OOM in memory-constrained notebook pods.
read() is called by OneLakeFileSystem.cat_file() in onelake_fsspec.py, which is the fallback for every file open in JupyterLab. The real fix is implementing _open() in OneLakeFileSystem (see comment there) — once _open() exists, cat_file() / readall() are no longer on the JupyterLab hot path. For the CLI cat command readall() is fine.
| raise FileNotFoundError(path) | ||
|
|
||
| if self.isdir(path): | ||
| return {"name": path, "type": "directory", "size": 0, "created": _now(), "mtime": _now()} |
There was a problem hiding this comment.
[Performance] info() makes 2 API round trips for every file entry
isdir() calls get_directory_properties() (API call 1), then if not a directory get_file_properties() is called (API call 2). JupyterLab calls info() on every entry it renders, so a folder with 50 files costs 100 API calls instead of 50.
The ADLS2 SDK's get_file_properties() returns an hdi_isfolder metadata field — call it once and derive the type from the response:
`python
from azure.core.exceptions import ResourceNotFoundError
def info(self, path, _kwargs):
path = _clean(self._strip_protocol(path))
if path == "":
return {"name": "", "type": "directory", "size": 0, "created": _now(), "mtime": _now()}
if not self._configured():
if path == _MARKER: return self._marker_info()
raise FileNotFoundError(path)
try:
props = onelake._get_file_client(path).get_file_properties()
is_dir = dict(props.metadata or {}).get("hdi_isfolder") == "true"
return {
"name": path,
"type": "directory" if is_dir else "file",
"size": 0 if is_dir else int(getattr(props, "size", 0) or 0),
"created": _now(), "mtime": _now(),
}
except ResourceNotFoundError:
raise FileNotFoundError(path)
"@ }, @{ path = "images/onelake/onelake_fsspec.py" position = 114 body = @" **[Performance]isfile()` chains 4 API calls
isfile() → exists() → info() → isdir() (API call) + get_file_properties() (API call), then info() is called again for the ["type"] check — 4 API calls total for a single boolean result.
Replace with a single info() call:
python def isfile(self, path): try: return self.info(path)["type"] == "file" except FileNotFoundError: return False
Same fix applies to isdir():
python def isdir(self, path): try: return self.info(path)["type"] == "directory" except FileNotFoundError: return False
"@
},
@{
path = "images/onelake/onelake_fsspec.py"
position = 29
body = @"
[Performance - Critical] Missing _open() — JupyterLab downloads entire files
OneLakeFileSystem does not implement _open(). Without it, fsspec falls back to cat_file() for every file open in JupyterLab, which calls onelake.read() → readall() — downloading the entire file into memory before the first byte is shown. A user double-clicking a 200 MB CSV waits for the full download. Parquet column pruning (one of OneLake's key advantages over BlobFuse2) is also impossible without range reads.
The ADLS2 SDK supports download_file(offset=..., length=...) which maps directly to an HTTP Range: header. Implement _open() using fsspec's AbstractBufferedFile with a 5 MB block cache:
`python
from fsspec.spec import AbstractBufferedFile, AbstractFileSystem
class OneLakeFile(AbstractBufferedFile):
def _fetch_range(self, start, end):
# ADLS2 SDK maps offset/length to HTTP Range header natively
fc = onelake._get_file_client(self.path)
return fc.download_file(offset=start, length=end - start).readall()
class OneLakeFileSystem(AbstractFileSystem):
...
def _open(self, path, mode="rb", block_size=None, **kwargs):
path = _clean(self._strip_protocol(path))
return OneLakeFile(
self, path, mode=mode,
block_size=block_size or 5 * 1024 * 1024, # 5 MB blocks
**kwargs,
)
`
AbstractBufferedFile caches blocks in memory — same behaviour as BlobFuse2's kernel page cache but in userspace. pyarrow and pandas are fsspec-aware and will only fetch the Parquet row groups they need via range reads, which BlobFuse2 cannot do.
7e4b629 to
399b16a
Compare
bcaf59c to
b26202e
Compare
4fc3bbd to
4ae4c99
Compare
Adds the OneLake image layer with CLI/Python/R/JupyterLab access path, wires it into the build chain (mid now derives from onelake), and includes the simplified-architecture brief. The get_branch_name helper is updated to sanitize PR branch names so they're valid Docker tags.
4ae4c99 to
d5f55d1
Compare
Summary
onelakeimage layer betweenbaseandmid.onelakeCLI, and a Jupyter/fsspec-backedOneLakefile-browser resource.images/mid.Important Notes
/dev/fuse,SYS_ADMIN, device-code auth, oraz login.ONELAKE_BROKER_URL.Tests
python -m pytest tests/onelake -q-> 11 passedfsspecsmoke check foronelake://-> passedAuthorship
All commits are authored by
vashkartik <kartikvashisth675@gmail.com>.