Add API-backed OneLake personal drive layer by vashkartik · Pull Request #262 · StatCan/zone-kubeflow-containers

vashkartik · 2026-05-07T17:13:35Z

Summary

Add a fresh OneLake architecture plan documenting the move away from BlobFuse2/FUSE.
Add a dedicated onelake image layer between base and mid.
Add broker-backed OneLake Python helpers, a cron-friendly onelake CLI, and a Jupyter/fsspec-backed OneLake file-browser resource.
Keep OneLake implementation isolated from images/mid.

Important Notes

This path does not use BlobFuse2, /dev/fuse, SYS_ADMIN, device-code auth, or az login.
Users should not select the legacy "Enable OneLake FUSE mount" notebook configuration for this image.
Live OneLake access still depends on the platform token broker contract via ONELAKE_BROKER_URL.

Tests

python -m pytest tests/onelake -q -> 11 passed
Local fsspec smoke check for onelake:// -> passed

Authorship

All commits are authored by vashkartik <kartikvashisth675@gmail.com>.

Keayoub

Performance Review — API-backed OneLake layer

The broker-credential pattern and no-FUSE direction are correct for Zone's privilege-restricted containers. However there are several performance issues that will make this noticeably slower than BlobFuse2 in interactive use. All are fixable without changing the architecture. Inline comments on the five critical paths below, with suggested fixes included.

Critical (will cause visible slowness):

BrokerCredential.get_token() has no token cache — broker HTTP call on every SDK operation
No
equests.Session reuse — new TLS handshake per token fetch
_read_saved_config() reads disk on every API operation
Missing _open() in OneLakeFileSystem — JupyterLab downloads entire files before rendering
info() makes 2 API calls per entry; isfile() chains 4

Full corrected versions of onelake_utils.py and onelake_fsspec.py available on request.

Keayoub · 2026-05-14T18:13:07Z

+            headers["Authorization"] = f"Bearer {token}"
+
+        response = requests.get(
+            f"{self.broker_url}/onelake/token",


[Performance - Critical] No token cache — broker HTTP call on every SDK operation

get_token() makes a live HTTP request + new TLS handshake to the broker on every single call. The Azure SDK calls get_token() before each REST operation, so every ls, read, write, or append triggers a full broker roundtrip (~50–100 ms). In a JupyterLab session browsing files this fires hundreds of times per minute.

The AccessToken already contains expires_on — use it. This is exactly what azure-identity's DefaultAzureCredential does internally.

`python
import time, threading

class BrokerCredential:
_EXPIRY_BUFFER_SECONDS = 60

def __init__(self, broker_url=None, token_file=None): ... self._cached_token: tuple[str, int] | None = None self._session = None self._lock = threading.Lock() def _get_session(self): # reuse one TCP+TLS connection — see requests.Session docs if self._session is None: import requests self._session = requests.Session() return self._session def get_token(self, *scopes, **_kwargs): with self._lock: if self._cached_token: token_str, mono_exp = self._cached_token if time.monotonic() < mono_exp - self._EXPIRY_BUFFER_SECONDS: from azure.core.credentials import AccessToken return AccessToken(token_str, int(mono_exp)) # fetch, then: self._cached_token = (token, mono_expiry)

`

Without this fix the implementation will be measurably slower than BlobFuse2 for interactive use regardless of network speed.

Keayoub · 2026-05-14T18:13:07Z

+    if not CONFIG_FILE.exists():
+        return {}
+    try:
+        return json.loads(CONFIG_FILE.read_text(encoding="utf-8"))


[Performance] _read_saved_config() reads disk on every operation

_get_config() calls _read_saved_config() which does a stat() + read() syscall every time. Since _get_config() is on the hot path of every SDK operation (ls, read, write, _get_fs_client(), _build_path()), this adds unnecessary disk I/O to every API call.

Fix: cache in a module-level variable, invalidate only on configure() / reset_clients():

`python
_config_cache: dict | None = None

def _read_saved_config():
global _config_cache
if _config_cache is None:
if not CONFIG_FILE.exists():
_config_cache = {}
else:
try:
_config_cache = json.loads(CONFIG_FILE.read_text(encoding="utf-8"))
except (json.JSONDecodeError, OSError):
_config_cache = {}
return _config_cache

def reset_clients():
global _client, _fs_client, _config_cache
_client = _fs_client = _config_cache = None # invalidate on reset
`
"@
},
@{
path = "images/onelake/onelake_utils.py"
position = 129
body = @"
[Performance + Correctness] Singleton init is not thread-safe and has no retry policy

Two issues here:

1. TOCTOU race: JupyterLab runs an async Tornado event loop alongside kernel threads. Two concurrent coroutines can both observe _client is None and create duplicate DataLakeServiceClient instances. Guard with threading.Lock().

2. No retry/backoff: OneLake's DFS endpoint throttles under load (HTTP 429) and returns 503 on transient failures. In a shared JupyterHub with multiple concurrent users this is expected. The SDK accepts retry config directly:

`python
_client_lock = threading.Lock()

def _get_service_client():
global _client
with _client_lock:
if _client is None:
from azure.storage.filedatalake import DataLakeServiceClient
_client = DataLakeServiceClient(
account_url=os.environ.get("ONELAKE_ENDPOINT", ONELAKE_ENDPOINT),
credential=BrokerCredential(),
retry_total=4,
retry_backoff_factor=0.5,
)
return _client
"@ }, @{ path = "images/onelake/onelake_utils.py" position = 256 body = @" **[Performance]readall()` buffers the entire file into RAM**

readall() downloads the complete file before returning a single byte. For a 200 MB CSV or Parquet file this means JupyterLab freezes for 30–120 s and risks OOM in memory-constrained notebook pods.

read() is called by OneLakeFileSystem.cat_file() in onelake_fsspec.py, which is the fallback for every file open in JupyterLab. The real fix is implementing _open() in OneLakeFileSystem (see comment there) — once _open() exists, cat_file() / readall() are no longer on the JupyterLab hot path. For the CLI cat command readall() is fine.

Keayoub · 2026-05-14T18:13:07Z

+            raise FileNotFoundError(path)
+
+        if self.isdir(path):
+            return {"name": path, "type": "directory", "size": 0, "created": _now(), "mtime": _now()}


[Performance] info() makes 2 API round trips for every file entry

isdir() calls get_directory_properties() (API call 1), then if not a directory get_file_properties() is called (API call 2). JupyterLab calls info() on every entry it renders, so a folder with 50 files costs 100 API calls instead of 50.

The ADLS2 SDK's get_file_properties() returns an hdi_isfolder metadata field — call it once and derive the type from the response:

`python
from azure.core.exceptions import ResourceNotFoundError

def info(self, path, _kwargs):
path = _clean(self._strip_protocol(path))
if path == "":
return {"name": "", "type": "directory", "size": 0, "created": _now(), "mtime": _now()}
if not self._configured():
if path == _MARKER: return self._marker_info()
raise FileNotFoundError(path)
try:
props = onelake._get_file_client(path).get_file_properties()
is_dir = dict(props.metadata or {}).get("hdi_isfolder") == "true"
return {
"name": path,
"type": "directory" if is_dir else "file",
"size": 0 if is_dir else int(getattr(props, "size", 0) or 0),
"created": _now(), "mtime": _now(),
}
except ResourceNotFoundError:
raise FileNotFoundError(path)
"@ }, @{ path = "images/onelake/onelake_fsspec.py" position = 114 body = @" **[Performance]isfile()` chains 4 API calls

isfile() → exists() → info() → isdir() (API call) + get_file_properties() (API call), then info() is called again for the ["type"] check — 4 API calls total for a single boolean result.

Replace with a single info() call:

python def isfile(self, path): try: return self.info(path)["type"] == "file" except FileNotFoundError: return False

Same fix applies to isdir():

python def isdir(self, path): try: return self.info(path)["type"] == "directory" except FileNotFoundError: return False
"@
},
@{
path = "images/onelake/onelake_fsspec.py"
position = 29
body = @"
[Performance - Critical] Missing _open() — JupyterLab downloads entire files

OneLakeFileSystem does not implement _open(). Without it, fsspec falls back to cat_file() for every file open in JupyterLab, which calls onelake.read() → readall() — downloading the entire file into memory before the first byte is shown. A user double-clicking a 200 MB CSV waits for the full download. Parquet column pruning (one of OneLake's key advantages over BlobFuse2) is also impossible without range reads.

The ADLS2 SDK supports download_file(offset=..., length=...) which maps directly to an HTTP Range: header. Implement _open() using fsspec's AbstractBufferedFile with a 5 MB block cache:

`python
from fsspec.spec import AbstractBufferedFile, AbstractFileSystem

class OneLakeFile(AbstractBufferedFile):
def _fetch_range(self, start, end):
# ADLS2 SDK maps offset/length to HTTP Range header natively
fc = onelake._get_file_client(self.path)
return fc.download_file(offset=start, length=end - start).readall()

class OneLakeFileSystem(AbstractFileSystem):
...
def _open(self, path, mode="rb", block_size=None, **kwargs):
path = _clean(self._strip_protocol(path))
return OneLakeFile(
self, path, mode=mode,
block_size=block_size or 5 * 1024 * 1024, # 5 MB blocks
**kwargs,
)
`

AbstractBufferedFile caches blocks in memory — same behaviour as BlobFuse2's kernel page cache but in userspace. pyarrow and pandas are fsspec-aware and will only fetch the Parquet row groups they need via range reads, which BlobFuse2 cannot do.

Adds the OneLake image layer with CLI/Python/R/JupyterLab access path, wires it into the build chain (mid now derives from onelake), and includes the simplified-architecture brief. The get_branch_name helper is updated to sanitize PR branch names so they're valid Docker tags.

vashkartik force-pushed the feat/onelake-personal-drive branch from 0be9c30 to 6d417bd Compare May 8, 2026 16:12

Keayoub suggested changes May 14, 2026

View reviewed changes

github-advanced-security AI found potential problems May 15, 2026

View reviewed changes

Comment thread tests/onelake/test_onelake_layer.py Fixed

vashkartik force-pushed the feat/onelake-personal-drive branch 3 times, most recently from 7e4b629 to 399b16a Compare May 15, 2026 17:55

github-advanced-security AI found potential problems May 15, 2026

View reviewed changes

Comment thread .github/workflows/docker.yaml Fixed

Comment thread .github/workflows/docker.yaml Fixed

vashkartik force-pushed the feat/onelake-personal-drive branch 3 times, most recently from bcaf59c to b26202e Compare May 20, 2026 18:14

vashkartik force-pushed the feat/onelake-personal-drive branch 5 times, most recently from 4fc3bbd to 4ae4c99 Compare May 29, 2026 13:55

vashkartik force-pushed the feat/onelake-personal-drive branch from 4ae4c99 to d5f55d1 Compare May 29, 2026 18:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add API-backed OneLake personal drive layer#262

Add API-backed OneLake personal drive layer#262
vashkartik wants to merge 1 commit into
betafrom
feat/onelake-personal-drive

vashkartik commented May 7, 2026

Uh oh!

Keayoub left a comment

Uh oh!

Keayoub May 14, 2026

Uh oh!

Keayoub May 14, 2026

Uh oh!

Keayoub May 14, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

vashkartik commented May 7, 2026

Summary

Important Notes

Tests

Authorship

Uh oh!

Keayoub left a comment

Choose a reason for hiding this comment

Performance Review — API-backed OneLake layer

Uh oh!

Keayoub May 14, 2026

Choose a reason for hiding this comment

Uh oh!

Keayoub May 14, 2026

Choose a reason for hiding this comment

Uh oh!

Keayoub May 14, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants