Skip to content

Add API-backed OneLake personal drive layer#262

Open
vashkartik wants to merge 1 commit into
betafrom
feat/onelake-personal-drive
Open

Add API-backed OneLake personal drive layer#262
vashkartik wants to merge 1 commit into
betafrom
feat/onelake-personal-drive

Conversation

@vashkartik
Copy link
Copy Markdown
Collaborator

Summary

  • Add a fresh OneLake architecture plan documenting the move away from BlobFuse2/FUSE.
  • Add a dedicated onelake image layer between base and mid.
  • Add broker-backed OneLake Python helpers, a cron-friendly onelake CLI, and a Jupyter/fsspec-backed OneLake file-browser resource.
  • Keep OneLake implementation isolated from images/mid.

Important Notes

  • This path does not use BlobFuse2, /dev/fuse, SYS_ADMIN, device-code auth, or az login.
  • Users should not select the legacy "Enable OneLake FUSE mount" notebook configuration for this image.
  • Live OneLake access still depends on the platform token broker contract via ONELAKE_BROKER_URL.

Tests

  • python -m pytest tests/onelake -q -> 11 passed
  • Local fsspec smoke check for onelake:// -> passed

Authorship

All commits are authored by vashkartik <kartikvashisth675@gmail.com>.

@vashkartik vashkartik force-pushed the feat/onelake-personal-drive branch from 0be9c30 to 6d417bd Compare May 8, 2026 16:12
Copy link
Copy Markdown

@Keayoub Keayoub left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Performance Review — API-backed OneLake layer

The broker-credential pattern and no-FUSE direction are correct for Zone's privilege-restricted containers. However there are several performance issues that will make this noticeably slower than BlobFuse2 in interactive use. All are fixable without changing the architecture. Inline comments on the five critical paths below, with suggested fixes included.

Critical (will cause visible slowness):

  • BrokerCredential.get_token() has no token cache — broker HTTP call on every SDK operation
  • No
    equests.Session reuse — new TLS handshake per token fetch
  • _read_saved_config() reads disk on every API operation
  • Missing _open() in OneLakeFileSystem — JupyterLab downloads entire files before rendering
  • info() makes 2 API calls per entry; isfile() chains 4

Full corrected versions of onelake_utils.py and onelake_fsspec.py available on request.

Comment thread images/onelake/onelake_utils.py Outdated
headers["Authorization"] = f"Bearer {token}"

response = requests.get(
f"{self.broker_url}/onelake/token",
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Performance - Critical] No token cache — broker HTTP call on every SDK operation

get_token() makes a live HTTP request + new TLS handshake to the broker on every single call. The Azure SDK calls get_token() before each REST operation, so every ls, read, write, or append triggers a full broker roundtrip (~50–100 ms). In a JupyterLab session browsing files this fires hundreds of times per minute.

The AccessToken already contains expires_on — use it. This is exactly what azure-identity's DefaultAzureCredential does internally.

`python
import time, threading

class BrokerCredential:
_EXPIRY_BUFFER_SECONDS = 60

def __init__(self, broker_url=None, token_file=None):
    ...
    self._cached_token: tuple[str, int] | None = None
    self._session = None
    self._lock = threading.Lock()

def _get_session(self):
    # reuse one TCP+TLS connection — see requests.Session docs
    if self._session is None:
        import requests
        self._session = requests.Session()
    return self._session

def get_token(self, *scopes, **_kwargs):
    with self._lock:
        if self._cached_token:
            token_str, mono_exp = self._cached_token
            if time.monotonic() < mono_exp - self._EXPIRY_BUFFER_SECONDS:
                from azure.core.credentials import AccessToken
                return AccessToken(token_str, int(mono_exp))
    # fetch, then: self._cached_token = (token, mono_expiry)

`

Without this fix the implementation will be measurably slower than BlobFuse2 for interactive use regardless of network speed.

Comment thread images/onelake/onelake_utils.py Outdated
if not CONFIG_FILE.exists():
return {}
try:
return json.loads(CONFIG_FILE.read_text(encoding="utf-8"))
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Performance] _read_saved_config() reads disk on every operation

_get_config() calls _read_saved_config() which does a stat() + read() syscall every time. Since _get_config() is on the hot path of every SDK operation (ls, read, write, _get_fs_client(), _build_path()), this adds unnecessary disk I/O to every API call.

Fix: cache in a module-level variable, invalidate only on configure() / reset_clients():

`python
_config_cache: dict | None = None

def _read_saved_config():
global _config_cache
if _config_cache is None:
if not CONFIG_FILE.exists():
_config_cache = {}
else:
try:
_config_cache = json.loads(CONFIG_FILE.read_text(encoding="utf-8"))
except (json.JSONDecodeError, OSError):
_config_cache = {}
return _config_cache

def reset_clients():
global _client, _fs_client, _config_cache
_client = _fs_client = _config_cache = None # invalidate on reset
`
"@
},
@{
path = "images/onelake/onelake_utils.py"
position = 129
body = @"
[Performance + Correctness] Singleton init is not thread-safe and has no retry policy

Two issues here:

1. TOCTOU race: JupyterLab runs an async Tornado event loop alongside kernel threads. Two concurrent coroutines can both observe _client is None and create duplicate DataLakeServiceClient instances. Guard with threading.Lock().

2. No retry/backoff: OneLake's DFS endpoint throttles under load (HTTP 429) and returns 503 on transient failures. In a shared JupyterHub with multiple concurrent users this is expected. The SDK accepts retry config directly:

`python
_client_lock = threading.Lock()

def _get_service_client():
global _client
with _client_lock:
if _client is None:
from azure.storage.filedatalake import DataLakeServiceClient
_client = DataLakeServiceClient(
account_url=os.environ.get("ONELAKE_ENDPOINT", ONELAKE_ENDPOINT),
credential=BrokerCredential(),
retry_total=4,
retry_backoff_factor=0.5,
)
return _client
"@ }, @{ path = "images/onelake/onelake_utils.py" position = 256 body = @" **[Performance]readall()` buffers the entire file into RAM**

readall() downloads the complete file before returning a single byte. For a 200 MB CSV or Parquet file this means JupyterLab freezes for 30–120 s and risks OOM in memory-constrained notebook pods.

read() is called by OneLakeFileSystem.cat_file() in onelake_fsspec.py, which is the fallback for every file open in JupyterLab. The real fix is implementing _open() in OneLakeFileSystem (see comment there) — once _open() exists, cat_file() / readall() are no longer on the JupyterLab hot path. For the CLI cat command readall() is fine.

raise FileNotFoundError(path)

if self.isdir(path):
return {"name": path, "type": "directory", "size": 0, "created": _now(), "mtime": _now()}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Performance] info() makes 2 API round trips for every file entry

isdir() calls get_directory_properties() (API call 1), then if not a directory get_file_properties() is called (API call 2). JupyterLab calls info() on every entry it renders, so a folder with 50 files costs 100 API calls instead of 50.

The ADLS2 SDK's get_file_properties() returns an hdi_isfolder metadata field — call it once and derive the type from the response:

`python
from azure.core.exceptions import ResourceNotFoundError

def info(self, path, _kwargs):
path = _clean(self._strip_protocol(path))
if path == "":
return {"name": "", "type": "directory", "size": 0, "created": _now(), "mtime": _now()}
if not self._configured():
if path == _MARKER: return self._marker_info()
raise FileNotFoundError(path)
try:
props = onelake._get_file_client(path).get_file_properties()
is_dir = dict(props.metadata or {}).get("hdi_isfolder") == "true"
return {
"name": path,
"type": "directory" if is_dir else "file",
"size": 0 if is_dir else int(getattr(props, "size", 0) or 0),
"created": _now(), "mtime": _now(),
}
except ResourceNotFoundError:
raise FileNotFoundError(path)
"@ }, @{ path = "images/onelake/onelake_fsspec.py" position = 114 body = @" **[Performance]isfile()` chains 4 API calls

isfile()exists()info()isdir() (API call) + get_file_properties() (API call), then info() is called again for the ["type"] check — 4 API calls total for a single boolean result.

Replace with a single info() call:

python def isfile(self, path): try: return self.info(path)["type"] == "file" except FileNotFoundError: return False

Same fix applies to isdir():

python def isdir(self, path): try: return self.info(path)["type"] == "directory" except FileNotFoundError: return False
"@
},
@{
path = "images/onelake/onelake_fsspec.py"
position = 29
body = @"
[Performance - Critical] Missing _open() — JupyterLab downloads entire files

OneLakeFileSystem does not implement _open(). Without it, fsspec falls back to cat_file() for every file open in JupyterLab, which calls onelake.read()readall() — downloading the entire file into memory before the first byte is shown. A user double-clicking a 200 MB CSV waits for the full download. Parquet column pruning (one of OneLake's key advantages over BlobFuse2) is also impossible without range reads.

The ADLS2 SDK supports download_file(offset=..., length=...) which maps directly to an HTTP Range: header. Implement _open() using fsspec's AbstractBufferedFile with a 5 MB block cache:

`python
from fsspec.spec import AbstractBufferedFile, AbstractFileSystem

class OneLakeFile(AbstractBufferedFile):
def _fetch_range(self, start, end):
# ADLS2 SDK maps offset/length to HTTP Range header natively
fc = onelake._get_file_client(self.path)
return fc.download_file(offset=start, length=end - start).readall()

class OneLakeFileSystem(AbstractFileSystem):
...
def _open(self, path, mode="rb", block_size=None, **kwargs):
path = _clean(self._strip_protocol(path))
return OneLakeFile(
self, path, mode=mode,
block_size=block_size or 5 * 1024 * 1024, # 5 MB blocks
**kwargs,
)
`

AbstractBufferedFile caches blocks in memory — same behaviour as BlobFuse2's kernel page cache but in userspace. pyarrow and pandas are fsspec-aware and will only fetch the Parquet row groups they need via range reads, which BlobFuse2 cannot do.

Comment thread tests/onelake/test_onelake_layer.py Fixed
@vashkartik vashkartik force-pushed the feat/onelake-personal-drive branch 3 times, most recently from 7e4b629 to 399b16a Compare May 15, 2026 17:55
Comment thread .github/workflows/docker.yaml Fixed
Comment thread .github/workflows/docker.yaml Fixed
@vashkartik vashkartik force-pushed the feat/onelake-personal-drive branch 3 times, most recently from bcaf59c to b26202e Compare May 20, 2026 18:14
@vashkartik vashkartik force-pushed the feat/onelake-personal-drive branch 5 times, most recently from 4fc3bbd to 4ae4c99 Compare May 29, 2026 13:55
Adds the OneLake image layer with CLI/Python/R/JupyterLab access path, wires it into the build chain (mid now derives from onelake), and includes the simplified-architecture brief. The get_branch_name helper is updated to sanitize PR branch names so they're valid Docker tags.
@vashkartik vashkartik force-pushed the feat/onelake-personal-drive branch from 4ae4c99 to d5f55d1 Compare May 29, 2026 18:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants