Task Summary
Title: Refactor dataset to generalized asset concept to support new resource types
Context
Currently in Texera, we use the keyword dataset to represent a group of files a user has uploaded into a LakeFS repository. Essentially, a dataset acts as a file system. This concept is heavily coupled throughout our stack: it functions as a resource in the dashboard, defines types and structures during workflow execution, and serves as a direct reference point within UDFs and Python code.
Motivation
We are planning to introduce a new resource type called model. Under the hood, a model is architecturally identical to a dataset: it is simply a repository containing files and folders in a tree structure (similar to an S3 bucket), backed by LakeFS and MinIO.
Because models and datasets share the exact same underlying file-system storage mechanism, the interpretation of the files stored in MinIO should be decoupled from the storage structure itself. Instead, the reading process (e.g., a UDF operator reading a file as a binary) should dictate how the content is interpreted.
Proposed Solution
To generalize our current LakeFS/MinIO storage architecture to support both datasets and models, we need to refactor the codebase to use a broader abstraction.
We propose introducing a new core keyword: asset.
The asset concept will act as the universal pointer to our storage layer, encompassing various specific resource types, including both dataset and model.
Tasks & Acceptance Criteria
To implement this abstraction, we need to replace occurrences of dataset with the generalized asset keyword across the stack.
Priority
P2 – Medium
Task Type
Task Summary
Title: Refactor
datasetto generalizedassetconcept to support new resource typesContext
Currently in Texera, we use the keyword
datasetto represent a group of files a user has uploaded into a LakeFS repository. Essentially, adatasetacts as a file system. This concept is heavily coupled throughout our stack: it functions as a resource in the dashboard, defines types and structures during workflow execution, and serves as a direct reference point within UDFs and Python code.Motivation
We are planning to introduce a new resource type called
model. Under the hood, amodelis architecturally identical to adataset: it is simply a repository containing files and folders in a tree structure (similar to an S3 bucket), backed by LakeFS and MinIO.Because models and datasets share the exact same underlying
file-systemstorage mechanism, the interpretation of the files stored in MinIO should be decoupled from the storage structure itself. Instead, the reading process (e.g., a UDF operator reading a file as a binary) should dictate how the content is interpreted.Proposed Solution
To generalize our current LakeFS/MinIO storage architecture to support both
datasetsandmodels, we need to refactor the codebase to use a broader abstraction.We propose introducing a new core keyword:
asset.The
assetconcept will act as the universal pointer to our storage layer, encompassing various specific resource types, including bothdatasetandmodel.Tasks & Acceptance Criteria
To implement this abstraction, we need to replace occurrences of
datasetwith the generalizedassetkeyword across the stack.datasetoccurrences in Postgres table names.commondirectory that currently refer todataset.file-serviceto use theassetterminology.datasetas the sole reference to storage.Priority
P2 – Medium
Task Type