Skip to content

Remove dataset keyword from code base and use keyword asset #4324

@aicam

Description

@aicam

Task Summary

Title: Refactor dataset to generalized asset concept to support new resource types

Context

Currently in Texera, we use the keyword dataset to represent a group of files a user has uploaded into a LakeFS repository. Essentially, a dataset acts as a file system. This concept is heavily coupled throughout our stack: it functions as a resource in the dashboard, defines types and structures during workflow execution, and serves as a direct reference point within UDFs and Python code.

Motivation

We are planning to introduce a new resource type called model. Under the hood, a model is architecturally identical to a dataset: it is simply a repository containing files and folders in a tree structure (similar to an S3 bucket), backed by LakeFS and MinIO.

Because models and datasets share the exact same underlying file-system storage mechanism, the interpretation of the files stored in MinIO should be decoupled from the storage structure itself. Instead, the reading process (e.g., a UDF operator reading a file as a binary) should dictate how the content is interpreted.

Proposed Solution

To generalize our current LakeFS/MinIO storage architecture to support both datasets and models, we need to refactor the codebase to use a broader abstraction.

We propose introducing a new core keyword: asset.
The asset concept will act as the universal pointer to our storage layer, encompassing various specific resource types, including both dataset and model.

Tasks & Acceptance Criteria

To implement this abstraction, we need to replace occurrences of dataset with the generalized asset keyword across the stack.

  • Database: Rename all dataset occurrences in Postgres table names.
  • Common Utilities: Update all storage utilities in the common directory that currently refer to dataset.
  • File Service: Refactor all files within file-service to use the asset terminology.
  • UDF Definitions: Update UDF classes and type definitions that currently hardcode dataset as the sole reference to storage.

Priority

P2 – Medium

Task Type

  • Code Implementation
  • Documentation
  • Refactor / Cleanup
  • Testing / QA
  • DevOps / Deployment

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No fields configured for Task.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions