Feature Request: Support ODCS for Blob Storage Metadata (File-Level Governance)
Description
I would like to propose extending datacontract-cli to support the definition and validation of ODCS data contracts applied to blob storage structures, focusing on file metadata rather than file content.
Use Case
In scenarios where blob storage is used to host unstructured data (e.g., GS1 product images), the primary concern is not the content itself but:
- The organization of files and paths (hierarchical structure)
- The consistency and validity of file metadata
- The enforcement of storage-level policies (naming conventions, required attributes, etc.)
Proposed Enhancement
Introduce support for defining data contracts that:
- Validate blob hierarchy and folder structure
- Enforce naming conventions and path patterns
- Verify file metadata integrity and required attributes
- Ensure compliance with governance rules at the storage level
CLI Integration
Extend the datacontract-cli test command to:
- Execute sanity checks on blob storage organization
- Detect structural inconsistencies or corruption
- Validate that the storage layout complies with the defined ODCS contract
Expected Outcome
This feature would allow teams to:
- Treat blob storage as a governed data product
- Ensure externalized validation of storage organization
- Prevent silent drift or corruption in file structures
- Apply ODCS principles beyond tabular data to unstructured storage ecosystems
If you want, I can also tailor this specifically to match the style of existing issues in the datacontract-cli repo (more concise vs more technical, with YAML examples, etc.).
version: 0.0.1
kind: DataContract
apiVersion: v3.0.1
id: gs1ca.internal.azure.image.yaml
name: GS1 Canada Image Collection (Datalake)
description:
purpose: |
This contract defines the structure of the GS1 Canada image collection stored in the datalake.
It includes the different types of images such as marketing, pharmaceutical, and planograms, along with their properties and metadata.
Based on the https://learn.microsoft.com/en-us/rest/api/storageservices/get-blob-properties?tabs=microsoft-entra-id
status: draft
servers:
- server: datalake_dev
type: azure
format: binary
location: abfss://web@myaccount.dfs.core.windows.net/api=media/content={model}/gtin=*/type=*/*.*
- server: datalake_prod
type: azure
format: binary
location: abfss://web@myaccount.dfs.core.windows.net/api=media/content={model}/gtin=*/type=*/*.*
schema:
- name: image
physicalType: file
description: GS1 pictures files
properties:
- name: product_id
physicalType: string
description: The GTIN of the product
- name: ecommerceContent
physicalType: file
description: GS1 pictures files
properties:
- name: Name
logicalType: string
physicalType: string
description: The name of the file in the storage
required: true
- name: DateUploaded
logicalType: timestamp
physicalType: string
description: The date the file was uploaded to the storage
required: true
- name: ETag
logicalType: string
physicalType: string
description: The ETag of the file in the storage
required: true
- name: Last-Modified
logicalType: timestamp
physicalType: string
description: The last modified date of the file in the storage
required: true
- name: Content-Length
logicalType: integer
physicalType: string
description: The content length of the file in the storage
required: true
quality:
- type: text
description: The content length is returned as a string in the storage, but it represents an integer value. if PNG files are around 1MB, the content length should be around 1000000.
- name: Content-Type
logicalType: string
physicalType: string
required: true
description: The content type of the file in the storage
quality:
- metric: invalidValues
arguments:
validValues: ['image/png', 'image/jpeg', 'image/tiff']
mustBe: 0
description: The content type is determined based on the file extension in the storage. For example, if the file is a PNG image, the content type should be "image/png".
- name: Content-MD5
logicalType: string
physicalType: string
description: The content MD5 of the file in the storage
required: true
- name: Metadata
logicalType: array
physicalType: string
description: The metadatas attributes of the file in the storage
items:
logicalType: object
properties:
- name: Name
logicalType: string
physicalType: string
description: The name of the metadata
- name: Value
logicalType: string
physicalType: string
description: The value of the metadata
- name: Tags
logicalType: array
description: The tags attributes of the file in the storage
items:
logicalType: object
properties:
- name: Key
logicalType: string
physicalType: string
description: The name of the tag
- name: Value
logicalType: string
physicalType: string
description: The value of the tag
- name: Owner
logicalType: string
physicalType: string
description: The owner of the file in the storage
- name: Encrypted
logicalType: boolean
description: The GTIN of the product
- name: expiry-time
logicalType: timestamp
description: Returns the expiration time that's set on the blob. Is returned only for files that have an expiration time set.
- name: acl
logicalType: string
description: |
The combined list of access and default access control list that are set for user, group and other on the file or directory.
Each access control entry (ACE) consists of a scope, a type, a user or group identifier, and permissions in the format [scope]:[type]:[id]:[permissions].
The default scope indicates that the ACE belongs to the default ACL for a directory; otherwise scope is implicit and the ACE belongs to the access ACL.
Each individual permission is in [r,w,x,-]{3} format.'
- name: marketingContent
physicalType: file
description: GS1 marketing pictures
properties:
- name: Name
logicalType: string
physicalType: string
description: The name of the file in the storage
required: true
- name: DateUploaded
logicalType: timestamp
physicalType: string
description: The date the file was uploaded to the storage
required: true
- name: ETag
logicalType: string
physicalType: string
description: The ETag of the file in the storage
required: true
- name: Last-Modified
logicalType: timestamp
physicalType: string
description: The last modified date of the file in the storage
required: true
- name: Content-Length
logicalType: integer
physicalType: string
description: The content length of the file in the storage
required: true
quality:
- type: text
description: The content length is returned as a string in the storage, but it represents an integer value. if PNG files are around 1MB, the content length should be around 1000000.
- name: Content-Type
logicalType: string
physicalType: string
required: true
description: The content type of the file in the storage
quality:
- metric: invalidValues
arguments:
validValues: ['image/png', 'image/jpeg', 'image/tiff']
mustBe: 0
description: The content type is determined based on the file extension in the storage. For example, if the file is a PNG image, the content type should be "image/png".
- name: Content-MD5
logicalType: string
physicalType: string
description: The content MD5 of the file in the storage
required: true
- name: Metadata
logicalType: array
physicalType: string
description: The metadatas attributes of the file in the storage
items:
logicalType: object
properties:
- name: Name
logicalType: string
physicalType: string
description: The name of the metadata
- name: Value
logicalType: string
physicalType: string
description: The value of the metadata
- name: Tags
logicalType: array
description: The tags attributes of the file in the storage
items:
logicalType: object
properties:
- name: Key
logicalType: string
physicalType: string
description: The name of the tag
- name: Value
logicalType: string
physicalType: string
description: The value of the tag
- name: Owner
logicalType: string
physicalType: string
description: The owner of the file in the storage
- name: Encrypted
logicalType: boolean
description: The GTIN of the product
- name: expiry-time
logicalType: timestamp
description: Returns the expiration time that's set on the blob. Is returned only for files that have an expiration time set.
- name: acl
logicalType: string
description: |
The combined list of access and default access control list that are set for user, group and other on the file or directory.
Each access control entry (ACE) consists of a scope, a type, a user or group identifier, and permissions in the format [scope]:[type]:[id]:[permissions].
The default scope indicates that the ACE belongs to the default ACL for a directory; otherwise scope is implicit and the ACE belongs to the access ACL.
Each individual permission is in [r,w,x,-]{3} format.'
- name: pharmaceuticalContent
physicalType: file
description: GS1 pharmaceutical pictures
properties:
- name: Name
logicalType: string
physicalType: string
description: The name of the file in the storage
required: true
- name: DateUploaded
logicalType: timestamp
physicalType: string
description: The date the file was uploaded to the storage
required: true
- name: ETag
logicalType: string
physicalType: string
description: The ETag of the file in the storage
required: true
- name: Last-Modified
logicalType: timestamp
physicalType: string
description: The last modified date of the file in the storage
required: true
- name: Content-Length
logicalType: integer
physicalType: string
description: The content length of the file in the storage
required: true
quality:
- type: text
description: The content length is returned as a string in the storage, but it represents an integer value. if PNG files are around 1MB, the content length should be around 1000000.
- name: Content-Type
logicalType: string
physicalType: string
required: true
description: The content type of the file in the storage
quality:
- metric: invalidValues
arguments:
validValues: ['image/png', 'image/jpeg', 'image/tiff']
mustBe: 0
description: The content type is determined based on the file extension in the storage. For example, if the file is a PNG image, the content type should be "image/png".
- name: Content-MD5
logicalType: string
physicalType: string
description: The content MD5 of the file in the storage
required: true
- name: Metadata
logicalType: array
physicalType: string
description: The metadatas attributes of the file in the storage
items:
logicalType: object
properties:
- name: Name
logicalType: string
physicalType: string
description: The name of the metadata
- name: Value
logicalType: string
physicalType: string
description: The value of the metadata
- name: Tags
logicalType: array
description: The tags attributes of the file in the storage
items:
logicalType: object
properties:
- name: Key
logicalType: string
physicalType: string
description: The name of the tag
- name: Value
logicalType: string
physicalType: string
description: The value of the tag
- name: Owner
logicalType: string
physicalType: string
description: The owner of the file in the storage
- name: Encrypted
logicalType: boolean
description: The GTIN of the product
- name: expiry-time
logicalType: timestamp
description: Returns the expiration time that's set on the blob. Is returned only for files that have an expiration time set.
- name: acl
logicalType: string
description: |
The combined list of access and default access control list that are set for user, group and other on the file or directory.
Each access control entry (ACE) consists of a scope, a type, a user or group identifier, and permissions in the format [scope]:[type]:[id]:[permissions].
The default scope indicates that the ACE belongs to the default ACL for a directory; otherwise scope is implicit and the ACE belongs to the access ACL.
Each individual permission is in [r,w,x,-]{3} format.'
- name: planoContent
physicalType: file
description: GS1 planograms pictures
properties:
- name: Name
logicalType: string
physicalType: string
description: The name of the file in the storage
required: true
- name: DateUploaded
logicalType: timestamp
physicalType: string
description: The date the file was uploaded to the storage
required: true
- name: ETag
logicalType: string
physicalType: string
description: The ETag of the file in the storage
required: true
- name: Last-Modified
logicalType: timestamp
physicalType: string
description: The last modified date of the file in the storage
required: true
- name: Content-Length
logicalType: integer
physicalType: string
description: The content length of the file in the storage
required: true
quality:
- type: text
description: The content length is returned as a string in the storage, but it represents an integer value. if PNG files are around 1MB, the content length should be around 1000000.
- name: Content-Type
logicalType: string
physicalType: string
required: true
description: The content type of the file in the storage
quality:
- metric: invalidValues
arguments:
validValues: ['image/png', 'image/jpeg', 'image/tiff']
mustBe: 0
description: The content type is determined based on the file extension in the storage. For example, if the file is a PNG image, the content type should be "image/png".
- name: Content-MD5
logicalType: string
physicalType: string
description: The content MD5 of the file in the storage
required: true
- name: Metadata
logicalType: array
physicalType: string
description: The metadatas attributes of the file in the storage
items:
logicalType: object
properties:
- name: Name
logicalType: string
physicalType: string
description: The name of the metadata
- name: Value
logicalType: string
physicalType: string
description: The value of the metadata
- name: Tags
logicalType: array
description: The tags attributes of the file in the storage
items:
logicalType: object
properties:
- name: Key
logicalType: string
physicalType: string
description: The name of the tag
- name: Value
logicalType: string
physicalType: string
description: The value of the tag
- name: Owner
logicalType: string
physicalType: string
description: The owner of the file in the storage
- name: Encrypted
logicalType: boolean
description: The GTIN of the product
- name: expiry-time
logicalType: timestamp
description: Returns the expiration time that's set on the blob. Is returned only for files that have an expiration time set.
- name: acl
logicalType: string
description: |
The combined list of access and default access control list that are set for user, group and other on the file or directory.
Each access control entry (ACE) consists of a scope, a type, a user or group identifier, and permissions in the format [scope]:[type]:[id]:[permissions].
The default scope indicates that the ACE belongs to the default ACL for a directory; otherwise scope is implicit and the ACE belongs to the access ACL.
Each individual permission is in [r,w,x,-]{3} format.'
Feature Request: Support ODCS for Blob Storage Metadata (File-Level Governance)
Description
I would like to propose extending
datacontract-clito support the definition and validation of ODCS data contracts applied to blob storage structures, focusing on file metadata rather than file content.Use Case
In scenarios where blob storage is used to host unstructured data (e.g., GS1 product images), the primary concern is not the content itself but:
Proposed Enhancement
Introduce support for defining data contracts that:
CLI Integration
Extend the
datacontract-cli testcommand to:Expected Outcome
This feature would allow teams to:
If you want, I can also tailor this specifically to match the style of existing issues in the
datacontract-clirepo (more concise vs more technical, with YAML examples, etc.).