Skip to content

ODCS test image or binaries on the top of a local / cloud blob storage, adlsgen2, aws S3 and GCP storage #1227

@dmaresma

Description

@dmaresma

Feature Request: Support ODCS for Blob Storage Metadata (File-Level Governance)

Description

I would like to propose extending datacontract-cli to support the definition and validation of ODCS data contracts applied to blob storage structures, focusing on file metadata rather than file content.

Use Case

In scenarios where blob storage is used to host unstructured data (e.g., GS1 product images), the primary concern is not the content itself but:

  • The organization of files and paths (hierarchical structure)
  • The consistency and validity of file metadata
  • The enforcement of storage-level policies (naming conventions, required attributes, etc.)

Proposed Enhancement

Introduce support for defining data contracts that:

  • Validate blob hierarchy and folder structure
  • Enforce naming conventions and path patterns
  • Verify file metadata integrity and required attributes
  • Ensure compliance with governance rules at the storage level

CLI Integration

Extend the datacontract-cli test command to:

  • Execute sanity checks on blob storage organization
  • Detect structural inconsistencies or corruption
  • Validate that the storage layout complies with the defined ODCS contract

Expected Outcome

This feature would allow teams to:

  • Treat blob storage as a governed data product
  • Ensure externalized validation of storage organization
  • Prevent silent drift or corruption in file structures
  • Apply ODCS principles beyond tabular data to unstructured storage ecosystems

If you want, I can also tailor this specifically to match the style of existing issues in the datacontract-cli repo (more concise vs more technical, with YAML examples, etc.).

version: 0.0.1
kind: DataContract
apiVersion: v3.0.1
id: gs1ca.internal.azure.image.yaml
name: GS1 Canada Image Collection (Datalake)
description: 
   purpose: |
     This contract defines the structure of the GS1 Canada image collection stored in the datalake. 
     It includes the different types of images such as marketing, pharmaceutical, and planograms, along with their properties and metadata.
     Based on the https://learn.microsoft.com/en-us/rest/api/storageservices/get-blob-properties?tabs=microsoft-entra-id
status: draft
servers:
- server: datalake_dev
 type: azure
 format: binary
 location: abfss://web@myaccount.dfs.core.windows.net/api=media/content={model}/gtin=*/type=*/*.*
- server: datalake_prod
 type: azure
 format: binary
 location: abfss://web@myaccount.dfs.core.windows.net/api=media/content={model}/gtin=*/type=*/*.*
schema:
 - name: image
   physicalType: file
   description: GS1 pictures files
   properties:
     - name: product_id
       physicalType: string
       description: The GTIN of the product
 - name: ecommerceContent
   physicalType: file
   description: GS1 pictures files
   properties:
     - name: Name
       logicalType: string
       physicalType: string
       description: The name of the file in the storage
       required: true
     - name: DateUploaded
       logicalType: timestamp
       physicalType: string
       description: The date the file was uploaded to the storage
       required: true
     - name: ETag
       logicalType: string
       physicalType: string
       description: The ETag of the file in the storage
       required: true
     - name: Last-Modified
       logicalType: timestamp
       physicalType: string
       description: The last modified date of the file in the storage
       required: true
     - name: Content-Length
       logicalType: integer
       physicalType: string
       description: The content length of the file in the storage
       required: true
       quality:
         - type: text
           description: The content length is returned as a string in the storage, but it represents an integer value. if PNG files are around 1MB, the content length should be around 1000000.
     - name: Content-Type
       logicalType: string
       physicalType: string
       required: true
       description: The content type of the file in the storage
       quality:
       - metric: invalidValues
         arguments:
           validValues: ['image/png', 'image/jpeg', 'image/tiff']
         mustBe: 0
         description: The content type is determined based on the file extension in the storage. For example, if the file is a PNG image, the content type should be "image/png".
     - name: Content-MD5
       logicalType: string
       physicalType: string
       description: The content MD5 of the file in the storage
       required: true
     - name: Metadata
       logicalType: array
       physicalType: string
       description: The metadatas attributes of the file in the storage
       items:
         logicalType: object
         properties:
         - name: Name
           logicalType: string
           physicalType: string
           description: The name of the metadata
         - name: Value
           logicalType: string
           physicalType: string
           description: The value of the metadata
     - name: Tags
       logicalType: array
       description: The tags attributes of the file in the storage
       items:
         logicalType: object
         properties:
         - name: Key
           logicalType: string
           physicalType: string
           description: The name of the tag
         - name: Value
           logicalType: string
           physicalType: string
           description: The value of the tag
     - name: Owner
       logicalType: string
       physicalType: string
       description: The owner of the file in the storage
     - name: Encrypted
       logicalType: boolean
       description: The GTIN of the product
     - name: expiry-time
       logicalType: timestamp
       description: Returns the expiration time that's set on the blob. Is returned only for files that have an expiration time set.
     - name: acl
       logicalType: string
       description: |
         The combined list of access and default access control list that are set for user, group and other on the file or directory. 
         Each access control entry (ACE) consists of a scope, a type, a user or group identifier, and permissions in the format [scope]:[type]:[id]:[permissions]. 
         The default scope indicates that the ACE belongs to the default ACL for a directory; otherwise scope is implicit and the ACE belongs to the access ACL. 
         Each individual permission is in [r,w,x,-]{3} format.'

 - name: marketingContent
   physicalType: file
   description: GS1 marketing pictures
   properties:
     - name: Name
       logicalType: string
       physicalType: string
       description: The name of the file in the storage
       required: true
     - name: DateUploaded
       logicalType: timestamp
       physicalType: string
       description: The date the file was uploaded to the storage
       required: true
     - name: ETag
       logicalType: string
       physicalType: string
       description: The ETag of the file in the storage
       required: true
     - name: Last-Modified
       logicalType: timestamp
       physicalType: string
       description: The last modified date of the file in the storage
       required: true
     - name: Content-Length
       logicalType: integer
       physicalType: string
       description: The content length of the file in the storage
       required: true
       quality:
         - type: text
           description: The content length is returned as a string in the storage, but it represents an integer value. if PNG files are around 1MB, the content length should be around 1000000.
     - name: Content-Type
       logicalType: string
       physicalType: string
       required: true
       description: The content type of the file in the storage
       quality:
       - metric: invalidValues
         arguments:
           validValues: ['image/png', 'image/jpeg', 'image/tiff']
         mustBe: 0
         description: The content type is determined based on the file extension in the storage. For example, if the file is a PNG image, the content type should be "image/png".
     - name: Content-MD5
       logicalType: string
       physicalType: string
       description: The content MD5 of the file in the storage
       required: true
     - name: Metadata
       logicalType: array
       physicalType: string
       description: The metadatas attributes of the file in the storage
       items:
         logicalType: object
         properties:
         - name: Name
           logicalType: string
           physicalType: string
           description: The name of the metadata
         - name: Value
           logicalType: string
           physicalType: string
           description: The value of the metadata
     - name: Tags
       logicalType: array
       description: The tags attributes of the file in the storage
       items:
         logicalType: object
         properties:
         - name: Key
           logicalType: string
           physicalType: string
           description: The name of the tag
         - name: Value
           logicalType: string
           physicalType: string
           description: The value of the tag
     - name: Owner
       logicalType: string
       physicalType: string
       description: The owner of the file in the storage
     - name: Encrypted
       logicalType: boolean
       description: The GTIN of the product
     - name: expiry-time
       logicalType: timestamp
       description: Returns the expiration time that's set on the blob. Is returned only for files that have an expiration time set.
     - name: acl
       logicalType: string
       description: |
         The combined list of access and default access control list that are set for user, group and other on the file or directory. 
         Each access control entry (ACE) consists of a scope, a type, a user or group identifier, and permissions in the format [scope]:[type]:[id]:[permissions]. 
         The default scope indicates that the ACE belongs to the default ACL for a directory; otherwise scope is implicit and the ACE belongs to the access ACL. 
         Each individual permission is in [r,w,x,-]{3} format.'
 - name: pharmaceuticalContent
   physicalType: file
   description: GS1 pharmaceutical pictures
   properties:
     - name: Name
       logicalType: string
       physicalType: string
       description: The name of the file in the storage
       required: true
     - name: DateUploaded
       logicalType: timestamp
       physicalType: string
       description: The date the file was uploaded to the storage
       required: true
     - name: ETag
       logicalType: string
       physicalType: string
       description: The ETag of the file in the storage
       required: true
     - name: Last-Modified
       logicalType: timestamp
       physicalType: string
       description: The last modified date of the file in the storage
       required: true
     - name: Content-Length
       logicalType: integer
       physicalType: string
       description: The content length of the file in the storage
       required: true
       quality:
         - type: text
           description: The content length is returned as a string in the storage, but it represents an integer value. if PNG files are around 1MB, the content length should be around 1000000.
     - name: Content-Type
       logicalType: string
       physicalType: string
       required: true
       description: The content type of the file in the storage
       quality:
       - metric: invalidValues
         arguments:
           validValues: ['image/png', 'image/jpeg', 'image/tiff']
         mustBe: 0
         description: The content type is determined based on the file extension in the storage. For example, if the file is a PNG image, the content type should be "image/png".
     - name: Content-MD5
       logicalType: string
       physicalType: string
       description: The content MD5 of the file in the storage
       required: true
     - name: Metadata
       logicalType: array
       physicalType: string
       description: The metadatas attributes of the file in the storage
       items:
         logicalType: object
         properties:
         - name: Name
           logicalType: string
           physicalType: string
           description: The name of the metadata
         - name: Value
           logicalType: string
           physicalType: string
           description: The value of the metadata
     - name: Tags
       logicalType: array
       description: The tags attributes of the file in the storage
       items:
         logicalType: object
         properties:
         - name: Key
           logicalType: string
           physicalType: string
           description: The name of the tag
         - name: Value
           logicalType: string
           physicalType: string
           description: The value of the tag
     - name: Owner
       logicalType: string
       physicalType: string
       description: The owner of the file in the storage
     - name: Encrypted
       logicalType: boolean
       description: The GTIN of the product
     - name: expiry-time
       logicalType: timestamp
       description: Returns the expiration time that's set on the blob. Is returned only for files that have an expiration time set.
     - name: acl
       logicalType: string
       description: |
         The combined list of access and default access control list that are set for user, group and other on the file or directory. 
         Each access control entry (ACE) consists of a scope, a type, a user or group identifier, and permissions in the format [scope]:[type]:[id]:[permissions]. 
         The default scope indicates that the ACE belongs to the default ACL for a directory; otherwise scope is implicit and the ACE belongs to the access ACL. 
         Each individual permission is in [r,w,x,-]{3} format.'
 - name: planoContent
   physicalType: file
   description: GS1 planograms pictures
   properties:
     - name: Name
       logicalType: string
       physicalType: string
       description: The name of the file in the storage
       required: true
     - name: DateUploaded
       logicalType: timestamp
       physicalType: string
       description: The date the file was uploaded to the storage
       required: true
     - name: ETag
       logicalType: string
       physicalType: string
       description: The ETag of the file in the storage
       required: true
     - name: Last-Modified
       logicalType: timestamp
       physicalType: string
       description: The last modified date of the file in the storage
       required: true
     - name: Content-Length
       logicalType: integer
       physicalType: string
       description: The content length of the file in the storage
       required: true
       quality:
         - type: text
           description: The content length is returned as a string in the storage, but it represents an integer value. if PNG files are around 1MB, the content length should be around 1000000.
     - name: Content-Type
       logicalType: string
       physicalType: string
       required: true
       description: The content type of the file in the storage
       quality:
       - metric: invalidValues
         arguments:
           validValues: ['image/png', 'image/jpeg', 'image/tiff']
         mustBe: 0
         description: The content type is determined based on the file extension in the storage. For example, if the file is a PNG image, the content type should be "image/png".
     - name: Content-MD5
       logicalType: string
       physicalType: string
       description: The content MD5 of the file in the storage
       required: true
     - name: Metadata
       logicalType: array
       physicalType: string
       description: The metadatas attributes of the file in the storage
       items:
         logicalType: object
         properties:
         - name: Name
           logicalType: string
           physicalType: string
           description: The name of the metadata
         - name: Value
           logicalType: string
           physicalType: string
           description: The value of the metadata
     - name: Tags
       logicalType: array
       description: The tags attributes of the file in the storage
       items:
         logicalType: object
         properties:
         - name: Key
           logicalType: string
           physicalType: string
           description: The name of the tag
         - name: Value
           logicalType: string
           physicalType: string
           description: The value of the tag
     - name: Owner
       logicalType: string
       physicalType: string
       description: The owner of the file in the storage
     - name: Encrypted
       logicalType: boolean
       description: The GTIN of the product
     - name: expiry-time
       logicalType: timestamp
       description: Returns the expiration time that's set on the blob. Is returned only for files that have an expiration time set.
     - name: acl
       logicalType: string
       description: |
         The combined list of access and default access control list that are set for user, group and other on the file or directory. 
         Each access control entry (ACE) consists of a scope, a type, a user or group identifier, and permissions in the format [scope]:[type]:[id]:[permissions]. 
         The default scope indicates that the ACE belongs to the default ACL for a directory; otherwise scope is implicit and the ACE belongs to the access ACL. 
         Each individual permission is in [r,w,x,-]{3} format.'

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions