Bio-DID-Seq: Decentralized Identifiers for Biological Research Data

Bio-DID-Seq is a GDPR compliant Decentralized Identifier (DID) system designed for research data, integrating with Dataverse and powered by AI agents to showcase decentralized metadata management and user empowerment in cultural heritage and biological research data contexts.

Project Overview

Bio-DID-Seq provides a solution for managing research data identifiers through decentralized technology like IPFS, DIDs and UCAN. It addresses key challenges in the scientific data management field:

User Data Control: Empowering researchers with control over their metadata
GDPR Compliance: Built in privacy protections aligned with European regulations
Cost Efficiency: Reducing expenses for large scale identifier creation
Accessibility: Enhancing data discoverability and community engagement
Interoperability: Seamless integration with existing identifier systems

Key Features

W3C DID Compliance: Implementation of W3C Decentralized Identifier standards
UCAN Integration: Authorization using User Controlled Authorization Networks
IPFS Storage: Decentralized content-addressed storage for research data
BioAgents Integration: AI agents for enhanced biological data processing
Harvard Dataverse Integration: Compatibility with the Dataverse repository infrastructure

Architecture

Bio-DID-Seq implements a layered architecture combining several technologies:

Storage Layer: IPFS based decentralized storage for data persistence
Identity Layer: W3C DID implementation for decentralized identity management
Authorization Layer: UCAN for user controlled access management
Application Layer: Actix Web API for service endpoints
Integration Layer: Connectors for BioAgents and Dataverse

Cost Reduction Numbers for Bio-DID-Seq Identifier Creation

Decentralized Storage: Centralized cloud storage (e.g., AWS S3) for large scale research data can cost $0.023-$0.10 per GB/month for storage and $0.09-$0.12 per GB for data retrieval. Bio-DID-Seq uses IPFS for decentralized storage, reducing costs to near $0 for storage if hosted on community nodes, or ~$0.005-$0.01 per GB/month on paid pinning services (e.g., Pinata, Filebase, Publiish etc). This can lead to 70-90% savings for storing identifier metadata (e.g., DID documents, typically <1 KB each). Example: For 1 million DID documents (~1 GB total), IPFS storage costs ~$0.01/month vs. $23-$100/month on AWS S3.
Automation via AI (BioAgents): Manual creation of identifiers and metadata for research datasets can cost $20-$50 per hour for skilled labor, with an average of 1-2 minutes per identifier. For 1 million identifiers, this translates to ~16,667-33,000 hours or $333,340-$1,666,650 in labor costs. Bio-DID-Seq’s BioAgents automate metadata extraction and DID creation, reducing processing time to seconds per identifier. Assuming a server cost of $0.10-$1.50 per hour for AI processing, generating 1 million identifiers might cost ~$100-$500, a 99.9% cost reduction.
DID Creation Efficiency: Traditional Identifier Systems: Proprietary systems like DOIs (Digital Object Identifiers) charge $0.01-$1 per identifier for creation and maintenance (e.g., DataCite pricing). For 1 million identifiers, this costs $10,000-$300,000. W3C compliant DIDs in Bio-DID-Seq are generated using Rust based DID Manager, with negligible computational costs (~$0.001 per 1,000 DIDs on a standard server). For 1 million DIDs, total cost is ~$1-$10, a 99.9% reduction compared to DOI systems. Decentralized DIDs require no annual fees (unlike DOIs), saving $0.50-$5 per identifier annually.
UCAN Authorization: Bio-DID-Seq’s UCAN based authorization eliminates central server costs, relying on cryptographic tokens managed client side. Setup costs are minimal and operational costs are near $0, yielding 90-100% savings for access management. Publishing datasets to Dataverse manually involves labor costs of $18-$50/hour for metadata entry and validation, with 10-30 minutes per dataset. For 10,000 datasets, this costs $33,000-$100,000.
Bio-DID-Seq Automation: The Dataverse Adapter automates metadata synchronization and publishing, reducing time to seconds per dataset. Assuming API call costs of $0.001-$0.01 per dataset, publishing 10,000 datasets costs $10-$100, a 99% cost reduction.

Components

IPFS Cluster: Provides redundant, decentralized storage
DID Manager: Creates and manages W3C compliant DIDs
UCAN Auth: Handles authorization using capability-based security
BioAgents Connector: Integrates with AI powered biological data processing
Dataverse Adapter: Connects with Harvard Dataverse for data publishing

Prerequisites

Ensure you have the following installed:

Rust (1.84.1 or later)
Cargo (included with Rust)
Docker and Docker Compose
IPFS (if running outside Docker)

Getting Started

Installation

Clone the repository:

git clone https://github.com/publiish/bio-did-seq
cd bio-did-seq

Configure your environment:

cp .env.example .env
# Edit .env with your settings

Generate cryptographic keys:

cargo run -- generate-keys --output ./keys

Start the services using Docker Compose:

docker-compose up -d

Configuration

The system is configured through environment variables in the .env file:

DATABASE_URL=mysql://user:password@localhost:3306/bio_did_seq
IPFS_NODE=http://127.0.0.1:5001
BIND_ADDRESS=127.0.0.1:8081
MAX_CONCURRENT_UPLOADS=20
RUST_LOG=info
DILITHIUM_PUBLIC_KEY=path/to/dilithium5_public.key
DILITHIUM_PRIVATE_KEY=path/to/dilithium5_secret.key
BIOAGENTS_API_URL=http://localhost:3000
DATAVERSE_API_URL=https://dataverse.harvard.edu/api
DATAVERSE_API_KEY=your_api_key

API Documentation

Core Endpoints

All endpoints are prefixed with /api:

POST /api/signup - Register a new user
POST /api/signin - Authenticate a user and receive a token
POST /api/did/create - Create a new DID for research data
GET /api/did/{id} - Retrieve a DID document
PUT /api/did/{id} - Update a DID document (requires authorization)
POST /api/upload - Upload research data (requires authorization)
GET /api/download/{cid} - Download research data
POST /api/bioagent/process - Process data using BioAgents
POST /api/dataverse/publish - Publish data to Dataverse

BioAgents Integration

Bio DID-Seq integrates with BioAgents for AI powered analysis of biological data:

Automatic metadata extraction from research papers
Semantic linking between related datasets
Generation of knowledge graphs from unstructured data
Enhanced search capabilities across biological datasets

UCAN Authorization

The system uses UCAN (User Controlled Authorization Network) for decentralized authorization:

Users control who can access their data through capability based tokens
Fine grained permissions for specific operations
Delegated authorization without central authority
Cryptographic verification of access rights

Deployment Options

Local Development

cargo run

Production Deployment

cargo build --release
./target/release/bio-did-seq

Docker Deployment

docker-compose up -d --build

Integrating with Dataverse

Bio-DID-Seq integrates with Harvard Dataverse through its API:

Create DIDs that reference Dataverse datasets
Publish datasets to Dataverse while maintaining DID linkage
Synchronize metadata between DID documents and Dataverse entries

Contributing

We welcome contributions to Bio-DID-Seq! Please see our CONTRIBUTING.md file for guidelines.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgments

W3C DID and Verifiable Credentials Working Groups
UCAN Working Group
IPFS and Filecoin Teams
Harvard Dataverse Project
BioAgents Contributors

Name		Name	Last commit message	Last commit date
Latest commit History 55 Commits
src		src
.env.example		.env.example
.gitignore		.gitignore
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
README.md		README.md
docker-compose.bak		docker-compose.bak
docker-compose.yml		docker-compose.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Bio-DID-Seq: Decentralized Identifiers for Biological Research Data

Project Overview

Key Features

Architecture

Cost Reduction Numbers for Bio-DID-Seq Identifier Creation

Components

Prerequisites

Getting Started

Installation

Configuration

API Documentation

Core Endpoints

BioAgents Integration

UCAN Authorization

Deployment Options

Local Development

Production Deployment

Docker Deployment

Integrating with Dataverse

Contributing

License

Acknowledgments

About

Uh oh!

Packages

Uh oh!

Languages

ekayaan/bio-did-seq

Folders and files

Latest commit

History

Repository files navigation

Bio-DID-Seq: Decentralized Identifiers for Biological Research Data

Project Overview

Key Features

Architecture

Cost Reduction Numbers for Bio-DID-Seq Identifier Creation

Components

Prerequisites

Getting Started

Installation

Configuration

API Documentation

Core Endpoints

BioAgents Integration

UCAN Authorization

Deployment Options

Local Development

Production Deployment

Docker Deployment

Integrating with Dataverse

Contributing

License

Acknowledgments

About

Resources

Uh oh!

Stars

Watchers

Forks

Packages 0

Uh oh!

Languages

Packages