Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 8 additions & 0 deletions .github/workflows/CI.yml
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,10 @@ concurrency:
cancel-in-progress: ${{ startsWith(github.ref, 'refs/pull/') }}
jobs:
test:
env:
DATADEPS_ALWAYS_ACCEPT: "true"
DATADEPS_PROGRESS_UPDATE_PERIOD: "Inf"
DATADEPS_DISABLE_DOWNLOAD: "false"
name: Julia ${{ matrix.version }} - ${{ matrix.os }} - ${{ matrix.arch }}
runs-on: ${{ matrix.os }}
timeout-minutes: 60
Expand Down Expand Up @@ -46,6 +50,10 @@ jobs:
fail_ci_if_error: false
docs:
name: Documentation
env:
DATADEPS_ALWAYS_ACCEPT: "true"
DATADEPS_PROGRESS_UPDATE_PERIOD: "Inf"
DATADEPS_DISABLE_DOWNLOAD: "false"
runs-on: ubuntu-latest
permissions:
actions: write # needed to allow julia-actions/cache to proactively delete old caches that it has created
Expand Down
12 changes: 5 additions & 7 deletions Project.toml
Original file line number Diff line number Diff line change
@@ -1,17 +1,15 @@
name = "HealthSampleData"
uuid = "85295614-c7c2-47eb-9e31-7664d7fbe6db"
authors = ["TheCedarPrince <[email protected]>, ParamThakkar123 <[email protected]> and contributors"]
version = "0.0.1"
authors = ["TheCedarPrince <[email protected]>, ParamThakkar123 <[email protected]> and contributors"]

[deps]
DataDeps = "124859b0-ceae-595e-8997-d05f6a7a8dfe"
Downloads = "f43a241f-c20a-4ad4-852c-f6b1247861c6"
HuggingFaceHub = "d0076355-e2c0-48e6-a044-05906e51b7fc"
Logging = "56ddb016-857b-54e1-b83d-db4d58db5568"

[compat]
DataDeps = "0.7"
Downloads = "1.6.0"
julia = "1.10"

[extras]
Test = "8dfed614-e22c-5e08-85e1-65c5234f0b40"

[targets]
test = ["Test"]
31 changes: 25 additions & 6 deletions docs/src/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,11 +4,30 @@ CurrentModule = HealthSampleData

# HealthSampleData

Documentation for [HealthSampleData](https://github.com/TheCedarPrince/HealthSampleData.jl).
> To provide consistent data sets for teaching and learning across JuliaHealth.

```@index
```
Welcome to `HealthSampleData.jl`!

```@autodocs
Modules = [HealthSampleData]
```
This package curates and provisions a number of datasets useful in health informatics, public health, medical imaging, and machine learning research.
It is made in an effort to provide learning resources that are consistently available across JuliaHealth.

## Dataset Overview

`HealthSampleData.jl` uses `DataDeps.jl` to download data sources from a variety of locations.
Each dataset provides:

1. A short description
2. Relevant links or resources
3. Its file type (e.g. CSV, sqlite, etc.)
4. A quickstart guide
5. Where it is being downloaded from

> **NOTE:** For more information about datasets and data sources, please refer to [Supported Datasets](./supported_datasets).

## Installation

To install `HealthSampleData.jl`, type the following snippet into the Julia REPL:

```julia
Pkg.add("HealthSampleData.jl")
```
39 changes: 39 additions & 0 deletions docs/src/quick_start.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
# Quick Start Guide

Here is a complete example workflow for how someone would want to use `HealthSampleData.jl`.

## Installation

To install `HealthSampleData.jl`, type the following snippet into the Julia REPL:

```julia
Pkg.add("HealthSampleData.jl")
```

## Download a Dataset

We'll download a small dataset:

```julia
import HealthSampleData:
Test

Test()
```

You should see something like the following:

```text
This program has requested access to the data dependency Test.
which is not currently installed. It can be installed automatically, and you will not see this message again.

The Palmer Penguins test dataset for HealthSampleData.jl. To cite:

Horst AM, Hill AP, Gorman KB (2020). palmerpenguins: Palmer
Archipelago (Antarctica) penguin data. R package version 0.1.0.
https://allisonhorst.github.io/palmerpenguins/. doi:
10.5281/zenodo.3960218.

Do you want to download the dataset from https://huggingface.co/datasets/JuliaHealthOrg/JuliaHealthDatasets/penguins.csv to "C:\Users\You\.julia\scratchspaces\[UUID]\datadeps\Test"?
[y/n]
```
17 changes: 17 additions & 0 deletions docs/src/supported_datasets.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
# Supported Datasets

`HealthSampleData.jl` supports a variety of datasets from various sources.
Here is an overview of available datasets grouped loosely by domain:

## Patient Medical Records

```@docs
Eunomia
Synthea
```

## Miscellaneous Data Sets

```@docs
Test
```
6 changes: 5 additions & 1 deletion src/HealthSampleData.jl
Original file line number Diff line number Diff line change
@@ -1,7 +1,11 @@
module HealthSampleData

using DataDeps

using HuggingFaceHub
using Logging

include("huggingface.jl")
include("OMOP_Common_Data_Model/data.jl")
include("HuggingFaceDatasets/data.jl")

end
63 changes: 63 additions & 0 deletions src/HuggingFaceDatasets/data.jl
Original file line number Diff line number Diff line change
@@ -0,0 +1,63 @@
function Synthea()
localpath = HealthSampleData._huggingface_dataset_register("Synthea", "JuliaHealthOrg/JuliaHealthDatasets", "synthea_1M_3YR.duckdb")
register(DataDep(
"Synthea",
"1 million patients each with 3 year retrospective medical histories generated using the Synthea data generator (https://synthea.mitre.org). DuckDB database following the OMOP Common Data Model layout.",
"https://huggingface.co/datasets/JuliaHealthOrg/JuliaHealthDatasets/blob/main/synthea_1M_3YR.duckdb";
fetch_method = (remotepath, localdir) -> begin
return localpath
end
))

datadep"Synthea"

@info "Synthea data source is downloaded!"

return "Synthea/synthea_1M_3YR.duckdb"
end


function Test()
localpath = HealthSampleData._huggingface_dataset_register("Test", "JuliaHealthOrg/JuliaHealthDatasets", "penguins.csv")
register(DataDep(
"Test",
"""
The Palmer Penguins test dataset for HealthSampleData.jl. To cite:

Horst AM, Hill AP, Gorman KB (2020). palmerpenguins: Palmer
Archipelago (Antarctica) penguin data. R package version 0.1.0.
https://allisonhorst.github.io/palmerpenguins/. doi:
10.5281/zenodo.3960218.

""",
"https://huggingface.co/datasets/JuliaHealthOrg/JuliaHealthDatasets/penguins.csv";
fetch_method = (remotepath, localdir) -> begin
return localpath
end
))

datadep"Test"

@info "Test data source is downloaded!"

return "Test/penguins.csv"
end

"""
register_huggingface_dataset(name::String)

Registers a dataset from HuggingFace as a DataDep and returns the local path.
"""
function download_hf_dataset(name::String)
if name == "Synthea"
@info "Downloading Synthea dataset as DataDep..."
return Synthea()
elseif name == "Test"
@info "Downloading Test dataset as DataDep..."
return Test()
else
error("Dataset registration for $name is not implemented.")
end
end

export download_hf_dataset
53 changes: 53 additions & 0 deletions src/huggingface.jl
Original file line number Diff line number Diff line change
@@ -0,0 +1,53 @@
using Downloads

const HF = HuggingFaceHub

"""
_huggingface_dataset_register(name::String, repo::String, filename::String)

Resolve dataset metadata from Hugging Face, download `filename` via HuggingFaceHub,
and return the local filesystem path to the downloaded file
"""
function _huggingface_dataset_register(name::String, repo::String, filename::String)

@info "Resolving Hugging Face metadata for $repo"

# Try fetching dataset info safely
dataset = HF.info(HF.Dataset, repo)

@info "Downloading $filename from $repo via HuggingFaceHub..."
try
# Prefer official HuggingFaceHub download if dataset info is available
if dataset !== nothing
localpath = HF.file_download(dataset, filename)
else
# Direct fallback if HF.info failed
url = "$repo/resolve/main/$filename"
tmpdir = mktempdir()
dest = joinpath(tmpdir, filename)
@info "Downloading $url -> $dest"
Downloads.download(url, dest)
localpath = dest
end
@info "Downloaded to $localpath"
return localpath

catch e
msg = string(e)
if occursin("symlink", msg) || occursin("creating symlinks", msg) ||
occursin("Administrator", msg) || occursin("operation not permitted", msg)

@warn "Symlink creation failed (likely Windows privilege issue). Falling back to direct HTTP download: $e"
url = "$repo/resolve/main/$filename"
tmpdir = mktempdir()
dest = joinpath(tmpdir, filename)
@info "Downloading $url -> $dest (no symlink)"
Downloads.download(url, dest)
localpath = dest
@info "Fallback download complete: $localpath"
return localpath
else
rethrow(e)
end
end
end
41 changes: 41 additions & 0 deletions test/Manifest.toml
Original file line number Diff line number Diff line change
@@ -0,0 +1,41 @@
# This file is machine-generated - editing it directly is not advised

julia_version = "1.11.7"
manifest_format = "2.0"
project_hash = "71d91126b5a1fb1020e1098d9d492de2a4438fd2"

[[deps.Base64]]
uuid = "2a0f44e3-6c83-55bd-87e4-b1978d98bd5f"
version = "1.11.0"

[[deps.InteractiveUtils]]
deps = ["Markdown"]
uuid = "b77e0a4c-d291-57a0-90e8-8db25a27a240"
version = "1.11.0"

[[deps.Logging]]
uuid = "56ddb016-857b-54e1-b83d-db4d58db5568"
version = "1.11.0"

[[deps.Markdown]]
deps = ["Base64"]
uuid = "d6f4376e-aef5-505a-96c1-9c027394607a"
version = "1.11.0"

[[deps.Random]]
deps = ["SHA"]
uuid = "9a3f8284-a2c9-5f02-9a11-845980a1fd5c"
version = "1.11.0"

[[deps.SHA]]
uuid = "ea8e919c-243c-51af-8825-aaa63cd721ce"
version = "0.7.0"

[[deps.Serialization]]
uuid = "9e88b42a-f829-5b0c-bbe9-9e923198166b"
version = "1.11.0"

[[deps.Test]]
deps = ["InteractiveUtils", "Logging", "Random", "Serialization"]
uuid = "8dfed614-e22c-5e08-85e1-65c5234f0b40"
version = "1.11.0"
2 changes: 2 additions & 0 deletions test/Project.toml
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
[deps]
Test = "8dfed614-e22c-5e08-85e1-65c5234f0b40"
12 changes: 11 additions & 1 deletion test/runtests.jl
Original file line number Diff line number Diff line change
Expand Up @@ -2,5 +2,15 @@ using HealthSampleData
using Test

@testset "HealthSampleData.jl" begin
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add tests for downloading; you could upload a very small file to HF JuliaHealthDatasets and use that for testing.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @ParamThakkar123 , you didn't add a test for downloading the test file. Could you do that? Additionally, please note that you can set DataDeps to always be downloaded in CI; read here: https://www.oxinabox.net/DataDeps.jl/stable/z10-for-end-users/#Configuration-1

# Write your tests here.
@test isa(HealthSampleData.Test, Function)
@test hasmethod(HealthSampleData.Test, Tuple{})

@test isa(HealthSampleData.download_hf_dataset, Function)

@test_throws ErrorException HealthSampleData.download_hf_dataset("NonExistentDataset12345")
@testset "HuggingFaceDatasets - Test dataset download" begin
path = HealthSampleData.download_hf_dataset("Test")
@test isa(path, String)
@test path == "Test/penguins.csv"
end
end
Loading