-
Notifications
You must be signed in to change notification settings - Fork 2
Added Huggingface support #13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
896d63c
96bf072
36cdc84
5e3e5de
cc52ddf
525541b
47807d2
3acfe4d
23d0566
5950233
4802009
9047a36
5a66f41
e5d8648
c1707c5
e8523d3
8150afe
7d2ab9d
6b0b299
29f9528
3ed1afd
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -1,17 +1,15 @@ | ||
| name = "HealthSampleData" | ||
| uuid = "85295614-c7c2-47eb-9e31-7664d7fbe6db" | ||
| authors = ["TheCedarPrince <[email protected]>, ParamThakkar123 <[email protected]> and contributors"] | ||
| version = "0.0.1" | ||
| authors = ["TheCedarPrince <[email protected]>, ParamThakkar123 <[email protected]> and contributors"] | ||
|
|
||
| [deps] | ||
| DataDeps = "124859b0-ceae-595e-8997-d05f6a7a8dfe" | ||
| Downloads = "f43a241f-c20a-4ad4-852c-f6b1247861c6" | ||
| HuggingFaceHub = "d0076355-e2c0-48e6-a044-05906e51b7fc" | ||
| Logging = "56ddb016-857b-54e1-b83d-db4d58db5568" | ||
|
|
||
| [compat] | ||
| DataDeps = "0.7" | ||
| Downloads = "1.6.0" | ||
| julia = "1.10" | ||
|
|
||
| [extras] | ||
| Test = "8dfed614-e22c-5e08-85e1-65c5234f0b40" | ||
|
|
||
| [targets] | ||
| test = ["Test"] |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,39 @@ | ||
| # Quick Start Guide | ||
|
|
||
| Here is a complete example workflow for how someone would want to use `HealthSampleData.jl`. | ||
|
|
||
| ## Installation | ||
|
|
||
| To install `HealthSampleData.jl`, type the following snippet into the Julia REPL: | ||
|
|
||
| ```julia | ||
| Pkg.add("HealthSampleData.jl") | ||
| ``` | ||
|
|
||
| ## Download a Dataset | ||
|
|
||
| We'll download a small dataset: | ||
|
|
||
| ```julia | ||
| import HealthSampleData: | ||
| Test | ||
|
|
||
| Test() | ||
| ``` | ||
|
|
||
| You should see something like the following: | ||
|
|
||
| ```text | ||
| This program has requested access to the data dependency Test. | ||
| which is not currently installed. It can be installed automatically, and you will not see this message again. | ||
|
|
||
| The Palmer Penguins test dataset for HealthSampleData.jl. To cite: | ||
|
|
||
| Horst AM, Hill AP, Gorman KB (2020). palmerpenguins: Palmer | ||
| Archipelago (Antarctica) penguin data. R package version 0.1.0. | ||
| https://allisonhorst.github.io/palmerpenguins/. doi: | ||
| 10.5281/zenodo.3960218. | ||
|
|
||
| Do you want to download the dataset from https://huggingface.co/datasets/JuliaHealthOrg/JuliaHealthDatasets/penguins.csv to "C:\Users\You\.julia\scratchspaces\[UUID]\datadeps\Test"? | ||
| [y/n] | ||
| ``` |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,17 @@ | ||
| # Supported Datasets | ||
|
|
||
| `HealthSampleData.jl` supports a variety of datasets from various sources. | ||
| Here is an overview of available datasets grouped loosely by domain: | ||
|
|
||
| ## Patient Medical Records | ||
|
|
||
| ```@docs | ||
| Eunomia | ||
| Synthea | ||
| ``` | ||
|
|
||
| ## Miscellaneous Data Sets | ||
|
|
||
| ```@docs | ||
| Test | ||
| ``` |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -1,7 +1,11 @@ | ||
| module HealthSampleData | ||
|
|
||
| using DataDeps | ||
|
|
||
| using HuggingFaceHub | ||
| using Logging | ||
|
|
||
| include("huggingface.jl") | ||
| include("OMOP_Common_Data_Model/data.jl") | ||
| include("HuggingFaceDatasets/data.jl") | ||
|
|
||
| end |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,63 @@ | ||
| function Synthea() | ||
| localpath = HealthSampleData._huggingface_dataset_register("Synthea", "JuliaHealthOrg/JuliaHealthDatasets", "synthea_1M_3YR.duckdb") | ||
| register(DataDep( | ||
| "Synthea", | ||
| "1 million patients each with 3 year retrospective medical histories generated using the Synthea data generator (https://synthea.mitre.org). DuckDB database following the OMOP Common Data Model layout.", | ||
| "https://huggingface.co/datasets/JuliaHealthOrg/JuliaHealthDatasets/blob/main/synthea_1M_3YR.duckdb"; | ||
| fetch_method = (remotepath, localdir) -> begin | ||
| return localpath | ||
| end | ||
| )) | ||
|
|
||
| datadep"Synthea" | ||
|
|
||
| @info "Synthea data source is downloaded!" | ||
|
|
||
| return "Synthea/synthea_1M_3YR.duckdb" | ||
| end | ||
|
|
||
|
|
||
| function Test() | ||
| localpath = HealthSampleData._huggingface_dataset_register("Test", "JuliaHealthOrg/JuliaHealthDatasets", "penguins.csv") | ||
| register(DataDep( | ||
| "Test", | ||
| """ | ||
| The Palmer Penguins test dataset for HealthSampleData.jl. To cite: | ||
|
|
||
| Horst AM, Hill AP, Gorman KB (2020). palmerpenguins: Palmer | ||
| Archipelago (Antarctica) penguin data. R package version 0.1.0. | ||
| https://allisonhorst.github.io/palmerpenguins/. doi: | ||
| 10.5281/zenodo.3960218. | ||
|
|
||
| """, | ||
| "https://huggingface.co/datasets/JuliaHealthOrg/JuliaHealthDatasets/penguins.csv"; | ||
| fetch_method = (remotepath, localdir) -> begin | ||
| return localpath | ||
| end | ||
| )) | ||
|
|
||
| datadep"Test" | ||
|
|
||
| @info "Test data source is downloaded!" | ||
|
|
||
| return "Test/penguins.csv" | ||
| end | ||
|
|
||
| """ | ||
| register_huggingface_dataset(name::String) | ||
|
|
||
| Registers a dataset from HuggingFace as a DataDep and returns the local path. | ||
| """ | ||
| function download_hf_dataset(name::String) | ||
| if name == "Synthea" | ||
| @info "Downloading Synthea dataset as DataDep..." | ||
| return Synthea() | ||
| elseif name == "Test" | ||
| @info "Downloading Test dataset as DataDep..." | ||
| return Test() | ||
| else | ||
| error("Dataset registration for $name is not implemented.") | ||
| end | ||
| end | ||
|
|
||
| export download_hf_dataset |
ParamThakkar123 marked this conversation as resolved.
Show resolved
Hide resolved
|
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,53 @@ | ||
| using Downloads | ||
|
|
||
| const HF = HuggingFaceHub | ||
|
|
||
| """ | ||
| _huggingface_dataset_register(name::String, repo::String, filename::String) | ||
|
|
||
| Resolve dataset metadata from Hugging Face, download `filename` via HuggingFaceHub, | ||
| and return the local filesystem path to the downloaded file | ||
| """ | ||
| function _huggingface_dataset_register(name::String, repo::String, filename::String) | ||
|
|
||
| @info "Resolving Hugging Face metadata for $repo" | ||
|
|
||
| # Try fetching dataset info safely | ||
| dataset = HF.info(HF.Dataset, repo) | ||
|
|
||
| @info "Downloading $filename from $repo via HuggingFaceHub..." | ||
| try | ||
| # Prefer official HuggingFaceHub download if dataset info is available | ||
| if dataset !== nothing | ||
| localpath = HF.file_download(dataset, filename) | ||
| else | ||
| # Direct fallback if HF.info failed | ||
| url = "$repo/resolve/main/$filename" | ||
| tmpdir = mktempdir() | ||
| dest = joinpath(tmpdir, filename) | ||
| @info "Downloading $url -> $dest" | ||
| Downloads.download(url, dest) | ||
| localpath = dest | ||
| end | ||
| @info "Downloaded to $localpath" | ||
| return localpath | ||
|
|
||
| catch e | ||
| msg = string(e) | ||
| if occursin("symlink", msg) || occursin("creating symlinks", msg) || | ||
| occursin("Administrator", msg) || occursin("operation not permitted", msg) | ||
|
|
||
| @warn "Symlink creation failed (likely Windows privilege issue). Falling back to direct HTTP download: $e" | ||
| url = "$repo/resolve/main/$filename" | ||
| tmpdir = mktempdir() | ||
| dest = joinpath(tmpdir, filename) | ||
| @info "Downloading $url -> $dest (no symlink)" | ||
| Downloads.download(url, dest) | ||
| localpath = dest | ||
| @info "Fallback download complete: $localpath" | ||
| return localpath | ||
| else | ||
| rethrow(e) | ||
| end | ||
| end | ||
| end |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,41 @@ | ||
| # This file is machine-generated - editing it directly is not advised | ||
|
|
||
| julia_version = "1.11.7" | ||
| manifest_format = "2.0" | ||
| project_hash = "71d91126b5a1fb1020e1098d9d492de2a4438fd2" | ||
|
|
||
| [[deps.Base64]] | ||
| uuid = "2a0f44e3-6c83-55bd-87e4-b1978d98bd5f" | ||
| version = "1.11.0" | ||
|
|
||
| [[deps.InteractiveUtils]] | ||
| deps = ["Markdown"] | ||
| uuid = "b77e0a4c-d291-57a0-90e8-8db25a27a240" | ||
| version = "1.11.0" | ||
|
|
||
| [[deps.Logging]] | ||
| uuid = "56ddb016-857b-54e1-b83d-db4d58db5568" | ||
| version = "1.11.0" | ||
|
|
||
| [[deps.Markdown]] | ||
| deps = ["Base64"] | ||
| uuid = "d6f4376e-aef5-505a-96c1-9c027394607a" | ||
| version = "1.11.0" | ||
|
|
||
| [[deps.Random]] | ||
| deps = ["SHA"] | ||
| uuid = "9a3f8284-a2c9-5f02-9a11-845980a1fd5c" | ||
| version = "1.11.0" | ||
|
|
||
| [[deps.SHA]] | ||
| uuid = "ea8e919c-243c-51af-8825-aaa63cd721ce" | ||
| version = "0.7.0" | ||
|
|
||
| [[deps.Serialization]] | ||
| uuid = "9e88b42a-f829-5b0c-bbe9-9e923198166b" | ||
| version = "1.11.0" | ||
|
|
||
| [[deps.Test]] | ||
| deps = ["InteractiveUtils", "Logging", "Random", "Serialization"] | ||
| uuid = "8dfed614-e22c-5e08-85e1-65c5234f0b40" | ||
| version = "1.11.0" |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,2 @@ | ||
| [deps] | ||
| Test = "8dfed614-e22c-5e08-85e1-65c5234f0b40" |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -2,5 +2,15 @@ using HealthSampleData | |
| using Test | ||
|
|
||
| @testset "HealthSampleData.jl" begin | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Add tests for downloading; you could upload a very small file to HF JuliaHealthDatasets and use that for testing.
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Hey @ParamThakkar123 , you didn't add a test for downloading the test file. Could you do that? Additionally, please note that you can set DataDeps to always be downloaded in CI; read here: https://www.oxinabox.net/DataDeps.jl/stable/z10-for-end-users/#Configuration-1 |
||
| # Write your tests here. | ||
| @test isa(HealthSampleData.Test, Function) | ||
| @test hasmethod(HealthSampleData.Test, Tuple{}) | ||
|
|
||
| @test isa(HealthSampleData.download_hf_dataset, Function) | ||
|
|
||
| @test_throws ErrorException HealthSampleData.download_hf_dataset("NonExistentDataset12345") | ||
| @testset "HuggingFaceDatasets - Test dataset download" begin | ||
| path = HealthSampleData.download_hf_dataset("Test") | ||
| @test isa(path, String) | ||
| @test path == "Test/penguins.csv" | ||
| end | ||
| end | ||
Uh oh!
There was an error while loading. Please reload this page.