Skip to content

Conversation

@ParamThakkar123
Copy link
Collaborator

@TheCedarPrince

This PR Fixes #12

Added a _huggingface_dataset_register to register datasets from huggingface hub using DataDeps.jl and implemented a load function to load the dataset using HealthSampleData.jl module for the users

Copy link
Member

@TheCedarPrince TheCedarPrince left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you so much @ParamThakkar123 for your work! I left a bunch of comments based on our call; apologies for the delay in reviewing with being out sick! Very excited here and happy to review multiple times as needed. Just let me know via ping again!

I think we also will need to implement #14 first before continuing with this PR however.

@TheCedarPrince
Copy link
Member

Hey @ParamThakkar123 , I think you are unblocked on this PR now! Could you rebase and address the comments I left here? Thanks and keep up the good work! :D

@ParamThakkar123
Copy link
Collaborator Author

Hey @ParamThakkar123 , I think you are unblocked on this PR now! Could you rebase and address the comments I left here? Thanks and keep up the good work! :D

Yes @TheCedarPrince I will start working on this now!!

@codecov
Copy link

codecov bot commented Oct 30, 2025

Codecov Report

❌ Patch coverage is 6.55738% with 57 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (main@921e0ee). Learn more about missing BASE report.

Files with missing lines Patch % Lines
src/huggingface.jl 0.00% 29 Missing ⚠️
src/HuggingFaceDatasets/data.jl 20.00% 16 Missing ⚠️
src/utilities.jl 0.00% 12 Missing ⚠️
Additional details and impacted files
@@          Coverage Diff           @@
##             main     #13   +/-   ##
======================================
  Coverage        ?   7.46%           
======================================
  Files           ?       5           
  Lines           ?      67           
  Branches        ?       0           
======================================
  Hits            ?       5           
  Misses          ?      62           
  Partials        ?       0           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Member

@TheCedarPrince TheCedarPrince left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey Param, very very close here! Some final comments; thanks for the hard work!

using HealthSampleData
using Test

@testset "HealthSampleData.jl" begin
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add tests for downloading; you could upload a very small file to HF JuliaHealthDatasets and use that for testing.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @ParamThakkar123 , you didn't add a test for downloading the test file. Could you do that? Additionally, please note that you can set DataDeps to always be downloaded in CI; read here: https://www.oxinabox.net/DataDeps.jl/stable/z10-for-end-users/#Configuration-1

Copy link
Member

@TheCedarPrince TheCedarPrince left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey Param, very very close here! Some final comments; thanks for the hard work!

@ParamThakkar123
Copy link
Collaborator Author

@TheCedarPrince I fixed the error which was occurring during our call yesterday. The HealthSampleData.download_hf_dataset("Test") should work perfectly now. Can you please review this and also check if this works on your side too ?

@TheCedarPrince
Copy link
Member

Hey @ParamThakkar123

I tried out the code and it still doesn't work. I did this and it hung indefinitely until I pressed Ctrl+C:

julia> HealthSampleData.download_hf_dataset("Test")
[ Info: Downloading Test dataset as DataDep...
[ Info: Resolving Hugging Face metadata for https://huggingface.co/datasets/JuliaHealthOrg/JuliaHealthDatasets

^C┌ Warning: Failed to resolve dataset metadata: HTTP.Exceptions.RequestError(HTTP.Messages.Request:"""
│ GET /api/datasets/https://huggingface.co/datasets/JuliaHealthOrg/JuliaHealthDatasets/revision/main HTTP/1.1
│ Host: huggingface.co
│ Accept: */*
│ User-Agent: HTTP.jl/1.11.7
│ Content-Length: 0
│ Accept-Encoding: gzip

""", InterruptException()). Falling back to direct download.
└ @ HealthSampleData ~/FOSS/HealthSampleData.jl/src/huggingface.jl:23
[ Info: Downloading penguins.csv from https://huggingface.co/datasets/JuliaHealthOrg/JuliaHealthDatasets via HuggingFaceHub...
[ Info: Downloading https://huggingface.co/datasets/JuliaHealthOrg/JuliaHealthDatasets/resolve/main/penguins.csv -> /tmp/jl_p2qTDI/penguins.csv
^CERROR: InterruptException:
Stacktrace:
 [1] open_nolock
   @ ~/.julia/juliaup/julia-1.11.7+0.x64.linux.gnu/share/julia/stdlib/v1.11/ArgTools/src/ArgTools.jl:35 [inlined]
 [2] arg_write(f::Function, arg::String)
   @ ArgTools ~/.julia/juliaup/julia-1.11.7+0.x64.linux.gnu/share/julia/stdlib/v1.11/ArgTools/src/ArgTools.jl:103
 [3] #download#2
   @ ~/.julia/juliaup/julia-1.11.7+0.x64.linux.gnu/share/julia/stdlib/v1.11/Downloads/src/Downloads.jl:258 [inlined]
 [4] download
   @ ~/.julia/juliaup/julia-1.11.7+0.x64.linux.gnu/share/julia/stdlib/v1.11/Downloads/src/Downloads.jl:247 [inlined]
 [5] _huggingface_dataset_register(name::String, repo::String, filename::String)
   @ HealthSampleData ~/FOSS/HealthSampleData.jl/src/huggingface.jl:42
 [6] Test
   @ ~/FOSS/HealthSampleData.jl/src/HuggingFaceDatasets/data.jl:33 [inlined]
 [7] download_hf_dataset(name::String)
   @ HealthSampleData ~/FOSS/HealthSampleData.jl/src/HuggingFaceDatasets/data.jl:69
 [8] top-level scope
   @ REPL[5]:1

I was looking at your code for this and it is doing a variety of weird things -- if you are using LLM-generated code here, the LLM is leading you wrong. Let me make some comments.

@ParamThakkar123
Copy link
Collaborator Author

I tried this out and it worked for me. I am able to download the datasets and store it into the right place

@ParamThakkar123
Copy link
Collaborator Author

Will take a look into this again

Copy link
Member

@TheCedarPrince TheCedarPrince left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @ParamThakkar123 , I left multiple comments as we are getting close. As I pointed out, if you were using LLMs, it led you to doing a lot of strange things in the code itself. Please review my comments carefully as this corrects a lot of what the LLM was having you do.

Comment on lines 32 to 64
try
# Prefer official HuggingFaceHub download if dataset info is available
if dataset !== nothing
localpath = HF.file_download(dataset, filename; progress = progress_fn)
else
# Direct fallback if HF.info failed
url = "$repo/resolve/main/$filename"
tmpdir = mktempdir()
dest = joinpath(tmpdir, filename)
@info "Downloading $url -> $dest"
Downloads.download(url, dest; progress = progress_fn)
localpath = dest
end
@info "Downloaded to $localpath"
return localpath

catch e
msg = string(e)
if occursin("symlink", msg) || occursin("creating symlinks", msg) ||
occursin("Administrator", msg) || occursin("operation not permitted", msg)

@warn "Symlink creation failed (likely Windows privilege issue). Falling back to direct HTTP download: $e"
url = "$repo/resolve/main/$filename"
tmpdir = mktempdir()
dest = joinpath(tmpdir, filename)
@info "Downloading $url -> $dest (no symlink)"
Downloads.download(url, dest; progress = progress_fn)
localpath = dest
@info "Fallback download complete: $localpath"
return localpath
else
rethrow(e)
end
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would instead just turn this into a simple try catch block that does:

try 
localpath =  HF.file_download(dataset, filename)
catch e
@error "Dataset X had an error; please open an issue at HEALTHSAMPLEDATA-URL"
end

We should be forcing it to use HuggingFaceHub and if it fails, that is an error from HF that we would need to address. The other content I really do not know what it is there for because DataDeps handles all the pathing.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The other things are for symlink errors that occured frequently with windows

using HealthSampleData
using Test

@testset "HealthSampleData.jl" begin
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @ParamThakkar123 , you didn't add a test for downloading the test file. Could you do that? Additionally, please note that you can set DataDeps to always be downloaded in CI; read here: https://www.oxinabox.net/DataDeps.jl/stable/z10-for-end-users/#Configuration-1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[FEATURE] Download Data Sources from HuggingFace

2 participants