-
Notifications
You must be signed in to change notification settings - Fork 2
Added Huggingface support #13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
TheCedarPrince
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you so much @ParamThakkar123 for your work! I left a bunch of comments based on our call; apologies for the delay in reviewing with being out sick! Very excited here and happy to review multiple times as needed. Just let me know via ping again!
I think we also will need to implement #14 first before continuing with this PR however.
|
Hey @ParamThakkar123 , I think you are unblocked on this PR now! Could you rebase and address the comments I left here? Thanks and keep up the good work! :D |
Yes @TheCedarPrince I will start working on this now!! |
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #13 +/- ##
======================================
Coverage ? 7.46%
======================================
Files ? 5
Lines ? 67
Branches ? 0
======================================
Hits ? 5
Misses ? 62
Partials ? 0 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
TheCedarPrince
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey Param, very very close here! Some final comments; thanks for the hard work!
| using HealthSampleData | ||
| using Test | ||
|
|
||
| @testset "HealthSampleData.jl" begin |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add tests for downloading; you could upload a very small file to HF JuliaHealthDatasets and use that for testing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @ParamThakkar123 , you didn't add a test for downloading the test file. Could you do that? Additionally, please note that you can set DataDeps to always be downloaded in CI; read here: https://www.oxinabox.net/DataDeps.jl/stable/z10-for-end-users/#Configuration-1
TheCedarPrince
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey Param, very very close here! Some final comments; thanks for the hard work!
|
@TheCedarPrince I fixed the error which was occurring during our call yesterday. The |
|
Hey @ParamThakkar123 I tried out the code and it still doesn't work. I did this and it hung indefinitely until I pressed Ctrl+C: julia> HealthSampleData.download_hf_dataset("Test")
[ Info: Downloading Test dataset as DataDep...
[ Info: Resolving Hugging Face metadata for https://huggingface.co/datasets/JuliaHealthOrg/JuliaHealthDatasets
^C┌ Warning: Failed to resolve dataset metadata: HTTP.Exceptions.RequestError(HTTP.Messages.Request:
│ """
│ GET /api/datasets/https://huggingface.co/datasets/JuliaHealthOrg/JuliaHealthDatasets/revision/main HTTP/1.1
│ Host: huggingface.co
│ Accept: */*
│ User-Agent: HTTP.jl/1.11.7
│ Content-Length: 0
│ Accept-Encoding: gzip
│
│ """, InterruptException()). Falling back to direct download.
└ @ HealthSampleData ~/FOSS/HealthSampleData.jl/src/huggingface.jl:23
[ Info: Downloading penguins.csv from https://huggingface.co/datasets/JuliaHealthOrg/JuliaHealthDatasets via HuggingFaceHub...
[ Info: Downloading https://huggingface.co/datasets/JuliaHealthOrg/JuliaHealthDatasets/resolve/main/penguins.csv -> /tmp/jl_p2qTDI/penguins.csv
^CERROR: InterruptException:
Stacktrace:
[1] open_nolock
@ ~/.julia/juliaup/julia-1.11.7+0.x64.linux.gnu/share/julia/stdlib/v1.11/ArgTools/src/ArgTools.jl:35 [inlined]
[2] arg_write(f::Function, arg::String)
@ ArgTools ~/.julia/juliaup/julia-1.11.7+0.x64.linux.gnu/share/julia/stdlib/v1.11/ArgTools/src/ArgTools.jl:103
[3] #download#2
@ ~/.julia/juliaup/julia-1.11.7+0.x64.linux.gnu/share/julia/stdlib/v1.11/Downloads/src/Downloads.jl:258 [inlined]
[4] download
@ ~/.julia/juliaup/julia-1.11.7+0.x64.linux.gnu/share/julia/stdlib/v1.11/Downloads/src/Downloads.jl:247 [inlined]
[5] _huggingface_dataset_register(name::String, repo::String, filename::String)
@ HealthSampleData ~/FOSS/HealthSampleData.jl/src/huggingface.jl:42
[6] Test
@ ~/FOSS/HealthSampleData.jl/src/HuggingFaceDatasets/data.jl:33 [inlined]
[7] download_hf_dataset(name::String)
@ HealthSampleData ~/FOSS/HealthSampleData.jl/src/HuggingFaceDatasets/data.jl:69
[8] top-level scope
@ REPL[5]:1I was looking at your code for this and it is doing a variety of weird things -- if you are using LLM-generated code here, the LLM is leading you wrong. Let me make some comments. |
|
I tried this out and it worked for me. I am able to download the datasets and store it into the right place |
|
Will take a look into this again |
TheCedarPrince
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @ParamThakkar123 , I left multiple comments as we are getting close. As I pointed out, if you were using LLMs, it led you to doing a lot of strange things in the code itself. Please review my comments carefully as this corrects a lot of what the LLM was having you do.
| try | ||
| # Prefer official HuggingFaceHub download if dataset info is available | ||
| if dataset !== nothing | ||
| localpath = HF.file_download(dataset, filename; progress = progress_fn) | ||
| else | ||
| # Direct fallback if HF.info failed | ||
| url = "$repo/resolve/main/$filename" | ||
| tmpdir = mktempdir() | ||
| dest = joinpath(tmpdir, filename) | ||
| @info "Downloading $url -> $dest" | ||
| Downloads.download(url, dest; progress = progress_fn) | ||
| localpath = dest | ||
| end | ||
| @info "Downloaded to $localpath" | ||
| return localpath | ||
|
|
||
| catch e | ||
| msg = string(e) | ||
| if occursin("symlink", msg) || occursin("creating symlinks", msg) || | ||
| occursin("Administrator", msg) || occursin("operation not permitted", msg) | ||
|
|
||
| @warn "Symlink creation failed (likely Windows privilege issue). Falling back to direct HTTP download: $e" | ||
| url = "$repo/resolve/main/$filename" | ||
| tmpdir = mktempdir() | ||
| dest = joinpath(tmpdir, filename) | ||
| @info "Downloading $url -> $dest (no symlink)" | ||
| Downloads.download(url, dest; progress = progress_fn) | ||
| localpath = dest | ||
| @info "Fallback download complete: $localpath" | ||
| return localpath | ||
| else | ||
| rethrow(e) | ||
| end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would instead just turn this into a simple try catch block that does:
try
localpath = HF.file_download(dataset, filename)
catch e
@error "Dataset X had an error; please open an issue at HEALTHSAMPLEDATA-URL"
endWe should be forcing it to use HuggingFaceHub and if it fails, that is an error from HF that we would need to address. The other content I really do not know what it is there for because DataDeps handles all the pathing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The other things are for symlink errors that occured frequently with windows
| using HealthSampleData | ||
| using Test | ||
|
|
||
| @testset "HealthSampleData.jl" begin |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @ParamThakkar123 , you didn't add a test for downloading the test file. Could you do that? Additionally, please note that you can set DataDeps to always be downloaded in CI; read here: https://www.oxinabox.net/DataDeps.jl/stable/z10-for-end-users/#Configuration-1
@TheCedarPrince
This PR Fixes #12
Added a
_huggingface_dataset_registerto register datasets from huggingface hub using DataDeps.jl and implemented aloadfunction to load the dataset usingHealthSampleData.jlmodule for the users