-
Notifications
You must be signed in to change notification settings - Fork 118
Rework retrieve_ror_fundref_ids rake task & Fix HTTParty InvalidURIError with non-ASCII characters
#3582
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Rework retrieve_ror_fundref_ids rake task & Fix HTTParty InvalidURIError with non-ASCII characters
#3582
Conversation
`query_ror()` calls `http_get()`, which includes a `HTTParty.get()` call. Without this percent-encoding, HTTParty throws InvalidURIError when given non-ASCII characters. TODO: Determine if any other `http_get()` callers require percent-encoding.
Our code relies on ROR's v1 API. https://ror.readme.io/docs/rest-api states the following: Changes to the ROR API begin the week of July 28, 2025 Beginning the week of July 28, 2025, ROR API requests with no version in the path will default to responses that use version 2 of the ROR schema instead of version 1. Read more in our changelog.
This change addresses the following suggestion for improving the UX (link includes screenshots): #837 (comment)
This change ensures that the task exits gracefully if either of the required schemes (ror or fundref) are missing from the db. Refactoring is also performed.
- Replaced CSV.generate + File.open with CSV.open - This change should also be more memory-efficient because rows are now written directly. Prior to this change, the full string was built in memory.
`OrgSelection::SearchService#weigh` states "The lower the weight the closer the match". - This change still attempts to find a result with weight <= 1. However, now a result with weight == 0 is prioritised, allowing for closer potential matches.
- Updated CSV to include weight column. Knowing this value provides us with a level of confidence that we can have in the new Identifier entries we are writing to the db. - Updated rake task to also log unmatched results to generated CSV. With these values/org names, we can determine whether ROR/Fundref values aren't available, or if there may be issues with the org names themselves, or rake task itself, etc. - Also added puts statements for unmatched results - Renamed variable `rslt` to `result` and `rstls` to `results`
Refactor `task update_ror_data` into `module Orgs::UpdateRorService`. - Helper methods now live inside the module instead of globally, preventing accidental overrides by other tasks or code. - The module is placed in `app/services/orgs/` to follow Rails conventions, enabling autoloading and keeping service code separate from Rake task definitions. - Using Orgs (plural) for the module namespace avoids conflicts with the existing Org ActiveRecord model.
This refactor and added comments are being made to address rubocop offences.
- Add `UPDATE_EXISTING` environment variable to control whether orgs
that already have ROR/Fundref identifiers should be updated.
- Default behavior remains the same: existing identifiers are skipped.
- Rake task usage documented with example:
`UPDATE_EXISTING=true bundle exec rake orgs:update_ror_data`
- Update `Orgs::UpdateRorService` to accept and utilise the `update_existing` keyword argument.
|
Checked the Rake task UPDATE_EXISTING=true bundle exec rake orgs:update_ror_data CSV produced after I added University of Edinburgh as Org by creating an account: @aaronskiba can I suggest you add the ror and funderref to the seeds.rb file for making testing easier: |
johnpinto1
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be good to have identifier_schemes for ror and funderef in the seeds file. Makes testing the rake task easier. cf. comment for details #3582 (comment)
After which good to merge as it sorts the non-ascii search bug, schema version and update of ror & funder ids.
Sounds great. Thank you for the thorough review and I will add those requests to the current PR. |
Used the same values from `task add_new_identifier_schemes` and `task contextualize_identifier_schemes` (see `lib/tasks/upgrade.rake`) for adding seeds. - However, `fundref.identifier_prefix = 'https://doi.org/10.13039/'` doesn't seem to be correct in `lib/tasks/upgrade.rake`. - "https://api.crossref.org/funders/" is used in DMP Assistant's DB, so that was used instead.
|
Hi @johnpinto1, I finally added the "ror" and "fundref" IdentifierSchemes to db/seeds.rb. |


Fixes portagenetwork#733
Fixes portagenetwork#837
Changes proposed in this PR:
1. Fix HTTParty InvalidURIError with non-ASCII characters
query_ror()callshttp_get(), which includes aHTTParty.get()call. Without this percent-encoding, HTTParty throws InvalidURIError when given non-ASCII characters.http_get()callers require percent-encoding.2. Rework
retrieve_ror_fundref_idsrake tasklib/tasks/upgrade.rakeincludes a one-off upgrade task for adding ROR and FUNDREF identifiers for Orgs in the database. This PR adds theorgs:update_ror_data, which is essentially a rework of the one-off upgrade.Make v1 explicit when using ROR api
Remove repetitive rendering of ROR/Fundref scheme names
Apply .managed fIlter to ror/fundref org updates
Guard against missing IdentifierScheme(s)
Improve best match result in org / ror rake task
OrgSelection::SearchService#weighstates "The lower the weight the closer the match".Update CSV handling: add weights and unmatched results
rslttoresultandrstlstoresultsRefactor: Namespace ROR Rake task in module
Refactor
task update_ror_dataintomodule Orgs::UpdateRorService.app/services/orgs/to follow Rails conventions, enabling autoloading and keeping service code separate from Rake task definitions.Add conditional flag to update existing ROR/Fundref data
UPDATE_EXISTINGenvironment variable to control whether orgs that already have ROR/Fundref identifiers should be updated.UPDATE_EXISTING=true bundle exec rake orgs:update_ror_dataOrgs::UpdateRorServiceto accept and utilise theupdate_existingkeyword argument.