Rework `retrieve_ror_fundref_ids` rake task & Fix HTTParty InvalidURIError with non-ASCII characters #3582

aaronskiba · 2025-11-04T20:35:22Z

Fixes portagenetwork#733
Fixes portagenetwork#837

Changes proposed in this PR:

1. Fix HTTParty InvalidURIError with non-ASCII characters

Percent-encode non-ASCII chars in query_ror
- query_ror() calls http_get(), which includes a HTTParty.get() call. Without this percent-encoding, HTTParty throws InvalidURIError when given non-ASCII characters.
- TODO: Determine if any other http_get() callers require percent-encoding.

2. Rework `retrieve_ror_fundref_ids` rake task

lib/tasks/upgrade.rake includes a one-off upgrade task for adding ROR and FUNDREF identifiers for Orgs in the database. This PR adds the orgs:update_ror_data, which is essentially a rework of the one-off upgrade.

Make v1 explicit when using ROR api

Our code relies on ROR's v1 API. https://ror.readme.io/docs/rest-api states the following:

Changes to the ROR API begin the week of July 28, 2025
Beginning the week of July 28, 2025, ROR API requests with no version in the path will default to responses that use version 2 of the ROR schema instead of version 1. Read more in our changelog.

Remove repetitive rendering of ROR/Fundref scheme names
- This- change addresses the following suggestion for improving the UX (link includes screenshots): Org profiles with ROR/Fundref portagenetwork/roadmap#837 (comment)
Apply .managed fIlter to ror/fundref org updates
Guard against missing IdentifierScheme(s)
- This change ensures that the task exits gracefully if either of the required schemes (ror or fundref) are missing from the db.
- Refactoring is also performed.
Improve best match result in org / ror rake task
- OrgSelection::SearchService#weigh states "The lower the weight the closer the match".
- This change still attempts to find a result with weight <= 1. However, now a result with weight == 0 is prioritised, allowing for closer potential matches.
Update CSV handling: add weights and unmatched results
- Updated CSV to include weight column. Knowing this value provides us with a level of confidence that we can have in the new Identifier entries we are writing to the db.
- Updated rake task to also log unmatched results to generated CSV. With these values/org names, we can determine whether ROR/Fundref values aren't available, or if there may be issues with the org names themselves, or rake task itself, etc.
- Also added puts statements for unmatched results
- Renamed variable rslt to result and rstls to results
Refactor: Namespace ROR Rake task in module
Refactor task update_ror_data into module Orgs::UpdateRorService.
- Helper methods now live inside the module instead of globally, preventing accidental overrides by other tasks or code.
- The module is placed in app/services/orgs/ to follow Rails conventions, enabling autoloading and keeping service code separate from Rake task definitions.
  - Using Orgs (plural) for the module namespace avoids conflicts with the existing Org ActiveRecord model.
Add conditional flag to update existing ROR/Fundref data
- Add UPDATE_EXISTING environment variable to control whether orgs that already have ROR/Fundref identifiers should be updated.
- Default behavior remains the same: existing identifiers are skipped.
- Rake task usage documented with example:
  - UPDATE_EXISTING=true bundle exec rake orgs:update_ror_data
- Update Orgs::UpdateRorService to accept and utilise the update_existing keyword argument.

`query_ror()` calls `http_get()`, which includes a `HTTParty.get()` call. Without this percent-encoding, HTTParty throws InvalidURIError when given non-ASCII characters. TODO: Determine if any other `http_get()` callers require percent-encoding.

Our code relies on ROR's v1 API. https://ror.readme.io/docs/rest-api states the following: Changes to the ROR API begin the week of July 28, 2025 Beginning the week of July 28, 2025, ROR API requests with no version in the path will default to responses that use version 2 of the ROR schema instead of version 1. Read more in our changelog.

This change addresses the following suggestion for improving the UX (link includes screenshots): #837 (comment)

This change ensures that the task exits gracefully if either of the required schemes (ror or fundref) are missing from the db. Refactoring is also performed.

- Replaced CSV.generate + File.open with CSV.open - This change should also be more memory-efficient because rows are now written directly. Prior to this change, the full string was built in memory.

`OrgSelection::SearchService#weigh` states "The lower the weight the closer the match". - This change still attempts to find a result with weight <= 1. However, now a result with weight == 0 is prioritised, allowing for closer potential matches.

- Updated CSV to include weight column. Knowing this value provides us with a level of confidence that we can have in the new Identifier entries we are writing to the db. - Updated rake task to also log unmatched results to generated CSV. With these values/org names, we can determine whether ROR/Fundref values aren't available, or if there may be issues with the org names themselves, or rake task itself, etc. - Also added puts statements for unmatched results - Renamed variable `rslt` to `result` and `rstls` to `results`

Refactor `task update_ror_data` into `module Orgs::UpdateRorService`. - Helper methods now live inside the module instead of globally, preventing accidental overrides by other tasks or code. - The module is placed in `app/services/orgs/` to follow Rails conventions, enabling autoloading and keeping service code separate from Rake task definitions. - Using Orgs (plural) for the module namespace avoids conflicts with the existing Org ActiveRecord model.

This refactor and added comments are being made to address rubocop offences.

- Add `UPDATE_EXISTING` environment variable to control whether orgs that already have ROR/Fundref identifiers should be updated. - Default behavior remains the same: existing identifiers are skipped. - Rake task usage documented with example: `UPDATE_EXISTING=true bundle exec rake orgs:update_ror_data` - Update `Orgs::UpdateRorService` to accept and utilise the `update_existing` keyword argument.

johnpinto1 · 2025-11-11T14:24:08Z

TESTED: Fix HTTParty InvalidURIError with non-ASCII characters
Tested with "Méditerranée" & "Würzburg".

johnpinto1 · 2025-11-11T15:48:01Z

Checked the Rake task UPDATE_EXISTING=true bundle exec rake orgs:update_ror_data

Scanning ROR for each of your existing Orgs.
The results will be written to "/media/jpinto/e88feba4-2a6d-4136-bc61-9af95ba1ca2e/DMP-WS/ROR-ORGS_WORK/roadmap-nontest/tmp/ror_fundref_ids.csv" to facilitate 
review and any corrections that may need to be made.
The CSV file contains the Org name stored in your DB next to the ROR org 
name that was matched. Use these 2 values to determine if the match was valid.
You can use the ROR search page to find the correct match for any organizations 
that need to be corrected: https://ror.org/search

Found 3 org(s) to process.
⚠️  No results found for Org with id: 2 and name: Government Agency
✅  Updated University of Edinburgh (ed.ac.uk) -> ROR: https://ror.org/01nrxwf90, University of Edinburgh (ed.ac.uk)
✅  Updated University of Edinburgh (ed.ac.uk) -> FUNDREF: https://api.crossref.org/funders/501100000848, University of Edinburgh (ed.ac.uk)
⚠️  No results found for Org with id: 3 and name: University of Exampleland

CSV produced after I added University of Edinburgh as Org by creating an account:
ror_fundref_ids.csv

@aaronskiba can I suggest you add the ror and funderref to the seeds.rb file for making testing easier:
identifier_schemes = [
{
name: 'orcid',
description: 'ORCID',
active: true,
logo_url:'http://orcid.org/sites/default/files/images/orcid_16x16.png',
identifier_prefix:'https://orcid.org',
context: 25
},
{
name: 'shibboleth',
description: 'Your institutional credentials',
active: true,
context: 11
},
{
name: 'ror',
description: 'Research Organization Registry (ROR)',
active: true,
identifier_prefix:'https://ror.org/',
context: 2
},
{
name: 'fundref',
description: 'Crossref Funder Registry (FundRef)',
active: true,
identifier_prefix:'https://api.crossref.org/funders/',
context: 2
}
]

johnpinto1

It would be good to have identifier_schemes for ror and funderef in the seeds file. Makes testing the rake task easier. cf. comment for details #3582 (comment)
After which good to merge as it sorts the non-ascii search bug, schema version and update of ror & funder ids.

aaronskiba · 2025-11-12T16:45:25Z

It would be good to have identifier_schemes for ror and funderef in the seeds file. Makes testing the rake task easier. cf. comment for details #3582 (comment) After which good to merge as it sorts the non-ascii search bug, schema version and update of ror & funder ids.

Sounds great. Thank you for the thorough review and I will add those requests to the current PR.

Used the same values from `task add_new_identifier_schemes` and `task contextualize_identifier_schemes` (see `lib/tasks/upgrade.rake`) for adding seeds. - However, `fundref.identifier_prefix = 'https://doi.org/10.13039/'` doesn't seem to be correct in `lib/tasks/upgrade.rake`. - "https://api.crossref.org/funders/" is used in DMP Assistant's DB, so that was used instead.

aaronskiba · 2025-11-20T21:17:46Z

Hi @johnpinto1, I finally added the "ror" and "fundref" IdentifierSchemes to db/seeds.rb.

aaronskiba added 13 commits November 4, 2025 11:25

Remove repetitive rendering of ROR/Fundref scheme names

e1de44c

This change addresses the following suggestion for improving the UX (link includes screenshots): #837 (comment)

cp upgrade:retrieve_ror_fundref_ids to lib/tasks/orgs.rake

6bb157f

Apply .managed fIlter to ror/fundref org updates

4456e97

Guard against missing IdentifierScheme(s)

6120184

This change ensures that the task exits gracefully if either of the required schemes (ror or fundref) are missing from the db. Refactoring is also performed.

Refactor: Simplify CSV logic in org/ror rake task

16a037a

- Replaced CSV.generate + File.open with CSV.open - This change should also be more memory-efficient because rows are now written directly. Prior to this change, the full string was built in memory.

Refactor orgs/ROR rake task

7950593

Make rubocop happy

22d8f40

This refactor and added comments are being made to address rubocop offences.

aaronskiba requested review from gjacob24, johnpinto1 and martaribeiro November 4, 2025 20:44

johnpinto1 approved these changes Nov 11, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Rework `retrieve_ror_fundref_ids` rake task & Fix HTTParty InvalidURIError with non-ASCII characters #3582

Rework `retrieve_ror_fundref_ids` rake task & Fix HTTParty InvalidURIError with non-ASCII characters #3582

aaronskiba commented Nov 4, 2025

Uh oh!

johnpinto1 commented Nov 11, 2025 •

edited

Loading

Uh oh!

johnpinto1 commented Nov 11, 2025 •

edited

Loading

Uh oh!

johnpinto1 left a comment

Uh oh!

aaronskiba commented Nov 12, 2025

Uh oh!

aaronskiba commented Nov 20, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Rework retrieve_ror_fundref_ids rake task & Fix HTTParty InvalidURIError with non-ASCII characters #3582

Are you sure you want to change the base?

Rework retrieve_ror_fundref_ids rake task & Fix HTTParty InvalidURIError with non-ASCII characters #3582

Conversation

aaronskiba commented Nov 4, 2025

Changes proposed in this PR:

1. Fix HTTParty InvalidURIError with non-ASCII characters

2. Rework retrieve_ror_fundref_ids rake task

Uh oh!

johnpinto1 commented Nov 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

johnpinto1 commented Nov 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

johnpinto1 left a comment

Choose a reason for hiding this comment

Uh oh!

aaronskiba commented Nov 12, 2025

Uh oh!

aaronskiba commented Nov 20, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Rework `retrieve_ror_fundref_ids` rake task & Fix HTTParty InvalidURIError with non-ASCII characters #3582

Rework `retrieve_ror_fundref_ids` rake task & Fix HTTParty InvalidURIError with non-ASCII characters #3582

2. Rework `retrieve_ror_fundref_ids` rake task

johnpinto1 commented Nov 11, 2025 •

edited

Loading

johnpinto1 commented Nov 11, 2025 •

edited

Loading