Skip to content

Conversation

@aaronskiba
Copy link
Contributor

Fixes portagenetwork#733
Fixes portagenetwork#837

Changes proposed in this PR:

1. Fix HTTParty InvalidURIError with non-ASCII characters

  • Percent-encode non-ASCII chars in query_ror
    • query_ror() calls http_get(), which includes a HTTParty.get() call. Without this percent-encoding, HTTParty throws InvalidURIError when given non-ASCII characters.
    • TODO: Determine if any other http_get() callers require percent-encoding.

2. Rework retrieve_ror_fundref_ids rake task

lib/tasks/upgrade.rake includes a one-off upgrade task for adding ROR and FUNDREF identifiers for Orgs in the database. This PR adds the orgs:update_ror_data, which is essentially a rework of the one-off upgrade.

  • Make v1 explicit when using ROR api

    • Our code relies on ROR's v1 API. https://ror.readme.io/docs/rest-api states the following:
      Changes to the ROR API begin the week of July 28, 2025
      Beginning the week of July 28, 2025, ROR API requests with no version in the path will default to responses that use version 2 of the ROR schema instead of version 1. Read more in our changelog. 
      
  • Remove repetitive rendering of ROR/Fundref scheme names

  • Apply .managed fIlter to ror/fundref org updates

  • Guard against missing IdentifierScheme(s)

    • This change ensures that the task exits gracefully if either of the required schemes (ror or fundref) are missing from the db.
    • Refactoring is also performed.
  • Improve best match result in org / ror rake task

    • OrgSelection::SearchService#weigh states "The lower the weight the closer the match".
    • This change still attempts to find a result with weight <= 1. However, now a result with weight == 0 is prioritised, allowing for closer potential matches.
  • Update CSV handling: add weights and unmatched results

    • Updated CSV to include weight column. Knowing this value provides us with a level of confidence that we can have in the new Identifier entries we are writing to the db.
    • Updated rake task to also log unmatched results to generated CSV. With these values/org names, we can determine whether ROR/Fundref values aren't available, or if there may be issues with the org names themselves, or rake task itself, etc.
    • Also added puts statements for unmatched results
    • Renamed variable rslt to result and rstls to results
  • Refactor: Namespace ROR Rake task in module
    Refactor task update_ror_data into module Orgs::UpdateRorService.

    • Helper methods now live inside the module instead of globally, preventing accidental overrides by other tasks or code.
    • The module is placed in app/services/orgs/ to follow Rails conventions, enabling autoloading and keeping service code separate from Rake task definitions.
      • Using Orgs (plural) for the module namespace avoids conflicts with the existing Org ActiveRecord model.
  • Add conditional flag to update existing ROR/Fundref data

    • Add UPDATE_EXISTING environment variable to control whether orgs that already have ROR/Fundref identifiers should be updated.
    • Default behavior remains the same: existing identifiers are skipped.
    • Rake task usage documented with example:
      • UPDATE_EXISTING=true bundle exec rake orgs:update_ror_data
    • Update Orgs::UpdateRorService to accept and utilise the update_existing keyword argument.

`query_ror()` calls `http_get()`, which includes a `HTTParty.get()` call. Without this percent-encoding, HTTParty throws InvalidURIError when given non-ASCII characters.

TODO: Determine if any other `http_get()` callers require percent-encoding.
Our code relies on ROR's v1 API. https://ror.readme.io/docs/rest-api states the following:

Changes to the ROR API begin the week of July 28, 2025
Beginning the week of July 28, 2025, ROR API requests with no version in the path will default to responses that use version 2 of the ROR schema instead of version 1. Read more in our changelog.
This change addresses the following suggestion for improving the UX (link includes screenshots): #837 (comment)
This change ensures that the task exits gracefully if either of the required schemes (ror or fundref) are missing from the db.
Refactoring is also performed.
- Replaced CSV.generate + File.open with CSV.open
- This change should also be more memory-efficient because rows are now written directly. Prior to this change, the full string was built in memory.
`OrgSelection::SearchService#weigh` states "The lower the weight the closer the match".
- This change still attempts to find a result with weight <= 1. However, now a result with weight == 0 is prioritised, allowing for closer potential matches.
- Updated CSV to include weight column. Knowing this value provides us with a level of confidence that we can have in the new Identifier entries we are writing to the db.
- Updated rake task to also log unmatched results to generated CSV. With these values/org names, we can determine whether ROR/Fundref values aren't available, or if there may be issues with the org names themselves, or rake task itself, etc.
- Also added puts statements for unmatched results
- Renamed variable `rslt` to `result` and `rstls` to `results`
Refactor `task update_ror_data` into `module Orgs::UpdateRorService`.
- Helper methods now live inside the module instead of globally, preventing accidental overrides by other tasks or code.
- The module is placed in `app/services/orgs/` to follow Rails conventions, enabling autoloading and keeping service code separate from Rake task definitions.
  - Using Orgs (plural) for the module namespace avoids conflicts with the existing Org ActiveRecord model.
This refactor and added comments are being made to address rubocop offences.
- Add `UPDATE_EXISTING` environment variable to control whether orgs
  that already have ROR/Fundref identifiers should be updated.
- Default behavior remains the same: existing identifiers are skipped.
- Rake task usage documented with example:
    `UPDATE_EXISTING=true bundle exec rake orgs:update_ror_data`
- Update `Orgs::UpdateRorService` to accept and utilise the `update_existing` keyword argument.
@johnpinto1
Copy link
Contributor

johnpinto1 commented Nov 11, 2025

TESTED: Fix HTTParty InvalidURIError with non-ASCII characters
Tested with "Méditerranée" & "Würzburg".

Selection_028 Selection_027

@johnpinto1
Copy link
Contributor

johnpinto1 commented Nov 11, 2025

Checked the Rake task UPDATE_EXISTING=true bundle exec rake orgs:update_ror_data

Scanning ROR for each of your existing Orgs.
The results will be written to "/media/jpinto/e88feba4-2a6d-4136-bc61-9af95ba1ca2e/DMP-WS/ROR-ORGS_WORK/roadmap-nontest/tmp/ror_fundref_ids.csv" to facilitate 
review and any corrections that may need to be made.
The CSV file contains the Org name stored in your DB next to the ROR org 
name that was matched. Use these 2 values to determine if the match was valid.
You can use the ROR search page to find the correct match for any organizations 
that need to be corrected: https://ror.org/search

Found 3 org(s) to process.
⚠️  No results found for Org with id: 2 and name: Government Agency
✅  Updated University of Edinburgh (ed.ac.uk) -> ROR: https://ror.org/01nrxwf90, University of Edinburgh (ed.ac.uk)
✅  Updated University of Edinburgh (ed.ac.uk) -> FUNDREF: https://api.crossref.org/funders/501100000848, University of Edinburgh (ed.ac.uk)
⚠️  No results found for Org with id: 3 and name: University of Exampleland

CSV produced after I added University of Edinburgh as Org by creating an account:
ror_fundref_ids.csv

@aaronskiba can I suggest you add the ror and funderref to the seeds.rb file for making testing easier:
identifier_schemes = [
{
name: 'orcid',
description: 'ORCID',
active: true,
logo_url:'http://orcid.org/sites/default/files/images/orcid_16x16.png',
identifier_prefix:'https://orcid.org',
context: 25
},
{
name: 'shibboleth',
description: 'Your institutional credentials',
active: true,
context: 11
},
{
name: 'ror',
description: 'Research Organization Registry (ROR)',
active: true,
identifier_prefix:'https://ror.org/',
context: 2
},
{
name: 'fundref',
description: 'Crossref Funder Registry (FundRef)',
active: true,
identifier_prefix:'https://api.crossref.org/funders/',
context: 2
}

]

Copy link
Contributor

@johnpinto1 johnpinto1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be good to have identifier_schemes for ror and funderef in the seeds file. Makes testing the rake task easier. cf. comment for details #3582 (comment)
After which good to merge as it sorts the non-ascii search bug, schema version and update of ror & funder ids.

@aaronskiba
Copy link
Contributor Author

It would be good to have identifier_schemes for ror and funderef in the seeds file. Makes testing the rake task easier. cf. comment for details #3582 (comment) After which good to merge as it sorts the non-ascii search bug, schema version and update of ror & funder ids.

Sounds great. Thank you for the thorough review and I will add those requests to the current PR.

Used the same values from `task add_new_identifier_schemes` and `task contextualize_identifier_schemes` (see `lib/tasks/upgrade.rake`) for adding seeds.
- However,  `fundref.identifier_prefix = 'https://doi.org/10.13039/'` doesn't seem to be correct in `lib/tasks/upgrade.rake`.
  - "https://api.crossref.org/funders/" is used in DMP Assistant's DB, so that was used instead.
@aaronskiba
Copy link
Contributor Author

Hi @johnpinto1, I finally added the "ror" and "fundref" IdentifierSchemes to db/seeds.rb.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Org profiles with ROR/Fundref URI::InvalidURIError: Raised When Querying Orgs With Non-ASCII Characters

2 participants