Skip to content

curate format-dates: Support date ranges#2001

Merged
victorlin merged 2 commits into
masterfrom
victorlin/ncbi-datasets-date-ranges
May 19, 2026
Merged

curate format-dates: Support date ranges#2001
victorlin merged 2 commits into
masterfrom
victorlin/ncbi-datasets-date-ranges

Conversation

@victorlin
Copy link
Copy Markdown
Member

@victorlin victorlin commented May 18, 2026

Description of proposed changes

This PR contains 1 prep commit + 1 main commit. Message from main commit:

This refactors code to handle date ranges, starting with Augur's standard format and the format of values newly emitted by NCBI Datasets.

Related issue(s)

Closes #2000

Checklist

  • Automated checks pass
  • Check if you need to add a changelog message
  • Check if you need to add tests
  • Check if you need to update docs

Update the format_date() doctest examples to pass either the default
formats or a single custom format per example.

This makes the intended parser path explicit for custom format examples.
@victorlin victorlin self-assigned this May 18, 2026
@victorlin victorlin force-pushed the victorlin/ncbi-datasets-date-ranges branch from c7c9569 to c33195f Compare May 18, 2026 20:42
@codecov
Copy link
Copy Markdown

codecov Bot commented May 18, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 72.93%. Comparing base (5ac3b06) to head (52a160b).

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #2001      +/-   ##
==========================================
+ Coverage   72.90%   72.93%   +0.03%     
==========================================
  Files          85       85              
  Lines       10714    10728      +14     
  Branches     2096     2100       +4     
==========================================
+ Hits         7811     7825      +14     
  Misses       2537     2537              
  Partials      366      366              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Comment thread augur/curate/format_dates.py Outdated
Comment on lines +20 to +24
# NCBI Datasets can return dates in '[YYYY TO YYYY]' format.
# There is no mention of this in Datasets docs, but it is documented
# for NCBI Pathogen Detection which may be related.
# <https://www.ncbi.nlm.nih.gov/pathogens/pathogens_help/#range-searches>
NCBI_DATASETS_RANGE = '[%Y TO %Y]'
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown
Contributor

@joverlee521 joverlee521 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for getting to this so quickly @victorlin! I think it would be good to wait on confirmation from NCBI on the expected date formats before releasing.

Comment thread augur/curate/format_dates.py Outdated
Comment thread augur/curate/format_dates.py Outdated
Comment on lines +162 to +168
# NCBI Datasets date ranges can't be parsed by strptime, so handle them
# with regex.
if date_format == NCBI_DATASETS_RANGE:
if match := RE_NCBI_DATASETS_RANGE.match(date_string):
start_year, end_year = match.groups()
# Return a format compatible with RE_DATE_RANGE in augur.dates
return f"{start_year}-01-01/{end_year}-12-31"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

non-blocking, leaving as future improvement

Ah, I see strptime will not work for any date ranges...What if we added a new CLI option to support ranges (e.g. --expected-range-patterns)?

Not sure how validation would work, maybe require the pattern to have two groups? Loop through patterns and parse the start and end date separately. The start should be simple with default date from strptime being the start of the month/year. The end would need munging to return the appropriate last day of the month/year.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've refactored the code and left a note:

# TODO: make BUILTIN_RANGE_FORMATS extendable like --expected-date-formats

But yes, supporting custom ranges will be much more involved with the need for regex matching and handling various cases for start/end.

This refactors code to handle date ranges, starting with Augur's
standard format and the format of values newly emitted by NCBI Datasets.
@victorlin victorlin force-pushed the victorlin/ncbi-datasets-date-ranges branch from c33195f to 52a160b Compare May 19, 2026 00:40
@victorlin victorlin changed the title curate format-dates: Support NCBI Datasets date ranges curate format-dates: Support date ranges May 19, 2026
@victorlin
Copy link
Copy Markdown
Member Author

I think it would be good to wait on confirmation from NCBI on the expected date formats before releasing.

I've updated the implementation so that there is less hardcoding around NCBI Datasets. Since we're already seeing [%Y TO %Y] in some data, I doubt it's going away. We could merge+release now and add other range formats as needed.

@victorlin
Copy link
Copy Markdown
Member Author

Spot-checked pathogen-repo-ci (rabies) artifact output-rabies/ingest/results/metadata.tsv to confirm that [2020 TO 2022] gets translated to 2020-01-01/2022-12-31:

image

@victorlin victorlin merged commit 5cb2bff into master May 19, 2026
34 checks passed
@victorlin victorlin deleted the victorlin/ncbi-datasets-date-ranges branch May 19, 2026 17:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

curate format-dates: support date ranges

2 participants