curate format-dates: Support date ranges#2001
Conversation
Update the format_date() doctest examples to pass either the default formats or a single custom format per example. This makes the intended parser path explicit for custom format examples.
c7c9569 to
c33195f
Compare
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## master #2001 +/- ##
==========================================
+ Coverage 72.90% 72.93% +0.03%
==========================================
Files 85 85
Lines 10714 10728 +14
Branches 2096 2100 +4
==========================================
+ Hits 7811 7825 +14
Misses 2537 2537
Partials 366 366 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
| # NCBI Datasets can return dates in '[YYYY TO YYYY]' format. | ||
| # There is no mention of this in Datasets docs, but it is documented | ||
| # for NCBI Pathogen Detection which may be related. | ||
| # <https://www.ncbi.nlm.nih.gov/pathogens/pathogens_help/#range-searches> | ||
| NCBI_DATASETS_RANGE = '[%Y TO %Y]' |
joverlee521
left a comment
There was a problem hiding this comment.
Thanks for getting to this so quickly @victorlin! I think it would be good to wait on confirmation from NCBI on the expected date formats before releasing.
| # NCBI Datasets date ranges can't be parsed by strptime, so handle them | ||
| # with regex. | ||
| if date_format == NCBI_DATASETS_RANGE: | ||
| if match := RE_NCBI_DATASETS_RANGE.match(date_string): | ||
| start_year, end_year = match.groups() | ||
| # Return a format compatible with RE_DATE_RANGE in augur.dates | ||
| return f"{start_year}-01-01/{end_year}-12-31" |
There was a problem hiding this comment.
non-blocking, leaving as future improvement
Ah, I see strptime will not work for any date ranges...What if we added a new CLI option to support ranges (e.g. --expected-range-patterns)?
Not sure how validation would work, maybe require the pattern to have two groups? Loop through patterns and parse the start and end date separately. The start should be simple with default date from strptime being the start of the month/year. The end would need munging to return the appropriate last day of the month/year.
There was a problem hiding this comment.
I've refactored the code and left a note:
augur/augur/curate/format_dates.py
Line 241 in 52a160b
But yes, supporting custom ranges will be much more involved with the need for regex matching and handling various cases for start/end.
This refactors code to handle date ranges, starting with Augur's standard format and the format of values newly emitted by NCBI Datasets.
c33195f to
52a160b
Compare
I've updated the implementation so that there is less hardcoding around NCBI Datasets. Since we're already seeing |
|
Spot-checked pathogen-repo-ci (rabies) artifact output-rabies/ingest/results/metadata.tsv to confirm that
|

Description of proposed changes
This PR contains 1 prep commit + 1 main commit. Message from main commit:
This refactors code to handle date ranges, starting with Augur's standard format and the format of values newly emitted by NCBI Datasets.
Related issue(s)
Closes #2000
Checklist