Scope
filter, frequencies, subsample
Description
Take this example dataset:
echo -e 'strain\tdate
SEQ1\t2019-01-01
SEQ2\t2020-01-01
SEQ3\t2020-12-31
SEQ4\t2021-01-01
SEQ5\t2021-12-31
SEQ6\t2022-01-01
' > metadata.tsv
--min-date is inclusive. With --min-date 2020, both 2020-01-01 and 2020-12-31 pass as expected:
augur filter \
--metadata metadata.tsv \
--min-date 2020 \
--output-metadata filtered.tsv
# strain date
# SEQ3 2020-12-31
# SEQ4 2021-01-01
# SEQ2 2020-01-01
# SEQ6 2022-01-01
# SEQ5 2021-12-31
However, --max-date is not inclusive. With --max-date 2021, both 2021-01-01 and 2021-12-31 are expected to pass, but instead they get filtered out:
augur filter \
--metadata metadata.tsv \
--max-date 2021 \
--output-metadata filtered.tsv
# strain date
# SEQ1 2019-01-01
# SEQ3 2020-12-31
# SEQ2 2020-01-01
Reason
In --max-date 2021, the value 2021 gets evaluated as 2021.0 by the type converter function numeric_date:
|
# date is numeric |
|
try: |
|
return float(date) |
and that value is used as max_date here:
|
if max_date: |
|
filtered = {s for s in filtered if (np.isscalar(dates[s]) or all(dates[s])) and np.min(dates[s]) <= max_date} |
This means the <= max_date is effectively < 2021 since the earliest ISO date 2021-01-01 ~= 2021.001.
Possible solution:
This has already been solved in #854. Two parts:
-
Treat 2021 as 2021-XX-XX:
|
# Absolute date in numeric format. |
|
if RE_NUMERIC_DATE.match(date_in): |
|
return float(date_in) |
|
|
|
# Absolute date in potentially incomplete/ambiguous ISO 8601 date format. |
|
if (RE_ISO_8601_DATE.match(date_in) or |
|
RE_AMBIGUOUS_ISO_8601_DATE.match(date_in) or |
|
RE_AMBIGUOUS_ISO_8601_DATE_YEAR_MONTH.match(date_in) or |
|
RE_YEAR_ONLY.match(date_in) |
|
): |
|
return iso_to_numeric(date_in, ambiguity_resolver) |
-
Use different type converters for --min-date and --max-date, taking minimum of ambiguity for --min-date and maximum for --max-date:
|
metadata_filter_group.add_argument('--min-date', type=any_to_numeric_type_min, help="minimal cutoff for date, the cutoff date is inclusive; may be specified as an Augur-style numeric date (with the year as the integer part) or YYYY-MM-DD") |
|
metadata_filter_group.add_argument('--max-date', type=any_to_numeric_type_max, help="maximal cutoff for date, the cutoff date is inclusive; may be specified as an Augur-style numeric date (with the year as the integer part) or YYYY-MM-DD") |
Your environment: if running Nextstrain locally
Scope
filter,frequencies,subsampleDescription
Take this example dataset:
--min-dateis inclusive. With--min-date 2020, both2020-01-01and2020-12-31pass as expected:However,
--max-dateis not inclusive. With--max-date 2021, both2021-01-01and2021-12-31are expected to pass, but instead they get filtered out:Reason
In
--max-date 2021, the value2021gets evaluated as2021.0by the type converter functionnumeric_date:augur/augur/dates.py
Lines 30 to 32 in c264580
and that value is used as
max_datehere:augur/augur/filter.py
Lines 332 to 333 in c264580
This means the
<= max_dateis effectively< 2021since the earliest ISO date2021-01-01 ~= 2021.001.Possible solution:
This has already been solved in #854. Two parts:
Treat
2021as2021-XX-XX:augur/augur/dates.py
Lines 123 to 133 in 110af66
Use different type converters for
--min-dateand--max-date, taking minimum of ambiguity for--min-dateand maximum for--max-date:augur/augur/filter.py
Lines 22 to 23 in 110af66
Your environment: if running Nextstrain locally