Skip to content

Commit f50727f

Browse files
committed
Review comments
1 parent b758464 commit f50727f

File tree

3 files changed

+29
-14
lines changed

3 files changed

+29
-14
lines changed

docs/data-model.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -554,7 +554,7 @@ will always fail. Use `tskit.is_unknown_time` to detect unknown values.
554554

555555
#### Migration requirements
556556

557-
Given a valid set of nodes and edges, the requirements for a value set of
557+
Given a valid set of nodes and edges, the requirements for a valid set of
558558
migrations are:
559559

560560
- `left` and `right` must be finite values that lie within the tree sequence

docs/export.md

Lines changed: 26 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -382,19 +382,34 @@ id and individual id, and two underscores will throw an error.
382382
### Modifying coordinates
383383

384384
Tree sequence site positions can be floating point values, whereas VCF
385-
requires discrete, 1-based integers. The ``position_transform`` argument
386-
controls how tskit maps coordinates into VCF. Translating non-integral
385+
requires positive integers. The ``position_transform`` argument
386+
controls how tskit maps coordinates into VCF. Translating non-integer
387387
positions necessarily loses precision; by default we round to the nearest
388-
integer, so multiple sites may share the same output position. Because VCF
389-
parsers differ, we do **not** automatically shift from tskit's 0-based
390-
convention to VCF's 1-based convention.
388+
integer, so multiple sites may share the same output position.
389+
Furthermore, tskit's coordinate system starts at zero,
390+
whereas the VCF standard requires positions to be positive,
391+
and so a site at position 0 is not valid in the VCF standard.
392+
Because VCF parsers differ, we do **not** do anything to account for this.
393+
394+
The simplest resolution of this discrepancy in convention between tskit and VCF
395+
positions is deal with any site at position 0 as a special case (for instance,
396+
by discarding or ignoring it).
397+
A different interpretation of this difference between tskit's position
398+
and VCF's POS field
399+
is that they are different coordinate systems: tskit coordinates are
400+
"distance to the right of the left end of the chromosome",
401+
while VCF coordinates are "which number site, counting from the left end
402+
of the chromosome and starting at one".
403+
Under this interpretation, the solution is to supply an explicit
404+
``position_transform`` that adds 1 to the coordinate when outputting
405+
to VCF (or, using the ``"legacy"`` option described below). However, this can
406+
easily lead to off-by-one errors converting between the coordinate systems,
407+
so should only be used if you really are using 0-based coordinates for your
408+
tree sequence.
391409

392410
:::{warning}
393-
Most VCF tools expect POS and contig coordinates to be 1-based. If your tree
394-
sequence already uses discrete, 0-based integer positions, leaving the default
395-
transform will produce off-by-one coordinates in the VCF. Supply an explicit
396-
``position_transform`` (for example, add 1 after rounding) or use the
397-
``"legacy"`` option to shift to 1-based coordinates.
411+
Most VCF tools cannot deal with a POS value of 0. If your tree
412+
sequence contains a site with position 0, this will likely cause an error.
398413
:::
399414

400415
Internally, the coordinates used in the VCF output are obtained by applying
@@ -415,8 +430,7 @@ transformed position is 0, :meth:`TreeSequence.write_vcf` will raise a
415430
- mask the offending sites using the ``site_mask`` argument; or
416431
- choose a ``position_transform`` that maps 0 to a valid positive position.
417432

418-
For example, to shift all coordinates by 1 to make them strictly
419-
1-based, we could define:
433+
For example, to shift all coordinates by 1, we could define:
420434

421435
```{code-cell}
422436
def one_based_positions(positions):

docs/file-formats.md

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -240,7 +240,8 @@ The mutation text format must contain the columns `site`,
240240
may also be optionally present (but `parent` must be specified if
241241
more than one mutation occurs at the same site). If the `time` column
242242
is absent, the mutation times in the resulting tree sequence are set to
243-
{data}`tskit.UNKNOWN_TIME`. Unknown mutation times written out by
243+
{data}`tskit.UNKNOWN_TIME`, which is a numeric value that behaves like NaN.
244+
Unknown mutation times written out by
244245
{meth}`TreeSequence.dump_text` are represented in the text file by the
245246
literal string ``\"unknown\"`` in the `time` column, and
246247
{func}`tskit.load_text` treats this string as `UNKNOWN_TIME` on input.

0 commit comments

Comments
 (0)