@@ -382,19 +382,34 @@ id and individual id, and two underscores will throw an error.
382382### Modifying coordinates
383383
384384Tree sequence site positions can be floating point values, whereas VCF
385- requires discrete, 1-based integers. The `` position_transform `` argument
386- controls how tskit maps coordinates into VCF. Translating non-integral
385+ requires positive integers. The `` position_transform `` argument
386+ controls how tskit maps coordinates into VCF. Translating non-integer
387387positions necessarily loses precision; by default we round to the nearest
388- integer, so multiple sites may share the same output position. Because VCF
389- parsers differ, we do ** not** automatically shift from tskit's 0-based
390- convention to VCF's 1-based convention.
388+ integer, so multiple sites may share the same output position.
389+ Furthermore, tskit's coordinate system starts at zero,
390+ whereas the VCF standard requires positions to be positive,
391+ and so a site at position 0 is not valid in the VCF standard.
392+ Because VCF parsers differ, we do ** not** do anything to account for this.
393+
394+ The simplest resolution of this discrepancy in convention between tskit and VCF
395+ positions is deal with any site at position 0 as a special case (for instance,
396+ by discarding or ignoring it).
397+ A different interpretation of this difference between tskit's position
398+ and VCF's POS field
399+ is that they are different coordinate systems: tskit coordinates are
400+ "distance to the right of the left end of the chromosome",
401+ while VCF coordinates are "which number site, counting from the left end
402+ of the chromosome and starting at one".
403+ Under this interpretation, the solution is to supply an explicit
404+ `` position_transform `` that adds 1 to the coordinate when outputting
405+ to VCF (or, using the `` "legacy" `` option described below). However, this can
406+ easily lead to off-by-one errors converting between the coordinate systems,
407+ so should only be used if you really are using 0-based coordinates for your
408+ tree sequence.
391409
392410:::{warning}
393- Most VCF tools expect POS and contig coordinates to be 1-based. If your tree
394- sequence already uses discrete, 0-based integer positions, leaving the default
395- transform will produce off-by-one coordinates in the VCF. Supply an explicit
396- `` position_transform `` (for example, add 1 after rounding) or use the
397- `` "legacy" `` option to shift to 1-based coordinates.
411+ Most VCF tools cannot deal with a POS value of 0. If your tree
412+ sequence contains a site with position 0, this will likely cause an error.
398413:::
399414
400415Internally, the coordinates used in the VCF output are obtained by applying
@@ -415,8 +430,7 @@ transformed position is 0, :meth:`TreeSequence.write_vcf` will raise a
415430- mask the offending sites using the `` site_mask `` argument; or
416431- choose a `` position_transform `` that maps 0 to a valid positive position.
417432
418- For example, to shift all coordinates by 1 to make them strictly
419- 1-based, we could define:
433+ For example, to shift all coordinates by 1, we could define:
420434
421435``` {code-cell}
422436def one_based_positions(positions):
0 commit comments