Skip to content

SLiM needs a place to put mutation metadata #569

@bhaller

Description

@bhaller

Since the dawn of tree-sequence recording in SLiM, we've been putting metadata about mutations (selection coefficient, mutation type id, origin subpopulation, origin tick, nucleotide) into the derived state information. That has never been great. One reason is that since the derived state has to be ASCII, we have to convert all this metadata to ASCII on the way out, and back to binary on the way back in, which is slow and messy. Another reason is that since SLiM can "stack" mutations, creating a new derived state that includes the previous mutation at the site as well as a new mutation "stacked" on top of it, we end up re-emitting all of this information whenever that happens (which is rare in many models, but commonplace in some models). So we end up with multiple copies of what might be duplicate information about a given mutation – or might even not be duplicate information, since the state of the mutation might have been changed in the meantime. A third reason why this is not great is that the final state of the mutation is not captured; if the user changes the mutation type of a mutation and then writes out, that state change is (I'm 99% sure?) not captured, since no new derived state gets recorded. In all, we've got a bloated, inefficient, confusing mess.

But we've been limping along with this. The straw that breaks the camel's back is that I am now shifting SLiM to support multiple traits. A given mutation can have pleiotropic effects on any/all traits in the model, which means that it needs to have separate selection and dominance coefficients for each trait in the model. So the potential amount of metadata per mutation, and the complexity of it, is about to jump upward sharply. Even in present-day SLiM people have been doing models with something like six or eight traits (with zero support in tree-sequence recording for that, and close to zero support in SLiM for it too!); once doing this is properly supported in SLiM, I expect people will do models with even more traits, and they will want that to work with tree-sequence recording. So we clearly need a new design.

What we need is a proper place to put mutation metadata: one that can support binary data of arbitrary form, that can be written out at the end of a run (capturing current state) rather than at the moment a new derived state is created (capturing state that might be stale later on), and that can capture each mutation once rather than glomming things up with duplicate copies (as the derived state column does, for SLiM with stacking). And the question is where that table of metadata can live, in a .trees file, and how we write it out and read it in.

We could approach this as a general problem – maybe it makes sense for tskit to provide a general strategy for attaching metadata to mutations (without having to wedge it in as ASCII data in the derived state column), or maybe it makes sense for tskit to provide a general strategy for client code to add their own tables to a .trees file, and read those tables back in. Or we could approach this as a SLiM-specific problem – maybe this is really only SLiM's problem and nobody else's, in which case a general tskit-level solution is perhaps not needed, so then maybe we just figure out how to work at the kastore level on our own and write/read our own table behind tskit's back. Or maybe we even write out a separate file – a .slimtrees file – side by side with the .trees file, such that the pair of files are associated and the pair is needed to correctly read anything useful back in.

I don't know which avenue is better. To me, at least some kind of minimal sort of support in tskit for custom client tables seems like it would make sense, to make the .trees data structure extensible. That will probably prove useful to lots of people in lots of ways, going forward. But if that is a no-go for some reason, then we need to figure out how SLiM is going to hack this in without tskit support, I guess.

So, let's talk about this. I've raised this issue at points in the past, but it has seemed like a distant problem and has not gotten solved; now it has become rather urgent, since I'm doing the multitrait work in SLiM now and I need a solution. Seems like this is of interest to @petrelharp @jeromekelleher @benjeffery @hyanwong and probably others. I wasn't sure whether to make this issue on SLiM or on tskit; feel free to make a companion issue on the tskit side if it seems like that makes sense, but to avoid confusion let's either discuss here or there, not both. :->

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugprioritytreesrelated to tree-seq, tskit, etc.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions