-
Notifications
You must be signed in to change notification settings - Fork 190
[ENH] BEP036 - Phenotypic Data Guidelines #2123
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Upstream PR
Quick update before merging our PR on surchs fork
BEP036 brings guidelines for best tabular phenotypic data to the BIDS specification. - Includes an appendix called `phenotype.md` - Includes admonitions for the guidelines in-line with modality agnostic files sections --------- Co-authored-by: Eric Earl <[email protected]> Co-authored-by: Samuel Guay <[email protected]> Co-authored-by: Sebastian Urchs <[email protected]> Co-authored-by: Arshitha B <[email protected]>
Changed "e.g." to "for example" to follow contributing style guidelines.
for more information, see https://pre-commit.ci
src/appendices/phenotype.md
Outdated
| each `phenotype/<measurement_tool_name>.json` data dictionary. | ||
| This improves reusability and provides clarity about the measurement tool. | ||
|
|
||
| ### 5. Use the demographics file for common variables about participants |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Copying from https://github.com/surchs/bids-specification/pull/1/files#r2103117486
For this section, would it make sense to suggest that demo-like information be prioritized in this file rather than participants.tsv, making the latter primarily a list of subject IDs? I haven't seen this explicitly addressed anywhere, though I'm unsure if it's something we want to formalize 😬
Something like this could follow the paragraph?:
When all demographic data is stored in
phenotype/demographics.tsv,participants.tsvmay serve primarily as a minimal listing of subject identifiers with only theparticipant_idcolumn.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree. It'd be good to mention this.
Put the phenotypic and assessment data content where it belongs.
Attempt to address more of @surchs comments.
Thanks for catching that excess newline, remark!
Remove acq_time as a phenotype column recommendation/option, as it should go into the sessions file instead.
Remove acq_time__phenotype from columns.yaml since it was removed from the rest of the schema.
Accept Sebastian's suggestion about the phrasing of guideline 8. Co-authored-by: Sebastian Urchs <[email protected]>
for more information, see https://pre-commit.ci
Changing "subject-level" to "participant-level" in sessions files section.
To better differentiate demographic data from phenotypic data
Made changes to align with final feedback prior to community review.
|
@effigies @rwblair Here is a blurb for the community review period to make announcements easier. If edits are needed, I will apply them directly to this comment before tomorrow. Community Review: BEP036 - Phenotypic Data GuidelinesWe are pleased to announce the community review period for BIDS Extension Proposal (BEP) 036! BEP036 extends the BIDS standard to include an appendix with tabular phenotypic data guidelines you can opt into for the BIDS validator. We have developed the extension to allow everyone to follow good practices in preparing their tabular phenotypic data. Additionally, this BEP introduces the ability to include
To view the file differences in either pull request, click the "Files changed" tab. |
This is logically equivalent to "the
This is extremely difficult to do without requiring a root-level
The bolded text is not doable in the current schema. This would need access to all the (subject, session) pairs in
Unvalidatable and ambiguous. I think this should just piggy-back off of common principles:
|
src/appendices/phenotype.md
Outdated
|
|
||
| Aggregate participant information across all sessions into one tabular TSV file per | ||
| measurement or phenotypic assessment and store this file in the `/phenotype` directory. | ||
| Demographic information is a special case and MUST be aggregated |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As of right now there are suggestions of what counts as demographic data, from a validation perspective this is hard to enforce without specific field names being listed in the schema. My interpretation is that these are then to become forbidden columns in any pheno/*.tsv? Are there any other demographic fields we'd like to enforce that on beyond sex age, gender, race, household_income?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think a growing list of specific field names considered common demographics would be great: sex, age, gender, race, and income for starters. Though perhaps that should be a validator WARNING and not an ERROR, so I will de-escalate that "MUST" to a "SHOULD".
Your comment also raises to me the thought of "does the validator check for the presence of duplicate-named columns across tabular data"? While I don't think it's a good idea to duplicate column names, it might happen sometimes and should raise a WARNING to encourage people to de-duplicate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"does the validator check for the presence of duplicate-named columns across tabular data"?
It does not.
it might happen sometimes and should raise a WARNING to encourage people to de-duplicate.
Our problem remains that we emit so many warnings that people simply stop reading them.
| measurement or phenotypic assessment and store this file in the `/phenotype` directory. | ||
| Demographic information is a special case and MUST be aggregated | ||
| in the `participants.tsv` file at the root level of the dataset. | ||
| It is RECOMMENDED to use the `age` column in the `participants.tsv` file |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Theoretically we could validate the appropriate age being used in each session based on the relative acq_times if present but I don't think its worth the effort. Maybe monotonically increasing age like schema.rules.checks.mri.VolumeTimingNotMonotonicallyIncreasing would be a compromise?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree it's not worth the effort. That and we can't rely on the age monotonically increasing as some sessions may be close enough to not affect the reported age.
|
|
||
| ### 3. Add `MeasurementToolMetadata` to each tabular phenotypic measurement tool | ||
|
|
||
| Whenever possible, it is RECOMMENDED to add `MeasurementToolMetadata` to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not an issue for this bep: In this and in the main phenotype article its implied that every tsv in the phenotype directory is a "Measurement Tool", but never explicitly stated that this is the only kind of tsv. Gave me pause when reviewing this, but it may be obvious to everyone else.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, the only permitted files in the phenotype folder are the measurement tool's TSVs and JSONs. Do you have an idea of where a sentence in the spec might help clear this up for other folks with the same experience you had there?
src/appendices/phenotype.md
Outdated
| - If more than one of the same measurement tool is acquired within | ||
| the same `session_id`, a `run_id` column MUST be added. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| - If more than one of the same measurement tool is acquired within | |
| the same `session_id`, a `run_id` column MUST be added. | |
| - If a measurement tool is acquired multiple times within a single session, a `run_id` column must be added to disambiguate the separate acquisitions. |
Note: This MUST is implicitly enforced by the combined index columns for phenotype tsv, If multiple results are acquired for the same subject and session with no run_id column the index check will error out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Love the edit! Implementing it now. Thanks!
src/appendices/phenotype.md
Outdated
| - Encoding the acquisition time for a measurement tool’s `session_id`, | ||
| is RECOMMENDED. This information MUST be stored in the `sessions.tsv` | ||
| file at the root level of the dataset in the `acq_time` column. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@effigies mentioned this as "This is logically equivalent to "the acq_time column MUST NOT appear in a phenotype TSV file", but it takes some thinking about to get there. The spec should just say that."
I agree with the explicit "MUST NOT", But it also goes a step further in enforcing a root level sessions.tsv.
| The combination of values in the `participant_id`, `session_id`, and `run_id` (if present) | ||
| columns MUST be unique for the entire tabular file. | ||
|
|
||
| ### 5. Store demographic data in the participants file and instrument data in the phenotype directory |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is mentioned in ### 1. Aggregate data across sessions, moving the two closer together or combining them would be nice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I removed those lines 78 and 79 because they are covered in the spec table by the schema.
| Create one tabular file for each instrument | ||
| in the phenotypic and assessment data directory. | ||
|
|
||
| ### 6. Record participant properties in the participants file and session properties in the sessions file |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Phenotypes aren't properties of participants? ;)
https://bids-specification.readthedocs.io/en/stable/modality-agnostic-files/data-summary-files.html#sessions-file
For pathology states When different from healthy, pathology SHOULD be specified.
Should this entry be taken as overriding the main spec, and this field should go in participants.tsv instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tabular phenotypic data from measurement tools in the /phenotype directory are the results of specific measurement tools. In fact, it's interesting that sometimes the "participant properties" collated into the participants.tsv file may be aggregated from separate interviews/screenings/measurement tools.
I hadn't seen that pathology line before. Thanks for finding it. Not sure what to do about that... @surchs @SamGuay @Arshitha ?
I suppose this BEP could make the additional validation opt-in override that sentence of the main spec?
| Properties of participants MAY include things like | ||
| age, sex, race, or household income. | ||
| Properties of sessions MAY include things like | ||
| acquisition time, measurement device properties, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Like participant properties, any way to explicitly list what is session appropriate metadata in the schema will make enforcing these rules easier/allow for making strong requirement claims. There is much less consistency here, so it may not be feasible.
Doing my best to address Chris' PR comment about a few pieces.
How's this?
The point here is that the guideline would rather a curator store
If a curator opts into this additional validation, then I agree it should require a
Does that mean the validator needing "access to all the (subject, session) pairs in
I removed that guideline because you're right, the common principle there is enough. For all little edits, see: surchs@b60eac1 |
Trying to address some of @rwblair's comments on the PR.
This is a warning for regular datasets and an error for additional validation datasets. bids-specification/src/schema/rules/tabular_data/modality_agnostic.yaml Lines 103 to 131 in d8b34f3
Yes. If you want to enforce that, it's going to require some schema and validation design. |
|
Moving @yarikoptic's good comments from the BEP Google Doc to here.
@yarikoptic It took me a long time to, I think, realize what you meant. I see a little clearer now, but there's a few conflated issues to untangle here. I'll try... What I think you're saying
Below is my reasoning for the 3 things to be as-is. Please refer back to them as 1, 2, and 3 as you reference them. 1. session_id as a column available in participants.tsvChris (@effigies) pointed out to me in conversation that this achieves most of what we set out to do, and Nell/Chris/Ross seemed to agree this was not a significant technical hurdle. So I pivoted the BEP, in the hopes of closing it off sooner, to focus on the session_id column being allowed in the participants file. Ultimately the goal is to capture multi-session data that changes about participants in an aggregated tabular file because our BEP leads feel that all the segregated 2. root-level sessions file nameThis is another one I don't care a lot about. It could be called sessions.tsv or participant_sessions.tsv or broccoli.tsv. I just think that pulling the prefix of 3. root-level sessions file participant_id columnWhether people put the participant_id and session_id pair in the participants.tsv file or the sessions.tsv or both, I would rather people record it and it be permitted by the BIDS validator to do so than not to record it because they don't know where to put it. Options and repetition can be good. Let me know your thoughts or if you want to hop on a call to continue the discussion, then come back here and type out the outcomes of our discussion. |
To me the main argument to not recommend putting longitudinal demographic info about participants (e.g. "age" or "medication status") in the Another way to solve it (that we discussed originally and dismissed after I don't think we should go back to encouraging storing information about participants in
The main purpose for the root-level sessions file is to provide a place where I can see (and describe) both the "imaging" sessions (e.g. I think it'd be odd to force such a file to the subject level (i.e. to disallow |

The BEP leads can meet as-needed to discuss this BEP PR
Coordinate a meeting by emailing Eric Earl: [email protected].
Communicate on this PR to provide feedback otherwise.
HTML preview of this BEP
BEP036 brings guidelines for best tabular phenotypic data to the BIDS specification.
phenotype.mdAdditionalValidationkey for thedataset_description.json, for which the usage is described in the modality agnostic files sectionssession_idas the second column in theparticipants.tsvAdditional Links
Co-authored-by: Eric Earl [email protected] @ericearl
Co-authored-by: Samuel Guay [email protected] @SamGuay
Co-authored-by: Sebastian Urchs [email protected] @surchs
Co-authored-by: Arshitha Basavaraj [email protected] @Arshitha