[ENH] BEP036 - Phenotypic Data Guidelines #2123

ericearl · 2025-05-30T14:09:19Z

The BEP leads can meet as-needed to discuss this BEP PR

Coordinate a meeting by emailing Eric Earl: [email protected].

Communicate on this PR to provide feedback otherwise.

BEP036 brings guidelines for best tabular phenotypic data to the BIDS specification.

Includes an appendix called phenotype.md
Includes a new AdditionalValidation key for the dataset_description.json, for which the usage is described in the modality agnostic files sections
Includes the new option to store session_id as the second column in the participants.tsv

Additional Links

Co-authored-by: Eric Earl [email protected] @ericearl
Co-authored-by: Samuel Guay [email protected] @SamGuay
Co-authored-by: Sebastian Urchs [email protected] @surchs
Co-authored-by: Arshitha Basavaraj [email protected] @Arshitha

Upstream PR

Quick update before merging our PR on surchs fork

BEP036 brings guidelines for best tabular phenotypic data to the BIDS specification. - Includes an appendix called `phenotype.md` - Includes admonitions for the guidelines in-line with modality agnostic files sections --------- Co-authored-by: Eric Earl <[email protected]> Co-authored-by: Samuel Guay <[email protected]> Co-authored-by: Sebastian Urchs <[email protected]> Co-authored-by: Arshitha B <[email protected]>

Changed "e.g." to "for example" to follow contributing style guidelines.

for more information, see https://pre-commit.ci

src/modality-agnostic-files/data-summary-files.md

surchs · 2025-05-30T14:57:31Z

src/appendices/phenotype.md

+each `phenotype/<measurement_tool_name>.json` data dictionary.
+This improves reusability and provides clarity about the measurement tool.
+
+### 5. Use the demographics file for common variables about participants


Copying from https://github.com/surchs/bids-specification/pull/1/files#r2103117486

For this section, would it make sense to suggest that demo-like information be prioritized in this file rather than participants.tsv, making the latter primarily a list of subject IDs? I haven't seen this explicitly addressed anywhere, though I'm unsure if it's something we want to formalize 😬
Something like this could follow the paragraph?:

When all demographic data is stored in phenotype/demographics.tsv, participants.tsv may serve primarily as a minimal listing of subject identifiers with only the participant_id column.

I agree. It'd be good to mention this.

src/appendices/phenotype.md

src/modality-agnostic-files/data-summary-files.md

Put the phenotypic and assessment data content where it belongs.

src/modality-agnostic-files/data-summary-files.md

src/modality-agnostic-files/phenotypic-and-assessment-data.md

src/appendices/phenotype.md

@surchs

Attempt to address more of @surchs comments.

Thanks for catching that excess newline, remark!

Remove acq_time as a phenotype column recommendation/option, as it should go into the sessions file instead.

src/schema/objects/columns.yaml

Remove acq_time__phenotype from columns.yaml since it was removed from the rest of the schema.

Accept Sebastian's suggestion about the phrasing of guideline 8. Co-authored-by: Sebastian Urchs <[email protected]>

for more information, see https://pre-commit.ci

src/modality-agnostic-files/data-summary-files.md

Changing "subject-level" to "participant-level" in sessions files section.

To better differentiate demographic data from phenotypic data

Made changes to align with final feedback prior to community review.

ericearl · 2025-10-12T13:05:53Z

@effigies @rwblair Here is a blurb for the community review period to make announcements easier. If edits are needed, I will apply them directly to this comment before tomorrow.

Community Review: BEP036 - Phenotypic Data Guidelines

We are pleased to announce the community review period for BIDS Extension Proposal (BEP) 036!

BEP036 extends the BIDS standard to include an appendix with tabular phenotypic data guidelines you can opt into for the BIDS validator. We have developed the extension to allow everyone to follow good practices in preparing their tabular phenotypic data. Additionally, this BEP introduces the ability to include session_id as a second column in participants files and to aggregate sessions files to the root-level, allowing you to store longitudinal tabular data about participants and sessions, respectively, inside those files.

The draft specification may be found at: https://bids-specification--2123.org.readthedocs.build/en/2123/
The proposed changes may be found at bids-specification pull request #2123.
Example datasets may be found under the titles pheno001 through pheno006 in the bids-examples pull request #465.

To view the file differences in either pull request, click the "Files changed" tab.

effigies · 2025-10-16T21:01:16Z

Encoding the acquisition time for a measurement tool’s session_id, is RECOMMENDED. This information MUST be stored in the sessions.tsv file at the root level of the dataset in the acq_time column.

This is logically equivalent to "the acq_time column MUST NOT appear in a phenotype TSV file", but it takes some thinking about to get there. The spec should just say that.

"if anyone uses sessions, everyone uses sessions."

This is extremely difficult to do without requiring a root-level /sessions.tsv to the exclusion of subject-level sub-<label>_sessions.tsv files. The reason is that sessions columns in phenotype are analyzed on their own. If we can depend on the presence or absence of sessions.tsv as an indication of whether there are any sessions in the dataset, then when we visit a phenotype file, we can check that length(columns.session_id) > 0 iff exists('/sessions.tsv'). Similarly when visiting a subject directory, we can check that length(subject.sessions.ses_dirs) > 0 iff exists('/sessions.tsv').

7. Use the sessions file at the root-level

If there is more than one session for any one participant, then it is RECOMMENDED to provide a sessions file at the dataset root. The sessions file MUST list all sessions for all subjects across imaging and tabular phenotypic data. The data dictionary JSON file’s session_id field MUST include Levels with the description of each session_id.

The bolded text is not doable in the current schema. This would need access to all the (subject, session) pairs in /sessions.tsv and in each phenotype file. I think it's tractable, but we will need to extend the validation context and implement those changes in the validator.

10. Respect participant privacy when recording acquisition times

When needed to preserve participant privacy, you SHOULD record relative acquisition times with respect to the earliest session. Relative session acquisition times MAY be listed as durations from the earliest session (baseline) in days, months, or years using the acq_time column.

Unvalidatable and ambiguous. I think this should just piggy-back off of common principles:

Dates can be shifted by a random number of days for privacy protection reasons. To distinguish real dates from shifted dates, is is RECOMMENDED to set shifted dates to the year 1925 or earlier. Note that some data formats do not support arbitrary recording dates. [...] For longitudinal studies dates MUST be shifted by the same number of days within each subject to maintain the interval information. For example: 1867-06-15T13:45:30

rwblair · 2025-10-16T21:14:42Z

src/appendices/phenotype.md

+
+Aggregate participant information across all sessions into one tabular TSV file per
+measurement or phenotypic assessment and store this file in the `/phenotype` directory.
+Demographic information is a special case and  MUST be aggregated


As of right now there are suggestions of what counts as demographic data, from a validation perspective this is hard to enforce without specific field names being listed in the schema. My interpretation is that these are then to become forbidden columns in any pheno/*.tsv? Are there any other demographic fields we'd like to enforce that on beyond sex age, gender, race, household_income?

I think a growing list of specific field names considered common demographics would be great: sex, age, gender, race, and income for starters. Though perhaps that should be a validator WARNING and not an ERROR, so I will de-escalate that "MUST" to a "SHOULD".

Your comment also raises to me the thought of "does the validator check for the presence of duplicate-named columns across tabular data"? While I don't think it's a good idea to duplicate column names, it might happen sometimes and should raise a WARNING to encourage people to de-duplicate.

"does the validator check for the presence of duplicate-named columns across tabular data"?

It does not.

it might happen sometimes and should raise a WARNING to encourage people to de-duplicate.

Our problem remains that we emit so many warnings that people simply stop reading them.

rwblair · 2025-10-16T21:18:29Z

src/appendices/phenotype.md

+measurement or phenotypic assessment and store this file in the `/phenotype` directory.
+Demographic information is a special case and  MUST be aggregated
+in the `participants.tsv` file at the root level of the dataset.
+It is RECOMMENDED to use the `age` column in the `participants.tsv` file


Theoretically we could validate the appropriate age being used in each session based on the relative acq_times if present but I don't think its worth the effort. Maybe monotonically increasing age like schema.rules.checks.mri.VolumeTimingNotMonotonicallyIncreasing would be a compromise?

I agree it's not worth the effort. That and we can't rely on the age monotonically increasing as some sessions may be close enough to not affect the reported age.

rwblair · 2025-10-16T21:26:24Z

src/appendices/phenotype.md

+
+### 3. Add `MeasurementToolMetadata` to each tabular phenotypic measurement tool
+
+Whenever possible, it is RECOMMENDED to add `MeasurementToolMetadata` to


Not an issue for this bep: In this and in the main phenotype article its implied that every tsv in the phenotype directory is a "Measurement Tool", but never explicitly stated that this is the only kind of tsv. Gave me pause when reviewing this, but it may be obvious to everyone else.

Yeah, the only permitted files in the phenotype folder are the measurement tool's TSVs and JSONs. Do you have an idea of where a sentence in the spec might help clear this up for other folks with the same experience you had there?

rwblair · 2025-10-16T21:35:53Z

src/appendices/phenotype.md

+-   If more than one of the same measurement tool is acquired within
+    the same `session_id`, a `run_id` column MUST be added.


Suggested change

- If more than one of the same measurement tool is acquired within

the same `session_id`, a `run_id` column MUST be added.

- If a measurement tool is acquired multiple times within a single session, a `run_id` column must be added to disambiguate the separate acquisitions.

Note: This MUST is implicitly enforced by the combined index columns for phenotype tsv, If multiple results are acquired for the same subject and session with no run_id column the index check will error out.

Love the edit! Implementing it now. Thanks!

rwblair · 2025-10-16T21:39:37Z

src/appendices/phenotype.md

+-   Encoding the acquisition time for a measurement tool’s `session_id`,
+    is RECOMMENDED. This information MUST be stored in the `sessions.tsv`
+    file at the root level of the dataset in the `acq_time` column.


@effigies mentioned this as "This is logically equivalent to "the acq_time column MUST NOT appear in a phenotype TSV file", but it takes some thinking about to get there. The spec should just say that."

I agree with the explicit "MUST NOT", But it also goes a step further in enforcing a root level sessions.tsv.

rwblair · 2025-10-16T21:45:31Z

src/appendices/phenotype.md

+The combination of values in the `participant_id`, `session_id`, and `run_id` (if present)
+columns MUST be unique for the entire tabular file.
+
+### 5. Store demographic data in the participants file and instrument data in the phenotype directory


This is mentioned in ### 1. Aggregate data across sessions, moving the two closer together or combining them would be nice.

I removed those lines 78 and 79 because they are covered in the spec table by the schema.

rwblair · 2025-10-16T21:53:54Z

src/appendices/phenotype.md

+Create one tabular file for each instrument
+in the phenotypic and assessment data directory.
+
+### 6. Record participant properties in the participants file and session properties in the sessions file


Phenotypes aren't properties of participants? ;)

https://bids-specification.readthedocs.io/en/stable/modality-agnostic-files/data-summary-files.html#sessions-file
For pathology states When different from healthy, pathology SHOULD be specified.

Should this entry be taken as overriding the main spec, and this field should go in participants.tsv instead?

Tabular phenotypic data from measurement tools in the /phenotype directory are the results of specific measurement tools. In fact, it's interesting that sometimes the "participant properties" collated into the participants.tsv file may be aggregated from separate interviews/screenings/measurement tools.

I hadn't seen that pathology line before. Thanks for finding it. Not sure what to do about that... @surchs @SamGuay @Arshitha ?

I suppose this BEP could make the additional validation opt-in override that sentence of the main spec?

rwblair · 2025-10-16T21:55:51Z

src/appendices/phenotype.md

+Properties of participants MAY include things like
+age, sex, race, or household income.
+Properties of sessions MAY include things like
+acquisition time, measurement device properties,


Like participant properties, any way to explicitly list what is session appropriate metadata in the schema will make enforcing these rules easier/allow for making strong requirement claims. There is much less consistency here, so it may not be feasible.

Doing my best to address Chris' PR comment about a few pieces.

ericearl · 2025-11-04T19:32:25Z

@effigies

Encoding the acquisition time for a measurement tool’s session_id, is RECOMMENDED. This information MUST be stored in the sessions.tsv file at the root level of the dataset in the acq_time column.

This is logically equivalent to "the acq_time column MUST NOT appear in a phenotype TSV file", but it takes some thinking about to get there. The spec should just say that.

How's this?

A measurement tool’s acquisition time SHOULD be stored in the sessions.tsv file at the root level of the dataset in the acq_time column.

The point here is that the guideline would rather a curator store acq_time somewhere than storing it nowhere. The preference is it goes into the sessions.tsv. So I suppose a sessions.tsv missing an acq_time column would receive a validator warning.

"if anyone uses sessions, everyone uses sessions."

This is extremely difficult to do without requiring a root-level /sessions.tsv to the exclusion of subject-level sub-<label>_sessions.tsv files. The reason is that sessions columns in phenotype are analyzed on their own. If we can depend on the presence or absence of sessions.tsv as an indication of whether there are any sessions in the dataset, then when we visit a phenotype file, we can check that length(columns.session_id) > 0 iff exists('/sessions.tsv'). Similarly when visiting a subject directory, we can check that length(subject.sessions.ses_dirs) > 0 iff exists('/sessions.tsv').

If a curator opts into this additional validation, then I agree it should require a sessions.tsv file instead of sub-<label>_sessions.tsv files.

7. Use the sessions file at the root-level
If there is more than one session for any one participant, then it is RECOMMENDED to provide a sessions file at the dataset root. The sessions file MUST list all sessions for all subjects across imaging and tabular phenotypic data. The data dictionary JSON file’s session_id field MUST include Levels with the description of each session_id.

The bolded text is not doable in the current schema. This would need access to all the (subject, session) pairs in /sessions.tsv and in each phenotype file. I think it's tractable, but we will need to extend the validation context and implement those changes in the validator.

Does that mean the validator needing "access to all the (subject, session) pairs in /sessions.tsv and in each phenotype file" is a blocker for this BEP?

10. Respect participant privacy when recording acquisition times
When needed to preserve participant privacy, you SHOULD record relative acquisition times with respect to the earliest session. Relative session acquisition times MAY be listed as durations from the earliest session (baseline) in days, months, or years using the acq_time column.

Unvalidatable and ambiguous. I think this should just piggy-back off of common principles:

Dates can be shifted by a random number of days for privacy protection reasons. To distinguish real dates from shifted dates, is is RECOMMENDED to set shifted dates to the year 1925 or earlier. Note that some data formats do not support arbitrary recording dates. [...] For longitudinal studies dates MUST be shifted by the same number of days within each subject to maintain the interval information. For example: 1867-06-15T13:45:30

I removed that guideline because you're right, the common principle there is enough.

For all little edits, see: surchs@b60eac1

@rwblair

Trying to address some of @rwblair's comments on the PR.

effigies · 2025-11-05T21:53:52Z

So I suppose a sessions.tsv missing an acq_time column would receive a validator warning.

This is a warning for regular datasets and an error for additional validation datasets.

bids-specification/src/schema/rules/tabular_data/modality_agnostic.yaml

Lines 103 to 131 in d8b34f3

    
           Sessions: 
        
             selectors: 
        
               - suffix == "sessions" 
        
               - extension == ".tsv" 
        
               - '!intersects(dataset.dataset_description.AdditionalValidation, ["Phenotype"])' 
        
             initial_columns: 
        
               - participant_id 
        
               - session_id 
        
               - run_id 
        
             columns: 
        
               participant_id: optional 
        
               session_id: required 
        
               run_id: optional 
        
               acq_time__sessions: recommended 
        
               pathology: recommended 
        
               HED: optional 
        
             index_columns: [participant_id, session_id, run_id] 
        
             additional_columns: allowed 
        
           Sessions__Additional: 
        
             $ref: rules.tabular_data.modality_agnostic.Sessions 
        
             selectors: 
        
               - suffix == "sessions" 
        
               - extension == ".tsv" 
        
               - intersects(dataset.dataset_description.AdditionalValidation, ["Phenotype"]) 
        
             columns: 
        
               $ref: rules.tabular_data.modality_agnostic.Sessions.columns 
        
               acq_time__sessions: required 
        
             additional_columns: allowed_if_defined

The sessions file MUST list all sessions for all subjects across imaging and tabular phenotypic data. The data dictionary JSON file’s session_id field MUST include Levels with the description of each session_id.

The bolded text is not doable in the current schema. This would need access to all the (subject, session) pairs in /sessions.tsv and in each phenotype file. I think it's tractable, but we will need to extend the validation context and implement those changes in the validator.

Does that mean the validator needing "access to all the (subject, session) pairs in /sessions.tsv and in each phenotype file" is a blocker for this BEP?

Yes. If you want to enforce that, it's going to require some schema and validation design.

ericearl · 2025-11-19T20:52:21Z

Moving @yarikoptic's good comments from the BEP Google Doc to here.

@yarikoptic It took me a long time to, I think, realize what you meant. I see a little clearer now, but there's a few conflated issues to untangle here. I'll try...

What I think you're saying

You don't want session_id as a column available in participants.tsv
You don't want the root-level sessions.tsv file to have that file name
You don't want a root-level sessions file to contain a participant_id column

Below is my reasoning for the 3 things to be as-is. Please refer back to them as 1, 2, and 3 as you reference them.

1. session_id as a column available in participants.tsv

Chris (@effigies) pointed out to me in conversation that this achieves most of what we set out to do, and Nell/Chris/Ross seemed to agree this was not a significant technical hurdle. So I pivoted the BEP, in the hopes of closing it off sooner, to focus on the session_id column being allowed in the participants file. Ultimately the goal is to capture multi-session data that changes about participants in an aggregated tabular file because our BEP leads feel that all the segregated sub-<label>_sessions.tsv files is an undesirable solution. Whether that happens in the participants file or sessions file matters very little to me. I just want people to record it somewhere obvious/predicatable.

2. root-level sessions file name

This is another one I don't care a lot about. It could be called sessions.tsv or participant_sessions.tsv or broccoli.tsv. I just think that pulling the prefix of sub-<label>_ out of the sessions.tsv files makes the most sense and will be familiar to BIDS users.

3. root-level sessions file participant_id column

Whether people put the participant_id and session_id pair in the participants.tsv file or the sessions.tsv or both, I would rather people record it and it be permitted by the BIDS validator to do so than not to record it because they don't know where to put it. Options and repetition can be good.

Let me know your thoughts or if you want to hop on a call to continue the discussion, then come back here and type out the outcomes of our discussion.

surchs · 2025-11-21T04:36:45Z

session_id as a column available in participants.tsv

To me the main argument to not recommend putting longitudinal demographic info about participants (e.g. "age" or "medication status") in the sessions.tsv file was clarity of purpose / scope: The information is not about session (but about the participant) and the only reason we put it in sessions.tsv is because this file allows us to list multiple sessions. Allowing session_id columns (and thus repeated rows of participant_id) in `participants.tsv is one way to solve this - with possible added benefit that many groups already collect their longitudinal participant info in a single table.

Another way to solve it (that we discussed originally and dismissed after session_id in participants.tsv was an option) was to recommend that people to move all demographic info into the /phenotypic directory (where session_id is allowed), e.g. into a /phenotypic/demographic.tsv file - and thus turn participants.tsv into just an index of participant_id.

I don't think we should go back to encouraging storing information about participants in sessions.tsv.

root-level sessions file

The main purpose for the root-level sessions file is to provide a place where I can see (and describe) both the "imaging" sessions (e.g. /sub-01/ses-01) and "phenotypic" sessions (that may only exist as a session_id entry in a file in /phenotypic) so that I can avoid unintentional reuse of session-ids (i.e. my participant completes a questionnaire in /phenotypic with session_id = ses-01 and then 6 months later shows up for their first MRI which I now store under /sub-01/ses-01, making it seem as if both were collected together).

I think it'd be odd to force such a file to the subject level (i.e. to disallow participant_id column inside). In an extreme case with only "phenotypic" sessions you would have an empty /sub-01 directory with only a sessions.tsv inside. If we make participants.tsv the "single place" for this info, then we will need to explain where I should put info about sessions like acq_time etc for phenotypic sessions vs imaging sessions.

ericearl and others added 4 commits May 20, 2025 08:24

Merge pull request #2 from bids-standard/master

3cedc86

Upstream PR

Merge pull request #3 from bids-standard/master

11fbb47

Quick update before merging our PR on surchs fork

Update phenotype.md and data-summary-files.md

0a640e6

Changed "e.g." to "for example" to follow contributing style guidelines.

ericearl requested review from effigies and rwblair May 30, 2025 14:09

ericearl assigned surchs, ericearl and SamGuay May 30, 2025

ericearl requested review from DimitriPapadopoulos and erdalkaraca as code owners May 30, 2025 14:09

ericearl added enhancement New feature or request BEP phenotype labels May 30, 2025

[pre-commit.ci] auto fixes from pre-commit.com hooks

a19512b

for more information, see https://pre-commit.ci

effigies reviewed May 30, 2025

View reviewed changes

src/modality-agnostic-files/data-summary-files.md Show resolved Hide resolved

src/modality-agnostic-files/data-summary-files.md Outdated Show resolved Hide resolved

surchs reviewed May 30, 2025

View reviewed changes