Skip to content

Conversation

@ericearl
Copy link
Collaborator

@ericearl ericearl commented May 30, 2025

The BEP leads can meet as-needed to discuss this BEP PR

Coordinate a meeting by emailing Eric Earl: [email protected].

Communicate on this PR to provide feedback otherwise.

HTML preview of this BEP

BEP036 brings guidelines for best tabular phenotypic data to the BIDS specification.

  • Includes an appendix called phenotype.md
  • Includes a new AdditionalValidation key for the dataset_description.json, for which the usage is described in the modality agnostic files sections
  • Includes the new option to store session_id as the second column in the participants.tsv

Additional Links

  1. Original Google Doc
  2. Draft BIDS Validator errors and warnings
  3. BIDS Examples PR

Co-authored-by: Eric Earl [email protected] @ericearl
Co-authored-by: Samuel Guay [email protected] @SamGuay
Co-authored-by: Sebastian Urchs [email protected] @surchs
Co-authored-by: Arshitha Basavaraj [email protected] @Arshitha

ericearl and others added 4 commits May 20, 2025 08:24
Quick update before merging our PR on surchs fork
BEP036 brings guidelines for best tabular phenotypic data to the BIDS specification.

- Includes an appendix called `phenotype.md`
- Includes admonitions for the guidelines in-line with modality agnostic files sections

---------

Co-authored-by: Eric Earl <[email protected]>
Co-authored-by: Samuel Guay <[email protected]>
Co-authored-by: Sebastian Urchs <[email protected]>
Co-authored-by: Arshitha B <[email protected]>
Changed "e.g." to "for example" to follow contributing style guidelines.
each `phenotype/<measurement_tool_name>.json` data dictionary.
This improves reusability and provides clarity about the measurement tool.

### 5. Use the demographics file for common variables about participants
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copying from https://github.com/surchs/bids-specification/pull/1/files#r2103117486

For this section, would it make sense to suggest that demo-like information be prioritized in this file rather than participants.tsv, making the latter primarily a list of subject IDs? I haven't seen this explicitly addressed anywhere, though I'm unsure if it's something we want to formalize 😬
Something like this could follow the paragraph?:

When all demographic data is stored in phenotype/demographics.tsv, participants.tsv may serve primarily as a minimal listing of subject identifiers with only the participant_id column.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. It'd be good to mention this.

Put the phenotypic and assessment data content where it belongs.
Attempt to address more of @surchs comments.
Thanks for catching that excess newline, remark!
Remove acq_time as a phenotype column recommendation/option, as it should go into the sessions file instead.
ericearl and others added 3 commits September 30, 2025 06:15
Remove acq_time__phenotype from columns.yaml since it was removed from the rest of the schema.
Accept Sebastian's suggestion about the phrasing of guideline 8.

Co-authored-by: Sebastian Urchs <[email protected]>
@ericearl
Copy link
Collaborator Author

ericearl commented Oct 12, 2025

@effigies @rwblair Here is a blurb for the community review period to make announcements easier. If edits are needed, I will apply them directly to this comment before tomorrow.


Community Review: BEP036 - Phenotypic Data Guidelines

We are pleased to announce the community review period for BIDS Extension Proposal (BEP) 036!

BEP036 extends the BIDS standard to include an appendix with tabular phenotypic data guidelines you can opt into for the BIDS validator. We have developed the extension to allow everyone to follow good practices in preparing their tabular phenotypic data. Additionally, this BEP introduces the ability to include session_id as a second column in participants files and to aggregate sessions files to the root-level, allowing you to store longitudinal tabular data about participants and sessions, respectively, inside those files.

To view the file differences in either pull request, click the "Files changed" tab.

@effigies
Copy link
Collaborator

effigies commented Oct 16, 2025

Encoding the acquisition time for a measurement tool’s session_id, is RECOMMENDED. This information MUST be stored in the sessions.tsv file at the root level of the dataset in the acq_time column.

This is logically equivalent to "the acq_time column MUST NOT appear in a phenotype TSV file", but it takes some thinking about to get there. The spec should just say that.

"if anyone uses sessions, everyone uses sessions."

This is extremely difficult to do without requiring a root-level /sessions.tsv to the exclusion of subject-level sub-<label>_sessions.tsv files. The reason is that sessions columns in phenotype are analyzed on their own. If we can depend on the presence or absence of sessions.tsv as an indication of whether there are any sessions in the dataset, then when we visit a phenotype file, we can check that length(columns.session_id) > 0 iff exists('/sessions.tsv'). Similarly when visiting a subject directory, we can check that length(subject.sessions.ses_dirs) > 0 iff exists('/sessions.tsv').

7. Use the sessions file at the root-level

If there is more than one session for any one participant, then it is RECOMMENDED to provide a sessions file at the dataset root. The sessions file MUST list all sessions for all subjects across imaging and tabular phenotypic data. The data dictionary JSON file’s session_id field MUST include Levels with the description of each session_id.

The bolded text is not doable in the current schema. This would need access to all the (subject, session) pairs in /sessions.tsv and in each phenotype file. I think it's tractable, but we will need to extend the validation context and implement those changes in the validator.

10. Respect participant privacy when recording acquisition times

When needed to preserve participant privacy, you SHOULD record relative acquisition times with respect to the earliest session. Relative session acquisition times MAY be listed as durations from the earliest session (baseline) in days, months, or years using the acq_time column.

Unvalidatable and ambiguous. I think this should just piggy-back off of common principles:

Dates can be shifted by a random number of days for privacy protection reasons. To distinguish real dates from shifted dates, is is RECOMMENDED to set shifted dates to the year 1925 or earlier. Note that some data formats do not support arbitrary recording dates. [...] For longitudinal studies dates MUST be shifted by the same number of days within each subject to maintain the interval information. For example: 1867-06-15T13:45:30


Aggregate participant information across all sessions into one tabular TSV file per
measurement or phenotypic assessment and store this file in the `/phenotype` directory.
Demographic information is a special case and MUST be aggregated
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As of right now there are suggestions of what counts as demographic data, from a validation perspective this is hard to enforce without specific field names being listed in the schema. My interpretation is that these are then to become forbidden columns in any pheno/*.tsv? Are there any other demographic fields we'd like to enforce that on beyond sex age, gender, race, household_income?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a growing list of specific field names considered common demographics would be great: sex, age, gender, race, and income for starters. Though perhaps that should be a validator WARNING and not an ERROR, so I will de-escalate that "MUST" to a "SHOULD".

Your comment also raises to me the thought of "does the validator check for the presence of duplicate-named columns across tabular data"? While I don't think it's a good idea to duplicate column names, it might happen sometimes and should raise a WARNING to encourage people to de-duplicate.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"does the validator check for the presence of duplicate-named columns across tabular data"?

It does not.

it might happen sometimes and should raise a WARNING to encourage people to de-duplicate.

Our problem remains that we emit so many warnings that people simply stop reading them.

measurement or phenotypic assessment and store this file in the `/phenotype` directory.
Demographic information is a special case and MUST be aggregated
in the `participants.tsv` file at the root level of the dataset.
It is RECOMMENDED to use the `age` column in the `participants.tsv` file
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Theoretically we could validate the appropriate age being used in each session based on the relative acq_times if present but I don't think its worth the effort. Maybe monotonically increasing age like schema.rules.checks.mri.VolumeTimingNotMonotonicallyIncreasing would be a compromise?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree it's not worth the effort. That and we can't rely on the age monotonically increasing as some sessions may be close enough to not affect the reported age.


### 3. Add `MeasurementToolMetadata` to each tabular phenotypic measurement tool

Whenever possible, it is RECOMMENDED to add `MeasurementToolMetadata` to
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not an issue for this bep: In this and in the main phenotype article its implied that every tsv in the phenotype directory is a "Measurement Tool", but never explicitly stated that this is the only kind of tsv. Gave me pause when reviewing this, but it may be obvious to everyone else.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, the only permitted files in the phenotype folder are the measurement tool's TSVs and JSONs. Do you have an idea of where a sentence in the spec might help clear this up for other folks with the same experience you had there?

Comment on lines 54 to 55
- If more than one of the same measurement tool is acquired within
the same `session_id`, a `run_id` column MUST be added.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- If more than one of the same measurement tool is acquired within
the same `session_id`, a `run_id` column MUST be added.
- If a measurement tool is acquired multiple times within a single session, a `run_id` column must be added to disambiguate the separate acquisitions.

Note: This MUST is implicitly enforced by the combined index columns for phenotype tsv, If multiple results are acquired for the same subject and session with no run_id column the index check will error out.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Love the edit! Implementing it now. Thanks!

Comment on lines 57 to 59
- Encoding the acquisition time for a measurement tool’s `session_id`,
is RECOMMENDED. This information MUST be stored in the `sessions.tsv`
file at the root level of the dataset in the `acq_time` column.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@effigies mentioned this as "This is logically equivalent to "the acq_time column MUST NOT appear in a phenotype TSV file", but it takes some thinking about to get there. The spec should just say that."

I agree with the explicit "MUST NOT", But it also goes a step further in enforcing a root level sessions.tsv.

The combination of values in the `participant_id`, `session_id`, and `run_id` (if present)
columns MUST be unique for the entire tabular file.

### 5. Store demographic data in the participants file and instrument data in the phenotype directory
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is mentioned in ### 1. Aggregate data across sessions, moving the two closer together or combining them would be nice.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed those lines 78 and 79 because they are covered in the spec table by the schema.

Create one tabular file for each instrument
in the phenotypic and assessment data directory.

### 6. Record participant properties in the participants file and session properties in the sessions file
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Phenotypes aren't properties of participants? ;)

https://bids-specification.readthedocs.io/en/stable/modality-agnostic-files/data-summary-files.html#sessions-file
For pathology states When different from healthy, pathology SHOULD be specified.

Should this entry be taken as overriding the main spec, and this field should go in participants.tsv instead?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tabular phenotypic data from measurement tools in the /phenotype directory are the results of specific measurement tools. In fact, it's interesting that sometimes the "participant properties" collated into the participants.tsv file may be aggregated from separate interviews/screenings/measurement tools.

I hadn't seen that pathology line before. Thanks for finding it. Not sure what to do about that... @surchs @SamGuay @Arshitha ?

I suppose this BEP could make the additional validation opt-in override that sentence of the main spec?

Properties of participants MAY include things like
age, sex, race, or household income.
Properties of sessions MAY include things like
acquisition time, measurement device properties,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Like participant properties, any way to explicitly list what is session appropriate metadata in the schema will make enforcing these rules easier/allow for making strong requirement claims. There is much less consistency here, so it may not be feasible.

Doing my best to address Chris' PR comment about a few pieces.
@ericearl
Copy link
Collaborator Author

ericearl commented Nov 4, 2025

@effigies

Encoding the acquisition time for a measurement tool’s session_id, is RECOMMENDED. This information MUST be stored in the sessions.tsv file at the root level of the dataset in the acq_time column.

This is logically equivalent to "the acq_time column MUST NOT appear in a phenotype TSV file", but it takes some thinking about to get there. The spec should just say that.

How's this?

A measurement tool’s acquisition time SHOULD be stored in the sessions.tsv file at the root level of the dataset in the acq_time column.

The point here is that the guideline would rather a curator store acq_time somewhere than storing it nowhere. The preference is it goes into the sessions.tsv. So I suppose a sessions.tsv missing an acq_time column would receive a validator warning.

"if anyone uses sessions, everyone uses sessions."

This is extremely difficult to do without requiring a root-level /sessions.tsv to the exclusion of subject-level sub-<label>_sessions.tsv files. The reason is that sessions columns in phenotype are analyzed on their own. If we can depend on the presence or absence of sessions.tsv as an indication of whether there are any sessions in the dataset, then when we visit a phenotype file, we can check that length(columns.session_id) > 0 iff exists('/sessions.tsv'). Similarly when visiting a subject directory, we can check that length(subject.sessions.ses_dirs) > 0 iff exists('/sessions.tsv').

If a curator opts into this additional validation, then I agree it should require a sessions.tsv file instead of sub-<label>_sessions.tsv files.

7. Use the sessions file at the root-level
If there is more than one session for any one participant, then it is RECOMMENDED to provide a sessions file at the dataset root. The sessions file MUST list all sessions for all subjects across imaging and tabular phenotypic data. The data dictionary JSON file’s session_id field MUST include Levels with the description of each session_id.

The bolded text is not doable in the current schema. This would need access to all the (subject, session) pairs in /sessions.tsv and in each phenotype file. I think it's tractable, but we will need to extend the validation context and implement those changes in the validator.

Does that mean the validator needing "access to all the (subject, session) pairs in /sessions.tsv and in each phenotype file" is a blocker for this BEP?

10. Respect participant privacy when recording acquisition times
When needed to preserve participant privacy, you SHOULD record relative acquisition times with respect to the earliest session. Relative session acquisition times MAY be listed as durations from the earliest session (baseline) in days, months, or years using the acq_time column.

Unvalidatable and ambiguous. I think this should just piggy-back off of common principles:

Dates can be shifted by a random number of days for privacy protection reasons. To distinguish real dates from shifted dates, is is RECOMMENDED to set shifted dates to the year 1925 or earlier. Note that some data formats do not support arbitrary recording dates. [...] For longitudinal studies dates MUST be shifted by the same number of days within each subject to maintain the interval information. For example: 1867-06-15T13:45:30

I removed that guideline because you're right, the common principle there is enough.

For all little edits, see: surchs@b60eac1

@effigies
Copy link
Collaborator

effigies commented Nov 5, 2025

So I suppose a sessions.tsv missing an acq_time column would receive a validator warning.

This is a warning for regular datasets and an error for additional validation datasets.

Sessions:
selectors:
- suffix == "sessions"
- extension == ".tsv"
- '!intersects(dataset.dataset_description.AdditionalValidation, ["Phenotype"])'
initial_columns:
- participant_id
- session_id
- run_id
columns:
participant_id: optional
session_id: required
run_id: optional
acq_time__sessions: recommended
pathology: recommended
HED: optional
index_columns: [participant_id, session_id, run_id]
additional_columns: allowed
Sessions__Additional:
$ref: rules.tabular_data.modality_agnostic.Sessions
selectors:
- suffix == "sessions"
- extension == ".tsv"
- intersects(dataset.dataset_description.AdditionalValidation, ["Phenotype"])
columns:
$ref: rules.tabular_data.modality_agnostic.Sessions.columns
acq_time__sessions: required
additional_columns: allowed_if_defined

The sessions file MUST list all sessions for all subjects across imaging and tabular phenotypic data. The data dictionary JSON file’s session_id field MUST include Levels with the description of each session_id.

The bolded text is not doable in the current schema. This would need access to all the (subject, session) pairs in /sessions.tsv and in each phenotype file. I think it's tractable, but we will need to extend the validation context and implement those changes in the validator.

Does that mean the validator needing "access to all the (subject, session) pairs in /sessions.tsv and in each phenotype file" is a blocker for this BEP?

Yes. If you want to enforce that, it's going to require some schema and validation design.

@ericearl
Copy link
Collaborator Author

Moving @yarikoptic's good comments from the BEP Google Doc to here.

{B5D1BA96-F31E-4FB2-B1B4-BA8C22A38AD1}

@yarikoptic It took me a long time to, I think, realize what you meant. I see a little clearer now, but there's a few conflated issues to untangle here. I'll try...

What I think you're saying

  1. You don't want session_id as a column available in participants.tsv
  2. You don't want the root-level sessions.tsv file to have that file name
  3. You don't want a root-level sessions file to contain a participant_id column

Below is my reasoning for the 3 things to be as-is. Please refer back to them as 1, 2, and 3 as you reference them.

1. session_id as a column available in participants.tsv

Chris (@effigies) pointed out to me in conversation that this achieves most of what we set out to do, and Nell/Chris/Ross seemed to agree this was not a significant technical hurdle. So I pivoted the BEP, in the hopes of closing it off sooner, to focus on the session_id column being allowed in the participants file. Ultimately the goal is to capture multi-session data that changes about participants in an aggregated tabular file because our BEP leads feel that all the segregated sub-<label>_sessions.tsv files is an undesirable solution. Whether that happens in the participants file or sessions file matters very little to me. I just want people to record it somewhere obvious/predicatable.

2. root-level sessions file name

This is another one I don't care a lot about. It could be called sessions.tsv or participant_sessions.tsv or broccoli.tsv. I just think that pulling the prefix of sub-<label>_ out of the sessions.tsv files makes the most sense and will be familiar to BIDS users.

3. root-level sessions file participant_id column

Whether people put the participant_id and session_id pair in the participants.tsv file or the sessions.tsv or both, I would rather people record it and it be permitted by the BIDS validator to do so than not to record it because they don't know where to put it. Options and repetition can be good.

Let me know your thoughts or if you want to hop on a call to continue the discussion, then come back here and type out the outcomes of our discussion.

@surchs
Copy link
Contributor

surchs commented Nov 21, 2025

  1. session_id as a column available in participants.tsv

To me the main argument to not recommend putting longitudinal demographic info about participants (e.g. "age" or "medication status") in the sessions.tsv file was clarity of purpose / scope: The information is not about session (but about the participant) and the only reason we put it in sessions.tsv is because this file allows us to list multiple sessions. Allowing session_id columns (and thus repeated rows of participant_id) in `participants.tsv is one way to solve this - with possible added benefit that many groups already collect their longitudinal participant info in a single table.

Another way to solve it (that we discussed originally and dismissed after session_id in participants.tsv was an option) was to recommend that people to move all demographic info into the /phenotypic directory (where session_id is allowed), e.g. into a /phenotypic/demographic.tsv file - and thus turn participants.tsv into just an index of participant_id.

I don't think we should go back to encouraging storing information about participants in sessions.tsv.

  1. root-level sessions file

The main purpose for the root-level sessions file is to provide a place where I can see (and describe) both the "imaging" sessions (e.g. /sub-01/ses-01) and "phenotypic" sessions (that may only exist as a session_id entry in a file in /phenotypic) so that I can avoid unintentional reuse of session-ids (i.e. my participant completes a questionnaire in /phenotypic with session_id = ses-01 and then 6 months later shows up for their first MRI which I now store under /sub-01/ses-01, making it seem as if both were collected together).

I think it'd be odd to force such a file to the subject level (i.e. to disallow participant_id column inside). In an extreme case with only "phenotypic" sessions you would have an empty /sub-01 directory with only a sessions.tsv inside. If we make participants.tsv the "single place" for this info, then we will need to explain where I should put info about sessions like acq_time etc for phenotypic sessions vs imaging sessions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

BEP enhancement New feature or request phenotype

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants