Skip to content

Conversation

@noemimontobbio
Copy link
Collaborator

Main changes

  • Model selection: as discussed (see below*). I split the lmer function into a lmer_selection function and a lmer_tests function. The lmer_selection function implements model selection in the “old way” (BIC for random effects + LRT backward selection for fixed effects) but can be customised to only do the BIC part (or the LRT part, or some different methods). I used existing R library buildmer - the backward selection process is super annoying to implement, no need to do it from scratch…I didn’t keep the code from the original lmer function as it required the user to specify all anova comparisons in advance, which is not what is normally done when using LRT model selection (i.e., anovas).
    The lmer_tests function runs some emmeans and emtrends comparisons using the provided model. If more than one model is provided, it chooses which one to use based on BIC and then runs the tests on the selected model.
    Several possible modelling strategies can be implemented using these two functions - it’s best explained by examples, see examples/analysis_example.py (I put the script there for now, feel free to move it / remove it).

  • Fitting a new model for each pairwise comparison (group1 vs. HC, group2 vs. HC, ...) is not good practice. Better to fit a single model with the group as a multilevel factor, and then implement pairwise comparisons as a post-hoc analysis.

  • The output of lmer_tests is now a dictionary where the keys are the different specs required by the user.

  • I removed other_col. The code gets the variables from formulas directly, so the user doesn't have to specify things twice -- they just give the formulas they need and the data gets checked for those variables. Names for group and ID variables can be specified by the user through group_col and id_col (instead of requiring them to be 'Group' and 'ID', so they don't have to rename them in their data). Also no need to rename data_col to 'Y'.

Questions

  • I don't get what data_index does. I left it as it was in the code.

  • force categorical data type for 'Sensor': is it necessary? It's a bit artificial, and very specific…if necessary, then we may add a sensor_col argument where the user can specify the name they used (if any) for the sensor?

  • remove rows where data_col is zero - why?


(*) Model selection strategy:
Old:

  • create full model with all possible predictors you can think of
  • select random effect structure with BIC
  • starting from selected model, recursively apply likelihood ratio tests to reduce fixed effect structure and only keep significant predictors
  • do your post hoc stats on final model. Non-significant predictors have been removed from the model so we already know that the effect of those remaining is significant - we only need to quantify it.

New:

  • create full model with all predictors that you can reasonably expect to have an effect on the dependent variable
  • select random effect structure with BIC
  • do your post hoc stats on that model. Fixed effect structure still includes all predictors, and the significance of the effects is established via post hoc tests.

Motivation:
iterative use of statistical inference for selection of variables is a flawed approach to design multivariable models, as it invalidates the assumptions of inference (which assume a fixed model, not one selected based on the data) and it induces selection bias. Also, with this approach, variables that are not “significantly” associated with the outcome (albeit the model would be underpowered for at least some of them) are not accounted for even if they are true confounders of the relationship of interest. Instead, a multivariable model should be built based on the known and anticipated causal relationships.

@noemimontobbio
Copy link
Collaborator Author

Main changes

EEG_predictions.py

  • append_lmer_results(): if 'z.ratio' is not found in the result table, look for 't.ratio' instead (can happen e.g. if the selected model is a simple lm with no random effects). This change is not needed if random effects are always retained (as is now the case), but we don't know a priori and I think it's safer this way.
  • Converted data['Sensor'] from np.str_ to str to avoid warning message when data is imported in R.
  • Removed MCI from the data – otherwise, the models would still include "MCI-HC" comparisons in the count when correcting for multiple tests.

Analysis.lmer_tests()

  • Given your changes, models basically became a required argument, which makes sense in my opinion. I made it into a positional argument, and removed the control throwing an error if models is None.
  • Minor changes to the checks on group_col: must be in at least one of the models (so no more need to check that it's in the DataFrame as all variables that are in the models have been checked for that). Note: instead of raising an error, we may consider just throwing a warning saying that group_col is not specified correctly and won't be used.
  • Made some changes to the checks on specs. If multiple models are provided to lmer_tests(), the lowest-BIC one is be selected. This may not contain all the variables listed in the specs – e.g. models = ['z ~ x + y', 'z ~ x'] and specs = ['x', 'y']. If the model 'z ~ x' gets selected, then the test on 'y' is now automatically skipped and only the test on 'x' is performed. More generally, if one of the specs is invalid (contains variables that are not fixed effects in the model, or contains more than 1 numeric variable), the test is simply skipped without raising an error. If there are no valid specs left, the result is left empty and a Warning message is displayed. This allows better automatisation of model selection and testing.
  • "Ensure Sensor remains as a character column": extended that to any grouping variable so that original level values appear in results.
  • enabled "trt.vs.control" comparison for group_col also when there's a grouping factor (e.g., in cases like "Group|Sensor")

Analysis.lmer_selection()

  • Given your changes, full_model basically became a required argument, which makes sense in my opinion. I made it into a positional argument, and removed the control throwing an error if full_model is None.
  • removed group_col argument - no more needed if full_model is mandatory!
  • Modified buildmer behaviour by setting singular.ok=T (so "singular" models are not excluded a priori)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants