From 6e548249019c74fab532bf34d9e69a50c4542392 Mon Sep 17 00:00:00 2001 From: avahoffman Date: Mon, 12 Jan 2026 13:52:32 -0500 Subject: [PATCH 1/7] fix typo --- modules/Statistics/Statistics.Rmd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/modules/Statistics/Statistics.Rmd b/modules/Statistics/Statistics.Rmd index b574de91..e6aa4584 100644 --- a/modules/Statistics/Statistics.Rmd +++ b/modules/Statistics/Statistics.Rmd @@ -379,7 +379,7 @@ where: * $y_i$ is the outcome for person i * $\alpha$ is the intercept -* $\beta_1$, $\beta_2$, $\beta_2$ are the slopes/coefficients for variables $x_{i1}$, $x_{i2}$, $x_{i3}$ - average difference in y for a unit change (or each value) in x while accounting for other variables +* $\beta_1$, $\beta_2$, $\beta_3$ are the slopes/coefficients for variables $x_{i1}$, $x_{i2}$, $x_{i3}$ - average difference in y for a unit change (or each value) in x while accounting for other variables * $x_{i1}$, $x_{i2}$, $x_{i3}$ are the predictors for person i * $\varepsilon_i$ is the residual variation for person i From 3886e555775ba940a43337cce4bef3ba76724aef Mon Sep 17 00:00:00 2001 From: avahoffman Date: Tue, 13 Jan 2026 12:33:23 -0500 Subject: [PATCH 2/7] Update regression dataset and examples --- modules/Statistics/Statistics.Rmd | 174 +++++++++++++++++------------- 1 file changed, 102 insertions(+), 72 deletions(-) diff --git a/modules/Statistics/Statistics.Rmd b/modules/Statistics/Statistics.Rmd index e6aa4584..e3915e00 100644 --- a/modules/Statistics/Statistics.Rmd +++ b/modules/Statistics/Statistics.Rmd @@ -452,34 +452,31 @@ For example, if we want to fit a regression model where outcome is `income` and ## Linear regression example -Let's look variables that might be able to predict the number of heat-related ER visits in Colorado. +Let's look variables that might be able to predict the number of "crowded" households. -We'll use the dataset that has ER visits separated out by age category. +We'll use a dataset that has socioeconomic measures from CDC. Find out more on https://daseh.org/data. -We'll combine this with a new dataset that has some weather information about summer temperatures in Denver, downloaded from [https://www.weather.gov/bou/DenverSummerHeat](https://www.weather.gov/bou/DenverSummerHeat). +It has already been filtered to include a few counties from Washington State. -We will use this as a proxy for temperatures for the state of CO as a whole for this example. +Each row represents a census tract/area. ## Linear regression example{.codesmall} ```{r} -er <- read_csv(file = "https://daseh.org/data/CO_ER_heat_visits_by_age.csv") +sp_dat <- read_csv(file = "https://daseh.org/data/socioeco_cdc.csv") -temps <- read_csv(file = "https://daseh.org/data/Denver_heat_data.csv") - -er_temps <- full_join(er, temps) -er_temps +sp_dat ``` ## Linear regression: model fitting{.codesmall} For this model, we will use two variables: -- **visits** - number of visits to the ER for heat-related illness -- **highest_temp** - the highest recorded temperature of the summer +- **crowd** - At household level (occupied housing units), more people than rooms +- **hu** - Number of housing units ```{r} -fit <- glm(visits ~ highest_temp, data = er_temps) +fit <- glm(crowd ~ hu, data = sp_dat) fit ``` @@ -497,7 +494,7 @@ The broom package can help us here too! The estimate is the coefficient or slope. -for every 1 degree increase in the highest temperature, we see 1.134 more heat-related ER visits. The error for this estimate is pretty big at 3.328. This relationship appears to be insignificant with a p value = 0.735. +for every 1 additional housing unit, we see 0.035 more crowded households (~29 housing units to one more crowded household might make more sense!). This relationship appears to be quite strong, with a p value 7.96e-44! ```{r} tidy(fit) |> glimpse() @@ -505,10 +502,10 @@ tidy(fit) |> glimpse() ## Linear regression: multiple predictors {.smaller} -Let's try adding another other explanatory variable to our model, year (`year`). +Let's try adding another other explanatory variable to our model, average per capita income for each census area (`pci`). ```{r} -fit2 <- glm(visits ~ highest_temp + year, data = er_temps) +fit2 <- glm(crowd ~ hu + pci, data = sp_dat) summary(fit2) ``` @@ -526,53 +523,39 @@ fit2 |> Factors get special treatment in regression models - lowest level of the factor is the comparison group, and all other factors are **relative** to its values. -Let's add age category (`age`) as a factor into our model. We'll need to convert it to a factor first. +Let's add the county (`county`) as a factor into our model. We'll need to convert it to a factor first. ```{r} -er_temps <- er_temps |> mutate(age = factor(age)) +sp_dat <- sp_dat |> mutate(county = factor(county)) ``` - ## Linear regression: factors {.smaller} The comparison group that is not listed is treated as intercept. All other estimates are relative to the intercept. ```{r regressbaseline, comment="", fig.height=4,fig.width=8} -fit3 <- glm(visits ~ highest_temp + year + age, data = er_temps) +fit3 <- glm(crowd ~ hu + pci + county, data = sp_dat) summary(fit3) ``` ## Linear regression: factors {.smaller} -Maybe we want to use the age group "65+ years" as our reference. We can relevel the factor. +Maybe we want to use King County as our reference. We can relevel the factor. -The ages are relative to the level that is not listed. +The counties are relative to the level that is not listed. ```{r} -er_temps <- - er_temps |> - mutate(age = factor(age, - levels = c("65+ years", "35-64 years", "15-34 years", "5-14 years", "0-4 years") +sp_dat <- + sp_dat |> + mutate(county = factor(county, + levels = c("King", "Clark", "Pierce", "Snohomish", "Spokane") )) -fit4 <- glm(visits ~ highest_temp + year + age, data = er_temps) +fit4 <- glm(crowd ~ hu + pci + county, data = sp_dat) summary(fit4) ``` -## Linear regression: factors {.smaller} - -You can view estimates for the comparison group by removing the intercept in the GLM formula - -`y ~ x - 1` - -*Caveat* is that the p-values change, and interpretation is often confusing. - -```{r regress9, comment="", fig.height=4, fig.width=8} -fit5 <- glm(visits ~ highest_temp + year + age - 1, data = er_temps) -summary(fit5) -``` - ## Linear regression: interactions ```{r, fig.alt="Statistical interaction showing the relationship between cookie yield, temperature, and cooking duration.", out.width = "70%", echo = FALSE, fig.align='center'} @@ -583,12 +566,11 @@ knitr::include_graphics("images/interaction.png") ## Linear regression: interactions {.smaller} -You can also specify interactions between variables in a formula `y ~ x1 + x2 + x1 * x2`. This allows for not only the intercepts between factors to differ, but also the slopes with regard to the interacting variable. +You can also specify interactions between variables in a formula with `*`. This allows for not only the intercepts between factors to differ, but also the slopes with regard to the interacting variable. ```{r fig.height=4, fig.width=8} -fit6 <- glm(visits ~ highest_temp + year + age + age*highest_temp, data = er_temps -) -tidy(fit6) +fit5 <- glm(crowd ~ hu + pci * county, data = sp_dat) +summary(fit5) ``` ## Linear regression: interactions {.smaller} @@ -596,8 +578,8 @@ tidy(fit6) By default, `ggplot` with a factor added as a color will look include the interaction term. Notice the different intercept and slope of the lines. ```{r fig.height=3.5, fig.width=7, warning=FALSE} -ggplot(er_temps, aes(x = highest_temp, y = visits, color = age)) + - geom_point(size = 1, alpha = 0.1) + +ggplot(sp_dat, aes(x = pci, y = hu, color = county)) + + geom_point(size = 1, alpha = 0.2) + geom_smooth(method = "glm", se = FALSE) + theme_classic() ``` @@ -620,21 +602,19 @@ See `?family` documentation for details of family functions. ## Logistic regression {.smaller} -Let's look at a logistic regression example. We'll use the `er_temps` dataset again. +Let's look at a logistic regression example. We'll use the `sp_dat` dataset again with a different variable. -We will create a new binary variable `high_rate`. We will say a visit rate of more than 8 qualifies as a high visit rate. +- **f_crowd** - Flag for the percentage of crowded households is in the 90th percentile (1 = yes, 0 = no) -```{r} -er_temps <- - er_temps |> mutate(high_rate = rate > 8) +There are 36 census tracks in the 90th percentile for crowded households. -glimpse(er_temps) +```{r} +sp_dat |> count(f_crowd) ``` - ## Logistic regression {.smaller} -Let's explore how `highest_temp`, `year`, and `age` might predict `high_rate`. +Let's explore how `hu`, `pci`, and `county` might predict `f_crowd`. ``` # General format @@ -642,7 +622,8 @@ glm(y ~ x, data = DATASET_NAME, family = binomial(link = "logit")) ``` ```{r regress7, comment="", fig.height=4,fig.width=8} -binom_fit <- glm(high_rate ~ highest_temp + year + age, data = er_temps, family = binomial(link = "logit")) +binom_fit <- glm(f_crowd ~ hu + pci + county, + data = sp_dat, family = binomial(link = "logit")) summary(binom_fit) ``` @@ -650,7 +631,6 @@ summary(binom_fit) See this [case study](https://www.opencasestudies.org/ocs-bp-vaping-case-study/#Logistic_regression_%E2%80%9Cby_hand%E2%80%9D_and_by_model) for more information. - ## Odds ratios > An odds ratio (OR) is a measure of association between an exposure and an outcome. The OR represents the odds that an outcome will occur given a particular exposure, compared to the odds of the outcome occurring in the absence of that exposure. @@ -661,34 +641,27 @@ Use `oddsratio(x, y)` from the `epitools()` package to calculate odds ratios. ## Odds ratios {.smaller} -During the years 2012, 2018, 2021, and 2022, there were multiple consecutive days with temperatures above 100 degrees. We will code this as `heatwave`. +Let's see if a high prevalence of no vehicle homes can predict a high prevalence of crowded homes. -```{r} -library(epitools) +- **f_noveh** - Flag for the percentage of households with no vehicles is in the 90th percentile (1 = yes, 0 = no) +- **f_crowd** - Flag for the percentage of crowded households is in the 90th percentile (1 = yes, 0 = no) -er_temps <- - er_temps |> - mutate(heatwave = year %in% c(2012, 2018, 2021, 2022)) - -glimpse(er_temps) -``` - -## Odds ratios {.smaller} +## Odds ratios -In this case, we're calculating the odds ratio for whether a heatwave is associated with having a visit rate greater than 8. +In this case, we're calculating the odds ratio for census areas, indicating whether a prevalence of no vehicle households is associated with more crowded households. ```{r} -response <- er_temps |> pull(high_rate) -predictor <- er_temps |> pull(heatwave) +library(epitools) -oddsratio(predictor, response) +response <- sp_dat %>% pull(f_crowd) +predictor <- sp_dat %>% pull(f_noveh) ``` ## Odds ratios {.smaller} -The Odds Ratio is 3.86. +The Odds Ratio is 3.33. -When the predictor is TRUE (aka it was a heatwave year), the odds of the response (high hospital visitation) are 3.86 times greater than when it is FALSE (not a heatwave year). +When the predictor is 1 (aka the census area has a lot of no vehicle households), the odds of the response (prevalence of crowded homes) are 3.33 times greater than when it is 0 (not a lot of no vehicle households). ```{r echo = FALSE} oddsratio(predictor, response) @@ -782,4 +755,61 @@ For balanced designs, determine if multiple variables influence a dependent vari `aov()` +## More on Linear regression: factors {.smaller} + +You can view estimates for the comparison group by removing the intercept in the GLM formula + +`y ~ x - 1` + +*Caveat* is that the p-values change, and interpretation is often confusing. +```{r regress9, comment="", fig.height=4, fig.width=8} +fit_force_intercept <- + glm(crowd ~ pci + sngpnt + county - 1, data = sp_dat) +summary(fit_force_intercept) +``` + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + From d946943ab724a576d43e0d93f86ef93e395532ae Mon Sep 17 00:00:00 2001 From: avahoffman Date: Tue, 13 Jan 2026 12:57:56 -0500 Subject: [PATCH 3/7] Trivial change to rerender --- modules/Statistics/Statistics.Rmd | 4 ++++ 1 file changed, 4 insertions(+) diff --git a/modules/Statistics/Statistics.Rmd b/modules/Statistics/Statistics.Rmd index e3915e00..78d79872 100644 --- a/modules/Statistics/Statistics.Rmd +++ b/modules/Statistics/Statistics.Rmd @@ -813,3 +813,7 @@ summary(fit_force_intercept) + + + + From 865357a11712e57f1d86c234a66dd7128da75374 Mon Sep 17 00:00:00 2001 From: avahoffman Date: Tue, 13 Jan 2026 13:08:06 -0500 Subject: [PATCH 4/7] Add new dataset to data tab of website --- data.Rmd | 1 + 1 file changed, 1 insertion(+) diff --git a/data.Rmd b/data.Rmd index 59b34ebb..0344f6fa 100644 --- a/data.Rmd +++ b/data.Rmd @@ -16,6 +16,7 @@ | [Climate change disasters](https://daseh.org/data/Yearly_CC_disasters_total_affected.csv) | Data about the number of people affected by total disasters (including droughts, extreme temperatures, floods, landslides, storms, and wildfires) by country and year. | • Manipulating Data in R Lab | [International Monetary Fund (IMF)](https://climatedata.imf.org/datasets/b13b69ee0dde43a99c811f592af4e821_0/about) | | [CO heat-related ER visits](https://daseh.org/data/CO_ER_heat_visits.csv) | Age-adjusted visit rates and total number of visits for all genders by Colorado county for 2011-2023, collected by the Colorado Environmental Public Health Tracking program | • Data Input
• Subsetting Data in R
• Data Summarization
• Data Cleaning
• Intro to Data Visualization
• Data Visualization
• Factors
• Statistics
• Data Output
• Functions | https://coepht.colorado.gov/heat-related-illness | | [COVID wastewater surveillance](https://daseh.org/data/SARS-CoV-2_Wastewater_Data.csv) | SARS-CoV-2 viral load measured in wastewater between 2022 and 2024, collected by the collected by the National Wastewater Surveillance System |• Data Classes | https://data.cdc.gov/Public-Health-Surveillance/NWSS-Public-SARS-CoV-2-Wastewater-Metric-Data/2ew6-ywp6/about_data | +| [CDC socioeconomic themes](https://daseh.org/data/socioeco_dat.csv) | 2018 socioeconomic theme data created by the Centers for Disease Control and Prevention (CDC) / Agency for Toxic Substances and Disease Registry (ATSDR) / Geospatial Research, Analysis, and Services Program (GRASP) |• Statistics | https://hub.scag.ca.gov/datasets/18981b657cf04f2dbe0df065f20581db_5/about | | [Flu internet searches](https://daseh.org/data/Wojcik_2021_flu.csv) | This study looks at the use of internet search data to track prevalence of Influenza-Like Illness (ILI) | • Statistics | https://www.nature.com/articles/s41467-020-20206-z | | [Nitrate exposure](https://daseh.org/data/Nitrate_Exposure_for_WA_Public_Water_Systems_byquarter_data.csv) | The amount of people in Washington exposed to excess levels of nitrate in their water between 1999 and 2020 by quarter, collected by the Washington Tracking Network | • Manipulating Data in R | https://doh.wa.gov/data-and-statistical-reports/washington-tracking-network-wtn/drinking-water | | [Weather on Mars](https://daseh.org/data/kaggleMars_Dataset.csv) | Information about temperature measures from the Rover Environmental Monitoring Station (REMS) on Mars, collected by Spain and Finland | • Homework 2 | https://www.kaggle.com/datasets/deepcontractor/mars-rover-environmental-monitoring-station/data | From a3686a3b0f0b5db46fdaa008afbd0b212d1e5f35 Mon Sep 17 00:00:00 2001 From: avahoffman Date: Tue, 13 Jan 2026 13:09:54 -0500 Subject: [PATCH 5/7] Update dictionary.txt --- resources/dictionary.txt | 3 +++ 1 file changed, 3 insertions(+) diff --git a/resources/dictionary.txt b/resources/dictionary.txt index a7da0f15..8e507f45 100644 --- a/resources/dictionary.txt +++ b/resources/dictionary.txt @@ -123,12 +123,14 @@ hrbr http HTTPS https +hu Humphries HW hydrocodone ide ifelse Ihaka +ILI IMG inclusivity inute @@ -181,6 +183,7 @@ NISSANs nizovatina NonCommercial nonconfidential +noveh NWSS obert ocs From bd0b80049138c9c069a3a0637c78a3fd921a5ccf Mon Sep 17 00:00:00 2001 From: avahoffman Date: Tue, 13 Jan 2026 13:19:25 -0500 Subject: [PATCH 6/7] Resolve #235 --- modules/Statistics/Statistics.Rmd | 6 ++++++ 1 file changed, 6 insertions(+) diff --git a/modules/Statistics/Statistics.Rmd b/modules/Statistics/Statistics.Rmd index 78d79872..f4d88179 100644 --- a/modules/Statistics/Statistics.Rmd +++ b/modules/Statistics/Statistics.Rmd @@ -711,6 +711,12 @@ Image by