Fitting many statistical models at once using dplyr

One common task in applied statistics is to fit and interpret a number of statistical models at once. For example, fitting a model with the same structure to a number of different outcome or explanatory variables, or fitting several models with different structure to the same data. Here are some examples of how I usually do this, using features that were introduced with dplyr version 1.1.0.

For this demonstration, we’ll be using the R packages dplyr, tidyr, ggplot2 (all of which are included in the tidyverse), as well as gt for making tables, emmeans for obtaining estimated means and comparisons, and performance for model-checking.


We’ll be using the Palmer Penguins data collected at Palmer Station, Antarctica, made available by Dr Kristen Gorman, and conveniently accessible in R using the palmerpenguins package. This dataset contains measurements on a number of penguins of different species on different islands.

# A tibble: 6 × 8
  species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
  <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
1 Adelie  Torgersen           39.1          18.7               181        3750
2 Adelie  Torgersen           39.5          17.4               186        3800
3 Adelie  Torgersen           40.3          18                 195        3250
4 Adelie  Torgersen           NA            NA                  NA          NA
5 Adelie  Torgersen           36.7          19.3               193        3450
6 Adelie  Torgersen           39.3          20.6               190        3650
# ℹ 2 more variables: sex <fct>, year <int>

Suppose we want to compare bill length, bill depth, flipper length, and body mass between species. We could manually run a separate model for each, but here’s a way to to automate the process. As with many things in R, the trick to doing this easily is to get the data in long form, with the outcomes stacked on top of each other, and a variable indicating which outcome is being measured.

The pivot_longer() function from tidyr gets the data into this format. I’ve also taken an extra step of recoding the “outcome” variable to give more descriptive labels. This isn’t required but it will make the tables and plots that we make later look nicer.

penguins_long <- penguins %>%
    c(bill_length_mm, bill_depth_mm, flipper_length_mm, body_mass_g),
    names_to = "outcome",
    values_to = "value"
  ) %>%
    outcome = fct_inorder(outcome),
    outcome = fct_recode(
      "Bill length (mm)" = "bill_length_mm",
      "Bill depth (mm)" = "bill_depth_mm",
      "Flipper length (mm)" = "flipper_length_mm",
      "Body mass (g)" = "body_mass_g"

Looking at the first few rows of this data frame, you can see we now have four rows for each penguin, one for each type of measurement:

# A tibble: 6 × 6
  species island    sex     year outcome              value
  <fct>   <fct>     <fct>  <int> <fct>                <dbl>
1 Adelie  Torgersen male    2007 Bill length (mm)      39.1
2 Adelie  Torgersen male    2007 Bill depth (mm)       18.7
3 Adelie  Torgersen male    2007 Flipper length (mm)  181  
4 Adelie  Torgersen male    2007 Body mass (g)       3750  
5 Adelie  Torgersen female  2007 Bill length (mm)      39.5
6 Adelie  Torgersen female  2007 Bill depth (mm)       17.4

We can use group_by() and summarise() from the dplyr package to group the rows by outcome, and then “summarise” these groups of rows down to a single row containing a statistical model for each outcome. This makes use of a couple of R tricks: ‘list columns’, a column in a data frame that contains a more complex object than the standard R data types (numeric, character, etc); and the new pick() verb which returns a data frame containing selected columns (in this case, all of them) within the group that’s being operated on.

penguin_models <- penguins_long %>%
  group_by(outcome) %>%
    model = list(
      lm(value ~ species, data = pick(everything()))

You can see the resulting data from contains four rows, one for each outcome, and a statistical model (“lm”) for each:

# A tibble: 4 × 2
  outcome             model 
  <fct>               <list>
1 Bill length (mm)    <lm>  
2 Bill depth (mm)     <lm>  
3 Flipper length (mm) <lm>  
4 Body mass (g)       <lm>  

Ideally we would also do a visual check of model assumptions. One way to do this is something like the code below, which saves a residual plot for each model to an image file, which can be inspected later. It uses the check_model() function in the performance package to generate the plots. The generated residual plots aren’t shown here, but they all look fine.

  ~ ggsave(
    paste0(.x, ".png"),
    plot(check_model(.y, check = c("pp_check", "linearity",
                                   "homogeneity", "qq"))),
    width = 12,
    height = 9

Once we’ve fitted the models, we can obtain quantities of interest from them. In this example we’ll look at estimated means for each species, p-values testing the hypothesis that all species means are equal (against at least one pair of means being different), and comparisons (differences in means) between all pairs of species.

The reframe() function from dplyr allows us to run some code that produces a data frame on each model and stack the results on top of each other. We can use the emmeans() function from the emmeans package to obtain estimated marginal means and as_tibble() to convert the result into a data frame. The rowwise(outcome) at the start tells reframe() that we want to call emmeans() separately for each row of the data frame (i.e., each outcome model), and preserve the outcome variable in the result.

penguin_means <- penguin_models %>%
  rowwise(outcome) %>%
    emmeans(model, "species") %>%

The first few rows of the resulting data frame are shown below. There is a row for each outcome for each species, containing the mean (emmean), standard error (SE), degrees of freedom (df) and lower and upper confidence limits (lower.CL and upper.CL).

# A tibble: 6 × 7
  outcome          species   emmean     SE    df lower.CL upper.CL
  <fct>            <fct>      <dbl>  <dbl> <dbl>    <dbl>    <dbl>
1 Bill length (mm) Adelie      38.8 0.241    339     38.3     39.3
2 Bill length (mm) Chinstrap   48.8 0.359    339     48.1     49.5
3 Bill length (mm) Gentoo      47.5 0.267    339     47.0     48.0
4 Bill depth (mm)  Adelie      18.3 0.0912   339     18.2     18.5
5 Bill depth (mm)  Chinstrap   18.4 0.136    339     18.2     18.7
6 Bill depth (mm)  Gentoo      15.0 0.101    339     14.8     15.2

We can use ggplot() to present the results visually. The plot shows that there’s a substantial variation between species in means of all of these measurements, with little or no overlap between many of the confidence intervals. Gentoo penguins appear to be heavier, and have longer flippers but shorter and shallower bills, than the other species.

penguin_means %>%
  ggplot(aes(x = emmean, y = species, xmin = lower.CL, xmax = upper.CL)) +
  geom_errorbar(width = 0.5) +
  geom_point() +
  scale_y_discrete(limits = rev) +
  facet_wrap(vars(outcome), nrow = 2, scales = "free_x") +
    x = "Mean", 
    y = "Species",
    caption = "Error bars show 95% confidence interval for mean."

The gt() function can be used to produce a nice table of results. The code shown below combines the lower.CL and upper.CL columns to produce a single column with the confidence interval, and separately specifies fewer decimal places for body mass than the other measures. The group_by() function before gt() results in a table sub-heading for each outcome. You could easily change this to group_by(species) to arrange the results by species.

penguin_means %>%
  group_by(outcome) %>%
  gt() %>%
  fmt_number(c(emmean, SE, lower.CL, upper.CL),
             decimals = 1, use_seps = FALSE) %>%
  fmt_number(c(emmean, SE, lower.CL, upper.CL),
             rows = outcome == "Body mass (g)",
             decimals = 0, use_seps = FALSE) %>%
  fmt_number(df, decimals = 0) %>%
  cols_align("left", species) %>%
  cols_merge_range(lower.CL, upper.CL, sep = ", ") %>%
    species = "Species",
    emmean = "Mean",
    lower.CL = "95% Confidence Interval"
SpeciesMeanSEdf95% Confidence Interval
Bill length (mm)
Adelie38.80.233938.3, 39.3
Chinstrap48.80.433948.1, 49.5
Gentoo47.50.333947.0, 48.0
Bill depth (mm)
Adelie18.30.133918.2, 18.5
Chinstrap18.40.133918.2, 18.7
Gentoo15.00.133914.8, 15.2
Flipper length (mm)
Adelie190.00.5339188.9, 191.0
Chinstrap195.80.8339194.2, 197.4
Gentoo217.20.6339216.0, 218.4
Body mass (g)
Adelie3701383393627, 3775
Chinstrap3733563393623, 3843
Gentoo5076423394994, 5158

We can do similar to produce an overall “F” test for each model, testing the hypothesis that all species have equal means for a particular measure against the hypothesis that at least one pair of means is different. The joint_tests() function in emmeans does this.

penguin_tests <- penguin_models %>%
  rowwise(outcome) %>%
    joint_tests(model) %>%

The resulting data frame is shown below. This time there is one row per model, but if there had been multiple variables in the model, there would have been one row per variable or interaction term (distinguished by the model term column). Each row contains the results of a hypothesis test: numerator and denominator degrees of freedom (df1 and df2), F-statistics (F.ratio) and p-value (p.value).

# A tibble: 4 × 6
  outcome             `model term`   df1   df2 F.ratio   p.value
  <fct>               <chr>        <dbl> <dbl>   <dbl>     <dbl>
1 Bill length (mm)    species          2   339    411. 2.69e- 91
2 Bill depth (mm)     species          2   339    360. 1.51e- 84
3 Flipper length (mm) species          2   339    595. 1.35e-111
4 Body mass (g)       species          2   339    344. 2.89e- 82

Again, this can be presented in a table using gt():

penguin_tests %>%
  gt() %>%
  fmt_number(F.ratio, decimals = 1) %>%
  fmt_number(p.value, decimals = 3) %>%
  cols_merge_range(df1, df2, sep = ", ") %>%
  sub_small_vals(p.value, threshold = 0.001) %>%
    outcome = "Outcome",
    `model term` = "Predictor",
    df1 = "df",
    F.ratio = "F",
    p.value = "p-value"
Bill length (mm)species2, 339410.6<0.001
Bill depth (mm)species2, 339359.8<0.001
Flipper length (mm)species2, 339594.8<0.001
Body mass (g)species2, 339343.6<0.001

Finally, we often want to obtain comparisons between particular estimated quantities. In this example we use the emmeans package again for this, this time using the pairs() function to produce comparisons between all pairs of species.

penguin_pairs <- penguin_models %>%
  rowwise(outcome) %>%
    emmeans(model, "species") %>%
      pairs(infer = TRUE, reverse = TRUE, adjust = "none") %>%

The first few rows of the data frame are shown below. The contents are similar to what we saw earlier for the estimated means, but this time each row contains information on a difference between pairs of means (described in the contrast column), along with the estimated difference in means, standard error, degrees of freedom, confidence interval, t-statistic and p-value.

# A tibble: 6 × 9
  outcome       contrast estimate    SE    df lower.CL upper.CL t.ratio  p.value
  <fct>         <chr>       <dbl> <dbl> <dbl>    <dbl>    <dbl>   <dbl>    <dbl>
1 Bill length … Chinstr…  10.0    0.432   339    9.19    10.9    23.2   4.23e-72
2 Bill length … Gentoo …   8.71   0.360   339    8.01     9.42   24.2   5.33e-76
3 Bill length … Gentoo …  -1.33   0.447   339   -2.21    -0.449  -2.97  3.18e- 3
4 Bill depth (… Chinstr…   0.0742 0.164   339   -0.248    0.396   0.453 6.50e- 1
5 Bill depth (… Gentoo …  -3.36   0.136   339   -3.63    -3.10  -24.7   7.93e-78
6 Bill depth (… Gentoo …  -3.44   0.169   339   -3.77    -3.11  -20.3   1.59e-60

These comparisons can be plotted or presented in a table using code very similar to what we used for the estimated means. The plot below also includes a dotted line indicating zero difference, which can be used as a visual indicator for whether comparisons are statistically significant.

penguin_pairs %>%
  ggplot(aes(x = estimate, y = contrast, xmin = lower.CL, xmax = upper.CL)) +
  geom_vline(xintercept = 0, linetype = "dotted") +
  geom_errorbar(width = 0.5) +
  geom_point() +
  scale_y_discrete(limits = rev) +
  facet_wrap(vars(outcome), nrow = 2, scales = "free_x") +
    x = "Difference in means", 
    y = "Contrast",
    caption = "Error bars show 95% confidence interval for difference in mean."