What this post is about

I have been thinking a lot lately about the importance of centering variables. This is something that has earned me eyerolls and scorn from my nearest collaborators, because it is so basic. Yet I think there is something to unpack here. To pick up on a theme of this blog – yes, it is good to recover the basics, doing stuff that everyone knows but that are basic to master before you go on to more advanced things.

This post is based on simulation of data from models, then fitting models to that data. I love this approach! It helps me to understand how a statistical mocle works before you apply it to real data. So let’s do that here! I hope that this useful to anyone who tries to perform their own regression analyses.

hmm let’s eat pizza!

Suppose you are in charge of ordering pizza for this year’s Sportsball tournament. All you have is the mass of each kid, which is something that sportsball teams usually maesure. Of course, you have some data about last year’s tourney – you know the mass of each kid and about how much each one ate (idk why you have this information, but come with me on an imagination voyage just the same).

First let’s generate random kids masses! Every Sportsball team has 33 players:


knitr::opts_chunk$set(message = FALSE, cache=TRUE)
body_mass <- rnorm(33, 40, 2.3) %>% round(digits = 1)

(I just googled “mass of 12-year-old in kg”; turns out this is 40kg)

We’re going to simulate data, and we need some plausible average values. OK, So you know from last year’s event that kids ate on average 4 slices of pizza each, and you reckon that for every kilogram of body mass they probably eat about another fifth of a slice – so if a kid is 5kg heavier, they’d probably want another whole slice.

We haven’t even got to centering the model, and already it is easier to think of the varialbe as centered at the mean.

average_pizza <- 3

mass_hunger <- 0.2
mean_mass <- round(mean(body_mass), 1)
body_mass_centered <- body_mass - mean_mass

average_per_kid <- average_pizza + mass_hunger * body_mass_centered

## [1] 2.18 3.98

OK so between 2 and 4 slices per kid – seems plausible! But of course how much a kid eats is not determined by their mass alone. They vary in many ways, including their own inclinations, how they’ve been socialized, how hard they played Sportsball that day, what they had for breakfast etc. Many small effects might well give us a normal variation around the pizza slice average1:

# add normal variation to the average values
# and round because anything smaller than tenths is just silly:
actual_per_kid <- round(average_per_kid + rnorm(n = 33, mean = 0, sd = 0.3), 1)

plot(body_mass_centered + mean_mass, actual_per_kid)

Above I generate the actual amount that each kid eats, by calculating the average and then adding some random noise on top of it. To do that, you can either generate random numbers from a normal distribution around the average rnorm(33, average_per_kid, sd = 0.25) or generate values with a mean of 0 and add them to the average – I chose to do the latter2.

In the figure above we can easily two things: the data look plausible and suitably variable, and the body masses look correct. You can see that the x-axis is actually a sum of the centered body mass and the mean – which of course gives you back the original mass.

We can do a quick linear model to see if we can recover the original parameters:

pizza_data <- data.frame(body_mass_centered, mean_mass, average_per_kid, actual_per_kid)

pizza_amount <- stan_lm(actual_per_kid ~ 1 + body_mass_centered,
                        data = pizza_data,
                        prior = R2(0.3, what = "mean"),
                        chains = 2, cores = 2)

pizza_amount %>% 
  gather_samples(`(Intercept)`, body_mass_centered) %>%
  ggplot(aes(y = term, x = estimate)) +
  geom_halfeyeh() + 
  geom_vline(xintercept = c(0.2, 3), colour = "forestgreen") + 

Green vertical lines show us the position of the true values before the simulation.

This shows us that the model estimates the original values pretty precisely. The estimate of the intercept is lower than our intended value, but this is due to sampling error: we set a value of 3 for an average mass of 40kg, but the average in this sample is lower: 39.6kg. So here we are getting an estimate of the sample mean (ie for this Sportsball team), while 3 was the population mean (i.e. average across all possible teams). We’ll come back to this later.

Little side note: units and intercepts

I wish more training in linear regression emphasized the units of the coefficents. It’s important to remember that even though we plotted the Intercept and the slope of body_mass_centered on the same axis above, they are NOT the same kind of number!

  • The Intercept is in units of “Slices of Pizza”
  • the slope of body mass is in units of “Slices of pizza x kilogram-1

In other words, the slope turns the predictor into the same units as the response, while the Intercept is already in those units. I see so many graduate students who think that statistics is something magical, and I think that a clear understanding of what the numbers represent is a key to unlock their biological thinking from the prison of statistical fear.

As an aside, the intercept is just a constant added to the model to slide the regression line up and down. We could add any constant we want! A very logical one might be the overall mean of the regression:

pizza_amount_mean <- lm(actual_per_kid ~ 0 + mean_mass + body_mass_centered,
                        data = pizza_data)

##                 term   estimate    std.error statistic      p.value
## 1          mean_mass 0.07421651 0.0001213148  611.7681 7.658278e-65
## 2 body_mass_centered 0.34489206 0.0024172800  142.6777 2.977745e-45

That gives you a sort of scale to transform the average x value into the average y value:

## [1] 39.6
mean_mass * 0.07379
## [1] 2.922084
## [1] 2.951515

When you do it this way, both coefficients in the model are on the same scale: both convert body size of a player into pizza slices.

More than one team

Let’s say you are actually buying the pizzas for a whole competition, between five competing teams of about the same size.

team_data <- data.frame(body_mass_centered, mean_mass, average_per_kid)

average_pizza <- 3
mass_hunger <- 0.2


five_teams <- rerun(5,
                      body_mass = rnorm(33, 40, 2.3) %>% round(digits = 1),
                      mean_mass = round(mean(body_mass), 1),
                      body_mass_centered = round(body_mass - mean_mass, 1),
                      average_per_kid = average_pizza + mass_hunger * body_mass_centered,
                      actual_per_kid = round(average_per_kid + rnorm(33, 0, 0.33), 1)
                    ) %>% 
  set_names(paste0("team", LETTERS[1:5])) %>% 
  bind_rows(.id = "team") %>% 

five_teams %>% 
  ggplot(aes(x = body_mass, y = actual_per_kid, colour = team)) +
  theme_bw() + 
  stat_smooth(method = "lm", se = FALSE) + 
  geom_point() + 
  scale_color_brewer(type = "qual", palette = 2)

OK the one weird thing about these simulations is that we always assume that the mean of the response is 3, no matter what the sampled average is. We’ll get to adding more realistic scenarios later.

pizza_amount_five_teams <- stan_lmer(
  actual_per_kid ~ 1 + mean_mass + body_mass_centered + (1 + body_mass_centered | team),
  data = five_teams,
  chains = 2, cores = 2)
## Warning: There were 15 divergent transitions after warmup. Increasing adapt_delta above 0.95 may help. See
## http://mc-stan.org/misc/warnings.html#divergent-transitions-after-warmup
## Warning: Examine the pairs() plot to diagnose sampling problems
## $team
##       (Intercept)   mean_mass body_mass_centered
## teamA    3.639084 -0.01577895          0.2003332
## teamB    3.637929 -0.01577895          0.2039835
## teamC    3.637528 -0.01577895          0.1951352
## teamD    3.636932 -0.01577895          0.2022604
## teamE    3.639153 -0.01577895          0.1952134
## attr(,"class")
## [1] "coef.mer"
##        (Intercept)          mean_mass body_mass_centered 
##         3.63839189        -0.01577895         0.19913441
pizza_amount_five_teams %>% 
  gather_samples(`(Intercept)`, body_mass_centered, mean_mass) %>%
  ggplot(aes(y = term, x = estimate)) +
  geom_halfeyeh() + 
  geom_vline(xintercept = c(0.2, 3), colour = "forestgreen") + 
  theme_bw() + 
  coord_cartesian(xlim = c(-1, 7))

These coefficients are .. a little hard to interpret. The slope of body mass is still OK and close to the true value.. but the intercept is a huge quantity (7!? Who can eat 7 slices of pizza?). Here, the regression is giving us an answer to an impossible question: what is the average appetite of the average sized kid (body_mass_centered = 0) in a group with no mass (mean_mass = 0). This happens to be a large number here, because the masses happen to have a (slight, insignificant) downward trend.

This too-high intercept will still get us good predictions, because it gets lowered by the product of mean_mass (a large number) but the coefficient (a small, but negative, number). This works because the mean masses are large numbers – much larger than the centered body sizes, which are differences from these values:

five_teams$body_mass_centered %>% range
## [1] -5.5  6.1
five_teams$mean_mass %>% range
## [1] 39.3 40.7

We can make this a lot simpler by CENTERING AGAIN!! This time, we centre the group means – in other words, we centre the centres!

five_teams_cen <- five_teams %>% 
  mutate(mean_mass_c = mean_mass - mean(mean_mass))

pizza_amount_five_teams_c <- stan_lmer(
  actual_per_kid ~ 1 + mean_mass_c + body_mass_centered + (1 + body_mass_centered | team),
  data = five_teams_cen,
  chains = 2, cores = 2)

##        (Intercept)        mean_mass_c body_mass_centered 
##          3.0055382         -0.0176928          0.1998710
pizza_amount_five_teams_c %>% 
  gather_samples(`(Intercept)`, body_mass_centered, mean_mass_c) %>%
  ggplot(aes(y = term, x = estimate)) +
  geom_halfeyeh() + 
  geom_vline(xintercept = c(0.2, 3), colour = "forestgreen") + 
  theme_bw() + 
  coord_cartesian(xlim = c(-0.5, 3.2))

Much better! Now the coefficients on body_mass_centered and (Intercept) are super narrow and exactly what we expect to see. The effect of mean_mass is still slightly negative, but very close to 0. That’s the right answer, of course – we didn’t use those values to simulate the data, and the model correctly thinks they don’t matter.

We have broken up the numbers in the dataset into small bits, but the result is more interpretable model coefficients.

five_teams_cen$body_mass_centered %>% range
## [1] -5.5  6.1
five_teams_cen$mean_mass_c %>% range
## [1] -0.7  0.7

In constrast to the above, the mean_mass_c values are now very close to 0, in addition to covering a very small range. In this example, where they do nothing, it amounts to something close to adding 0 to the model formula – in other words, doing nothing. Because we’ve centered them, the model is free to use the intercept to represent the mean of the response, and the other terms can represent differences to this mean.

I.. can’t resist..

  1. people who think that pizza slices ought to follow a poisson distribution clearly are not eating pizza in the same way as I do

  2. I think that generating normal variates around the average is probably easier for students to understand, because it seems closer to a textbook definition of regression. But adding normal errors to an average is awesome because it builds our intuition: linear models are all about just adding things together. We need to start thinking of that right away!