j’ n’y suis jamais allé quoi

Ah dacc, tout s’explique donc

Andrew: … pardon, mais vous avez dit combien de mots là?

I have something like that interaction almost every day. French is an intricate language, and high school studying prepared me well for all the conjugations, rules, and exceptions. What I was not prepared for was the sheer number of words. The French pepper their speech with small pronouns, exclamations and phrases – which they then elide when speaking, you know – for convenience. I’m sure it makes speaking faster – but putain it also makes understanding them harder!

In other words, it’s just like y ~ x + (x|group)

Not all of us are language minorities in statistical modelling. Many practitioners today started their career several programming languages ago. To them, modern “conveniences” like R formulae are compact and information dense. I imagine they marvel at how much can be expressed in so few keystrokes1 But to myself and many other early-career scientists, statistical programming is a country we had to move to. And in this land they speak a strange tongue, and they speak it quickly.

How to speak formula like a local

Here is an example, straight from the lme4 vignette:

## Loading required package: Matrix
fm1 <- lmer(Reaction ~ Days + (Days | Subject), sleepstudy)
## Linear mixed model fit by REML ['lmerMod']
## Formula: Reaction ~ Days + (Days | Subject)
##    Data: sleepstudy
## REML criterion at convergence: 1743.6
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -3.9536 -0.4634  0.0231  0.4634  5.1793 
## Random effects:
##  Groups   Name        Variance Std.Dev. Corr
##  Subject  (Intercept) 612.09   24.740       
##           Days         35.07    5.922   0.07
##  Residual             654.94   25.592       
## Number of obs: 180, groups:  Subject, 18
## Fixed effects:
##             Estimate Std. Error t value
## (Intercept)  251.405      6.825   36.84
## Days          10.467      1.546    6.77
## Correlation of Fixed Effects:
##      (Intr)
## Days -0.138

If you had asked the Andrew of a few years ago what this model formula meant, you would have heard a rather vague answer, something like “well it fits the fixed effect of Days with a random effect of Days by Subject”. But if you pressed him, he’d have a hard time explaining exactly what he meant. How many numbers does that take? What is the equation for these effects?

The model actually has five numbers in it:

##                           term     estimate std.error statistic    group
## 1                  (Intercept) 251.40510485  6.824556 36.838310    fixed
## 2                         Days  10.46728596  1.545789  6.771485    fixed
## 3       sd_(Intercept).Subject  24.74044768        NA        NA  Subject
## 4              sd_Days.Subject   5.92213326        NA        NA  Subject
## 5 cor_(Intercept).Days.Subject   0.06555134        NA        NA  Subject
## 6      sd_Observation.Residual  25.59181589        NA        NA Residual
  1. The intercept (Reaction when Days = 0)
  2. A slope (How Reaction goes up when Days goes up by 1)
  3. The standard deviation of the Intercept: how different are Subjects in their Intercept?
  4. The standard deviation of the slope: how different are the Subjects in the effect of Days?
  5. the correlation between those last two items

And if you consider that you’re interested in the actual people in the study, there’s an average slope and intercept and 18 departures from it:

## (Intercept)        Days 
##   251.40510    10.46729
## $Subject
##     (Intercept)        Days
## 308   2.2585654   9.1989719
## 309 -40.3985770  -8.6197032
## 310 -38.9602459  -5.4488799
## 330  23.6904985  -4.8143313
## 331  22.2602027  -3.0698946
## 332   9.0395259  -0.2721707
## 333  16.8404312  -0.2236244
## 334  -7.2325792   1.0745761
## 335  -0.3336959 -10.7521591
## 337  34.8903509   8.6282839
## 349 -25.2101104   1.1734143
## 350 -13.0699567   6.6142050
## 351   4.5778352  -3.0152572
## 352  20.8635925   3.5360133
## 369   3.2754530   0.8722166
## 370 -25.6128694   4.8224646
## 371   0.8070397  -0.9881551
## 372  12.3145394   1.2840297
## [1] 18

Not knowing, then suddenly realizing, can come as a nasty shock. If you’re speaking your second language, why would you think that this:

Reaction ~ Days + (Days | Subject)

contains a correlation, for crying out loud!? In the same way, I can’t help but be vaguely, unpleasantly surprised when I’m trying to parse some French that somebody just spoke, and I find a y right there among some other normal words. (what is the y even for?!)

There are still more means of writing models that are meant to “save time” (how about Days + (Days|Group/Subject), but compress even more terms into fewer characters on the computer. Fortunately there is an extensive resource on mixed models, which includes a dictionary defining the different model formulae.

How to learn what all these mean? Well, I did what I’ve done in the past when I’ve traveled to lands where I didn’t speak the language: I made flashcards. That worked for a while. But I think it is better for learners if they master the meaning, not just the sound. Is there an easier way?

“Sorry could you repeat that please”

Can you slow down the speech, so you can hear every syllable? Of course. There are always many ways to write out a model. Here is the exact2 same model as above, written in the rethinking package.

It takes… a bit more typing:


sleep_model <- alist(
  Reaction                     ~ dnorm(mean_reaction, sigma_reaction),  
  mean_reaction              <-  a              + bdays              * Days + 
                                 a_sub[Subject] + bdays_sub[Subject] * Days,
  c(a_sub, bdays_sub)[Subject] ~ dmvnormNC(sig_sub, rho_sub),
  sig_sub                      ~ dcauchy(0, 2),
  rho_sub                      ~ dlkjcorr(4),
  a                            ~ dnorm(30, 5),
  bdays                        ~ dnorm(2, 5),
  sigma_reaction               ~ dcauchy(0, 2)

data(sleepstudy, package = "lme4")

sleepstudy$Subject <- as.integer(sleepstudy$Subject)

sleep_model <- map2stan(sleep_model, data = sleepstudy , iter = 3000 , warmup=1000 , chains=1 , cores=1 )
# sample via rethinking

aaaaaand now my hands are tired. When you write a model this way you have to put all the pieces together by hand. The philosophy behind the rethinking package and book is that students should learn how the little statistical robot works before they go and turn it loose on the world. I couldn’t agree more.

Slow down and talk proper

Just as there’s no right way to speak, there is no right way to create statistical models. But I think we can all agree that there is such a thing as clarity in communication. This is especially obvious when you’re in front of a class teaching, or when your colleague / grad student / friend’s grad student / friend calls you for advice. Can we keep the convenience, but express ourselves more clearly?

Perhaps we can do some simple, even trivial things to make our models a little easier to read. One Weird Trick might be to just write out the intercept:

Reaction ~ 1 + Days + (1 + Days | Subject)

Even easier, if you are a Bayesian anyways, is to write out all your priors (which is an upright and righteous thing to do anyways). This way you can see your model just by counting them. This is good practice, for example, when you’re writing using brms or rstanarm.

et alors

I would wager that most ecologists don’t realize how model formulae translate into parameters in the model. Perhaps that’s not the biggest problem. Just as I may not need to know how to separate every word in a French expression to understand it, or even use it myself. At the same time though – as anyone who has lived as a linguistic minority will tell you – if you only partly understand the words you are using, you will humiliate yourself eventually. I suspect the same holds for all languages, even R’s weird formula one.

However we write out our models, we should make sure that our students (collaborators, supervisors, consciences) know how many actual numbers are going to be calculated. Most of all, let’s remember how hard it was to strike up a conversation when we first started learning this odd foreign language.

  1. Richard McElreath wrote an excellent blog on this very topic, which you should read. In fact, this post is a sort of compliment to his. His post describes how statistical software grew up around him as he was developing as a scientist (There are photos of punch cards).

  2. Ok not exact exact, it is obviously Bayesian and has priors, which the lme4 one does not.