These notes largely follow the book, and I don’t have time to do all the exercises. In some cases I’ve compared Simon Wood’s “by hand” solutions in the text to other packages or methods.
Chapter 1 is about linear models and their theory.
Write out the following models in form \(\mathbf{y} = \mathbf{X}\mathbf{\beta} + \mathbf{\epsilon}\).
It has one factor, with 3 levels, and each level has 2 observations. R can make this with model.matrix
.
<- factor(rep(1:3,each=2))
b model.matrix(~ b, data.frame(b=b))
## (Intercept) b2 b3
## 1 1 0 0
## 2 1 0 0
## 3 1 1 0
## 4 1 1 0
## 5 1 0 1
## 6 1 0 1
## attr(,"assign")
## [1] 0 1 1
## attr(,"contrasts")
## attr(,"contrasts")$b
## [1] "contr.treatment"
So we write it as: \[ \begin{bmatrix} y_{11} \\ y_{12} \\ y_{21} \\ y_{22} \\ y_{31} \\ y_{32} \end{bmatrix} = \begin{bmatrix} 1 & 0 & 0 \\ 1 & 0 & 0 \\ 1 & 1 & 0 \\ 1 & 1 & 0 \\ 1 & 0 & 1 \\ 1 & 0 & 1 \end{bmatrix} \begin{bmatrix} \alpha \\ \beta_{2} \\ \beta_{3} \end{bmatrix} + \begin{bmatrix} \epsilon_{11} \\ \epsilon_{12} \\ \epsilon_{21} \\ \epsilon_{22} \\ \epsilon_{31} \\ \epsilon_{32} \end{bmatrix} \]
We do it in R, note that we use expand.grid
to generate all unique combinations. There’s only 1 observation per stratum, so that’s sufficient.
<- factor(1:3)
beta <- factor(1:4)
gamma model.matrix(~ beta + gamma, expand.grid(beta=beta,gamma=gamma))
## (Intercept) beta2 beta3 gamma2 gamma3 gamma4
## 1 1 0 0 0 0 0
## 2 1 1 0 0 0 0
## 3 1 0 1 0 0 0
## 4 1 0 0 1 0 0
## 5 1 1 0 1 0 0
## 6 1 0 1 1 0 0
## 7 1 0 0 0 1 0
## 8 1 1 0 0 1 0
## 9 1 0 1 0 1 0
## 10 1 0 0 0 0 1
## 11 1 1 0 0 0 1
## 12 1 0 1 0 0 1
## attr(,"assign")
## [1] 0 1 1 2 2 2
## attr(,"contrasts")
## attr(,"contrasts")$beta
## [1] "contr.treatment"
##
## attr(,"contrasts")$gamma
## [1] "contr.treatment"
It’s written as: \[ \begin{bmatrix} y_{11} \\ y_{21} \\ y_{31} \\ y_{12} \\ y_{22} \\ y_{32} \\ y_{13} \\ y_{23} \\ y_{33} \\ y_{14} \\ y_{24} \\ y_{34} \end{bmatrix} = \begin{bmatrix} 1 & 0 & 0 & 0 & 0 & 0 \\ 1 & 1 & 0 & 0 & 0 & 0 \\ 1 & 0 & 1 & 0 & 0 & 0 \\ 1 & 0 & 0 & 1 & 0 & 0 \\ 1 & 1 & 0 & 1 & 0 & 0 \\ 1 & 0 & 1 & 1 & 0 & 0 \\ 1 & 0 & 0 & 0 & 1 & 0 \\ 1 & 1 & 0 & 0 & 1 & 0 \\ 1 & 0 & 1 & 0 & 1 & 0 \\ 1 & 0 & 0 & 0 & 0 & 1 \\ 1 & 1 & 0 & 0 & 0 & 1 \\ 1 & 0 & 1 & 0 & 0 & 1 \end{bmatrix} \begin{bmatrix} \alpha \\ \beta_{2} \\ \beta_{3} \\ \gamma_{2} \\ \gamma_{3} \\ \gamma_{4} \end{bmatrix} + \begin{bmatrix} \epsilon_{11} \\ \epsilon_{21} \\ \epsilon_{31} \\ \epsilon_{12} \\ \epsilon_{22} \\ \epsilon_{32} \\ \epsilon_{13} \\ \epsilon_{23} \\ \epsilon_{33} \\ \epsilon_{14} \\ \epsilon_{24} \\ \epsilon_{34} \end{bmatrix} \]
<- factor(rep(1:2,times=c(2,4)))
b <- c(0.1,0.4,0.5,0.3,0.4,0.7)
x model.matrix(~ beta + x, data.frame(beta = b, x = x))
## (Intercept) beta2 x
## 1 1 0 0.1
## 2 1 0 0.4
## 3 1 1 0.5
## 4 1 1 0.3
## 5 1 1 0.4
## 6 1 1 0.7
## attr(,"assign")
## [1] 0 1 2
## attr(,"contrasts")
## attr(,"contrasts")$beta
## [1] "contr.treatment"
Written out: \[ \begin{bmatrix} y_{1} \\ y_{2} \\ y_{3} \\ y_{4} \\ y_{5} \\ y_{6} \end{bmatrix} = \begin{bmatrix} 1 & 0 & 0.1 \\ 1 & 0 & 0.4 \\ 1 & 1 & 0.5 \\ 1 & 1 & 0.3 \\ 1 & 1 & 0.4 \\ 1 & 1 & 0.7 \end{bmatrix} \begin{bmatrix} \alpha \\ \beta_{2} \\ \gamma \end{bmatrix} + \begin{bmatrix} \epsilon_{1} \\ \epsilon_{2} \\ \epsilon_{3} \\ \epsilon_{4} \\ \epsilon_{5} \\ \epsilon_{6} \end{bmatrix} \]
lm
with R’s linear algebraWe are using the linear model \(y_{i} = \mathbf{X}_{i}\mathbf{\beta} + \epsilon_{i}\) and \(\epsilon_{i}\sim N(0,\sigma^2)\). This question basically asks us to replicate lm
using R’s built in linear algebra functions.
This stackexchange answer is really helpful for figuring out what’s going on.
y
, model matrix X
and computes least squares estimate of parameters \(\beta\) based on QR decomposition of X
.The function works like this:
backsolve
to get \(\hat{\mathbf{\beta}}=\mathbf{R}^{-1}\mathbf{f}\).backsolve
to get \(\mathbf{R}^{-1}\) by solving the problem \(\mathbf{R}\mathbf{R}^{-1} = \mathbf{I}\) for the inverse, then the covariance matrix is \(\mathbf{V}_{\hat{\beta}} = \mathbf{R}^{-1}\mathbf{R}^{\intercal}\sigma^{2}\).<- 100
n <- c(5,3)
beta <- rlnorm(n = n)
x1 <- beta[1] + beta[2]*x1 + rnorm(n = n)
y
<- model.matrix(~ x, data.frame(x=x1))
X
<- function(y,X) {
least_sq <- qr(X) ## QR decomposition
qrx <- qr.qty(qrx,y) ## form Q'y efficiently
yy <- qr.R(qrx) ## R
R <- ncol(X);n <- nrow(X) ## get dimensions
p <- yy[1:p]; r <- yy[(p+1):n] ## partition Q'y
f <- backsolve(R,f) ## backsolve(R,f) solves R * beta = f for beta
beta <- sum(r^2)/(n-p) ## resid variance estimate (1.8)
sig2 <- backsolve(R,diag(ncol(R))) ## inverse of R matrix, R * Ri = I for Ri
Ri <- Ri%*%t(Ri)*sig2 ## covariance matrix
Vb <- diag(Vb)^.5 ## standard errors (c)
se <- f^2/sig2 ## sequential F-ratios
F.ratio <- 1-pf(F.ratio,1,n-p) ## seq. p-values (e)
seq.p.val list(beta=beta,se=se,sig2=sig2,seq.p.val=seq.p.val,df=n-p)
}
least_sq(y = y,X = X)
## $beta
## [1] 5.120932 2.982596
##
## $se
## [1] 0.09368107 0.01867322
##
## $sig2
## [1] 0.738315
##
## $seq.p.val
## [1] 0 0
##
## $df
## [1] 98
<- model.matrix(dist ~ speed + I(speed^2),cars)
X <- lm(dist ~ speed + I(speed^2),cars)
cars.fit.lm <- least_sq(y = cars$dist,X = X) cars.fit
pcls
The data InsectSprays
has measurements on counts of insects in plots, with different insecticide sprays. We’ll use the model: \(y_{i} = \mu + \beta_{j}\) if observation \(i\) had spray \(j\).
It wont be identifiable right away. Use the constraint \(\sum_{j}\beta_{j} = 0\). We make the model matrix \(\mathbf{X}\) which has a redundant column for the constraint.
<- model.matrix(~ spray - 1, InsectSprays)
X <- cbind(1,X)
X <- 7 # 7 unconstrained parameters
p <- 1 # 1 constraint m
We want to impose some general linear constraints of the form \(\mathbf{C}\mathbf{\beta}=\mathbf{0}\), with \(\mathbf{C}\) the constraint matrix. \(\mathbf{C}\) has dimension \(m\times p\) where \(m\) is the number of constraints. \(\mathbf{C}\) should have a 1 for the constrained parameters, as they should sum to 0 when multiplied by it.
<- matrix(c(0,rep(1,6)),1,7) C
From the book:
We have the fact that \(\mathbf{XQ} = (\mathbf{Q}^{\intercal}\mathbf{X}^{\intercal})^{\intercal}\). The call qr.qty(qrc,t(X))
returns \(\mathbf{Q}^{\intercal}\mathbf{X}^{\intercal}\). Transposing it and taking the last \(p-m\) columms gives us \(\mathbf{XZ}\). It’s the same as t(t(qr.Q(qrc,complete = TRUE)) %*% t(X))[,(m+1):p]
or (X %*% qr.Q(qrc,complete = TRUE))[,(m+1):p]
.
<- qr(t(C))
qrc <- t(qr.qty(qrc,t(X)))[,(m+1):p] XZ
<- lm(InsectSprays$count~XZ-1)
m1 <- coef(m1) bz
qr.qy(qrc,b)
gives \(\mathbf{Z}\hat{\beta}_{z}\). It is the same as doing qr.Q(qrc,complete = T) %*% c(0,bz)
.<- c(0,bz)
b <- qr.qy(qrc,b)
b sum(b[2:7])
## [1] 3.552714e-15
Now we still have an intercept, but all the other terms sum to 0. We can check this against mgcv::pcls
, it’s the same. Note that we can’t leave Ain
blank in pcls
, even if we don’t have any inequality constraints, so we put in the trivial one that any parameter vector will fulfill (since Ain
is all zeros).
<- list(
M y = InsectSprays$count,
w = rep(1, nrow(InsectSprays)),
X = X,
C = C,
S = list(),
off = array(0,0),
sp = array(0,0),
p = c(5, -1,1,-1,1,-1,1),
Ain = matrix(0,1,7),
bin = -1
)<- mgcv::pcls(M = M) pcls.fit
9.5 | 5 | 5.833333 | -7.416667 | -4.583333 | -6 | 7.166667 | |
pcls.fit | 9.5 | 5 | 5.833333 | -7.416667 | -4.583333 | -6 | 7.166667 |
nls
Consider the following non-linear model: \(V_{i} = \beta_{1} G_{i}^{\beta_{2}} H_{i}^{\beta_{3}} + \epsilon_{i}\), where \(G_{i},H_{i}\) are girth and height of trees, and \(V_{i}\) are the trees’ volumes.
Let’s first work out what the Jacobian (matrix of partial first derivatives) should be.
The partial derivative with respect to \(\beta_{1}\) is: \(\partial\mathbb{E}(V_{i})/\partial{\beta_{1}} = G_{i}^{\beta_{2}} H_{i}^{\beta_{3}} \frac{d}{d\beta_{1}}(\beta_{1}) = G_{i}^{\beta_{2}} H_{i}^{\beta_{3}}\). We solved it by pulling out constants and differentiating.
The partial derivative with respect to \(\beta_{2}\) is: \(\partial\mathbb{E}(V_{i})/\partial{\beta_{2}} = \beta_{1} H_{i}^{\beta_{3}} \frac{d}{d\beta_{2}}(G_{i}^{\beta_{2}})\). After we pulled out constants, lets work on differentiating the exponential expression. \(\frac{d}{d\beta_{2}}(G_{i}^{\beta_{2}}) = \frac{d}{d\beta_{2}}(e^{\log G_{i}})^{\beta_{2}} = \frac{d}{d\beta_{2}}(e^{\beta_{2}\log G_{i}})\). Now we get to apply the chain rule: \(e^{\beta_{2}\log G_{i}} \frac{d}{d\beta_{2}}(\beta_{2}\log G_{i}) = e^{\beta_{2}\log G_{i}} \log G_{i} \frac{d}{d\beta_{2}}(\beta_{2}) = e^{\beta_{2}\log G_{i}} \log G_{i} = G_{i}^{\beta_{2}} \log G_{i}\). So the answer is:
\[ \partial\mathbb{E}(V_{i})/\partial{\beta_{2}} = \beta_{1} H_{i}^{\beta_{3}} G_{i}^{\beta_{2}} \log G_{i} \]
\[ \partial\mathbb{E}(V_{i})/\partial{\beta_{3}} = \beta_{1} G_{i}^{\beta_{2}} H_{i}^{\beta_{3}} \log H_{i} \]
The R function, is therefore (taking advantage of vectorized operators):
<- function(beta, G, H) {
EV.func <- beta[1] * (G^beta[2]) * (H^beta[3])
mu <- cbind(
J ^beta[2]) * (H^beta[3]),
(G*log(G),
mu*log(H)
mu
)list(mu=mu,J=J)
}
trees
data. Use \(\beta = [.002, 2, 1]\) for starting values. We use non-linear least squares (NLS).data(trees)
<- trees$Volume
V <- trees$Girth
G <- trees$Height
H <- c(0.002, 2, 1)
beta <- Inf
diff while (diff > 1e-8) {
<- EV.func(beta = beta, G = G, H = H)
EV <- V - EV$mu + (EV$J %*% beta) # calc pseudodata
z <- beta
beta_old # linear least squares to minimize S^{[k]} with respect to beta to get improved estimate \beta^{[k+1]}
<- coef(lm(z ~ EV$J - 1))
beta <- sum(abs(beta - beta_old))
diff }
solve(t(EV$J)%*%EV$J)
finds \((\mathbf{J}^{\intercal}\mathbf{J})^{-1}\).<- sum((V - EV$mu)^2)/(nrow(trees)-3)
sig2 <- solve(t(EV$J)%*%EV$J)*sig2
Vb <- diag(Vb)^.5;se se
## [1] 0.001366994 0.082077439 0.242158812
Let’s compare the above to what we get from stats::nls
. It’s the same.
<- stats::nls(
nls.fit ~ beta1 * (Girth^beta2) * (Height^beta3),
Volume data = trees,
start = as.list(c(beta1=0.002, beta2=2, beta3=1))
)
beta1 | beta2 | beta3 | |
---|---|---|---|
0.0014488 | 1.996921 | 1.087647 | |
beta | 0.0014488 | 1.996922 | 1.087646 |
beta1 | beta2 | beta3 | |
---|---|---|---|
0.001367 | 0.0820774 | 0.2421588 | |
se | 0.001367 | 0.0820774 | 0.2421588 |
Linear mixed models (LMM) extend the linear model (LM).
Linear models are of the form:
\[ \mathbf{y} = \mathbf{X}\mathbf{\beta} + \mathbf{\epsilon}, \mathbf{\epsilon}\sim N(\mathbf{0},\mathbf{I}\sigma^{2}) \]
LMM are extended to be:
\[ \mathbf{y} = \mathbf{X}\mathbf{\beta} + \mathbf{Z}\mathbf{b} + \mathbf{\epsilon}, \mathbf{\epsilon}\sim N(\mathbf{0},\mathbf{\Lambda_{\theta}}), \mathbf{b}\sim N(\mathbf{0},\mathbf{\psi_{\theta}}) \]
The random vector \(\mathbf{b}\) has the random effects, whose covariance matrix depends on some parameters \(\theta\); these are usually the target of inference, rather than \(\mathbf{b}\) per se. \(\mathbf{Z}\) is the model matrix for the random effects. \(\mathbf{\Lambda_{\theta}}\) models residual autocorrelation, but often it is as simple as \(\mathbf{\Lambda_{\theta}} = \mathbf{I}\sigma^{2}\). So the random effects also get a linear model structure, just like the fixed effects.
The model for the outcome itself will follow a multivariate normal distribution:
\[ \mathbf{y}\sim N(\mathbf{X\beta},\mathbf{Z}\mathbf{\psi}_{\theta}\mathbf{Z}^{\intercal} + \mathbf{\Lambda}_{\theta}) \]
For the tree example, a fixed effects model is clearly unidentifiable. The random effects model will be set up as:
\[ y_{i} = \alpha_{j} + b_{k} +\epsilon_{i} \] For measurement \(i\) in CO2 level \(j\) and tree \(k\), and \(b_{k}\sim N(0,\sigma_{b}^{2}), \epsilon_{i}\sim N(0,\sigma^{2})\).
If we don’t need/want to do inference on the specific values of \(b_{k}\) and simply want to test for CO2 effects in the presence of random tree effects. Not that for tree \(k\), the average response is:
\[ \overline{y}_{k} = \alpha_{j} + e_{k} \] Where \(e_{k}\sim N(0,\sigma^{2}_{b} + \sigma^{2}/4)\). This is because there are 4 observations for each tree. We averaged over the data at each level of the random effect \(b_{k}\), that is, for each tree.
data(stomata)
<- aggregate(
st data.matrix(stomata),
by=list(tree=stomata$tree),mean
)$CO2 <- as.factor(st$CO2);st st
<- lm(area ~ CO2, st)
m3 anova(m3)
There’s evidence for C02 effect in this model. Again, \(\theta\) is a target of inference, and in this case it’s \(\sigma_{b}^{2}\), so let’s estimate it.
<- lm(area ~ CO2 + tree, stomata) m1
The full unidentifiable model has \(\hat{\sigma}^{2} = \mathrm{RSS}_{1}/18\). There are 18 degrees of freedom \(n-p = 18, n=24, p=6\).
The LMM has \(\widehat{\sigma_{b}^{2} + \sigma^{2}/4} = \mathrm{RSS}_{3}/4\).
To get \(\hat{\sigma}^{2}_{b}\) we just need to subtract:
\[ \hat{\sigma}^{2}_{b} = \widehat{\sigma_{b}^{2} + \sigma^{2}/4} - \hat{\sigma}^{2}/4 = \mathrm{RSS}_{3}/4 - \mathrm{RSS}_{1}/72 \]
summary(m3)$sigma^2 - summary(m1)$sigma^2/4
## [1] 0.06770177
The next example is about some data on rails. We do the inference by hand:
<- lm(travel ~ Rail,Rail)
m1 <- aggregate(data.matrix(Rail),by=list(Rail$Rail),mean)
rt <- lm(travel ~ 1,rt)
m0 <- summary(m1)$sigma
sig <- (summary(m0)$sigma^2 - sig^2/3)^0.5
sigb
<- rbind(sigb = sigb, sig = sig, intercept = unname(coef(m0)))
tab ::kable(tab,caption = "By-hand LMM for rails data") knitr
## Warning in kable_pipe(x = structure(c("sigb", "sig", "intercept", "24.805465", :
## The table should have a header (column names)
sigb | 24.805465 |
sig | 4.020779 |
intercept | 66.500000 |
And now we compare to nlme::lme
. It will give us the same thing.
<- as.data.frame(data.matrix(Rail))
rail_df # lme.fit = lme(travel ~ 1,random = ~ 1 | Rail,data = rail_df)
= lme(travel ~ 1,random = list(Rail = ~1),data = rail_df)
lme.fit summary(lme.fit)
## Linear mixed-effects model fit by REML
## Data: rail_df
## AIC BIC logLik
## 128.177 130.6766 -61.0885
##
## Random effects:
## Formula: ~1 | Rail
## (Intercept) Residual
## StdDev: 24.80547 4.020779
##
## Fixed effects: travel ~ 1
## Value Std.Error DF t-value p-value
## (Intercept) 66.5 10.17104 12 6.538173 0
##
## Standardized Within-Group Residuals:
## Min Q1 Med Q3 Max
## -1.61882658 -0.28217671 0.03569328 0.21955784 1.61437744
##
## Number of Observations: 18
## Number of Groups: 6
We want to do inference on \(\mathbf{\beta,\theta}\), but we don’t want to work directly with the pdf \(\mathbf{y}\sim N(\mathbf{X\beta},\mathbf{Z}\mathbf{\psi}_{\theta}\mathbf{Z}^{\intercal} + \mathbf{\Lambda}_{\theta})\) as it is too unwieldy.
Instead what we work with is the marginal \(f(\mathbf{y}|\mathbf{\beta})\), which has \(\mathbf{b}\) integrated out (the part depending on \(\theta\)).
We will also need \(f(\mathbf{y,b} | \mathbf{\beta})\), which is just the pdf of \(\mathbf{y}\) conditional on particular values of \(\mathbf{b,\beta}\). We know that \(f(\mathbf{y,b} | \mathbf{\beta}) = f(\mathbf{y} | \mathbf{b,\beta}) f(\mathbf{b})\).
Say we have \(\hat{\mathbf{b}}\) which maximizes \(f(\mathbf{y,b} | \mathbf{\beta})\) for a particular \(\mathbf{\beta}\). Then \(f(\mathbf{y}|\mathbf{\beta}) = f(\mathbf{y,\hat{b}} | \mathbf{\beta}) \times (\text{terms depending only on }\theta)\). The point is that we can do inference on \(f(\mathbf{y}|\mathbf{\beta})\) by profile likelihood methods. For any \(\theta\) it is easy to get an estimator for \(\beta\) by minimizing with respect to \(\mathbf{\beta,b}\).
We call the profile likelihood \(l_{p}(\mathbf{\theta}) = l(\mathbf{\theta,\hat{\beta}_{\theta}})\), where \(\hat{\beta}_{\theta}\) maximizes \(f(\mathbf{y}|\mathbf{\beta})\) given \(\mathbf{\theta}\).
We have:
\[ \mathbf{b|y,\hat{\beta}_{\theta}} \sim N(\mathbf{\hat{b} , (Z^{\intercal}\Lambda_{\theta}Z | \psi_{\theta}^{-1})^{-1} }) \] Where \(\hat{\mathbf{b}} = \mathbf{(Z^{\intercal}\Lambda_{\theta}Z | \psi_{\theta}^{-1})^{-1}Z^{\intercal}\Lambda_{\theta}\tilde{y}}\). \(\mathbf{\tilde{y} = y - x\hat{\beta}}\) are the residuals. The \(\hat{\mathbf{b}}\) estimate is called the maximum a posteriori estimator or the predicted random effects vector.
To get the distribution of \(\hat{\beta}_{\theta}\) for a given value of \(\theta\), we can work with the pdf: \(\mathbf{y}\sim N(\mathbf{X\beta},\mathbf{Z}\mathbf{\psi}_{\theta}\mathbf{Z}^{\intercal} + \mathbf{\Lambda}_{\theta})\). For a given \(\theta\) a weighted least squares estimate gives \(\hat{\beta}_{\theta}\). We eventually get:
\[ $\hat{\beta}_{\theta}$ \sim N(\beta, (\mathbf{X}^{\intercal}(\mathbf{Z}\mathbf{\psi}_{\theta}\mathbf{Z}^{\intercal} + \mathbf{\Lambda}_{\theta})^{-1}\mathbf{X})^{-1} ) \]
Which we use for large sample inference.
The function below is the log likelihood for a given value of \(\theta\); given a lattice of \(\theta\) values, it can be maximized (actually minimized, it gives negative log likelihood).
This function essentially minimizes the profile likelihood:
\[ l_{p}(\theta) = \left\lVert \mathbf{y - X\beta - Zb} \right\rVert^{2}_{\Lambda_{\theta}} + \mathbf{b^{\intercal}\psi_{\theta}^{-1}b} \]
with respect to \(\mathbf{\beta,b}\).
<- function(theta,X,Z,y) {
llm ## untransform parameters...
<- exp(theta[1])
sigma.b <- exp(theta[2])
sigma ## extract dimensions...
<- length(y); pr <- ncol(Z); pf <- ncol(X) ## obtain \hat \beta, \hat b...
n <- cbind(X,Z)
X1 <- c(rep(0,pf),rep(1/sigma.b^2,pr))
ipsi <- solve(crossprod(X1)/sigma^2+diag(ipsi),t(X1)%*%y/sigma^2)
b1 ## compute log|Z’Z/sigma^2 + I/sigma.b^2|...
<- sum(log(diag(chol(crossprod(Z)/sigma^2 + diag(ipsi[-(1:pf)])))))
ldet ## compute log profile likelihood...
<- (-sum((y-X1%*%b1)^2)/sigma^2 - sum(b1^2*ipsi) - n*log(sigma^2) - pr*log(sigma.b^2) - 2*ldet - n*log(2*pi))/2
l attr(l,"b") <- as.numeric(b1) ## return \hat beta and \hat b
return(-l)
}
If there is only one level (factor) of grouping in the data, the LMM is of the form:
\[ \mathbf{y_{i}} = \mathbf{X_{i}\beta + Z_{i}b_{i} + \epsilon_{i}} \]
Terms that have the \(i\) subscript depend on group (they are for group \(i\)) and others are common to all groups. So \(\mathbf{\beta,\psi_{\theta},\Lambda_{\theta}}\), the coefficients for fixed effects, covariance of random effects, and error terms are the same for all groups.
nlme
The function lme
handles LMMs. It needs 2 formulas, one for the fixed effects, and one for \(\mathbf{Z_{i}}\) and grouping. For example to fit the rails model \(y_{ij} = \alpha + b_{i} + \epsilon_{ij}\) for grouping factor \(i\) with repeated measures \(j\), we’d do:
data(Rail)
# lme(travel ~ 1, Rail, ~ 1|Rail) # works too
lme(travel ~ 1, Rail, list(Rail = ~ 1))
## Linear mixed-effects model fit by REML
## Data: Rail
## Log-restricted-likelihood: -61.0885
## Fixed: travel ~ 1
## (Intercept)
## 66.5
##
## Random effects:
## Formula: ~1 | Rail
## (Intercept) Residual
## StdDev: 24.80547 4.020779
##
## Number of Observations: 18
## Number of Groups: 6
lme
We have data for 14 trees on their growth (height
) as a function of age
, where Seed
is the unique ID for each tree. Let \(y\) be height, \(x\) be age, then our model might be:
\[ y_{ij} + \beta_{0} + \beta_{1}x_{ij} + \beta_{2}x_{ij}^{2} + \beta_{3}x_{ij}^{3} + b_{0} + b_{1}x_{ij}^{2} + b_{2}x_{ij}^{3} + \epsilon_{ij} \]
and correlation between errors is given by \(\rho(\epsilon_{ij}, \epsilon_{ij-k}) = \phi^{k}\), with \(\phi\) unknown. That means errors are autocorrelated within tree over time. So measurements taken 3 years apart would have a decaying correlation of \(\phi^{3}\).
data(Loblolly)
# center the data, polynomial terms can become too correlated otherwise
$age <- Loblolly$age - mean(Loblolly$age)
Loblolly# use more EM iters before Newton for the final optimization
<- lmeControl(niterEM=500,msMaxIter=100)
lmc # fit the model
<- lme(height ~ age + I(age^2) + I(age^3),Loblolly,
m0 random = list(Seed = ~ age + I(age^2) + I(age^3)),
correlation = corAR1(form = ~ age|Seed),control=lmc)
If there are two factors, and repeated measurements within each strata, we need to be more careful when using lme
. Say we index factors by \(i,j\) and measurements within factor strata by \(k\):
\[
y_{ijk} = \mu + \alpha_{i} + b_{j} + (\alpha b)_{ij} + \epsilon_{ijk}
\] This model has an overall mean, terms for the independent random effect of both factors, and one for the specific strata. Distributions follow: \(b_{j} \sim N(0,\sigma_{b}^{2}), (\alpha b)_{ij} \sim N(0,\sigma_{\alpha b}^{2}), \epsilon_{ijk} \sim N(0,\sigma^{2})\). For the dataset Machines
where the factors are Worker
and Machine
and outcome is score
, we’d use:
data(Machines)
lme(score ~ Machine,Machines,list(Worker = ~1, Machine = ~1))
## Linear mixed-effects model fit by REML
## Data: Machines
## Log-restricted-likelihood: -107.8438
## Fixed: score ~ Machine
## (Intercept) MachineB MachineC
## 52.355556 7.966667 13.916667
##
## Random effects:
## Formula: ~1 | Worker
## (Intercept)
## StdDev: 4.78105
##
## Formula: ~1 | Machine %in% Worker
## (Intercept) Residual
## StdDev: 3.729532 0.9615771
##
## Number of Observations: 54
## Number of Groups:
## Worker Machine %in% Worker
## 6 18
lme4
nlme
is efficient for models with nesting, but less so for non nested models like \(y_{ijk} = \alpha + b_{i} + c_{j} + \epsilon_{ijk}\). lme4
addresses this. Unlike lme
, the lmer
function in lme4
uses a single formula to specify both fixed and random effects. Random effects (\(b\)) are specified like \((x|g)\) where \(g\) is the grouping factor and \(x\) is the right hand of the model formula.
We fit the Machines model again:
\[ y_{ijk} = \mu + \alpha_{i} + b_{j} + (\alpha b)_{ij} + \epsilon_{ijk} \]
<- lmer(score ~ Machine + (1|Worker) + (1|Worker:Machine), data=Machines)
a1 a1
## Linear mixed model fit by REML ['lmerMod']
## Formula: score ~ Machine + (1 | Worker) + (1 | Worker:Machine)
## Data: Machines
## REML criterion at convergence: 215.6876
## Random effects:
## Groups Name Std.Dev.
## Worker:Machine (Intercept) 3.7295
## Worker (Intercept) 4.7811
## Residual 0.9616
## Number of obs: 54, groups: Worker:Machine, 18; Worker, 6
## Fixed Effects:
## (Intercept) MachineB MachineC
## 52.356 7.967 13.917
mgcv
gam
in mgcv
can also fit LMMs, with independent Gaussian random effects, but need to be specified using “smooth” terms of the form s(z,x,v,bs="re")
.
Once again, with gusto, the Machines model:
\[ y_{ijk} = \mu + \alpha_{i} + b_{j} + (\alpha b)_{ij} + \epsilon_{ijk} \]
<- gam(score ~ Machine + s(Worker,bs="re") + s(Machine,Worker,bs="re"),data=Machines,method="REML")
b1 summary(b1)
##
## Family: gaussian
## Link function: identity
##
## Formula:
## score ~ Machine + s(Worker, bs = "re") + s(Machine, Worker, bs = "re")
##
## Parametric coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 52.356 2.486 21.062 < 2e-16 ***
## MachineB 7.967 2.177 3.660 0.000799 ***
## MachineC 13.917 2.177 6.393 2.02e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Approximate significance of smooth terms:
## edf Ref.df F p-value
## s(Worker) 4.141 5 8502 0.000487 ***
## s(Machine,Worker) 10.623 15 595 0.300410
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## R-sq.(adj) = 0.986 Deviance explained = 99%
## -REML = 107.84 Scale est. = 0.92463 n = 54
GLMs give you 2 things:
They all have the basic structure:
\[ g(\mu_{i}) = \mathbf{X_{i}\beta} \]
A GLMM follows naturally:
\[ g(\mu_{i}) = \mathbf{X_{i}\beta} + \mathbf{Z_{i}b}, \mathbf{b} \sim N(0,\psi) \]
For either one, the “linear predictor” is \(\mathbf{\eta = X\beta}\).
glmmPQL
This function comes from the MASS
package.
A GAM is a GLM with the linear predictor being a sum of smooth functions of covariates.
\[ g(\mu_{i}) = \mathbf{A_{i}\theta} + f_{1}(x_{1i}) + f_{2}(x_{2i}) + ... \] Here, \(\mathbf{A}\) plays the role of the typical model matrix and \(\mathbf{\theta}\) akin to \(\mathbf{\beta}\). All the \(f_{i}\) are smooth functions of covariates.
Let’s start by considering
\[ y_{i} = f(x_{i}) + \epsilon_{i} \]
To estimate \(f\) in terms of the statistical methods we know requires we turn it (somehow) into a linear model. We do this by representing it by a basis expansion:
\[ f(x) = \sum_{j=1}^{k} b_{j}(x) \beta_{j} \]
We choose the basis expansion, so all \(b_{j}(x)\) terms are completely known (just as if we’d added, say, polynomial terms to a standard regression). Now we want to estimate the \(\beta_{j}\) coefficients. So then we get a linear model \(\mathbf{y = X\beta + \epsilon}\) where \(X_{ij} = b_{j}(x_{i})\).
Let’s start with the simplest basis function, the piecewise linear basis. Our function tf
will define \(b_{j}(x)\). It will be used in tf.X
, which takes a sequence of knot points and \(x\) observed data to produce a model matrix \(\mathbf{X}\) for the piecewise linear basis expansion. We’ll use 6 knots, spread evenly over the range of \(x\). The basis functions are plotted in light dashed colors, and the basis functions multiplied by estimated coefficients in lighted colors, and the knot locations are vertical lines. Their sum is the fit curve. The reason the tent functions are not triangles is that the knot points are not necessarily at all the data points, as would be the case for simple interpolation. Each basis \(b_{j}(x)\) must have a peak at the \(j^{th}\) knot point but can change slope at data points as it falls off to zero.
data("engine")
attach(engine)
# tf: tent function
# defines b_{j}(x)
# gives the j^th function from a set defined by knot points xj
<- function(x, xj, j) {
tf <- xj * 0
dj <- 1
dj[j] approx(x = xj,y = dj,xout = x,method = "linear")$y
}
# x: data
# xj: knot sequence
<- function(x, xj) {
tf.X <- length(xj)
nk <- length(x)
n <- matrix(NaN, n, nk)
X for (j in 1:nk) {
<- tf(x, xj, j)
X[, j]
}return(X)
}
<- seq(min(size),max(size),length=6) ## generate knots
sj <- tf.X(size,sj) ## get model matrix
X <- lm(wear ~ X - 1) ## fit model
b <- seq(min(size),max(size),length=200) ## prediction data
s <- tf.X(s,sj) ## prediction matrix
Xp plot(size,wear,xlab="Engine capacity",ylab="Wear index",ylim=c(0,5)) ## plot data
lines(s,Xp %*% coef(b)) ## overlay estimated f
<- gg_color_hue(n = 6)
tent_cols
for (j in 1:ncol(X)) {
abline(v = sj[j],col = adjustcolor(tent_cols[j],0.5),lty = 3)
lines(x = size[order(size)], y = X[,j][order(size)], col = adjustcolor(tent_cols[j],0.85), lty = 2, lwd = 1.5)
lines(x = size[order(size)], y = X[,j][order(size)] * coef(b)[j], col = adjustcolor(tent_cols[j],0.75), lty = 3, lwd = 1.5)
}
detach(engine)
We got a decent fit, but we still don’t know how to choose the basis dimension \(k\) (here, the number of knots). One thing to do is keep \(k\) larger than what one might think necessary but to add a penalty to the standard fitting objective:
\[ \left\lVert \mathbf{y-X\beta} \right\rVert^2 + \lambda \sum_{j=2}^{k-1} (f(x_{j-1}) - 2f(x_{j}) + f(x_{j+1}))^2 \]
The first term is obviously the least squares loss, and the second term is a penalty upon the sum of (approximate) squared second derivatives of the function at the knots. Note if \(f\) is just a constant (straight line), the penalty is zero. The smoothing parameter \(\lambda\) controls how smooth the estimated \(f\) should be: if \(\lambda \to \infty\) the only way to have a finite loss is to make the second term go away by estimating a straight line. If \(\lambda = 0\) we get the standard least squares solution to the basis expansion.
We can start by writing the penalty term in terms of matrices, which will shed light on things. For the piecewise linear case, the coefficients of \(f\) in the summed penalty are just the function values at the knots, \(\beta_{j} = f(x_{j})\). First, write it as:
\[ \begin{bmatrix} \beta_{1} - 2\beta_{2} + \beta_{3} \\ \beta_{2} - 2\beta_{3} + \beta_{4} \\ \beta_{3} - 2\beta_{4} + \beta_{5} \\ \vdots \end{bmatrix} = \begin{bmatrix} 1 & -2 & 1 & 0 & \dots & \dots & \dots \\ 0 & 1 & -2 & 1 & 0 & \dots & \dots \\ 0 & 0 & 1 & -2 & 1 & 0 & \dots \\ \vdots & \vdots & \vdots & \vdots & \vdots & \vdots & \vdots \end{bmatrix} \begin{bmatrix} \beta_{1} \\ \beta_{2} \\ \beta_{3} \\ \vdots \end{bmatrix} \] Call the right hand side \(\mathbf{D\beta}\). Note that \(\mathbf{D}\) has dimension \((k-2)\times k\). That means it has a null space (recall the null space of a matrix \(\mathbf{A}\) is the space of non-zero vectors \(\mathbf{x}\) which solve \(\mathbf{Ax=0}\)). The null space has dimension \(k - (k-2) = 2\), so it’s 2 dimensional. That makes perfect sense, because the only time the penalty can be zero is when your function \(f\) is a straight line with no wigglyness (non-zero second derivatives)! When the penalty is zero, the fitted function we get is a linear function that lies in the null space.
The penalty term is then \(\mathbf{\lambda\beta^{\intercal}D^{\intercal}D\beta}\). Once again though, for practical computation, we don’t solve that directly but we note that:
\[ \left\lVert \mathbf{y-X\beta} \right\rVert^2 + \mathbf{\lambda\beta^{\intercal}D^{\intercal}D\beta} = \left\lVert \begin{bmatrix}y \\ 0 \end{bmatrix} - \begin{bmatrix} X \\ \sqrt{\lambda}D \end{bmatrix}\beta \right\rVert^2 \] Now, let’s write an R function to solve that linear model on the RHS that’s equivalent to what we want to solve!
# y: outcome data
# x: "predictor" data
# xj: knot locations
# lambda: smoothing parameter
<- function(y, x, xj, lambda) {
prs.fit <- tf.X(x, xj) # model matrix
X <- diff(diag(length(xj)),differences = 2) # D penalty matrix
D <- rbind(X, sqrt(lambda)*D) # augment model matrix
X <- c(y, rep(0,nrow(D))) # augment data
y lm(y ~ X - 1) # penalized fit
}
attach(engine)
par(mfrow=c(1,3))
<- seq(min(size),max(size),length=20) ## knots
sj
for (lambda in c(80,2,0.05)) {
<- prs.fit(wear,size,sj,lambda) ## penalized fit
b <- tf.X(s,sj) ## prediction matrix
Xp plot(size,wear,xlab="Engine capacity",ylab="Wear index", main=paste0("lambda: ",lambda)) ## plot data
lines(s,Xp %*% coef(b)) ## plot the smooth
}
par(mfrow=c(1,1))
detach(engine)
If \(\lambda\) is too high, the data will be over-smoothed; small values give under-smoothing. How to choose \(\lambda\) to get a good estimate \(\hat{f}\) of the (unknown) \(f\)?
\[ \frac{1}{n}\sum_{i=1}^{n} (\hat{f}_{i} - f_{i})^{2} \] We can’t minimize the above since \(f\) is unknown. Therefore let us use leave one out cross-validation:
\[ \frac{1}{n}\sum_{i=1}^{n} (\hat{f}_{i}^{[-i]} - y_{i})^{2} \] Where \(\hat{f}_{i}^{[-i]}\) is the estimated function with \(y_{i}\) left out. We can run GCV (generalized cross validation), a simplification that is equivalent to the above which uses the influence matrix rather than refitting \(n\) times.
data(engine)
= seq(-9,11,length=90) # log values of lambda
rho <- length(engine$wear) # number of data points
n <- rep(NA,90) # GCV score
V <- seq(min(engine$size),max(engine$size),length=20) ## knots
sj
for (i in 1:90) { ## loop through smoothing params
<- prs.fit(engine$wear,engine$size,sj,exp(rho[i])) ## fit model
b <- sum(influence(b)$hat[1:n]) ## extract EDF
trF <- sum((engine$wear-fitted(b)[1:n])^2) ## residual SS
rss <- n*rss/(n-trF)^2 ## GCV
V[i]
}
par(mfrow=c(1,2))
plot(rho,V,type="l",xlab=expression(log(lambda)),main="GCV score")
<- exp(rho[V==min(V)]) ## extract optimal sp
sp <- seq(min(engine$size),max(engine$size),length=200) ## prediction data
s <- tf.X(s,sj) ## prediction matrix
Xp <- prs.fit(engine$wear,engine$size,sj,sp) ## re-fit
b plot(engine$size,engine$wear,main="GCV optimal fit")
lines(s,Xp %*% coef(b))
par(mfrow=c(1,1))
The smoothing penalty we’ve introduced so far can also be interpreted in a Bayesian way, as a prior distribution over the function’s wiggliness. The most parsimonious way is to say the function’s wiggliness is proportional to an exponential prior:
\[ \propto \mathrm{exp}(\mathbf{\lambda \beta^{\intercal}S\beta}/\sigma^{2}) \] Where \(\mathbf{S = D^{\intercal}D}\) that we introduced earlier. This is equivalent to a prior on the coefficients:
\[ \mathbf{\beta}\sim N(\mathbf{0},\sigma^{2}\mathbf{S}^{-1}/\lambda) \]
When we reinterpret the model is a Bayesian way, the whole thing can be recast as fitting a linear mixed model. Let \(\mathbf{\beta}^{\prime} = \mathbf{D_{+}\beta}\) where \(\mathbf{D_{+}}\) is the augmented penalty matrix \(\mathbf{D_{+}} = \begin{bmatrix} \mathbf{I}_{2} \mathbf{0} \\ \mathbf{D} \end{bmatrix}\). We also have to adjust the data to be \(\mathbf{X}^{*}\) be the first two columns of \(\mathbf{XD}_{+}^{-1}\), and \(\mathbf{\beta}^{*}\) be the first 2 elements of \(\mathbf{\beta}^{\prime}\). The mixed model interpretation of the Bayesian smooth model is:
\[ \mathbf{y} = \mathbf{X^{*}\beta^{*} + Zb + \epsilon}, \mathbf{b}\sim N(\mathbf{0},\mathbf{I}\sigma^{2}/\lambda), \mathbf{\epsilon}\sim N(\mathbf{0},\mathbf{I}\sigma^{2}) \]
It can be estimated by working with the marginal likelihood using llm
from before. We also use lme
for comparison, which estimates using REML.
attach(engine)
# original parameterization of X
<- tf.X(size,sj)
X0
# make D_{+}
<- rbind(0,0,diff(diag(20),difference = 2))
D diag(D) <- 1
# make new X
<- t(backsolve(t(D),t(X0)))
X
# the random and fixed effects
<- X[,-c(1,2)]
Z <- X[,1:2]
X
# estimate smoothing lambda and variance
<- optim(c(0,0),llm,method="BFGS",X=X,Z=Z,y=wear)
m <- attr(llm(m$par,X,Z,wear),"b") ## extract coefficients
b
plot(size,wear)
<- t(backsolve(t(D),t(Xp))) ## re-parameterized pred. mat.
Xp1 lines(s,Xp1 %*% as.numeric(b),col="grey",lwd=2)
# use REML (lme)
library(nlme)
<- factor(rep(1,nrow(X))) # dummy factor variable because all terms are in the same "group"
g <- lme(wear ~ X - 1, random=list(g = pdIdent(~ Z-1)))
m lines(s,Xp1 %*% as.numeric(coef(m))) ## and to plot
detach(engine)
Let’s now move on where we have two \(x,v\) as explanatory variables for response \(y\). Then our model has the additive structure:
\[ y_{i} = \alpha + f_{1}(x_{i}) + f_{2}(v_{i}) + \epsilon_{i} \] Now that there are two smooths, we have an identifiability problem. Any constant \(c\) could be added to \(f_{1}\) and subtracted from \(f_{2}\) without changing predicted responses, so we need to introduce constraints to fit the model.