These notes largely follow the book, and I don’t have time to do all the exercises. In some cases I’ve compared Simon Wood’s “by hand” solutions in the text to other packages or methods.

Ch. 1: Linear Models

Chapter 1 is about linear models and their theory.

Exercises

Exercise 4: Write out factor models with model matrices

Write out the following models in form \(\mathbf{y} = \mathbf{X}\mathbf{\beta} + \mathbf{\epsilon}\).

  1. ‘balanced one-way ANOVA model’: \(y_{ij} = \alpha + \beta_{i} + \epsilon_{ij}\) where \(i=1,2,3\) and \(j=1,2\).

It has one factor, with 3 levels, and each level has 2 observations. R can make this with model.matrix.

b <- factor(rep(1:3,each=2))
model.matrix(~ b, data.frame(b=b))
##   (Intercept) b2 b3
## 1           1  0  0
## 2           1  0  0
## 3           1  1  0
## 4           1  1  0
## 5           1  0  1
## 6           1  0  1
## attr(,"assign")
## [1] 0 1 1
## attr(,"contrasts")
## attr(,"contrasts")$b
## [1] "contr.treatment"

So we write it as: \[ \begin{bmatrix} y_{11} \\ y_{12} \\ y_{21} \\ y_{22} \\ y_{31} \\ y_{32} \end{bmatrix} = \begin{bmatrix} 1 & 0 & 0 \\ 1 & 0 & 0 \\ 1 & 1 & 0 \\ 1 & 1 & 0 \\ 1 & 0 & 1 \\ 1 & 0 & 1 \end{bmatrix} \begin{bmatrix} \alpha \\ \beta_{2} \\ \beta_{3} \end{bmatrix} + \begin{bmatrix} \epsilon_{11} \\ \epsilon_{12} \\ \epsilon_{21} \\ \epsilon_{22} \\ \epsilon_{31} \\ \epsilon_{32} \end{bmatrix} \]

  1. A model with 2 factor variables (3 and 4 levels, respectively) and 1 observation per combination: \(y_{ij} = \alpha + \beta_{i} + \gamma_{j} + \epsilon_{ij}\).

We do it in R, note that we use expand.grid to generate all unique combinations. There’s only 1 observation per stratum, so that’s sufficient.

beta <- factor(1:3)
gamma <- factor(1:4)
model.matrix(~ beta + gamma, expand.grid(beta=beta,gamma=gamma))
##    (Intercept) beta2 beta3 gamma2 gamma3 gamma4
## 1            1     0     0      0      0      0
## 2            1     1     0      0      0      0
## 3            1     0     1      0      0      0
## 4            1     0     0      1      0      0
## 5            1     1     0      1      0      0
## 6            1     0     1      1      0      0
## 7            1     0     0      0      1      0
## 8            1     1     0      0      1      0
## 9            1     0     1      0      1      0
## 10           1     0     0      0      0      1
## 11           1     1     0      0      0      1
## 12           1     0     1      0      0      1
## attr(,"assign")
## [1] 0 1 1 2 2 2
## attr(,"contrasts")
## attr(,"contrasts")$beta
## [1] "contr.treatment"
## 
## attr(,"contrasts")$gamma
## [1] "contr.treatment"

It’s written as: \[ \begin{bmatrix} y_{11} \\ y_{21} \\ y_{31} \\ y_{12} \\ y_{22} \\ y_{32} \\ y_{13} \\ y_{23} \\ y_{33} \\ y_{14} \\ y_{24} \\ y_{34} \end{bmatrix} = \begin{bmatrix} 1 & 0 & 0 & 0 & 0 & 0 \\ 1 & 1 & 0 & 0 & 0 & 0 \\ 1 & 0 & 1 & 0 & 0 & 0 \\ 1 & 0 & 0 & 1 & 0 & 0 \\ 1 & 1 & 0 & 1 & 0 & 0 \\ 1 & 0 & 1 & 1 & 0 & 0 \\ 1 & 0 & 0 & 0 & 1 & 0 \\ 1 & 1 & 0 & 0 & 1 & 0 \\ 1 & 0 & 1 & 0 & 1 & 0 \\ 1 & 0 & 0 & 0 & 0 & 1 \\ 1 & 1 & 0 & 0 & 0 & 1 \\ 1 & 0 & 1 & 0 & 0 & 1 \end{bmatrix} \begin{bmatrix} \alpha \\ \beta_{2} \\ \beta_{3} \\ \gamma_{2} \\ \gamma_{3} \\ \gamma_{4} \end{bmatrix} + \begin{bmatrix} \epsilon_{11} \\ \epsilon_{21} \\ \epsilon_{31} \\ \epsilon_{12} \\ \epsilon_{22} \\ \epsilon_{32} \\ \epsilon_{13} \\ \epsilon_{23} \\ \epsilon_{33} \\ \epsilon_{14} \\ \epsilon_{24} \\ \epsilon_{34} \end{bmatrix} \]

  1. A model with a factor variable (2 levels) and a continuous variable: \(y_{i} = \alpha + \beta_{j} + \gamma x_{i} + \epsilon_{i}\)
b <- factor(rep(1:2,times=c(2,4)))
x <- c(0.1,0.4,0.5,0.3,0.4,0.7)
model.matrix(~ beta + x, data.frame(beta = b, x = x))
##   (Intercept) beta2   x
## 1           1     0 0.1
## 2           1     0 0.4
## 3           1     1 0.5
## 4           1     1 0.3
## 5           1     1 0.4
## 6           1     1 0.7
## attr(,"assign")
## [1] 0 1 2
## attr(,"contrasts")
## attr(,"contrasts")$beta
## [1] "contr.treatment"

Written out: \[ \begin{bmatrix} y_{1} \\ y_{2} \\ y_{3} \\ y_{4} \\ y_{5} \\ y_{6} \end{bmatrix} = \begin{bmatrix} 1 & 0 & 0.1 \\ 1 & 0 & 0.4 \\ 1 & 1 & 0.5 \\ 1 & 1 & 0.3 \\ 1 & 1 & 0.4 \\ 1 & 1 & 0.7 \end{bmatrix} \begin{bmatrix} \alpha \\ \beta_{2} \\ \gamma \end{bmatrix} + \begin{bmatrix} \epsilon_{1} \\ \epsilon_{2} \\ \epsilon_{3} \\ \epsilon_{4} \\ \epsilon_{5} \\ \epsilon_{6} \end{bmatrix} \]

Exercise 11: Replicate lm with R’s linear algebra

We are using the linear model \(y_{i} = \mathbf{X}_{i}\mathbf{\beta} + \epsilon_{i}\) and \(\epsilon_{i}\sim N(0,\sigma^2)\). This question basically asks us to replicate lm using R’s built in linear algebra functions.

This stackexchange answer is really helpful for figuring out what’s going on.

  1. Write R function that takes vector of responses y, model matrix X and computes least squares estimate of parameters \(\beta\) based on QR decomposition of X.

The function works like this:

  1. get the QR decomposition of the model matrix: \(\mathbf{X} = \mathbf{Q}\mathbf{R}\). Recall that this takes a matrix \(\mathbf{X}\) with independent but not orthonormal columns and finds a matrix \(\mathbf{R}\) such that \(\mathbf{Q}\) contains the columns of \(\mathbf{X}\) rotated to become orthonormal.
  2. get \(\mathbf{Q}^\intercal \mathbf{y} = \begin{bmatrix} \mathbf{f}\\ \mathbf{r} \end{bmatrix}\).
  3. After multiplying \(y-\mathbf{x}\mathbf{\beta}\) on the left by \(\mathbf{Q}^{\intercal}\) we eventually get \(\left\lVert f-\mathbf{R}\mathbf{\beta} \right\rVert^{2} + \left\lVert r \right\rVert^{2}\). We are trying to estimate \(\beta\) and we do so by trying to get \(\mathbf{R}\mathbf{\beta} = \mathbf{f}\) so the first term goes to 0. We use backsolve to get \(\hat{\mathbf{\beta}}=\mathbf{R}^{-1}\mathbf{f}\).
  4. variance estimate of \(\sigma^{2}\) is \(\hat{\sigma}^{2}= \left\lVert \mathbf{r} \right\rVert^{2} / (n-p)\).
  5. To get the covariance matrix of \(\hat{\mathbf{\beta}}\) first use backsolve to get \(\mathbf{R}^{-1}\) by solving the problem \(\mathbf{R}\mathbf{R}^{-1} = \mathbf{I}\) for the inverse, then the covariance matrix is \(\mathbf{V}_{\hat{\beta}} = \mathbf{R}^{-1}\mathbf{R}^{\intercal}\sigma^{2}\).
n <- 100
beta <- c(5,3)
x1 <- rlnorm(n = n)
y <- beta[1] + beta[2]*x1 + rnorm(n = n)

X <- model.matrix(~ x, data.frame(x=x1))

least_sq <- function(y,X) {
  qrx <- qr(X)                ## QR decomposition
  yy <- qr.qty(qrx,y)          ## form Q'y efficiently
  R <- qr.R(qrx)              ## R
  p <- ncol(X);n <- nrow(X)   ## get dimensions
  f <- yy[1:p]; r <- yy[(p+1):n] ## partition Q'y
  beta <- backsolve(R,f)      ## backsolve(R,f) solves R * beta = f for beta
  sig2 <- sum(r^2)/(n-p)      ## resid variance estimate (1.8)
  Ri <- backsolve(R,diag(ncol(R))) ## inverse of R matrix, R * Ri = I for Ri
  Vb <- Ri%*%t(Ri)*sig2       ## covariance matrix
  se <- diag(Vb)^.5           ## standard errors (c)
  F.ratio <- f^2/sig2         ## sequential F-ratios
  seq.p.val <- 1-pf(F.ratio,1,n-p) ## seq. p-values (e)
  list(beta=beta,se=se,sig2=sig2,seq.p.val=seq.p.val,df=n-p)
}

least_sq(y = y,X = X)
## $beta
## [1] 5.120932 2.982596
## 
## $se
## [1] 0.09368107 0.01867322
## 
## $sig2
## [1] 0.738315
## 
## $seq.p.val
## [1] 0 0
## 
## $df
## [1] 98
X <- model.matrix(dist ~ speed + I(speed^2),cars)
cars.fit.lm <- lm(dist ~ speed + I(speed^2),cars)
cars.fit <- least_sq(y = cars$dist,X = X)

Exercise 12: Constrained linear model, compare to pcls

The data InsectSprays has measurements on counts of insects in plots, with different insecticide sprays. We’ll use the model: \(y_{i} = \mu + \beta_{j}\) if observation \(i\) had spray \(j\).

It wont be identifiable right away. Use the constraint \(\sum_{j}\beta_{j} = 0\). We make the model matrix \(\mathbf{X}\) which has a redundant column for the constraint.

X <- model.matrix(~ spray - 1, InsectSprays)
X <- cbind(1,X)
p <- 7 # 7 unconstrained parameters
m <- 1 # 1 constraint

We want to impose some general linear constraints of the form \(\mathbf{C}\mathbf{\beta}=\mathbf{0}\), with \(\mathbf{C}\) the constraint matrix. \(\mathbf{C}\) has dimension \(m\times p\) where \(m\) is the number of constraints. \(\mathbf{C}\) should have a 1 for the constrained parameters, as they should sum to 0 when multiplied by it.

C <- matrix(c(0,rep(1,6)),1,7)

From the book:

  1. Find the QR decomposition of \(\mathbf{C}^{\intercal}\), the final \(p-m\) columns of the orthogonal factor (that’s \(\mathbf{Q}\)) define \(\mathbf{Z}\). If \(\mathbf{\beta}_{Z}\) is the \(p-m\) dimensional vector of unconstrained parameters, then \(\beta = \mathbf{Z}\mathbf{\beta}_{Z}\). We form \(\mathbf{X}\mathbf{Z}\), but we don’t want to get \(\mathbf{Z}\) explicitly because it’s computationally inefficient.

We have the fact that \(\mathbf{XQ} = (\mathbf{Q}^{\intercal}\mathbf{X}^{\intercal})^{\intercal}\). The call qr.qty(qrc,t(X)) returns \(\mathbf{Q}^{\intercal}\mathbf{X}^{\intercal}\). Transposing it and taking the last \(p-m\) columms gives us \(\mathbf{XZ}\). It’s the same as t(t(qr.Q(qrc,complete = TRUE)) %*% t(X))[,(m+1):p] or (X %*% qr.Q(qrc,complete = TRUE))[,(m+1):p].

qrc <- qr(t(C))
XZ <- t(qr.qty(qrc,t(X)))[,(m+1):p]
  1. Minimize the unconstrained model \(\left\lVert \mathbf{y} - \mathbf{X}\mathbf{Z}\mathbf{\beta}_{z} \right\rVert^2\) to get \(\hat{\mathbf{\beta}_{z}}\).
m1 <- lm(InsectSprays$count~XZ-1)
bz <- coef(m1) 
  1. \(\mathbf{\beta} = \mathbf{Z}\hat{\mathbf{\beta}_{z}}\). qr.qy(qrc,b) gives \(\mathbf{Z}\hat{\beta}_{z}\). It is the same as doing qr.Q(qrc,complete = T) %*% c(0,bz).
b <- c(0,bz)
b <- qr.qy(qrc,b)
sum(b[2:7])
## [1] 3.552714e-15

Now we still have an intercept, but all the other terms sum to 0. We can check this against mgcv::pcls, it’s the same. Note that we can’t leave Ain blank in pcls, even if we don’t have any inequality constraints, so we put in the trivial one that any parameter vector will fulfill (since Ain is all zeros).

M <- list(
  y = InsectSprays$count,
  w = rep(1, nrow(InsectSprays)),
  X = X,
  C = C,
  S = list(),
  off = array(0,0),
  sp = array(0,0),
  p = c(5, -1,1,-1,1,-1,1),
  Ain = matrix(0,1,7),
  bin = -1
)
pcls.fit <- mgcv::pcls(M = M)
pcls v. hand constraints
9.5 5 5.833333 -7.416667 -4.583333 -6 7.166667
pcls.fit 9.5 5 5.833333 -7.416667 -4.583333 -6 7.166667

Exercise 13: Nonlinear least squares, compare to nls

Consider the following non-linear model: \(V_{i} = \beta_{1} G_{i}^{\beta_{2}} H_{i}^{\beta_{3}} + \epsilon_{i}\), where \(G_{i},H_{i}\) are girth and height of trees, and \(V_{i}\) are the trees’ volumes.

  1. Write an R function to evaluate the vector of \(\mathbb{E}(V)\) estimates given a vector of \(\beta\) values and covariates (\(G,H\)). Also return a Jacobian matrix containing \(\partial\mathbb{E}(V_{i}) / \partial\beta_{j}\).

Let’s first work out what the Jacobian (matrix of partial first derivatives) should be.

  • The partial derivative with respect to \(\beta_{1}\) is: \(\partial\mathbb{E}(V_{i})/\partial{\beta_{1}} = G_{i}^{\beta_{2}} H_{i}^{\beta_{3}} \frac{d}{d\beta_{1}}(\beta_{1}) = G_{i}^{\beta_{2}} H_{i}^{\beta_{3}}\). We solved it by pulling out constants and differentiating.

  • The partial derivative with respect to \(\beta_{2}\) is: \(\partial\mathbb{E}(V_{i})/\partial{\beta_{2}} = \beta_{1} H_{i}^{\beta_{3}} \frac{d}{d\beta_{2}}(G_{i}^{\beta_{2}})\). After we pulled out constants, lets work on differentiating the exponential expression. \(\frac{d}{d\beta_{2}}(G_{i}^{\beta_{2}}) = \frac{d}{d\beta_{2}}(e^{\log G_{i}})^{\beta_{2}} = \frac{d}{d\beta_{2}}(e^{\beta_{2}\log G_{i}})\). Now we get to apply the chain rule: \(e^{\beta_{2}\log G_{i}} \frac{d}{d\beta_{2}}(\beta_{2}\log G_{i}) = e^{\beta_{2}\log G_{i}} \log G_{i} \frac{d}{d\beta_{2}}(\beta_{2}) = e^{\beta_{2}\log G_{i}} \log G_{i} = G_{i}^{\beta_{2}} \log G_{i}\). So the answer is:

\[ \partial\mathbb{E}(V_{i})/\partial{\beta_{2}} = \beta_{1} H_{i}^{\beta_{3}} G_{i}^{\beta_{2}} \log G_{i} \]

  • The partial derivative with respect to \(\beta_{3}\) can be solved for just like the previous one. We’ll skip the algebra.

\[ \partial\mathbb{E}(V_{i})/\partial{\beta_{3}} = \beta_{1} G_{i}^{\beta_{2}} H_{i}^{\beta_{3}} \log H_{i} \]

The R function, is therefore (taking advantage of vectorized operators):

EV.func <- function(beta, G, H) { 
  mu <- beta[1] * (G^beta[2]) * (H^beta[3])
  J <- cbind(
    (G^beta[2]) * (H^beta[3]),
    mu*log(G),
    mu*log(H)
  )
  list(mu=mu,J=J)
}
  1. Fit the model to the trees data. Use \(\beta = [.002, 2, 1]\) for starting values. We use non-linear least squares (NLS).
  • Consider fitting the model \(\mathbb{E}(y) = f(\beta)\), where \(f\) is some nonlinear function of parameters/coefficients.
  • Let the objective function be \(\mathcal{S} = \left\lVert y - f(\beta) \right\rVert^{2}\).
  • Use iterative linear least squares to solve; we have some starting parameters \(\hat{\beta}^{[k]}\) and Taylor expand \(f\) so that \(\mathcal{S}\) becomes: \(\mathcal{S} \approx \mathcal{S}^{[k]} = \left\lVert y - f(\hat{\beta}^{[k]}) + J^{[k]}\hat{\beta}^{[k]} - J^{[k]}\beta \right\rVert^{2}\).
  • Define pseudodata \(z^{[k]} = y - f(\hat{\beta}^{[k]}) + J^{[k]}\hat{\beta}^{[k]}\) so the objective is \(\mathcal{S}^{[k]} = \left\lVert z^{[k]} - J^{[k]}\beta \right\rVert^{2}\). Minimize \(\mathcal{S}^{[k]}\) with respect to \(\beta\) to get an improved parameter estimate \(\hat{\beta}^{[k+1]}\). Do that until convergence.
data(trees)
V <- trees$Volume
G <- trees$Girth
H <- trees$Height
beta <- c(0.002, 2, 1)
diff <- Inf
while (diff > 1e-8) {
  EV <- EV.func(beta = beta, G = G, H = H)
  z <- V - EV$mu + (EV$J %*% beta)  # calc pseudodata
  beta_old <- beta
  # linear least squares to minimize S^{[k]} with respect to beta to get improved estimate \beta^{[k+1]}
  beta <- coef(lm(z ~ EV$J - 1))
  diff <- sum(abs(beta - beta_old))
}
  1. Get standard error estimates. The call solve(t(EV$J)%*%EV$J) finds \((\mathbf{J}^{\intercal}\mathbf{J})^{-1}\).
sig2 <- sum((V - EV$mu)^2)/(nrow(trees)-3)
Vb <- solve(t(EV$J)%*%EV$J)*sig2
se <- diag(Vb)^.5;se
## [1] 0.001366994 0.082077439 0.242158812

Let’s compare the above to what we get from stats::nls. It’s the same.

nls.fit <- stats::nls(
  Volume ~ beta1 * (Girth^beta2) * (Height^beta3),
  data = trees,
  start = as.list(c(beta1=0.002, beta2=2, beta3=1))
)
parameter estimates
beta1 beta2 beta3
0.0014488 1.996921 1.087647
beta 0.0014488 1.996922 1.087646
standard error estimates
beta1 beta2 beta3
0.001367 0.0820774 0.2421588
se 0.001367 0.0820774 0.2421588

Ch. 2: Linear Mixed Models

Linear mixed models (LMM) extend the linear model (LM).

Linear models are of the form:

\[ \mathbf{y} = \mathbf{X}\mathbf{\beta} + \mathbf{\epsilon}, \mathbf{\epsilon}\sim N(\mathbf{0},\mathbf{I}\sigma^{2}) \]

LMM are extended to be:

\[ \mathbf{y} = \mathbf{X}\mathbf{\beta} + \mathbf{Z}\mathbf{b} + \mathbf{\epsilon}, \mathbf{\epsilon}\sim N(\mathbf{0},\mathbf{\Lambda_{\theta}}), \mathbf{b}\sim N(\mathbf{0},\mathbf{\psi_{\theta}}) \]

The random vector \(\mathbf{b}\) has the random effects, whose covariance matrix depends on some parameters \(\theta\); these are usually the target of inference, rather than \(\mathbf{b}\) per se. \(\mathbf{Z}\) is the model matrix for the random effects. \(\mathbf{\Lambda_{\theta}}\) models residual autocorrelation, but often it is as simple as \(\mathbf{\Lambda_{\theta}} = \mathbf{I}\sigma^{2}\). So the random effects also get a linear model structure, just like the fixed effects.

The model for the outcome itself will follow a multivariate normal distribution:

\[ \mathbf{y}\sim N(\mathbf{X\beta},\mathbf{Z}\mathbf{\psi}_{\theta}\mathbf{Z}^{\intercal} + \mathbf{\Lambda}_{\theta}) \]

2.1: Mixed models for balanced data

For the tree example, a fixed effects model is clearly unidentifiable. The random effects model will be set up as:

\[ y_{i} = \alpha_{j} + b_{k} +\epsilon_{i} \] For measurement \(i\) in CO2 level \(j\) and tree \(k\), and \(b_{k}\sim N(0,\sigma_{b}^{2}), \epsilon_{i}\sim N(0,\sigma^{2})\).

If we don’t need/want to do inference on the specific values of \(b_{k}\) and simply want to test for CO2 effects in the presence of random tree effects. Not that for tree \(k\), the average response is:

\[ \overline{y}_{k} = \alpha_{j} + e_{k} \] Where \(e_{k}\sim N(0,\sigma^{2}_{b} + \sigma^{2}/4)\). This is because there are 4 observations for each tree. We averaged over the data at each level of the random effect \(b_{k}\), that is, for each tree.

data(stomata)
st <- aggregate(
  data.matrix(stomata),
  by=list(tree=stomata$tree),mean
)
st$CO2 <- as.factor(st$CO2);st
m3 <- lm(area ~ CO2, st)
anova(m3)

There’s evidence for C02 effect in this model. Again, \(\theta\) is a target of inference, and in this case it’s \(\sigma_{b}^{2}\), so let’s estimate it.

m1 <- lm(area ~ CO2 + tree, stomata)

The full unidentifiable model has \(\hat{\sigma}^{2} = \mathrm{RSS}_{1}/18\). There are 18 degrees of freedom \(n-p = 18, n=24, p=6\).

The LMM has \(\widehat{\sigma_{b}^{2} + \sigma^{2}/4} = \mathrm{RSS}_{3}/4\).

To get \(\hat{\sigma}^{2}_{b}\) we just need to subtract:

\[ \hat{\sigma}^{2}_{b} = \widehat{\sigma_{b}^{2} + \sigma^{2}/4} - \hat{\sigma}^{2}/4 = \mathrm{RSS}_{3}/4 - \mathrm{RSS}_{1}/72 \]

summary(m3)$sigma^2 - summary(m1)$sigma^2/4
## [1] 0.06770177

The next example is about some data on rails. We do the inference by hand:

m1 <- lm(travel ~ Rail,Rail)
rt <- aggregate(data.matrix(Rail),by=list(Rail$Rail),mean)
m0 <- lm(travel ~ 1,rt)
sig <- summary(m1)$sigma
sigb <- (summary(m0)$sigma^2 - sig^2/3)^0.5

tab <- rbind(sigb = sigb, sig = sig, intercept = unname(coef(m0)))
knitr::kable(tab,caption = "By-hand LMM for rails data")
## Warning in kable_pipe(x = structure(c("sigb", "sig", "intercept", "24.805465", :
## The table should have a header (column names)
By-hand LMM for rails data
sigb 24.805465
sig 4.020779
intercept 66.500000

And now we compare to nlme::lme. It will give us the same thing.

rail_df <- as.data.frame(data.matrix(Rail))
# lme.fit = lme(travel ~ 1,random = ~ 1 | Rail,data = rail_df)
lme.fit = lme(travel ~ 1,random = list(Rail = ~1),data = rail_df)
summary(lme.fit)
## Linear mixed-effects model fit by REML
##   Data: rail_df 
##       AIC      BIC   logLik
##   128.177 130.6766 -61.0885
## 
## Random effects:
##  Formula: ~1 | Rail
##         (Intercept) Residual
## StdDev:    24.80547 4.020779
## 
## Fixed effects:  travel ~ 1 
##             Value Std.Error DF  t-value p-value
## (Intercept)  66.5  10.17104 12 6.538173       0
## 
## Standardized Within-Group Residuals:
##         Min          Q1         Med          Q3         Max 
## -1.61882658 -0.28217671  0.03569328  0.21955784  1.61437744 
## 
## Number of Observations: 18
## Number of Groups: 6

2.4: MLE for LMMs

We want to do inference on \(\mathbf{\beta,\theta}\), but we don’t want to work directly with the pdf \(\mathbf{y}\sim N(\mathbf{X\beta},\mathbf{Z}\mathbf{\psi}_{\theta}\mathbf{Z}^{\intercal} + \mathbf{\Lambda}_{\theta})\) as it is too unwieldy.

Instead what we work with is the marginal \(f(\mathbf{y}|\mathbf{\beta})\), which has \(\mathbf{b}\) integrated out (the part depending on \(\theta\)).

We will also need \(f(\mathbf{y,b} | \mathbf{\beta})\), which is just the pdf of \(\mathbf{y}\) conditional on particular values of \(\mathbf{b,\beta}\). We know that \(f(\mathbf{y,b} | \mathbf{\beta}) = f(\mathbf{y} | \mathbf{b,\beta}) f(\mathbf{b})\).

Say we have \(\hat{\mathbf{b}}\) which maximizes \(f(\mathbf{y,b} | \mathbf{\beta})\) for a particular \(\mathbf{\beta}\). Then \(f(\mathbf{y}|\mathbf{\beta}) = f(\mathbf{y,\hat{b}} | \mathbf{\beta}) \times (\text{terms depending only on }\theta)\). The point is that we can do inference on \(f(\mathbf{y}|\mathbf{\beta})\) by profile likelihood methods. For any \(\theta\) it is easy to get an estimator for \(\beta\) by minimizing with respect to \(\mathbf{\beta,b}\).

We call the profile likelihood \(l_{p}(\mathbf{\theta}) = l(\mathbf{\theta,\hat{\beta}_{\theta}})\), where \(\hat{\beta}_{\theta}\) maximizes \(f(\mathbf{y}|\mathbf{\beta})\) given \(\mathbf{\theta}\).

2.4.1: The distribution of \(\mathbf{b|y,\hat{\beta}_{\theta}}\) given \(\mathbf{\theta}\)

We have:

\[ \mathbf{b|y,\hat{\beta}_{\theta}} \sim N(\mathbf{\hat{b} , (Z^{\intercal}\Lambda_{\theta}Z | \psi_{\theta}^{-1})^{-1} }) \] Where \(\hat{\mathbf{b}} = \mathbf{(Z^{\intercal}\Lambda_{\theta}Z | \psi_{\theta}^{-1})^{-1}Z^{\intercal}\Lambda_{\theta}\tilde{y}}\). \(\mathbf{\tilde{y} = y - x\hat{\beta}}\) are the residuals. The \(\hat{\mathbf{b}}\) estimate is called the maximum a posteriori estimator or the predicted random effects vector.

2.4.2: The distribution of \(\hat{\beta}_{\theta}\) given \(\mathbf{\theta}\)

To get the distribution of \(\hat{\beta}_{\theta}\) for a given value of \(\theta\), we can work with the pdf: \(\mathbf{y}\sim N(\mathbf{X\beta},\mathbf{Z}\mathbf{\psi}_{\theta}\mathbf{Z}^{\intercal} + \mathbf{\Lambda}_{\theta})\). For a given \(\theta\) a weighted least squares estimate gives \(\hat{\beta}_{\theta}\). We eventually get:

\[ $\hat{\beta}_{\theta}$ \sim N(\beta, (\mathbf{X}^{\intercal}(\mathbf{Z}\mathbf{\psi}_{\theta}\mathbf{Z}^{\intercal} + \mathbf{\Lambda}_{\theta})^{-1}\mathbf{X})^{-1} ) \]

Which we use for large sample inference.

2.4.3: The distribution of \(\hat{\theta}\)

2.4.4: Maximizing the profile likelihood

The function below is the log likelihood for a given value of \(\theta\); given a lattice of \(\theta\) values, it can be maximized (actually minimized, it gives negative log likelihood).

This function essentially minimizes the profile likelihood:

\[ l_{p}(\theta) = \left\lVert \mathbf{y - X\beta - Zb} \right\rVert^{2}_{\Lambda_{\theta}} + \mathbf{b^{\intercal}\psi_{\theta}^{-1}b} \]

with respect to \(\mathbf{\beta,b}\).

llm <- function(theta,X,Z,y) { 
  ## untransform parameters... 
  sigma.b <- exp(theta[1]) 
  sigma <- exp(theta[2])
  ## extract dimensions...
  n <- length(y); pr <- ncol(Z); pf <- ncol(X) ## obtain \hat \beta, \hat b...
  X1 <- cbind(X,Z)
  ipsi <- c(rep(0,pf),rep(1/sigma.b^2,pr))
  b1 <- solve(crossprod(X1)/sigma^2+diag(ipsi),t(X1)%*%y/sigma^2)
  ## compute log|Z’Z/sigma^2 + I/sigma.b^2|...
  ldet <- sum(log(diag(chol(crossprod(Z)/sigma^2 + diag(ipsi[-(1:pf)])))))
  ## compute log profile likelihood...
  l <- (-sum((y-X1%*%b1)^2)/sigma^2 - sum(b1^2*ipsi) - n*log(sigma^2) - pr*log(sigma.b^2) - 2*ldet - n*log(2*pi))/2 
  attr(l,"b") <- as.numeric(b1) ##   return \hat beta and \hat b 
  return(-l)
}

2.5: Linear mixed models in R

If there is only one level (factor) of grouping in the data, the LMM is of the form:

\[ \mathbf{y_{i}} = \mathbf{X_{i}\beta + Z_{i}b_{i} + \epsilon_{i}} \]

Terms that have the \(i\) subscript depend on group (they are for group \(i\)) and others are common to all groups. So \(\mathbf{\beta,\psi_{\theta},\Lambda_{\theta}}\), the coefficients for fixed effects, covariance of random effects, and error terms are the same for all groups.

2.5.1: package nlme

The function lme handles LMMs. It needs 2 formulas, one for the fixed effects, and one for \(\mathbf{Z_{i}}\) and grouping. For example to fit the rails model \(y_{ij} = \alpha + b_{i} + \epsilon_{ij}\) for grouping factor \(i\) with repeated measures \(j\), we’d do:

data(Rail)
# lme(travel ~ 1, Rail, ~ 1|Rail)  # works too
lme(travel ~ 1, Rail, list(Rail = ~ 1))
## Linear mixed-effects model fit by REML
##   Data: Rail 
##   Log-restricted-likelihood: -61.0885
##   Fixed: travel ~ 1 
## (Intercept) 
##        66.5 
## 
## Random effects:
##  Formula: ~1 | Rail
##         (Intercept) Residual
## StdDev:    24.80547 4.020779
## 
## Number of Observations: 18
## Number of Groups: 6

2.5.2: Tree growth, an example using lme

We have data for 14 trees on their growth (height) as a function of age, where Seed is the unique ID for each tree. Let \(y\) be height, \(x\) be age, then our model might be:

\[ y_{ij} + \beta_{0} + \beta_{1}x_{ij} + \beta_{2}x_{ij}^{2} + \beta_{3}x_{ij}^{3} + b_{0} + b_{1}x_{ij}^{2} + b_{2}x_{ij}^{3} + \epsilon_{ij} \]

and correlation between errors is given by \(\rho(\epsilon_{ij}, \epsilon_{ij-k}) = \phi^{k}\), with \(\phi\) unknown. That means errors are autocorrelated within tree over time. So measurements taken 3 years apart would have a decaying correlation of \(\phi^{3}\).

data(Loblolly)
# center the data, polynomial terms can become too correlated otherwise
Loblolly$age <- Loblolly$age - mean(Loblolly$age)
# use more EM iters before Newton for the final optimization
lmc <- lmeControl(niterEM=500,msMaxIter=100)
# fit the model
m0 <- lme(height ~ age + I(age^2) + I(age^3),Loblolly,
          random = list(Seed = ~ age + I(age^2) + I(age^3)),
          correlation = corAR1(form = ~ age|Seed),control=lmc)

2.5.3: Several levels of nesting

If there are two factors, and repeated measurements within each strata, we need to be more careful when using lme. Say we index factors by \(i,j\) and measurements within factor strata by \(k\):

\[ y_{ijk} = \mu + \alpha_{i} + b_{j} + (\alpha b)_{ij} + \epsilon_{ijk} \] This model has an overall mean, terms for the independent random effect of both factors, and one for the specific strata. Distributions follow: \(b_{j} \sim N(0,\sigma_{b}^{2}), (\alpha b)_{ij} \sim N(0,\sigma_{\alpha b}^{2}), \epsilon_{ijk} \sim N(0,\sigma^{2})\). For the dataset Machines where the factors are Worker and Machine and outcome is score, we’d use:

data(Machines)
lme(score ~ Machine,Machines,list(Worker = ~1, Machine = ~1))
## Linear mixed-effects model fit by REML
##   Data: Machines 
##   Log-restricted-likelihood: -107.8438
##   Fixed: score ~ Machine 
## (Intercept)    MachineB    MachineC 
##   52.355556    7.966667   13.916667 
## 
## Random effects:
##  Formula: ~1 | Worker
##         (Intercept)
## StdDev:     4.78105
## 
##  Formula: ~1 | Machine %in% Worker
##         (Intercept)  Residual
## StdDev:    3.729532 0.9615771
## 
## Number of Observations: 54
## Number of Groups: 
##              Worker Machine %in% Worker 
##                   6                  18

2.5.4: Package lme4

nlme is efficient for models with nesting, but less so for non nested models like \(y_{ijk} = \alpha + b_{i} + c_{j} + \epsilon_{ijk}\). lme4 addresses this. Unlike lme, the lmer function in lme4 uses a single formula to specify both fixed and random effects. Random effects (\(b\)) are specified like \((x|g)\) where \(g\) is the grouping factor and \(x\) is the right hand of the model formula.

We fit the Machines model again:

\[ y_{ijk} = \mu + \alpha_{i} + b_{j} + (\alpha b)_{ij} + \epsilon_{ijk} \]

a1 <- lmer(score ~ Machine + (1|Worker) + (1|Worker:Machine), data=Machines)
a1
## Linear mixed model fit by REML ['lmerMod']
## Formula: score ~ Machine + (1 | Worker) + (1 | Worker:Machine)
##    Data: Machines
## REML criterion at convergence: 215.6876
## Random effects:
##  Groups         Name        Std.Dev.
##  Worker:Machine (Intercept) 3.7295  
##  Worker         (Intercept) 4.7811  
##  Residual                   0.9616  
## Number of obs: 54, groups:  Worker:Machine, 18; Worker, 6
## Fixed Effects:
## (Intercept)     MachineB     MachineC  
##      52.356        7.967       13.917

2.5.5: Package mgcv

gam in mgcv can also fit LMMs, with independent Gaussian random effects, but need to be specified using “smooth” terms of the form s(z,x,v,bs="re").

Once again, with gusto, the Machines model:

\[ y_{ijk} = \mu + \alpha_{i} + b_{j} + (\alpha b)_{ij} + \epsilon_{ijk} \]

b1 <- gam(score ~ Machine + s(Worker,bs="re") + s(Machine,Worker,bs="re"),data=Machines,method="REML")
summary(b1)
## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## score ~ Machine + s(Worker, bs = "re") + s(Machine, Worker, bs = "re")
## 
## Parametric coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   52.356      2.486  21.062  < 2e-16 ***
## MachineB       7.967      2.177   3.660 0.000799 ***
## MachineC      13.917      2.177   6.393 2.02e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##                      edf Ref.df    F  p-value    
## s(Worker)          4.141      5 8502 0.000487 ***
## s(Machine,Worker) 10.623     15  595 0.300410    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =  0.986   Deviance explained =   99%
## -REML = 107.84  Scale est. = 0.92463   n = 54

Ch. 3: Generalized Linear Models

GLMs give you 2 things:

  1. response distributions other than Gaussian
  2. a degree of non-linearity in model structure (the link function)

They all have the basic structure:

\[ g(\mu_{i}) = \mathbf{X_{i}\beta} \]

A GLMM follows naturally:

\[ g(\mu_{i}) = \mathbf{X_{i}\beta} + \mathbf{Z_{i}b}, \mathbf{b} \sim N(0,\psi) \]

For either one, the “linear predictor” is \(\mathbf{\eta = X\beta}\).

3.5: GLMMs with R

3.5.1: glmmPQL

This function comes from the MASS package.

Ch. 4: Introducing GAMs

A GAM is a GLM with the linear predictor being a sum of smooth functions of covariates.

\[ g(\mu_{i}) = \mathbf{A_{i}\theta} + f_{1}(x_{1i}) + f_{2}(x_{2i}) + ... \] Here, \(\mathbf{A}\) plays the role of the typical model matrix and \(\mathbf{\theta}\) akin to \(\mathbf{\beta}\). All the \(f_{i}\) are smooth functions of covariates.

4.2: univariate smoothing

Let’s start by considering

\[ y_{i} = f(x_{i}) + \epsilon_{i} \]

To estimate \(f\) in terms of the statistical methods we know requires we turn it (somehow) into a linear model. We do this by representing it by a basis expansion:

\[ f(x) = \sum_{j=1}^{k} b_{j}(x) \beta_{j} \]

We choose the basis expansion, so all \(b_{j}(x)\) terms are completely known (just as if we’d added, say, polynomial terms to a standard regression). Now we want to estimate the \(\beta_{j}\) coefficients. So then we get a linear model \(\mathbf{y = X\beta + \epsilon}\) where \(X_{ij} = b_{j}(x_{i})\).

Let’s start with the simplest basis function, the piecewise linear basis. Our function tf will define \(b_{j}(x)\). It will be used in tf.X, which takes a sequence of knot points and \(x\) observed data to produce a model matrix \(\mathbf{X}\) for the piecewise linear basis expansion. We’ll use 6 knots, spread evenly over the range of \(x\). The basis functions are plotted in light dashed colors, and the basis functions multiplied by estimated coefficients in lighted colors, and the knot locations are vertical lines. Their sum is the fit curve. The reason the tent functions are not triangles is that the knot points are not necessarily at all the data points, as would be the case for simple interpolation. Each basis \(b_{j}(x)\) must have a peak at the \(j^{th}\) knot point but can change slope at data points as it falls off to zero.

data("engine")
attach(engine)

# tf: tent function
# defines b_{j}(x)
# gives the j^th function from a set defined by knot points xj
tf <- function(x, xj, j) {
  dj <- xj * 0
  dj[j] <- 1
  approx(x = xj,y = dj,xout = x,method = "linear")$y
}

# x: data
# xj: knot sequence
tf.X <- function(x, xj) {
  nk <- length(xj)
  n <- length(x)
  X <- matrix(NaN, n, nk)
  for (j in 1:nk) {
    X[, j] <- tf(x, xj, j)
  }
  return(X)
}

sj <- seq(min(size),max(size),length=6) ## generate knots
X <- tf.X(size,sj) ## get model matrix
b <- lm(wear ~ X - 1) ## fit model
s <- seq(min(size),max(size),length=200) ## prediction data
Xp <- tf.X(s,sj) ## prediction matrix
plot(size,wear,xlab="Engine capacity",ylab="Wear index",ylim=c(0,5)) ## plot data
lines(s,Xp %*% coef(b)) ## overlay estimated f

tent_cols <- gg_color_hue(n = 6)

for (j in 1:ncol(X)) {
  abline(v = sj[j],col = adjustcolor(tent_cols[j],0.5),lty = 3)
  lines(x = size[order(size)], y = X[,j][order(size)], col = adjustcolor(tent_cols[j],0.85), lty = 2, lwd = 1.5)
  lines(x = size[order(size)], y = X[,j][order(size)] * coef(b)[j], col = adjustcolor(tent_cols[j],0.75), lty = 3, lwd = 1.5)
}

detach(engine)

We got a decent fit, but we still don’t know how to choose the basis dimension \(k\) (here, the number of knots). One thing to do is keep \(k\) larger than what one might think necessary but to add a penalty to the standard fitting objective:

\[ \left\lVert \mathbf{y-X\beta} \right\rVert^2 + \lambda \sum_{j=2}^{k-1} (f(x_{j-1}) - 2f(x_{j}) + f(x_{j+1}))^2 \]

The first term is obviously the least squares loss, and the second term is a penalty upon the sum of (approximate) squared second derivatives of the function at the knots. Note if \(f\) is just a constant (straight line), the penalty is zero. The smoothing parameter \(\lambda\) controls how smooth the estimated \(f\) should be: if \(\lambda \to \infty\) the only way to have a finite loss is to make the second term go away by estimating a straight line. If \(\lambda = 0\) we get the standard least squares solution to the basis expansion.

We can start by writing the penalty term in terms of matrices, which will shed light on things. For the piecewise linear case, the coefficients of \(f\) in the summed penalty are just the function values at the knots, \(\beta_{j} = f(x_{j})\). First, write it as:

\[ \begin{bmatrix} \beta_{1} - 2\beta_{2} + \beta_{3} \\ \beta_{2} - 2\beta_{3} + \beta_{4} \\ \beta_{3} - 2\beta_{4} + \beta_{5} \\ \vdots \end{bmatrix} = \begin{bmatrix} 1 & -2 & 1 & 0 & \dots & \dots & \dots \\ 0 & 1 & -2 & 1 & 0 & \dots & \dots \\ 0 & 0 & 1 & -2 & 1 & 0 & \dots \\ \vdots & \vdots & \vdots & \vdots & \vdots & \vdots & \vdots \end{bmatrix} \begin{bmatrix} \beta_{1} \\ \beta_{2} \\ \beta_{3} \\ \vdots \end{bmatrix} \] Call the right hand side \(\mathbf{D\beta}\). Note that \(\mathbf{D}\) has dimension \((k-2)\times k\). That means it has a null space (recall the null space of a matrix \(\mathbf{A}\) is the space of non-zero vectors \(\mathbf{x}\) which solve \(\mathbf{Ax=0}\)). The null space has dimension \(k - (k-2) = 2\), so it’s 2 dimensional. That makes perfect sense, because the only time the penalty can be zero is when your function \(f\) is a straight line with no wigglyness (non-zero second derivatives)! When the penalty is zero, the fitted function we get is a linear function that lies in the null space.

The penalty term is then \(\mathbf{\lambda\beta^{\intercal}D^{\intercal}D\beta}\). Once again though, for practical computation, we don’t solve that directly but we note that:

\[ \left\lVert \mathbf{y-X\beta} \right\rVert^2 + \mathbf{\lambda\beta^{\intercal}D^{\intercal}D\beta} = \left\lVert \begin{bmatrix}y \\ 0 \end{bmatrix} - \begin{bmatrix} X \\ \sqrt{\lambda}D \end{bmatrix}\beta \right\rVert^2 \] Now, let’s write an R function to solve that linear model on the RHS that’s equivalent to what we want to solve!

# y: outcome data
# x: "predictor" data
# xj: knot locations
# lambda: smoothing parameter
prs.fit <- function(y, x, xj, lambda) {
  X <- tf.X(x, xj) # model matrix
  D <- diff(diag(length(xj)),differences = 2) # D penalty matrix
  X <- rbind(X, sqrt(lambda)*D) # augment model matrix
  y <- c(y, rep(0,nrow(D))) # augment data
  lm(y ~ X - 1) # penalized fit
}

attach(engine)
par(mfrow=c(1,3))

sj <- seq(min(size),max(size),length=20) ## knots

for (lambda in c(80,2,0.05)) {
  b <- prs.fit(wear,size,sj,lambda) ## penalized fit
  Xp <- tf.X(s,sj) ## prediction matrix
  plot(size,wear,xlab="Engine capacity",ylab="Wear index", main=paste0("lambda: ",lambda)) ## plot data
  lines(s,Xp %*% coef(b)) ## plot the smooth
}

par(mfrow=c(1,1))
detach(engine)

4.2.3: Choosing the smoothing parameter, \(\lambda\), by cross validation

If \(\lambda\) is too high, the data will be over-smoothed; small values give under-smoothing. How to choose \(\lambda\) to get a good estimate \(\hat{f}\) of the (unknown) \(f\)?

\[ \frac{1}{n}\sum_{i=1}^{n} (\hat{f}_{i} - f_{i})^{2} \] We can’t minimize the above since \(f\) is unknown. Therefore let us use leave one out cross-validation:

\[ \frac{1}{n}\sum_{i=1}^{n} (\hat{f}_{i}^{[-i]} - y_{i})^{2} \] Where \(\hat{f}_{i}^{[-i]}\) is the estimated function with \(y_{i}\) left out. We can run GCV (generalized cross validation), a simplification that is equivalent to the above which uses the influence matrix rather than refitting \(n\) times.

data(engine)
rho = seq(-9,11,length=90) # log values of lambda
n <- length(engine$wear) # number of data points
V <- rep(NA,90) # GCV score
sj <- seq(min(engine$size),max(engine$size),length=20) ## knots

for (i in 1:90) { ## loop through smoothing params
  b <- prs.fit(engine$wear,engine$size,sj,exp(rho[i])) ## fit model
  trF <- sum(influence(b)$hat[1:n]) ## extract EDF
  rss <- sum((engine$wear-fitted(b)[1:n])^2) ## residual SS
  V[i] <- n*rss/(n-trF)^2 ## GCV
}

par(mfrow=c(1,2))
plot(rho,V,type="l",xlab=expression(log(lambda)),main="GCV score")
sp <- exp(rho[V==min(V)]) ## extract optimal sp
s <- seq(min(engine$size),max(engine$size),length=200) ## prediction data
Xp <- tf.X(s,sj) ## prediction matrix
b <- prs.fit(engine$wear,engine$size,sj,sp) ## re-fit
plot(engine$size,engine$wear,main="GCV optimal fit")
lines(s,Xp %*% coef(b))

par(mfrow=c(1,1))

4.2.4: The Bayesian/mixed model alternative

The smoothing penalty we’ve introduced so far can also be interpreted in a Bayesian way, as a prior distribution over the function’s wiggliness. The most parsimonious way is to say the function’s wiggliness is proportional to an exponential prior:

\[ \propto \mathrm{exp}(\mathbf{\lambda \beta^{\intercal}S\beta}/\sigma^{2}) \] Where \(\mathbf{S = D^{\intercal}D}\) that we introduced earlier. This is equivalent to a prior on the coefficients:

\[ \mathbf{\beta}\sim N(\mathbf{0},\sigma^{2}\mathbf{S}^{-1}/\lambda) \]

When we reinterpret the model is a Bayesian way, the whole thing can be recast as fitting a linear mixed model. Let \(\mathbf{\beta}^{\prime} = \mathbf{D_{+}\beta}\) where \(\mathbf{D_{+}}\) is the augmented penalty matrix \(\mathbf{D_{+}} = \begin{bmatrix} \mathbf{I}_{2} \mathbf{0} \\ \mathbf{D} \end{bmatrix}\). We also have to adjust the data to be \(\mathbf{X}^{*}\) be the first two columns of \(\mathbf{XD}_{+}^{-1}\), and \(\mathbf{\beta}^{*}\) be the first 2 elements of \(\mathbf{\beta}^{\prime}\). The mixed model interpretation of the Bayesian smooth model is:

\[ \mathbf{y} = \mathbf{X^{*}\beta^{*} + Zb + \epsilon}, \mathbf{b}\sim N(\mathbf{0},\mathbf{I}\sigma^{2}/\lambda), \mathbf{\epsilon}\sim N(\mathbf{0},\mathbf{I}\sigma^{2}) \]

It can be estimated by working with the marginal likelihood using llm from before. We also use lme for comparison, which estimates using REML.

attach(engine)

# original parameterization of X
X0 <- tf.X(size,sj)

# make D_{+}
D <- rbind(0,0,diff(diag(20),difference = 2))
diag(D) <- 1

# make new X
X <- t(backsolve(t(D),t(X0)))

# the random and fixed effects
Z <- X[,-c(1,2)]
X <- X[,1:2]

# estimate smoothing lambda and variance
m <- optim(c(0,0),llm,method="BFGS",X=X,Z=Z,y=wear)
b <- attr(llm(m$par,X,Z,wear),"b") ## extract coefficients

plot(size,wear)
Xp1 <- t(backsolve(t(D),t(Xp))) ## re-parameterized pred. mat.
lines(s,Xp1 %*% as.numeric(b),col="grey",lwd=2)

# use REML (lme)
library(nlme)
g <- factor(rep(1,nrow(X))) # dummy factor variable because all terms are in the same "group"
m <- lme(wear ~ X - 1, random=list(g = pdIdent(~ Z-1)))
lines(s,Xp1 %*% as.numeric(coef(m))) ## and to plot

detach(engine)

4.3: Additive models

Let’s now move on where we have two \(x,v\) as explanatory variables for response \(y\). Then our model has the additive structure:

\[ y_{i} = \alpha + f_{1}(x_{i}) + f_{2}(v_{i}) + \epsilon_{i} \] Now that there are two smooths, we have an identifiability problem. Any constant \(c\) could be added to \(f_{1}\) and subtracted from \(f_{2}\) without changing predicted responses, so we need to introduce constraints to fit the model.