This is a recall for some of what I have learned in 2018, Fall. Back then, the main focus was on Bayesian models. A friend asked me a question few days ago on estimating the parameters related to Generalized Estimating Equation (GEE). In general, this can be comfortably handeled by the R
package such as geepack
, but she wanted to know the exact estimation procedures. Hence, I review some of my materials, and a book recommended by my colleague which will be referenced in the end.
The estimation is proposed in Liang & Zeger (1986), and I found something not that clear for me in the first place and I consult for a very nice book written by Hardin & Hilbe (2013), which gave a more fridenly explaination.
I will illustrate this using a seizure data example. Data on seizures were collected on 59 epileptics, i.e. subject \(i = \lbrace 1, \ldots, 59 \rbrace\). For each patient the no. of seizures were recorded during a baseline period of 8 weeks, after which patients were randomized to treatment with the drug progabide or to placebo. The no. of seizures was then recorded in 4 consecutive 2-week periods, i.e. \(n_i = 5, \forall i\). Patient age was also available. Let \[\begin{align*} Y_{ij} &= \text{number of seizuires on patitent } i \text{ at occation }j\\ t_{ij} &= \text{length of observation period on patient } i \text{ at occation } j\\ x_{i1} &= 0/1 \text{ if patient } i \text{ was assigned placebo/progabibe}\\ x_{ij2} &= 0/1 \text{ if } j = 0/1,2,3,4 \end{align*}\] with \(t_{ij} = 8\) if \(j = 0\) and \(t_{ij} =2\) if \(j \geq 1, \forall i\). The total number of observation is \(N = 59 \times 5 = 295\).
For this data, we have
The standard way as mentioned previously is to use the geepack
package to run the model given data. Here we assume a exchangeable correlation structures, i.e.
\[R(\alpha) = \begin{bmatrix}\begin{array}{ccccc} 1 & \alpha & \alpha & \alpha & \alpha\\ \alpha & 1 & \alpha & \alpha & \alpha\\ \alpha & \alpha & 1& \alpha & \alpha\\ \alpha & \alpha & \alpha & 1 & \alpha\\ \alpha & \alpha & \alpha & \alpha & 1\\ \end{array}\end{bmatrix}\]
library(MASS)
library(geepack)
seizure <- read.csv("data/seizure.csv",header=T)
new_id <- c(1,diff(seizure$subject))
seizure$trt <- as.numeric(as.factor(seizure$trt))
y <- num_of_seizure <- c(seizure$base[new_id==1],seizure$y)
id <- c(unique(seizure$subject),seizure$subject)
t <- c(rep(8,length(unique(seizure$subject))),rep(2,length(seizure$subject)))
x1 <- c(seizure$trt[new_id==1],seizure$trt)-1
x2 <- c(rep(0,sum(new_id)),rep(1,59*4))
seizure <- data.frame(id=id,y=y,x1=x1,x2=x2,t=t)
seizure <- seizure[order(seizure$id),]
fit <- geese(y~x1*x2+offset(log(t)),id=id,
corstr="exchangeable",
family="poisson",data=seizure)
summary(fit)
##
## Call:
## geese(formula = y ~ x1 * x2 + offset(log(t)), id = id, data = seizure,
## family = "poisson", corstr = "exchangeable")
##
## Mean Model:
## Mean Link: log
## Variance to Mean Relation: poisson
##
## Coefficients:
## estimate san.se wald p
## (Intercept) 1.34760922 0.1573571 73.34238067 0.0000000
## x1 0.02651461 0.2218539 0.01428355 0.9048683
## x2 0.10871914 0.1156491 0.88374536 0.3471779
## x1:x2 -0.10160167 0.2133655 0.22675324 0.6339418
##
## Scale Model:
## Scale Link: identity
##
## Estimated Scale Parameters:
## estimate san.se wald p
## (Intercept) 19.42418 8.699135 4.985779 0.02555648
##
## Correlation Model:
## Correlation Structure: exchangeable
## Correlation Link: identity
##
## Estimated Correlation Parameters:
## estimate san.se wald p
## alpha 0.7765117 0.07534162 106.2248 0
##
## Returned Error Value: 0
## Number of clusters: 59 Maximum cluster size: 5
Now you can see that the point estimate of scale parameter \(\hat{\phi} = 19.424\) and the correlation parameter \(\hat{\alpha} = 0.777\).
According to Liang & Zeger (1986) and Hardin and Hilbe (2013). A GEE with an exchageable correlation structure uses the estimated Pearson residuals (\(\hat{r}_{ij}\)) from the current fit of the model to estimate the parameters \(\phi, \alpha\), where
\[\begin{align} \hat{r}_{ij} & = \dfrac{Y_{ij} - \hat{\mu}_{ij}}{\sqrt{\mathrm{Var}(\hat{\mu}_{ij})}} \\ \hat{\phi} &= \dfrac{1}{N-p}\sum_{i = 1}^{K}\sum_{j = 1}^{n_i}\hat{r}_{ij}^2\\ \hat{\alpha} &= \dfrac{1}{\hat{\phi}}\sum_{i = 1}^{K}\left(\dfrac{\sum_{j = 1}^{n_i}\sum_{k = 1}^{n_i}\hat{r}_{ij}\hat{r}_{ik} - \sum_{j = 1}^{n_i}\hat{r}_{ij}^2}{n_i(n_i-1)}\right) \end{align}\]
To compute these two parameters esitmates, first, we have to obtain the Pearson’s residuals under Poisson model. Recall that for the Poisson model, mean and variance are equal, hence the Pearson residuals can be re-written as
\[\hat{r}_{ij} = \dfrac{Y_{ij} - \hat{\mu}_{ij}}{\sqrt{\mathrm{Var}(\hat{\mu}_{ij})}} = \dfrac{Y_{ij} - \hat{\mu}_{ij}}{\sqrt{\hat{\mu}_{ij}}}\] And the rest is straightfoward, not that since \(n_i = 5, \forall i \Rightarrow n_i(n_i-1) = 20, \forall i\) as well.
fit2 <- glm(y~x1*x2+offset(log(t)), family = 'poisson', data = seizure)
N <- nrow(seizure)
# Pearson residuals under Poisson model
seizure$res <- (seizure$y - fit2$fitted.values)/sqrt(fit2$fitted.values)
(phi <- sum(seizure$res^2)/N)
## [1] 19.42418
library(dplyr)
tmp <- seizure %>% group_by(id) %>% summarize(num = sum(outer(res, res)) - sum(res^2), den = 20)
(alpha <- sum(tmp$num)/sum(tmp$den)/phi)
## [1] 0.7765117
The output is the same as the geese
function from geepack
package.
One thing is that although we have \(x_1, x_2\), because of the formation, we may treated \(p = 0\).
Reference
Liang, K. Y., & Zeger, S. L. (1986). Longitudinal data analysis using generalized linear models. Biometrika, 73(1), 13-22.
Hardin, J., & Hilbe, J. (2013). Generalized Estimating Equations (Second).