Model for categorical data

Overview

Assume now that the observed data takes its values in a fixed and finite set of nominal categories $\{c_1, c_2,\ldots , c_K\}$. Considering the observations $(y_{ij}, 1 \leq j \leq n_i)$ of any individual $i$ as a sequence of independent random variables, the model is completely defined by the probability mass functions $\prob{y_{ij}=c_k | \psi_i}$, for $k=1,\ldots, K$ and $1 \leq j \leq n_i$.

For a given $(i,j)$, the sum of the $K$ probabilities is 1, so in fact only $K-1$ of them need to be defined.

In the most general way possible, any model can be considered so long as it defines a probability distribution, i.e., for each $k$, $\prob{y_{ij}=c_k | \psi_i} \in [0,1]$, and $\sum_{k=1}^{K} \prob{y_{ij}=c_k | \psi_i} = 1$. For instance, we could define $K$ time-dependent parametric functions $a_1$, $a_2$, ..., $a_K$ and set for any individual $i$, time $t_{ij}$ and $k \in \{1,\ldots,K\}$,

 $$\prob{y_{ij}=c_k | \psi_i} = \displaystyle{\frac{e^{a_k(t_{ij},\psi_i)} }{\sum_{m=1}^K e^{a_m(t_{ij},\psi_i)} } }.$$ (1) Example:

Suppose we want to model binary data, i.e., data where $y_{ij} \in \{0,1\}$.

Let $\psi_i=(\alpha_i,\beta_i)$ and let $a_1(t,\psi_i)=0$ and $a_2(t,\psi_i) = \alpha_i + \beta_i \, t$. Then, (1) gives a probability distribution for binary outcomes:

$$\prob{y_{ij}=0 | \psi_i} = \displaystyle{\frac{1}{1 + e^{\alpha_i + \beta_i \, t_{ij} } } } \quad \ \ \ \text{and} \quad \ \ \ \prob{y_{ij}=1 | \psi_i} = \displaystyle{\frac{e^{\alpha_i + \beta_i \, t_{ij} } }{1 + e^{\alpha_i + \beta_i \, t_{ij} } } }.$$

Such parametrizations are extremely flexible and easy to interpret in simple situations. In the previous example for instance, $\prob{y_{ij}=1 | \psi_i}$ and $a_2(t_{ij},\psi_i)$ move in the same direction as time increases.

Ordinal data

Ordinal data further assumes that the categories are ordered, i.e., there exists an order $\prec$ such that

$$c_1 \prec c_2,\prec \ldots \prec c_K .$$

We can think for instance of levels of pain (low, moderate, severe), or any scores on a discrete scale, e.g., from 1 to 10.

Instead of defining the probabilities of each category, it may be convenient to define the cumulative probabilities $\prob{y_{ij} \preceq c_k | \psi_i}$ for $k=1,\ldots ,K-1$, or in the other direction: $\prob{y_{ij} \succeq c_k | \psi_i}$ for $k=2,\ldots, K$. Any model is possible as long as it defines a probability distribution, i.e., satisfies:

$$0 \leq \prob{y_{ij} \preceq c_1 | \psi_i} \leq \prob{y_{ij} \preceq c_2 | \bpsi_i} \leq \ldots \leq \prob{y_{ij} \preceq c_K | \psi_i} =1 .$$

Without any loss of generality, we will consider numerical categories in what follows. The order $\prec$ then reduces to the usual order $<$ on $\Rset$. Currently, the most popular model for ordinal data is the proportional odds model which uses logits of these cumulative probabilities, also called cumulative logits. We assume that there exist $\alpha_{i,1}\geq0$, $\alpha_{i,2}\geq 0, \ldots , \alpha_{i,K-1}\geq 0$ such that for $k=1,2,\ldots,K-1$,

 $$\logit \left(\prob{y_{ij} \leq c_k | \psi_i} \right) = \left( \sum_{m=1}^k \alpha_{im}\right) + \beta_i \, x(t_{ij}) ,$$ (2)

where $x(t_{ij})$ is a vector of regression variables and $\beta_i$ a vector of coefficients. Here, $\bpsi_i=(\alpha_{i1},\alpha_{i2},\ldots,\alpha_{i,K-1},\beta_i)$.

Recall that $\logit(p) = \log\left(p/(1-p)\right)$. Then, the probability defined in (2) can also be expressed as

$$\prob{y_{ij} \leq c_k | \bpsi_i} = \displaystyle{\frac{1}{1 + e^{ \left(\sum_{m=1}^k \alpha_{im}\right) + \beta_i \, x(t_{ij})} } }.$$ Example:

We give to patients a drug which is supposed to decrease the level of a given type of pain.

The level of pain is measured on a scale from 1 to 3: 1=low, 2=moderate, 3=high. We consider the following model with the constraint that $\alpha_{i2}\geq 0$:

$$\begin{eqnarray} \logit \left(\prob{y_{ij} \leq 1 | \psi_i}\right) &=& \alpha_{i,1} + \beta_{i,1}\, t_{ij} + \beta_{i,2}\, C_{ij} \\ \logit \left(\prob{y_{ij} \leq 2 | \psi_i}\right) &=& \alpha_{i,1} + \alpha_{i,2} + \beta_{i,1}\, t_{ij} + \beta_{i2}\, C_{ij} \\ \prob{y_{ij} \leq 3 | \psi_i} &=& 1, \end{eqnarray}$$

where $C_{ij}$ is the concentration of the drug at time $t_{ij}$. The model parameters are quite easy to explain:

• $\beta_{i,1}=0$ means that without treatment, the level of pain tends to remains stable over time.
• $\beta_{i,1}<0$ (resp. $\beta_{i1}>0$) means that the pain tends to increase (resp. decrease) over time.
• $\beta_{i,2}=0$ means that the drug has no effect on pain.
• $\beta_{i,2}>0$ means that the level of pain tends to decrease when the drug concentration increases, whereas $\beta_{i2}<0$ means that pain is an adverse drug effect.

Remarks

Exclusive use of linear models (or generalized linear models) has no real justification today since very efficient tools are available for nonlinear models.

Model (2) can be easily extended to a nonlinear model:

 $$\logit \left(\prob{y_{ij} \leq k | \psi_i } \right) = \sum_{m=1}^k \alpha_{i,m} + \beta(x(t_{ij})) ,$$ (3)
where $\beta$ is any (linear or nonlinear) function of $x(t_{ij})$.

Markovian dependence

For the sake of simplicity, we will assume here that the observations $(y_{ij})$ take their values in $\{1, 2, \ldots, K\}$.

We have so far assumed that the categorical observations $(y_{ij},\,j=1,2,\ldots,n_i)$ for individual $i$ are independent. It is however possible to introduce dependency between observations from the same individual by assuming that $(y_{ij},\,j=1,2,\ldots,n_i)$ forms a Markov chain. For instance, a Markov chain with memory 1 assumes that all is required from the past to determine the distribution of $y_{i,j}$ is the value of the previous observation $y_{i,j-1}$. i.e., for all $k=1,2,\ldots ,K$,

$$\prob{y_{i,j} = k\, | \,y_{i,j-1}, y_{i,j-2}, y_{i,j-3},\ldots,\psi_i} = \prob{y_{i,j} = k | y_{i,j-1},\psi_i}.$$

Discrete time Markov chains

If the observation times are regularly spaced (constant length of time between successive observations), we can consider the observations $(y_{ij},\,j=1,2,\ldots,n_i)$ to be a discrete time Markov chain. Here, for each individual $i$, the probability distribution of the sequence $(y_{ij},\,j=1,2,\ldots,n_i)$ is defined by:

• the distribution $\pi_{i,1} = (\pi_{i,1}^{k} , k=1,2,\ldots,K)$ of the first observation $y_{i,1}$:

$$\pi_{i,1}^{k} = \prob{y_{i,1} = k | \psi_i}$$

• the sequence of transition matrices $(Q_{i,j}, j=2,3,\ldots)$, where for each $j$, $Q_{i,j} = (q_{i,j}^{\ell,k}, 1\leq \ell,k \leq K)$ is a matrix of size $K \times K$ such that,

$$\begin{eqnarray} q_{i,j}^{\ell,k} &=& \prob{y_{i,j} = k | y_{i,j-1}=\ell , \psi_i} \quad \text{ for all } (\ell,k),\\ \sum_{k=1}^{K}q_{ij}^{\ell,k} &=& 1 \quad \text{ for all } (\ell,k). \end{eqnarray}$$

The conditional distribution of $y_i=(y_{i,j}, j=1,2,\ldots, n_i)$ is then well-defined:

$$\pcyipsii(y_i | \psi_i) = \pmacro(y_{i,1}|\psi_i) \prod_{j=2}^{n_i} \pmacro(y_{i,j} | y_{i,j-1},\psi_i) .$$

For a given individual $i$, $Q_{i,j}$ defines the transition probabilities between states at a given time $t_{ij}$: Our model must therefore give, for each individual $i$, the distribution of first observation $(y_{i,1})$ and a description of how the transition probabilities evolve with time.

The figure below shows several examples of simulated sequences coming from a model with 2 states defined by:

$$\begin{eqnarray} \logit\left(q_{i,j}^{1,2}\right) &=& a_i+b_i \, t_j \\ \logit\left(q_{i,j}^{2,1}\right) &=& c_i+d_i \, t_j \\ \prob{y_{i,1}=1} &=& 0.5 , \end{eqnarray}$$

where $t_j = j$. In the first example (left), the logits of the transitions between states are constant ($b_i = d_i = 0$). Transition probabilities are therefore constant over time. Here, $q^{1,2}=1/(1+\exp(2.5))=0.0759$ and $q^{2,1}=1/(1+\exp(2))=0.1192$. As $q^{1,2}$ and $q^{2,1}$ are small with $q^{1,2}<q^{2,1}$, transitions between the two states are rare, and a larger amount of time (on average) is spent in state 1. Indeed, the stationary distribution is the eigenvector of the transition matrix $P$: $\prob{y_{ij}=1}=0.611$ and $\prob{y_{ij}=2}=0.389$. The figure (left) displays the transition rates $q^{1,2}$ and $q^{2,1}$ as function of the time (top left) and two simulated sequences of states (centre and bottom left).

In the second example (center), $b_i$ and $d_i$ are negative. This means that as time progresses, transitions from state 1 to 2 become rarer, and the same is true from 2 to 1.

In the third example (right), now $b_i$ and $d_i$ are positive. This means that as time progresses, transitions from state 1 to 2 become more and more frequent, and also more frequent from 2 to 1. Note that the value of $a_i$ (resp. $c_i$) can be seen as the transition probability from state 1 to 2 (resp. 2 to 1) at time $t=0$.

Different choices can be made for defining an initial distribution $\pi_{i,1}$:

• The initial state can be defined arbitrarily: $y_{i,1}=k_0$. This means that $\pi_{i,1}^{k_0} = 1$ and $\pi_{i,1}^{k} = 0$ for $k\neq k_0$.

• More generally, any simple probability distribution can be put on the choice of the initial state, e.g., the uniform distribution $\pi_{i,1}^{k} = 1/K$ for $k=1,2,\ldots , K$.

• If a transition matrix $Q_{i1}$ has been defined at time $t_1$, we might consider using its stationary distribution, i.e., taking for $\pi_{i,1}$ the solution to:

$$\pi_{i,1} = \pi_{i,1} Q_{i1} .$$

Continuous time Markov chains

The previous situation can be extended to the case where observation times are irregular, by modeling the sequence of states as a continuous-time Markov process. The difference is that rather than transitioning to a new (possibly the same) state at each time step, the system remains in the current state for some random amount of time before transitioning. This process is now characterized by transition rates instead of transition probabilities:

$$\prob{y_{i}(t+h) = k\, | \,y_{i}(t)=\ell , \psi_i} = h \, \rho_{i}^{\ell,k}(t) + o(h),\quad k \neq \ell .$$

The probability that no transition happens between $t$ and $t+h$ is

$$\prob{y_{i}(s) = \ell, \forall s\in(t, t+h) \ | \ y_{i}(t)=\ell , \psi_i} = e^{h \, \rho_{i}^{\ell,\ell}(t)} .$$

Summary

A model for independent categorical data is completely defined by:
• The probability mass functions $\left(\prob{y_{ij} = k | \psi_i} \right)$
• (or) the cumulative probability functions $\left(\prob{y_{ij} \leq c_k | \psi_i} \right)$ for ordinal data
• (or) the cumulative logits $\left(\logit \left( \prob{y_{ij} \leq k | \psi_i} \right)\right)$ for a proportional odds model

A model for categorical data with Markovian dependency is completely defined by:

1. the probability transitions in the case of a discrete-time Markov chain
2. (or) the transition rates in the case of a continuous-time Markov process
3. the probability distribution of the initial states

$\mlxtran$ for categorical data models Example 1: $\quad y_{ij} \in \{0, 1, 2\}$

 $$\begin{eqnarray} \psi_i &=& (V_i, k_i, \alpha_{0,i}, \alpha_{1,i}, \gamma_i) \\[0.2cm] D &=&100 \\ C(t,\psi_i) &=& \frac{D_i}{V_i} e^{-k_i \, t} \\[0.2cm] \prob{y_{ij}\leq 0} &=& \alpha_{0,i} + \gamma_i \, C(t_{ij},\psi_i) \\ \prob{y_{ij}\leq 1} &=& \alpha_{0,i} + \alpha_{1,i} + \gamma_i \, C(t_{ij},\psi_i) \\ \prob{y_{ij}\leq 2} &=& 1 \end{eqnarray}$$ MLXTran INPUT: input = {V, k, alpha0, alpha1, gamma} EQUATION: D = 100 C = D/V*exp(-k*t) p0 = alpha0 + gamma*C p1 = p0 + alpha1 DEFINITION: y = {type=categorical, categories={0, 1, 2}, P(y<=0)=p0, P(y<=1)=p1 } Example 2: $\quad$ 2-state discrete-time Markov chain

 $$\begin{eqnarray} \psi_i &=& (a_i,b_i,c_i,d_i) \\[0.2cm] \logit(p_{ij}^{12}) &=& a_i+b_i \, t_{ij} \\ \logit(p_{ij}^{21}) &=& c_i+d_i \, t_{ij} \\ \prob{y_{i,1}=1} &=& 0.5 \end{eqnarray}$$ MLXTran INPUT: input = {a, b, c, d} DEFINITION: Y = { type = categorical, categories = {1, 2}, dependence = Markov P(Y_1=1) = 0.5 logit(P(Y=2 | Y_p=1)) = a + b*t logit(P(Y=1 | Y_p=2)) = c + d*t } Example 3: $\quad$ 2-state continuous-time Markov chain

 $$\begin{eqnarray} \psi_i &=& (a_i,b_i,c_i,d_i,\pi_i) \\[0.2cm] q_{i}^{12}(t) &=& e^{a_i+b_i \, t} \\ q_{i}^{21}(t) &=& e^{c_i+d_i \, t} \\ \prob{y_{i,1}=1} &=& \pi_i \end{eqnarray}$$ MLXTran INPUT: input = {a, b, c, d, pi} DEFINITION: Y = { type = categorical, categories = {1, 2}, dependence = Markov P(Y_1=1) = pi transitionRate(1,2) = exp(a + b*t) transitionRate(2,1) = exp(c + d*t) }

Bibliography

Agresti, A. - Analysis of ordinal categorical data

Vol. 656, Wiley,2010
Agresti, A. - An introduction to categorical data analysis
Vol. 423, Wiley-Interscience,2007
Bolker, B. M., Brooks, M. E., Clark, C. J., Geange, S. W., Poulsen, J. R., Stevens, M. H. H., White, J.-S. S., others - Generalized linear mixed models: a practical guide for ecology and evolution
Trends in ecology & evolution 24(3):127-135,2009
Davidian, M., Giltinan, D. M. - Nonlinear Models for Repeated Measurements Data
Chapman & Hall., London,1995
Jiang., J. - Linear and Generalized Linear Mixed Models and Their Applications.
Springer Series in Statistics, New York,2007
Littell, R. C. - SAS for mixed models
SAS institute,2006
McCulloch, C. E., Searle, S. R., Neuhaus, J. M. - Generalized, Linear, and Mixed Models
Wiley,2011
Molenberghs, G., Verbeke, G. - Models for discrete longitudinal data
Springer,2005
Powers, D. A., Xie, Y. - Statistical methods for categorical data analysis
Emerald Group Publishing,2008
Wolfinger, R., O'Connell, M. - Generalized linear mixed models a pseudo-likelihood approach
Journal of statistical Computation and Simulation 48(3-4):233-243,1993