# Mixture models

## Introduction

Mixed-effects models are frequently used for modeling longitudinal data when data is obtained from different individuals from the same population. These models allow us to take into account between-subject variability. One complicating factor arises when data is obtained from a population with some underlying heterogeneity. If we assume that the population consists of several homogeneous sub-populations, a straightforward extension of mixed-effects models is a finite mixture of mixed-effects models.

As an example, the use of a mixture of mixed effects models is particularly relevant when the response of patients to a drug therapy is heterogeneous. In any clinical efficacy trial, patients who respond, partially respond or do not respond at all can be considered different sub-populations with quite different profiles.

The introduction of a categorical covariate (e.g., sex, genotype, treatment, status, etc.) into such a model already supposes that the whole population can be decomposed into sub-populations. The covariate then serves as a label for assigning each individual to a sub-population. In practice, the covariate can either be known or not.

Mixture models usually refer to models for which the categorical covariate is unknown, but whatever the case, the joint model that brings together all the parts (observations, individual parameters, covariates, labels, design, etc.) is the same. The difference appears when having to perform certain tasks and in the methods needed to implement them. For instance, the task of simulation makes no distinction between the two situations because all the variables are simulated, whereas model construction is different depending on whether the labels are known or unknown: we have supervised learning if the labels are known and unsupervised learning otherwise.

There exist several types of mixture models which are useful in the context of mixed-effects models, e.g., mixtures of distributions, mixtures of residual error models, and mixtures of structural models. Indeed, heterogeneity in the response variable cannot be always adequately explained only by inter-patient variability of certain parameters. It can therefore be necessary to introduce diversity into the structural models themselves:

• Between-subject model mixtures assume that there exist sub-populations of individuals. Here, various structural models describe the response of the different sub-populations, and each subject belongs to one sub-population. One can imagine for example different structural models for responders, non responders and partial responders to a given treatment.

• Within-subject model mixtures assume that there exist sub-populations (of cells, viruses, etc.) within each patient. Again, differing structural models describe the response of the different sub-populations, but the proportion of each sub-population depends on the patient.

## Mixtures of mixed-effects models

For the sake of simplicity, we will consider a basic model that involves individual parameters $\bpsi=(\psi_i,1\leq i \leq N)$ and observations $\by=(y_i,1\leq i \leq N)$, where $y_i=(y_{ij},1\leq j \leq n_i)$. Then, the simplest way to model a finite mixture model is to introduce a label sequence $\bz=(z_i ; 1\leq z_i \leq N)$ that takes its values in $\{1,2,\ldots,M\}$ and is such that $z_i=m$ if subject $i$ belongs to sub-population $m$.

In some situations, the label set $\bz$ is known and can then be used as a categorical covariate in the model. If $\bz$ is known and if we consider $\bz$ the realization of a random vector, the model is the conditional distribution

 $$\pcypsiz(\by,\bpsi | \bz;\theta) = \pccypsiz(\by | \bpsi , \bz)\pcpsiz(\bpsi | \bz;\theta) .$$ (1)

If $\bz$ is unknown, it is modeled as a random vector and the model is the joint distribution

 $$\pypsiz(\by,\bpsi, \bz;\theta) = \pcypsiz(\by,\bpsi |\bz;\theta)\pz(\bz;\theta) .$$ (2)

We therefore consider that $\bz=(z_i)$ is a set of independent random variables taking its values in $\{1,2,\ldots,M\}$: for $i=1,2,\ldots, N$, there exist $\pw_{i,1},\pw_{i,2},\ldots,\pw_{i,M}$ such that

$$\prob{z_i = m} = \pw_{i,m} .$$

A simple model might assume that the $(z_i)$ are identically distributed: $\pw_{i,m} = \pw_{m}$ for $m=1,\ldots,M$. But more complex models can be considered, assuming for instance that an individual's probabilities depend on its covariate values. Example

The Hepatitis C virus (HCV) can be divided into six distinct genotypes. Genotype 1 is the most difficult to treat, whereas individuals with genotypes 2 and 3 are almost three times more likely to respond to the therapy of a combination of alpha interferon and ribavirin. Suppose we want to divide patients infected with HCV into three outcome groups: patients who respond, partially respond or do not respond. It is valid to assume that an individual's probabilities for ending up in each of these groups depends on their value for the genotype covariate.

In its most general form, a mixture of mixed-effects models assumes that there exist $M$ joint distributions $\pyipsii_{1}$, ..., $\pyipsii_{M}$ and vectors of parameters $\theta_1$, ..., $\theta_M$ such that for any individual $i$, the joint distribution of $y_i$ and $\psi_i$ becomes

$$\pyipsii(y_i,\psi_i;\theta) = \sum_{m=1}^M \prob{z_i = m} \pyipsii_{m}(y_i,\psi_i;\theta_m) ,$$

where $\pypsi_{m}$ is the joint distribution of $(y_i,\psi_i)$ in group $m$ and where $\theta=(\theta_1,\ldots,\theta_M)$.

The distribution of the observations $y_i$ is therefore itself a mixture of $M$ distributions:

$$\begin{array}{c} \pyi(y_i;\theta) &=& \int \pyipsii(y_i,\psi_i;\theta) \, d \psi_i \end{array}$$

 $$\begin{array}{c} & = & \sum_{m=1}^M \prob{z_i = m} \left( \int \pyipsii_{m}(y_i,\psi_i,\theta_m) \, d \psi_i \right) \end{array}$$ (3)

$$\begin{array}{c} & = & \sum_{m=1}^M \prob{z_i = m} \pyi_{m}(y_i;\theta_m) . \end{array}$$

The mixture can then be looked at via the distribution of the individual parameters $\qpsii$ and/or the conditional distribution of the observations $\qcyipsii$. Let us now see some examples of such mixtures models.

• A latency structure can be introduced at the individual parameter level:

$$\begin{eqnarray} \pyipsii(y_i,\psi_i;\theta) & =& \pcyipsii(y_i | \psi_i)\ppsii(\psi_i;\theta) \\ & =& \pcyipsii(y_i | \psi_i) \left(\sum_{m=1}^M \prob{z_i = m} \ppsii_{m}(\psi_i;\theta_m) \right) , \end{eqnarray}$$

where $\ppsii_{m}(\psi_i;\theta_m)$ is the distribution of the individual parameters in group $m$. For example, a mixture of linear Gaussian models for the individual parameters assumes that there exist $M$ population parameters $\psi_{{\rm pop},1}, \ldots, \psi_{{\rm pop},M}$, vectors of coefficients $\beta_{1}, \ldots, \beta_{M}$, variance matrices $\Omega_{1}, \ldots, \Omega_{M}$ and transformations $h_1,\ldots,h_M$ such that

$$h_m(\psi_i) \ | \ z_i=m \ \ \sim \ \ {\cal N}(\mu_m , \Omega_m),$$

where $\mu_m = h_m(\psi_{ {\rm pop},m})+ \langle \beta_m , c_i \rangle$.
This is the most general representation possible because it allows the transformation, population parameters, covariate model and variance-covariance structure of the random effects all to vary from one group to the next. A more simpler representation would have one or all of these fixed across the groups.

• A latency structure can also be introduced at the level of the conditional distribution of the observations $(y_{ij})$:

$$\begin{eqnarray} \pyipsii(y_i,\psi_i;\theta) & =& \pcyipsii(y_i | \psi_i)\ppsii(\psi_i;\theta) \\ & =& \left(\sum_{m=1}^M \prob{z_i = m} \pcyipsii_{m}(y_i|\psi_i) \right) \ppsii(\psi_i;\theta) , \end{eqnarray}$$

where $\pcyipsii_{m}$ is the conditional distribution of the observations in group $m$. For example, the model for continuous data
 $$y_{ij} = f\left( t_{ij};\psi_i,z_i \right) + g\left( t_{ij};\psi_i,z_i \right)\teps_{ij}$$ (4)
with $\teps_{ij} \sim {\cal N}(0,1)$, can be equivalently represented as

$$y_{ij} | \,z_i=m \ \ \sim \ \ {\cal N}(f_m( t_{ij};\psi_i) , \ g_m( t_{ij};\psi_i)^2)$$

for each $m=1,\ldots,M$. A mixture of conditional distributions therefore reduces to a mixture of structural models and/or residual errors.
To give a precise example, a mixture of constant error models would assume that

$$\begin{eqnarray} y_{ij} &= & f\left( t_{ij};\psi_i \right) + \left( \sum_{m=1}^M \one_{z_i = m} a_m \right) \varepsilon_{ij} . \end{eqnarray}$$

Alternatively, between subject model mixtures (BSMM) assume that the structural model is a mixture of $M$ different structural models:
 $$f\left( t_{ij};\psi_i,z_i \right) = \sum_{m=1}^M \one_{z_i = m} f_m\left( t_{ij};\psi_i \right) .$$ (5)

Remarks

It may be too simplistic to assume that each individual is represented by only one well-defined model from the mixture. For instance, in a pharmacological setting there may be subpopulations of cells or viruses within each patient that react differently to a drug treatment. In this case, it makes sense to consider that the mixture of models happens within each individual. Such within-subject model mixtures (WSMM) therefore require additional vectors of individual parameters $\pi_i=(\pi_{i,1},\ldots \pi_{i,M})$ representing proportions of the $M$ models within each individual $i$:
 $$f\left( t_{ij};\psi_i,z_i \right) = \sum_{m=1}^M \pi_{i,m} f_m\left( t_{ij};\psi_i \right) .$$ (6)

The proportions $(\pi_{i,m})$ are now individual parameters in the model and the problem is transformed into a standard NLMEM. These proportions are assumed to be positive and summing to $1$ for each patient. We can then define $\pi_{i,m}$ in order to satisfy these constraints. One possible way to do this is:

$$\pi_{i,m} =\displaystyle{ \frac{\gamma_{i,m} }{\sum_{\ell=1}^M \gamma_{i,\ell} } },$$

where $\log(\gamma_{i,m}) \sim {\cal N}(\log(\gamma_{ {\rm pop},m}), \omega^2_m)$.

## Example 1: Mixtures of normal distributions

We consider here a simple PK model for a single oral administration:

$$\begin{eqnarray} f(t ; ka,V,ke) &=& \frac{D\, k_a}{V(k_a-k_e)} \left( e^{-k_e \, t} - e^{-k_a \, t} \right). \end{eqnarray}$$

Here, the PK parameters are the absorption rate constant $ka$, the elimination rate constant $ke$ and the volume of distribution $V$.

We can model the PK parameters $\psi_i=(ka_i,V_i, ke_i)$ of individual $i$ randomly chosen from the population as a vector of independent random parameters.

The figure shows the final distribution obtained for the volume when given as a mixture of two log-normal distributions:

$$\log(V_i ) \sim 0.35 \ {\cal N}(\log(70) , 0.3^2) + 0.65 \ {\cal N}(\log(42) , 0.3^2).$$ 2 log-normal distributions $p_1$ and $p_2$ for the volume and a mixture of these two distributions

Here, the structural model $f$ is a function of time and $f( t ; \psi_i)$ is the predicted concentration of the drug in individual $i$ at time $t$. Then, $f( \, \cdot \, ; \psi_i)$ is a random function because it depends on a random parameter $\psi_i$. The probability distribution of $f( \, \cdot \, ; \psi_i)$ is therefore that of the concentration predicted by the model. It represents the inter-individual variability of the drug's pharmacokinetics in the population.

The figure below displays prediction intervals for the concentration $f( \, \cdot \, ; \psi_i)$ for one individual $i$ randomly chosen in the population, where $\psi_i=(ka_i,V_i,Cl_i)$ are the PK parameters. In other words, this plot allows us to visualize the impact of the inter-individual variability of the individual PK parameters on the exposure to the drug. Here, $V_i$ is a mixture of two log-normal distributions as described above while $ka_i$ and $ke_i$ have log-normal distributions:

$$\begin{eqnarray} \log(ka_i ) &\sim& {\cal N}(\log(1) , 0.3^2) \\ \log(ke_i ) &\sim& {\cal N}(\log(4) , 0.3^2) . \end{eqnarray}$$ Prediction intervals for the predicted concentration kinetics $f( \, \cdot \, ; \psi_i)$

Remarks

Here, the distribution of $f( \, \cdot \, ; \psi_i)$ cannot be computed in a closed form because the model is non-linear, but it can be easily estimated by Monte Carlo simulation.

Here, the distribution of $f( \, \cdot \, ; \psi_i)$ is itself a mixture of 2 distributions since the distribution of $\psi_i$ is a mixture of distributions due to $V_i$. It is interesting to see the distribution of the predicted concentration in each subpopulation. Indeed, any individual $i$ will either have a log-volume from ${\cal N}(\log(70) , 0.3^2)$ (with probability 0.35) or a log-volume from ${\cal N}(\log(42) , 0.3^2)$ (with probability 0.65), so in order to visualize what really happen to a single individual $i$, we need to split the data into two plots: 35% of the individuals will have concentration kinetics distributed like on the left, and 65% like on the right. The probability distribution of the predicted concentration kinetics $f( \, \cdot \, ; \psi_i)$ in the two subpopulations

## Example 2: Mixtures of structural models

Here we are interested in a study which concerns treated HIV-infected patients. The output data is the viral load evolution for these patients. The figure below gives examples of patients with one of three "characteristic" viral load progressions:

• Non-responders (1) show no decline in viral load.
• Responders (2) exhibit a sustained viral load decline.
• Rebounders (3 and 4) exhibit an initial drop in viral load, then a rebound to higher viral load levels. Viral load progression for 4 HIV-infected patients. (1) non-responder; (2) responder; (3) and (4) are rebounders. Red points indicate below level of quantification data.

Remarks:

• Since viral loads generally evolve exponentially over time, they are most commonly expressed on a logarithmic scale.
• There is a detection limit at $50$ HIV RNA copies/ml, corresponding to a log-viral load of $1.7$, i.e., data are left-censored. These points are shown in red.

Within a few months of HIV infection, patients typically enter a steady state of chronic infection and have a stabilized concentration of HIV-1 in blood plasma. This concentration is modeled by an individual constant $A_{i,0}$. When anti-retroviral treatment starts, the viral load of patients who respond shows an initial rapid exponential decay, usually followed by a slower second phase of exponential decay. This two-phase decay in viral load can be approximated by the bi-exponential model:

$$A_{1}e^{-\lambda_{1}t} +A_{2}e^{-\lambda_{2}t} .$$

After the decrease in viral load level, some subjects show a rebound, which can be due to several factors (non-adherence to the therapy, emergence of drug-resistant virus strains, etc.). We propose to extend the bi-exponential model to these patients by adding a third phase, characterized by a logistic growth process $A_{3}/({1+e^{-\lambda_{3}(t-\tau)}})$, where $\tau$ is the inflection point of this growth process.

We can then describe the log-transformed viral load with a BSMM with three simple models, corresponding to each of three characteristic viral load progressions:

$$\begin{eqnarray} f_1(t_{ij},\psi_i) &=& A_{i,0} \\ f_2(t_{ij},\psi_i) &=&A_{i,1}e^{-\lambda_{i,1}t_{ij} } +A_{i,2}e^{-\lambda_{i,2}t_{ij} } \\ f_3(t_{ij},\psi_i) &=&A_{i,1}e^{-\lambda_{i,1}t_{ij} } +A_{i,2}e^{-\lambda_{i,2}t_{ij} } + \displaystyle{ \frac{A_{i,3} }{1+e^{-\lambda_{i,3}(t_{ij}-\tau_i) } } } . \end{eqnarray}$$

The log-transformed viral load can then be modeled by:

$$\log(y_{ij} ) = \sum_{m=1}^3\one_{z_i=m}\log(f_m(t_{ij},\psi_i) ) + \varepsilon_{ij},$$

where $y_{ij}$ is the viral load for subject $i$ at time $t_{ij}$ and $\psi_i=(A_{i,0},A_{i,1},A_{i,2},A_{i,3},\lambda_{i,1},\lambda_{i,2},\lambda_{i,3},\tau_i)$ the vector of individual parameters.

The figure below displays the predicted viral loads for the 4 patients using model $f_1$ for patient 1, $f_2$ for patient 2 and $f_3$ for patients 3 and 4, with the parameters given to the right of the figure: Observed and predicted viral load progression for 4 HIV-infected patients
ID $A_0$ $A_1$ $A_2$ $A_3$ $\lambda_1$ $\lambda_2$ $\lambda_3$ $\tau$
1 92 $-$ $-$ $-$ $-$ $-$ $-$ $-$
2 $-$ 66 5 $-$ 0.14 $2\times10^{-5}$ $-$ $-$
3 $-$ 53 6 28 0.15 $1.5\times10^{-5}$ 0.15 200
4 $-$ 77 10 100 0.1 $1.5\times10^{-5}$ 0.013 270

Not all observed viral load progressions fall so easily into one of the three classes, as for example the patients shown in the next figure. Viral load data for 4 patients with ambiguous progressions

In these cases, it does not seem quite so reasonable to model the data under the BSMM assumption that each patient must belong uniquely to one class. Instead, it is perhaps more natural to suppose that each patient is partially responding, partially non-responding and partially rebounding to the given drug treatment. The goal becomes to find the relative strength of each process in each patient, and a WSMM is an ideal tool to do this. Without going further into the details, here are the resulting observed and predicted viral loads for these 4 individuals when each individual represents a mixture of the three viral load progressions. Observed and predicted viral load progression for 4 patients time using WSMM

## Bibliography

Biernacki, C., Celeux, G., Govaert, G., Langrognet, F. - Model-Based Cluster and Discriminant Analysis with the MIXMOD Software.

Computational Statistics and Data Analysis 51(2):587-600,2006
Celeux, G., Hurn, M., Robert, C. - Model Based Clustering for Longitudinal Data
Computational Statistics and Data Analysis 52(3):1441-1457,2008
Frühwirth-Schnatter, S. - Finite Mixture and Markov Switching Models
Springer, New York,2006
Hou, W., Li, H., Zhang, B., Huang, M., Wu, R. - A nonlinear mixed-effect mixture model for functional mapping of dynamic traits
Heredity 101:321-328,2008
Ketchum, J. M., Best, A. M., Ramakrishnan, V. - A Within-Subject Normal-Mixture Model with Mixed-Effects for Analyzing Heart Rate Variability
J. Biomet Biostat S7:013,2012
Lavielle, M., Mbogning, C. - An improved SAEM algorithm for maximum likelihood estimation in mixtures of non linear mixed effects models
Statistics & Computing (to appear) ,2013
Mbogning, C., Bleakley, K., Lavielle, M. - Between-subject and within-subject model mixtures for classifying HIV treatment response
Progress in Applied Mathematics 4(2):148-166,2012
McLachland, G. J., Peel, D. - Finite Mixture models.
Wiley-Interscience, New York,2000
Muthén, B., Shedden, K. - Finite mixture modeling with mixture outcomes using the EM algorithm
Biometrics 55(2):463-469,1999
Ng, S. K., McLachlan, G. J., Wang, K., Ben-Tovim, L., Ng, S. W. - A mixture model with mixed effects components for clustering correlated gene-expression profiles
Bioinformatics 22:1745-1752,2006
Rosner, G. L., Muller, P. - Bayesian population pharmacokinetic and pharmacodynamic analyses using mixture models.
J. Pharmacokin. Biopharm. 25:209-233,1997
Verbeke, G., Lesaffre, E. - A linear mixed-effects model with heterogeneity in the random-effects population
Journal of the American Statistical Association 91(433):217-221,1996
Wang, X., Schumitzky, A., D'Argenio, D. Z. - Non linear random effects mixture models : Maximum likelihood estimation via the EM algorithm
Comput. Stat. Data Anal. 51:6614-6623,2007