# Models for count data

Count data is a special type of statistical data that can only take non-negative integer values $\{0, 1, 2,\ldots\}$ that come from counting something, e.g., the number of seizures, hemorrhages or lesions in each given time period. More precisely, data from individual $i$ is the sequence $y_i=(y_{ij},1\leq j \leq n_i)$ where $y_{ij}$ is the number of events observed in the $j$th time interval $I_{ij}$.

For the moment, let us assume that all the intervals have the same length. This is the case, for instance, if data are daily seizure counts: $I_{ij}$ is the $j$th day after the start of the experiment and $y_{ij}$ the number of seizures observed during that day.

We will then model the sequence $y_i=(y_{ij},1\leq j \leq n_i)$ as a sequence of random variables that take its values in $\{ 0, 1, 2,\ldots\}$.

If we assume that these random variables are independent, then the model is completely defined by the probability mass functions $\prob{y_{ij}=k}$, for $k \geq 0$ and $1 \leq j \leq n_i$. Common distributions used to model count data include Poisson, binomial and negative binomial.

Indeed, here we will only consider parametric distributions. In this context, building a model means defining:

• the parameter function (or "intensity") $\lambda_{ij} = \lambda(t_{ij},\psi_i)$ for any individual $i$ that depends on individual parameters $\psi_i$ and possibly the time $t_{ij}$.
• the probability mass function $\prob{y_{ij}=k; \lambda_{ij}}$.

The conditional distribution of the observations is therefore written:

$$\prob{y_{ij}=k | \psi_i} = \prob{y_{ij}=k ; \lambda_{ij} }.$$

Example

Let us illustrate this approach for the Poisson distribution.

A Poisson distribution with intensity $\lambda$ is defined by its probability mass function:

$$\prob{y=k ; \lambda} = \displaystyle{\frac{\lambda^{k} \, e^{-\lambda} }{k!} }.$$

One of the main property of the Poisson distribution is that $\lambda$ is both the mean and the variance of the distribution:

$$\esp{y} = \var{y} = \lambda$$

All that remains is to define the Poisson intensity function $\lambda_{ij} = \lambda(t_{ij},\psi_i)$. Then,

$$\prob{y_{ij}=k | \psi_i} = \displaystyle{\frac{\lambda_{ij}^{k}\, e^{-\lambda_{ij} } } {k!} }.$$

There are many variations of the Poisson model:

• Homogeneous Poisson distribution: this assumes a constant intensity $\lambda_i$ for each individual $i$. Here, $\psi_i = \lambda_i$ and $\lambda(t_{ij},\psi_i)=\lambda_i$.

• Non-homogeneous Poisson distribution: this assumes that the Poisson intensity is a function of time. For example, suppose that we believe that a disease-related event is increasing linearly in frequency each month. We could then model this using $\lambda(t_{ij},\psi_i) = \lambda_{i} + a_i t_{ij}$, where $t_{ij} = j$ (months). Here, $\psi_i=(\lambda_{i},a_i)$.

• Additional regression variables: the Poisson intensity may depend on regression variables other than time. For example, assume that taking a drug tends to reduce the number of events. We can then link the time-varying drug concentration $C$ to the value of $\lambda$ at time $t_{ij}$ using for instance an "Imax" model:

$$\lambda(t_{ij},\psi_i) = \lambda_{i}\left(1-\Imax_i\displaystyle{\frac{ \ C_i(t_{ij})}{IC_{50,i} + C_i(t_{ij})} }\right) ,$$

where $\lambda_{i}$ is the baseline intensity and where $0\leq \Imax_i\leq 1$. Here, $\psi_{i} = (\lambda_{i}, \Imax_i, IC_{50,i})$.
This model can even be combined with the previous non-homogeneous model by assuming a time-varying baseline $\lambda_{i}(t)$ in order to combine a drug effect model with a disease model for instance.

• Instead of assuming independent count data, we can introduce Markovian dependency into the model by assuming for example that $\lambda_{ij}$ is function of $y_{i,j-1}$. Then, $\prob{y_{ij}=k\, |\, y_{i\,j-1}, t_{ij},\psi_i}$ is the probability function of a Poisson random variable with parameter $\lambda_{ij} =\lambda(y_{i,j-1}, t_{ij},\psi_i)$.

• If $y_{ij}$ is the number of a given type of events (seizures, hemorrhages, etc.) in a given time interval $I_{ij}$, and if $h_i(t)=h(t,\psi_i)$ is the hazard function associated with this sequence of events for individual $i$, then $y_{ij}$ is a non-homogeneous Poisson process with Poisson intensity $\lambda_{ij}=\displaystyle{ \int_{I_{ij}}} h(t,\psi_i)dt$ in interval $I_{ij}$ (see Models for time-to-event data section).

Let us see now some other examples of distributions for count data:

• The inflated Poisson distribution:

$$\prob{y=k ; \lambda,p_0} = \left\{ \begin{array}{cc} p_0 + (1-p_0)e^{-\lambda} & {\rm if } \ k=0 \\ (1-p_0) \displaystyle {\frac{e^{-\lambda} \lambda^{k} }{k!} } & {\rm if } \ k>0 . \end{array} \right.$$

where $0\leq p_0 <1$. This is useful when data seem generally to follow a Poisson distribution except for having an overly large quantity of cases when $k=0$:

• The negative binomial distribution is:

$$\prob{y=k ; p,r} = \displaystyle{ \frac{\Gamma(k+r)}{k!\, \Gamma(r)} }(1-p)^r p^k ,$$

with $0\leq p \leq 1$ and $r>0$. If $r$ is an integer, then the negative binomial (NB) distribution with parameters $(p,r)$ is the probability distribution of the number of successes in a sequence of Bernoulli trials with probability of success $p$ before $r$ failures occur.

• The generalized Poisson distribution is:

$$\prob{y=k ; \lambda,\delta} = \displaystyle {\frac{\lambda (\lambda+k\delta)^{k-1} e^{-\lambda-k\delta} }{k!} },$$

with $\lambda>0$ and $0\leq \delta <1$.
The generalized Poisson (GP) distribution includes the Poisson distribution as a special case $(\delta=0)$, and is over-dispersed relative to the Poisson. Indeed, the variance to mean ratio exceeds 1:

$$\begin{eqnarray} \esp{y} &=& \frac{\lambda}{1-\delta} \\ \var{y} &=& \frac{\lambda}{1-\delta^3}. \end{eqnarray}$$

Summary

For a given design $\bx_{i}$ and a given vector of parameters $\psi_i$, a parametric model for count data is completely defined by:

- the probability mass function used to represent the distribution of the data in a given time interval

- a model which defines how the distribution's parameter function (i.e., intensity) varies over time.

## $\mlxtran$ for count data models

Example 1: Poisson model with time varying intensity

 $$\begin{array}{c} \psi_i &=& (\alpha_i,\beta_i) \\[0.3cm] \lambda(t,\psi_i) &=& \alpha_i + \beta_i\,t \\[0.3cm] \prob{y_{ij}=k} &=& \displaystyle{ \frac{\lambda(t_{ij} , \psi_i)^k}{k!} } e^{-\lambda(t_{ij} , \psi_i)}\\ \end{array}$$ MLXTran  INPUT: input = {alpha, beta} EQUATION: lambda = alpha + beta*t DEFINITION: y ~ poisson(lambda) 

Example 2: generalized Poisson model

 $$\begin{array}{c} \psi_i &=& (\lambda_i,\delta_i) \\ \log\left( \prob{y_{ij}=k} \right) &=& \log(\lambda_i) + (k-1)\log(\lambda_i+k\delta_i) \\ && -\lambda_i-k\delta_i - \log(k!)\\[1cm] \end{array}$$ MLXTran  INPUT: parameter = {dlt, lbd} DEFINITION: Y = { type = count, log(P(Y=k)) = log(lambda) + (k-1)*log(lambda+k*delta) - lambda -k*delta - factln(k) } 

## Bibliography

Blundell, R., Griffith, R., Windmeijer, F. - Individual effects and dynamics in count data models

Journal of Econometrics 108(1):113-131,2002
Bolker, B. M., Brooks, M. E., Clark, C. J., Geange, S. W., Poulsen, J. R., Stevens, M. H., White, J.-S. S., others - Generalized linear mixed models: a practical guide for ecology and evolution
Trends in ecology & evolution 24(3):127-135,2009
Cameron, A. C., Trivedi, P. K. - Regression analysis of count data
Vol. 30, Cambridge University Press,1998
Christensen, O. F., Waagepetersen, R. - Bayesian prediction of spatial count data using generalized linear mixed models
Biometrics 58(2):280-286,2002
Fahrmeir, L., Tutz, G., Hennevogl, W. - Multivariate statistical modelling based on generalized linear models
Vol. 2, Springer New York,1994
Hall, D. B. - Zero-inflated Poisson and binomial regression with random effects: a case study
Biometrics 56(4):103--1039,2004
Heilbron, D. C. - Zero-Altered and other Regression Models for Count Data with Added Zeros
Biometrical Journal 36(5):531-547,2007
Lawless, J. F. - Negative binomial and mixed Poisson regression
Lee, A. H., Wang, K., Scott, J. A., Yau, K. K. W., McLachlan, G. J. - Multi-level zero-inflated Poisson regression modelling of correlated count data with excess zeros
Statistical Methods in Medical Research 15(1):47-61,2006
McCulloch, C. E., Searle, S. R., Neuhaus, J. M. - Generalized, Linear, and Mixed Models
Wiley,2011
Min, Y., Agresti, A. - Random effect models for repeated measures of zero-inflated count data
Statistical Modelling 5(1):1-19,2005
Molenberghs, G., Verbeke, G. - Models for discrete longitudinal data
Springer,2005
Mullahy, J. - Heterogeneity, excess zeros, and the structure of count data models
Journal of Applied Econometrics 12(3):337-350,1998
Savic, R., Lavielle, M. - Performance in population models for count data, part ii: A new saem algorithm
Journal of pharmacokinetics and pharmacodynamics 36(4):367-379,2009
Thall, P. F. - Mixed Poisson likelihood regression models for longitudinal interval count data
Biometrics pp. 197-209,1988
Thall, P. F., Vail, S. C. - Some covariance models for longitudinal count data with overdispersion
Biometrics pp. 657-671,1990
Tempelman, R. J., Gianola, D. - A mixed effects model for overdispersed count data in animal breeding
Biometrics pp. 265-279,1996
Winkelmann, R. - Econometric analysis of count data
Springer,2008
Wolfinger, R., O'Connell, M. - Generalized linear mixed models a pseudo-likelihood approach
Journal of statistical Computation and Simulation 48(3-4):233-243,1993
Yau, K. K. W., Wang, K., Lee, A. H. - Zero-Inflated Negative Binomial Mixed Regression Modeling of Over-Dispersed Count Data with Extra Zeros
Biometrical Journal 45(4):437-452,2003
Zeileis, A., Kleiber, C., Jackman, S. - Regression models for count data in R
Journal of Statistical Software 27(8):1-25,2008