Models for time-to-event data

De Popix
Aller à : Navigation, rechercher

Here, observations are the "times at which events occur". An event may be one-off (e.g., death, hardware failure) or repeated (e.g., epileptic seizures, metro strike).


Single event

To begin with, we will consider a one-off event. Depending on the application, the length of time to this event may be called the survival time (until death), failure time (until hardware fails), etc. To be general, we can just say event time.

The random variable representing the event time for subject $i$ is typically written $T_i$. Several situations are then possible to define the observations:

    • The event time is exactly observed.
    Then, the observation for individual $i$ is $y_i = t_i$, where $t_i$ is a realization of the random variable $T_i$.

    • We may know the event has happened in an interval $I_i$ but not know the exact time $t_i$. This is interval censoring. For example, at a routine check-up, cancer recurrence may be detected, and we only know that it has occurred at some point in time since the last check-up.
    The observation for individual $i$ is the event: $y_i = $ "$a_i < t_i \leq b_i$".

    • If we assume that the trial ends at time $\tstop$, then the event may happen after the end of the trial period. This is right censoring.
    There are several variations of this for defining what the observations are:

    • If events (before $\tstop$) are exactly observed, then for $i=1,2,\ldots, N$,

    \( y_i = \left\{ \begin{array}{ll} t_i & {\rm if \quad} t_i \leq \tstop \\ {\rm t_i > \tstop \quad} & {\rm otherwise. \quad} \end{array} \right. \)

Assume that a trial starts at $\tstart=0$ and ends at $\tstop=5$, and that we obtain the following observations from 4 individuals:

$y_1 = 3.2$

$y_2=$ "$t_2>5$"

$y_3= 2.7$

$y_4 =$ "$t_4>5$"

These observations can be stored in a data file as shown in the table on the right.

Here, "event=0" at time $t$ means that the event happened after $t$ while "event=1" means that the event happened at time $t$.

The lines with $t=0$ are used to state the trial start time $\tstart=0$.
1 0 0
1 3.2 1
2 0 0
2 5 0
3 0 0
3 2.7 1
4 0 0
4 5 0

    • If events before $\tstop$ are interval censored, then for $i=1,2,\ldots, N$,

    \( y_i = \left\{ \begin{array}{ll} {\rm a_i < t_i \quad \leq \quad b_i} & {\rm if \quad} t_i\leq \tstop \\ {\rm t_i > \tstop \quad} & {\rm otherwise.} \end{array} \right. \)

Assume that we have censoring intervals of length 1:


For the same four individuals as the previous example, we now have the following observations:

$y_1=$ "$3 < t_1 \leq 4$",

$y_2=$ "$t_2>5$",

$y_3=$ "$2< t_3 \leq 3$",

$y_4=$ "$t_4>5$".

These observations can be stored in a data file as shown in the table on the right.

Here "event=0" at time $t$ means that the event happened after $t$ while "event=1" means that the event happened before time $t$.
1 0 0
1 3 0
1 4 1
2 0 0
2 5 0
3 0 0
3 2 0
3 3 1
4 0 0
4 5 0

Probability distributions

Several functions play key roles in time-to-event analysis: the survival function, the hazard function and the cumulative hazard function. We are still working under a population approach here and so these functions, detailed below, are therefore individual functions, i.e., each subject has its own. As we are using parametric models, this means that these functions depend on individual parameters $(\psi_i)$.

    • The survival function $S(t; \psi_i)$ gives the probability that the event happens to individual $i$ after time $t>t_{start}$:

    \( S(t; \psi_i) \ \ \eqdef \ \ \prob{T_i>t ; \psi_i} . \)

    • The hazard function $\hazard(t;\psi_i)$ is defined for individual $i$ as the instantaneous rate of the event at time $t$, given that the event has not already occurred:

    \( \hazard(t;\psi_i) \ \ \eqdef \ \ \lim_{dt\to 0} \displaystyle{\frac{S(t;\psi_i) - S(t + dt;\psi_i)}{ S(t;\psi_i) \, dt} }. \)

    This is equivalent to:
    \( \hazard(t;\psi_i) \ \ = \ \ -\displaystyle{ \frac{d}{dt} } \log{S(t;\psi_i)}. \)

    • Another useful quantity is the cumulative hazard function $\cumhaz(a,b;\psi_i)$, defined for individual $i$ as:

    \( \cumhaz(a,b;\psi_i) \ \ \eqdef \ \ \displaystyle{\int_a^b \hazard(t;\psi_i) \, dt }. \)

    Note that (1) implies that:

    \( S(t;\psi_i) \ \ = \ \ e^{-\cumhaz(t_{start},t;\psi_i)}. \)

Equation (1) shows that the hazard function $\hazard(t;\psi_i)$ characterizes the problem, because knowing it is the same as knowing the survival function $S(t;\psi_i)$. The probability distribution of survival data is therefore completely defined by the hazard function. Let $\qcyipsii$ be the conditional distribution of the observation $y_i$ given the vector of individual parameters $\psi_i$. Its pdf can be easily computed for the various censoring situations discussed above:

  1. If the event is exactly observed with $y_i=t_i$, the density is the derivative of the cumulative density function, i.e., the derivative of $1 - S(t_i;\psi_i)$:
  2. \( \begin{eqnarray}\pcyipsii(y_i | \psi_i) &=& \frac{d}{dt_i}\left(1 - e^{-\cumhaz(t_{start},t_i;\psi_i)}\right)\\ %&=& \left(\frac{d}{dt_i} \int_{t_{start} }^{t_i} \hazard(u;\psi_i) \, du \right) e^{-\cumhaz(t_{start},t_i;\psi_i)}\\ &=&\hazard(t_i;\psi_i)e^{-\cumhaz(t_{start},t_i;\psi_i)} . \end{eqnarray}\)

  3. If the event is interval-censored with $y_i=\,$ "$a_i<t_i\leq b_i$":
  4. \(\begin{eqnarray} \pcyipsii(y_i | \psi_i) &=& \prob{T_i \in (a_i,b_i]\,| \,\psi_i} \\ %&=& \prob{T_i \leq b_i | \psi_i} - \prob{T_i \leq a_i | \psi_i} \\ %&=& (1-S( b_i ; \psi_i)) - (1-S( a_i ; \psi_i)) \\ &=& e^{-\cumhaz(t_{start},a_i;\psi_i)} - e^{-\cumhaz(t_{start},b_i;\psi_i)} . \end{eqnarray}\)

  5. If the event is right-censored with $y_i= \,$ "$t_i>t_{stop}$":
  6. \(\begin{eqnarray} \pcyipsii(y_i | \psi_i) &=& \prob{T_i > t_{stop} | \psi_i} \\ %&=& S( t_{stop} ; \psi_i) \\ &=& e^{-\cumhaz(t_{start},t_{stop};\psi_i)} . \end{eqnarray}\)

Repeated events

Sometimes, an event can potentially happen again and again, e.g., epileptic seizures, heart attacks, etc. For any given hazard function $\hazard$, the survival function $S$ for individual $i$ now represents survival since the previous event at $t_{i,j-1}$, written here in terms of the cumulative hazard from $t_{i,j-1}$ to $t_{i,j}$:

\(\begin{eqnarray} S(t_{i,j} | t_{i,j-1};\psi_i) &=& \prob{T_{i,j} > t_{i,j}\, | \,T_{i,j-1} = t_{i,j-1};\psi_i} \\ &=& e^{-\cumhaz(t_{i,j-1},t_{i,j};\psi_i)} \\ &=& \exp\left({-\int_{t_{i,j-1} }^{t_{i,j} } \hazard(t;\psi_i) \, dt}\right) . \end{eqnarray}\)

Censoring and probability distributions

Taking into account censoring for repeated events is slightly more complicated than for one-off events. First, let us assume that a trial starts at time $t_{start}$ and ends at time $t_{stop}$. Let $(T_{i1}, T_{i2}, \ldots )$ be random event times after $t_{start}$. Then, we can distinguish between the two following situations:

    1. Exactly observed events: A sequence of $n_i$ event times is precisely observed before $t_{stop}$, i.e., ${\rm y_i = (t_{i,1},t_{i,2},\ldots,t_{i,n_i}, \quad t_{i,n_i+1}>\tstop)}$.
    The conditional pdf of $y_i$ is given by:
    The conditional pdf of $y_i$ is given by:
    \( \pcyipsii(y_i | \psi_i) = \left(\prod_{j=1}^{n_i}\hazard(t_{ij};\psi_i)e^{-\cumhaz(t_{i,j-1},t_{i,j};\psi_i)} \right)e^{-\cumhaz(t_{n_i},\tstop;\psi_i)} , \)
    where $t_{i0}=\tstart$.


Suppose that for individual $i=1$ we know there were 8 events but only 7 of them occurred before $\tstop$. Here is a graphic showing the events that were exactly observed:


This data is then stored in the table on the left below. We see that the 8th and final event is noted "event = 0" with time $\tstop = 18$, indicating that the event was not observed at the end of the time period $\tstop$. In the table on the right, we show the contributions of each observation to the conditional pdf of $y_1$. Indeed, equation (1) means that the pdf of $y_1=(y_{1,1}, \ldots, y_{1,8})$ is the product of the conditional pdfs given in the right table.
1 0 0
1 1.4 1
1 3.5 1
1 4.4 1
1 5.6 1
1 9.7 1
1 11.4 1
1 15.8 1
1 18 0

    2. Interval-censored events: Let $(b_{0}, b_1], (b_{1}, b_2], \ldots , (b_{K-1}, b_K]$ be a sequence of successive intervals with $\tstart=b_0<b_1<b_2 < \ldots <b_K = \tstop$. We do not know the exact event times, but a sequence $(m_{ik}; \, 1 \leq k \leq K)$ is observed, where $m_{ik}$ is the number of events that occurred for individual $i$ in interval $(b_{k-1}, b_k]$.
    We can show that the conditional pdf of $y_i$ is given by:
    \( \pcyipsii(y_i | \psi_i) = \prod_{k=1}^{K} e^{-\cumhaz(b_{k-1}, b_k;\psi_i)} \displaystyle{\frac{\cumhaz^{m_{ik} }(b_{k-1}, b_k;\psi_i)}{m_{ik}!} } . \)
    In other words, the number of events per interval for individual $i$ is a (possibly non-homogeneous) Poisson process with intensity $\cumhaz(b_{k-1}, b_k;\psi_i)$ in interval $(b_{k-1}, b_k]$.


Here is a graphic that shows an example of the interval boundaries and the number of events that occurred in each interval for individual $i=1$.


The table on the left below shows the same data. Using (2) we see that the conditional pdf of $y_1=(y_{1,1}, \ldots, y_{1,6})$ is the product of the conditional pdfs given in the table on the right.
1 0 0
1 3 1
1 6 3
1 9 0
1 12 2
1 15 0
1 18 1
$e^{-\cumhaz(0,3;\psi_1)}\cumhaz(0,3;\psi_1) $
$e^{-\cumhaz(3,6;\psi_1)} {\cumhaz^{3}(3,6;\psi_1)}/{6} $
$e^{-\cumhaz(9,12;\psi_1)} {\cumhaz^{2}(9,12;\psi_1)}/{2} $
$e^{-\cumhaz(15,18;\psi_1)}\cumhaz(15,18;\psi_1) $


if the total number $n_i$ of (observed and unobserved) events for individual $i$ is known to be finite, then formula (2) is slightly modified when the last event occurs before $\tstop$ ($t_{n_i}<\tstop$).

Assume that the last event for individual $i$ occurs in the $K_i$-th interval. Let $s_{i} = \sum_{i=1}^{k_i-1} m_{ik}$ be the number of events that occurred before this interval. Then, we can show that

\( \pcyipsii(y_i | \psi_i) = \prod_{k=1}^{K_i-1} \left( \displaystyle{ \frac{\cumhaz^{m_{ik} }(b_{k-1}, b_k;\psi_i)}{m_{ik}!} }e^{-\cumhaz(b_{k-1}, b_k;\psi_i)} \right) \!\times \!\left(1 - \sum_{\ell=0}^{n_i-s_{i} } \displaystyle{ \frac{\cumhaz^{\ell}(b_{k_i -1},b_{k_i};\psi_i)}{\ell!} } e^{-\cumhaz(b_{k_i -1},b_{k_i};\psi_i)}\right) . \)

Examples of hazard functions

    • Constant hazard model:
    The most simple case is that of a constant hazard function: $\hazard(t;\psi_i) = \hazard_i \in \Rset$. Here, $\psi_i=\hazard_i$.

    • Proportional hazards model:

    \( \hazard(t;\psi_i) = \hazard_0(t;\alpha_i) \, e^{ \langle \beta , c_i \rangle}. \)

    Here, the hazard is decomposed into two terms: a baseline function $\hazard_0$ of $t$, and an "individual" term, function of some individual covariates $c_i$. $ \langle \beta , c_i \rangle$ means a scalar product, i.e., a linear function of $c_i$. In a proportional hazards model, a unit increase in the value of a covariate has a multiplicative effect on the hazard.
    In the usual proportional hazard model, $\alpha_i$ is a population constant ($\alpha_i=\alpha$). Then, $\psi_i$ can be decomposed into a set of population parameters $\alpha$ and an individual parameter $ \langle \beta , c_i \rangle$. A straightforward extension consists in assuming that $\alpha_i$ is also an individual parameter.

    • Extended proportional hazards model:
    Another possible extension assumes that the hazard function is a (possibly nonlinear) function $u$ of a regression variable $x_i$:

    \( \hazard(t;\bpsi_i) = \hazard_0(t;\alpha_{i}) \, e^{ u(\beta_i,x_i(t))} . \)

    Consider for example that $x_i(t)$ is the plasmatic concentration of a drug at time $t$ for individual $i$. Then, $u(\beta_i,x_i(t))$ is the term that represents (i.e., models) the effect of the drug on the hazard, while $\hazard_0(t;\alpha_i)$ might model the effect of disease progression on the hazard.
    In this example, $x_i(t)$ is the "true" plasmatic concentration for subject $i$ at time $t$, and it is a continuous function of time. However, in practice it is only measured at precise times, so a longitudinal model for plasmatic concentration is needed to give a concentration value for each $t$.
    Therefore, in practice we need to develop a joint model in order to simultaneously model time-to-events data and longitudinal data. Such an approach is introduced in the Joint models section.

    • Accelerated failure time (AFT) model:
    Unlike proportional hazards models, the AFT model supposes that a change in a covariate has a multiplicative effect not on the hazard but the predicted event time. This can be written as:

    \( \log(T_i) = \langle \psi_i , c_i \rangle + \xi_i \)

    where $\xi_i$ is a zero-mean random variable, e.g., a centered normal distribution. Usually, parameters are fixed effects: $\psi_i=\psi$ for each subject $i$.
    To calculate the hazard function, let us first denote $p_{\xi_i}$ the density and $F_{\xi_i}$ the cdf of $\xi_i$, and to simplify, denote $\mu_i = \langle \psi_i , c_i \rangle$ the mean of $\log(T_i)$. We begin by calculating the survival function:

    \(\begin{eqnarray} S(t;\psi_i) &=& \prob{\log{T_i} > \log{t} ; \bpsi_i} \\ &=& \int_{\log{t}-\mu_i}^{\infty} p_{\xi_i}(u; \psi_i) \, du \\ &=& 1 - F_{\xi_i}(\log{t}-\mu_i ; \psi_i) . \end{eqnarray}\)

    Calculating (1) then gives the hazard function:

    \( \hazard(t;\psi_i) = \displaystyle{ \frac{p_{\xi_i}(\log{t} - \mu_i; \psi_i)}{t(1- F_{\xi_i}(\log{t} - \mu_i; \psi_i))} }\, \)


    For a given vector of individual parameters $\psi_i$, a model for (repeated) time-to-event data is completely defined by

    1. the hazard function $\hazard(t ; \psi_i)$, or the survival function $S(t ; \psi_i)$
    2. (possibly) the interval and/or right censoring process
    3. (possibly) the maximum number of possible events


    Aalen, O., Borgan, O., Gjessing, H. - Survival and Event History Analysis.

    Springer, New York,2008
    Andersen, P. K. - Survival analysis
    Wiley Online Library,2006
    Diggle, P., Kenward, M. G. - Informative drop-out in longitudinal data analysis.
    Appl. Stats 43:49-93,1994

    Duchateau, L., Janssen, P. - The Frailty Model. Statistics for Biology and Health

    Springer., New York,2008
    Fleming, T. R., Harrington, D. P. - Counting processes and survival analysis
    Vol. 169, Wiley,2011
    Huang, X., Liu, L. - A joint frailty model for survival and gap times between recurrent events.
    Biometrics 63:389-397,2007
    Ibrahim, J. G., Chen, M.-H., Sinha, D. - Bayesian survival analysis
    Wiley Online Library,2005
    Kalbfleisch, J. D., Prentice, R. L. - The statistical analysis of failure time data
    Kelly, P. J., Jim, L. L. - Survival analysis for recurrent event data: an application to childhood infectious disease.
    Statistics in Medicine 19(1):13-33,2000
    Klein, J. P., Moeschberger, M. L. - Survival analysis: techniques for censored and truncated data
    Klein, J. P., Moeschberger, M. L. - Survival Analysis - Techniques for Censored and Truncated Data.
    Springer-Verlag, New York,1997
    Kleinbaum, D. G. - Survival analysis
    Littell, R. C. - SAS for mixed models
    SAS institute,2006
    Miller Jr, R. G. - Survival analysis
    Wienke, A. - Frailty models in survival analysis
    Vol. 37, Chapman & Hall,2010


Outils personnels
Espaces de noms

Tasks & Tools
Download files
Boîte à outils