Article Preview
TopIntroduction
The employment of predictive time-to-event modeling in medical survival analysis usually falls into two broad categories. The first is prognostic, developing models for how a certain disease will progress. The purpose of such models includes understanding disease processes and prediction of how new patients will behave in the context of existing data. Examples include predicting which prostate cancer patients will recur so that therapy can be initiated early (Donovan et al., 2009) or identifying which group of patients will benefit more from a certain therapy. The second purpose is factor analysis; to analyze disease processes and explore interaction affects between disease factors. An example is understanding the interaction of whether a potentially significant gene will continue to be relevant when combined with other factors in a multivariate setting (Donovan et al., 2009) in order to possibly prioritize and identify candidate genes for targeted therapeutic drug development.
While time-to-event prediction is inherently a regression problem, it challenges computational modeling approaches due to the fact that healthcare data in such settings is characterized by censored and non-censored (event) observations. Healthcare data used in such prognostic modeling is usually obtained from tracking patients over the course of a well designed study, perhaps lasting years. Contrary to traditional regression problems, the information for most observations is incomplete and only known “up-to-a-point.” Patients who have experienced the endpoint of interest (cancer remission, recurrence, etc.) during their follow-up are considered as non-censored or events. Patients that did not experience the endpoint during study or were lost to follow-up for any cause (i.e., the patient moved during a multi-year study) are considered censored. All that is known about them is that they were disease free up to a certain point, but what subsequently occurred is unknown. For a d-dimensional vector xi Є Rd the observed time Si is called the censoring time. For such individuals, it is only known that they survived for at least time Si. The actual target Ti is unknown for censored cases, thus Si < Ti . An important assumption is that Ti and Si are independent conditional on xi, i.e., the cause for censoring is independent of the survival time. With an indicator function δi which is 0 if an event occurred and 1 if the observation is censored, the available training data can be summarized for N patients as D = { Ti, xi,δi } Ni=1 (Raykar et al., 2008).
Censored observations contribute incomplete information as the event of interest may occur after they were lost to follow-up. Simply omitting the censored observations (Burke et al., 1997; Shivaswamy, Chu, & Janasche, 2007) or treating them as non-recurring samples in a classifier (Snow, Smith, & Catalona, 1997) both bias the resulting model and should be avoided.