Cross Validated is a question and answer site for people interested in statistics, machine learning, data analysis, data mining, and data visualization. Join them; it only takes a minute:

Sign up
Here's how it works:
  1. Anybody can ask a question
  2. Anybody can answer
  3. The best answers are voted up and rise to the top

I am learning survival analysis from this post on UCLA IDRE and got tripped up at section 1.2.1. The tutorial says:

... if the survival times were known to be exponentially distributed, then the probability of observing a survival time ...

Why are survival times assumed to be exponentially distributed? It seems very unnatural to me.

Why not normally distributed? Say suppose we are investigating some creature's life span under certain condition (say number of days), should it be more centered around some number with some variance (say 100 days with variance 3 days)?

If we want time to be strictly positive, why not make normal distribution with higher mean?

share|improve this question
3  
Heuristically, I cannot think of the normal distribution as an intuitive way to model failure time. It's never cropped up in any of my applied work. They are always skewed very far right. I think normal distributions heuristically come about as a matter of averages, whereas survival times heuristically come about as a matter of extrema such as the effect of a constant hazard being applied to a sequence of parallel or series components. – AdamO 9 hours ago
    
@AdamO coming from average or extrema ! This helps a lot! – hxd1011 9 hours ago
4  
I agree with @AdamO about the extreme distributions inherent to survival and time to failure. As others have noted, exponential assumptions have the advantage of being tractable. The biggest problem with them is the implicit assumption of a constant rate of decay. Other functional forms are possible and come as standard options depending on the software, e.g., generalized gamma. Goodness of fit tests can be employed to test differing functional forms and assumptions. The best text on survival modeling is Paul Allison's Survival Analysis Using SAS, 2nd ed. Forget SAS-it's an excellent review – DJohnson 6 hours ago
    
@DJohnson thanks for the recommendation, yes, SAS is intimidating for me :), I may skip it if you did not mention it explicitly. – hxd1011 6 hours ago
2  
I would note that the very first word in your quote is "if" – Fomite 5 hours ago

Exponential distributions are often used to model survival times because they are the simplest distributions that can be used to characterize survival / reliability data. This is because they are memoryless, and thus the hazard function is constant w/r/t time, which makes analysis very simple. This kind of assumption may be valid, for example, for some kinds of electronic components like high-quality integrated circuits. I'm sure you can think of more examples where the effect of time on hazard can safely be assumed to be negligible.

However, you are correct to observe that this would not be an appropriate assumption to make in many cases. Normal distributions can be alright in some situations, though obviously negative survival times are meaningless. For this reason, lognormal distributions are often considered. Other common choices include Weibull, Smallest Extreme Value, Largest Extreme Value, Logistic, etc. A sensible choice for model would be informed by subject-area experience and probability plotting. You can also, of course, consider non-parametric modeling.

A good reference for classical parametric modeling in survival analysis is: William Q. Meeker and Luis A. Escobar (1998). Statistical Methods for Reliability Data, Wiley

share|improve this answer
1  
+1 for mention different other options and nice reference! – hxd1011 8 hours ago

To add a bit of mathematical intuition behind how exponents pop up in survival distributions:

The probability density of a survival variable is $f(t) = h(t)S(t)$, where $h(t)$ is the current hazard (risk for a person to "die" this day) and $S(t)$ is the probability that a person survived until $t$. $S(t)$ can be expanded as the probability that a person survived day 1, and survived day 2, ... up to day $t$. Then: $$ P(survived\ day\ t)=1-h(t)$$ $$ P(survived\ days\ 1, 2, ..., t) = (1-h(t))^t$$ With constant and small hazard $\lambda$, we can use: $$ e^{-\lambda} \approx 1-\lambda$$ to approximate $S(t)$ as simply $$ (1-\lambda)^t \approx e^{-\lambda t} $$ , and the probability density is then $$ f(t) = h(t)S(t) = \lambda e^{-\lambda t}$$

Disclaimer: this is in no way an attempt at a proper derivation of the pdf - I just figured this is a neat coincidence, and welcome any comments on why this is correct/incorrect.

EDIT: changed the approximation per advice by @SamT, see comments for discussion.

share|improve this answer
1  
+1 this helped me to understand more on properties of exponential distribution. – hxd1011 8 hours ago
1  
Could you explain your penultimate line? It says $S(t) = ...$, so the left hand side is function of $t$; moreover, so is the right. However, the two middle terms are functions of $\lambda$ (as is the right hand side), but not functions of $t$. Moreover, the approximation $(1+x/n)^n ~ e^{x}$ only holds for $x = o(\sqrt{n})$. It's certainly not true that $\lim_{t \to \infty} (1-\lambda t/t)^t = e^{-\lambda t}$ -- it's not even approximately true for large $t$. I guess this is just a notational mistake you've made though...? – Sam T 7 hours ago
    
@SamT - thanks for the comment, edited. Coming from an applied background, I very much welcome any corrections, esp. on notation. Passing to the limit wrt $t$ was certainly not needed there, but I still believe the approximation holds for small $\lambda$, as are typically encountered in survival models. Or would you say there's something else that coincidentally makes this approximation hold? – juod 5 hours ago
1  
Looks better now :) -- the issue is that while $\lambda$ may be small it's not true that $\lambda t$ is necessarily small; as such, you can't use the approximation $$(1+x/n)^n \approx e^x$$ (directly): it's not even "you can in applied maths but can't in pure"; it just doesn't hold at all. However, we can get around this: we do have that $\lambda$ is small, so we can get there directly, writing $$e^{-\lambda t} = \big(e^{-\lambda}\big)^t \approx \big(1-\lambda)^t.$$ Of course, $\lambda = \lambda t / t$, so we can then deduce that $$e^{-\lambda t} \approx \big(1 - \lambda t / t\big)^t.$$ – Sam T 5 hours ago
    
Being applied, you may feel this is being slightly picky, but the point is that the reasoning wasn't valid; similar invalid steps may not happen to be true. Of course, as someone applied, you may be happy to make this step, find it holds in the majority of cases and not worry about the specifics! As someone who does pure maths, this is out of the question for me, but I understand that we need both pure and applied! (And particularly in stats it's good not to get bogged down in pure technicalities.) – Sam T 4 hours ago

You'll almost certainly want to look at reliability engineering and predictions for thorough analyses of survival times. Within that, there are a few distributions which get used often:

The Weibull (or "bathtub") distribution is the most complex. It accounts for three types of failure modes, which dominate at different ages: infant mortality (where defective parts break early on), induced failures (where parts break randomly throughout the life of the system), and wear out (where parts break down from use). As used, it has a PDF which looks like "\__/". For some electronics especially, you might hear about "burn in" times, which means those parts have already been operated through the "\" part of the curve, and early failures have been screened out (ideally). Unfortunately, Weibull analysis breaks down fast if your parts aren't homogeneous (including use environment!) or if you are using them at different time scales (e.g. if some parts go directly into use, and other parts go into storage first, the "random failure" rate is going to be significantly different, due to blending two measurements of time (operating hours vs. use hours).

Normal distributions are almost always wrong. Every normal distribution has negative values, no reliability distribution does. They can sometimes be a useful approximation, but the times when that's true, you're almost always looking at a log-normal anyway, so you may as well just use the right distribution. Log-normal distributions are correctly used when you have some sort of wear-out and negligible random failures, and in no other circumstances! Like the Normal distribution, they're flexible enough that you can force them to fit most data; you need to resist that urge and check that the circumstances make sense.

Finally, the exponential distribution is the real workhorse. You often don't know how old parts are (for example, when parts aren't serialized and have different times when they entered into service), so any memory-based distribution is out. Additionally, many parts have a wearout time that is so arbitrarily long that it's either completely dominated by induced failures or outside the useful time-frame of the analysis. So while it may not be as perfect a model as other distributions, it just doesn't care about things which trip them up. If you have an MTTF (population time/failure count), you have an exponential distribution. On top of that, you don't need any physical understanding of your system. You can do exponential estimates just based on observed part MTTFs (assuming a large enough sample), and they come out pretty dang close. It's also resilient to causes: if every other month, someone gets bored and plays croquet with some part until it breaks, exponential accounts for that (it rolls into the MTTF). Exponential is also simple enough that you can do back-of-the-envelope calculations for availability of redundant systems and such, which significantly increases its usefulness.

share|improve this answer
2  
This is a good answer, but note that the Weibull distribution is not "the most complex" parametric distribution for survival models. I'm not sure if there could be such a thing, but certainly relative to the Weibull there is the generalized Gamma distribution, & the generalized F distribution, both of which can take the Weibull as a special case by setting parameters to 0. – gung 5 hours ago

Some ecology might help answer the "Why" behind this question.

The reason why exponential distribution is used for modeling survival is due to the life strategies involved in organisms living in nature. There's essentially two extremes with regard to survival strategy with some room for the middle ground.

Here's an image that illustrates what I mean (courtesy of Khan Academy):

https://www.khanacademy.org/science/biology/ecology/population-ecology/a/life-tables-survivorship-age-sex-structure

This graph plots surviving individuals on the Y axis, and "percentage of maximum life expectancy" (a.k.a. approximation of the individual's age) on the X axis.

Type I is humans, which model organisms which have an extreme level of care of their offspring ensuring very low infant mortality. Often these species have very few offspring because each one takes a large amount of the parents time and effort. The majority of what kills Type I organisms is the type of complications that arise in old age. The strategy here is high investment for high payoff in long, productive lives, if at the cost of sheer numbers.

Conversely, Type III is modeled by trees (but could also be plankton, corals, spawning fish, many types of insects, etc) where the parent invests relatively little in each offspring, but produces a ton of them in the hopes that a few will survive. The strategy here is "spray and pray" hoping that while most offspring will be destroyed relatively quickly by predators taking advantage of easy pickings, the few that survive long enough to grow will become increasingly difficult to kill, eventually becoming (practically) impossible to be eaten. All the while these individuals produce huge numbers of offspring hoping that a few will likewise survive to their own age.

Type II is a middling strategy with moderate parental investment for moderate survivability at all ages.

I had an ecology professor who put it this way:

"Type III (trees) is the 'Curve of Hope', because the longer an individual survives, the more likely it becomes that it will continue to survive. Meanwhile Type I (humans) is the 'Curve of Despair', because the longer you live, the more likely it becomes that you will die."

share|improve this answer
    
This is interesting, but note that for humans, before modern medicine (& still in some places in the world today), infant mortality is very high. Baseline human survival is often modeled with "bathtub hazard". – gung 5 hours ago
    
@gung Absolutely, this is a broad generalization and there are variations within humans of different regions and time periods. The main difference is clearer when you're comparing extremes, i.e. Western human families (~2.5 children per pair, most of which don't die in infancy) vs corals or spawning fish (millions of eggs released per mating cycle, most of which die due to being eaten, starvation, hazardous water chemistry, or simply failing to drift into a habitable destination) – CaffeineConnoisseur 5 hours ago
    
While I'm all for explanations from ecology, I'll note assumptions like this are also made for things like hard drives and aircraft engines. – Fomite 5 hours ago

To answer your explicit question, you cannot use the normal distribution for survival because the normal distribution goes to negative infinity, and survival is strictly non-negative. Moreover, I don't think it's true that "survival times are assumed to be exponentially distributed" by anyone in reality.

When survival times are modeled parametrically (i.e., when any named distribution is invoked), the Weibull distribution is the typical starting place. Note that the Weibull has two parameters, shape and scale, and that when shape = 1, the Weibull simplifies to the exponential distribution. A way of thinking about this is that the exponential distribution the simplest possible parametric distribution for survival times, which is why it is often discussed first when survival analysis is being taught. (By analogy, consider that we often begin teaching hypothesis testing by going over the one-sample $z$-test, where we pretend to know the population SD a-priori, and then work up to the $t$-test.)

The exponential distribution assumes that the hazard is always exactly the same, no matter how long a unit has survived (consider the figure in @CaffeineConnoisseur's answer). In contrast, when the shape is $>1$ in the Weibull distribution, it implies that hazards increase the longer you survive (like the 'human curve'); and when it is $<1$, it implies hazards decrease (the 'tree').

Most commonly, survival distributions are complex and not well fit by any named distribution. People typically don't even bother trying to figure out what distribution it might be. That's what makes the Cox proportional hazards model so popular: it is semi-parametric in that the baseline hazard can be left completely unspecified but the rest of the model can be parametric in terms of its relationship to the unspecified baseline.

share|improve this answer
    
"Moreover, I don't think it's true that "survival times are assumed to be exponentially distributed" by anyone in reality." I've actually found it to be quite common in epidemiology, usually implicitly. – Fomite 1 hour ago

tl;dr- An expontential distribution is equivalent to assuming that individuals are as likely to die at any given moment as any other.

Derivation

  1. Assume that a living individual is as likely to die a moment as at any other moment.

  2. So, the death rate $-\frac{\text{d}P}{\text{d}t}$ is proportional to the population, $P$.

$$-\frac{\text{d}P}{\text{d}t}{\space}{\propto}{\space}P$$

  1. Solving on WolframAlpha shows:

$$P\left(t\right)={c_1}{e^{-t}}$$

So, the population follows an exponential distribution.

Math note

The above math is a reduction of a first-order ordinary differential equation (ODE). Normally, we would also solve for $c_0$ by noting the boundary condition that population starts at some given value, $P\left(t_0\right)$, at start-time $t_0$.

Then the equation becomes: $$P\left(t\right)={e^{-t}}P\left({t_0}\right).$$

Reality check

The exponential distribution assumes that people in the population tend to die at the same rate over time. In reality, death rates will tend to vary for finite populations.

Coming up with better distributions involves stochastic differential equations. Then, we can't say that there's a constant death likelihood; rather, we have to come up with a distribution for each individual's odds of dying at any given moment, then combine those various possibility trees together for the entire population, then solve that differential equation over time.

I can't recall having seen this done in anything online before, so you probably won't run into it; but, that's the next modeling step if you want to improve upon the exponential distribution.

share|improve this answer

Your Answer

 
discard

By posting your answer, you agree to the privacy policy and terms of service.

Not the answer you're looking for? Browse other questions tagged or ask your own question.