J. Berger

High error rates in discussions of error rates: no end in sight

27D0BB5300000578-3168627-image-a-27_1437433320306

waiting for the other shoe to drop…

“Guides for the Perplexed” in statistics become “Guides to Become Perplexed” when “error probabilities” (in relation to statistical hypotheses tests) are confused with posterior probabilities of hypotheses. Moreover, these posteriors are neither frequentist, subjectivist, nor default. Since this doublespeak is becoming more common in some circles, it seems apt to reblog a post from one year ago (you may wish to check the comments).

Do you ever find yourself holding your breath when reading an exposition of significance tests that’s going swimmingly so far? If you’re a frequentist in exile, you know what I mean. I’m sure others feel this way too. When I came across Jim Frost’s posts on The Minitab Blog, I thought I might actually have located a success story. He does a good job explaining P-values (with charts), the duality between P-values and confidence levels, and even rebuts the latest “test ban” (the “Don’t Ask, Don’t Tell” policy). Mere descriptive reports of observed differences that the editors recommend, Frost shows, are uninterpretable without a corresponding P-value or the equivalent. So far, so good. I have only small quibbles, such as the use of “likelihood” when meaning probability, and various and sundry nitpicky things. But watch how in some places significance levels are defined as the usual error probabilities —indeed in the glossary for the site—while in others it is denied they provide error probabilities. In those other places, error probabilities and error rates shift their meaning to posterior probabilities, based on priors representing the “prevalence” of true null hypotheses.

Begin with one of his kosher posts “Understanding Hypothesis Tests: Significance Levels (Alpha) and P values in Statistics” (blue is Frost):

(1) The Significance level is the Type I error probability (3/15)

The significance level, also denoted as alpha or α, is the probability of rejecting the null hypothesis when it is true. For example, a significance level of 0.05 indicates a 5% risk of concluding that a difference exists when there is no actual difference….

Keep in mind that there is no magic significance level that distinguishes between the studies that have a true effect and those that don’t with 100% accuracy. The common alpha values of 0.05 and 0.01 are simply based on tradition. For a significance level of 0.05, expect to obtain sample means in the critical region 5% of the time when the null hypothesis is true. In these cases, you won’t know that the null hypothesis is true but you’ll reject it because the sample mean falls in the critical region. That’s why the significance level is also referred to as an error rate! (My emphasis.)

Note: Frost is using the term “error rate” here, which is why I use it in my title. Error probability would be preferable.

This type of error doesn’t imply that the experimenter did anything wrong or require any other unusual explanation. The graphs show that when the null hypothesis is true, it is possible to obtain these unusual sample means for no reason other than random sampling error. It’s just luck of the draw.

(2) Definition Link: Now we go to the blog’s definition link for this “type of error”

No hypothesis test is 100% certain. Because the test is based on probabilities, there is always a chance of drawing an incorrect conclusion.

Type I error

When the null hypothesis is true and you reject it, you make a type I error. The probability of making a type I error is α, which is the level of significance you set for your hypothesis test. An α of 0.05 indicates that you are willing to accept a 5% chance that you are wrong when you reject the null hypothesis. (My emphasis)

  Null Hypothesis
Decision True False
Fail to reject Correct Decision (probability = 1 – α) Type II Error – fail to reject the null when it is false (probability = β)
Reject Type I Error – rejecting the null when it is true (probability = α) Correct Decision (probability = 1 – β)

 

He gives very useful graphs showing quite clearly that the probability of a Type I error comes from the sampling distribution of the statistic (in the illustrated case, it’s the distribution of sample means).

So it is odd that elsewhere Frost tells us that a significance level (attained or fixed) is not the probability of a Type I error. Note: the issue here isn’t whether the significance level is fixed or attained, the difference to which I’m calling your attention is between an ordinary frequentist error probability and a posterior probability in a null hypothesis, given it is rejected—based on a prior probability for the null vs a probability for a single alternative, which he writes as P(real). We may call it the false finding rate (FFR), and it arises in typical diagnostic screening contexts. I elsewhere take up the allegation, by some, that a significance level is an error probability but a P-value is not (Are P-values error probabilities?). A post is here. Also see note [a] below.]
Here are some examples from Frost’s posts on 4/14 & 5/14:

 (3)  In a different post Frost alleges: A significance level is not the Type I error probability

Incorrect interpretations of P values are very common. The most common mistake is to interpret a P value as the probability of making a mistake by rejecting a true null hypothesis (a Type I error).

Now “making a mistake” may be vague, (and his statement here is a bit wonky) but the parenthetical link makes it clear he intends the Type I error probability. Guess what? The link is to the exact same definition of Type I error as before: the ordinary error probability computed from the sampling distribution. Yet in the blogpost itself, the Type I error probability now refers to a posterior probability of the null, based on an assumed prior probability of .5!

If a P value is not the error rate, what the heck is the error rate?

Sellke et al.* have estimated the error rates associated with different P values. While the precise error rate depends on various assumptions (which I discuss here), the table summarizes them for middle-of-the-road assumptions.

P value Probability of incorrectly rejecting a true null hypothesis
0.05 At least 23% (and typically close to 50%)
0.01 At least 7% (and typically close to 15%)

*Thomas SELLKE, M. J. BAYARRI, and James O. BERGER, Calibration of p Values for Testing Precise Null Hypotheses, The American Statistician, February 2001, Vol. 55, No. 1

We’ve discussed how J. Berger and Sellke (1987) compute these posterior probabilities using spiked priors, generally representing undefined “reference” or conventional priors. (Please see my previous post.) J. Berger claims, at times, that these posterior probabilities (which he computes in lots of different ways) are the error probabilities, and Frost does too, at least in some posts. The allegation that therefore P-values exaggerate the evidence can’t be far behind–or so a reader of this blog surmises–and there it is, right below:

Do the higher error rates in this table surprise you? Unfortunately, the common misinterpretation of P values as the error rate creates the illusion of substantially more evidence against the null hypothesis than is justified. As you can see, if you base a decision on a single study with a P value near 0.05, the difference observed in the sample may not exist at the population level. (Frost 5/14)

Admittedly, Frost is led to his equivocation by the sleights of hands of others, encouraged after around 2003 (in my experience).

(4) J. Berger’s Sleight of Hand: These sleights of hand are familiar to readers of this blog; I wouldn’t have expected them in a set of instructional blogposts about misinterpreting significance levels (at least without a great big warning). But Frost didn’t dream them up, he’s following a practice, traceable to J. Berger, of claiming that a posterior (usually based on conventional priors) gives the real error rate (or the conditional error rate). [The computations come from Edwards, Lindmann, and Savage 1963.) Whether it’s the ‘right’ default prior to use or not, my point is simply that the meaning is changed, and Frost ought to issue a trigger alert! Instead, he includes numerous links to related posts on significance tests, making it appear that blithely assigning a .5 spiked prior to the null is not only kosher, but is part of ordinary significance testing. It’s scarcely a justified move. As Casella and R.Berger (1987) show, this is a highly biased prior to use. See this post, and others by Stephen Senn (3/16/15, 5/9/15). Moreover, many people regard point nulls as always false.

In my comment on J. Berger (2003), I noted my surprise at his redefinition of error probability. (See pp. 19-24 in this paper). In response to me, Berger asks,

“Why should the frequentist school have exclusive right to the term ‘error probability’? It is not difficult to simply add the designation ‘frequentist’ (or Type I or Type II) or ‘Bayesian’ to the term to differentiate between the schools” (J. Berger 2003, p. 30).

That would work splendidly, I’m all in favor of differentiating between the schools. Note that he allows “Type I ” to go with the ordinary frequentist variant. If Berger had emphasized this distinction in his paper, Frost would have been warned of the slipperly slide he’s about to take a trip on. Instead, Berger has increasingly grown accustomed to claiming these are the real frequentist(!) error probabilities. (Others have followed.)

(5) Is Hypothesis testing like Diagnostic Screening? Now Frost (following David Colquhoun) appears to favor (not Berger’s conventionalist prior) a type of frequentist or “prevalence” prior:

It is the proportion of hypothesis tests in which the alternative hypothesis is true at the outset. It can be thought of as the long-term probability, or track record, of similar types of studies. It’s the plausibility of the alternative hypothesis.

Do we know these prevalences? What reference class should a given hypothesis be regarded as having been selected? There would be many different urns to which a particular hypothesis belongs. Such frequentist-Bayesian computations may be appropriate in contexts of high throughput screening, where a hypothesis is viewed as a generic, random selection from an urn of hypotheses. Here, a (behavioristic) concern to control the rates of following up false leads is primary. But that’s very different from evaluating how well tested or corroborated a particular H is. And why should fields with “high crud factors” (as Meehl called them) get the benefit? (having a low prior prevalence of “no effect”).

I’ve discussed all these points elsewhere, and they are beside my current complaint which is simply this: In some places, Frost construes the probability of a Type I error as an ordinary error probability based on the sampling distribution alone; and other places as a Bayesian posterior probability of a hypothesis, conditional on a set of data.

Frost goes further in the post to suggest that “hypotheses tests are journeys from the prior probability to posterior probability”.

Hypothesis tests begin with differing probabilities that the null hypothesis is true depending on the specific hypotheses being tested. [Mayo: They do?] This prior probability influences the probability that the null is true at the conclusion of the test, the posterior probability.

 

Initial Probability of
true null (1 – P(real))
P value obtained Final Minimum Probability
of true null
0.5 0.05 0.289
0.5 0.01 0.110
0.5 0.001 0.018
0.33 0.05 0.12
0.9 0.05 0.76

The table is based on calculations by Colquhoun and Sellke et al. It shows that the decrease from the initial probability to the final probability of a true null depends on the P value. Power is also a factor but not shown in the table.

It is assumed that there is just a crude dichotomy: the null is true vs the effect is real (never mind magnitudes of discrepancy which I and others insist upon), and further, that you reject on the basis of a single, just statistically significant result. But these moves go against the healthy recommendations for good testing in the other posts on the Minitab blog. I recommend Frost go back and label the places where he has conflated the probability of a Type I error with a posterior probability based on a prior: use Berger’s suggestion of reserving “Type I error probability” for the ordinary frequentist error probability based on a sampling distribution alone, calling the posterior error probability Bayesian. Else contradictory claims will ensue…But I’m not holding my breath.

I may continue this in (ii)….

***********************************************************************

January 21, 2016 Update:

Jim Frost from Minitab responded, not to my post, but to a comment I made on his blog prior to writing the post. Since he hasn’t commented here, let me paste the relevant portion of his reply. I want to separate the issue of predesignating alpha and the observed P-value, because my point now is quite independent of it. Let’s even just talk of a fixed alpha or fixed P-value for rejecting the null. My point’s very simple: Frost sometimes considers the type I error probability to be alpha (in the 2015 posts) based solely on the sampling distribution which he ably depicts– whereas in other places he regards it as the posterior probability of the null hypothesis based on a prior probability (the 2014 posts). He does the same in his reply to me (which I can’t seem to link, but it’s in the discussion here):

From Frost’s reply to me:

Yes, if your study obtains a p-value of 0.03, you can say that 3% of all studies that obtain a p-value less than or equal to 0.03 will have a Type I error. That’s more of a N-P error rate interpretation (except that N-P focused on critical test values rather than p-values). …..

Based on Fisher’s measure of evidence approach, the correct way to interpret a p-value of exactly 0.03 is like this:

Assuming the null hypothesis is true, you’d obtain the observed effect or more in 3% of studies due to random sampling error.

We know that the p-value is not the error rate because:

1) The math behind the p-values is not designed to calculate the probability that the null hypothesis is true (which is actually incalculable based solely on sample statistics). See a graphical representation of the math behind p-values and a post dedicated to how to correctly interpret p-values.

2) We also know this because there have been a number of simulation studies that look at the relationship between p-values and the probability that the null is true. These studies show that the actual probability that the null hypothesis is true tends to be greater than the p-value by a large margin.

3) Empirical studies that look at the replication of significant results also suggest that the actual probability that the null is true is greater than the p-value.

Frost’s points (1)-(3) above would also oust alpha as the type I error probability, for it’s also not designed to give a posterior. Never mind the question of the irrelevance or bias associated with the hefty spiked prior to the null involved in the simulations, all I’m saying is that Frost should make the distinction that even J. Berger agrees to, if he doesn’t want to confuse his readers.

—————————————————————————

[a] A distinct issue as to whether significance levels, but not P-values (the attained significance level) are error probabilities, is discussed here. Here are some of the assertions from Fisherian, Neyman-Pearsonian and Bayesian camps cited in that post. (I make no attempt at uniformity in writing the “P-value”, but retain the quotes as written.

From the Fisherian camp (Cox and Hinkley):

For given observations y we calculate t = tobs = t(y), say, and the level of significance pobs by

pobs = Pr(T > tobs; H0).

….Hence pobs is the probability that we would mistakenly declare there to be evidence against H0, were we to regard the data under analysis as being just decisive against H0.” (Cox and Hinkley 1974, 66).

Thus pobs would be the Type I error probability associated with the test.

From the Neyman-Pearson N-P camp (Lehmann and Romano):

“[I]t is good practice to determine not only whether the hypothesis is accepted or rejected at the given significance level, but also to determine the smallest significance level…at which the hypothesis would be rejected for the given observation. This number, the so-called p-value gives an idea of how strongly the data contradict the hypothesis. It also enables others to reach a verdict based on the significance level of their choice.” (Lehmann and Romano 2005, 63-4) 

Very similar quotations are easily found, and are regarded as uncontroversial—even by Bayesians whose contributions stood at the foot of Berger and Sellke’s argument that P values exaggerate the evidence against the null.

Gibbons and Pratt:

“The P-value can then be interpreted as the smallest level of significance, that is, the ‘borderline level’, since the outcome observed would be judged significant at all levels greater than or equal to the P-value[i] but not significant at any smaller levels. Thus it is sometimes called the ‘level attained’ by the sample….Reporting a P-value, whether exact or within an interval, in effect permits each individual to choose his own level of significance as the maximum tolerable probability of a Type I error.” (Gibbons and Pratt 1975, 21).

 

REFERENCES:

Berger, J. O. and Sellke, T.  (1987). “Testing a point null hypothesis: The irreconcilability of p values and evidence,” (with discussion). J. Amer. Statist. Assoc. 82: 112–139.

Casella G. and Berger, R. (1987). “Reconciling Bayesian and Frequentist Evidence in the One-sided Testing Problem,” (with discussion). J. Amer. Statist. Assoc. 82 106–111, 123–139.

Edwards, A. ., Lindman, H. and Savage, L. 1963. ‘Bayesian Statistical Inference for Psychological Research’, Psychological Review 70(3): 193–242.

Sellke, T., Bayarri, M. J. and Berger, J. O. 2001. Calibration of p Values for Testing Precise Null Hypotheses. The American Statistician, 55(1): 62-71.

  1. Frost blog posts:
  • 4/17/14: How to Correctly Interpret P Values
  • 5/1/14: Not All P Values are Created Equal
  • 3/19/15: Understanding Hypothesis Tests: Significance Levels (Alpha) and P values in Statistics

Some Relevant Errorstatistics Posts:

  •  4/28/12: Comedy Hour at the Bayesian Retreat: P-values versus Posteriors.
  • 9/29/13: Highly probable vs highly probed: Bayesian/ error statistical differences.
  • 7/14/14: “P-values overstate the evidence against the null”: legit or fallacious? (revised)
  • 8/17/14: Are P Values Error Probabilities? or, “It’s the methods, stupid!” (2nd install)
  • 3/5/15: A puzzle about the latest test ban (or ‘don’t ask, don’t tell’)
  • 3/16/15: Stephen Senn: The pathetic P-value (Guest Post)
  • 5/9/15: Stephen Senn: Double Jeopardy?: Judge Jeffreys Upholds the Law (sequel to the pathetic P-value)
Categories: highly probable vs highly probed, J. Berger, reforming the reformers, Statistics | 1 Comment

Frequentstein: What’s wrong with (1 – β)/α as a measure of evidence against the null? (ii)

Slide1

.

In their “Comment: A Simple Alternative to p-values,” (on the ASA P-value document), Benjamin and Berger (2016) recommend researchers report a pre-data Rejection Ratio:

It is the probability of rejection when the alternative hypothesis is true, divided by the probability of rejection when the null hypothesis is true, i.e., the ratio of the power of the experiment to the Type I error of the experiment. The rejection ratio has a straightforward interpretation as quantifying the strength of evidence about the alternative hypothesis relative to the null hypothesis conveyed by the experimental result being statistically significant. (Benjamin and Berger 2016, p. 1)

The recommendation is much more fully fleshed out in a 2016 paper by Bayarri, Benjamin, Berger, and Sellke (BBBS 2016): Rejection Odds and Rejection Ratios: A Proposal for Statistical Practice in Testing Hypotheses. Their recommendation is:

…that researchers should report the ‘pre-experimental rejection ratio’ when presenting their experimental design and researchers should report the ‘post-experimental rejection ratio’ (or Bayes factor) when presenting their experimental results. (BBBS 2016, p. 3)….

The (pre-experimental) ‘rejection ratio’ Rpre , the ratio of statistical power to significance threshold (i.e., the ratio of the probability of rejecting under H1 and H0 respectively), is shown to capture the strength of evidence in the experiment for Hover H0. (ibid., p. 2)

But in fact it does no such thing! [See my post from the FUSION conference here.] J. Berger, and his co-authors, will tell you the rejection ratio (and a variety of other measures created over the years) are entirely frequentist because they are created out of frequentist error statistical measures. But a creation built on frequentist measures doesn’t mean the resulting animal captures frequentist error statistical reasoning. It might be a kind of Frequentstein monster! [1] Continue reading

Categories: J. Berger, power, reforming the reformers, S. Senn, Statistical power, Statistics | 36 Comments

High error rates in discussions of error rates (1/21/16 update)

27D0BB5300000578-3168627-image-a-27_1437433320306

waiting for the other shoe to drop…

Do you ever find yourself holding your breath when reading an exposition of significance tests that’s going swimmingly so far? If you’re a frequentist in exile, you know what I mean. I’m sure others feel this way too. When I came across Jim Frost’s posts on The Minitab Blog, I thought I might actually have located a success story. He does a good job explaining P-values (with charts), the duality between P-values and confidence levels, and even rebuts the latest “test ban” (the “Don’t Ask, Don’t Tell” policy). Mere descriptive reports of observed differences that the editors recommend, Frost shows, are uninterpretable without a corresponding P-value or the equivalent. So far, so good. I have only small quibbles, such as the use of “likelihood” when meaning probability, and various and sundry nitpicky things. But watch how in some places significance levels are defined as the usual error probabilities and error rates—indeed in the glossary for the site—while in others it is denied they provide error rates. In those other places, error probabilities and error rates shift their meaning to posterior probabilities, based on priors representing the “prevalence” of true null hypotheses. Continue reading

Categories: highly probable vs highly probed, J. Berger, reforming the reformers, Statistics | 11 Comments

Are P Values Error Probabilities? or, “It’s the methods, stupid!” (2nd install)

f1ce127a4cfe95c4f645f0cc98f04fca

.

Despite the fact that Fisherians and Neyman-Pearsonians alike regard observed significance levels, or P values, as error probabilities, we occasionally hear allegations (typically from those who are neither Fisherian nor N-P theorists) that P values are actually not error probabilities. The denials tend to go hand in hand with allegations that P values exaggerate evidence against a null hypothesis—a problem whose cure invariably invokes measures that are at odds with both Fisherian and N-P tests. The Berger and Sellke (1987) article from a recent post is a good example of this. When leading figures put forward a statement that looks to be straightforwardly statistical, others tend to simply repeat it without inquiring whether the allegation actually mixes in issues of interpretation and statistical philosophy. So I wanted to go back and look at their arguments. I will post this in installments.

1. Some assertions from Fisher, N-P, and Bayesian camps

Here are some assertions from Fisherian, Neyman-Pearsonian and Bayesian camps: (I make no attempt at uniformity in writing the “P-value”, but retain the quotes as written.)

a) From the Fisherian camp (Cox and Hinkley):

For given observations y we calculate t = tobs = t(y), say, and the level of significance pobs by

pobs = Pr(T > tobs; H0).

….Hence pobs is the probability that we would mistakenly declare there to be evidence against H0, were we to regard the data under analysis as being just decisive against H0.” (Cox and Hinkley 1974, 66).

Thus pobs would be the Type I error probability associated with the test.

b) From the Neyman-Pearson N-P camp (Lehmann and Romano):

“[I]t is good practice to determine not only whether the hypothesis is accepted or rejected at the given significance level, but also to determine the smallest significance level…at which the hypothesis would be rejected for the given observation. This number, the so-called p-value gives an idea of how strongly the data contradict the hypothesis. It also enables others to reach a verdict based on the significance level of their choice.” (Lehmann and Romano 2005, 63-4) 

Very similar quotations are easily found, and are regarded as uncontroversial—even by Bayesians whose contributions stood at the foot of Berger and Sellke’s argument that P values exaggerate the evidence against the null. Continue reading

Categories: frequentist/Bayesian, J. Berger, P-values, Statistics | 32 Comments

U-Phil: Deconstructions [of J. Berger]: Irony & Bad Faith 3

Memory Lane: 2 years ago:
My efficient Errorstat Blogpeople1 have put forward the following 3 reader-contributed interpretive efforts2 as a result of the “deconstruction” exercise from December 11, (mine, from the earlier blog, is at the end) of what I consider:

“….an especially intriguing remark by Jim Berger that I think bears upon the current mindset (Jim is aware of my efforts):

Too often I see people pretending to be subjectivists, and then using “weakly informative” priors that the objective Bayesian community knows are terrible and will give ridiculous answers; subjectivism is then being used as a shield to hide ignorance. . . . In my own more provocative moments, I claim that the only true subjectivists are the objective Bayesians, because they refuse to use subjectivism as a shield against criticism of sloppy pseudo-Bayesian practice. (Berger 2006, 463)” (From blogpost, Dec. 11, 2011)
_________________________________________________
Andrew Gelman:

The statistics literature is big enough that I assume there really is some bad stuff out there that Berger is reacting to, but I think that when he’s talking about weakly informative priors, Berger is not referring to the work in this area that I like, as I think of weakly informative priors as specifically being designed to give answers that are _not_ “ridiculous.”

Keeping things unridiculous is what regularization’s all about, and one challenge of regularization (as compared to pure subjective priors) is that the answer to the question, What is a good regularizing prior?, will depend on the likelihood.  There’s a lot of interesting theory and practice relating to weakly informative priors for regularization, a lot out there that goes beyond the idea of noninformativity.

To put it another way:  We all know that there’s no such thing as a purely noninformative prior:  any model conveys some information.  But, more and more, I’m coming across applied problems where I wouldn’t want to be noninformative even if I could, problems where some weak prior information regularizes my inferences and keeps them sane and under control. Continue reading

Categories: Gelman, Irony and Bad Faith, J. Berger, Statistics, U-Phil | Tags: , , , | 3 Comments

Blog at WordPress.com.