Monthly Archives: February 2017

Hocus pocus! Adopt a magician’s stance, if you want to reveal statistical sleights of hand

images-3

.

Here’s the follow-up post to the one I reblogged on Feb 3. When they sought to subject Uri Geller to the scrutiny of scientists, magicians had to be brought in because only they were sufficiently trained to spot the subtle sleight of hand shifts by which the magician tricks by misdirection. We, too, have to be magicians to discern the subtle misdirections and shifts of meaning in the discussions of statistical significance tests (and other methods)—even by the same statistical guide. We needn’t suppose anything deliberately devious is going on at all! Often, the statistical guidebook reflects shifts of meaning that grow out of one or another critical argument. These days, they trickle down quickly to statistical guidebooks, thanks to popular articles on the “statistics crisis in science”. The danger is that their own guidebooks contain inconsistencies. To adopt the magician’s stance is to be on the lookout for standard sleights of hand. There aren’t that many.[0]

I don’t know Jim Frost, but he gives statistical guidance at the minitab blog. The purpose of my previous post is to point out that Frost uses the probability of a Type I error in two incompatible ways in his posts on significance tests. I assumed he’d want to clear this up, but so far he has not. His response to a comment I made on his blog is this:

Based on Fisher’s measure of evidence approach, the correct way to interpret a p-value of exactly 0.03 is like this:

Assuming the null hypothesis is true, you’d obtain the observed effect or more in 3% of studies due to random sampling error.

We know that the p-value is not the error rate because:

1) The math behind the p-values is not designed to calculate the probability that the null hypothesis is true (which is actually incalculable based solely on sample statistics). …

But this is also true for a test’s significance level α, so on these grounds α couldn’t be an “error rate” or error probability either. Yet Frost defines α to be a Type I error probability (“An α of 0.05 indicates that you are willing to accept a 5% chance that you are wrong when you reject the null hypothesis“.) [1]

Let’s use the philosopher’s slightly obnoxious but highly clarifying move of subscripts. There is error probability1—the usual frequentist (sampling theory) notion—and error probability2—the posterior probability that the null hypothesis is true conditional on the data, as in Frost’s remark.  (It may also be stated as conditional on the p-value, or on rejecting the null.) Whether a p-value is predesignated or attained (observed), error probabilitity1 ≠ error probability2.[2] Frost, inadvertently I assume, uses the probability of a Type I error in these two incompatible ways in his posts on significance tests.[3]

Interestingly, the simulations to which Frost refers to “show that the actual probability that the null hypothesis is true [i.e., error probability2] tends to be greater than the p-value by a large margin” work with a fixed p-value, or α level, of say .05. So it’s not a matter of predesignated or attained p-values [4]. Their computations also employ predesignated probabilities of type II errors and corresponding power values. The null is rejected based on a single finding that attains .05 p-value. Moreover, the point null (of “no effect”) is give a spiked prior of .5. (The idea comes from a context of diagnostic testing; the prior is often based on an assumed “prevalence” of true nulls from which the current null is a member. Please see my previous post.)

Their simulations are the basis of criticisms of error probability1 because what really matters, or so these critics presuppose, is error probability2 .

Whether this assumption is correct, and whether these simulations are the slightest bit relevant to appraising the warrant for a given hypothesis, are completely distinct issues. I’m just saying that Frost’s own links mix these notions. If you approach statistical guidebooks with the magician’s suspicious eye, however, you can pull back the curtain on these sleights of hand.

Oh, and don’t lose your nerve just because the statistical guides themselves don’t see it or don’t relent. Send it on to me at [email protected].

******************************************************************************

[0] They are the focus of a book I am completing: “Statistical Inference As Severe Testing: How to Get Beyond the Statistics Wars (CUP, 2017)

[1]  I admit we need a more careful delineation of the meaning of ‘error probability’.  One doesn’t have an error probability without there being something that could be “in error”. That something is generally understood as an inference or an interpretation of data. A method of statistical inference moves from data to some inference about the source of the data as modeled; some may wish to see the inference as a kind of “act” (using Neyman’s language) or “decision to assert” but nothing turns on this.
Associated error probabilities refer to the probability a method outputs an erroneous interpretation of the data, where the particular error is pinned down. For example, it might be, the test infers μ > 0 when in fact the data have been generated by a process where μ = 0.  The test is defined in terms of a test statistic d(X), and  the error probabilitiesrefer to the probability distribution of d(X), the sampling distribution, under various assumptions about the data generating process. Error probabilities in tests, whether of the Fisherian or N-P varieties, refer to hypothetical relative frequencies of error in applying a method.

[2] Notice that error probability2 involves conditioning on the particular outcome. Say you have observed a 1.96 standard deviation difference, and that’s your fixed cut-off. There’s no consideration of the sampling distribution of d(X), if you’ve conditioned on the actual outcome. Yet probabilities of Type I and Type II errors, as well as p-values, are defined exclusively in terms of the sampling distribution of d(X), under a statistical hypothesis of interest. But all that’s error probability1.

[3] Doubtless, part of the problem is that testers fail to clarify when and why a small significance level (or p-value) provides a warrant for inferring a discrepancy from the null. Firstly, for a p-value to be actual (and not merely nominal):

Pr(P < pobs; H0) = pobs .

Cherry picking and significance seeking can yield a small nominal p-value, while the actual probability of attaining even smaller p-values under the null is high. So this identity fails. Second, A small p- value warrants inferring a discrepancy from the null because, and to the extent that, a larger p-value would very probably have occurred, were the null hypothesis correct. This links error probabilities of a method to an inference in the case at hand.

….Hence pobs is the probability that we would mistakenly declare there to be evidence against H0, were we to regard the data under analysis as being just decisive against H0.” (Cox and Hinkley 1974, p. 66).

[4] The myth that significance levels lose their error probability status once the attained p-value is reported is just that, a myth. I’ve discussed it a lot elsewhere; but the the current point doesn’t turn on this. Still, it’s worth listening to statistician Stephen Senn (2002, p. 2438) on this point.

 I disagree with [Steve Goodman] on two grounds here: (i) it is not necessary to separate p-values from their hypothesis test interpretation; (ii) the replication probability has no direct bearing on inferential meaning. Second he claims that, ‘the replication probability can be used as a frequentist counterpart of Bayesian and likelihood models to show that p-values overstate the evidence against the null-hypothesis’ (p. 875, my italics). I disagree that there is such an overstatement.  In my opinion, whatever philosophical differences there are between significance tests and hypothesis test, they have little to do with the use or otherwise of p-values. For example, Lehmann in Testing Statistical Hypotheses, regarded by many as the most perfect and complete expression of the Neyman–Pearson approach, says

‘In applications, there is usually available a nested family of rejection regions corresponding to different significance levels. It is then good practice to determine not only whether the hypothesis is accepted or rejected at the given significance level, but also to determine the smallest significance level … the significance probability or p-value, at which the hypothesis would be rejected for the given observation’. (Lehmann, Testing Statistical hypotheses (1994, p. 70, original italics). 

Note to subscribers: Please check back to find follow-ups and corrected versions of blogposts, indicated with (ii), (iii) etc.

Some Relevant Posts:

Categories: Statistics, frequentist/Bayesian, P-values, reforming the reformers, S. Senn | Leave a comment

High error rates in discussions of error rates: no end in sight

27D0BB5300000578-3168627-image-a-27_1437433320306

waiting for the other shoe to drop…

“Guides for the Perplexed” in statistics become “Guides to Become Perplexed” when “error probabilities” (in relation to statistical hypotheses tests) are confused with posterior probabilities of hypotheses. Moreover, these posteriors are neither frequentist, subjectivist, nor default. Since this doublespeak is becoming more common in some circles, it seems apt to reblog a post from one year ago (you may wish to check the comments).

Do you ever find yourself holding your breath when reading an exposition of significance tests that’s going swimmingly so far? If you’re a frequentist in exile, you know what I mean. I’m sure others feel this way too. When I came across Jim Frost’s posts on The Minitab Blog, I thought I might actually have located a success story. He does a good job explaining P-values (with charts), the duality between P-values and confidence levels, and even rebuts the latest “test ban” (the “Don’t Ask, Don’t Tell” policy). Mere descriptive reports of observed differences that the editors recommend, Frost shows, are uninterpretable without a corresponding P-value or the equivalent. So far, so good. I have only small quibbles, such as the use of “likelihood” when meaning probability, and various and sundry nitpicky things. But watch how in some places significance levels are defined as the usual error probabilities —indeed in the glossary for the site—while in others it is denied they provide error probabilities. In those other places, error probabilities and error rates shift their meaning to posterior probabilities, based on priors representing the “prevalence” of true null hypotheses.

Begin with one of his kosher posts “Understanding Hypothesis Tests: Significance Levels (Alpha) and P values in Statistics” (blue is Frost):

(1) The Significance level is the Type I error probability (3/15)

The significance level, also denoted as alpha or α, is the probability of rejecting the null hypothesis when it is true. For example, a significance level of 0.05 indicates a 5% risk of concluding that a difference exists when there is no actual difference….

Keep in mind that there is no magic significance level that distinguishes between the studies that have a true effect and those that don’t with 100% accuracy. The common alpha values of 0.05 and 0.01 are simply based on tradition. For a significance level of 0.05, expect to obtain sample means in the critical region 5% of the time when the null hypothesis is true. In these cases, you won’t know that the null hypothesis is true but you’ll reject it because the sample mean falls in the critical region. That’s why the significance level is also referred to as an error rate! (My emphasis.)

Note: Frost is using the term “error rate” here, which is why I use it in my title. Error probability would be preferable.

This type of error doesn’t imply that the experimenter did anything wrong or require any other unusual explanation. The graphs show that when the null hypothesis is true, it is possible to obtain these unusual sample means for no reason other than random sampling error. It’s just luck of the draw.

(2) Definition Link: Now we go to the blog’s definition link for this “type of error”

No hypothesis test is 100% certain. Because the test is based on probabilities, there is always a chance of drawing an incorrect conclusion.

Type I error

When the null hypothesis is true and you reject it, you make a type I error. The probability of making a type I error is α, which is the level of significance you set for your hypothesis test. An α of 0.05 indicates that you are willing to accept a 5% chance that you are wrong when you reject the null hypothesis. (My emphasis)

  Null Hypothesis
Decision True False
Fail to reject Correct Decision (probability = 1 – α) Type II Error – fail to reject the null when it is false (probability = β)
Reject Type I Error – rejecting the null when it is true (probability = α) Correct Decision (probability = 1 – β)

 

He gives very useful graphs showing quite clearly that the probability of a Type I error comes from the sampling distribution of the statistic (in the illustrated case, it’s the distribution of sample means).

So it is odd that elsewhere Frost tells us that a significance level (attained or fixed) is not the probability of a Type I error. Note: the issue here isn’t whether the significance level is fixed or attained, the difference to which I’m calling your attention is between an ordinary frequentist error probability and a posterior probability in a null hypothesis, given it is rejected—based on a prior probability for the null vs a probability for a single alternative, which he writes as P(real). We may call it the false finding rate (FFR), and it arises in typical diagnostic screening contexts. I elsewhere take up the allegation, by some, that a significance level is an error probability but a P-value is not (Are P-values error probabilities?). A post is here. Also see note [a] below.]
Here are some examples from Frost’s posts on 4/14 & 5/14:

 (3)  In a different post Frost alleges: A significance level is not the Type I error probability

Incorrect interpretations of P values are very common. The most common mistake is to interpret a P value as the probability of making a mistake by rejecting a true null hypothesis (a Type I error).

Now “making a mistake” may be vague, (and his statement here is a bit wonky) but the parenthetical link makes it clear he intends the Type I error probability. Guess what? The link is to the exact same definition of Type I error as before: the ordinary error probability computed from the sampling distribution. Yet in the blogpost itself, the Type I error probability now refers to a posterior probability of the null, based on an assumed prior probability of .5!

If a P value is not the error rate, what the heck is the error rate?

Sellke et al.* have estimated the error rates associated with different P values. While the precise error rate depends on various assumptions (which I discuss here), the table summarizes them for middle-of-the-road assumptions.

P value Probability of incorrectly rejecting a true null hypothesis
0.05 At least 23% (and typically close to 50%)
0.01 At least 7% (and typically close to 15%)

*Thomas SELLKE, M. J. BAYARRI, and James O. BERGER, Calibration of p Values for Testing Precise Null Hypotheses, The American Statistician, February 2001, Vol. 55, No. 1

We’ve discussed how J. Berger and Sellke (1987) compute these posterior probabilities using spiked priors, generally representing undefined “reference” or conventional priors. (Please see my previous post.) J. Berger claims, at times, that these posterior probabilities (which he computes in lots of different ways) are the error probabilities, and Frost does too, at least in some posts. The allegation that therefore P-values exaggerate the evidence can’t be far behind–or so a reader of this blog surmises–and there it is, right below:

Do the higher error rates in this table surprise you? Unfortunately, the common misinterpretation of P values as the error rate creates the illusion of substantially more evidence against the null hypothesis than is justified. As you can see, if you base a decision on a single study with a P value near 0.05, the difference observed in the sample may not exist at the population level. (Frost 5/14)

Admittedly, Frost is led to his equivocation by the sleights of hands of others, encouraged after around 2003 (in my experience).

(4) J. Berger’s Sleight of Hand: These sleights of hand are familiar to readers of this blog; I wouldn’t have expected them in a set of instructional blogposts about misinterpreting significance levels (at least without a great big warning). But Frost didn’t dream them up, he’s following a practice, traceable to J. Berger, of claiming that a posterior (usually based on conventional priors) gives the real error rate (or the conditional error rate). [The computations come from Edwards, Lindmann, and Savage 1963.) Whether it’s the ‘right’ default prior to use or not, my point is simply that the meaning is changed, and Frost ought to issue a trigger alert! Instead, he includes numerous links to related posts on significance tests, making it appear that blithely assigning a .5 spiked prior to the null is not only kosher, but is part of ordinary significance testing. It’s scarcely a justified move. As Casella and R.Berger (1987) show, this is a highly biased prior to use. See this post, and others by Stephen Senn (3/16/15, 5/9/15). Moreover, many people regard point nulls as always false.

In my comment on J. Berger (2003), I noted my surprise at his redefinition of error probability. (See pp. 19-24 in this paper). In response to me, Berger asks,

“Why should the frequentist school have exclusive right to the term ‘error probability’? It is not difficult to simply add the designation ‘frequentist’ (or Type I or Type II) or ‘Bayesian’ to the term to differentiate between the schools” (J. Berger 2003, p. 30).

That would work splendidly, I’m all in favor of differentiating between the schools. Note that he allows “Type I ” to go with the ordinary frequentist variant. If Berger had emphasized this distinction in his paper, Frost would have been warned of the slipperly slide he’s about to take a trip on. Instead, Berger has increasingly grown accustomed to claiming these are the real frequentist(!) error probabilities. (Others have followed.)

(5) Is Hypothesis testing like Diagnostic Screening? Now Frost (following David Colquhoun) appears to favor (not Berger’s conventionalist prior) a type of frequentist or “prevalence” prior:

It is the proportion of hypothesis tests in which the alternative hypothesis is true at the outset. It can be thought of as the long-term probability, or track record, of similar types of studies. It’s the plausibility of the alternative hypothesis.

Do we know these prevalences? What reference class should a given hypothesis be regarded as having been selected? There would be many different urns to which a particular hypothesis belongs. Such frequentist-Bayesian computations may be appropriate in contexts of high throughput screening, where a hypothesis is viewed as a generic, random selection from an urn of hypotheses. Here, a (behavioristic) concern to control the rates of following up false leads is primary. But that’s very different from evaluating how well tested or corroborated a particular H is. And why should fields with “high crud factors” (as Meehl called them) get the benefit? (having a low prior prevalence of “no effect”).

I’ve discussed all these points elsewhere, and they are beside my current complaint which is simply this: In some places, Frost construes the probability of a Type I error as an ordinary error probability based on the sampling distribution alone; and other places as a Bayesian posterior probability of a hypothesis, conditional on a set of data.

Frost goes further in the post to suggest that “hypotheses tests are journeys from the prior probability to posterior probability”.

Hypothesis tests begin with differing probabilities that the null hypothesis is true depending on the specific hypotheses being tested. [Mayo: They do?] This prior probability influences the probability that the null is true at the conclusion of the test, the posterior probability.

 

Initial Probability of
true null (1 – P(real))
P value obtained Final Minimum Probability
of true null
0.5 0.05 0.289
0.5 0.01 0.110
0.5 0.001 0.018
0.33 0.05 0.12
0.9 0.05 0.76

The table is based on calculations by Colquhoun and Sellke et al. It shows that the decrease from the initial probability to the final probability of a true null depends on the P value. Power is also a factor but not shown in the table.

It is assumed that there is just a crude dichotomy: the null is true vs the effect is real (never mind magnitudes of discrepancy which I and others insist upon), and further, that you reject on the basis of a single, just statistically significant result. But these moves go against the healthy recommendations for good testing in the other posts on the Minitab blog. I recommend Frost go back and label the places where he has conflated the probability of a Type I error with a posterior probability based on a prior: use Berger’s suggestion of reserving “Type I error probability” for the ordinary frequentist error probability based on a sampling distribution alone, calling the posterior error probability Bayesian. Else contradictory claims will ensue…But I’m not holding my breath.

I may continue this in (ii)….

***********************************************************************

January 21, 2016 Update:

Jim Frost from Minitab responded, not to my post, but to a comment I made on his blog prior to writing the post. Since he hasn’t commented here, let me paste the relevant portion of his reply. I want to separate the issue of predesignating alpha and the observed P-value, because my point now is quite independent of it. Let’s even just talk of a fixed alpha or fixed P-value for rejecting the null. My point’s very simple: Frost sometimes considers the type I error probability to be alpha (in the 2015 posts) based solely on the sampling distribution which he ably depicts– whereas in other places he regards it as the posterior probability of the null hypothesis based on a prior probability (the 2014 posts). He does the same in his reply to me (which I can’t seem to link, but it’s in the discussion here):

From Frost’s reply to me:

Yes, if your study obtains a p-value of 0.03, you can say that 3% of all studies that obtain a p-value less than or equal to 0.03 will have a Type I error. That’s more of a N-P error rate interpretation (except that N-P focused on critical test values rather than p-values). …..

Based on Fisher’s measure of evidence approach, the correct way to interpret a p-value of exactly 0.03 is like this:

Assuming the null hypothesis is true, you’d obtain the observed effect or more in 3% of studies due to random sampling error.

We know that the p-value is not the error rate because:

1) The math behind the p-values is not designed to calculate the probability that the null hypothesis is true (which is actually incalculable based solely on sample statistics). See a graphical representation of the math behind p-values and a post dedicated to how to correctly interpret p-values.

2) We also know this because there have been a number of simulation studies that look at the relationship between p-values and the probability that the null is true. These studies show that the actual probability that the null hypothesis is true tends to be greater than the p-value by a large margin.

3) Empirical studies that look at the replication of significant results also suggest that the actual probability that the null is true is greater than the p-value.

Frost’s points (1)-(3) above would also oust alpha as the type I error probability, for it’s also not designed to give a posterior. Never mind the question of the irrelevance or bias associated with the hefty spiked prior to the null involved in the simulations, all I’m saying is that Frost should make the distinction that even J. Berger agrees to, if he doesn’t want to confuse his readers.

—————————————————————————

[a] A distinct issue as to whether significance levels, but not P-values (the attained significance level) are error probabilities, is discussed here. Here are some of the assertions from Fisherian, Neyman-Pearsonian and Bayesian camps cited in that post. (I make no attempt at uniformity in writing the “P-value”, but retain the quotes as written.

From the Fisherian camp (Cox and Hinkley):

For given observations y we calculate t = tobs = t(y), say, and the level of significance pobs by

pobs = Pr(T > tobs; H0).

….Hence pobs is the probability that we would mistakenly declare there to be evidence against H0, were we to regard the data under analysis as being just decisive against H0.” (Cox and Hinkley 1974, 66).

Thus pobs would be the Type I error probability associated with the test.

From the Neyman-Pearson N-P camp (Lehmann and Romano):

“[I]t is good practice to determine not only whether the hypothesis is accepted or rejected at the given significance level, but also to determine the smallest significance level…at which the hypothesis would be rejected for the given observation. This number, the so-called p-value gives an idea of how strongly the data contradict the hypothesis. It also enables others to reach a verdict based on the significance level of their choice.” (Lehmann and Romano 2005, 63-4) 

Very similar quotations are easily found, and are regarded as uncontroversial—even by Bayesians whose contributions stood at the foot of Berger and Sellke’s argument that P values exaggerate the evidence against the null.

Gibbons and Pratt:

“The P-value can then be interpreted as the smallest level of significance, that is, the ‘borderline level’, since the outcome observed would be judged significant at all levels greater than or equal to the P-value[i] but not significant at any smaller levels. Thus it is sometimes called the ‘level attained’ by the sample….Reporting a P-value, whether exact or within an interval, in effect permits each individual to choose his own level of significance as the maximum tolerable probability of a Type I error.” (Gibbons and Pratt 1975, 21).

 

REFERENCES:

Berger, J. O. and Sellke, T.  (1987). “Testing a point null hypothesis: The irreconcilability of p values and evidence,” (with discussion). J. Amer. Statist. Assoc. 82: 112–139.

Casella G. and Berger, R. (1987). “Reconciling Bayesian and Frequentist Evidence in the One-sided Testing Problem,” (with discussion). J. Amer. Statist. Assoc. 82 106–111, 123–139.

Edwards, A. ., Lindman, H. and Savage, L. 1963. ‘Bayesian Statistical Inference for Psychological Research’, Psychological Review 70(3): 193–242.

Sellke, T., Bayarri, M. J. and Berger, J. O. 2001. Calibration of p Values for Testing Precise Null Hypotheses. The American Statistician, 55(1): 62-71.

  1. Frost blog posts:
  • 4/17/14: How to Correctly Interpret P Values
  • 5/1/14: Not All P Values are Created Equal
  • 3/19/15: Understanding Hypothesis Tests: Significance Levels (Alpha) and P values in Statistics

Some Relevant Errorstatistics Posts:

  •  4/28/12: Comedy Hour at the Bayesian Retreat: P-values versus Posteriors.
  • 9/29/13: Highly probable vs highly probed: Bayesian/ error statistical differences.
  • 7/14/14: “P-values overstate the evidence against the null”: legit or fallacious? (revised)
  • 8/17/14: Are P Values Error Probabilities? or, “It’s the methods, stupid!” (2nd install)
  • 3/5/15: A puzzle about the latest test ban (or ‘don’t ask, don’t tell’)
  • 3/16/15: Stephen Senn: The pathetic P-value (Guest Post)
  • 5/9/15: Stephen Senn: Double Jeopardy?: Judge Jeffreys Upholds the Law (sequel to the pathetic P-value)
Categories: highly probable vs highly probed, J. Berger, reforming the reformers, Statistics | 1 Comment

Blog at WordPress.com.