|
Criticisms of Meta-analysis
The first appearances of meta-analysis in the 1970s were not met universally with
encomiums and expressions of gratitude. There was no shortage of critics who found the
whole idea wrong-headed, senseless, misbegotten, etc.
The Apples-and-Oranges Problem
Of course the most often repeated criticism of meta-analysis was that it was meaningless
because it "mixed apples and oranges." I was not unprepared for this criticism; indeed, I
had long before prepared my own defense: "Of course it mixes apples and oranges; in the
study of fruit nothing else is sensible; comparing apples and oranges is the only endeavor
worthy of true scientists; comparing apples to apples is trivial." But I misjudged the
degree to which this criticism would take hold of people's opinions and shut down their
minds. At times I even began to entertain my own doubts that it made sense to integrate
any two studies unless they were studies of "the same thing." But, the same persons who
were arguing that no two studies should be compared unless they were studies of the
"same thing," were blithely comparing persons (i.e., experimental "subjects") within their
studies all the time. This seemed inconsistent. Plus, I had a glimmer of the self-contradictory
nature of the statement "No two things can be compared unless they are the
same." If they are the same, there is no reason to compare them; indeed, if "they" are the
same, then there are not two things, there is only one thing and comparison is not an
issue. And yet I had a gnawing insecurity that the critics might be right. One study is an
apple, and a second study is an orange; and comparing them is as stupid as comparing
apples and oranges, except that sometimes I do hesitate while considering whether I'm
hungry for an apple or an orange.
At about this timelate 1970sI was browsing through a new book that I had bought out
of a vague sense that it might be worth my time because it was
written by a Harvard philosopher, carried a title like Philosophical Explanations
and was written by an authorRobert Nozickwho had written one of the few pieces on
the philosophy of the social sciences that ever impressed me as being worth rereading. To
my amazement, Nozick spent the first one hundred pages of his book on the problem of
"identity," i.e., what does it mean to say that two things are the same? Starting with the
puzzle of how two things that are alike in every respect would not be one thing, Nozick
unraveled the problem of identity and discovered its fundamental nature underlying a
host of philosophical questions ranging from "How do we think?" to "How do I know
that I am I?" Here, I thought at last, might be the answer to the "apples and oranges"
question. And indeed, it was there.
Nozick considered the classic problem of Theseus's ship. Theseus, King of Thebes, and
his men are plying the waters of the Mediterranean. Each day a sailor
replaces a wooden plank in the ship. After nearly five years, every plank has been
replaced. Are Theseus and his men still sailing in the same ship that was launched five
years earlier on the Mediterranean? "Of course," most will answer. But suppose that as
each original plank was removed, it was taken ashore and repositioned exactly as it had
been on the waters, so that at the end of five years, there exists a ship on shore, every
plank of which once stood in exactly the same relationship to every other in what five
years earlier had been Theseus's ship. Is this ship on shorewhich we could easily launch
if we so choseTheseus's ship? Or is the ship sailing the Mediterranean with all of its new
planks the same ship that we originally regarded as Theseus's ship? The answer depends
on what we understand the concept of "same" to mean?
Consider an even more troubling example that stems from the problem of the persistence
of personal identity. How do I know that I am that person who I was yesterday, or last
year, or twenty-five years ago? Why would an old high-school friend say that I am Gene
Glass, even though hundreds, no thousands of things about me have changed since high
school? Probably no cells are in common between this organism and the organism that
responded to the name "Gene Glass" forty years ago; I can assure you that there are few
attitudes and thoughts held in common between these two organismsor is it one
organism? Why then, would an old high-school friend, suitably prompted, say without
hesitation, "Yes, this is Gene Glass, the same person I went to high school with." Nozick
argued that the only sense in which personal identity survives across time is in the sense
of what he called "the closest related continuer." I am still recognized as Gene Glass to
those who knew me then because I am that thing most closely related to that person to
whom they applied the name "Gene Glass" over forty years ago. Now notice that implied
in this concept of the "closest related continuer" are notions of distance and relationship.
Nozick was quite clear that these concepts had to be given concrete definition to
understand how in particular instances people use the concept of identity. In fact, to
Nozick's way of thinking, things are compared by means of weighted functions of
constituent factors, and their "distance" from each other is "calculated" in many instances in
a Euclidean way.
Consider Theseus's ship again. Is the ship sailing the seas the "same" ship that Theseus
launched five years earlier? Or is the ship on the shore made of all the original planks
from that first ship the "same" as Theseus's original ship? If I give great weight to the
materials and the length of time those materials functioned as a ship (i.e., to displace
water and float things), then the vessel on the shore is the closest related continuation of
what historically had been called "Theseus's ship." But if, instead, I give great weight to
different factors such as the importance of the battles the vessel was involved in (and
Theseus's big battles were all within the last three years), then the vessel that now floats
on the Mediterraneannot the ship on the shore made up of Theseus's original planksis
Theseus's ship, and the thing on the shore is old spare parts.
So here was Nozick saying that the fundamental riddle of how two things could be the
same ultimately resolves itself into an empirical question involving observable factors
and weighing them in various combinations to determine the closest related continuer.
The question of "sameness" is not an a priori question at all; apart from being a
logical impossibility, it is an empirical question. For us, no two "studies" are the same.
All studies differ and the only interesting questions to ask about them concern how they
vary across the factors we conceive of as important. This notion is not fully developed
here and I will return to it later.
The "Flat Earth" Criticism
I may not be the best person to critique meta-analysis, for obvious reasons. However, I
will cop to legitimate criticisms of the approach when I see them, and I haven't seen
many. But one criticism rings true because I knew at the time that I was being forced into
a position with which I wasn't comfortable. Permit me to return to the psychotherapy
meta-analysis.
Eysenck was, as I have said, a nettlesome critic of the psychotherapy establishment in the
1960s and 1970s. His exaggerated and inflammatory statements about psychotherapy
being worthless (no better than a placebo) were not believed by psychotherapists or
researchers, but they were not being effectively rebutted either. Instead of taking him
head-on, as my colleagues and I attempted to do, researchers, like Gordon Paul, for
example, attempted to argue that the question whether psychotherapy was effective was
fundamentally meaningless. Rather, asserted Paul while many others assented, the only
legitimate research question was "What type of therapy, with what type of client,
produces what kind of effect?" I confess that I found this distracting dodge as frustrating
as I found Eysenck's blanket condemnation. Here was a criticEysencksaying that all
psychotherapists are either frauds or gullible, self-deluded incompetents, and the
establishment's response is to assert that he is not making a meaningful claim. Well, he
was making a meaningful claim; and I already knew enough from the meta-analysis of
the outcome studies to know that Paul's question was unanswerable due to insufficient
data, and that reseacrhers were showing almost no interest in collecting the kind of data
that Paul and others argued were the only meaningful data.
It fell to me, I thought, to argue that the general question "Is psychotherapy effective?" is
meaningful and that psychotherapy is effective. Such generalizationsacross types of
therapy, types of client and types of outcomeare meaningful to many peoplepolicy
makers, average citizensif not to psychotherapy researchers or psychotherapists
themselves. It was not that I necessarily believed that different therapies did not have
different effects for different kinds of people; rather, I felt certain that the available
evidence, tons of it, did not establish with any degree of confidence what these
differential effects were. It was safe to say that in general psychotherapy works on many
things for most people, but it was impossible to argue that this therapy was better than
that therapy for this kind of problem. (I might add that twenty years after the publication
of The Benefits of Psychotherapy, I still have not seen compelling answers to
Paul's questions, nor is their evidence of researchers having any interest in answering
them.)
The circumstances of the debate, then, put me in the position of arguing, circa 1980, that
there are very few differences among various ways of treating human beings and that, at
least, there is scarcely any convincing experimental evidence to back up claims of
differential effects. And that policy makers and others hardly need to waste their time
asking such questions or looking for the answers. Psychotherapy works; all types of
therapy work about equally well; support any of them with your tax dollars or your
insurance policies. Class size reductions workvery gradually at first (from 30 to 25 say)
but more impressively later (from 15 to 10); they work equally for all grades, all subjects,
all types of student. Reduce class sizes, and it doesn't matter where or for whom.
Well, one of my most respected colleagues called me to task for this way of
thinking and using social science research. In a beautiful and important paper entitled
"Prudent Aspirations for the Social Sciences," Lee Cronbach chastised his profession for
promising too much and chastised me for expecting too little. He lumped me with a small
group of like-minded souls into what he named the "Flat Earth Society," i.e., a group of
people who believe that the terrain that social scientists explore is featureless, flat, with
no interesting interactions or topography. All therapies work equally well; all tests predict
success to about the same degree; etc.:
"...some of our colleagues are beginning to sound like a kind of Flat
Earth Society. They
tell us that the world is essentially simple: most social phenomena are adequately
described by linear relations; one-parameter scaling can discover coherent variables
independent of culture and population; and inconsistences among studies of the same
kind will vanish if we but amalgamate a sufficient number of studies.... The Flat Earth
folk seek to bury any complex hypothesis with an empirical bulldozer." (Cronbach, 1982,
p. 70.)
Cronbach's criticism stung because it was on target. In attempting to refute Eysenck's
outlandishness without endorsing the psychotherapy establishment's obfuscation, I had
taken a position of condescending simplicity. A meta-analysis will give you the BIG
FACT, I said; don't ask for more sophisticated answers; they aren't there. My own work
tended to take this form, and much of what has ensued in the past 25 years has regrettably
followed suit. Effect sizesif it is experiments that are at issueare calculated, classified
in a few ways, perhaps, and all their variability is then averaged across. Little effort is
invested in trying to plot the complex, variegated landscape that most likely underlies our
crude averages.
Consider an example that may help illuminate these matters. Perhaps the most
controversial conclusion from the psychotherapy meta-analysis that my colleagues and I
published in 1980 was that there was no evidence favoring behavioral psychotherapies
over non-behavioral psychotherapies. This finding was vilified by the behavioral therapy
camp and praised by the Rogerians and Freudians. Some years later, prodded by
Cronbach's criticism, I returned to the database and dug a little deeper. What I found
appears in Figure 2. When the nine experiments extant in 1979and I would be surprised
if there are many more nowin which behavioral and non-behavioral psychotherapies are
compared in the same experiment between randomized groups and the effects of
treatment are plotted as a function of follow-up time, the two curves in Figure 2 result.
The findings are quite extraordinary and suggestive. Behavioral therapies produce large
short-term effects which decay in strength over the first year of follow-up; non-
behavioral therapies produce initially smaller effects which increase over time. The two
curves appear to be converging on the same long-term effect. I leave it to the reader to
imagine why. One answer, I suspect, is not arcane and is quite plausible.
Figure 2, I believe, is truer to Cronbach's conception of reality and how research, even
meta-analysis, can lead us to a more sophisticated understanding of our world. Indeed,
the world is not flat; it encompasses all manner of interesting hills and valleys, and in
general, averages do not do it justice.
Extensions of Meta-analysis
In the twenty-five years between the first appearance of the word "meta-analysis" in print
and today, there have been several attempts to modify the approach, or advance
alternatives to it, or extend the method to reach auxiliary issues. If I may be so cruel, few
of efforts have added much. One of the hardest things to abide in following the
developments in meta-analysis methods in the past couple of decades was the frequent
observation that what I had contributed to the problem of research synthesis was the idea
of dividing mean differences by standard deviations. "Effect sizes," as they are called,
had been around for decades before I opened my first statistics text. Having to read that
"Glass has proposed integrating studies by dividing mean differences by standard
deviations and averaging them" was a bitter pill to swallow. Some of the earliest work
that I and my colleagues did involved using a variety of outcome measures to be analyzed
and synthesized: correlations, regression coefficients, proportions, odds ratios. Well, so
be it; better to be mentioned in any favorable light than not to be remembered at all.
After all, this was not as hard to take as newly minted confections such as "best evidence
research synthesis," a come-lately contribution that added nothing whatsoever to what
myself and many others had been saying repeatedly on the question of whether meta-
analyses should use all studies or only "good" studies. I remain staunchly committed to
the idea that meta-analyses must deal with all studies, good bad and indifferent, and that
their results are only properly understood in the context of each other, not after having
been censored by some a priori set of prejudices. An effect size of 1.50 for 20
studies employing randomized groups has a whole different meaning when 50 studies
using matching show an average effect of 1.40 than if 50 matched groups studies show an
effect of -.50, for example.
Statistical Inference in Meta-analysis
The appropriate role for inferential statistics in meta-analysis is not merely unclear, it has
been seen quite differently by different methodologists in the 25 years since meta-
analysis appeared. In 1981, in the first extended discussion of the topic (Glass, McGaw
and Smith, 1981), I raised doubts about the applicability of inferential statistics in meta-
analysis. Inference at the level of persons within seemed quite unnecessary, since even a
modest size synthesis will involve a few hundred persons (nested within studies) and lead
to nearly automatic rejection of null hypotheses. Moreover, the chances are remote that
the persons or subjects within studies were drawn from defined populations with anything
even remotely resembling probabilistic techniques. Hence, probabilistic calculations
advanced as if subjects had been randomly selected would be dubious. At the level of
"studies," the question of the appropriateness of inferential statistics can be posed again,
and the answer again seems to be negative. There are two instances in which common
inferential methods are clearly appropriate, not just in mata-analysis but in any research:
1) when a well defined population has been randomly sampled, and 2) when subjects
have been randomly assigned to conditions in a controlled experiment. In the latter case,
Fisher showed how the permutation test can be used to make inferences to the universe of
all possible permutations. But this case is of little interest to meta-analysts who never
assign units to treatments. Moreover, the typical meta-analysis virtually never meets the
condition of probabilistic sampling of a population (though in one instance (Smith, Glass
& Miller, 1980), the available population of psychoactive drug treatment experiments
was so large that a random sample of experiments was in fact drawn for the meta-
analysis). Inferential statistics has little role to play in meta-analysis
It is common to acknowledge, in meta-analysis and elsewhere, that many data sets
fail to meet probabilistic sampling conditions, and then to argue that one ought to treat
the data in hand "as if" it were a random sample of some hypothetical population. One
must be wary here of the slide from "hypothesis about a population" into "a hypothetical
population." They are quite different things, the former being standard and
unobjectionable, the latter being a figment with which we hardly know how to deal.
Under this stipulation that one is making inferences not to some defined or known
population but a hypothetical one, inferential techniques are applied and the results
inspected. The direction taken mirrors some of the earliest published opinion on this
problem in the context of research synthesis, expressed, for example, by Mosteller and
his colleagues in 1977: "One might expect that if our MEDLARS approach were perfect
and produced all the papers we would have a census rather than a sample of the papers.
To adopt this model would be to misunderstand our purpose. We think of a process
producing these research studies through time, and we think of our sampleeven if it
were a censusas a sample in time from the process. Thus, our inference would still be to
the general process, even if we did have all appropriate papers from a time period."
(Gilbert, McPeek and Mosteller, 1977, p. 127; quoted in Cook et al., 1992, p. 291) This
position is repeated in slightly different language by Larry Hedges in Chapter 3
"Statistical Considerations" of the Handbook of Research Synthesis (1994): "The universe
is the hypothetical collection of studies that could be conducted in principle and about
which we wish to generalize. The study sample is the ensemble of studies that are used in
the review and that provide the effect size data used in the research synthesis." (p. 30)
These notions appear to be circular. If the sample is fixed and the population is
allowed to be hypothetical, then surely the data analyst will imagine a population that
resembles the sample of data. If I show you a handful of red and green M&Ms, you will
naturally assume that I have just drawn my hand out of a bowl of mostly red and green
M&Ms, not red and green and brown and yellow ones. Hence, all of these "hypothetical
populations" will be merely reflections of the samples in hand and there will be no need
for inferential statistics. Or put another way, if the population of inference is not defined
by considerations separate from the characterization of the sample, then the population is
merely a large version of the sample. With what confidence is one able to generalize the
character of this sample to a population that looks like a big version of the sample? Well,
with a great deal of confidence, obviously. But then, the population is nothing but the
sample writ large and we really know nothing more than what the sample tells us in spite
of the fact that we have attached misleadingly precise probability numbers to the result.
Hedges and Olkin (1985) have developed inferential techniques that ignore the pro
forma testing (because of large N) of null hypotheses and focus on the estimation of
regression functions that estimate effects at different levels of study. They worry about
both sources of statistical instability: that arising from persons within studies and that
which arises from variation between studies. The techniques they present are based on
traditional assumptions of random sampling and independence. It is, of course, unclear to
me precisely how the validity of their methods are compromised by failure to achieve
probabilistic sampling of persons and studies.
The irony of traditional hypothesis testing approaches applied to meta-analysis is that
whereas consideration of sampling error at the level of persons always leads to a pro
forma rejection of "null hypotheses" (of zero correlation or zero average effect size),
consideration of sampling error at the level of study characteristics (the study, not the
person as the unit of analysis) leads to too few rejections (too many Type II errors, one
might say). Hedges's homogeneity test of the hypothesis that all studies in a group
estimate the same population parameter frequently seen in published meta-analyses these
days. Once a hypothesis of homogeneity is accepted by Hedges's test, one is advised to
treat all studies within the ensemble as the same. Experienced data analysts know,
however, that there is typically a good deal of meaningful covariation between study
characteristics and study findings even within ensembles where Hedges's test can not
reject the homogeneity hypothesis. The situation is parallel to the experience of
psychometricians discovering that they could easily interpret several more common
factors than inferential solutions (maximum- likelihood; LISREL) could confirm. The
best data exploration and discovery are more complex and convincing than the most
exact inferential test. In short, classical statistics seems not able to reproduce the complex
cognitive processes that are commonly applied with success by data analysts.
Donald Rubin (1990) addressed some of these issues squarely and articulated a
position that I find very appealing : "...consider the idea that sampling and
representativeness of the studies in a meta-analysis are important. I will claim that this is
nonsensewe don't have to worry about representing a population but rather about other
far more important things." (p. 155) These more important things to Rubin are the
estimation of treatment effects under a set of standard or ideal study conditions. This
process, as he outlined it, involves the fitting of response surfaces (a form of quantitative
model building) between study effects (Y) and study conditions (X, W, Z etc.). I would
only add to Rubin's statement that we are interested in not merely the response of the
system under ideal study conditions but under many conditions having nothing to do with
an ideally designed study, e.g., person characteristics, follow-up times and the like.
By far most meta-analyses are undertaken in pursuit not of scientific theory but
technological evaluation. The evaluation question is never whether some hypothesis or
model is accepted or rejected but rather how "outputs" or "benefits" or "effect sizes" vary
from one set of circumstances to another; and the meta-analysis rarely works on a
collection of data that can sensibly be described as a probability sample from anything.
|