The Three Faces of Bayes

Last summer, I was at a conference having lunch with Hal Daume III when we got to talking about how “Bayesian” can be a funny and ambiguous term. It seems like the definition should be straightforward: “following the work of English mathematician Rev. Thomas Bayes,” perhaps, or even “uses Bayes’ theorem.” But many methods bearing the reverend’s name or using his theorem aren’t even considered “Bayesian” by his most religious followers. Why is it that Bayesian networks, for example, aren’t considered… y’know… Bayesian?
As I’ve read more outside the fields of machine learning and natural language processing — from psychometrics and environmental biology to hackers who dabble in data science — I’ve noticed three broad uses of the term “Bayesian.” I mentioned to Hal that I wanted to blog about these different uses, and he said it probably would have been more useful about six years ago when being “Bayesian” was all the rage (being “deep” is where it’s at these days). I still think these are useful distinctions, though, so here it is anyway.
I’ll present the three main uses of “Bayesian” as I understand them, all through the lens of a naïve Bayes classifier. I hope you find it useful and interesting!
A Theorem By Any Other Name…
First off, Bayes’ theorem (in some form) is involved in all three takes on “Bayesian.” This 250-year-old staple of statistics gives us a way to estimate the probability of some outcome of interest given some evidence
:
If we care about inferring from
, Bayes’ theorem says this can be done by estimating the joint probability
and then dividing by the marginal probability
to get the conditional probability
. The second equivalence follows from the chain rule:
.
is called the prior distribution over
, and
is called the posterior distribution over
after having observed
.
1. Bayesians Against Discrimination
Now briefly consider this photo from the Pittsburgh G20 protests in 2009. That’s me carrying a sign that says “Bayesians Against Discrimination” behind the one and only John Oliver. I don’t think he realized he was in satirical company at the time (photo by Arthur Gretton, more ML protest photos here).

To get the joke, you need to grasp the first interpretation of “Bayesian”: a model that uses Bayes’ theorem to make predictions given some data…




