Simply Statistics

Hurricane María official death count in conflict with mortality data

Sun, 03 Dec 2017 00:00:00 +0000

A recent preprint by Alexis R. Santos-Lozada and Jeffrey T. Howard concludes that

The mortality burden may [be] higher than official counts, and may exceed the current official death toll by a factor of 10.

The authors used monthly death records from the Puerto Rico Vital Statistics system from 2010 to 2016. Although data for 2017 was apparently not available, they extracted data from a statement made by Héctor Pesquera, the Secretary of Public Safety:

The number of deaths for September 2017 is 2,838, with 95% of the deaths processed by the Puerto Rico Department of Health.”

Their final conclusions rely on assumptions and methodology needed to predict October figures. But just by looking at the data, we can see that the official figure of 55 deaths appears to be way off.

To create this plot, I downloaded the Microsoft Word version of the preprint, converted it to PDF, then scraped the data from Table 1. Because there is month-to-month variability in total deaths in Puerto Rico, I computed the difference between each data point and the average for their respective month. The September 2017 data point is a clear outlier, 455 deaths above average, and is well beyond 55 deaths above the largest deviation from the monthly average. Keep in mind that Hurricane María hit Puerto Rico on September 20th, so only 10 days account for the observed difference. The official figure includes September and October so it covers at least 40 days.

Below is a plot of the total deaths, which the preprint shows in Figure 2.

Note that 55 was the official figure at the time the preprint was written.

Some roadblocks to the broad adoption of machine learning and AI

Mon, 27 Nov 2017 00:00:00 +0000

I read two blog posts on AI over the Thanksgiving break. One was a nice post discussing the challenges for AI in medicine by Luke Oakden-Rayder and the other was about the need for increased focus on basic research in AI motivated by AlphaGo by Tim Harford.

I’ve had a lot of interactions with people lately who want to take advantage of machine learning/AI in their research or business. Despite the excitement around AI and the exciting results we see from sophisticated research teams almost daily - the actual extent and application of AI is much smaller. In fact, most AI usually ends up being humans in the end.

While the promise of AI/ML has never been clearer, there are still only a handful of organizations that are using the technology in a major way. Sometimes even apparent success stories turn out to be problematic.

I was thinking about the lifecycle of developing an AI application. I have defined this type of application previously as having three parts: (i) an interface to humans, (ii) a data set, and (iii) an algorithm for turning the data into interactions. I started thinking about the extension of this idea to the development of an AI application and all the steps involved. Then I started thinking about potential barriers.

To develop an AI application you need a few things:

A group of people who are willing to let you have their data.
A technology for data capture from people (this could be as simple as a website, or an Echo, or as complex as a robot).
A data storage mechanism for collecting the raw data from this input (this could just be a database)
A set of algorithms and scripts for organizing the data for analysis.
A definition of the problem you’d like to solve in quantitative terms - usually generated through exploratory analysis.
An algorithm trained on a massive data set or at minimum trained with a good prior or expert knowledge.
A way to structure the responses and provide feedback either to the original users of your application or to other users (researchers or executives at a company for example).
The pipeline to take those formatted responses and return them to the user in a way that they can take advantage of.

I think that a lot of attention is focused on step 6 and how costly talent is for designing AI algorithms. I think for the big players where a lot of the other steps have been solved this is for sure the limiting factor and it is no wonder that the talent war is fierce.

But I think that for 95% of organizations - whether they be researchers, businesses, or individuals the problem isn’t in developing the algorithm. A random forest can be fit with one line of R code and while it won’t be as accurate as an expertly trained neural network on a gigantic training data set, it will be really useful.

So I think that most of the roadblocks to the democratization of AI are actually in the other steps and in particular the “glue” between the steps. For example:

Getting access to people’s data (Step 0) - It is very hard, even for researchers, to get access to health information if you aren’t DeepMind and Google. It can take years to deal with the bureaucracy of getting access to even simple data sets.
Having the infrastructure to capture data (Step 1) if you aren’t a major player you might not even be capturing complete data from people visiting your website, let alone sensors, images, text, and everything else you would want to do to perform AI.
Storing data centrally (Step 2) In almost all of the organizations I’ve talked to data are scattered and managed across multiple systems and with different protocols. Just knowing where and what the data are can be a multi-month process.
Tidying the data (Step 3) There is an entire industry of data scientists built to tackle this problem, but if you can’t find the data or if the data isn’t stored centrally (Step 2), then this can be delayed. Even if it isn’t, there is rarely a standardized data tidying pipeline even in places that only have one data type - so that makes it hard to do the next step.
Defining a question AI can answer (Step 4) I would venture to say this is maybe one of the biggest bottlenecks. To create an AI system as they currently exist a human needs to (i) define a concrete scientific/business problem, (ii) create a quantitative definition of that model, and (iii) define an objective function to optimize. This process can take a huge amount of expert/knowledgeable work.
Fitting the algorithm (Step 5) While I think there are some commodity technologies that work well here - streamlining the process from modeling to implementation is probably where a lot of AI applications could use work. This can take a while to just get the model set up even after you know exactly what you want to fit.
Making the model output mean something to humans (Step 6) Even in the cases where we ideally want the computer to do everything (self driving cars) we’d still like to summarize the choices the AI might make so humans can decide if they are ethical and how to regulate those decisions. But there is still a whole field of AI that is interpretable that needs to be developed and disseminated. So even places that have models built often struggle to communicate the results in a way that they can be used.
Automating the use of AI models (Step 7) Even if you get past all those other hurdles and have a working, interpretable AI model, you need humans to use it. Whether that is a doctor using the output of a radiology scan to make diagnostic/treatment decisions, or a car that can actually drive, the last step of actually making the model useful is still a major barrier to many projects.

I think a lot of these barriers come down to the fact that for the most part we don’t have strict standards for data capture/tidying/organization/use that are used across organizations. We also don’t have the “glue” steps between each of these components automated. So while I think that the algorithms for AI will continue to develop rapidly in accuracy and range, for organizations to keep up they will need a lot more than just a way to fit the latest model. The reason that I think some organizations are leaping so far ahead is that they already have spent a huge amount of time thinking about all the Steps but the model fitting, so now they can focus their time/energy/resources on making algorithms do things we didn’t imagine were possible.

A few things that would reduce stress around reproducibility/replicability in science

Tue, 21 Nov 2017 00:00:00 +0000

I was listening to the Effort Report Episode on The Messy Execution of Reproducible Research where they were discussing the piece about Amy Cuddy in the New York Times. I think both the article and the podcast did a good job of discussing the nuances of the importance of reproducibility and the challenges of the social interactions around this topic. After listening to the podcast I realized that I see a lot of posts about reproducibility/replicability, but many of them are focused on the technical side. So I started to think about compiling a list of more cultural things we can do to reduce the stress/pressure around the reproducibility crisis.

I’m sure others have pointed these out in other places but I am procrastinating writing something else so I’m writing these down while I’m thinking about them :).

We can define what we mean by “reproduce” and “replicate” Different fields have different definitions of the words reproduce and replicate. If you are publishing a new study we now have an R package that you can use to create figures that show what changed and what was the same betweeen the original study and your new work. Defining concretely what was the same and different will reduce some of the miscommunication about what a reproducibility/replicability study means.
We can remember that replication is statistical, not deterministic If you are doing a new study where you re-collect data according to a protocol from another group - you should not expect to get exactly the same answer. So if a result is statistically significant in one study and not significant in another, that may be within the bounds of what we’d expect to see.
We can remember that there is a difference between exploratory and confirmatory research There is a reason that randomized trials are the basis for regulatory decisions by the FDA and others. But if we require every single study to meet the requirements of a pre-registered, randomized, double blind controlled trial with a huge sample size we might miss some important discoveries.
We can remember that a failed replication isn’t always a scientific failure One thing Roger and Elizabeth point out is that many scientific studies won’t replicate - nor should we expect all studies to replicate. Sometimes a study is preliminary, exploratory, or the first observation of an unusual event. It may be a perfectly well executed study, and not replicate because the sample size was too small or there was an unmodeled confounder. This doesn’t mean that the scientific study was a failure, it just was an observation that didn’t pan out.
We can stop publicizing scientific results as solutions so quickly University press offices, startup companies, and researchers stressed for funding are under pressure to label every discovery as a “cure”, a “diagnosis”, a “solution to the crisis”. A lot of the frustration in the scientific community arises from this overstatement of results. It is hard to escape this (for me too!) - but we can excercise skepticism about claims of solutions on the basis of single scientific papers.
We can be persistent and private as long as possible Like many people I’ve run into frustrating cases where data isn’t available from a paper that has been published. I have contacted the authors only to be rebuffed. I have found that it takes work to convince them to provide the data, but I can often do it without having to resort to publicizing the problems and trying to make it adversarial.
We can make the realization that data is valuable but in science you don’t own it There is still discussion of data parasites and data symbionts. I have been both a data collector and a data analyst. I realize there is frustration to releasing your data and seeing others quickly publish ideas you may have had. At the same time I’ve seen how frustrating it can be to see people keep their data private and inaccessable indefinitely after publication. The reality is that people do deserve credit for collecting data, but that they don’t own the data they collect.
We should cut each other some slack I think that a lot of the frustration around reproducibility and replicability can come from the way the problem is approached. On the one hand, if you publish a scientific paper and someone tries to reproduce or replicate your work you can realize that they are doing that because they are interested and try to help them. Even if they find some flaws (as they inevitably will) or the study doesn’t replicate (as it might not) that is not a failure by you. On the other hand if you are reproducibing or replicating, remember that the scientist on the other end is a person. That person is subject to all the same funding, publication, and promotion stresses as everyone else. So we should try not to make the discovery of reproducibility/replicability problems high profile “gotchas” but focus on pointing out the real scientific issues that arise and trying to move forward in a positive way.

Like many others, I have noticed that the stakes around reproducibility/replicability have been ratcheted up by social media and the rise in the prominence of the field of reproducibility. As someone who has experienced the real stress all of this can create in the life of real scientists, I’d love to see if we could move forward in a way that was more positive and collaborative while we address the explosion of data in the scientific enterprise.

Follow Up on Reasoning About Data

Mon, 20 Nov 2017 00:00:00 +0000

Sometimes, when I write a really long blog post, I forget what the point was at the end. I suppose I could just update the previous post…but that feels wrong for some reason.

I meant to make one final point in my last post about how better data analyses help you reason about the data. In particular, I meant to tie together the discussion about garbage collection to the section on data analysis. With respect to garbage collection in programming, I wrote

The promise of garbage collection is clear: the programmer doesn’t have to think about memory. Lattner’s criticism is that when building large-scale systems the programmer always has to think about memory, and as such, garbage collection makes it harder to do so. The goal of [Automatic Reference Counting] is to make it easier for other people to understand your code and to allow programmers to reason clearly about memory.

To pick this apart just a little bit, it’s tempting to think that “not having to think about memory management” is a worthy goal, but Lattner’s point is that it’s not. At least not right now. Better to make tools that will help people to think more effectively about this important topic (a “bicycle for the mind”).

I think the analogy for data analysis is that I think it’s tempting to think of the goal of methods development as removing the need to think about data and assumptions. The “ultimate” method is one where you don’t have to worry about distributions or nonlinearities or interactions or anything like that. But I don’t see that as the goal. Good methods, and good analyses, help us think about all those things much more efficiently. So what I might say is that

When doing large-scale data analyses, the data analyst always has to think about the data and assumptions, and as such, some approaches can actually make that harder to do than others. The goal of the good data analysis is to make it easier to reason about how the data are related to the result, relative to the assumptions you make about the data and the models.

Reasoning About Data

Thu, 16 Nov 2017 00:00:00 +0000

In my ongoing discussion in my mind about what makes for a good data analysis, one of the ideas that keeps coming back to me is this notion of being able to “reason about the data”. The idea here is that it’s important that a data analysis allow you to understand how the data, as opposed to other aspects of an analysis like assumptions or models, played a role in producing the outputs. I think for a given problems, some kinds of analysis do a better job of that than others.

To Collect Garbage or Not?

Programmers talk a lot about the importance of being able to “reason about code”, in part so that you can understand what’s going on and perhaps make modifications in the future. If there are two software solutions that produce equivalent results, one might prefer the one that allows you to reason about the code and about what the software is doing, even if it is less efficient. Aside from having a certain common sense logic to it, there are also important implications for maintaining the code.

Along these lines, a while back I was listening to one of my favorite podcasts, the Accidental Tech Podcast, when Chris Lattner made a surprise appearance. Chris previously created LLVM and worked at Apple where he developed the Swift programming language. On this episode, around the 1 hour and 56 minute mark, Lattner and John Siracusa get into a discussion about memory management models for programming languages. In particular, Siracusa asks Lattner to discuss garbage collection versus automatic reference counting (ARC), a system that Lattner developed while working at Apple.

John Siracusa: Objective-C had garbage collection, as you mentioned. Eventually, Objective-C dropped the garbage collection and got ARC [Automatic Reference Counting], and of course, Swift doesn’t have garbage collection at all. Can you talk about the trade-offs there and why Swift is the way it is?

Just as a quick summary, garbage collection is a system by which programmers can allocate memory for objects and not have to worry about de-allocating the memory when their done with it. The idea is that a separate process called the garbage collector periodically goes through all the memory that a program is using and checks to see if it still being used by some aspect of the program. If the memory is not being used, the garbage collector marks the memory as available so that it can be allocated for something else. If the garbage collector never ran, then eventually all the memory would get allocated and nothing would free up.

Here’s how Siracusa sells it:

Siracusa: Well, the idea [is] that memory management is completely out of the hands of the programmer, and that some magical fairy behind the scenes will make it all good for you. What you’re giving up, as you mentioned before…you lack some amount of control…

If you use R, you’re using an application that uses a garbage collector (originally written by Luke Tierney). If you’ve ever noticed that sometimes you run an operation in R that normally takes a short amount of time and it takes an unusually long time to run, that’s often because R had to stop for a second or two to let the garbage collector run in the background. This “stutter” is the bane of all applications that use garbage collection. You can actually force the garbage collector to run in R by calling the gc() function (although you should never have to do that in practice). I don’t think the “stutter” is a big issue in R, but it can be a big problem for applications that demand a fluid user experience.

One downside of garbage collection is that it addes a random aspect to program operation. Since you never know when the garbage collector is going to get called (that is based on runtime conditions), you never know when your program is going to be halted to allow the garbage collector to do its thing. The upside though is that programmers don’t have to worry about memory management, which is often the source of nasty bugs.

Automatic reference counting is essentially a fancier version of manual memory management, where programmers have to manual allocate and deallocate. But instead, with ARC the compiler does most of the work for you. In order to allow for this, certain restrictions must be placed on the language to allow the compiler to analyze programs and determine whether certain blocks of memory will be in use at a given time. The advantage here is that the analysis occurs at compile time, thus allowing for the running of a program to be deterministic. The downside, of course, is that the programmer has to do more deliberate thinking about memory.

A key question then, as posed by Lattner is this:

Chris Lattner: You said GC [garbage collection] means that you don’t have to think about memory. Is that true?

Frankly, before hearing the rest, I would have said “yes, it’s true”. But Lattner continues (emphasis added):

Lattner: The stutter problem, to me, isn’t really the issue, even though that’s what GC-haters will bring up all the time. It’s more about being able to reason about when the memory goes away. The most important aspect of that is that ARC gets rid of finalizers.

If you use a garbage-collected language, you use finalizers. Finalizers are the thing that gets run when your object gets destroyed. Finalizers have so many problems that there are entire bodies of work talking about how to work around problems with finalizers.

For example, the finalizer gets run on the wrong thread, it has to get run multiple times, the object can get resurrected while the finalizer’s running. It happens non-deterministically later. You can’t count on it, and so you can’t use it for resource management for database handles and things like that, for example. There are so many problems with finalizers that ARC just defines [it] away by having deterministic destruction.

Even without knowing what a finalizer is, it’s clear that there’s a lot of complexity involved with using a garbage collector. Lattner’s final point is critical:

Lattner: There’s this question of if you’re building a large scale system, do you want people to “never think about memory?” Do you want them to think about memory all the time, like they did in Objective-C’s classic manual retain-and-release? Or do you want something in the middle?

His argument is that ARC is that “something in the middle”.

Lattner: I think that ARC strikes a really interesting balance, whether it’s in Objective-C or Swift. I look at manual retain-and-release as being a very imperative style of memory management…where you’re telling the code, line by line…this is what you should do at this point in time.

ARC then takes that model and bubbles it up a big step, and it makes it be a very declarative model…. The cool thing about that to me is that not only does it get rid of the mechanics of maintaining reference counting and define away tons of bugs by doing that, it also means that it is now explicit in your code what your intention was. That’s something that people who maintain your code benefit from.

The promise of garbage collection is clear: the programmer doesn’t have to think about memory. Lattner’s criticism is that when building large-scale systems the programmer always has to think about memory, and as such, garbage collection makes it harder to do so. The goal of ARC is to make it easier for other people to understand your code and to allow programmers to reason clearly about memory.

What About Data?

What does any of this have to do with data analysis? Well, nothing, directly. But I think there is an analogous spectrum of techniques in data analysis. In programming, one can think of garbage collection (“no thinking”) on one end of the spectrum and manual memory management (“constantly thinking”) on the other end. With data analysis, there are certain black box-type procedures proposed on one end and, well, just an uncoordinated jumble of procedures on the other end of the spectrum.

When I mention “black box-type” procedures, I’m not referring to machine learning prediction models, which are often derided as “black boxes” because of their complexity. Rather, I’m thinking of a way of doing data analysis that focuses primarily on outcomes and results. With these kinds of approaches, the discussion is almost entirely focused on whether the outcomes are “correct” or not, which I think is misguided in many scientific contexts. With prediction models, at least initially, it makes sense to focus on outcomes. But with data analysis, we often want to know more about how the data are connected to those outcomes.

One famous example of how seemingly equivalent methods can diverge is Anscombe’s Quartet. Anscombe presented four datasets that have exactly the same summary statistics but appear very different when plotted. In particular, the correlations for the four datasets are the same. My take from his paper is that computing a correlation coefficient provides a result and the correlation will provide you some information about the data. We could have endless discussions about whether this result is correct, whether it truly exists in the population, whether it might replicate in the next study, or whether we should have used some other correlation metric. But none of that matters once you make a scatterplot of the data. The scatterplots show you how the data inform the result. In this case, there are four different ways in which the data can inform a given correlation coefficient, some of which are perhaps more concerning than others.

The scatterplots in Anscombe’s Quartet allow us to reason about the data in a way that the correlation coefficient alone does not. They tell us more about what is going on with the data and aid us in interpreting the correlation coefficient. In this case, I think an analysis that includes the scatterplots is a better data analysis than one that does not. Of course, the scatterplots alone may not be sufficient because they don’t actually give you a number, which may be needed in subsequent analyses. But there’s nothing that says you can’t do both.

This is more than simply saying that you should make plots (although you should). When you run a simple linear regression, there are two contributing parts: (a) the data and (b) the assumption of linearity. Reporting the regression coefficient gives you a sense of the strength of the relationship between the predictor and the response, but does that relationship arise from the data or from the model’s assumptions? Any analysis that can help the reader elicit the relative contributions of those two components is better than an analysis that doesn’t.

Evaluating Data Analysis

Interestingly, Anscombe laid out four “myths” that he hoped to debunk:

Numerical calculations are exact, but graphs are rough;
For any particular kind of statistical data there is just one set of calculations constituting a correct statistical analysis;
Performing intricate calculations is virtuous, whereas actually looking at the data is cheating.

Unfortunately, I think these three ideas are alive and well in much of the world of data analysis, especially points 2 and 3. In discussions with various people about what makes for a good data analysis, I’ve found there’s a common focus on “correctness”. For example, if a data analysis leads to a result that later replicates, then the original analysis is a good one. It’s hard to be against correctness, but I think it’s misguided. In particular, the correctness of a scientific claim cannot usually be evaluated with a single study, so this would imply that you cannot evaluate the data analysis that went into making that claim. If a replication study occurs 10 years later, that would imply that we will be waiting 10 years to determine if a data analysis is good.

I think it should be possible to evaluate the quality of a data analysis once it’s done, not 10 years later when a replication study verifies a scientific claim. One way to do so is to ask whether the analysis allows us to reason about the data relative to another possible analysis. If we can answer the question, “How do the data in this study inform the result?” then I think we are in good shape.

How do you convince other people to use R?

Mon, 30 Oct 2017 00:00:00 +0000

I just got back from the rOpenSci OzUnconf that was run in Melbourne last week. I’d like to give a big thanks to the organizers (Nick Tierney, Di Cook, Rob Hyndman and others) for putting on a great unconference. These events are always a great opportunity to meet people just getting started in the R community and to get them involved.

As is typical for these unconferences, topic ideas were pitched via issues on the OzUnconf GitHub repo. One issue that I filed was titled “How do you pitch R to new users?”. While this issue was not taken up during the unconference (for good reason, I think), Nick was kind enough to initiate a lunch time discussion of the topic (with Daniel Falster kindly taking notes).

The topic came up in my mind because I’ve found that I’ve had to change the way that I “sell” R to new users. Through various discussions at the unconference and in many other venues, I’ve found that many R users, even today, are the “lone R user” in their group/institution/organization. While they may enjoy using R, it’s often made difficult by the fact that others in their group do not use R and therefore there is some negotiation over “how things get done”. Convincing others in the group to use R is one way to go (I suppose abandoning R is the other) and the question of how best to do this for different audiences is the question that arose in my mind. As a teacher, you can usually design the curriculum so that the students are forced to use R. But in most other environments, a different approach is needed.

As part of the introduction to the topic, I talked about how I used to convince others to use R. Bear in mind, this was almost 20 years ago and the majority of people I was talking to were using SAS, Stata, SPSS, Excel, or some other commercial package. These were largely interactive packages, perhaps with graphical user interfaces, designed to do more or less traditional statistical analyses. My pitch usually involved three things:

Free. R was both free as in cost and free as in free software. The free cost part made it a highly accessible package and the free software part allowed for anyone to tinker with the package, inspect its code, and make improvements.
Graphics. R was able to produce high quality “publication ready” graphics and it gave you detailed control over all the graphical elements. S-PLUS could also do this but S-PLUS didn’t come with the free part mentioned above.
Programming language. Unlike packages like SAS, Stata, or SPSS, R came with a robust and sophisticated Lisp-like programming language that was well-suited for data analysis applications. In addition, you could use it to build packages that could extend the core R system.

Much has changed since those early days and I’ve varied my pitch quite a bit to focus on a few different things (that didn’t exist back then). In particular, the audience has changed—I talk to many more people who are just getting started in data analysis and therefore are a bit more open-minded about which software to use. Also, python has come on the scene as a viable alternative to R for data science and so there are even more arguments to consider. Some of the things I focus on now are

Reproducibility and Reporting. With the development of knitr and its combination with R Markdown, the writing of reproducible reports was made infinitely easier. (Markdown itself, probably deserves its own discussion, but it’s not specifically R-related.)
RStudio. The development of the RStudio IDE has made getting started with R much easier. Having a powerful IDE was important to me for learning other languages and I’m glad R finally has something solid for itself. RStudio has significantly simplified the development of R packages via devtools and roxygen2. While it’s not yet perfect, these tools have changed what used to be a labor-intensive and finicky process into a more manageable and easier to learn work flow.
Graphics. R still has the ability to make great data graphics and with the introduction of ggplot2, it has become easier to make good graphics.
R Packages and Community. With over 10,000 packages on CRAN alone, there’s pretty much a package to do anything. More importantly, the people contributing those packages and the greater R community have expanded tremendously over time, bringing in new users and pushing R to be useful in more applications. Eveyr year now there are probably hundreds if not thousands of meetups, conferences, seminars, and workshops all around the world, all related to R.

Discussion

At the unconference, a number of people had different approaches to how to convince others to use R. Here are just a summary:

Someone mentioned the Bioconductor, which is a huge resource for those doing research in the world of high throughput biology. Not many other packages have something similar so it makes an obvious selling point to people working in this area
The idea of using R end-to-end came up, meaning using R to clean up messy data and taking it all the way to some interactive Shiny app on the other end. The idea that you can use the same tool to do all the things in between made for a compelling case for R.
For the spreadsheet audience, the dplyr package was sometimes a good selling point. The idea here was that you could show people how much time could be saved by automating analyses and using dplyr to clean up data.
The open source nature of R came up a few times, primarily as a means for developing transferable skills. If you work at a company/institution that specializes in some proprietary package, it’s often difficult to transfer those skills somewhere else if your new job doesn’t use that package. The fact that R is open source means that, in theory, you could use it anywhere and the skills that you build up in R are (again, in theory) applicable everywhere.
Someone mentioned that if you want to convince someone to learn/use R, just show them the multitude of jobs available to R programmers (and in particular, the salaries attached to them).
The fact that R is obtainable for free is still important, given that Matlab and SAS licenses have not gotten any cheaper over time. In my experience, this is particularly important in non-industrialized countries where for many people paying for expensive licenses is not an option.

This just a brief summary of our discussion at the unconference and I was heartened to see all of the enthusiasm for R there. Even with R’s incredible growth over the last 20 years, there will still come a time when a case needs to be made to use R over something else. I’m just glad that we have so many more reasons today than we used to.

It Costs Money to Get It Right

Mon, 09 Oct 2017 00:00:00 +0000

On the latest episode of Not So Standard Deviations I talked with Hilary about Apple’s efforts to train machine learning algorithms in their Face ID technology in the iPhone X. The gist of Face ID is that it recognizes your face using a mathematical representation and then unlocks the phone when it can confirm that it is you. In its keynote presentation, Apple mentioned that it’s using machine learning to do this and even had developed its own custom chips to do the computations.

There were many questions regarding Face ID, including how exactly Apple had trained this machine learning model without it ever being released to the public. There were natural concerns that the system would only recognize certain kinds of faces, perhaps of certain ethnic backgrounds or of certain shapes or sizes. Two bits of evidence indicate that Apple likely spent a lot of money in this training process. The first bit comes from Craig Federighi, Senior Vice President for Software Engineering, in an interview with Tech Crunch,

“Phil [Schiller] mentioned that we’d gathered a billion images and that we’d done data gathering around the globe to make sure that we had broad geographic and ethnic data sets. Both for testing and validation for great recognition rates,” says Federighi. “That wasn’t just something you could go pull off the internet.”

Especially given that the data needed to include a high-fidelity depth map of facial data. So, says Federighi, Apple went out and got consent from subjects to provide scans that were “quite exhaustive.” Those scans were taken from many angles and contain a lot of detail that was then used to train the Face ID system.

The second bit of information comes from the Accidental Tech Podcast where a listener wrote in to describe a study that he had recently enrolled in that took very detailed images of his face. Although there is no explicit connection, it would sound by the description that this study could have been related to the Face ID efforts (you can hear Casey Liss discuss the study here). Apparently, the study was being conducted by Exponent which from its web site appears to be a scientific and technical consulting firm.

Given the two bits of information, I don’t think it’s a stretch to say that

Apple decided it needed a huge dataset to train the Face ID algorithm and that no such dataset existed that they could access
Having no real desire to collect such data themselves, Apple outsourced the study to a third party company (in much the same way a biotech company might outsource a clinical trial to a clinical research organization) to collect the data to their specifications. The outsourcing also allowed Apple to shield its identity in the process so that they could keep the whole Face ID system secret.
The collected dataset (and the subsequently trained model) effectively serve as the “prior” distribution in a complex classification scheme that is adapted to a specific user once a user allows the iPhone X to collect their face data.

Apple claims that user data is not sent back to Apple for privacy reasons, so that the “big” model is not updated according to user data. Perhaps Apple will continue to collect data separately from user data, but there was no mention of that.

If any of this is true, there are a few things worth noting:

This is not the way things are done Silicon Valley these days. A perhaps less costly way would be to give some app away for free (like a fun game or something similar), let the app take a bunch of pictures of you from different angles, and then gather all the data to train a model. Of course, such an approach would be subject to the selection bias that all approaches like this suffer from, because the data only represents the people whom the company could get to use the product.
It must have been very expensive to conduct this study, although that is less of a big deal when you have over $150 billion in net cash. That said, only a handful of companies could afford to do this and certainly not your average startup.
If the Face ID system works well and is capable of recognizing the diversity of faces that represent the iPhone customer base, then it would suggest that getting this kind of machine learning right costs a lot of money. In particular, it highlights the limitations of crowdsourcing-as-data-collection and it would suggest that old-fashioned ideas of sampling and experimental design are still needed once in a while.

Is It Research?

On Twitter, someone mused that wouldn’t it be nice if they released all this data to the public? Certainly, it would be a valuable and unique resource if they did and would likely spur a wealth of new innovations. Of course, this will almost certainly not happen, if only because Apple considers this dataset a key to their competitive advantage in the area of face detection, which they believe is the future of authentication. Why give this up to their competitors? Indeed, I don’t think anyone would expect Apple to give up this dataset any more than we might expect Google to give up data it collects from its search engine.

On the other hand, if I had conducted a study collecting similar face information (albeit on a smaller scale!) and published a paper about my findings about face morphology, there would likely be an expectation of me to release my data to the public along with the findings. And rightly so! Reproducibility is critical to moving science forward and releasing the data allows others to reproduce the findings. Furthermore, such a dataset would likely be useful to others investigations and would benefit all of science.

But why is there is no expectation for Apple to release data from its study but there is an expectation on me to release data from my study?

In my opinion, Apple is not conducting research, even though it kind of looks like it is. Rather, Apple is doing product development, which requires some gathering of information as part of the process. The fact that the information gathering was done on such a large scale isn’t relevant. Just as any consumer packaged goods company might gather a focus group before developing their next toothpaste, Apple gathered face data in order to develop Face ID.

I often argue with people over whether companies like Google, Facebook, and Apple do research. My argument is that for the most part, they do not, because they are not interested in creating new knowledge. They do not make any specific public claims or inferences about the data they’ve gathered and so there isn’t really anything for them to defend. They are interested in taking whatever information they collected and channeling it into products. Yes, all of these companies occasionally publish a paper (I think Apple has a grand total of five), and I would say that those papers represent real research. But I would wager that those papers represent a small fraction of the work going on in those companies.

As a side note, Mark Neuman on Twitter suggested that this kind of work does qualify as research for the purposes of evaluating the ethical treatment of research sbjects. I would have to agree here, and I would hope that all of these companies obtain proper informed consent from subjects before collecting their data (at least their lawyers spend a lot of time thinking about it). The fact that these companies may be collecting these data for product development and not for research doesn’t absolve them of the need to treat subjects properly.

Creating an expository graph for a talk

Mon, 02 Oct 2017 00:00:00 +0000

I’m co-teaching a data science class at Johns Hopkins with John Muschelli. I gave the lectures on EDA and he just gave a lecture on how to create an “expository graph”. When we teach the class an exploratory graph is the kind of graph you make for yourself just to try to understand a data set. An expository graph is one where you are trying to communicate information to someone else.

When you are making an exploratory graph it is usually simple, with no axes, legends, fancy colors, or other effort to make it pretty, understandable and clear. John has a great blog post on how to build up a figure that is expository.

Recently I gave a talk at McGill University and needed to create a plot for the talk. I figured one more example is always better for everything, so I thought I’d go through my process here.

I wanted to show the p-value distribution from the tidy-pvals package. So first I loaded the data:

library(tidypvals)
library(ggridges)

## Loading required package: ggplot2

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(ggplot2)
library(forcats)
data(allp)

I knew I wanted to use the ggridges package so I read the docs and started with the easiest version:

allp %>% 
  ggplot(aes(x = pvalue, y = field)) +
    geom_density_ridges()

## Picking joint bandwidth of 0.00413

Right away I saw there were some problems here. First of all, clearly a p-value greater than one shouldn’t be in there, so that was a mistake. I also don’t like that you can’t really see the values because most of the action is near zero.

So let’s fix the x-axis a bit. I spent a few minutes fiddling and decided I just wanted to see the values between 0 and 0.25.

allp %>% 
  ggplot(aes(x = pvalue, y = field)) +
    geom_density_ridges() + 
  xlim(c(0,0.25))

## Picking joint bandwidth of 0.00401

## Warning: Removed 359521 rows containing non-finite values
## (stat_density_ridges).

Ok that’s better, but I don’t really like the grey background so let’s pick a different background color

allp %>% 
  ggplot(aes(x = pvalue, y = field)) +
    geom_density_ridges() + 
  xlim(c(0,0.25)) + 
  theme_ridges(grid = FALSE)

## Picking joint bandwidth of 0.00401

## Warning: Removed 359521 rows containing non-finite values
## (stat_density_ridges).

That’s a bit prettier, but we see that field is sometimes NA so we need to remove those values.

allp %>% 
  filter(!is.na(field)) %>%
  ggplot(aes(x = pvalue, y = field)) +
    geom_density_ridges() + 
  xlim(c(0,0.25)) + 
  theme_ridges(grid = FALSE)

## Picking joint bandwidth of 0.00404

## Warning: Removed 349629 rows containing non-finite values
## (stat_density_ridges).

And actually the density plots are a little weird for p-values, lets see if we can turn them into something a little more like a histogram, which I think fits this data type better. To do that we have to change the parameters in geom_density_ridges.

allp %>% 
  filter(!is.na(field)) %>%
  ggplot(aes(x = pvalue, y = field)) +
    geom_density_ridges(stat = "binline") + 
  xlim(c(0,0.25)) + 
  theme_ridges(grid = FALSE)

## `stat_binline()` using `bins = 30`. Pick better value with `binwidth`.

## Warning: Removed 349629 rows containing non-finite values (stat_binline).

Ok but I think it would look better if it was a little bit higher resolution, let’s up the number of bins

allp %>% 
  filter(!is.na(field)) %>%
  ggplot(aes(x = pvalue, y = field)) +
    geom_density_ridges(stat = "binline",bins=50) + 
  xlim(c(0,0.25)) + 
  theme_ridges(grid = FALSE)

## Warning: Removed 349629 rows containing non-finite values (stat_binline).

Ok but as people have pointed out the spike at 0.05 is due to censoring (p-values reported like $P < 0.05$). So let’s break it down by operator.

allp %>% 
  filter(!is.na(field)) %>%
  ggplot(aes(x = pvalue, y = field,fill=operator)) +
    geom_density_ridges(stat = "binline",bins=50) + 
  xlim(c(0,0.25)) + 
  theme_ridges(grid = FALSE)

## Warning: Removed 349629 rows containing non-finite values (stat_binline).

Ok there aren’t that many greater than p-values and it makes the plot messy so let’s drop those

allp %>% 
  filter(!is.na(field)) %>%
  filter(operator != "greaterthan") %>%
  ggplot(aes(x = pvalue, y = field,fill=operator)) +
    geom_density_ridges(stat = "binline",bins=50) + 
  xlim(c(0,0.25)) + 
  theme_ridges(grid = FALSE)

## Warning: Removed 332965 rows containing non-finite values (stat_binline).

The histograms overlap a bit so let’s alpha blend the colors.

allp %>% 
  filter(!is.na(field)) %>%
  filter(operator != "greaterthan") %>%
  ggplot(aes(x = pvalue, y = field,fill=operator)) +
    geom_density_ridges(stat = "binline",
                        bins=50,alpha=0.25) + 
  xlim(c(0,0.25)) + 
  theme_ridges(grid = FALSE)

## Warning: Removed 332965 rows containing non-finite values (stat_binline).

There is some funkiness in how the histogram bins are computed so I went to the internet and figured out I needed to set the boundary at 0 and make the bins be closed on the right.

allp %>% 
  filter(!is.na(field)) %>%
  filter(operator != "greaterthan") %>%
  ggplot(aes(x = pvalue, y = field,fill=operator)) +
    geom_density_ridges(stat = "binline",
                        bins=50,alpha=0.25,
                        boundary=0,closed="right") + 
  xlim(c(0,0.25)) + 
  theme_ridges(grid = FALSE)

## Warning: Removed 332965 rows containing non-finite values (stat_binline).

Now we make sure that there isn’t wasted space on the y-axis by using the expand argument.

allp %>% 
  filter(!is.na(field)) %>%
  filter(operator != "greaterthan") %>%
  ggplot(aes(x = pvalue, y = field,fill=operator)) +
    geom_density_ridges(stat = "binline",
                        bins=50,alpha=0.25,
                        boundary=0,closed="right") + 
  xlim(c(0,0.25)) + 
  theme_ridges(grid = FALSE) + 
  scale_y_discrete(expand=c(0,0))

## Warning: Removed 332965 rows containing non-finite values (stat_binline).

Remove the baseline from the plot for true ggridges coolness

allp %>% 
  filter(!is.na(field)) %>%
  filter(operator != "greaterthan") %>%
  ggplot(aes(x = pvalue, y = field,fill=operator)) +
    geom_density_ridges(stat = "binline",
                        bins=50,alpha=0.25,
                        boundary=0,closed="right",
                        draw_baseline=FALSE) + 
  xlim(c(0,0.25)) + 
  theme_ridges(grid = FALSE) + 
  scale_y_discrete(expand=c(0,0))

## Warning: Removed 332965 rows containing non-finite values (stat_binline).

That’s definitely not a perfect plot, but it worked for my talk and was at least able to communicate a couple of the key points (about variation by field, variation by operator, and spikes at critical values).

If I was going beyond the talk I’d probably reduce the number of fields displayed or really increase the size of the plot. I’d probably make the bin width even smaller and I’d add a title. I’d also probably clean up the “greaterthan” and “lessthan” to be “Greater than” and “Less than”.

Regardless, I’m always surprised how much work it takes to go from an exploratory plot I’m just looking at myself to one I’d show to other people.

Recording Podcasts with a Remote Co-Host

Wed, 20 Sep 2017 00:00:00 +0000

I previously wrote about my editing workflow for podcasts and I thought I’d follow up with some details on how I record both Not So Standard Deviations and The Effort Report. This post is again going to be a bit Mac-specific because, well, that’s what I do.

Communication

Both of my podcasts have a co-host who is not in the same physical location as me. Therefore, we need to use some sort of Internet-based communication software (Skype, Google Hangouts, FaceTime, etc.) to talk with each other. Another simpler option is to use web-based communications platforms that are built for podcasting, including Zencastr or Cast. These platforms take advantage of WebRTC to record audio on each co-hosts machine (more on this later).

Web Platforms

For those who are just starting up, I would recommend something like Zencastr or Cast because they are easy to setup and basically “just work”. (One note: if you are using Safari 10.x on the Mac (which doesn’t support WebRTC), these sites won’t work. Use Google Chrome instead.) For these web sites, you need nothing more than a pair of headphones with a built in microphone (like the ones that likely came with your cell phone) and a computer. Communication occurs over the Internet, as you might expect, but the audio that is recorded is the audio that comes from your microphone. The audio that goes over the Internet is not recorded.

The advantage of this system is that

You don’t have to worry about synchronization—the audio is recorded simultaneously.
The audio quality is good because it is recorded locally.
Files can be uploaded immediately to the cloud so that the editor can quickly assemble them (i.e. Zencastr uploads files to Dropbox).

The disadvantages of these services are

They charge a monthly fee with tiered plans (Zencastr has a free “hobbyist” plan)
They only can handle certain host setups
In my experience, communications noise can still affect the audio (not sure why).

Both Zencastr and Cast are fairly new offerings and therefore occasionally have to work through some bugs and wonkiness. Hilary and I started out using Zencastr for Not So Standard Deviations but eventually moved off of it. This was in part because one time we thought we recorded an entire hour-long episode and it turned out the file was corrupted (we had to re-record it the next day). To this day I’m not sure what happened, but after that I resolved to have more control over the process.

“Manual” Platforms

The alternative to the web-based platforms is to use a more manual approach. I have found this system to be more reliable, if not a bit more complicated. Here, you can use Google Hangouts, Skype, FaceTime, or another Internet-based communications program. I’ve used most of them and I’ve found that Google Hangouts is probably the best (but it is pretty close all around). Obviously, the quality of your Internet connection will be more important than which software you use, and that may depend on exactly where in the world you are.

Recording

The basic idea is that there are two types of audio streams: the audio generated by each co-host/speaker and the audio that is passing across the Internet. The basic process here is

Each speaker/co-host records their own audio file on their computer using a program like Quicktime or Audacity or something similar. On the Mac, this is super simple—just open up Quicktime Player and goto File > New Audio Recording…. After that, just click the red circle button and you’re recording!
You can also record the Internet audio using a program like Ecamm or Audio Hijack. This is strictly speaking not necessary but it is useful as a backup audio recording and also for synchronizing the various speaker audio files in the editing phase.

If you’re going down the road of manual recording (and you use a Mac) I strongly recommend that you invest in getting Audio Hijack. It is a phenomenal program that lets you do basically anything on the Mac involving audio. If your Mac makes a sound in any way, you can easily record it.

With Audio hijack, you can set things up to record audio, save it to a file in various formats, and send the audio to different outputs. Here is my setup:

Here’s how it breaks down:

I am recording my audio using an ATR USB Microphone (the exclamation point is there because my mic is not connect right now)
I record my co-host’s audio through Google Chrome (via Google Hangouts).
Both my and my co-host’s audio is saved to separate uncompressed 16-bit mono AIFF files that will later be imported into Logic Pro X for editing.
I then route each audio stream through Peak/RMS meters to make sure our respective microphones are tuned properly and that the audio isn’t peaking.
Then my co-host’s audio is routed to my headphones (Internal Speakers) so that I can hear her.
The combined audio with both my and my co-host’s voice is then saved to a single 256 kbps MP3 file. This final file is the backup audio file that I can use to help with synchronization.
The co-host then sends me her locally recorded audio file through Google Drive or Dropbox.

The advantage of this approach is that it can be adapted and modified in a variety of ways including a combination of in-person and remote guests (the web platforms can only really do remote guests). For example, if you have an in-person guest you can record both in-person audio streams and send it through a loopback device so that the remote person can hear it. “Multi-ender” podcasts are starightforward because everyone just records their audio separately and sends me the resulting audio file. I still have the combined recording as a backup just in case.

Editing Podcasts with Logic Pro X

Mon, 18 Sep 2017 00:00:00 +0000

I thought I’d write a brief description of how I edit podcasts using Logic Pro X because when I was first getting into podcasts, I didn’t find a lot of useful stuff out there. A lot of it was YouTube videos of advanced editing or very basic stuff. I don’t consider myself a sound expert in any way, but I wanted a good workflow that would produce decent quality stuff. When I first started Not So Standard Deviations with Hilary Parker I actually edited the first few episodes using Final Cut Pro X, a video editor, because that’s just what I was familiar with. Eventually, I learned Logic Pro X, which is Apple’s digital audio workstation, largely because of this post by Jason Snell.

On with the show. Unfortunately, this is going to be an Apple-heavy post given that Logic Pro X is only available on Macs. Sorry, rest of world!

Recording the Episode

I assume you’ve recorded your podcast and that you have one file for each speaker. In both The Effort Report and Not So Standard Deviations, I record with a co-host. As of this writing, both podcasts are recorded remotely because of geographic constraints. The way we do this is that each co-host records her microphone input locally and then sends me the audio file (I also record my own microphone input). This way, we don’t have to worry about any noise going over the Internet connection (and there is a lot sometimes). Unfortunately, there is no way to deal with the connection delays that inevitably come up (more on that later). As a backup, I also record the combined conversation going over the Internet using Audio Hijack. Besides being a backup, this recording is also useful for synchronizing the separate tracks, but at the end of the day, I don’t actually use it. I’ll probable write a separate post on the details of how I record each podcast.

Starting the Project

I start each project in Logic Pro X by selecting the “Multi-Track” option after clicking File > New from Template…. This just opens you up in an empty project with a whole bunch of empty tracks setup.

Next I import three audio files: One containing just my voice, another containing just my co-host’s voice, and another containing our combined conversation. These tracks are then labeled according to the speaker’s name. Below is an example from Episode 54 of The Effort Report after importing the audio.

At this point I don’t worry about where the audio tracks are placed, I just get them into the project. This is a good time to save your project (and also give it a name).

Synchronization

The first problem is that the audio tracks are not synchronized. For the most part this is not such a big deal as long as you have a recollection of how the conversation went. But it’s easier if you have a recording of the combined (synchronized conversation). Here I can line up my voice in the separate audio file with the combined audio file, and then line up Elizabeth Matsui’s voice in her audio file to the same combined audio. It takes a little trial and error, but doesn’t take too long.

Here is a screenshot after I’ve synchronized the audio files. You can see that the waveforms nicely line up with each other in the combined audio file.

At this point I usually delete the combined audio track because it’s no longer needed (and is of lower quality because it is recorded over the Internet).

Strip Silence

Perhaps the key reason I use Logic Pro X for podcast editing (a task for which is not particularly well-suited) is the strip silence feature. What this does is it takes an audio track and just deletes anything that it considers to be silence. What exactly is “silence” is configurable and you may need to play with it depending on your recording levels. You can run strip silence by first selecting the track in the window so that it is highlighted and then pressing Ctrl-X. You will be presented with this window:

I find the defaults don’t work for me, so I modify them slightly. My parameters are:

Threshold: 4%
Minimum Time to accept as Silence: 1.0 sec
Pre Attack-Time: 0.1 sec
Post Release-Time: 0

And I leave the “Search Zero Crossing” box checked. Once you hit “OK”, the track will look something like this.

After doing it again to my track, your track editor window will look something like this.

Now you can see that the only thing that remains of each track are the parts where one of us is talking. We are getting there!

Dealing with Delay, Cross talk, and Noise

At this point you might think we were done, given that the audio has been synchronized. However, there are a few issues left that need to be dealt with:

Cross talk. Especially on remote connections with a delay (or on recordings with multiple speakers), there will likely be moments when people talk over each other. This is unpleasant to listen to and it’s nice to clean up those moments if possible.
Connectivity Delay. Often with remote recording, there is a delay as a person’s voice is relayed by the Internet gnomes over thousands of miles. The fact that there is only a 1–3 second delay in a connection between Melbourne and Baltimore is, frankly, a miracle, but it’s still annoying to listen to.
Conversational Noise. Often in remote conversations, the person not doing the talking is prone to say “Uh uh”, or “yeah”, just to indicate that they’re still listening. This stuff is not necessary in the final recording and it’s good to clean that up.

The nice thing about all of the problems above is that they are easily dealt with after running strip silence. In partcular, in sections where a single person is just talking, there is nothing to do. So you can just skip over that. For example, in the section below, Elizabeth is speaking for just over a minute uninterrupted.

There’s nothing to edit here so you can skip to the next section.

By contrast in the section below, I am speaking and Elizabeth is agreeing with me by saying “right” and “yeah”. With strip silence it’s easy to recognize this pattern and I can just delete those little blurbs.

Finally, with connectivity delays, there will often be a gap between when one speaker finishes and another responds. In the example below, when I’m done speaking, there’s about a 2 second delay between when I stop and when Elizabeth responds. Sometimes that’s on purpose, but usually it’s because of the connectivity delay.

This can be easily fixed by clicking on Elizabeth’s track at that point, hitting Shift-F, which selects everything after that point, and then dragging the whole thing to the left a little bit. The result is below. It’s important to use Shift-F in order to preserve synchronization in the rest of the recording!

The nice thing about the strip slience feature is that you only need to navigate to all the points where one speaker stops and another starts. These boundaries are easy to find in the track editor. That doesn’t mean that this process isn’t tedious, especially in an hour-long episode. But it’s faster than it would be otherwise and it makes for a much cleaner recording and a much more listenable episode.

Lastly, cross talk is usually easy to spot because it looks something like this.

Here, both Elizabeth and I are talking at the same time and it looks messy. In these cases I’ll listen to what was happening before and after the cross talk and see if I can clean it up. Usually, I just end up deleting the section with cross talk and try to paste together what came before to what comes after. Or if it makes sense, sometimes I can just shift one person’s voice over just a little bit so that both speakers’ points are made, but just not at the same time.

Exporting

After all the editing is done, I export the final episode by bouncing to a file. You can hit Cmd-B to get the bounce menu. Typically I bounce to MP3 at 160 kpbs and I turn Normalize to “Off”.

Summary

That’s pretty much it for editing. There are a few extra things that I do (like sound compression), but they’re not quite as important as getting the editing right. A lot of stuff out there encourages you to use free editors like Audacity or something similar—and that’s a good place to start—but I think a professional tool like Logic Pro X is essential for dealing with problems like noise, cross talk, and delay, which I think are a feature of every remotely recorded podcast.

Specialization and Communication in Data Science

Mon, 11 Sep 2017 00:00:00 +0000

I have been interested for a while now in how data scientists can better communicate data analysis activities to each other and to people outside the field. I believe that our current methods are inadequate because they have mostly been borrowed from other areas (notably, computer science). Many of those tools are useful, but they were not developed to communicate data analysis concepts specifically and often fall short. I talked about this problem in my Dean’s Lecture earlier this year and how the field of data science could benefit from developing its own theories, to simplify communication as other fields have done.

One thing that I have noticed is that in other fields, the development of those fields can be viewed in part as a trend towards increasing specialization. With people in a field who increasingly specialize in a certain sub-specialty, there is a parallel need for the specialists to communicate and coordinate with each other in order to produce a whole product. Over time, the separation of a field into an collection of specialists drives the development of communication tools that can serve as clearinghouses of mutually agreed upon information. Without adequate tools, the communications overhead involved with adding more people to a project will become too great and the entire enterprise can collapse. This phenomenon is famously described in Fred Brooks’s The Mythical Man-Month as it relates to software engineering projects.

I thought it might be useful to talk about some of these other areas and how they have overcome increased specialization and the separattion of duties with communications tools. Tracing the history of other fields is instructive as it potentially provides a basis on which we can discuss data analysis. Listeners of my podcast with Hilary Parker know that we regularly have a segment that we refer to as “analogy corner” and this is the Simply Statistics version of that.

Specialization in Other Areas

The first example comes from filmmaking and the development of the screenplay. The Script Lab describes The History of the Screenplay and how filmmaking worked before the development of scripts:

When contemplating the history of screenwriting, one cannot divorce the theories of screenwriting from the evolution of film production. The earliest films were often solo projects, from conception to completion. Referred to as the “cameraman system” this was the most primitive of filmmaking. Soon, directors became central to the process, but most movies were filmed with only a vague idea of what the director wanted to shoot. Often crews were kept waiting while the director planned what to film next.

Films were one-person projects and were developed in a more or less linear fashion. It was an inefficient system—most films today are produced in a highly nonlinear fashion to accommodate actors’ schedules and various production processes.

Today, the screenplay serves as a critical communications hub around which many filmmaking departments (costume, makeup, hair, props, sets) can organize their activity. Imagine if representatives of each of these departments had to individually consult with the screenwriter or director about every detail of their work. It would be a nightmare of exponentially growing complexity. With a written document, such as as screenplay, that everyone can agree on as the authority of “what is happening in the film”, people can get their jobs done without the need of constant back and forth communication.

The second analogy comes from finance. In finance there was a similar development of specialization as it relates to limited liability. Here, the “specialization” refers to the separation of the owners of a company from its managers. As a result, there must be a way for managers of the company to communicate to the investors what exactly is going on with the operations of the company. Thus, the development of financial statements, accounting rules, and various publicly available documents that let investors analyze the health of a company. Graham and Dodd’s seminal Security Analysis is essentially a plea to investors to evaluate companies based on the publicly available data, rather than on common myths and legends about what makes for a good or safe investment. Today, with the separation of owners from managers and the creation of standardized communication formats between the two (e.g. S-1s, 10-Ks, 10-Qs, etc.), we have the basis of the global capital markets system.

The last analogy comes from western classical music, where there is often a division between the composer of the music and the performer. In more complex symphonic music, you might say there are three roles: the composer, the performer, and the interpreter/conductor. However, in early classical music, such a division didn’t exist and composers typically performed their own music, often by themselves. In this setup, there’s not much need to write things down, as the music can be stored in and performed from the composer’s head. This concept was well-captured in the movie Amadeus where Mozart describes his opera The Magic Flute as being “up here in my noodle” (the rest is just scribbling and bibbling).

Of course, opera might be the ultimate example from classical music where some sort of communication tool is needed to coordinate between the musicians, singers, and set designers. Thus, for most classical music, we have the score, which specifies what every instrument and signer is doing at any given time. There is a standardized notation that allows others unfamiliar with the composer to quickly grasp what is going on and to assemble the time and resources needed to perform the work.

What About Data Analysis?

In data science today, or really in science, much of what goes on follows the “vertically integrated” model, where the same person, asks the question, collects the data, and analyzes the data. The need for communication methods doesn’t really arise until that work needs to be disseminated to others (including yourself in the future). In large collaborations where communication over analyses needs to be done from the start, my experience has been that even in the best case scenarios, the methodology is ad hoc and difficult to re-create in another project involving different people.

Most would agree that the software code that actually does the analysis is an important component of communicate what is being done. However, not everyone needs or wants all of the details provided by the code. Perhaps one concept we could steal from music is the distinction between the score and the parts. In a symphony, the conductor needs the full score because they need to know what everyone is doing at all times. But the first violinist only reads the first violin part—they don’t need to read the entire score in order to play a vital part in creating the finished product.

Developing appropriate communications tools for data science is critical to scaling data analysis so that more people can be involved and to reproducibility/replicability so that more people can understand what is going on in an analysis. Until then, I think we will continue to jam tools from other fields into the data science process, and that is fine. These tools are useful, but I think ultimately are not a perfect fit.

Moon Shots Cost More Than You Think

Thu, 07 Sep 2017 00:00:00 +0000

In a deeply reported article, Casey Ross and Ike Swetlitz report that IBM’s Watson isn’t living up to its hype when it comes to cancer care:

The interviews suggest that IBM, in its rush to bolster flagging revenue, unleashed a product without fully assessing the challenges of deploying it in hospitals globally. While it has emphatically marketed Watson for cancer care, IBM hasn’t published any scientific papers demonstrating how the technology affects physicians and patients. As a result, its flaws are getting exposed on the front lines of care by doctors and researchers who say that the system, while promising in some respects, remains undeveloped.

I thought the article was very well-written, covering many angles and getting into the details. I would encourage reading the entire thing.

While many issues are covered, I bumped on this one quote about the amount of investment (or lack thereof) that IBM is putting into Watson.

[Peter] Greulich said IBM needs to invest more money in Watson and hire more people to make it successful. In the 1960s, he said, IBM spent about 11.5 times its annual earnings to develop its mainframe computer, a line of business that still accounts for much of its profitability today. If it were to make an equivalent investment in Watson, it would need to spend $137 billion. “The only thing it’s spent that much money on is stock buybacks,” Greulich said.

That Watson hasn’t cured cancer overnight—or lived up to its own marketing material—should come as no surprise. But I wonder if people are nevertheless up to investing in what it would take to find a cure.

I’ve often heard of curing cancer as a “moon shot”, in reference to the Apollo program to land a man on the moon and return him safely to Earth. But that project cost a lot of money, about $112 billion in today’s dollars, or about \$11 billion per year over the roughly 10 year span of the project. The National Cancer Institute’s annual budget is about $5 billion—so, it’s less. Also, I would argue we know substantially less about human cancer now than we did about physics and space in 1959 (although there was plenty to learn back then too). How much additional money would we need to make up for that lack of base knowledge?

Deep Dive - Y. Ogata's Residual Analysis for Point Processes

Mon, 04 Sep 2017 00:00:00 +0000

For a long time now—actually ever since we started the blog—I’ve wanted to do a series of deep dives into specific papers that I thought were great. Clearly, it’s taken a bit longer than I expected, but I figure better late than never. Actually, that’s become a bit of a theme for my work these days!

One problem I have with much academic writing on the Internet is that I feel like most of it is devoted to (1) promoting one’s own work; or (2) identifying weaknesses in others’ work. Now, there’s absolutely nothing wrong with either activity—both are essential to the functioning of science. But I think there’s room for more activity. In particular, I think putting up examples of good work, and explaining why they are good, is an important part of producing more good work in the community. I hope to make this entry the first in a series, but we’ll see if I can sustain it. Seeing as this is a statistics/data science blog most of the papers will be statistically oriented. But I may try to throw in a few biomedical/environmental papers if I can.

The first paper I want to write about it is one that had a huge influence on me when I was a grad student, many year ago, doing research on point process models. This is Yosihiko Ogata’s paper Statistical models for earthquake occurrences and residual analysis for point processes, originally published in the Journal of the American Statistical Association in 1988 (there is a pdf here). I haven’t picked up this paper in quite some time as I no longer do research in point process models, but it remains remarkable in my mind. So let’s get into it.

A Road Map for Data Analysis

One immediately important aspect of this paper is that it presents a road map for how to apply point process models to real data, using a formal likelihood-based inference framework. This may sound unremarkable, but in the point process area, there was a lot of focus on (1) theoretical properties of various estimators; and (2) ad hoc fitting of models to data via 2nd-order methods. A lot of the methodologies in point process modeling were translated from time series, which involved a lot of 2nd order correlation methods. Likelihood methods were far less common, in part because they involved substantial computation. For people who are interested in applying point process models to data to make scientific inferences, I think this is still a great paper to read. In fact, I think it’s a great paper to read if you’re interested in data analysis at all, for any kind of model.

Scientific Theory

Ogata’s paper is focused on a theory of earthquakes that posits that there are main shocks followed by aftershocks. After a long decay in aftershock occurrences, there is a period of “quiescence”, after which there are foreshocks, followed by another main shock. The problem is, of course, that nature is not kind enough to label all these shocks as “main shock”, “aftershock” and “foreshock”. We need a model to help us infer which is which. The primary goal lying in the background is predicting the next main shock.

Ogata reviews some of the fundmental theory for earthquakes. There’s the famous Gutenberg-Richter law (equation 2) which explains the distribution of earthquake magnitudes, and the modified Omori law (equation 1) which explains the frequency of aftershocks. Given these two laws, Ogata puts them together in an “epidemic-type” conditional intensity model (equation 10). This model can be thought of as describing the instantaneous rate at which earthquakes occur at any given time. If the model is correct, we can predict the rate at which an earthquake will occur at any time (along with it magnitude), given the entire history of earthquakes to date. These kinds of models would later be described as epidemic-type aftershock sequence, or ETAS, models.

The bottom line here is that the ultimate model presented in equation 17 of the paper is rooted in formally hypothesized theories of how earthquakes work along with interpretable parameters. These theories may or may not be correct, but they are presented and modeled in a manner where we can determine whether evidence in the data favor them or not. In other words, we have a good chance of knowing whether the theory/model fits the data or not. This aspect is key—what’s important is not whether your model is correct or not, but whether you know if your model is correct or not. Not every field is so lucky to have parsimoniously specified physical models to go on. But the truth is, if the models are incorrect, or do not fit the data, then the fact that they are parsimonious is not that helpful. Nevertheless, it’s worth trying to develop a model that we can understand, and which will teach us something about the underlying phenomena, in this case how earthquakes occur.

The Data

Before getting into how to fit the model to data, Ogata presents the data—literally. This may seem quaint in today’s big data era, but Figure 1 and Table 1 show all the data. Table 1 takes 2 pages just to print out (there was no supplementary material back then). But there’s something very visceral and satisfying about being able to flip through the data like that. I feel like I notice details that I never would have noticed in a single plot or a table of summary statistics. The dataset is 483 earthquakes with magnitude 6 or more from 1885–1980.

Figures 2 and 3 show more summaries of the data. Figure 3 in particular suggests a reasonable adherence to the Gutenberg-Richter law of magnitudes (i.e. straight line on a log-plot indicating an exponential distribution). Figures 4, 5, and 6 are designed to show whether the earthquake process is purely random (Poisson) or exhibits some clustering behavior. Suffice it to say that all plots show that there is a lot of clustering (non-Poisson). Under the Poisson assumption, Figure 4 should be a straight line, Figure 5 would follow the solid line, and Figure 6 would just be a horizontal scatter (without the little spike up near zero).

This section is short, but Ogata lets the data do some talking. He couples the literal presentation of the data with some summary statistics, some of which are highly specific to point process data. But ultimately, he takes a light touch to the summarization and tries to present as many angles as feasible/useful to the reader. This is one of my favorite aspects of this paper—the careful curation of how the data are presented so that you might best assess how the data eventually do or do not fit the hypothesized models.

The Modeling

Section 3.2 shows the log-likelihood of the model and how we can compare models using AIC. There’s nothing super novel here, except that maximizing the likelihood must have been a pain in the butt at the time.

Residual Analysis

Section 3.3 is the heart and soul of this paper, in my view. Ogata explains that while AIC can be useful for comparing a set of models, it’s not useful for determining whether there is a better model outside that set. Here, we need the point process equivalent of residual analysis for regression models. The problem is that at the time there really wasn’t a settled understanding of what residual analysis was for point process models (or even what constituted a “residual”).

Ogata uses a theoretical result of Papangelou that says that you can shift the time points of the earthquake occurrences in a 1-to-1 manner dictated by the model so that the rescaled time points follow a Poisson process with rate 1. A Poisson process with constant rate 1 is the “white noise” of point process modeling. So, in other words, if your model captures the main features of the data, then the rescaled time points, which we will call the residuals, should essentially be noise. The nice thing about this approach is that there are already a bunch of tests designed to determine if a point process is Poisson or not. So we can apply that battery of tests to the residuals to see if we are missing anything. Furthermore, if we detect anything unusual in the residuals, we can easily map them back to the original data to see what was going on.

Figures 9-15 look at the residuals carefully and most indicate that the residuals look Poisson. However, Figure 13(a) suggests a violation of the variance assumption, so Figure 15(3a) is used to diagnose that problem. It seems there was an unusual “swarm” of earthquakes in 1938—when these are removed, the residuals appear Poisson. Of course, the model will need to be re-examined in light of its inability to model this “swarm”, but the point is that these residual methods allowed us to easily identify this “outlying” group of earthquakes.

Interestingly, the way that Ogata structures the article sets us up for this moment. In Section 3.1 he shows all these plots of the raw data and concludes that the data are highly clustered and definitely not Poisson. Later, in the residual analysis section, he can show us the exact same series of plots and say, “Look, they’re Poisson now!” Or at least, almost Poisson.

In this section, Ogata develops an entirely new tool for assessing point process models and shows how it can be used to identify interesting aspects of the data that are not captured by existing theories or models, even after evaluating a set of models with AIC. The thoughtful treatment of residuals and the visualization of how they do or do not adhere to the Poisson assumption is important for allowing the reader to evaluate for themselves the adequacy of the model.

Summary

The remainder of the paper is an assessment of the theory of “quiescence”, which I will skip as this entry is long enough. In summary, I think this is a great applied statistics and data analysis paper. Some of the things that stand out to me are

There is a strong reliance on the theory of earthquakes, and Ogata constantly references those theories in each section as he evaluates the data. In other words, the point of the analysis is to develop evidence for or against the theory. Note that there’s very little emphasis on formal hypothesis testing. Nevertheless, Ogata puts the reader in a position where they can judge for themselves how well the models fit the data.
Ogata goes through great lengths to present the data, plainly yet informatively. This is not a data dump. I think the reason he can present the data in this informative manner is again because he is guided by the theory and therefore has a prioritization of what is or is not important to show.
In a way, Ogata buries the lede in this paper. He waits until Section 3.3 to introduce the idea of residual analysis for point process models, which in reality was a totally new applied statistics method. He could have written an entire paper about that approach. But ultimately, I think in his mind this paper was about earthquakes and earthquake models. Statisticians routinely have to decide whether a paper is about the method or the application—there is no right or wrong answer. Regardless, Ogata didn’t let the novelty of the method get in the way of answering the primary question.

Data Science on a Chromebook

Tue, 29 Aug 2017 00:00:00 +0000

About nine months ago I announced that I was attempting a Chromebook experiment for the 2nd time. At first I thought it was going to be a short term experiment just to see if it was possible to function with only a Chromebook. But in an interesting twist I got used to it and have been working exclusively on a Chromebook for the last few months since the experiment started.

I set myself the following requirements:

I could only use Chrome OS - no installing/booting to Linux
I couldn’t use another computer for any task
I had to be “fully cloudy” in the sense that I didn’t run any additional hardware

One of the reasons I did this was I wanted to see if it was possible to be a functioning/day to day data scientist without using an expensive laptop. This is part of a broader experiment I’m just beginning on how to democratize data science education.

I’m not going to go into extreme detail on how I set everything up here (more on that in a second) but I thought I’d describe my Chromebook set up that I’ve been using for the last couple of months.

I have been using two Samsung Chromebook Plus computers, one of which I keep at home and one which I keep at work. One of the best parts about the fully cloudy/Chrome OS requirement is this means that from the user perspective everything is always in sync. I log off the computer at home, come to work, log on and its like I’m on the same computer.

I thought I’d just go through at a high level the software I’m using to keep everything running.

Google Slides for presentations - (Cost:free) For the most part this has been really easy and is a smooth transition from Powerpoint. One thing I’ve found really useful is the laser pointer mode of the Chromebook plus for highlighting things on screen when presenting. I have also found that since they are using USB-C adapters I can participate in dongle communism with Apple users. I had to figure out the display mirroring menus in Chrome OS but after that this was easy.
Google Docs/Paperpile for writing - (Cost:free) This works great and has been my work flow as I describe in my book since before the Chromebook experiment started.
DocHub for signing things - (Cost:\$4.99/month/billed yearly) Often I have to “sign” a document by adding my electronic signature. I used the note feature to create a jpeg of my signature. I can then upload the file to Docub
Overleaf for writing latex - (Cost:free or \$10/month/billed yearly) This is not necessary for all data scientists, but it has some nice features, including when I could live write a grant and people could watch.
Gmail for email - (Cost:free) this one is pretty obvious.
Google Sheets for data - (Cost:free) this is again a choice I had been making frequently before I moved to Chromebooks. The googlesheets R package lets you do all sorts of cool things with google sheets.
Digital Ocean for Rstudio - _(Cost: \$20/mo)__ I set up an Rstudio server and run it remotely on Digital Ocean. I currently use the \$20/month option but sometimes scale it up or down as needed. One great thing about the dockerized version of the software is that I can pause the instance, scale up the compute infrastructre, restart and everything is just as I left it but with more computational horsepower. I can then use that for a few hours as needed and scale back down. I use the terminal in Rstudio for most of my management of code/etc. on Github.
Google Hangouts for video conferencing - (Cost:free) this is the default but honestly I wish I had a better option. I often find it complicated and laggy to work with, but still mostly better than Skype. Would be open to suggestions on that front.
Slack for communication (Cost: \$6.67/month) a variety of different teams here at JHU and around the country use Slack for group communication. I use it through the web browser, although the Chromebook Plus allows you to install Android apps.
Google Music for listening to music/podcasts (Cost:\$10/month) This is an unnecesary expense but I like having something to listen to while I work.
Tweetdeck for twitter - (Cost:free) I have a couple of accounts I manage and I do this through the web browser. For the most part this works great.

So my total monthly cost comes to something like \$35 a month for various cloud services. At first doing this was sort of like writing a Haiku. I could still write, but the constraints made me think hard about how I did things. But after a while I have gotten so used to the form that it feels natural and I don’t miss my (really expensive) Apple products anymore.

The biggest headaches have been:

Wifi connectivity issues - this isn’t as big as I thought it would be, most places have wifi where I work and it is mostly ok. When I have trouble I stream from my phone.
Wifi blocking my DO server - this one has been a headache. I think if I just got a custom domain for the webserver and didn’t just use the IP address I could avoid it. When I have trouble I stream from my phone.
httr and Rstudio on a server - when I need to authenticate for websites I have run into trouble, but if I set httr_oob_default==TRUE (documentation here) then the Oauth process generates a code I can paste into my server.

Beyond that it has actually been pretty straightforward to do almost anything I need. Stay tuned because this experiment has inspired a broader effort we are doing with Chromebooks here at the JHU Data Science Lab. If you want to hear about this effort as it gets underway, sign up for our weekly newsletter and you’ll be the first to hear new announcements.

Simple Queue Package for R

Mon, 28 Aug 2017 00:00:00 +0000

I recently coded up a simple package that implements a file-based queue abstract data type. This package was needed for a different package that I’m working on involving parallel processing (more on that in the near future). Actually, this othe package is a project that I started almost nine years ago but was never able to get off the ground. I tried to implement a queue interface in the filehash package but it never served the purpose that I needed.

Recently, I came across Rich FitzJohn’s thor package, which provides an interface to the LMDB database, which is a light-weight key-value style database. When I saw this, I realized that it was exactly what I needed because it supports transactions with blocking. So if multiple processes attempt to write to the database at once, it will block the other processes untill the currently writing one is finished, and then move on to the next one.

The code for the queue package is available on GitHub and can be installed in the usual way via devtools:

devtools::install_github("rdpeng/queue")

The package just involves four functions:

enqueue(): add an element to the queue
dequeue(): remove an element from the queue (and return it)
peek(): return the next element of the queue (but don’t remove it)
is_empty(): is the queue empty:

There are also functions

create_Q: create the queue and associated files on the disk
init_Q: initialize a queue that has been previously created

Everything is implemented as S3 classes and methods. Queues are just sub-directories on the filesystem that contain some metadata for the underlying LMDB database.

You can create a queue with the create_Q() function.

library(queue)
x <- create_Q("testq")
is_empty(x)

## [1] TRUE

You can then add arbitrary objects to the queue with enqueue(). Behind the scenes, objects are serialize()-ed.

enqueue(x, "hello")
enqueue(x, 1)
peek(x)

## [1] "hello"

Finally, you can remove an object from the front of the queue with dequeue().

obj <- dequeue(x)
obj

## [1] "hello"

peek(x)

## [1] 1

The implementation is pretty basic right now but the guts of it are there. I don’t plan to add any more features, but hopefully will round out the documentation and add some tests.