«I chose statistics because I thought it was just more interesting, had more life to it than pure mathematics, even though I was trained mathematically». Alan Gelfand explains his passion for statistics, to which he devoted his life, thusly. In the nineties he popularised, together with Adrian Smith, the Markov Chain Monte Carlo method (MCMC), a contribution that considerably improved the subset of Bayesian statistics by sampling from probability distributions. «It is one of those things where you can feel very fortunate because there are lots of really smart people out there who work really hard for their careers and don’t happen to find something. You discover something and it happens to be a breakthrough. The only thing I can say for me is at least I took advantage of it», he remembers.
During the last years, Alan Gelfand’s research revolved around the area of space-time statistics, a booming field with a lot of possibilities, as he explains humbly yet passionately: «There is an expression we use in English, “low-hanging fruit”, which you can reach up and grab without working so hard because it hasn’t been picked yet and in spatial-temporal analysis there was so much low-hanging fruit that you could play, enjoy all the possibilities that are out there. Many other areas have been developed so much, have been pushed so much that you have to reach much higher to find some fruit. I have been very lucky». In fact, he published four books and more than 250 scientific papers regarding these questions, and received a number of awards. The latest one was the Distinguished Research medal from the ASA Section on Statistics and the Environment.
Alan Gelfand is currently a Professor of Statistics at Duke University (Durham, USA) and Fellow at the American Statistical Association, the Institute of Mathematical Statistics and the International Statistical Institute. Applications of his work can be found predominantly in areas such as environmental exposure, space-time ecological processes and the development of climate models.
What is your vision on the applicability of statistics?
The 21st century statistician is a package of somebody who has good methodological background, is solid mathematically, is good in terms of modeling, has good computing skills, has good data analysis skills, is able to do visualization if needed, and has a whole tool kit that enables her or him to do the things that are needed nowadays. It may be the case that different people will be stronger in different areas, but you do need sort of this whole package in some way, if you’re going to really be a modern statistician. And the result of that is the life that has come to statistics from this. There’s an energy, a vitality that has come to statistics from all of this interdisciplinary development, all of this collaboration, and it doesn’t matter that much whether it is collaboration in environmental sciences or genomics or neural sciences or social sciences or economics. In the 21st century the statistician is an integrated player from the beginning and it’s scientifically much more exciting, because you are really part of the team. So it’s a very exciting time for statistics, and that’s what I mean in terms of application.
What exactly is Bayesian thinking?
The Bayesian view is scientifically intuitive and the only challenge that ever happened to Bayesian thinking was within the statistics community, not really when you look at the bigger scientific community. The idea behind Bayesian thinking is just the simplest thing you can imagine. You try to infer about what you don’t know given what you see. What could be more natural than that? But the more classical view is sort of inverted, the classical view says: what might you imagine seeing given what you don’t know? It’s a strange way to look at things; it’s a strange pair of glasses. I think that’s why the Bayesian paradigm for inference is so attractive and the reason it took so long to come to the front of statistics is because of the computational problem and so for a long time the Bayesian community could only do very simple problems. Bayesians spent a lot of time playing with formal axiomatic theory and a mathematical perspective, that was foundationally important, but from an applied perspective, from the practical perspective, did not really help people.
What did the Markov Chain Monte Carlo (MCMC) mean for Bayesian statistics at the moment?
For a non-technical audience, the Bayesian computing problem is one of high-dimensional integration and complicated integrations that could not be done explicitly. We needed a mechanism to break what they called at that point the curse of dimensionality, because high-dimensional problems could not be done analytically, and we needed a strategy. So, what the strategy turns out to be is the most basic idea in all statistics. If you want to learn about a population you sample from it. That’s what we do, and it turns out that Gibbs sampling and Markov Chain Monte Carlo is just the mechanism for enabling sampling from complicated populations, complicated multi-dimensional distributions. So it’s really a very elegant idea. But why does it work the way it worked? Because it was impossible to sample from a very high-dimension distribution, maybe a hundred or a thousand dimension distribution, but you could break it into smaller pieces and sample from low dimensional distributions. If you did it the right way, the sampling from the low-dimensional distributions would produce realization samples from the high-dimensional distribution, which is what we really were interested in. That was the power of this thing and when we first looked at it we were like little kids in a toy store.
You also have been working in Spacial Statistics. What was your experience in that?
I realized how well suited spatial statistics were for Bayesian inference. The idea is that for a lot of statistical work people use likelihoods and likelihood-based inference, and when you have to put uncertainty and likelihood-based inference the uncertainty usually comes from asymptotic ideas. And what happens with spatial statistics is asymptotics do not work, asymptotics are not right, the only asymptotics that you could use in spatial statistics were like time-series asymptotics, where you let time go to infinity. But with space, you don’t want to let space to go to infinity. If you want to study a region, Valencia or even Spain, you are not going to look at a region the size of the Earth. What you really want to do is think about inference where the samples get bigger by looking at more locations within the region, not letting the region grow. And that sort of asymptotic behavior doesn’t work with classical time-series and other sorts of asymptotic theory and so it turns out that Bayes is really good for this because Bayes gives exact inference and does not require any asymptotic arguments. Space is interesting because when you work with time series in one dimension, then you have order, you can tell what’s before and what’s after. But in space there’s no order. Makes life much more interesting, more challenging, more fun, and it just opened up lots of possibilities. The combination of space and time is really beautiful.
In what direction do you think statistics will grow? Do you think it will be more theoretical or more applied?
Statistical contribution is really inference, and so the question is whether we can maintain inference as a scientific contribution or not. So that is where the challenges with Big Data come in, and the question of whether we need statistics, whether we need inference or not, or whether we can do things just through algorithms, whether we can do science with just descriptive summaries and exploring databases without necessarily doing probabilistic inference. Before we do that, it’s sort of interesting because of this identity crisis that statistics is going to face, the question is where should we be heading? And there are some people who will say: «Well, we’ll always need the theoretical side because we need to have some rigor, we need to make sure that people are doing things correctly». And it is true, if you provide the software people use it. But it can be dangerous if people haven’t rigorously looked at the foundational challenges that go with some of this work. So I think there’s always going to be a place for the theoretical side, but there’s no doubt that the future is in the applied side. And in particular, what is really changing is the style of doing things. It has to change the way that you are thinking of things, because it requires a more integrated thinking, more synthesizing of different sources of information, because of the fact that you are not able to control everything. You are essentially looking at a more complex process, which has features that you have to try to capture in as many ways as you can. So all this is a long way to say that the future of statistics is going to be in working in these complicated interdisciplinary projects, working on challenging processes, challenging systems.
Viktor Mayer-Schönberger and Kenneth Cukier start their book about Big Data^{1}telling how Google searches did predict the spread of the H1N1 flu outbreak in 2009. This example serves to authors to cite the article by Chris Anderson, «The End of Theory: The Data Deluge Makes the Scientific Method Obsolete»^{2}, published in 2008 in Wired magazine, who provocatively proclaimed «Petabytes allow us to say: “Correlation is enough”. We can stop looking for models». What do you think about it? Is the future of Statistic to become just simple descriptive data analysis?
In the old days, I though of myself as a statistician, but now, if you ask what I am I would call myself a stochastic modeler, that is I model problems with uncertainty. My view of the world is that when we look at these complex processes we are not able to explain them perfectly and therefore, if we can’t explain them perfectly, we should introduce some measure of uncertainty, and for me that means probabilistic or stochastic modeling. That’s something that is not what I would think of purely as a Big Data idea. There are many problems that do not require big data. You can work on interesting scientific problems on much smaller scales, but what’s really troubling in this business is that machine learning and big data tries to envision a large umbrella under which would be statistics, computer science, maybe computer engineering, applied mathematics, and they think of it all as machine learning. That’s not really the way I want to think about doing science. I think a lot of this machine-learning stuff is kind of a search for structure in big data, a search for patterns, for relationships. But it’s not the same as we do, we try to understand complex processes, explain, we are trying to predict, we are trying to capture uncertainty. It’s not the same way of looking at things, it cannot answer the same questions. We need science with data, for sure, and it’s not science without statistics. I really think that the approach to science that I’m so describing is much different from just exploring an enormous database and try to tease some structure out of it, which is the prevalent thing in machine learning.
«But Big Data is not the same as statistics. We try to understand complex processes, explain, predict, capture uncertainty»
David Valls
David Valls
«If statisticians are not visible enough, people would just assume that we don’t have that much to contribute»
We would like to talk about the visibility of statistics. What should statisticians do to make them more visible?
Last year was the international year of statistics, and so it was an opportunity for us to think a little bit more about how we are going to continue to participate in the bigger scientific community. It’s really good. Historically statisticians have been happy to be in the background. We’ve been the low visibility people. We didn’t need to be in El País or New York Times. We did not need to be on television. That is, the scientific leader of the project would be a statistician who was happy to be in the background. I think there’s a lot of good to that, in the sense that it makes the statistical community sort of more caring, a little less ruthless, a little more supportive. Because we are not so much looking for glory, we don’t compete so much, we support more, and that kind of innocence is really a nice thing to have. I don’t want to lose that, but I recognize that if we are not visible enough, people would just assume that we don’t have that much to contribute. If we are not visible enough, they would assume that a scientist in other field can do whatever we do and we don’t have that much of an important role to play. I don’t know really how to get around this problem because I don’t want to sacrifice the innocence, but on the other hand I don’t want to be out there looking for the glory. It’s not a drive. For me what’s important is the science. Doing good science, trying to solve important problems, and trying to make a contribution on important problems. I think that is what’s really challenging.
Which is your opinion about the future of financial funding in research?
In the old days, you could propose a research agenda that was purely theoretical statistics and, ok, maybe you would not get wealthy, but you could get some money to fund it and maybe you could get some money to support a student or two, a little bit of travel, and it was sort of the way to do things. But now, that sort of research is not going to be fundable very much, and now once again, at least in the United States, and I know in the UK, and maybe in Spain, because the crisis is so difficult, funding is more challenging, but I know the model that the funding agencies are looking for is this interdisciplinary project idea. I think this really will be the future of funding. The difficulty is that the amount of funding is shrinking because people outside of the scientific community don’t value basic science that much, don’t really think that the government should spend money on science when there are other things they should spend money on, and maybe they shouldn’t have spent so much money in the first place. And so what’s happening is the funding amounts are going down but there’s more pressure on it because, at least in the United States, the universities are expecting their faculty to bring in money. So it’s a game they cannot win. I don’t know how this is going to play out. Some people suggest that private industry will help. I’m not so sure, I think private industries come in with their own agendas. I don’t think you can expect them to provide the funding you would want to sustain basic science research. I’m not sure I can see a solution.
Misuse of statistics in the media can lead, on purpose or unintentionally, to manipulating the figures. Do you think this is a common issue?
The old lie used to be, at least in English, «liars, damned liars and statisticians». The idea was that statisticians certainly could abuse data. There’s no doubt that it has been done. What has emerged now is an enormous amount of pressure to publish, and the pressure to publish is the pressure to find things. That sort of pressure leads to misuse or misrepresentation of statistical results. This is combined with a challenge we call reproducibility. For example, can you give a data set that you have written something about, to another scientific team, and can they reproduce what you found or not. And the story on this is rvalemarkably disappointing – it depends on the fields, but I just listened to a wonderful talk on reproducibility on medical statistics – and the numbers are scary. In less than 50% of data sets, people have been able to reproduce the results that were actually published. There’s a real challenge with misuse of the data and, in the public mind, it creates a real skepticism, a real doubt about the validity of statistical analysis, and the feeling is amongst the public that a statistician can tell any story you want, you can manipulate the data anyway you would like. And in some sense it is true.
Do you consider the population is prepared to correctly understand statistical analysis in the media?
I am deeply concerned about the numeracy, as we call it, of the population. I don’t think many people understand even orders of magnitude. I know, for example, if I walk into a small shop in the United States and buy a candy bar, and I give twenty dollars, the person behind the register has no idea roughly how much money I should get back. They have no notion of numeracy. The only way they do this is they put the number in and it tells them the change to return. And I know it’s not only the United States, it’s everywhere. I believe that there’s this complete loss of the magnitudes of numbers, which makes it really hard for people to understand statistical information. On top of this, when we provide statistical summaries, though most people might be able to grasp per averages, maybe proportions, like percentages, when you start going beyond this, even uncertainty, variability, these ideas are things that people just really struggle with, even people who are well educated. All of these things make it really difficult for large-scale public acceptance and comprehension of statistical information. I think we have made it too easy for people not to appreciate numbers because everybody can do simple calculations on a cellphone, and everything is automated when they buy, and they don’t do anything.
1. Mayer-Schönberger, V. i K. Cukier, 2013. Big Data: A Revolution That Will Transform How We Live, Work and Think. Eamon Dolan/Houghton Mifflin Harcourt. Boston. (Go back)
2. Anderson, C., 2008. «The End of Theory: The Data Deluge Makes the Scientific Method Obsolete». Wired, 16 (7). Available at: <http://archive.wired.com/science/discoveries/magazine/16-07/pb_theory>. (Go back)
«The future of statistics is going to be in working in these complicated interdisciplinary projects, working on challenging processes, challenging systems»
«There’s a real challenge with misuse of the data and, in the public mind, it creates a real skepticism, a real doubt about the validity of statistical analysis»