Thursday, March 31, 2005

Why Tim Lambert is Wrong

Tim Lambert of University of New South Wales (and of fighting John Lott fame) seems to misunderstand the meaning of confidence intervals and Gaussian Distributions when discussing the Lancet study,which used cluster sampling to claim in its abstract that 100,000 excess deaths occurred among Iraqi civilians as a result of the war, here and here.

The study reported that the 95% confidence interval for excess deaths is between 8,000 and 194,000. Lancet picked the midpoint when making the 100,000 excess deaths claim and this has been the source of a lot of controversy. Critics of the study say that it is incorrect to merely pick the midpoint as the most likely figure. Lambert writes:
Yes, the 95% confidence interval by itself doesn’t tell us what the probabilities are. But this doesn’t mean that each value is equally likely. We can also construct other confidence intervals. We can be 67% confident that the number is between 50,000 and 150,000. In this sense the end points of the 95% CI are less likely and the middle is most likely.

In the second link I provide above, he writes:
Not all values in the confidence interval are equally likely. The ones in the middle are more likely and 100,000 is the most likely number.

This seriously misunderstands the point of Gaussian distributions (and Lambert admits earlier in the first post I linked to that the distribution is well-approximated by a Gaussian). First, a Gaussian (a.k.a. Normal, bell curve, etc) distribution is a continuous distribution. As such, any single point has probability of 0 of being attained. The reason is because the "width" or measure of any single point is 0. The only things that have positive measure (or positive probability) in a Guassian distribution are intervals. So, yes, while the "mass" of a Gaussian is mostly "concentrated in the center", it does not mean that a single point in the center is more likely than a single point at the edge. They are both equally likely, meaning not likely at all (this is somewhat of a paradox since every single point has probability 0, but adding them all up, you get 100% probability... this is the result of adding an uncountable, rather than just infinite, number of 0's).

Let me demonstrate with a small example. Take a normal distribution with mean 3 and variance 1. If you look in a statistics table (or compute using Mathematica or Matlab), you'll notice that the interval 0 to 1.04 has about 2.3% probability. On the other hand, the interval between 2.975 and 3.025 has only 2% probability, even though these are clearly closer to the mean of 3 than any point between 0 and 1.04. So, no, points closer to the mean are not more likely. Intervals closer to the mean of the same length are. But that's not what is at question here. The question is which number is more likely, 8000 or 100,000. These two are equally likely in the case of continuous Gaussian distributions.

Here's also why taking the midpoint is a bit silly on just a common sense level. We know for 100% certainty that there are between 0 and 100 billion excess deaths as a result of the war. That doesn't mean that 50 billion is the most likely one.

The point to be made with confidence intervals is that what they mean is if you repeat the experiment many many times, 95% of the time, the real number will fall in the confidence interval given by the experiment. I am not a statistician so I can't comment on the regression analysis that Lancet did (which they don't really make clear... they just refer to a software package they used). I assume that the confidence interval they cite is not a Gaussian interval itself but is a transform of the Gaussian intervals for the standard errors. Again, I'm not an expert on regression analysis, so I don't want to venture into that area. But since my research is in probability theory, I had to correct Prof. Lambert's misunderstanding of the likelihood of specific numbers in a confidence interval.

UPDATE: After a number of email exchanges with Daniel Davies (a.k.a. dsquared) of Crooked Timber and Tim Lambert, I suppose I will have to agree that by looking at a discrete version of a "Gaussian," yes, indeed the mean has the highest likelihood of occuring. However, the Lancet study obtained a confidence interval from their bootstrap procedure, in which, I assume, the parameters are continuous. Again, I am not an expert on bootstrap or maximum likelihood estimators, so I will defer to Davies on this one. As for Tim Lambert, a little less snarkiness is in order.

UPDATE II: Tim Lambert points out in the comments below that my correction was ungracious. I apologize to Tim for being ungracious. As a practical matter, his analysis was more apropos than mine. And I thank him for particularly poignant examples in his email and in the comments on his blog that made me change my views.


Blogger Kevin Donoghue said...

"We know for 100% certainty that there are between 0 and 100 billion excess deaths as a result of the war."

It took a while to post this comment here, so in the meantime I explained why this is wrong over at Tim Lambert's blog. The word "excess" is used for a reason.

11:51 AM  
Blogger Yevgeny Vilensky said...

Ok, would you like it better if it were between -6 billion and 100 billion? Then, the mean in your case would be 47 billion. Either way, it's nonsense.

3:47 PM  
Blogger Tim Lambert said...

Well, that was a rather graceless correction.

12:37 AM  
Blogger Tim Lambert said...

Thank you for your second update.

8:56 AM  
Anonymous Dan said...

I think there's a natural way of observing the same thing even in the continuous context. Part of the problem comes, I think, from the fact that we too often treat probability densities as if they were the same thing as as a probability (I rather like E. T. Jaynes' take on this). If you ignore the measure theory (i.e., the Kolmogorov system) for the moment (I hate sigma fields with a passion), then the natural way to get a density over a continuous space is to take a finite partition over the space and take the limit as the partition becomes arbitrarily large, thereby producing the density.

From this perspective, the problem arises when you then treat the density as if it were a real thing rather than the outcome of a limiting argument. By (correctly) regarding any single outcome as having probability measure zero, and then (incorrectly) comparing the probability of two outcomes each of measure zero, we get the undefined odds ratio of 0/0. But if we take the odds ratio in the discrete/finite case first and then take the limit at the very end, we get the ratio of the two Gaussian densities, which is also properly defined as an odds ratio.

Basically, the argument is that you should always define the problem properly in the finite case, and only then take the limit. Otherwise you run into all sorts of horrible paradoxes.

11:11 AM  
Anonymous Dan said...

Oops. I should have been explicit in saying that I was referring to your original update, where you mention that the problem doesn't come up in the discrete case, but wonder about what happens with respect to continuous probability spaces. (Also, for anyone happening past unaware, I'm not Daniel Davies. Different Daniel).

11:20 AM  
Blogger Sammler said...

The difference between continuous and discrete distributions is, of course, not germane. However, there is no reason to suppose that the distribution is symmetric (in fact, it would be highly unlikely). An equally valid and more parsimonious fit is a lognormal distribution (N = 39400000*exp(0.813*X) where X is a standard normal deviate) -- this gives the same confidence interval, with a median at 39.4MM and a mode at 20.3MM. Not that this median/mode are right, either; they are equally useless.

3:17 AM  

Post a Comment

<< Home