Why Tim Lambert is Wrong
Tim Lambert of University of New South Wales (and of fighting John Lott fame) seems to misunderstand the meaning of confidence intervals and Gaussian Distributions when discussing the Lancet study,which used cluster sampling to claim in its abstract that 100,000 excess deaths occurred among Iraqi civilians as a result of the war, here and here.
The study reported that the 95% confidence interval for excess deaths is between 8,000 and 194,000. Lancet picked the midpoint when making the 100,000 excess deaths claim and this has been the source of a lot of controversy. Critics of the study say that it is incorrect to merely pick the midpoint as the most likely figure. Lambert writes:
In the second link I provide above, he writes:
This seriously misunderstands the point of Gaussian distributions (and Lambert admits earlier in the first post I linked to that the distribution is well-approximated by a Gaussian). First, a Gaussian (a.k.a. Normal, bell curve, etc) distribution is a continuous distribution. As such, any single point has probability of 0 of being attained. The reason is because the "width" or measure of any single point is 0. The only things that have positive measure (or positive probability) in a Guassian distribution are intervals. So, yes, while the "mass" of a Gaussian is mostly "concentrated in the center", it does not mean that a single point in the center is more likely than a single point at the edge. They are both equally likely, meaning not likely at all (this is somewhat of a paradox since every single point has probability 0, but adding them all up, you get 100% probability... this is the result of adding an uncountable, rather than just infinite, number of 0's).
Let me demonstrate with a small example. Take a normal distribution with mean 3 and variance 1. If you look in a statistics table (or compute using Mathematica or Matlab), you'll notice that the interval 0 to 1.04 has about 2.3% probability. On the other hand, the interval between 2.975 and 3.025 has only 2% probability, even though these are clearly closer to the mean of 3 than any point between 0 and 1.04. So, no, points closer to the mean are not more likely. Intervals closer to the mean of the same length are. But that's not what is at question here. The question is which number is more likely, 8000 or 100,000. These two are equally likely in the case of continuous Gaussian distributions.
Here's also why taking the midpoint is a bit silly on just a common sense level. We know for 100% certainty that there are between 0 and 100 billion excess deaths as a result of the war. That doesn't mean that 50 billion is the most likely one.
The point to be made with confidence intervals is that what they mean is if you repeat the experiment many many times, 95% of the time, the real number will fall in the confidence interval given by the experiment. I am not a statistician so I can't comment on the regression analysis that Lancet did (which they don't really make clear... they just refer to a software package they used). I assume that the confidence interval they cite is not a Gaussian interval itself but is a transform of the Gaussian intervals for the standard errors. Again, I'm not an expert on regression analysis, so I don't want to venture into that area. But since my research is in probability theory, I had to correct Prof. Lambert's misunderstanding of the likelihood of specific numbers in a confidence interval.
UPDATE: After a number of email exchanges with Daniel Davies (a.k.a. dsquared) of Crooked Timber and Tim Lambert, I suppose I will have to agree that by looking at a discrete version of a "Gaussian," yes, indeed the mean has the highest likelihood of occuring. However, the Lancet study obtained a confidence interval from their bootstrap procedure, in which, I assume, the parameters are continuous. Again, I am not an expert on bootstrap or maximum likelihood estimators, so I will defer to Davies on this one. As for Tim Lambert, a little less snarkiness is in order.
UPDATE II: Tim Lambert points out in the comments below that my correction was ungracious. I apologize to Tim for being ungracious. As a practical matter, his analysis was more apropos than mine. And I thank him for particularly poignant examples in his email and in the comments on his blog that made me change my views.
The study reported that the 95% confidence interval for excess deaths is between 8,000 and 194,000. Lancet picked the midpoint when making the 100,000 excess deaths claim and this has been the source of a lot of controversy. Critics of the study say that it is incorrect to merely pick the midpoint as the most likely figure. Lambert writes:
Yes, the 95% confidence interval by itself doesn’t tell us what the probabilities are. But this doesn’t mean that each value is equally likely. We can also construct other confidence intervals. We can be 67% confident that the number is between 50,000 and 150,000. In this sense the end points of the 95% CI are less likely and the middle is most likely.
In the second link I provide above, he writes:
Not all values in the confidence interval are equally likely. The ones in the middle are more likely and 100,000 is the most likely number.
This seriously misunderstands the point of Gaussian distributions (and Lambert admits earlier in the first post I linked to that the distribution is well-approximated by a Gaussian). First, a Gaussian (a.k.a. Normal, bell curve, etc) distribution is a continuous distribution. As such, any single point has probability of 0 of being attained. The reason is because the "width" or measure of any single point is 0. The only things that have positive measure (or positive probability) in a Guassian distribution are intervals. So, yes, while the "mass" of a Gaussian is mostly "concentrated in the center", it does not mean that a single point in the center is more likely than a single point at the edge. They are both equally likely, meaning not likely at all (this is somewhat of a paradox since every single point has probability 0, but adding them all up, you get 100% probability... this is the result of adding an uncountable, rather than just infinite, number of 0's).
Let me demonstrate with a small example. Take a normal distribution with mean 3 and variance 1. If you look in a statistics table (or compute using Mathematica or Matlab), you'll notice that the interval 0 to 1.04 has about 2.3% probability. On the other hand, the interval between 2.975 and 3.025 has only 2% probability, even though these are clearly closer to the mean of 3 than any point between 0 and 1.04. So, no, points closer to the mean are not more likely. Intervals closer to the mean of the same length are. But that's not what is at question here. The question is which number is more likely, 8000 or 100,000. These two are equally likely in the case of continuous Gaussian distributions.
Here's also why taking the midpoint is a bit silly on just a common sense level. We know for 100% certainty that there are between 0 and 100 billion excess deaths as a result of the war. That doesn't mean that 50 billion is the most likely one.
The point to be made with confidence intervals is that what they mean is if you repeat the experiment many many times, 95% of the time, the real number will fall in the confidence interval given by the experiment. I am not a statistician so I can't comment on the regression analysis that Lancet did (which they don't really make clear... they just refer to a software package they used). I assume that the confidence interval they cite is not a Gaussian interval itself but is a transform of the Gaussian intervals for the standard errors. Again, I'm not an expert on regression analysis, so I don't want to venture into that area. But since my research is in probability theory, I had to correct Prof. Lambert's misunderstanding of the likelihood of specific numbers in a confidence interval.
UPDATE: After a number of email exchanges with Daniel Davies (a.k.a. dsquared) of Crooked Timber and Tim Lambert, I suppose I will have to agree that by looking at a discrete version of a "Gaussian," yes, indeed the mean has the highest likelihood of occuring. However, the Lancet study obtained a confidence interval from their bootstrap procedure, in which, I assume, the parameters are continuous. Again, I am not an expert on bootstrap or maximum likelihood estimators, so I will defer to Davies on this one. As for Tim Lambert, a little less snarkiness is in order.
UPDATE II: Tim Lambert points out in the comments below that my correction was ungracious. I apologize to Tim for being ungracious. As a practical matter, his analysis was more apropos than mine. And I thank him for particularly poignant examples in his email and in the comments on his blog that made me change my views.