The paradoxical title of this post is quoted from this article in Science News about the inconsistent and often inappropriate use of statistics in academic research.  I would strongly recommend every accounting researcher read the article, make sure they understand the criticisms, and then promptly ignore most of them.

The article emphasizes that statistical significance is not substantively (in our field, economically) significant; that o-values rest on a weak theoretical basis (blending two very different statistical approaches, Fishers and Neyman-Pearson’s); that a p-value of 95% does NOT mean there is a 95% chance that the hypothesis in question is true; and that the critical importance of the .05 threshold is arbitrary. Moreover, the article argues that researchers consistently ignore these facts.  For example:

Statisticians perpetually caution against mistaking statistical significance for practical importance, but scientific papers commit that error often. Ziliak studied journals from various fields — psychology, medicine and economics among others — and reported frequent disregard for the distinction.  For example:

“I found that eight or nine of every 10 articles published in the leading journals make the fatal substitution” of equating statistical significance to importance, he said in an interview. Ziliak’s data are documented in the 2008 book The Cult of Statistical Significance, coauthored with Deirdre McCloskey of the University of Illinois at Chicago.

Another point is that various biases in statistical tests (including testing multiple hypotheses at once) can cause inappropriate inferences:

Nowhere are the problems with statistics more blatant than in studies of genetic influences on disease. In 2007, for instance, researchers combing the medical literature found numerous studies linking a total of 85 genetic variants in 70 different genes to acute coronary syndrome, a cluster of heart problems. When the researchers compared genetic tests of 811 patients that had the syndrome with a group of 650 (matched for sex and age) that didn’t, only one of the suspect gene variants turned up substantially more often in those with the syndrome — a number to be expected by chance.

“Our null results provide no support for the hypothesis that any of the 85 genetic variants tested is a susceptibility factor” for the syndrome, the researchers reported in the Journal of the American Medical Association.

I say that most practicing researchers should ignore most of the article because it fails to recognize how p-values are used in the research community (at least in accounting):  they are simply a filter that provides a relatively consistent and useful standard for results worth further exploration.  If you use standard statistical techniques and get to the magic 0.05, most researchers will view the results as worth paying attention to, and will then turn to additional questions.  Is the effect economically significant?  Does the story make sense?  Are you missing obvious controls?  Is the result novel?  Is there an easy alternative explanation?  Are there further testable implications?  If you can’t justify a ‘yes’ answer on each count, you probably won’t get through the publication process; if you can, well then at least you have a shot.

It may seem unfair that some studies are arbitrarily dismissed for failing to reach a .05 level of statistical significance, but I believe it is a reasonable price to pay for additional assurance that we aren’t publishing “novel” results every month that are simply flukes.  The important point (especially if you are just starting in your research career) is to keep reminding yourself that statistical significance is not a statement of TRUTH VALUE of a proposition, but merely an indication that the finding is sufficiently reliable to warrant consideration.

Before you accuse me of having no regard for Research As a Path to True Knowledge, keep in mind that a Bayesian perspective makes assessing truth from calculated statistical significance pretty much impossible. As discussed in the article, Bayesians form beliefs by combining their newly-collected data with the beliefs they held before observing that data.  This means that if you thought in 1990 that there was a 99% chance that markets were informationally efficient (for example), it would have taken whopping evidence to change your mind.  A single study with a 5% significance level wouldn’t mean much, and indeed we see that those who held strong prior beliefs about efficiency spent years arguing that contrary evidence was methodologically flawed.  Since we never know if a model is well-specified, and we rely on our priors, the p-value doesn’t tell us what is true.  But the .05 p-value forces the conversation to continue, rather than have those papers rejected out of hand.

I am a Bayesian, but I am also an instrumentalist:  tools like p-values and significance thresholds are tools that researchers use to make their community work.  We should know and understand the underlying theory, but let’s not get too carried away with what the results mean.  Instead, let’s focus on how they are used.