In my distant past I was a research assistant working with a scientist who had a whole bunch of pair wise correlations between variables in a very large data set. The researcher found some interesting correlations between some of these variables and complained that the statistician involved with the research said that the correlations were not significant even though looked at in isolation they were significant. I had to agree with the statistician and fortunately was able to convince my boss that the statistician was indeed correct here. My reasoning was basically by analogy-if you flip 100 coins you expect to find on average 50 heads and 50 tails and your results should vary around that by chance. Further, every once in a while, you expect to find say 20 heads and 50 tails just by chance and you would not get excited about that. In the same way if you have a slew of uncorrelated variables by chance you can get some significant correlations.
Scientists love statistically "significant" results but this last year a very interesting article titled "The Truth Wears Off" appeared in the New Yorker looking at a disturbingly common pattern in science, called the decline effect. Often, especially with complex phenomena initial experimental results are highly significant , but then with replication the results either are not so pronounced or disappear entirely. The results cannot be replicated. The article's author Jonah Lehrer gives a wide range of examples from both basic and applied science and lists a number of reasons why this phenomenon happens.
Some of the reasons discussed are bias toward positive results, include regression toward the mean, and publication bias, none of which by themselves explain what is going on. The most likely culprit seems to be selective reporting and this has actually itself been tested by a biologist by the name of Richard Palmer. Lehrer says the following about Palmer's investigation:
"Palmer’s most convincing evidence relies on a statistical tool known as a funnel graph. When a large number of studies have been done on a single subject, the data should follow a pattern: studies with a large sample size should all cluster around a common value—the true result—whereas those with a smaller sample size should exhibit a random scattering, since they’re subject to greater sampling error. This pattern gives the graph its name, since the distribution resembles a funnel....after Palmer plotted every study of fluctuating asymmetry, he noticed that the distribution of results with smaller sample sizes wasn’t random at all but instead skewed heavily toward positive results."
Palmer's results suggest that reporting bias is everywhere in science and Palmer concludes that:
“We cannot escape the troubling conclusion that some—perhaps many—cherished generalities are at best exaggerated in their biological significance and at worst a collective illusion nurtured by strong a-priori beliefs often repeated.”
The article gives a range of disturbing examples from a wide range of disciplines and gloomily concludes that:
"Such anomalies demonstrate the slipperiness of empiricism. Although many scientific ideas generate conflicting results and suffer from falling effect sizes, they continue to get cited in the textbooks and drive standard medical practice. Why? Because these ideas seem true. Because they make sense. Because we can’t bear to let them go. And this is why the decline effect is so troubling. Not because it reveals the human fallibility of science, in which data are tweaked and beliefs shape perceptions."
So what's to be done? From my perspective there are several things, some of which are discussed int he article. First of all scientists have to avoid what the article terms significance chasing-getting excited about seemingly significant correlations as my old boss did. Related to that, scientists need to be sure that they are indeed using appropriately conservative tests, that is tests that minimize the chance that an incorrect hypothesis will be accepted. Data needs to be open and available for scrutiny and experimental controls carefully laid out. When I teach, one one thing I tell my students is that the hypothesis and the experimental design has to be worked out in advance of testing the hypothesis including the sample sizes.
Maybe we forget- and don't properly train our students to remember- that a hypothesis is just that - it is a guess, perhaps informed by theory but it is still a guess and non significant results are still meaningful. Of course it would help if funding agencies by they governmental or private would remember that as well!