In 2011, a group of researchers conducted a scientific study to find an impossible result: that listening to certain songs can make you younger.
2011年,一组研究人员进行了一项不可能找到结论的科学研究:听某些歌曲可以让你变得更年轻。
Their study involved real people, truthfully reported data, and commonplace statistical analyses. So how did they do it?
这项研究邀请真人参与,采用真实有效的数据,以及常见的统计分析。那么,他们是怎么做到的呢?
The answer lies in a statistical method scientists often use to try to figure out whether their results mean something or if they're random noise.
答案就在科学家们常用的一种统计方法,用来检测他们的数据是有意义的还是随机的。
In fact, the whole point of the music study was to point out ways this method can be misused.
事实上,这项关于音乐的研究,就是为了指出这个统计方法有哪些被滥用的途径。
A famous thought experiment explains the method: there are eight cups of tea, four with the milk added first, and four with the tea added first.
一个著名的思想实验可以说明这个统计方法:假如你有8杯茶,其中4杯先加了牛奶,另外4杯先加了茶。
A participant must determine which are which according to taste.
参与者们必须根据口味将这8杯茶分成两组。
There are 70 different ways the cups can be sorted into two groups of four, and only one is correct.
总共有70个不同的方式可以将杯子分为两组,每组4个,而只有一个方式是正确的。
So, can she taste the difference? That's our research question.
那么,她可以尝出差异吗?这就是我们的研究问题。
To analyze her choices, we define what's called a null hypothesis: that she can't distinguish the teas.
为了分析她的选择,我们需要建立一个原假设:她不能分辨出这8杯茶。
If she can't distinguish the teas, she'll still get the right answer 1 in 70 times by chance.
假如她分辨不出这些茶,她仍然可以在70次中做一次正确选择。
1 in 70 is roughly .014. That single number is called a p-value.
70分之1大概是0.014的概率。这个数字可以称之为P值。
In many fields, a p-value of .05 or below is considered statistically significant, meaning there's enough evidence to reject the null hypothesis.
许多领域中,当P值等于或小于0.05时,可以认为在统计学上具有显著性,这意味着有着足够的证据可以拒绝我们的零假设。
Based on a p-value of .014, they'd rule out the null hypothesis that she can't distinguish the teas.
基于0.014的P值,他们可以推翻她无法辨别这8杯茶的原假设。
Though p-values are commonly used by both researchers and journals to evaluate scientific results, they're really confusing, even for many scientists.
虽然研究人员和期刊常使用P值来鉴定试验结果,但是甚至对许多科学家来说,P值非常难理解。
That's partly because all a p-value actually tells us is the probability of getting a certain result, assuming the null hypothesis is true.
这是因为在原假设是正确的情况下,P值其实只能告诉我们得到某种结果的概率。
So if she correctly sorts the teas, the p-value is the probability of her doing so assuming she can't tell the difference.
所以假如她正确地分类出这8杯茶,P值代表的是在假设她无法辨别这些茶的情况下她能正确地分类的概率。
But the reverse isn't true: the p-value doesn't tell us the probability that she can taste the difference, which is what we're trying to find out.
但反过来就不成立了:P值不能代表她能够尝出来不同味道的概率,但这就是我们想要找出的。
So if a p-value doesn't answer the research question, why does the scientific community use it?
所以当P值不能解答我们的研究问题时,为什么科学界还在使用P值呢?
Well, because even though a p-value doesn't directly state the probability that the results are due to random chance, it usually gives a pretty reliable indication.
这是因为,虽然P值不能直接说明实验结果是随机的,但它是一个可靠的指示。
At least, it does when used correctly. And that's where many researchers, and even whole fields, have run into trouble.
至少,当它在正确使用时,确实如此。这就是许多研究人员,甚至整个领域会出错的地方。
Most real studies are more complex than the tea experiment.
大多数真正的研究都比8杯茶实验更加复杂。
Scientists can test their research question in multiple ways, and some of these tests might produce a statistically significant result, while others don't.
科学家可以使用许多不同的方法来测试他们的研究问题,其中有一些方法可能会产生具有统计显著性的结果,但有些方法就不会。
It might seem like a good idea to test every possibility.
似乎测试每一种可能性是一个很好的主意。
But it's not, because with each additional test, the chance of a false positive increases.
但是并不然,因为每一个附加的测试都会带来增加误报的可能性。
Searching for a low p-value, and then presenting only that analysis, is often called p-hacking.
寻找低P值,然后只展示这部分的分析,通常称为P值操纵。
It's like throwing darts until you hit a bullseye and then saying you only threw the dart that hit the bull's eye.
这就好比一直扔飞镖,直到有一个击中靶心,然后宣称你只扔了那一个正中靶心的飞镖。
This is exactly what the music researchers did.
这正是这项音乐研究人员做的。
They played three groups of participants each a different song and collected lots of information about them.
他们为三组不同的参与者放了三个不同的歌曲,然后收集了他们的大量信息。
The analysis they published included only two out of the three groups.
他们发表的分析报告只包括了三组中的两组。
Of all the information they collected, their analysis only used participants' fathers' age -- to "control for variation in baseline age across participants."
在所有收集的信息中,他们的分析只使用了参与者父亲的年龄目的是为了“控制参与者的年龄基线”。
They also paused their experiment after every ten participants, and continued if the p-value was above .05, but stopped when it dipped below .05.
他们还在每10个参与者后暂停实验,然后如果P值大于0.05,他们会继续试验,但假如P值开始下降到低于0.05,他们会停止实验。
They found that participants who heard one song were 1.5 years younger than those who heard the other song, with a p-value of .04.
他们发现听到某一首歌的参与者会比听到另一首歌的年轻1.5岁,P值为0.04。
Usually it's much tougher to spot p-hacking, because we don't know the results are impossible: the whole point of doing experiments is to learn something new.
通常很难发现是否存在P值操控,因为我们不知道哪些结论是不可能的:实验的目的就是探索新的知识。
Fortunately, there's a simple way to make p-values more reliable: pre-registering a detailed plan for the experiment and analysis beforehand that others can check,
幸运的是,有许多容易的方法可以让P值更加地可靠:预先记录实验及分析的详细计划,以便他人能够核查,
so researchers can't keep trying different analyses until they find a significant result.
来确保在研究员得到重要结果前,不会尝试更改分析方式。
And, in the true spirit of scientific inquiry,
本着真正的科学探究精神,
there's even a new field that's basically science doing science on itself: studying scientific practices in order to improve them.
甚至还有一个新领域出现,那就是用科学研究科学:研究不同的科学实践,以改进它们。