In 2013, a group of researchers at DeepMind in London had set their sights on a grand challenge.
2013年,一群伦敦DeepMind公司的研究者把目光放在一大挑战上。
They wanted to create an AI system that could beat, not just a single Atari game, but every Atari game.
他们想要创建一个人工智能系统,不仅能胜一个,而是能够全部完胜雅达利游戏。
They developed a system they called Deep Q Networks, or DQN, and less than two years later, it was superhuman.
他们开发了个名叫强化学习的网络(DQN),在不到两年,它超越人类。
DQN was getting scores 13 times better than professional human games testers at "Breakout," 17 times better at "Boxing," and 25 times better at "Video Pinball."
DQN打砖块游戏的得分比人类专业游戏玩家高13倍,在拳击游戏中高17倍,在电子弹珠台中高25倍。
But there was one notable, and glaring, exception.
但有一个很明显的例外。
When playing "Montezuma's Revenge" DQN couldn't score a single point, even after playing for weeks.
玩游戏《蒙特祖马的复仇》时,DQN一分都拿不到,即便是在玩了几周后。
What was it that made this particular game so vexingly difficult for AI? And what would it take to solve it?
是什么让这个特别的游戏,让人工智能这么难以取胜?需要采取什么来解决它?
Spoiler alert: babies. We'll come back to that in a minute.
剧透警告:婴儿。我们1分钟后回来。
Playing Atari games with AI involves what's called reinforcement learning, where the system is designed to maximize some kind of numerical rewards.
人工智能玩雅达利游戏涉及到强化学习,在这里系统被设计为最大化某种量化的奖励。
In this case, those rewards were simply the game's points.
在这个例子中,这些奖励是游戏分数。
This underlying goal drives the system to learn which buttons to press and when to press them to get the most points.
这个潜在的目标驱使系统学习按哪个按键以及何时去按来获得最高分数。
Some systems use model-based approaches, where they have a model of the environment that they can use to predict what will happen next once they take a certain action.
一些系统使用基于模型的方法,它们有一个环境的模型,这样它们就能用来预测一旦它们采取特定行动后,下一步会发生什么。
DQN, however, is model free. Instead of explicitly modeling its environment,
然而,DQN没有任何模型。与其明确地建模环境,
it just learns to predict, based on the images on screen, how many future points it can expect to earn by pressing different buttons.
它只需要基于屏幕上的图像学习预测,它们按不同的键能够期望获得多少分数。
For instance, "if the ball is here and I move left, more points, but if I move right, no more points."
例如,“如果球在这里,我向左移就得更多的分数,但如果向右移就不得分。”
But learning these connections requires a lot of trial and error.
但学习这些联系需要大量的试错。
The DQN system would start by mashing buttons randomly, and then slowly piece together which buttons to mash when in order to maximize its score.
DQN系统从随意敲按键开始,然后慢慢拼凑何时需要敲哪个按键才能够得到最高分。
But in playing "Montezuma's Revenge," this approach of random button-mashing fell flat on its face.
但在玩《蒙特祖马的复仇》时,这种随意敲按键的方法彻底失效了。
A player would have to perform this entire sequence just to score their first points at the very end.
玩家得做完这整个序列动作才能最终得到第一分。
A mistake? Game over. So how could DQN even know it was on the right track? This is where babies come in.
犯个错误?游戏结束。那么DQN如何知道它在正确的道路上?婴儿上场的时候到了。
In studies, infants consistently look longer at pictures they haven't seen before than ones they have.
在研究中,婴儿看没见过的图片要比见过的图片花更多的时间。
There just seems to be something intrinsically rewarding about novelty.
新奇似乎就是某种内在奖励。
This behavior has been essential in understanding the infant mind.
这种行为对于理解婴儿的心理至关重要。
It also turned out to be the secret to beating "Montezuma's Revenge."
这正好也是玩好《蒙特祖马的复仇》游戏的秘密。
The DeepMind researchers worked out an ingenious way to plug this preference for novelty into reinforcement learning.
DeepMind研究人员找到巧妙的方法,将这种对新奇事物的偏好插入到强化学习中。
They made it so that unusual or new images appearing on the screen were every bit as rewarding as real in-game points.
他们让不同寻常或新的图片出现在屏幕中时与真正的游戏积分一样有奖励意义。
Suddenly, DQN was behaving totally differently from before.
突然之间,DQN的行为完全跟起初不一样了。
It wanted to explore the room it was in, to grab the key and escape through the locked door
它想要探索所处的房间,去抓住钥匙并通过锁住的门逃出去,
not because it was worth 100 points, but for the same reason we would: to see what was on the other side.
不是因为这价值100分,而是跟我们的理由一样:去看看另一边有什么。
With this new drive, DQN not only managed to grab that first key -- it explored all the way through 15 of the temple's 24 chambers.
通过这种新的激励,DQN不仅能够抓住第一把钥匙--它还在24个房间中,探索了15个房间。
But emphasizing novelty-based rewards can sometimes create more problems than it solves.
但强调基于新奇的奖励有时候会带来比它解决的问题更多的问题。
A novelty-seeking system that's played a game too long will eventually lose motivation.
一个新颖性探索的系统,如果玩游戏太久,最终会失去动力。
If it's seen it all before, why go anywhere? Alternately, if it encounters, say, a television, it will freeze.
如果这都是以前见过的,为什么还要去呢?换之,假如它遇到电视,它就会停下来。
The constant novel images are essentially paralyzing. The ideas and inspiration here go in both directions.
不断出现的新奇图像基本让人瘫痪。这个想法和启发是双向的。
AI researchers stuck on a practical problem, like how to get DQN to beat a difficult game, are turning increasingly to experts in human intelligence for ideas.
人工智能研究人员被一个问题困住了,比如如何让DQN打赢一个不同的游戏,逐渐变成了探索人类的思想智能。
At the same time, AI is giving us new insights into the ways we get stuck and unstuck:
同时,人工智能给我们提供了新的视角,让我们了解如何陷入和摆脱困境:
into boredom, depression, and addiction, along with curiosity, creativity, and play.
变得无聊、沮丧和上瘾,还有好奇心、创造力和玩乐。