7.bitter_lesson
http://www.incompleteideas.net/IncIdeas/BitterLesson.html
中文:
从 70 年的人工智能研究中可以得出的最大教训是,利用计算的通用方法最终是最有效的,而且效率更高。其根本原因是摩尔定律,或者更确切地说,是其对单位计算成本持续呈指数级下降的概括。大多数人工智能研究都假设智能体可用的计算量是恒定的(在这种情况下,利用人类知识将是提升性能的唯一途径之一),但与典型的研究项目相比,在稍长的时间内,不可避免地会出现大量的计算量。为了寻求短期内带来显著效果的改进,研究人员试图利用他们在该领域的人类知识,但从长远来看,唯一重要的是利用计算。这两者并不一定相互矛盾,但在实践中往往如此。花在一种方法上的时间就等于没有花在另一种方法上的时间。人们在心理上会倾向于投资于一种方法或另一种方法。而基于人类知识的方法往往会使方法复杂化,使其不太适合利用利用计算的通用方法。人工智能研究人员迟迟未能吸取这一惨痛教训的例子不胜枚举,回顾其中一些最为突出的案例颇具启发意义。
在计算机象棋领域,1997年击败世界冠军卡斯帕罗夫的方法正是基于大规模深度搜索。当时,大多数致力于利用人类对象棋特殊结构理解的计算机象棋研究人员对此感到沮丧。当一种更简单的、基于搜索的方法,结合特殊的硬件和软件,被证明更为有效时,这些基于人类知识的象棋研究人员便不再是输不起的人了。他们表示,“蛮力”搜索或许这次赢了,但它并非通用策略,而且无论如何,这也不是人类下象棋的方式。这些研究人员希望基于人类输入的方法能够获胜,但最终却未能如愿,他们为此感到失望。
计算机围棋的研究也呈现出类似的模式,只是晚了20年。最初,人们投入了巨大的努力,试图利用人类的知识或围棋本身的特性来规避搜索,但一旦搜索被大规模有效地应用,所有这些努力都被证明是无用的,甚至更糟。同样重要的是利用自我对弈学习来学习价值函数(就像在许多其他游戏甚至国际象棋中一样,尽管在1997年首次击败世界冠军的程序中,学习并没有发挥重要作用)。自我对弈学习以及一般的学习,就像搜索一样,能够运用大规模计算。搜索和学习是人工智能研究中利用大规模计算的两大最重要的技术。在计算机围棋中,就像在计算机国际象棋中一样,研究人员最初的努力是利用人类的理解力(这样就需要更少的搜索),直到很久以后,通过拥抱搜索和学习才取得了更大的成功。在语音识别领域,早
在20世纪70年代就有一个由美国国防部高级研究计划局(DARPA)赞助的早期竞赛。参赛方法包括一系列利用人类知识(例如词汇、音素、人类声道等知识)的特殊方法。另一方面,还有一些基于隐马尔可夫模型 (HMM) 的新方法,它们本质上更具统计性,计算量更大。统计方法再次胜过基于人类知识的方法。这导致了整个自然语言处理领域的重大变革,几十年来,统计和计算逐渐占据了主导地位。语音识别领域深度学习的兴起正是朝着这一方向迈出的最新一步。深度学习方法对人类知识的依赖程度更低,使用更多的计算,并结合在海量训练集上进行学习,从而显著提升了语音识别系统的性能。就像在游戏中一样,研究人员总是试图构建出符合他们自身思维方式的系统——他们试图将这些知识融入到自己的系统中——但最终却适得其反,浪费了研究人员大量的时间。而随着摩尔定律的出现,大规模计算成为可能,人们也找到了充分利用计算资源的方法。
在计算机视觉领域,也存在类似的模式。早期的方法将视觉理解为搜索边缘、广义圆柱体或 SIFT 特征。但如今,所有这些都已被摒弃。现代深度学习神经网络仅使用卷积和某些类型的不变性概念,但性能却显著提升。
这是一个重要的教训。作为一个领域,我们至今仍未彻底汲取教训,因为我们仍在不断犯同样的错误。为了认识到这一点并有效地抵制它,我们必须理解这些错误的吸引力。我们必须汲取惨痛的教训,那就是,从长远来看,固守我们固有的思维方式是行不通的。这个惨痛的教训基于以下历史观察:
1)人工智能研究人员经常尝试将知识融入他们的智能体;
2)这在短期内总是有帮助的,并且能让研究人员个人感到满意,但
3)从长远来看,它会停滞不前,甚至会阻碍进一步的进步;
4)突破性的进展最终会通过一种基于搜索和学习扩展计算能力的相反方法实现。最终的成功往往带着一丝苦涩,而且往往难以完全消化,因为它是对一种备受青睐的、以人为本的方法的成功。
从这个惨痛的教训中,我们应该汲取的一点是通用方法的强大威力,即使可用的计算能力已经非常庞大,这些方法仍然能够随着计算能力的增加而不断扩展。似乎可以以这种方式任意扩展的两种方法是搜索和学习。
从惨痛教训中可以吸取的第二个普遍观点是,思维的实际内容极其复杂,无可救药;我们应该停止试图寻找思考思维内容的简单方法,例如思考空间、物体、多主体或对称性的简单方法。所有这些都是任意的、本质上复杂的外部世界的一部分。它们不是应该内置的东西,因为它们的复杂性是无穷无尽的;相反,我们应该只内置能够发现和捕捉这种任意复杂性的元方法。这些方法的本质是它们可以找到好的近似值,但寻找它们应该由我们的方法来完成,而不是由我们来完成。我们想要的是能够像我们一样发现的人工智能代理,而不是包含我们已经发现的东西的代理。内置我们的发现只会让我们更难看出发现过程是如何完成的。
英文:
The biggest lesson that can be read from 70 years of AI research is that general methods that leverage computation are ultimately the most effective, and by a large margin. The ultimate reason for this is Moore's law, or rather its generalization of continued exponentially falling cost per unit of computation. Most AI research has been conducted as if the computation available to the agent were constant (in which case leveraging human knowledge would be one of the only ways to improve performance) but, over a slightly longer time than a typical research project, massively more computation inevitably becomes available. Seeking an improvement that makes a difference in the shorter term, researchers seek to leverage their human knowledge of the domain, but the only thing that matters in the long run is the leveraging of computation. These two need not run counter to each other, but in practice they tend to. Time spent on one is time not spent on the other. There are psychological commitments to investment in one approach or the other. And the human-knowledge approach tends to complicate methods in ways that make them less suited to taking advantage of general methods leveraging computation. There were many examples of AI researchers' belated learning of this bitter lesson, and it is instructive to review some of the most prominent.
In computer chess, the methods that defeated the world champion, Kasparov, in 1997, were based on massive, deep search. At the time, this was looked upon with dismay by the majority of computer-chess researchers who had pursued methods that leveraged human understanding of the special structure of chess. When a simpler, search-based approach with special hardware and software proved vastly more effective, these human-knowledge-based chess researchers were not good losers. They said that ``brute force" search may have won this time, but it was not a general strategy, and anyway it was not how people played chess. These researchers wanted methods based on human input to win and were disappointed when they did not.
A similar pattern of research progress was seen in computer Go, only delayed by a further 20 years. Enormous initial efforts went into avoiding search by taking advantage of human knowledge, or of the special features of the game, but all those efforts proved irrelevant, or worse, once search was applied effectively at scale. Also important was the use of learning by self play to learn a value function (as it was in many other games and even in chess, although learning did not play a big role in the 1997 program that first beat a world champion). Learning by self play, and learning in general, is like search in that it enables massive computation to be brought to bear. Search and learning are the two most important classes of techniques for utilizing massive amounts of computation in AI research. In computer Go, as in computer chess, researchers' initial effort was directed towards utilizing human understanding (so that less search was needed) and only much later was much greater success had by embracing search and learning.
In speech recognition, there was an early competition, sponsored by DARPA, in the 1970s. Entrants included a host of special methods that took advantage of human knowledge---knowledge of words, of phonemes, of the human vocal tract, etc. On the other side were newer methods that were more statistical in nature and did much more computation, based on hidden Markov models (HMMs). Again, the statistical methods won out over the human-knowledge-based methods. This led to a major change in all of natural language processing, gradually over decades, where statistics and computation came to dominate the field. The recent rise of deep learning in speech recognition is the most recent step in this consistent direction. Deep learning methods rely even less on human knowledge, and use even more computation, together with learning on huge training sets, to produce dramatically better speech recognition systems. As in the games, researchers always tried to make systems that worked the way the researchers thought their own minds worked---they tried to put that knowledge in their systems---but it proved ultimately counterproductive, and a colossal waste of researcher's time, when, through Moore's law, massive computation became available and a means was found to put it to good use.
In computer vision, there has been a similar pattern. Early methods conceived of vision as searching for edges, or generalized cylinders, or in terms of SIFT features. But today all this is discarded. Modern deep-learning neural networks use only the notions of convolution and certain kinds of invariances, and perform much better.
This is a big lesson. As a field, we still have not thoroughly learned it, as we are continuing to make the same kind of mistakes. To see this, and to effectively resist it, we have to understand the appeal of these mistakes. We have to learn the bitter lesson that building in how we think we think does not work in the long run. The bitter lesson is based on the historical observations that 1) AI researchers have often tried to build knowledge into their agents, 2) this always helps in the short term, and is personally satisfying to the researcher, but 3) in the long run it plateaus and even inhibits further progress, and 4) breakthrough progress eventually arrives by an opposing approach based on scaling computation by search and learning. The eventual success is tinged with bitterness, and often incompletely digested, because it is success over a favored, human-centric approach.
One thing that should be learned from the bitter lesson is the great power of general purpose methods, of methods that continue to scale with increased computation even as the available computation becomes very great. The two methods that seem to scale arbitrarily in this way are search and learning.
The second general point to be learned from the bitter lesson is that the actual contents of minds are tremendously, irredeemably complex; we should stop trying to find simple ways to think about the contents of minds, such as simple ways to think about space, objects, multiple agents, or symmetries. All these are part of the arbitrary, intrinsically-complex, outside world. They are not what should be built in, as their complexity is endless; instead we should build in only the meta-methods that can find and capture this arbitrary complexity. Essential to these methods is that they can find good approximations, but the search for them should be by our methods, not by us. We want AI agents that can discover like we can, not which contain what we have discovered. Building in our discoveries only makes it harder to see how the discovering process can be done.
最后更新于
这有帮助吗?