Ensuring smarter-than-human intelligence has a positive outcome

||yabo app ,视频


The talk was inspired by “AI Alignment: Why It’s Hard, and Where to Start,” and serves as an introduction to the subfield of alignment research in AI. A modified transcript follows.




2.1.Task: Fill a cauldron
2.2.Subproblem: Suspend buttons

3.The big picture

3.2。Four key propositions

4.Fundamental difficulties


I’m the executive director of the Machine Intelligence Research Institute. Very roughly speaking, we’re a group that’s thinking in the long term about artificial intelligence and working to make sure that by the time we have advanced AI systems, we also know how to point them in useful directions.

Across history, science and technology have been the largest drivers of change in human and animal welfare, for better and for worse. If we can automate scientific and technological innovation, that has the potential to change the world on a scale not seen since the Industrial Revolution. When I talk about “advanced AI,” it’s this potential for automating innovation that I have in mind.

AI systems that exceed humans in this capacity aren’t coming next year, but many smart people are working on it, and I’m not one to bet against human ingenuity. I think it’s likely that we’ll be able to build something like an automated scientist in our lifetimes, which suggests that this is something we need to take seriously.

当人们谈论社会影响general AI,他们经常沦为拟人化的猎物。他们将人工混合在一起智力with artificialconsciousness, or assume that if AI systems are “intelligent,” they must be intelligent in the same way a human is intelligent. A lot of journalists express a concern that when AI systems pass a certain capability level, they’ll spontaneously develop “natural” desires like a human hunger for power; or they’ll reflect on their programmed goals, find them foolish, and “rebel,” refusing to obey their programmed instructions.

These are misplaced concerns. The human brain is a complicated product of natural selection. We shouldn’t expect machines that exceed human performance in scientific innovation to closely resemble humans, any more than early rockets, airplanes, or hot air balloons closely resembled birds.1

The notion of AI systems “breaking free” of the shackles of their source code or spontaneously developing human-like desires is just confused. The AI systemits source code, and its actions will only ever follow from the execution of the instructions that we initiate. The CPU just keeps on executing the next instruction in the program register. We could write a program that manipulates its own code, including coded objectives. Even then, though, the manipulations that it makes are made as a result of executing the original code that we wrote; they do not stem from some kind of ghost in the machine.

The serious question with smarter-than-human AI is how we can ensure that the objectives we’ve specified are correct, and how we can minimize costly accidents and unintended consequences in cases of misspecification. As Stuart Russell (co-author ofArtificial Intelligence: A Modern Approach)puts it:

The primary concern is not spooky emergent consciousness but simply the ability to make高质量的决定。在这里,质量是指采取的动作的预期结果实用性,该效用功能可能是由人类设计师指定的。现在我们有一个问题:

1. The utility function may not be perfectly aligned with the values of the human race, which are (at best) very difficult to pin down.

2. Any sufficiently capable intelligent system will prefer to ensure its own continued existence and to acquire physical and computational resources – not for their own sake, but to succeed in its assigned task.

A system that is optimizing a function ofn变量,目标取决于大小的子集k<n,通常会将其余的不受限制变量设置为极值;如果实际上我们关心的那些不受约束的变量之一,则发现的解决方案可能是高度不受欢迎的。

These kinds of concerns deserve a lot more attention than the more anthropomorphic risks that are generally depicted in Hollywood blockbusters.


Task: Fill a cauldron

Many people, when they start talking about concerns with smarter-than-human AI, will throw up a picture of the Terminator. I was once quoted in a news article making fun of people who put up Terminator pictures in all their articles about AI, next to a Terminator picture. I learned something about the media that day.

I think this is a much better picture:


This is Mickey Mouse in the movieFantasia, who has very cleverly enchanted a broom to fill a cauldron on his behalf.

How might Mickey do this? We can imagine that Mickey writes a computer program and has the broom execute the program. Mickey starts by writing down a scoring function or objective function:
$$\mathcal{U}_{broom} =
\ begin {cases}
1&\ text {如果cauldron full} \\
0&\ text {如果cauldron空}
\ end {case} $$
Given some set of available actions, Mickey then writes a program that can take one of these actions as input and calculate how high the score is expected to be if the broom takes that action. Then Mickey can write a function that spends some time looking through actions and predicting which ones lead to high scores, and outputs an action that leads to a relatively high score:
$$\underset{a\in A}{\mathrm{sorta\mbox{-}argmax}} \ \mathbb{E}\left[\mathcal{U}_{broom}\mid a\right]$$The reason this is “sorta-argmax” is that there may not be time to evaluate every action in . For realistic action sets, agents should only need to find actions that make the scoring function as large as they can given resource constraints, even if this isn’t the maximal action.

This program may look simple, but of course, the devil’s in the details: writing an algorithm that does accurate prediction and smart search through action space is basically the whole problem of AI. Conceptually, however, it’s pretty simple: We can describe in broad strokes the kinds of operations the broom must carry out, and their plausible consequences at different performance levels.

When Mickey runs this program, everything goes smoothly at first.Then:


我声称作为AI Go的虚构描述,这是非常现实的。


第一个困难是米奇释放扫帚的目标功能a bunch of other termsMickey cares about:

$$ \ MATHCAL {U} _ {human} =
\ begin {cases}
1&\ text {如果cauldron full} \\
0&\text{ if cauldron empty} \\
-10&\text{ if workshop flooded} \\
+0.2&\ text {如果很有趣} \\
-1000000&\text{ if someone gets killed} \\
&\ text {…和更多的} \\
\ end {case} $$

The second difficulty is that Mickey programmed the broom to make the expectation of its score as large as it could. “Just fill one cauldron with water” looks like a modest, limited-scope goal, but when we translate this goal into a probabilistic context, we find that optimizing it means driving up the probability of success to absurd heights. If the broom assigns a 99.9% probability to “the cauldron is full,” and it has extra resources lying around, then it will always try to find ways to use those resources to drive the probability even a little bit higher.

Contrast this with the limited “task-like我们可能想到的目标。我们希望大锅饱满,但是从某种意义上说,我们希望该系统“不要太努力”,即使它具有许多可用的认知和物理资源来致力于这个问题。亚博体育苹果app官方下载我们希望它能在某些直观的限制内行使创造力和机智,但我们不希望它采取“荒谬”策略,尤其是具有很大意外后果的策略。2

In this example, the original objective function looked pretty task-like. It was bounded and quite simple. There was no way to get ever-larger amounts of utility. It’s not like the system got one point for every bucket of water it poured in — then there would clearly be an incentive to overfill the cauldron. The problem was hidden in the fact that we’re maximizingexpected效用。这使得目标开放式,这意味着系统的目标函数中的小错误也会亚博体育苹果app官方下载爆炸

There are a number of different ways that a goal that looks task-like can turn out to be open-ended. Another example: a larger system that has an overarching task-like goal may havesubprocessesthat are themselves trying to maximize a variety of different objective functions, such as optimizing the system’s memory usage. If you don’t understand your system well enough to track whether any of its subprocesses are themselves acting like resourceful open-ended optimizers, thenit may not matter how safe the top-level objective is

因此,扫帚不断抓住更多的水 - 例如,大锅有泄漏的机会,或者“饱满”要求水略高于边缘的水平。而且,当然,扫帚从来没有“反抗”米奇的代码。如果有的话,扫帚追求了它的目标tooeffectively.

Subproblem: Suspend buttons

A common response to this problem is: “OK, there may be some unintended consequences of the objective function, but we can always pull the plug, right?”

Mickeytries this,并且行不通:

vlcsnap-2016-05-04-19h21m04s349 VLCSNAP-2016-05-04-19H22M09S178 vlcsnap-2016-05-04-19h53m09s315

我声称这也是现实的,对于足够擅长建模其环境的系统。亚博体育苹果app官方下载如果该系统试图亚博体育苹果app官方下载提高其评分功能的期望,并且足够聪明,可以意识到其被关闭将导致得分较低的结果,那么该系统的激励措施就是颠覆关闭尝试。该系统越有能力,就越有可能找到创造性的方法亚博体育苹果app官方下载来实现该子目标 - 例如,通过将自己复制到互联网上,或者欺骗程序员认为它更安全。

This isn’t to say that it’s impossible to shut down sufficiently capable AI systems; it’s just that we need to put in the extra work of deliberately designing systems to avoid searching for ways to avoid being shut down. If you’ve written code that would try to resist attempts to shut it down, then you shouldn’t run the code even if you’re convinced it would fail — there are better ways to spend those compute cycles than on adversarial policy searches.

相关的安全措施不认为我们总是可以超越并超越该系统以某种相关方式修改系统的目标函数。亚博体育苹果app官方下载我们需要弄清楚如何正式指定不会自动将AI系统与操作员一起将AI系统置于对抗环境中的目标函数;亚博体育苹果app官方下载或者,我们需要找出某种方式使系统实现目标,而无需在传统意义上优化某些目标功能。亚博体育苹果app官方下载例如,我们可以寻找限制系统功能的方法亚博体育苹果app官方下载到一小撮“白名单”域that don’t include thinking about whether the system will be shut down.

$$\mathcal{U}^\prime_{broom} =
\ begin {cases}
1 \ text {如果cauldron full}&\ text {&} \ textbf {b} \ text {= off} \\
0 \text{ if cauldron empty } & \text{& } \textbf{B}\text{=OFF} \\
1 \text{ if robot suspended } & \text{& } \textbf{B}\text{=ON} \\
0 \ text {否则}
\ end {case} $$

It’s not clear how we could actually specify this goal — if there are any loopholes in the way we’ve specified “suspension,” then the system might try to exploit them to hedge its bets. E.g., it could fulfill the技术的definition of “suspend operations” while tasking a child agent it built or subprocess it started with filling the cauldron. This would allow is to achieve both “cauldron full” and “robot suspended” at once, just in case it’s wrong about whether the button has been pushed.

但是,假设我们已经找到了一种技术上指定悬架的气密方法。我们现在遇到的问题是,我们已经使系统有动力去关心按钮是否被按下。亚博体育苹果app官方下载If it has any way to influence its operators’ decision to push the button, or any other way to exercise more control, then it will jump on the opportunity and try to ensure the button is pushed, or ensure that it’s not pushed — depending on whether it looks easier to suspend or to fill the cauldron.

And, surprisingly, the state of the art on this problem isn’t much more advanced than that. There’s a paper co-authored by me, Fallenstein, Yudkowsky, and Armstrong (“Corrigibility“), and a related result by Orseau and Armstrong (“安全中断的代理“), but not a lot more.

我们想要的是一种结合两个目标函数的方法 - 正常操作的默认函数,以及何时暂停系统磁盘的暂停功能。亚博体育苹果app官方下载$$\mathcal{U}_{normal}(act) =
\ begin {cases}
1\text{ if cauldron full} \\
0 \ text {如果cauldron空}
\ end {case} $$

$$\mathcal{U}_{suspend}(act) =
\ begin {cases}
1\text{ if suspended} \hspace{0.8cm}\\
0 \ text {否则}
\ end {case} $$

So far, we haven’t found any way to achieve all three goals at once. As an example, we can try to remove any incentive on the system’s part to control whether its suspend button is pushed by giving the system a switching objective function that always assigns the same expected utility to the button being on or off:
$$ \ Mathcal {U} _ {switch}(act)=
\ begin {cases}
\mathcal{U}_{normal}(act) &\text{ if button=OFF} \\
\mathcal{U}_{suspend}(act) + \theta &\text{ if button=ON}
\ end {case} $$
$ \ theta = \ max_ {act} \ mathbb {e} \ left [\ mathcal {u} _ {normal} \ mid act \ right] - \ max_ {act} \ mathbb {e}u} _ {暂停} \ mid act \ right] $$Unfortunately, this approach makes it impossible to satisfy the third condition. It turns out that the above objective function is equivalent to setting the probability of the button being pressed to 0. The system won’t故意地tamper with the button, but it will happily approve updates that delete any reference to the button as dead code, or build subagents that ignore the button.

Meeting these intuitively simple constraints turns out to be a nontrivial problem. This is a pattern that runs through a lot of questions in this space: traditional tools and concepts run into immediate safety problems that don’t turn up in conventional capabilities research.

The big picture


Let’s take a step back and talk about what’s needed overall in order to align highly capable AI systems with our interests.

Here’s a dramatically simplified pipeline: You have some humans who come up with some task or goal or preference set that serves as their intended value function . Since our values are complicated and context-sensitive, in practice we’ll need to build systems to learn our values over time, rather than coding them by hand.3我们将AI系统的目标称为目标(这可能与或可能不会相同)亚博体育苹果app官方下载。

alignment-prioritiesWhen the press covers this topic, they often focus on one of two problems: “What if the wrong group of humans develops smarter-than-human AI first?”, and “What if AI’s natural desires cause to diverge from ?”

humans-nd在我看来,“错误的人类”问题不应该是我们关注的事情,直到我们有理由认为我们可以通过right一群人。在某种情况下,我们非常有意义的人也无法利用一般的AI系统来做好事情,即使他们尝试过。亚博体育苹果app官方下载As a simple example, if you handed me a box that was an extraordinarily powerful function optimizer — I could put in a description of any mathematical function, and it would give me an input that makes the output extremely large — then I don’t know how I could use that box to develop a new technology or advance a scientific frontier without causing any catastrophes.4


我相信有很多唾手可得in this area, and also that a fair amount of the work does need to be done early (e.g., to help inform capabilities research directions — some directions may produce systems that are much easier to align than others). If we don’t solve these problems, developers with arbitrarily good or bad intentions will end up producing equally bad outcomes. From an academic or scientific standpoint, our first objective in that kind of situation should be to remedy this state of affairs and at least make good outcomes technologically possible.

许多人很快认识到“自然的欲望”是一个小说,但是从此推断出我们需要专注于媒体倾向于强调的其他问题 - “如果坏演员掌握了比人类更聪明的人AI,该怎么办?”, “How will this kind of AI impact employment and the distribution of wealth?”, etc. These are important questions, but they’ll only end up actually being relevant if we figure out how to bring general AI systems up to a minimum level of reliability and safety.

另一个共同的线程是“为什么不告诉AI系统(在此处插入直觉的道德戒律)?”亚博体育苹果app官方下载在这种思考问题的方式上,通常(也许是不公平的)与艾萨克·阿西莫夫(Isaac Asimov)的写作相关联,确保AI系统的积极影响很大程度上是关于提出足够模糊的自然语言指令,以涵盖许多人类的道德推理:亚博体育苹果app官方下载


In contrast, precision is a virtue in real-world safety-critical software systems. Driving down accident risk requires that we begin with limited-scope goals rather than trying to “solve” all of morality at the outset.5

My view is that the critical work is mostly in designing an effective value learning process, and in ensuring that the sorta-argmax process is correctly hooked up to the resultant objective function :


The better your value learning framework is, the less explicit and precise you need to be in pinpointing your value function , and the more you can offload the problem of figuring out what you want to the AI system itself. Value learning, however, raisesa number of basic difficultiesthat don’t crop up in ordinary machine learning tasks.


Natural selection is the only “engineering” process we know of that has ever led to a generally intelligent artifact: the human brain. Since natural selection relies on a fairly unintelligent hill-climbing approach, one lesson we can take away from this is that it’s possible to reach general intelligence with a hill-climbing approach and enough brute force — though we can presumably do better with our human creativity and foresight.

Another key take-away is that natural selection was maximally strict aboutonly为一个非常简单的目标优化大脑:遗传健康。尽管如此,人类代表的内部目标不是遗传健康。我们有无数的目标 - 爱,正义,美丽,怜悯,娱乐,尊重,美食,身体健康,这与祖先稀树草原中的良好生存和繁殖策略有关。但是,我们最终将这些关联直接相关,而不是重视我们基因的传播本身的目的 - 正如我们每次采用节育措施时所证明的那样。

This is a case where the external optimization pressure on an artifact resulted in a general intelligence with internal objectives that didn’t match the external selection pressure. And just as this caused humans’ actions to diverge from natural selection’s pseudo-goal once we gained new capabilities, we can expect AI systems’ actions to diverge from humans’ if we treat their inner workings as black boxes.

如果我们将梯度下降应用于黑匣子,试图使它非常擅长最大化某些目标,那么以足够的独创性和耐心,我们也许能够产生某种强大的优化过程。6By default, we should expect an artifact like that to have a goal that strongly correlates with our objective in the training environment, but sharply diverges from在某些新环境或较宽的选项集可用时

On my view, the most important part of the alignment problem is ensuring that the value learning framework and overall system design we implement allow us to crack open the hood and confirm when the internal targets the system is optimizing for match (or don’t match) the targets we’re externally selecting through the learning process.7


Four key propositions

退后一步:我给出了该领域中开放问题的一些示例(悬挂按钮,价值学习,有限的基于任务的AI等),并且概述了我认为是主要问题类别的内容。但是,我最初的特征是为什么我认为这一重要领域是“ AI可以自动化通用科学推理,而通用科学推理是一个很大的重要性”。优先考虑这项工作的核心原因是什么?



第二,sufficiently optimized objectives tend to converge on adversarial instrumental strategies。大多数目标自己人工智能系统c亚博体育苹果app官方下载ould possess would be furthered by subgoals like “acquire resources” and “remain operational” (along with “learn more about the environment,” etc.).


To use an example due to Stuart Russell: If you build a robot and program it to go to the supermarket to fetch some milk, and the robot’s model says that one of the paths is much safer than the other, then the robot, in optimizing for the probability that it returns with milk, will automatically take the safer path. It’s not that the system fears death, but that it can’t fetch the milk if it’s dead.

Third,通用AI系统可能会显示出较大且快速的能亚博体育苹果app官方下载力增长。The human brain isn’t anywhere near the upper limits for hardware performance (or, one assumes, software performance), and there are a number of other reasons to expect large capability advantages and rapid capability gain from advanced AI systems.

一个简单的例子,Google可以购买有希望的AI创业公司并向他们投掷大量GPU,从而快速跳出“这些问题看起来可能从现在开始相关的是相关的”到“我们需要在T​​he The The The The The The The The The The The The Pressight中解决所有这些问题next year” à la DeepMind’s progress in Go. Or performance may suddenly improve when a system is first given large-scale Internet access, when there’s a conceptual breakthrough in algorithm design, or when the system itself is able to propose improvements to its hardware and software.9

Fourth,将高级AI系统与我们的兴趣保持一致看起来很亚博体育苹果app官方下载困难。I’ll say more about why I think this presently.

Roughly speaking, the first proposition says that AI systems won’t naturally end up sharing our objectives. The second says that by default, systems with substantially different objectives are likely to end up adversarially competing for control of limited resources. The third suggests that adversarial general-purpose AI systems are likely to have a strong advantage over humans. And the fourth says that this problem is hard to solve — for example, that it’s hard to transmit our values to AI systems (addressing orthogonality) or避免对抗激励措施(addressing convergent instrumental strategies).

These four propositions don’t mean that we’re screwed, but they mean that this problem is critically important. General-purpose AI has the potential to bring enormous benefits if we solve this problem, but we do need to make finding solutions a priority for the field.

Fundamental difficulties

我为什么认为AI的对齐看起来相当困难?主要原因是这是我真正解决这些问题的经验。我鼓励你自己看看一些问题and try to solve them in toy settings; we could use more eyes here. I’ll also make note of a few structural reasons to expect these problems to be hard:




Analogously, even though general AI and narrow AI are “just AI” in some sense, we can expect that the more general AI systems are likely to experience a wider range of stressors, and possess more dangerous failure modes.


第二,对齐看起来困难相同的意图son it’s harder to build a good space probe than to write a good app.


This is significantly more caution than goes into the deployment of, say, the new WhatsApp. One big reason for the difference is that it’s hard to roll back a space probe. You can send version updates to a space probe and correct software bugs, but only if the probe’s antenna and receiver work, and if all the code required to apply the patch is working. If your system for applying patches is itself failing, then there’s nothing to be done.

In that respect, smarter-than-human AI is more like a space probe than like an ordinary software project. If you’re trying to build something smarter than yourself, there are parts of the system that have to work perfectly on the first real deployment. We can do all the test runs we want, but once the system is out there, we can only make online improvements if the code that makes the system允许those improvements is working correctly.

If nothing yet has struck fear into your heart, I suggest meditating on the fact that the future of our civilization may well depend on our ability to write code that正常工作on the first deploy.

Lastly, alignment looks difficult for the same reason computer security is difficult: systems need to be robust to intelligent searches for loopholes.


AI出现了类似的问题。我在这里强调的问题并不是AI系统可能会采取对抗:AI的一致性作为研究计划是为了寻找方法亚博体育官网亚博体育苹果app官方下载prevent adversarial behaviorbefore it can crop up. We don’t want to be in the business of trying to outsmart arbitrarily intelligent adversaries. That’s a losing game.

The parallel to cryptography is that in AI alignment we deal with systems that perform intelligent searches through a very large search space, and which can produce weird contexts that force the code down unexpected paths. This is because the weirdedge casesare places of extremes, and places of extremes are often the place where a given objective function is optimized.11像计算机安全专业人员一样,AI一致性研究人员需要非常擅长考虑边缘案例。亚博体育官网

It’s much easier to make code that works well on the path that you were visualizing than to make code that works on all the paths that you weren’t visualizing. AI alignment needs to workon all the paths you weren’t visualizing

总结,我们应该这类问题的方法with the same level of rigor and caution we’d use for a security-critical rocket-launched space probe, and do the legwork as early as possible. At this early stage, a key part of the work is just to formalize basic concepts and ideas so that others can critique them and build on them. It’s one thing to have a philosophical debate about what kinds of suspend buttons people intuit ought to work, and another thing to translate your intuition into an equation so that others can fully evaluate your reasoning.

This is a crucial project, and I encourage all of you who are interested in these problems to get involved and try your hand at them. There areample resources onlinefor learning more about the open technical problems. Some good places to start include MIRI’s亚博体育官网 以及Google Brain,Openai和S亚博体育官网tanford的研究人员的一篇很棒的论文称为“AI安全中的具体问题。”

  1. An airplane can’t heal its injuries or reproduce, though it can carry heavy cargo quite a bit further and faster than a bird. Airplanes are simpler than birds in many respects, while also being significantly more capable in terms of carrying capacity and speed (for which they were designed). It’s plausible that early automated scientists will likewise be simpler than the human mind in many respects, while being significantly more capable in certain key dimensions. And just as the construction and design principles of aircraft look alien relative to the architecture of biological creatures, we should expect the design of highly capable AI systems to be quite alien when compared to the architecture of the human mind.
  2. Trying to give some formal content to these attempts to differentiate task-like goals from open-ended goals is one way of generating open research problems. In the “Alignment for Advanced Machine Learning Systems”研亚博体育官网究建议,正式化“不要太努力”的问题是轻度优化, “steer clear of absurd strategies” is保守主义, and “don’t have large unanticipated consequences” isimpact measures。See also “avoiding negative side effects” in Dario Amodei, Chris Olah, Jacob Steinhardt, Paul Christiano, John Schulman, and Dan Mané’s “AI安全中的具体问题。”
  3. 在过去的几十年中,我们在机器视觉领域中学到的一件事是,手工指定猫的外观是没有希望的,但是指定可以学习识别猫的学习系统并不难。亚博体育苹果app官方下载可以手工指定我们重视的所有内容更加绝望,但是我们可以指定一个可以学习相关的“价值”概念的学习系统是合理的。亚博体育苹果app官方下载
  4. 看 ”环境目标,”“”低影响力的代理,” and “Mild Optimization” for examples of obstacles to specifying physical goals without causing catastrophic side-effects.

    Roughly speaking, MIRI’s focus is on research directions that seem likely to help us conceptually understand how to do AI alignment in principle, so we’re fundamentally less confused about the kind of work that’s likely to be needed.

    What do I mean by this? Let’s say that we’re trying to develop a new chess-playing programs. Do we understand the problem well enough that we could solve it if someone handed us an arbitrarily large computer? Yes: We make the whole search tree, backtrack, see whether white has a winning move.

    如果我们不知道如何回答这个问题with an arbitrarily large computer, then this would suggest that we were fundamentally confused about chess in some way. We’d either be missing the search-tree data structure or the backtracking algorithm, or we’d be missing some understanding of how chess works.

    This was the position we were in regarding chess prior to Claude Shannon’s seminal paper, and it’s the position we’re currently in regarding many problems in AI alignment. No matter how large a computer you hand me, I could not make a smarter-than-human AI system that performs even a very simple limited-scope task (e.g., “put a strawberry on a plate without producing any catastrophic side-effects”) or achieves even a very simple open-ended goal (e.g., “maximize the amount of diamond in the universe”).

    If I didn’t have any particular goal in mind for the system, Icouldwrite a program (assuming an arbitrarily large computer) that strongly optimized the future in an undirected way, using a formalism likeaixi。In that sense we’re less obviously confused about capabilities than about alignment, even though we’re still missing a lot of pieces of the puzzle on the practical capabilities front.

    Similarly, we do know how to leverage a powerful function optimizer to mine bitcoin or prove theorems. But we don’t know how to (safely) do the kind of prediction and policy search tasks I described in the “fill a cauldron” section, even for modest goals in the physical world.

    Our goal is to develop and formalize basic approaches and ways of thinking about the alignment problem, so that our engineering decisions don’t end up depending on sophisticated and clever-sounding verbal arguments that turn out to be subtly mistaken. Simplifications like “what if we weren’t worried about resource constraints?” and “what if we were trying to achieve a much simpler goal?” are a good place to start breaking down the problem into manageable pieces. For more on this methodology, see “MIRI’s Approach。”

  5. “Fill this cauldron without being too clever about it or working too hard or having any negative consequences I’m not anticipating” is a rough example of a goal that’s intuitively limited in scope. The things we actually want to use smarter-than-human AI for are obviously more ambitious than that, but we’d still want to begin with various limited-scope tasks rather than open-ended goals.

    Asimov’s Three Laws of Robotics make for good stories partly for the same reasons they’re unhelpful from a research perspective. The hard task of turning a moral precept into lines of code is hidden behind phrasings like “[don’t,] through inaction, allow a human being to come to harm.” If one followed a rule like that strictly, the result would be massively disruptive, as AI systems would need to systematically intervene to prevent即使是最小的危害的最小风险; and if the intent is that one follow the rule loosely, then all the work is being done by the human sensibilities and intuitions that tell us何时以及如何应用规则

    A common response here is that vague natural-language instruction is sufficient, because smarter-than-human AI systems are likely to be capable of natural language comprehension. However, this is eliding the distinction between the system’s objective function and its model of the world. A system acting in an environment containing humans may learn a world-model that has lots of information about human language and concepts, which the system can then use to achieve its objective function; but this fact doesn’t imply that any of the information about human language and concepts will “leak out” and alter the system’s objective function directly.

    Some kind of value learning process needs to be defined where the objective function itself improves with new information. This is a tricky task because there aren’t known (scalable) metrics or criteria for value learning in the way that there are for conventional learning.

    If a system’s world-model is accurate in training environments but fails in the real world, then this is likely to result in lower scores on its objective function — the system itself has an incentive to improve. The severity of accidents is also likelier to be self-limiting in this case, since false beliefs limit a system’s ability to effectively pursue strategies.

    In contrast, if a system’s value learning process results in a that matches our in training but diverges from in the real world, then the system’s will obviously not penalize it for optimizing . The system has no incentive relative to to “correct” divergences between and , if the value learning process is initially flawed. And accident risk is larger in this case, since a mismatch between and doesn’t necessarily place any limits on the system’s instrumental effectiveness at coming up with effective and creative strategies for achieving .


    1. “Do What I Mean” is an informal idea, and even if we knew how to build a smarter-than-human AI system, we wouldn’t know how to precisely specify this idea in lines of code.

    2. If doing what we actually mean is instrumentally useful for achieving a particular objective, then a sufficiently capable system may learn how to do this, and may act accordingly so long as doing so is useful for its objective. But as systems become more capable, they are likely to find creative new ways to achieve the same objectives, and there is no obvious way to get an assurance that “doing what I mean” will continue to be instrumentally useful indefinitely.

    3. If we use value learning to refine a system’s goals over time based on training data that appears to be guiding the system toward a that inherently values doing what we mean, it is likely that the system will actually end up zeroing in on a that approximately does what we mean during training but catastrophically diverges in some difficult-to-anticipate contexts. See “Goodhart’s Curse”有关此的更多信息。

    For examples of problems faced by existing techniques for learning goals and facts, such as reinforcement learning, see “使用机器学习来解决AI风险。”

  6. The result will probably not be a particularly human-like design, since so many complex historical contingencies were involved in our evolution. The result will also be able to benefit from a number of largesoftware and hardware advantages
  7. This concept is sometimes lumped into the “transparency” category, but standard algorithmic transparency research isn’t really addressing this particular problem. A better term for what I have in mind here is “understanding。” What we want is to gain deeper and broader insights into the kind of cognitive work the system is doing and how this work relates to the system’s objectives or optimization targets, to provide a conceptual lens with which to make sense of the hands-on engineering work.
  8. We could选择to program the system to tire, but we don’t have to. In principle, one could program a broom that only ever finds and executes actions that optimize the fullness of the cauldron. Improving the system’s ability to efficiently find high-scoring actions (in general, or relative to a particular scoring rule) doesn’t in itself change the scoring rule it’s using to evaluate actions.
  9. We can imagine the latter case resulting in afeedback loopas the system’s design improvements allow it to come up with further design improvements, until all the low-hanging fruit is exhausted.


  10. 具体来说,火箭更广泛的经验temperatures and pressures, traverse those ranges more rapidly, and are also packed more fully with explosives.
  11. Consider Bird and Layzell’sexampleof a very simple genetic algorithm that was tasked with evolving an oscillating circuit. Bird and Layzell were astonished to find that the algorithm made no use of the capacitor on the chip; instead, it had repurposed the circuit tracks on the motherboard as a radio to replay the oscillating signal from the test device back to the test device.

    This was not a very smart program. This is just using hill climbing on a very small solution space. In spite of this, the solution turned out to be outside the space of solutions the programmers were themselves visualizing. In a computer simulation, this algorithm might have behaved as intended, but the actual solution space in the real world was wider than that, allowing hardware-level interventions.

    In the case of an intelligent system that’s significantly smarter than humans on whatever axes you’re measuring, you should by default expect the system to push toward weird and creative solutions like these, and for the chosen solution to be difficult to anticipate.