Groundwork for AGI safety engineering

||yabo app

Improvements in AI are resulting in the automation of increasingly complex andcreativehuman behaviors.Given enough time,我们应该期望人造的推理者开始与人类竞争任意领域,最终导致artificial general intelligence(AGI).

A machine would qualify as an ‘AGI’, in the intended sense, if it could adapt to a very wide range of situations to consistently achieve some goal or goals. Such a machine would behave intelligently when supplied with arbitrary physical and computational environments, in the same sense thatDeep Bluebehaves intelligently when supplied with arbitrarychess boardconfigurations — consistently hitting its victory condition within that narrower domain.


  • 因为阿吉斯是intelligent,,,,they will tend to be complex, adaptive, and capable of autonomous action, and they will have a large impact where employed.
  • 因为阿吉斯是general,,,,their users will have incentives to employ them in an increasingly wide range of environments. This makes it hard to construct valid sandbox tests and requirements specifications.
  • 因为阿吉斯是artificial,,,,they will deviate fromhumanagents, causing them to violate many of our natural intuitions and expectations about intelligent behavior.

Today’s AI software is already tough to verify and validate, thanks to its complexity and its uncertain behavior in the face of state space explosions. Menzies & Pecheur (2005) give a good overview of AI verification and validation (V&V) methods, noting that AI, and especially adaptive AI, will often yield undesired and unexpected behaviors.

An adaptive AI that acts autonomously, like a Mars rover that can’t be directly piloted from Earth, represents an additional large increase in difficulty. Autonomous safety-critical AI agents need to make irreversible decisions in dynamic environments with very low failure rates. The state of the art in safety research for autonomous systems is improving, but continues to lag behind system capabilities work. Hinchman et al. (2012) write:

As autonomous systems become more complex, the notion that systems can be fully tested and all problems will be found is becoming an impossible task. This is especially true in unmanned/autonomous systems. Full test is becoming increasingly challenging on complex system. As these systems react to more environmental [stimuli] and have larger decision spaces, testing all possible states and all ranges of the inputs to the system is becoming impossible. […] As systems become more complex, safety is really risk hazard analysis, i.e. given x amount of testing, the system appears to be safe. A fundamental change is needed. This change was highlighted in the 2010 Air Force Technology Horizon report, “It is possible to develop systems having high levels of autonomy, but it is the lack of suitable V&V methods that prevents all but relatively low levels of autonomy from being certified for use.” […]

The move towards more autonomous systems has lifted this need [for advanced verification and validation techniques and methodologies] to a national level.

人工智能自主行动在任意域中,,,,then, looks particularly difficult to verify. If AI methods continue to see rapid gains in efficiency and versatility, and especially if these gains further increase the opacity of AI algorithms to human inspection, AI safety engineering will become much more difficult in the future. In the absence of any reason to expect a development in the lead-up to AGI that would make high-assurance AGI easy (or AGI itself unlikely), we should be worried about the safety challenges of AGI, and that worry should inform our research priorities today.


New safety challenges from AGI

A natural response to the idea of starting work on high-assurance AGI is that AGI itself appears to be decades away. Why worry about it now? And, supposing that we should worry about it, why think there’s any useful work we can do on AGI safety so far in advance?

回答第二个问题:确实,乍一看,Agi看起来很难有效地准备。但是,这个问题足够重要,不仅仅是一眼。长期的项目诸如缓解气候变化或检测和偏转小行星等长期项目在直观上很难。依赖预计未来技术的干预措施也是如此量子后密码学in anticipation ofquantum computers。In spite of that, we’ve made important progress on these fronts.

Covert channel communicationprovides one precedent. It was successfully studied decades in advance of being seen in the wild. Roger Schell cites a few other success cases in Muehlhauser (2014b), and suggests reasons why long-term security and safety work remains uncommon. We don’t know whether early-stage AGI safety work will be similarly productive, but we shouldn’t rule out the possibility before doing basic research into the question. I’ll list some possible places to start looking in the next section.

What about the first question? Why worry specifically about AGI?

I noted above that AGI is an extreme manifestation of many ordinary AI safety challenges. However, MIRI is particularly concerned with unprecedented, AGI-specific behaviors. For example: An AGI’s problem-solving ability (and therefore its scientific and economic value) depends on its ability to model its environment. This includes modeling the dispositions of its human programmers. Since the program’s success depends in large part on its programmers’ beliefs and preferences, an AI pursuing some optimization target can select actions for the effect they have on programmers’ mental states, not just on the AI’s material environment.

This means that safety protocols will need to be sensitive to risks that differ qualitatively from ordinary software failure modes — AGI-specific hazards like ‘the program models its programmed goals as being better served if it passes human safety inspections, so it selects action policies that make it look safer (to humans) than it really is’. If we model AGI behavior using only categories from conventional software design, we risk overlooking new intelligent behaviors, including ‘deception‘.

At the same time, oversimplifying these novel properties can cause us to anthropomorphize the AGI. If it’s naïve to expect ordinary software validation methods to immediately generalize to advanced autonomous agents, it’s even more naïve to expect conflict prevention strategies that work onhumansto immediately generalize to an AI. A ‘deceptive’ AGI is just one whose planning algorithm identifies some human misconception as instrumentally useful to its programmed goals. Its methods or reasons for deceiving needn’t resemble a human’s, even if its capacities do.

In human society, we think, express, and teach norms like ‘don’t deceive’ or Weld & Etzioni’s (1994) ‘don’t let humans come to harm’ relatively quickly and easily. Thecomplex conditional responsethat makes humans converge on similar goals remains hidden inside a black box — the undocumented spaghetti code that is the human brain. As a result of our lack of introspective access to how our social dispositions are cognitively and neurally implemented, we’re likely to underestimate how contingent and complex they are. For example:

  • 我们可能期望一个特别聪明的AI系统将具有特别有价值的目标,因为知识和洞察力与人类的许多其他美德有关。亚博体育苹果app官方下载例如,大厅(2007) conjectures this on the grounds that criminality negatively correlates with intelligence in humans. Bostrom’s (2003) response is that there’s no particular reason to expect AIs to converge on anthropocentric terminal values like ‘compassion’ or ‘loyalty’ or ‘novelty’. A superintelligent AI could consistently have no goal other than to constructpaperclips,,,,for example.
  • We might try to directly hand-code goals like ‘don’t deceive’ into the agent by breaking the goals apart into simpler goals, e.g., ‘don’t communicate information you believe to be false’. However, in the process we’re likely to neglect the kinds of subtleties that can be safely left implicit when it’s a human child we’re teaching — lies of omission, misleading literalism, novel communication methods, or any number of edge cases. As Bostrom (2003)注意,代理商的目标may continue to reflect the programmers’ poor translation of their requirements into lines of code, even after itsintelligencehas arrived at asuperior understandingof human psychology.
  • We might instead try to instill the AGI with humane values via machine learning — training it to promote outcomes associated with camera inputs of smiling humans, for example. But a powerful search process is likely to hit on solutionsthat would never occur to a developing human。If the agent becomes more powerful or general over time, initially benign outputs may be a poor indicator of long-term safety.

先进的人工智能也可能技术capabilities, such as strong self-modification, that introduce other novel safety obstacles; see Yudkowsky (2013).

These are quick examples of some large and poorly-understood classes of failure mode. However, the biggest risks may be from problem categories so contrary to our intuitions that they will not occur to programmers at all. Relying on our untested intuitions, or on past experience with very different systems, is unlikely to catch every hazard.

As an intelligent but inhuman agent, AGI represents a fundamentally new kind of safety challenge. As such, we’ll need to do basic theoretical work on the general features of AGI before we can understand such agents well enough to predict and plan for them.

Early steps

早,什么理论播洒至安全研究亚博体育官网look like? How does one vet a hypothetical technology? We can distinguish research projects oriented toward system verification from projects oriented toward system requirements.

Verification-directed AGI research extends existing AI safety and security tools that are likely to help confirm various features of advanced autonomous agents. Requirements-directed AGI research instead specifies desirable AGI abilities or behaviors, and tries to build toy models exhibiting the desirable properties. These models are then used to identify problems to be overcome and basic gaps in our conceptual understanding.

In other words,verification-directed approacheswould ask ‘What tools and procedures can we use to increase our overall confidence that the complex systems of the future will match their specifications?’ They include:

  • Develop new tools for improving thetransparencyof AI systems to inspection. Frequently, useful AI techniques likeboosting以临时的方式进行尝试,然后在观察到某些问题集工作时晋升。理解whenandwhy计划的工作将使更强大的安全保障。Computational learning theorymay be useful here, for proving bounds on the performance of various machine learning algorithms.
  • Extend techniques for designing complex systems to be readily verified. Work on clean-slate hardware and software approaches that maintain high assurance at every stage, likeHACMSandSAFE
  • Extend current techniques inprogram synthesisandformal verification,,,,with a focus on methods applicable to complex and adaptive systems, such ashigher-order program verification和长矛”(2000,,,,2006)增量恢复。扩展现有工具 - 例如,设计更好的接口和培训方法SPIN model checkerto improve its accessibility.
  • Applyhomotopy type theoryto program verification. The theory’sunivalence axiomlets us deriveidentities from isomorphisms。Harper & Licata (2011) suggest that if we can implement this as an algorithm, it may allow us to reuse high-assurance code in new contexts without a large loss in confidence.
  • Expand the current body of verified software libraries and compilers, such as the验证的软件工具链。A lot of program verification work is currently directed at older toolchains, e.g., relatively small libraries in C. Focusing on newer toolchains would limit our ability to verify systems that are already in wide use, but would put us in a better position to verify more advanced safety-critical systems.

Requirements-directed approacheswould ask ‘What outcomes are we likely to want from an AGI, and what general classes of agent could most easily get us those outcomes?’ Examples of requirements-directed work include:

  • Formalize stability guarantees for intelligent self-modifying agents. A general intelligence could help maintain itself and implement improvements to its own software and hardware, including improvements to its search and decision heuristics à laEURISKO。It may be acceptable for the AI to introduce occasional errors in its own object recognition or speech synthesis modules, but we’d want pretty strong assurance about the integrity of its core decision-making algorithms, including the module that approves self-modifications. At present, toy models of self-modifying AI discussed in Fallenstein & Soares (2014)遇到了两种自我参考的悖论,即“洛比亚障碍”和“拖延悖论”。对现实世界中的AGI可能会出现类似的障碍,而找到解决方案应改善我们对制定自己决策和预测自己的系统的一般理解。亚博体育苹果app官方下载
  • Specify desirable checks on AGI behavior. Some basic architecture choices may simplify the task of making AGI safer, by restricting the system’s output channel (e.g., oracle AI in Armstrong et al. (2012)) or installing emergency tripwires and fail-safes (e.g.,simplex architectures). AGI checks are a special challenge because of the need to recruit the agent to help actively regulate itself. If this demand isn’t met, the agent may devote its problem-solving capabilities to finding loopholes in its restrictions, as in Yampolskiy’s (2012) discussion of an ‘AI in a box’.
  • 设计通用方法,通过这些方法,智能代理可以随着时间的推移改善其用户需求模型。自主通用情报的作用领域可能是无限的;或者,如果有限,我们可能无法预测将适用哪些限制。因此,它需要安全的处境目标。然而,通过手工编码一套普遍安全且有用的决策标准,看上去无望。相反,有些indirectly normative像杜威(Dewey)的方法(2011)值的学习似乎是必要的,允许初始ly imperfect decision-makers to improve their goal content over time. Related open questions include: what kinds of base cases can we use to train and test a beneficial AI?; and how can AIs be made safe and stable during the training process?
  • Formalize otheroptimality criteriafor arbitrary reasoners. Just as a general-purpose adaptive agent would need general-purpose values, it would also need general-purpose methods for tracking features of its environment and of itself, and for selecting action policies on this basis. Mathematically modeling ideal inference (e.g., Hutter (2012)), ideal决策理论expected value calculation (e.g., Altair (2013)), and ideal game-theoretic coordination (e.g., Barasz et al. (2014)) are unlikely to be strictly necessary for AGI. All the same, they’re plausibly necessary for AGIsafety,,,,because models like these would give us a solid top-down theoretical foundation upon which to cleanly construct components of autonomous agents for human inspection and verification.

我们可以在这两种类型的美国国际集团可能取得进展safety research well in advance of building a working AGI. Requirements-directed research focuses on abstract mathematical agent models, which makes it likely to be applicable to a wide variety of software implementations. Verification-directed approaches will be similarly valuable, to the extent they are flexible enough to apply to future programs that are much more complex and dynamic than any contemporary software. We can compare this to present-day high-assurance design strategies, e.g., in Smith & Woodside (2000) and Hinchman et al. (2012). The latter write, concerning simpler autonomous machines:


Verification- and requirements-directed work is complementary. The point of building clear mathematical models of agents with desirable properties is to make it easier to design systems whose behaviors are transparent enough to their programmers to be rigorously verified; and verification methods will fail to establish system safety if we have a poor understanding of what kind of system we want in the first place.

Some valuable projects will also fall in between these categories — e.g., developing methods for principled formalvalidation,,,,which can increase our confidence that we’re verifying the right properties given the programmers’ and users’ goals. (See Cimatti et al. (2012) on formal validation, and also Rushby (2013) on epistemic doubt.)


MIRI’s founder, Eliezer Yudkowsky, has been one of the most vocal advocates for research into autonomous high-assurance AGI, or ‘friendly AI’. Russell & Norvig (2009) write:

[T]he challenge is one of mechanism design — to define a mechanism for evolving Al systems under a system of checks and balances, and to give the systems utility functions that will remain friendly in the face of such changes. We can’t just give a program a static utility function, because circumstances, and our desired responses to circumstances, change over time.


Rather than building on current formal verification methods, MIRI prioritizes jump-starting these new avenues of research. Muehlhauser (2013) writes that engineering innovations often have their germ in prior work in mathematics, which in turn can be inspired by informal philosophical questions. At this point, AGI safety work is just beginning to enter the ‘mathematics’ stage. Friendly AI researchers construct simplified models of likely AGI properties or subsystems, formally derive features of those models, and check those features against general or case-by-case norms.

Because AGI safety is so under-researched, we’re likely to find low-hanging fruit even in investigating basic questions like ‘What kind ofprior probability distributionworks best for formal agents in unknown environments?’ As Gerwin Klein notes in Muehlhauser (2014a), “In the end, everything that makes it easier for humans to think about a system, will help to verify it.” And, though MIRI’s research agenda is determined by social impact considerations, it is also of general intellectual interest, bearing on core open problems in theoretical computer science and mathematical logic.

同时,请记住formal proofsof AGI properties only function as especially strong probabilistic evidence.Formal methodsin computer science can decrease risk and uncertainty, but they can’t eliminate it. Our assumptions are uncertain, so our conclusions will be too.

Though we can never reach complete confidence in the safety of an AGI, we can still decrease the probability of catastrophic failure. In the process, we are likely to come to a better understanding of the most important ways AGI can defy our expectations. If we begin working now to better understand AGI as a theoretical system, we’ll be in a better position to implement robust safety measures as AIs improve in intelligence and autonomy in the decades to come.


My thanks to Luke Muehlhauser, Shivaram Lingamneni, Matt Elder, Kevin Carlson, and others for their feedback on this piece.


Did you like this post?You may enjoy our otheryabo app posts, including: