新论文:“ Cirl框架中的不可验证”
MIRI assistant research fellow Ryan Carey has a new paper out discussing situations where good performance inCooperative Inverse Reinforcement Learning(CIRL) tasks fails to imply that software agents will assist or cooperate with programmers.
这paper, titled “Incorrigibility in the CIRL Framework,”提出了四种情况,其中cirl违反了四个条件符合条件定义Soares等。(2015)。Abstract:
价值学习系统有激励措施遵循关闭说明亚博体育苹果app官方下载,假设关闭指令提供了有关哪些操作导致有价值结果的信息(从技术意义上讲)。但是,此假设对于模拟错误指定的模型并不强(例如,在程序员错误的情况下)。我们通过提出一些有监督的POMDP方案来证明这一点,其中参数化奖励功能中的错误删除了遵循关闭命令的激励措施。这些困难与Soares等人讨论的困难相似。(2015年)在他们的有关科罗的论文中。
We argue that it is important to consider systems that follow shutdown commands under some weaker set of assumptions (e.g., that one small verified module is correctly implemented; as opposed to an entire prior probability distribution and/or parameterized reward function). We discuss some difficulties with simple ways to attempt to attain these sorts of guarantees in a value learning framework.
该论文是对Hadfield-Menell,Dragan,Abbeel和Russell的论文的回应,“开关游戏。” Hadfield-Menell et al. show that an AI system will be more responsive to human inputs when it is uncertain about its reward function and thinks that its human operator has more information about this reward function. Carey shows that the CIRL framework can be used to formalize the problem of corrigibility, and that the known assurances for CIRL systems, given in “The Off-Switch Game”, rely on strong assumptions about having an error-free CIRL system. With less idealized assumptions, a value learning agent may have beliefs that cause it to evade redirection from the human.
关闭按钮的目的是关闭AI系统亚博体育苹果app官方下载如果所有其他保证都失败了, e.g., in the event that the AI system is ignoring (for one reason or another) the instructions of the operators. If the designers of [the AI system]r已经完美地编程了系统,以至于先验和[奖励亚博体育苹果app官方下载功能]r完全没有错误,然后是Hadfield-Menell等人的定理。(2017年)确实适用。实际上,这意味着为了变得可验证,有必要拥有一个不确定所有可能重要的事物的AI系统。亚博体育苹果app官方下载问题在于,在所有可能的世界上执行贝叶斯推理,所有可能的价值函数都非常棘手。实际上,人类可能必须使用大量的启发式方法和近似值来实施系统的信仰系统和更新规则。亚博体育苹果app官方下载[…]
Soares等。(2015年)似乎想要一个可以作为最后一个度假胜地机制的关闭按钮,以便在观察并拒绝程序员建议的情况下关闭AI系统(程序员认为该系统正在发生故障)。亚博体育苹果app官方下载清楚地,一些part of the system must be working correctly in order for us to expect the shutdown button to work at all. However, it seems undesirable for the working of the button to depend on there being zero critical errors in the specification of the system’s prior, the specification of the reward function, the way it categorizes different types of actions, and so on. Instead, it is desirable to develop a shutdown module that is small and simple, with code that could ideally be rigorously verified, and which ideally works to shut the system down even in the event of large programmer errors in the specification of the rest of the system.
为了在价值学习框架中执行此操作,我们需要一个价值学习系统,该系统(i)能够将其操作覆盖,该模块被关闭命令的小型验证模块;亚博体育苹果app官方下载(ii)没有动力去除,损坏或忽略关闭模块;(iii)有一些小动力来保持其关闭模块;即使在广泛的情况下r,先验,一组可用动作等。
即使学习了公用事业功能,仍然需要对意外失败的额外防御。希望这可以通过对AI系统进行模块化来实现。亚博体育苹果app官方下载为此,我们将需要一个代理的模型,该模型将以符合其他系统组件的指定性能的方式进行可检修。亚博体育苹果app官方下载
Sign up to get updates on new MIRI technical results
Get notified every time a new technical paper is published.