克里斯蒂安诺(Christiano)能力放大建议的挑战

||yabo app

以下是我代为基本上未经审查的总结ote up on March 16 of my take on Paul Christiano’s AGI alignment approach (described in “ALBA” and “Iterated Distillation and Amplification”). Where Paul had comments and replies, I’ve included them below.


I see a lot of free variables with respect to what exactly Paul might have in mind. I've sometimes tried presenting Paul with my objections and then he replies in a way that locally answers some of my question but I think would make other difficulties worse. My global objection is thus something like, "I don't see any concrete setup and一致的同时setting of the variables where this whole scheme works." These difficulties are not minor or technical; they appear to me quite severe. I try to walk through the details below.

应该始终理解我没有声称能够通过保罗的ITTfor Paul’s view and that this is me criticizing my own, potentially straw misunderstanding of what I imagine Paul might be advocating.

保罗·克里斯蒂安诺(Paul Christiano)

总体而言:我认为这些都是我的提议所面临的合理困难,在很大程度上,我同意Eliezer对这些问题的描述(尽管不是他对我目前的信念的描述)。

I don't understand exactly how hard Eliezer expects these problems to be; my impression is "just about as hard as solving alignment from scratch," but I don't have a clear sense of why.

To some extent we are probably disagreeing about alternatives. From my perspective, the difficulties with my approach (e.g. better understanding the forms of optimization that cause trouble, or how to avoid optimization daemons in systems about as smart as you are, or how to address X-and-only-X) are also problems for alternative alignment approaches. I think it's a mistake to think that tiling agents, or decision theory, or naturalized induction, or logical uncertainty, are going to make the situation qualitatively better for these problems, so work on those problems looks to me like procrastinating on the key difficulties. I agree with the intuition that progress on the agent foundations agenda "ought to be possible," and I agree that it will help at least alittle bit埃利泽(Eliezer)在本文档中描述的问题,但总体代理基金会似乎比对问题的直接攻击要少(鉴于我们还没有尝试过直接攻击以放弃的直接攻击)。在混凝土对准策略的背景下解决哲学问题通常对我来说似乎比试图在抽象中思考它们更有前途,我认为这可以证明我的方法中的大多数核心困难也会折磨研究亚博体育官网基于代理基础。

The main way I could see agent foundations research as helping to address these problems, rather than merely deferring them, is if we plan to eschew large-scale ML altogether. That seems to me like a very serious handicap, so I'd only go that direction once I was quite pessimistic about solving these problems. My subjective experience is of making continuous significant progress rather than being stuck. I agree there is clear evidence that the problems are "difficult" in the sense that we are going to have to make progress in order to solve them, but not that they are "difficult" in the sense that P vs. NP or even your typical open problem in CS is probably difficult (and even then if your options were "prove P != NP" or "try to beat Google at building an AGI without using large-scale ML," I don't think it's obvious which option you should consider more promising).


First and foremost, I don't understand how "preserving alignment while amplifying capabilities" is supposed to work at all under this scenario, in a way consistent with other things that I’ve understood Paul to say.

I want to first go through an obvious point that I expect Paul and I agree upon: Not every system of locally aligned parts has globally aligned output, and some additional assumption beyond "the parts are aligned" is necessary to yield the conclusion "global behavior is aligned". The straw assertion "an aggregate of aligned parts is aligned" is the reverse of theargumentthat Searle uses to ask us to imagine that an (immortal) human being who speaks only English, who has been trained do things with many many pieces of paper that instantiate a Turing machine, can't be part of a whole system that understands Chinese, because the individual pieces and steps of the system aren't locally imbued with understanding Chinese. Here the compositionally non-preserved property is "lack of understanding of Chinese"; we can't expect "alignment" to be any more necessarily preserved than this, except by further assumptions.

The second-to-last time Paul and I conversed at length, I kept probing Paul for what in practice the non-compacted-by-training version of a big aggregate of small aligned agents would look like. He described people, living for a single day, routing around phone numbers of other agents with nobody having any concept of the global picture. I used the term "Chinese Room Bureaucracy" to describe this. Paul seemed to think that this was an amusing but perhaps not inappropriate term.

如果中国房间官僚机构在哪些行动中有哪些后果以及原因,那么这将切断最明显的途径,任何代理商的对齐都可以适用于整体的一致性。我通常的想象方式,代理的一致性适用于代理商理解的事物。如果您有大量的代理,这些代理理解了本地代理人不了解的东西,那么大汇总不会继承小型代理的一致性。Searle的中国房间也可以理解中国人,即使里面的人不了解中文,默认情况下,这意味着中国房间内的人无能为力,无法在餐厅订单中表达自己的口味。

I don't understand Paul's model of how a ton of little not-so-bright agents yield a big powerful understanding in aggregate, in a way that doesn't effectively consist of them running AGI code that they don't understand.

保罗·克里斯蒂安诺(Paul Christiano)

The argument for alignment isn’t that “a system made of aligned neurons is aligned.” Unalignment isn't a thing that magically happens; it’s the result of specific optimization pressures in the system that create trouble. My goal is to (a) first construct weaker agents who aren't internally doing problematic optimization, (b) put them together in a way that improves capability without doing other problematic optimization, (c) iterate that process.

Paul has previously challenged me to name a bottleneck that I think a Christiano-style system can't pass. This is hard because (a) I'm not sure I understand Paul's system, and (b) it's clearest if I name a task for which we don't have a present crisp algorithm. But:

我在与保罗的最后一次讨论中命名的瓶颈是:“我们有一个起始代理的副本,在终止之前,该副本最多可以累积一天,而且该特工以前没有学到很多数学,但很聪明,可以实现即使代理人开始知道具体算术,也要在一天结束时了解代数。该代理的系统如何在不只是操作操作AGI的图灵机器的情况下如何才能在某个中发明无Hessian Bree Borighatization亚博体育苹果app官方下载神经网吗?”

This is a slightly obsolete example because nobody uses Hessian-free optimization anymore. But I wanted to find an example of an agent that needed to do something that didn't have a simple human metaphor. We can understand second derivatives using metaphors like acceleration. "Hessian-free optimization" is something that doesn't have an obvious metaphor that can explain it, well enough to use it in an engineering design, to somebody who doesn't have a mathy and not just metaphorical understanding of calculus. Even if it did have such a metaphor, that metaphor would still be very unlikely to be invented by someone who didn't understand calculus.

我看不到保罗如何期望许多小型代理人一天可以学习代数,按顺序运行,汇总可以使用无Hessian的优化来构建设计的东西,without小型特工有效地训练了一台不朽的狗的角色,该狗经过训练以操作图灵机。因此,我也看不到保罗如何期望小型代理商的假定对齐方式通过这种神秘的理解形式,将其融合到了解无用的优化的系统的对齐状态。亚博体育苹果app官方下载

我希望这已经被理解了,但是我指出的是,一般而言,对齐方式并不是认知系统保存的构图:如果您训练一堆好的和道德的人来操作图灵机的元素,并且没有人拥有亚博体育苹果app官方下载全球对正在发生的事情的看法,他们的善良和道德并没有传递给图灵机器。即使我们让善良和道德的人对何时编写与通常的规则不同的符号有酌处权,他们仍然无法有效地使全球系统保持一致,因为他们不单独了解Hessian-亚博体育苹果app官方下载自由优化被用于善良或邪恶,因为他们不了解无黑森州的优化或纳入它的想法。因此,我们不想将系统放在错误的假设上:“由对齐子代码组成的任何系统亚博体育苹果app官方下载都对齐”,我们认为这通常是错误的,因为这个反例。我们希望有一些更狭窄的假设,也许还有其他前提,实际上是正确的,系统的对齐方式依赖于此。亚博体育苹果app官方下载我不知道保罗想使用什么狭窄的假设。


Paul asks us to considerAlphaGoas a model of capability amplification.

My view of AlphaGo would be as follows: We understand Monte Carlo Tree Search. MCTS is an iterable algorithm whose intermediate outputs can be plugged into further iterations of the algorithm. So we can use supervised learning where our systems of gradient descent can capture and foreshorten the computation of some but not all of the details of winning moves revealed by the short MCTS, plug in the learned outputs to MCTS, and get a pseudo-version of "running MCTS longer and wider" which is weaker than an MCTS actually that broad and deep, but more powerful than the raw MCTS run previously. The alignment of this system is provided by the crisp formal loss function at the end of the MCTS.

据我所知,这是一种替代案例。假设我们有一个播放的RNN。它的构建方式是这样,如果我们迭代RNN更长的时间,那么GO的举动会变得更好。“啊哈,”稻草能力放大器说,“显然,我们可以采用此RNN,训练另一个网络以近似于最初的GO位置100次迭代后近似其内部状态;我们在开始时将内部状态喂入RNN,然后将其送入训练放大的网络在运行200次迭代后近似于该RNN的内部状态。结果显然会继续尝试“在Go中获胜”,因为原始RNN试图在GO中获胜;放大的系统保留了值的值亚博体育苹果app官方下载原本的。”这是行不通的,因为我们通过假设说,如果您继续迭代,那么RNN就无法随意地变得更好。功能放大设置的性质不允许任何外部损失功能,可以告诉放大的RNN,它在GO时做得更好还是更糟。

保罗·克里斯蒂安诺(Paul Christiano)

我绝对同意,放大没有比“让人类任意思考的时间”更好。我认为这不是一个强烈的反对意见,因为我认为人类(甚至只有很短的时间的人)最终会收敛到我们面临的问题的足够好的答案。

它收敛于RNN只有任何意见,or whatever set of opinions it diverges to, to tell itself how well it's doing. This is exactly what it is for capability amplification to preserve alignment; but this in turn means that capability amplification only works to the extent that what we are amplifying has within itself the capability to be very smart in the limit.

如果我们有效地建立了长期存在的保罗·克里斯蒂安诺斯(Paul Christianos)的文明,那么这个困难会有所缓解。这个文明文明仍然可能出错(即使我以后提到反对意见,我们是否可以真正地和现实地做到这一点)。但是,我确实相信,保罗的文明可以做一些美好的事情。

But other parts of Paul's story don't permit this, or at least that's what Paul was saying last time; Paul's supervised learning setup only lets the simulated component people operate for a day, because we can't get enough labeled cases if the people have to each run for a month.

Furthermore, as I understand it, the "realistic" version of this is supposed to start with agents dumber than Paul. According to my understanding of something Paul said in answer to a later objection, the agents in the system are supposed to be even dumber than an average human (but aligned). It is not at all obvious to me that an arbitrarily large system of agents with IQ 90, who each only live for one day, can implement a much smarter agent in a fashion analogous to the internal agents themselves achieving understandings to which they can apply their alignment in a globally effective way, rather than them blindly implementing a larger algorithm they don't understand.

I'm not sure a system of one-day-living IQ-90 humans ever gets to the point of inventing fire or the wheel.

If Paul has an intuition saying "Well, of course they eventually start doing Hessian-free optimization in a way that makes their understanding effective upon it to create global alignment; I can’t figure out how to convince you otherwise if you don’t already see that," I'm not quite sure where to go from there, except onwards to my other challenges.

保罗·克里斯蒂安诺(Paul Christiano)

好吧,我可以看到一种明显的方式来说服您:实际运行实验。但是在这样做之前,我想更精确地对您的期望工作和不工作,因为我不会从字面上进行HF优化示例(开发新算法是超出了现有ML的范围)。我认为我们可以做(对我来说)比发明HF优化更难。但是我不知道我是否有足够的模型模型来知道您实际认为更难的事情。

当然,除非您在(未压缩的)骨料中有这么多代理,以至于骨料会实现一种更智能的遗传算法,该算法最大程度地提高了内部药物的批准。如果您比居住一天的IQ 90人更聪明,并训练它以使IQ 90人类输出大量标志着他们的认可,我默认情况下会希望它破解IQ 90的一日人,不安全的系统。亚博体育苹果app官方下载我们以不保留对齐方式的方式回到了比单个代理更聪明的亚博体育苹果app官方下载全球系统。

保罗·克里斯蒂安诺(Paul Christiano)

Definitely agree that even if the agents are aligned, they can implement unaligned optimization, and then we're back to square one. Amplification only works if we can improve capability without doing unaligned optimization. I think this is a disagreement about the decomposability of cognitive work. I hope we can resolve it by actually finding concrete, simple tasks where we have differing intuitions, and then doing empirical tests.

能力放大的中心有趣的我的想法是exactly模仿人类,我们可以绕过加强学习的通常注定。如果您可以构建对人类的确切模仿,那么它具有与人类完全相同的一致性。如果我们接受加强学习者并要求它最大化来自人类的批准信号,这是事实,这是不正确的。(如果这个话题是保罗·克里斯蒂安诺(Paul Christiano)或卡尔·舒尔曼(Carl Shulman),我愿意说这些人是合理地对准的;我对某人给他们的钥匙的钥匙可以期望,我很满意。背部。)

It is not obvious to me how fast alignment-preservation degrades as the exactness of the imitation is weakened. This matters because of things Paul has said which sound to me like he's not advocating for perfect imitation, in response to challenges I've given about how perfect imitation would be very expensive. That is, the answer he gave to a challenge about the expense of perfection makes the answer to "How fast do we lose alignment guarantees as we move away from perfection?" become very important.

我期望从标准加强学习中获得的厄运的一个例子是我认为“ X and kan-os-oss-X”问题。不幸的是,我还没有写信,所以我将尝试在这里简要介绍一下。

X-and-only-X is what I call the issue where the property that's easy to verify and train is X, but the property you want is "this was optimized for X and only X and doesn't contain a whole bunch of possible subtle bad Ys that could be hard to detect formulaically from the final output of the system".

For example, imagine X is "give me a program which solves a Rubik's Cube". You can run the program and verify that it solves Rubik's Cubes, and use a loss function over its average performance which also takes into account how many steps the program's solutions require.

The property Y is that the program the AI gives you also modulates RAM to send GSM cellphone signals.

That is: It's much easier to verify "This is a program which at least solves the Rubik's Cube" than "This is a program which was optimized to solve the Rubik's Cube and only that and was not optimized for anything else on the side."

If I were going to talk about trying to do aligned AGI under the standard ML paradigms, I'd talk about how this creates a differential ease of development between "build a system that does X" and "build a system that does X and only X and not Y in some subtle way". If you just want X however unsafely, you can build the X-classifier and use that as a loss function and let reinforcement learning loose with whatever equivalent of gradient descent or other generic optimization method the future uses. If the safety property you want is optimized-for-X-and-just-X-and-not-any-possible-number-of-hidden-Ys, then you can't write a simple loss function for that the way you can for X.

保罗·克里斯蒂安诺(Paul Christiano)

According to my understanding of optimization / use of language: the agent produced by RL is optimized only for X. However, optimization for X is liable to produce a Y-optimizer. So the actions of the agent are both X-optimized and Y-optimized.

The team that's building a less safe AGI can plug in the X-evaluator and let rip, the team that wants to build a safe AGI can't do things the easy way and has to solve new basic problems in order to get a trustworthy system. It's not unsolvable, but it's an element of the class of added difficulties of alignment such that the whole class extremely plausibly adds up to额外两年发展。

In Paul's capability-amplification scenario, if we can get exact imitation, we are genuinely completely bypassing the whole paradigm that creates the X-and-only-X problem. If you can get exact imitation of a human, the outputs have only and exactly whatever properties the human already has. This kind of genuinely different viewpoint is why I continue to be excited about Paul's thinking.

保罗·克里斯蒂安诺(Paul Christiano)

I agree that perfect imitation would be a way to get around the X-and-only-X problem. However, I don't think that it's plausible and it's not how my approach hopes to get around the X-and-only-X problem.

I would solve X-and-only-X in two steps:

首先,给定代理和针对不良后果y进行了优化的动作,我们希望能够说明该动作具有这种不良的副作用。我认为我们可以通过智能代理作为监督者来做到这一点,并给予更智能的代理商对弱者的认知(例如,通过在弱者和解释产生的剂之间分享权重)。这就是我所说的知情监督。

其次,给定代理商,确定情况特别有可能产生不良结果的情况,或证明其不会或对其内部的理解足够多,您可以看到为什么不会。这在“优化最坏情况性能的技术。”

(显然,这也需要一个更聪明的代理,您希望通过归纳 +扩增来获得)。

I think that both of those are hard problems, in addition to the assumption that amplification will work. But I don't yet see reason to be super pessimistic about either of them.

另一方面,假设我们没有确切的模仿。我们对X和HONSY-X的防御能力有多快?好吧,这取决于模仿的不确定性。在什么样的距离度量标准下,不完美的模仿是“接近原始”的?就像,如果我们谈论的是输出中的欧几里得距离,我希望您失去X and-y-fly-X的保证,可以在智能对抗性扰动中快速地保证。

另一方面,假设模仿的不确定性是“这个代理人的行为就像保罗·克里斯蒂安诺(Paul Christiano),但5个IQ指出了哑巴。”如果这仅仅是产生不精确的形式,并且我们知道,那么我会说,我们可以很好地保证略微傻瓜,从而产生包含隐藏GSM信号器的Rubik的Cube的求解器。

另一方面,假设模仿的不确定性是“该代理人通过图灵测试;人类不能告诉人类除了人类之外。”然后,将X and-Only-X完全扔出窗口。我们不能保证任何人无法检测到的任何y,这涵盖了大量致命的领域,这就是为什么我们不能仅仅通过让人类的输出来检查不受信任的超级智能的输出来消毒。看看他们是否有任何明显的不良后果。


Speaking of inexact imitation: It seems to me that having an AI output a高保真度模仿人类的行为,足够高保真,可以保留诸如“聪明”和“成为一个好人”和“仍然是一个奇怪的人,例如被聚集到巨大的中国房间官僚机构”,这是一个非常巨大的问。

It seems to me obvious, though this is the sort of point where I've been surprised about what other people don't consider obvious, that in general exact imitation is a bigger ask than superior capability. Building a Go player that imitates Shuusaku's Go play so well that a scholar couldn't tell the difference, is a bigger ask than building a Go player that could defeat Shuusaku in a match. A human is much smarter than a pocket calculator but would still be unable to imitate one without using a paper and pencil; to imitate the pocket calculator you need all of the pocket calculator's abilities in addition to your own.

Correspondingly, a realistic AI we build that literally passes the strong version of the Turing Test would probably have to be much smarter than the other humans in the test, probably smarter than any human on Earth, because it would have to possess all the human capabilities in addition to its own. Or at least all the human capabilities that can be exhibited to another human over the course of however long the Turing Test lasts. (Note that on the version of capability amplification I heard, capabilities that can be exhibited over the course of a day are the only kinds of capabilities we're allowed to amplify.)

保罗·克里斯蒂安诺(Paul Christiano)

完全同意,因此,我同意您不能依靠完美的模仿来解决X和HONSY-X问题,因此需要其他解决方案。如果您说服我说是知情的监督或可靠性是不可能的,那么我会在很大程度上确信我注定要失败。

一个AI that learns to exactly imitate humans, not just passing the Turing Test to the limits of human discrimination on human inspection, but perfect imitation with all added bad subtle properties thereby excluded, must be so cognitively powerful that its learnable hypothesis space includes systems equivalent to entire human brains. I see no way that we're not talking about a superintelligence here.

因此要假设perfect模仿,我们首先会遇到以下问题:

(a)学习这种模仿所需的AGI是extremely强大的,这可能意味着我们何时可以建立任何危险的AGI,以及我们可以建立可以使用完美映射能力放大来对齐的AGI时。

(b) Since we cannot invoke a perfect-imitation capability amplification setup to get this very powerful AGI in the first place (because it is already the least AGI that we can use to even get started on perfect-imitation capability amplification), we already have an extremely dangerous unaligned superintelligence sitting around that we are trying to use to implement our scheme for alignment.

现在,我们也许可能会回答,模仿并不完美,可以使用笨拙,危险的AI来完成。也许甚至是如此愚蠢,以至于没有巨大的超级智能。但是随后,我们正在调整“模仿的完美”设置,这可能会迅速炸毁我们对标准机器学习范式的标准命令的保证。

I'm worried that you have to degrade the level of imitation很多before it becomes less than an巨大的请问,即模仿的东西不是很聪明,不是人类和/或不符合对齐的。

要具体:我认为,如果您想模仿iQ-90人类思考的一天,并特别模仿它们,以至于模仿通常是聪明的,即使在汇总成怪异的官僚机构的极限中也是如此,您正在寻找,您正在寻找在一个足够强大的AGI中,可以考虑整个系统与IQ-90人类的类似。亚博体育苹果app官方下载

保罗·克里斯蒂安诺(Paul Christiano)

重要的是,我对放大对齐的论点进行不是进行有问题的优化。因此,如果我们将其与足够好的解决方案相结合,以了解监督和可靠性(放大,以及到目前为止的归纳工作……),那么我们可以继续训练绝对不会进行有问题优化的不完美模仿。他们会到处乱七八糟,因此可能无法胜任(另一个问题需要处理),但是目标是设置东西,以使很多笨蛋都不会打破一致。

我认为这是一个非常强大的AGI。我认为这种AGI足够聪明,可以使您的恶作剧越过您,除非您使用的方法可以产生不结盟的Agis的忠实模仿。我认为这是一个可以执行强大工程专长的AGI,除非它以某种方式能够模拟人类做强大的工程专长而没有本身能够实现强大的工程专长。

然后,此外,功能放大模式要求AGI足够强大才能学会模仿amplifiedsystems of humans as well!

If you don’t degrade the AI to well below the point of superintelligence, we require extremely strong Faraday-cage keep-God-in-a-box don't-peek-at-the-code AI boxing. So we cannot for example let an untrusted superintelligence originate queries that it can use to learn human behavior; it has to be strictly unsupervised example-based learning rather than a query model. This interacts with worries about sample complexity below.

保罗·克里斯蒂安诺(Paul Christiano)

I appreciate how often you had to argue against disjunctions here in order to give me the benefit of the doubt.

Oryou need to postulate some prior alignment of the AGI that lets you robustly and safely direct it at the imitation problem. Then (a) your alignment schema depends on an unspecified prior alignment schema that already works to make very powerful AGIs do things. And (b) you probably aren't allowing unrestrained intelligence explosions, which probably introduces a big unavoidable delay between the first point where very powerful AGI is technologically possible, and the later point where you can build and safely align the extremely powerful AGI that you're using as the base unit in your imitation scheme.


The last I checked with Paul, his scheme bit the bullet on using powerful smart dangerous units to imitate humans. Paul's scheme for bypassing this, as I heard and understood it back then, involved a scheme which:

  • 培训了一百万个具有足够多样性的智能预测指标,至少一个预测因子将以忠实预测的真正简单目标最终出现。
  • Compared the predictors against each other, finding cases where their predictions diverged sufficiently;
  • Tested them against reality;
  • Had enough sampling power to perform a sufficient number of tests that, even if a superintelligence cunningly delays the time when it starts trying to slip in bad predictions, we can pause the system and run the experimental test and generate new samples and kick out the unfaithful predictor at that time.

That's a one-millionfold performance penalty, if I understood it correctly.

如果这只是一个具体的插图来帮助我了解发生的事情,实际上,应该有一些明显的更幼稚的更有效的计划来利用没有巨大的性能罚款的主要不受信任的超智能,那么我没有,那么我就没有T还看到了。

保罗·克里斯蒂安诺(Paul Christiano)

There are several possible techniques for cutting this performance penalty down, e.g., representing a large ensemble implicitly or using a tricky distillation procedure. I totally agree if you have a millionfold slowdown then you are out of luck.

When we talked about this, I agreed that "train a reliable system" was an important subproblem. Train an implicit ensemble is one possible approach to that problem. At the time of our discussion I listed 3 techniques. Since then I've become quite pessimistic about training an implicit ensemble, and have become more optimistic about verification and transparency. (My thinking about the problem is also generally much sharper.) The three approaches I currently consider most live are the ones in “优化最坏情况性能的技术。”

(As usual, all of it requires amplification + the induction.)

(在最初的讨论中,我将“大合奏”列为一种可能的方法,以回应您的“您如何解决这个问题?”问题,而不是必要的成分。)


我也很难看到在无监督的学习范式下如何踢出不忠的预测变量,该范式实际上不允许过度计算。

问题是:人类的确切输出总是非常不可能。即使代理人只能互相发送短信而没有插图(这进一步降低了他们的集体智慧和生产能够在一天中可读的教科书的能力),文本消息仍然是一个巨大的可能性空间。如果我们要求超级智能给我们他们对人类接下来做什么的预测,那么他们的预测实际上不可能在人类的所有可能输出旁边包含一个巨大的查找表。

I think I see how I’d “solve” this problem using hypercomputation and an enormous number of samples: I require the superintelligences to give me their giant lookup tables of probabilities for exact outputs, calculate total distances between probability tables (KL divergence or whatever), and if the distance is great enough, I sample a human and do a Bayesian update. I drop any superintelligences whose prior probability goes low enough. I keep doing this until only agreeing superintelligences remain.

在这个系统下,在我看来亚博体育苹果app官方下载,您需要一个很多of samples to kick out any superintelligences trying to sneak in bad suggestions. Let's say that an uncompressed Chinese Room bureaucracy is doing the equivalent of sending around a quadrillion messages, and that the output can be corrupted by an adversarial attack against one billion of those messages such that there's a\(\ frac {1} {1,000,000} \)有机会输出针对这些十亿条消息中每个消息的对手选择的替代方案。好吧,超级智能可以分配\(\ frac {1} {1,000,000} \)的生成概率的唯一方法是,从所有可能输出的其他消息中窃取了0.0001%的概率质量。So each time we check against a real human, we update by \(\log_{2}\) \(\frac{999,999}{1,000,000}\) bits, and after taking a million samples we’ve driven down the relative odds of this naughty superintelligence by \(\frac{1}{e}\).

我们必须采集一百万个样本来做到这一点,每个样本都涉及一天的思维。那只是让我们\(e\):1 evidence, so we've got to take a million samples 7 times to get to the point of 1000:1 odds against the superintelligence being honest. And if we have a million superintelligences in the distribution, some of them can make skewed bets à la predicting coinflips whereby a thousand of those will be able to gain 1000:1 advantages over the others, so actually you've got to take 14 million samples to be sure of beating any bad predictor classes that collectively try that scheme.

我认为上次我与保罗谈论这个问题时,保罗就像是:“当然,我们谈论的是一个巨大的政府项目,该项目可能需要数百万个样本。”

I'm not sure, I could be wrong about this, but I think you actually need 14 million samples per bad predictor that you want to kick out of the distribution.

Even if I’m wrong about that, we need millions of humans who are smart enough and aligned enough that random samples from their number can agglomerate into a smart aligned Chinese Room Bureaucracy.

我也看不到在不使用超级计算比较巨型查找表的情况下进行此操作的方法,但是我可能只是缺少数学技巧。

保罗·克里斯蒂安诺(Paul Christiano)

My best guess is that this can be done under plausible assumptions with O(1) samples per bad predictor. It's tricky, but we are good at tricky math problems, so it's not very scary relative to the other problems we face.

(Unfortunately, I think that a large implicit ensemble is probably stuck anyway, in part because a million predictors isn't enough. But I'm not confident about that.)

(If you're thinking of GANs, then so far as I can tell, the discriminator has to be at least as smart as the generator, and you have to trust the discriminator, and there isn't any simpler problem with respect to how you find a trusted superintelligence within a collective of untrusted ones to act as your discriminator.)

(EDIT: Actually, after thinking about this for another five minutes, maybe I do see how to do it with GANs and lower sample complexity.)

一个even largerissue is that I don't see any obvious way to carry out a scheme like this oneat allwith respect to imperfect imitations. (And the above scheme I thought of with GANs would also just fail.)

保罗·克里斯蒂安诺(Paul Christiano)

I think we could probably get over this too, it's another tricky math problem. I think this kind of problem is reliably either impossible, or elseradically比我们在对齐中处理的大多数其他内容都容易。

(尽管我认可大型隐式合奏注定要失败的总体直觉。)


我认为这些论点完全是一个症结所在。也就是说,除非我在迅速编写本文中错过了自己的思考过程之一,或者假设一个共享的背景假设实际上并未共享。

Let's say that D is the degree of imperfection allowed by some system of capability amplification, and call D-imperfect imitations D-imitations. Iterated D-imitations of amplified systems of D-imitations will be termed DD-imitations. Then I think I'd start to be pragmatically interested in capability amplification as I understood it, if I believed all of the following:

  • 在世界以其他未对齐的AI结束之前,我们可以使AIS变得足够强大,以学习D-Imimitation和DD-Imimitation;
  • D-imitations和DD-imitations强劲保持the goodness of the people being imitated, despite the imperfection of the imitation;
  • D-imitations凝聚的充分认知power to perform a pivotal act in a way that causes the alignment of the components to be effective upon aligning the whole; and imperfect DD-imitation preserves this property;
  • We can find any way of either:
    • 单独信任一个强大的AI,可以忠实地执行D-Imation的任务(但是,为什么我们不能只使用该方案来对准强大的AGI?);
    • Find a scheme for agglomerating mostly untrustworthy powerful intelligences which:
      • 不需要巨型查找表,不需要带有可信赖的歧视者的gan,除非您能说出如何产生可信赖的歧视者,并且可以使用实际的人类样本作为燃料来歧视不信任的发生器之间的信任度of D-imitations
      • Is extremely sample-efficient (let's say you can clear 100 people who are trustworthy to be part of an amplified-capability system, which already sounds to me like a huge damned ask);or您可以向我展示一个社交模式,该模式将大多数不受信任的人类结合成中国房间的官僚机构,我们相信,该官僚机构可以执行关键任务,以及您信任的政治模式,您可以相信做涉及数百万人类的事情,在这种情况下,您可以占用数百万美元样本,但不是数十亿。老实说,我目前不相信在AI场景中,良好和值得信赖的政府进行了复杂的AI对齐模式,涉及数百万人,因此,如果您走上这条道路,我们最终会带来不同的关键。但是,如果您获得了所有其他关键,我已经给我留下了深刻的印象。
      • 不太计算效率低;更像是20-1放缓比1,000,000-1。因为我认为您无法获得与世界其他地方其他其他项目相比,后来的优势。除非您假设大规模的全球完美监视计划不会破坏人类的未来,否则人性的未来是由过于竞争,超值的伟大大国实现的,对国际化的价值有着深厚的承诺 - 与当前的大国的观察到的特征非常不同,并且没有反对。由任何其他主要政府。同样,如果我们走下挑战的分支,那么我们将不再处于原始关键状态。

I worry that going down the last two branches of the challenge could create the illusion of a political disagreement, when I have what seem to me like strong technical objections at the previous branches. I would prefer that the more technical cruxes be considered first. If Paul answered all the other technical cruxes and presented a scheme for capability amplification that worked with a moderately utopian world government, I would already have been surprised. I wouldn't actually try it because you cannot get a moderately utopian world government, but Paul would have won many points and I would be interested in trying to refine the scheme further because it had already been refined further than I thought possible. On my present view, trying anything like this should either just plain not get started (if you wait to satisfy extreme computational demands and sampling power before proceeding), just plain fail (if you use weak AIs to try to imitate humans), or just plain kill you (if you use a superintelligence).

保罗·克里斯蒂安诺(Paul Christiano)

我认为分歧几乎完全是技术性的。我认为,如果我们真的需要100万人,那将不是一个破坏交易的人,但这是因为技术而不是政治分歧(关于这些人需要做什么)。我同意1,000,000倍的放缓是不可接受的(我认为即使是10倍放缓也几乎注定要失败)。

I restate that these objections seemto meto collectively sum up to “This is fundamentally just not a way you can get an aligned powerful AGI unless you already have an aligned superintelligence”, rather than “Some further insights are required for this to work in practice.” But who knows what further insights may really bring? Movement in thoughtspace consists of better understanding, not cleverer tools.

我继续被保罗的思考thi兴奋s subject; I just don’t think it works in the present state.

保罗·克里斯蒂安诺(Paul Christiano)

On this point, we agree. I don’t think anyone is claiming to be done with the alignment problem, the main question is about what directions are most promising for making progress.

On my view, this is not an unusual state of mind to be in with respect to alignment research. I can’t point to any MIRI paper that works to align an AGI. Other people seem to think that they ought to currently be in a state of having a pretty much workable scheme for aligning an AGI, which I would consider to be an odd expectation. I would think that a sane point of view consisted in having ideas for addressing some problems that created further difficulties that needed to be fixed and didn’t address most other problems at all; a map with what you think are the big unsolved areas clearly marked. Being able to have a thought which真正的攻击任何对齐困难尽管它暗示了任何其他困难,但在我看来,这已经是一项巨大而不寻常的成就。洞察力“对人类外部行为的值得信赖的模仿将避免许多默认的命令,因为它们在外部行为中表现出与人类行为不同”可能在某个时候至关重要。我继续建议向保罗扔掉尽可能多的钱,他说他可以说他知道如何使用大量的钱。