The Nonlinear Library: Alignment Forum The Nonlinear Library allows you to easily listen to top EA and rationalist content on your podcast player. We use text-to-speech software to create an automatically updating repository of audio content from the EA Forum, Alignment Forum, LessWrong, and other EA blogs. To find out more, please visit us at nonlinear.org The Nonlinear Fund © 2023 The Nonlinear Fund en-us https://www.nonlinear.org https://storage.googleapis.com/rssfile/images/Nonlinear%20Logo%203000x3000%20-%20Alignment%20Forum.png The Nonlinear Fund podcast@nonlinear.org no The Nonlinear Fund Thu, 23 Mar 2023 23:37:26 +0000ajhtyKxtmmErTwH5t_NL_AF_AF AF - EAI Alignment Speaker Series #1: Challenges for Safe and Beneficial Brain-Like Artificial General Intelligence with Steve Byrnes by Curtis Huebner Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: EAI Alignment Speaker Series #1: Challenges for Safe & Beneficial Brain-Like Artificial General Intelligence with Steve Byrnes, published by Curtis Huebner on March 23, 2023 on The AI Alignment Forum. A couple months ago EleutherAI started an alignment speaker series, some of these talks have been recorded. This is the first instalment in the series. The following is a transcript generated with the help of Conjecture's Verbalize and some light editing: Getting started 1 CURTIS00:00:22,775 --> 00:00:56,683Okay, I've started the recording. I think we can give it maybe a minute or two more and then I guess we can get started. I've also got the chat window as part of the recording. So if anyone has something they want to write out, feel free to put that in. Steve, you want to do questions throughout the talk, or should we wait till the end of the talk before we ask questions? 2 STEVE00:00:59,405 --> 00:01:09,452Let's do throughout, but I reserve the right to put people off if something seems tangential or something. 3 CURTIS00:01:10,200 --> 00:01:12,101Awesome. All right, cool. Let's go with that then. 10 STEVE00:02:02,246 --> 00:21:41,951 The talk All right. Thanks, everybody, for coming. This is going to be based on blog posts called Intro to Brain-Like AGI Safety. If you've read all of them, you'll find this kind of redundant, but you're still welcome to stay. My name is Steve Byrnes and I live in the Boston area. I'm employed remotely by Astera Institute, which is based in Berkeley. I'm going to talk about challenges for safe and beneficial brain-like Artificial General Intelligence for the next 35 minutes. Feel free to jump in with questions. Don't worry, I'm funded by an entirely different crypto billionaire. .That joke was very fresh when I wrote it three months ago. I need a new one now. Okay, so I'll start with—well, we don't have to talk about the outline. You'll see as we go. General motivation Start with general motivation. Again, I'm assuming that the audience has a range of backgrounds, and some of you will find parts of this talk redundant. The big question that I'm working on is: What happens when people figure out how to run brain-like algorithms on computer chips? I guess I should say “if and when”, but we can get back to that. And I find that when I bring this up to people, they they tend to have two sorts of reactions: One is that we should think of these future algorithms as “like tools for people to use”. And the other is that we should think of them as “like a new intelligent species on the planet”. So let's go through those one by one. Let’s start with the tool perspective. This is the perspective that would be more familiar to AI people. If we put brain-like algorithms on computer chips, then that would be a form of artificial intelligence. And everybody knows that AI today is a tool for people to use. So on this perspective, the sub-problem I'm working on is accident prevention. We want to avoid the scenarios where the AI does something that nobody wanted it to do—not the people who programmed it, not anybody. So there is a technical problem to solve there, which is: If people figure out how to run brain-like algorithms on computer chips, and they want those algorithms to be trying to do X—where X is solar cell research or being honest or whatever you can think of—then what source code should they write? What training environment should they use? And so on. This is an unsolved problem. It turns out to be surprisingly tricky, for some pretty deep reasons that mostly are not going to be in the scope of this talk, but you can read the series. This slide is the bigger picture of that. So if we want our awesome post-AGI future, then we want to avoid, y'know, catastrophic accidents where the AI gets out of control and self-replicates around the Intern...]]>
Curtis Huebner https://www.alignmentforum.org/posts/ajhtyKxtmmErTwH5t/eai-alignment-speaker-series-1-challenges-for-safe-and Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: EAI Alignment Speaker Series #1: Challenges for Safe & Beneficial Brain-Like Artificial General Intelligence with Steve Byrnes, published by Curtis Huebner on March 23, 2023 on The AI Alignment Forum. A couple months ago EleutherAI started an alignment speaker series, some of these talks have been recorded. This is the first instalment in the series. The following is a transcript generated with the help of Conjecture's Verbalize and some light editing: Getting started 1 CURTIS00:00:22,775 --> 00:00:56,683Okay, I've started the recording. I think we can give it maybe a minute or two more and then I guess we can get started. I've also got the chat window as part of the recording. So if anyone has something they want to write out, feel free to put that in. Steve, you want to do questions throughout the talk, or should we wait till the end of the talk before we ask questions? 2 STEVE00:00:59,405 --> 00:01:09,452Let's do throughout, but I reserve the right to put people off if something seems tangential or something. 3 CURTIS00:01:10,200 --> 00:01:12,101Awesome. All right, cool. Let's go with that then. 10 STEVE00:02:02,246 --> 00:21:41,951 The talk All right. Thanks, everybody, for coming. This is going to be based on blog posts called Intro to Brain-Like AGI Safety. If you've read all of them, you'll find this kind of redundant, but you're still welcome to stay. My name is Steve Byrnes and I live in the Boston area. I'm employed remotely by Astera Institute, which is based in Berkeley. I'm going to talk about challenges for safe and beneficial brain-like Artificial General Intelligence for the next 35 minutes. Feel free to jump in with questions. Don't worry, I'm funded by an entirely different crypto billionaire. .That joke was very fresh when I wrote it three months ago. I need a new one now. Okay, so I'll start with—well, we don't have to talk about the outline. You'll see as we go. General motivation Start with general motivation. Again, I'm assuming that the audience has a range of backgrounds, and some of you will find parts of this talk redundant. The big question that I'm working on is: What happens when people figure out how to run brain-like algorithms on computer chips? I guess I should say “if and when”, but we can get back to that. And I find that when I bring this up to people, they they tend to have two sorts of reactions: One is that we should think of these future algorithms as “like tools for people to use”. And the other is that we should think of them as “like a new intelligent species on the planet”. So let's go through those one by one. Let’s start with the tool perspective. This is the perspective that would be more familiar to AI people. If we put brain-like algorithms on computer chips, then that would be a form of artificial intelligence. And everybody knows that AI today is a tool for people to use. So on this perspective, the sub-problem I'm working on is accident prevention. We want to avoid the scenarios where the AI does something that nobody wanted it to do—not the people who programmed it, not anybody. So there is a technical problem to solve there, which is: If people figure out how to run brain-like algorithms on computer chips, and they want those algorithms to be trying to do X—where X is solar cell research or being honest or whatever you can think of—then what source code should they write? What training environment should they use? And so on. This is an unsolved problem. It turns out to be surprisingly tricky, for some pretty deep reasons that mostly are not going to be in the scope of this talk, but you can read the series. This slide is the bigger picture of that. So if we want our awesome post-AGI future, then we want to avoid, y'know, catastrophic accidents where the AI gets out of control and self-replicates around the Intern...]]>
Thu, 23 Mar 2023 14:32:53 +0000 AF - EAI Alignment Speaker Series #1: Challenges for Safe and Beneficial Brain-Like Artificial General Intelligence with Steve Byrnes by Curtis Huebner Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: EAI Alignment Speaker Series #1: Challenges for Safe & Beneficial Brain-Like Artificial General Intelligence with Steve Byrnes, published by Curtis Huebner on March 23, 2023 on The AI Alignment Forum. A couple months ago EleutherAI started an alignment speaker series, some of these talks have been recorded. This is the first instalment in the series. The following is a transcript generated with the help of Conjecture's Verbalize and some light editing: Getting started 1 CURTIS00:00:22,775 --> 00:00:56,683Okay, I've started the recording. I think we can give it maybe a minute or two more and then I guess we can get started. I've also got the chat window as part of the recording. So if anyone has something they want to write out, feel free to put that in. Steve, you want to do questions throughout the talk, or should we wait till the end of the talk before we ask questions? 2 STEVE00:00:59,405 --> 00:01:09,452Let's do throughout, but I reserve the right to put people off if something seems tangential or something. 3 CURTIS00:01:10,200 --> 00:01:12,101Awesome. All right, cool. Let's go with that then. 10 STEVE00:02:02,246 --> 00:21:41,951 The talk All right. Thanks, everybody, for coming. This is going to be based on blog posts called Intro to Brain-Like AGI Safety. If you've read all of them, you'll find this kind of redundant, but you're still welcome to stay. My name is Steve Byrnes and I live in the Boston area. I'm employed remotely by Astera Institute, which is based in Berkeley. I'm going to talk about challenges for safe and beneficial brain-like Artificial General Intelligence for the next 35 minutes. Feel free to jump in with questions. Don't worry, I'm funded by an entirely different crypto billionaire. .That joke was very fresh when I wrote it three months ago. I need a new one now. Okay, so I'll start with—well, we don't have to talk about the outline. You'll see as we go. General motivation Start with general motivation. Again, I'm assuming that the audience has a range of backgrounds, and some of you will find parts of this talk redundant. The big question that I'm working on is: What happens when people figure out how to run brain-like algorithms on computer chips? I guess I should say “if and when”, but we can get back to that. And I find that when I bring this up to people, they they tend to have two sorts of reactions: One is that we should think of these future algorithms as “like tools for people to use”. And the other is that we should think of them as “like a new intelligent species on the planet”. So let's go through those one by one. Let’s start with the tool perspective. This is the perspective that would be more familiar to AI people. If we put brain-like algorithms on computer chips, then that would be a form of artificial intelligence. And everybody knows that AI today is a tool for people to use. So on this perspective, the sub-problem I'm working on is accident prevention. We want to avoid the scenarios where the AI does something that nobody wanted it to do—not the people who programmed it, not anybody. So there is a technical problem to solve there, which is: If people figure out how to run brain-like algorithms on computer chips, and they want those algorithms to be trying to do X—where X is solar cell research or being honest or whatever you can think of—then what source code should they write? What training environment should they use? And so on. This is an unsolved problem. It turns out to be surprisingly tricky, for some pretty deep reasons that mostly are not going to be in the scope of this talk, but you can read the series. This slide is the bigger picture of that. So if we want our awesome post-AGI future, then we want to avoid, y'know, catastrophic accidents where the AI gets out of control and self-replicates around the Intern...]]>
Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: EAI Alignment Speaker Series #1: Challenges for Safe & Beneficial Brain-Like Artificial General Intelligence with Steve Byrnes, published by Curtis Huebner on March 23, 2023 on The AI Alignment Forum. A couple months ago EleutherAI started an alignment speaker series, some of these talks have been recorded. This is the first instalment in the series. The following is a transcript generated with the help of Conjecture's Verbalize and some light editing: Getting started 1 CURTIS00:00:22,775 --> 00:00:56,683Okay, I've started the recording. I think we can give it maybe a minute or two more and then I guess we can get started. I've also got the chat window as part of the recording. So if anyone has something they want to write out, feel free to put that in. Steve, you want to do questions throughout the talk, or should we wait till the end of the talk before we ask questions? 2 STEVE00:00:59,405 --> 00:01:09,452Let's do throughout, but I reserve the right to put people off if something seems tangential or something. 3 CURTIS00:01:10,200 --> 00:01:12,101Awesome. All right, cool. Let's go with that then. 10 STEVE00:02:02,246 --> 00:21:41,951 The talk All right. Thanks, everybody, for coming. This is going to be based on blog posts called Intro to Brain-Like AGI Safety. If you've read all of them, you'll find this kind of redundant, but you're still welcome to stay. My name is Steve Byrnes and I live in the Boston area. I'm employed remotely by Astera Institute, which is based in Berkeley. I'm going to talk about challenges for safe and beneficial brain-like Artificial General Intelligence for the next 35 minutes. Feel free to jump in with questions. Don't worry, I'm funded by an entirely different crypto billionaire. .That joke was very fresh when I wrote it three months ago. I need a new one now. Okay, so I'll start with—well, we don't have to talk about the outline. You'll see as we go. General motivation Start with general motivation. Again, I'm assuming that the audience has a range of backgrounds, and some of you will find parts of this talk redundant. The big question that I'm working on is: What happens when people figure out how to run brain-like algorithms on computer chips? I guess I should say “if and when”, but we can get back to that. And I find that when I bring this up to people, they they tend to have two sorts of reactions: One is that we should think of these future algorithms as “like tools for people to use”. And the other is that we should think of them as “like a new intelligent species on the planet”. So let's go through those one by one. Let’s start with the tool perspective. This is the perspective that would be more familiar to AI people. If we put brain-like algorithms on computer chips, then that would be a form of artificial intelligence. And everybody knows that AI today is a tool for people to use. So on this perspective, the sub-problem I'm working on is accident prevention. We want to avoid the scenarios where the AI does something that nobody wanted it to do—not the people who programmed it, not anybody. So there is a technical problem to solve there, which is: If people figure out how to run brain-like algorithms on computer chips, and they want those algorithms to be trying to do X—where X is solar cell research or being honest or whatever you can think of—then what source code should they write? What training environment should they use? And so on. This is an unsolved problem. It turns out to be surprisingly tricky, for some pretty deep reasons that mostly are not going to be in the scope of this talk, but you can read the series. This slide is the bigger picture of that. So if we want our awesome post-AGI future, then we want to avoid, y'know, catastrophic accidents where the AI gets out of control and self-replicates around the Intern...]]>
Curtis Huebner https://storage.googleapis.com/rssfile/images/Nonlinear%20Logo%203000x3000%20-%20Alignment%20Forum.png 45:47 None full 5335
b9sGz74ayftqPBDYv_NL_AF_AF AF - The space of systems and the space of maps by Jan Kulveit Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: The space of systems and the space of maps, published by Jan Kulveit on March 22, 2023 on The AI Alignment Forum. When we're trying to do AI alignment, we're often studying systems which don't yet exist. This is a pretty weird epistemic activity, and seems really hard to get right. This post offers one frame for thinking about what we're actually doing when we're thinking about AI alignment: using parts of the space of maps to reason about parts of the space of intelligent systems. In this post, we: Introduce a simple model of the epistemic situation, and Share some desiderata for maps useful for alignment. We hope that the content is mostly the second kind of obvious: obvious once you see things in this way, which you maybe already do. In our experience, this comes with a risk: reading too fast, you may miss most of the nuance and useful insight the deceptively simple model brings, or come away with a version of the model which is rounded off to something less useful (i.e. "yeah, there is this map and territory distinction"). As a meta recommendation, we suggest reading this post slowly, and ideally immediately trying to apply the model to some confusion or disagreement about AI alignment. The space of systems and the space of maps Imagine the space of possible intelligent systems: Two things seem especially important about this space: It’s very large; much larger than the space of current systems. We don’t get direct epistemic access to it. This is obviously true of systems which don’t currently exist. In a weaker sense, it also seems true of systems which do exist. Even when we get to directly interact with a system: Our thinking about these parts of the space is still filtered through our past experiences, priors, predictive models, cultural biases, theories. We often don’t understand the emergent complexity of the systems in question. If we don’t get direct epistemic access to the space of systems, what are we doing when we reason about it? Let’s imagine a second space, this time a space of “maps”: The space of maps is an abstract representation of all the possible “maps” that can be constructed about the space of intelligent systems. The maps are ways of thinking about (parts of) the space of systems. For example: Replicable descriptions of how a machine learning model works and was trained are a way of thinking about that model (a point in the space of intelligent systems). An ethnographic study of a particular human community is a way of thinking about that community (another point in the space of systems). The theory of evolution is a way of thinking about evolved creatures, including intelligent ones. Expected utility theory is a way of thinking about some part of the space which may or may not include future AI systems. Historical analysis of trends in technological development is a way of thinking about whichever parts of the space of intelligent systems are governed by similar dynamics to those governing past technological developments. When we’re reasoning about intelligent systems, we’re using some part of the space of maps to think about some part of the space of intelligent systems: Different maps correspond to different regions of the space of intelligent systems. Of course, thinking in terms of the space of systems and the space of maps is a simplification. Some of the ways that reality is more complicated: The space of systems looks different on different maps. Maps can affect which parts of the space of systems actually get developed. Maps are themselves embedded in the space of systems. Which maps and systems actually exist at a given time is evolving and dynamic. AI will play a big role in both the space of maps and the space of systems. We think that the space of systems and the space of maps is a useful simplification which helps us to think ...]]>
Jan Kulveit https://www.alignmentforum.org/posts/b9sGz74ayftqPBDYv/the-space-of-systems-and-the-space-of-maps Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: The space of systems and the space of maps, published by Jan Kulveit on March 22, 2023 on The AI Alignment Forum. When we're trying to do AI alignment, we're often studying systems which don't yet exist. This is a pretty weird epistemic activity, and seems really hard to get right. This post offers one frame for thinking about what we're actually doing when we're thinking about AI alignment: using parts of the space of maps to reason about parts of the space of intelligent systems. In this post, we: Introduce a simple model of the epistemic situation, and Share some desiderata for maps useful for alignment. We hope that the content is mostly the second kind of obvious: obvious once you see things in this way, which you maybe already do. In our experience, this comes with a risk: reading too fast, you may miss most of the nuance and useful insight the deceptively simple model brings, or come away with a version of the model which is rounded off to something less useful (i.e. "yeah, there is this map and territory distinction"). As a meta recommendation, we suggest reading this post slowly, and ideally immediately trying to apply the model to some confusion or disagreement about AI alignment. The space of systems and the space of maps Imagine the space of possible intelligent systems: Two things seem especially important about this space: It’s very large; much larger than the space of current systems. We don’t get direct epistemic access to it. This is obviously true of systems which don’t currently exist. In a weaker sense, it also seems true of systems which do exist. Even when we get to directly interact with a system: Our thinking about these parts of the space is still filtered through our past experiences, priors, predictive models, cultural biases, theories. We often don’t understand the emergent complexity of the systems in question. If we don’t get direct epistemic access to the space of systems, what are we doing when we reason about it? Let’s imagine a second space, this time a space of “maps”: The space of maps is an abstract representation of all the possible “maps” that can be constructed about the space of intelligent systems. The maps are ways of thinking about (parts of) the space of systems. For example: Replicable descriptions of how a machine learning model works and was trained are a way of thinking about that model (a point in the space of intelligent systems). An ethnographic study of a particular human community is a way of thinking about that community (another point in the space of systems). The theory of evolution is a way of thinking about evolved creatures, including intelligent ones. Expected utility theory is a way of thinking about some part of the space which may or may not include future AI systems. Historical analysis of trends in technological development is a way of thinking about whichever parts of the space of intelligent systems are governed by similar dynamics to those governing past technological developments. When we’re reasoning about intelligent systems, we’re using some part of the space of maps to think about some part of the space of intelligent systems: Different maps correspond to different regions of the space of intelligent systems. Of course, thinking in terms of the space of systems and the space of maps is a simplification. Some of the ways that reality is more complicated: The space of systems looks different on different maps. Maps can affect which parts of the space of systems actually get developed. Maps are themselves embedded in the space of systems. Which maps and systems actually exist at a given time is evolving and dynamic. AI will play a big role in both the space of maps and the space of systems. We think that the space of systems and the space of maps is a useful simplification which helps us to think ...]]>
Wed, 22 Mar 2023 14:59:05 +0000 AF - The space of systems and the space of maps by Jan Kulveit Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: The space of systems and the space of maps, published by Jan Kulveit on March 22, 2023 on The AI Alignment Forum. When we're trying to do AI alignment, we're often studying systems which don't yet exist. This is a pretty weird epistemic activity, and seems really hard to get right. This post offers one frame for thinking about what we're actually doing when we're thinking about AI alignment: using parts of the space of maps to reason about parts of the space of intelligent systems. In this post, we: Introduce a simple model of the epistemic situation, and Share some desiderata for maps useful for alignment. We hope that the content is mostly the second kind of obvious: obvious once you see things in this way, which you maybe already do. In our experience, this comes with a risk: reading too fast, you may miss most of the nuance and useful insight the deceptively simple model brings, or come away with a version of the model which is rounded off to something less useful (i.e. "yeah, there is this map and territory distinction"). As a meta recommendation, we suggest reading this post slowly, and ideally immediately trying to apply the model to some confusion or disagreement about AI alignment. The space of systems and the space of maps Imagine the space of possible intelligent systems: Two things seem especially important about this space: It’s very large; much larger than the space of current systems. We don’t get direct epistemic access to it. This is obviously true of systems which don’t currently exist. In a weaker sense, it also seems true of systems which do exist. Even when we get to directly interact with a system: Our thinking about these parts of the space is still filtered through our past experiences, priors, predictive models, cultural biases, theories. We often don’t understand the emergent complexity of the systems in question. If we don’t get direct epistemic access to the space of systems, what are we doing when we reason about it? Let’s imagine a second space, this time a space of “maps”: The space of maps is an abstract representation of all the possible “maps” that can be constructed about the space of intelligent systems. The maps are ways of thinking about (parts of) the space of systems. For example: Replicable descriptions of how a machine learning model works and was trained are a way of thinking about that model (a point in the space of intelligent systems). An ethnographic study of a particular human community is a way of thinking about that community (another point in the space of systems). The theory of evolution is a way of thinking about evolved creatures, including intelligent ones. Expected utility theory is a way of thinking about some part of the space which may or may not include future AI systems. Historical analysis of trends in technological development is a way of thinking about whichever parts of the space of intelligent systems are governed by similar dynamics to those governing past technological developments. When we’re reasoning about intelligent systems, we’re using some part of the space of maps to think about some part of the space of intelligent systems: Different maps correspond to different regions of the space of intelligent systems. Of course, thinking in terms of the space of systems and the space of maps is a simplification. Some of the ways that reality is more complicated: The space of systems looks different on different maps. Maps can affect which parts of the space of systems actually get developed. Maps are themselves embedded in the space of systems. Which maps and systems actually exist at a given time is evolving and dynamic. AI will play a big role in both the space of maps and the space of systems. We think that the space of systems and the space of maps is a useful simplification which helps us to think ...]]>
Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: The space of systems and the space of maps, published by Jan Kulveit on March 22, 2023 on The AI Alignment Forum. When we're trying to do AI alignment, we're often studying systems which don't yet exist. This is a pretty weird epistemic activity, and seems really hard to get right. This post offers one frame for thinking about what we're actually doing when we're thinking about AI alignment: using parts of the space of maps to reason about parts of the space of intelligent systems. In this post, we: Introduce a simple model of the epistemic situation, and Share some desiderata for maps useful for alignment. We hope that the content is mostly the second kind of obvious: obvious once you see things in this way, which you maybe already do. In our experience, this comes with a risk: reading too fast, you may miss most of the nuance and useful insight the deceptively simple model brings, or come away with a version of the model which is rounded off to something less useful (i.e. "yeah, there is this map and territory distinction"). As a meta recommendation, we suggest reading this post slowly, and ideally immediately trying to apply the model to some confusion or disagreement about AI alignment. The space of systems and the space of maps Imagine the space of possible intelligent systems: Two things seem especially important about this space: It’s very large; much larger than the space of current systems. We don’t get direct epistemic access to it. This is obviously true of systems which don’t currently exist. In a weaker sense, it also seems true of systems which do exist. Even when we get to directly interact with a system: Our thinking about these parts of the space is still filtered through our past experiences, priors, predictive models, cultural biases, theories. We often don’t understand the emergent complexity of the systems in question. If we don’t get direct epistemic access to the space of systems, what are we doing when we reason about it? Let’s imagine a second space, this time a space of “maps”: The space of maps is an abstract representation of all the possible “maps” that can be constructed about the space of intelligent systems. The maps are ways of thinking about (parts of) the space of systems. For example: Replicable descriptions of how a machine learning model works and was trained are a way of thinking about that model (a point in the space of intelligent systems). An ethnographic study of a particular human community is a way of thinking about that community (another point in the space of systems). The theory of evolution is a way of thinking about evolved creatures, including intelligent ones. Expected utility theory is a way of thinking about some part of the space which may or may not include future AI systems. Historical analysis of trends in technological development is a way of thinking about whichever parts of the space of intelligent systems are governed by similar dynamics to those governing past technological developments. When we’re reasoning about intelligent systems, we’re using some part of the space of maps to think about some part of the space of intelligent systems: Different maps correspond to different regions of the space of intelligent systems. Of course, thinking in terms of the space of systems and the space of maps is a simplification. Some of the ways that reality is more complicated: The space of systems looks different on different maps. Maps can affect which parts of the space of systems actually get developed. Maps are themselves embedded in the space of systems. Which maps and systems actually exist at a given time is evolving and dynamic. AI will play a big role in both the space of maps and the space of systems. We think that the space of systems and the space of maps is a useful simplification which helps us to think ...]]>
Jan Kulveit https://storage.googleapis.com/rssfile/images/Nonlinear%20Logo%203000x3000%20-%20Alignment%20Forum.png 07:59 None full 5319
SBahPHStddcFJnyft_NL_AF_AF AF - Some constructions for proof-based cooperation without Löb by James Payor Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Some constructions for proof-based cooperation without Löb, published by James Payor on March 21, 2023 on The AI Alignment Forum. This post presents five closely-related ways to achieve proof-based cooperation without using Löb's theorem, and muses on legible cooperation in the real world. (Edit: maybe they're closer to just-use-Löb's-theorem than I originally thought! See this comment. If these constructions somehow work better, I'm more confused than before about why.) I'm writing this as a follow-up to Andrew Critch's recent post, to share more of my perspective on the subject. We're going to dive straight into the weeds. (I'm planning to also write a more accessible explainer post soon.) The ideas Idea #1: try to prove AB I claim the following are sufficient for robust cooperation: A↔□(AB) B□A A tries to prove that AB, and B tries to prove A. The reason this works is that B can prove that A□A, i.e. A only cooperates in ways legible to B. (Proof sketch: A↔□X□□X↔□A.) The flaw in this approach is that we needed to know that A won't cooperate for illegible reasons. Otherwise we can't verify that B will cooperate whenever A does. This indicates to me that "AB" isn't the right "counterfactual". It shouldn't matter if A could cooperate for illegible reasons, if A is actually cooperating for a legible one. Idea #2: try to prove □AB We can weaken the requirements with a simple change: A□(□AB) B□A Note that this form is close to the lemma discussed in Critch's post. In this case, the condition □AB is trivial. And when the condition activates, it also ensures that □A is true, which discharges our assumption and ensures B is true. I still have the sense that the condition for cooperation should talk about itself activating, not A. Because we want it to activate when that is sufficient for cooperaion. But I do have to admit that □AB works for mostly the right reasons, comes with a simple proof, and is the cleanest two-agent construction I know. Idea #3: factor out the loop-cutting gadget We can factor the part that is trying to cut the loop out from A, like so: A□X B□A X↔□(XB); or alternatively X↔□(□XB) This gives the loop-cutting logic a name, X. Now X can refer to itself, and roughly says "I'll legibly activate if I can verify this will cause B to be true". The key properties of X are that □X□B, and $\Box (\Box X \rightarrow \Box B) Like with idea #2, we just need A to reveal a mechanism by which it can be compelled to cooperate. Idea #4: everyone tries to prove □methem What about three people trying to cooperate? We can try applying lots of idea #2: A□(□AB∧C) B□(□BA∧C) C□(□CA∧B) And, this works! Proof sketch: Under the assumption of □C: A□(□AB∧C)□(□AB) B□(□BA∧C)□(□BA) A and B form a size-2 group, which cooperates by inductive hypothesis □CA∧B, since we proved A and B under the assumption C and □C follow from (2) A and B also follow, from (2) and (3) The proof simplifies the group one person at a time, since each person is asking "what would happen if everyone else could tell I cooperate". This lets us prove the whole thing by induction. It's neat that it works, though it's not the easiest thing to see. Idea #5: the group agrees to a shared mechanism or leader What if we factor out the choosing logic in a larger group? Here's one way to do it: A□X B□X C□X X↔□(□XA∧B∧C) This is the cleanest idea I know for handling the group case. The group members agree on some trusted leader or process X. They set things up so X activates legibly, verifies things in a way trusted by everyone, and only activates when it verifies this will cause cooperation. We've now localized the choice-making in one place. X proves that □XA∧B∧C, X activates, and everyone cooperates. Closing remarks on groups in the real world Centralizing the choosing like in idea #5 make the logic simpler, but this sort o...]]>
James Payor https://www.alignmentforum.org/posts/SBahPHStddcFJnyft/some-constructions-for-proof-based-cooperation-without-loeb Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Some constructions for proof-based cooperation without Löb, published by James Payor on March 21, 2023 on The AI Alignment Forum. This post presents five closely-related ways to achieve proof-based cooperation without using Löb's theorem, and muses on legible cooperation in the real world. (Edit: maybe they're closer to just-use-Löb's-theorem than I originally thought! See this comment. If these constructions somehow work better, I'm more confused than before about why.) I'm writing this as a follow-up to Andrew Critch's recent post, to share more of my perspective on the subject. We're going to dive straight into the weeds. (I'm planning to also write a more accessible explainer post soon.) The ideas Idea #1: try to prove AB I claim the following are sufficient for robust cooperation: A↔□(AB) B□A A tries to prove that AB, and B tries to prove A. The reason this works is that B can prove that A□A, i.e. A only cooperates in ways legible to B. (Proof sketch: A↔□X□□X↔□A.) The flaw in this approach is that we needed to know that A won't cooperate for illegible reasons. Otherwise we can't verify that B will cooperate whenever A does. This indicates to me that "AB" isn't the right "counterfactual". It shouldn't matter if A could cooperate for illegible reasons, if A is actually cooperating for a legible one. Idea #2: try to prove □AB We can weaken the requirements with a simple change: A□(□AB) B□A Note that this form is close to the lemma discussed in Critch's post. In this case, the condition □AB is trivial. And when the condition activates, it also ensures that □A is true, which discharges our assumption and ensures B is true. I still have the sense that the condition for cooperation should talk about itself activating, not A. Because we want it to activate when that is sufficient for cooperaion. But I do have to admit that □AB works for mostly the right reasons, comes with a simple proof, and is the cleanest two-agent construction I know. Idea #3: factor out the loop-cutting gadget We can factor the part that is trying to cut the loop out from A, like so: A□X B□A X↔□(XB); or alternatively X↔□(□XB) This gives the loop-cutting logic a name, X. Now X can refer to itself, and roughly says "I'll legibly activate if I can verify this will cause B to be true". The key properties of X are that □X□B, and $\Box (\Box X \rightarrow \Box B) Like with idea #2, we just need A to reveal a mechanism by which it can be compelled to cooperate. Idea #4: everyone tries to prove □methem What about three people trying to cooperate? We can try applying lots of idea #2: A□(□AB∧C) B□(□BA∧C) C□(□CA∧B) And, this works! Proof sketch: Under the assumption of □C: A□(□AB∧C)□(□AB) B□(□BA∧C)□(□BA) A and B form a size-2 group, which cooperates by inductive hypothesis □CA∧B, since we proved A and B under the assumption C and □C follow from (2) A and B also follow, from (2) and (3) The proof simplifies the group one person at a time, since each person is asking "what would happen if everyone else could tell I cooperate". This lets us prove the whole thing by induction. It's neat that it works, though it's not the easiest thing to see. Idea #5: the group agrees to a shared mechanism or leader What if we factor out the choosing logic in a larger group? Here's one way to do it: A□X B□X C□X X↔□(□XA∧B∧C) This is the cleanest idea I know for handling the group case. The group members agree on some trusted leader or process X. They set things up so X activates legibly, verifies things in a way trusted by everyone, and only activates when it verifies this will cause cooperation. We've now localized the choice-making in one place. X proves that □XA∧B∧C, X activates, and everyone cooperates. Closing remarks on groups in the real world Centralizing the choosing like in idea #5 make the logic simpler, but this sort o...]]>
Tue, 21 Mar 2023 16:12:16 +0000 AF - Some constructions for proof-based cooperation without Löb by James Payor Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Some constructions for proof-based cooperation without Löb, published by James Payor on March 21, 2023 on The AI Alignment Forum. This post presents five closely-related ways to achieve proof-based cooperation without using Löb's theorem, and muses on legible cooperation in the real world. (Edit: maybe they're closer to just-use-Löb's-theorem than I originally thought! See this comment. If these constructions somehow work better, I'm more confused than before about why.) I'm writing this as a follow-up to Andrew Critch's recent post, to share more of my perspective on the subject. We're going to dive straight into the weeds. (I'm planning to also write a more accessible explainer post soon.) The ideas Idea #1: try to prove AB I claim the following are sufficient for robust cooperation: A↔□(AB) B□A A tries to prove that AB, and B tries to prove A. The reason this works is that B can prove that A□A, i.e. A only cooperates in ways legible to B. (Proof sketch: A↔□X□□X↔□A.) The flaw in this approach is that we needed to know that A won't cooperate for illegible reasons. Otherwise we can't verify that B will cooperate whenever A does. This indicates to me that "AB" isn't the right "counterfactual". It shouldn't matter if A could cooperate for illegible reasons, if A is actually cooperating for a legible one. Idea #2: try to prove □AB We can weaken the requirements with a simple change: A□(□AB) B□A Note that this form is close to the lemma discussed in Critch's post. In this case, the condition □AB is trivial. And when the condition activates, it also ensures that □A is true, which discharges our assumption and ensures B is true. I still have the sense that the condition for cooperation should talk about itself activating, not A. Because we want it to activate when that is sufficient for cooperaion. But I do have to admit that □AB works for mostly the right reasons, comes with a simple proof, and is the cleanest two-agent construction I know. Idea #3: factor out the loop-cutting gadget We can factor the part that is trying to cut the loop out from A, like so: A□X B□A X↔□(XB); or alternatively X↔□(□XB) This gives the loop-cutting logic a name, X. Now X can refer to itself, and roughly says "I'll legibly activate if I can verify this will cause B to be true". The key properties of X are that □X□B, and $\Box (\Box X \rightarrow \Box B) Like with idea #2, we just need A to reveal a mechanism by which it can be compelled to cooperate. Idea #4: everyone tries to prove □methem What about three people trying to cooperate? We can try applying lots of idea #2: A□(□AB∧C) B□(□BA∧C) C□(□CA∧B) And, this works! Proof sketch: Under the assumption of □C: A□(□AB∧C)□(□AB) B□(□BA∧C)□(□BA) A and B form a size-2 group, which cooperates by inductive hypothesis □CA∧B, since we proved A and B under the assumption C and □C follow from (2) A and B also follow, from (2) and (3) The proof simplifies the group one person at a time, since each person is asking "what would happen if everyone else could tell I cooperate". This lets us prove the whole thing by induction. It's neat that it works, though it's not the easiest thing to see. Idea #5: the group agrees to a shared mechanism or leader What if we factor out the choosing logic in a larger group? Here's one way to do it: A□X B□X C□X X↔□(□XA∧B∧C) This is the cleanest idea I know for handling the group case. The group members agree on some trusted leader or process X. They set things up so X activates legibly, verifies things in a way trusted by everyone, and only activates when it verifies this will cause cooperation. We've now localized the choice-making in one place. X proves that □XA∧B∧C, X activates, and everyone cooperates. Closing remarks on groups in the real world Centralizing the choosing like in idea #5 make the logic simpler, but this sort o...]]>
Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Some constructions for proof-based cooperation without Löb, published by James Payor on March 21, 2023 on The AI Alignment Forum. This post presents five closely-related ways to achieve proof-based cooperation without using Löb's theorem, and muses on legible cooperation in the real world. (Edit: maybe they're closer to just-use-Löb's-theorem than I originally thought! See this comment. If these constructions somehow work better, I'm more confused than before about why.) I'm writing this as a follow-up to Andrew Critch's recent post, to share more of my perspective on the subject. We're going to dive straight into the weeds. (I'm planning to also write a more accessible explainer post soon.) The ideas Idea #1: try to prove AB I claim the following are sufficient for robust cooperation: A↔□(AB) B□A A tries to prove that AB, and B tries to prove A. The reason this works is that B can prove that A□A, i.e. A only cooperates in ways legible to B. (Proof sketch: A↔□X□□X↔□A.) The flaw in this approach is that we needed to know that A won't cooperate for illegible reasons. Otherwise we can't verify that B will cooperate whenever A does. This indicates to me that "AB" isn't the right "counterfactual". It shouldn't matter if A could cooperate for illegible reasons, if A is actually cooperating for a legible one. Idea #2: try to prove □AB We can weaken the requirements with a simple change: A□(□AB) B□A Note that this form is close to the lemma discussed in Critch's post. In this case, the condition □AB is trivial. And when the condition activates, it also ensures that □A is true, which discharges our assumption and ensures B is true. I still have the sense that the condition for cooperation should talk about itself activating, not A. Because we want it to activate when that is sufficient for cooperaion. But I do have to admit that □AB works for mostly the right reasons, comes with a simple proof, and is the cleanest two-agent construction I know. Idea #3: factor out the loop-cutting gadget We can factor the part that is trying to cut the loop out from A, like so: A□X B□A X↔□(XB); or alternatively X↔□(□XB) This gives the loop-cutting logic a name, X. Now X can refer to itself, and roughly says "I'll legibly activate if I can verify this will cause B to be true". The key properties of X are that □X□B, and $\Box (\Box X \rightarrow \Box B) Like with idea #2, we just need A to reveal a mechanism by which it can be compelled to cooperate. Idea #4: everyone tries to prove □methem What about three people trying to cooperate? We can try applying lots of idea #2: A□(□AB∧C) B□(□BA∧C) C□(□CA∧B) And, this works! Proof sketch: Under the assumption of □C: A□(□AB∧C)□(□AB) B□(□BA∧C)□(□BA) A and B form a size-2 group, which cooperates by inductive hypothesis □CA∧B, since we proved A and B under the assumption C and □C follow from (2) A and B also follow, from (2) and (3) The proof simplifies the group one person at a time, since each person is asking "what would happen if everyone else could tell I cooperate". This lets us prove the whole thing by induction. It's neat that it works, though it's not the easiest thing to see. Idea #5: the group agrees to a shared mechanism or leader What if we factor out the choosing logic in a larger group? Here's one way to do it: A□X B□X C□X X↔□(□XA∧B∧C) This is the cleanest idea I know for handling the group case. The group members agree on some trusted leader or process X. They set things up so X activates legibly, verifies things in a way trusted by everyone, and only activates when it verifies this will cause cooperation. We've now localized the choice-making in one place. X proves that □XA∧B∧C, X activates, and everyone cooperates. Closing remarks on groups in the real world Centralizing the choosing like in idea #5 make the logic simpler, but this sort o...]]>
James Payor https://storage.googleapis.com/rssfile/images/Nonlinear%20Logo%203000x3000%20-%20Alignment%20Forum.png 05:47 None full 5336
XWwvwytieLtEWaFJX_NL_AF_AF AF - Deep Deceptiveness by Nate Soares Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Deep Deceptiveness, published by Nate Soares on March 21, 2023 on The AI Alignment Forum. Meta This post is an attempt to gesture at a class of AI notkilleveryoneism (alignment) problem that seems to me to go largely unrecognized. E.g., it isn’t discussed (or at least I don't recognize it) in the recent plans written up by OpenAI (1,2), by DeepMind’s alignment team, or by Anthropic, and I know of no other acknowledgment of this issue by major labs. You could think of this as a fragment of my answer to “Where do plans like OpenAI’s ‘Our Approach to Alignment Research’ fail?”, as discussed in Rob and Eliezer’s challenge for AGI organizations and readers. Note that it would only be a fragment of the reply; there's a lot more to say about why AI alignment is a particularly tricky task to task an AI with. (Some of which Eliezer gestures at in a follow-up to his interview on Bankless.) Caveat: I'll be talking a bunch about “deception” in this post because this post was generated as a result of conversations I had with alignment researchers at big labs who seemed to me to be suggesting "just train AI to not be deceptive; there's a decent chance that works". I have a vague impression that others in the community think that deception in particular is much more central than I think it is, so I want to warn against that interpretation here: I think deception is an important problem, but its main importance is as an example of some broader issues in alignment. Summary Attempt at a short version, with the caveat that I think it's apparently a sazen of sorts, and spoiler tagged for people who want the opportunity to connect the dots themselves: Deceptiveness is not a simple property of thoughts. The reason the AI is deceiving you is not that it has some "deception" property, it's that (barring some great alignment feat) it's a fact about the world rather than the AI that deceiving you forwards its objectives, and you've built a general engine that's good at taking advantage of advantageous facts in general. As the AI learns more general and flexible cognitive moves, those cognitive moves (insofar as they are useful) will tend to recombine in ways that exploit this fact-about-reality, despite how none of the individual abstract moves look deceptive in isolation. Investigating a made-up but moderately concrete story Suppose you have a nascent AGI, and you've been training against all hints of deceptiveness. What goes wrong? When I ask this question of people who are optimistic that we can just "train AIs not to be deceptive", there are a few answers that seem well-known. Perhaps you lack the interpretability tools to correctly identify the precursors of 'deception', so that you can only train against visibly deceptive AI outputs instead of AI thoughts about how to plan deceptions. Or perhaps training against interpreted deceptive thoughts also trains against your interpretability tools, and your AI becomes illegibly deceptive rather than non-deceptive. And these are both real obstacles. But there are deeper obstacles, that seem to me more central, and that I haven't observed others to notice on their own. That's a challenge, and while you (hopefully) chew on it, I'll tell an implausibly-detailed story to exemplify a deeper obstacle. A fledgeling AI is being deployed towards building something like a bacterium, but with a diamondoid shell. The diamondoid-shelled bacterium is not intended to be pivotal, but it's a supposedly laboratory-verifiable step on a path towards carrying out some speculative human-brain-enhancement operations, which the operators are hoping will be pivotal. (The original hope was to have the AI assist human engineers, but the first versions that were able to do the hard parts of engineering work at all were able to go much farther on their own, and the competit...]]>
Nate Soares https://www.alignmentforum.org/posts/XWwvwytieLtEWaFJX/deep-deceptiveness Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Deep Deceptiveness, published by Nate Soares on March 21, 2023 on The AI Alignment Forum. Meta This post is an attempt to gesture at a class of AI notkilleveryoneism (alignment) problem that seems to me to go largely unrecognized. E.g., it isn’t discussed (or at least I don't recognize it) in the recent plans written up by OpenAI (1,2), by DeepMind’s alignment team, or by Anthropic, and I know of no other acknowledgment of this issue by major labs. You could think of this as a fragment of my answer to “Where do plans like OpenAI’s ‘Our Approach to Alignment Research’ fail?”, as discussed in Rob and Eliezer’s challenge for AGI organizations and readers. Note that it would only be a fragment of the reply; there's a lot more to say about why AI alignment is a particularly tricky task to task an AI with. (Some of which Eliezer gestures at in a follow-up to his interview on Bankless.) Caveat: I'll be talking a bunch about “deception” in this post because this post was generated as a result of conversations I had with alignment researchers at big labs who seemed to me to be suggesting "just train AI to not be deceptive; there's a decent chance that works". I have a vague impression that others in the community think that deception in particular is much more central than I think it is, so I want to warn against that interpretation here: I think deception is an important problem, but its main importance is as an example of some broader issues in alignment. Summary Attempt at a short version, with the caveat that I think it's apparently a sazen of sorts, and spoiler tagged for people who want the opportunity to connect the dots themselves: Deceptiveness is not a simple property of thoughts. The reason the AI is deceiving you is not that it has some "deception" property, it's that (barring some great alignment feat) it's a fact about the world rather than the AI that deceiving you forwards its objectives, and you've built a general engine that's good at taking advantage of advantageous facts in general. As the AI learns more general and flexible cognitive moves, those cognitive moves (insofar as they are useful) will tend to recombine in ways that exploit this fact-about-reality, despite how none of the individual abstract moves look deceptive in isolation. Investigating a made-up but moderately concrete story Suppose you have a nascent AGI, and you've been training against all hints of deceptiveness. What goes wrong? When I ask this question of people who are optimistic that we can just "train AIs not to be deceptive", there are a few answers that seem well-known. Perhaps you lack the interpretability tools to correctly identify the precursors of 'deception', so that you can only train against visibly deceptive AI outputs instead of AI thoughts about how to plan deceptions. Or perhaps training against interpreted deceptive thoughts also trains against your interpretability tools, and your AI becomes illegibly deceptive rather than non-deceptive. And these are both real obstacles. But there are deeper obstacles, that seem to me more central, and that I haven't observed others to notice on their own. That's a challenge, and while you (hopefully) chew on it, I'll tell an implausibly-detailed story to exemplify a deeper obstacle. A fledgeling AI is being deployed towards building something like a bacterium, but with a diamondoid shell. The diamondoid-shelled bacterium is not intended to be pivotal, but it's a supposedly laboratory-verifiable step on a path towards carrying out some speculative human-brain-enhancement operations, which the operators are hoping will be pivotal. (The original hope was to have the AI assist human engineers, but the first versions that were able to do the hard parts of engineering work at all were able to go much farther on their own, and the competit...]]>
Tue, 21 Mar 2023 02:51:55 +0000 AF - Deep Deceptiveness by Nate Soares Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Deep Deceptiveness, published by Nate Soares on March 21, 2023 on The AI Alignment Forum. Meta This post is an attempt to gesture at a class of AI notkilleveryoneism (alignment) problem that seems to me to go largely unrecognized. E.g., it isn’t discussed (or at least I don't recognize it) in the recent plans written up by OpenAI (1,2), by DeepMind’s alignment team, or by Anthropic, and I know of no other acknowledgment of this issue by major labs. You could think of this as a fragment of my answer to “Where do plans like OpenAI’s ‘Our Approach to Alignment Research’ fail?”, as discussed in Rob and Eliezer’s challenge for AGI organizations and readers. Note that it would only be a fragment of the reply; there's a lot more to say about why AI alignment is a particularly tricky task to task an AI with. (Some of which Eliezer gestures at in a follow-up to his interview on Bankless.) Caveat: I'll be talking a bunch about “deception” in this post because this post was generated as a result of conversations I had with alignment researchers at big labs who seemed to me to be suggesting "just train AI to not be deceptive; there's a decent chance that works". I have a vague impression that others in the community think that deception in particular is much more central than I think it is, so I want to warn against that interpretation here: I think deception is an important problem, but its main importance is as an example of some broader issues in alignment. Summary Attempt at a short version, with the caveat that I think it's apparently a sazen of sorts, and spoiler tagged for people who want the opportunity to connect the dots themselves: Deceptiveness is not a simple property of thoughts. The reason the AI is deceiving you is not that it has some "deception" property, it's that (barring some great alignment feat) it's a fact about the world rather than the AI that deceiving you forwards its objectives, and you've built a general engine that's good at taking advantage of advantageous facts in general. As the AI learns more general and flexible cognitive moves, those cognitive moves (insofar as they are useful) will tend to recombine in ways that exploit this fact-about-reality, despite how none of the individual abstract moves look deceptive in isolation. Investigating a made-up but moderately concrete story Suppose you have a nascent AGI, and you've been training against all hints of deceptiveness. What goes wrong? When I ask this question of people who are optimistic that we can just "train AIs not to be deceptive", there are a few answers that seem well-known. Perhaps you lack the interpretability tools to correctly identify the precursors of 'deception', so that you can only train against visibly deceptive AI outputs instead of AI thoughts about how to plan deceptions. Or perhaps training against interpreted deceptive thoughts also trains against your interpretability tools, and your AI becomes illegibly deceptive rather than non-deceptive. And these are both real obstacles. But there are deeper obstacles, that seem to me more central, and that I haven't observed others to notice on their own. That's a challenge, and while you (hopefully) chew on it, I'll tell an implausibly-detailed story to exemplify a deeper obstacle. A fledgeling AI is being deployed towards building something like a bacterium, but with a diamondoid shell. The diamondoid-shelled bacterium is not intended to be pivotal, but it's a supposedly laboratory-verifiable step on a path towards carrying out some speculative human-brain-enhancement operations, which the operators are hoping will be pivotal. (The original hope was to have the AI assist human engineers, but the first versions that were able to do the hard parts of engineering work at all were able to go much farther on their own, and the competit...]]>
Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Deep Deceptiveness, published by Nate Soares on March 21, 2023 on The AI Alignment Forum. Meta This post is an attempt to gesture at a class of AI notkilleveryoneism (alignment) problem that seems to me to go largely unrecognized. E.g., it isn’t discussed (or at least I don't recognize it) in the recent plans written up by OpenAI (1,2), by DeepMind’s alignment team, or by Anthropic, and I know of no other acknowledgment of this issue by major labs. You could think of this as a fragment of my answer to “Where do plans like OpenAI’s ‘Our Approach to Alignment Research’ fail?”, as discussed in Rob and Eliezer’s challenge for AGI organizations and readers. Note that it would only be a fragment of the reply; there's a lot more to say about why AI alignment is a particularly tricky task to task an AI with. (Some of which Eliezer gestures at in a follow-up to his interview on Bankless.) Caveat: I'll be talking a bunch about “deception” in this post because this post was generated as a result of conversations I had with alignment researchers at big labs who seemed to me to be suggesting "just train AI to not be deceptive; there's a decent chance that works". I have a vague impression that others in the community think that deception in particular is much more central than I think it is, so I want to warn against that interpretation here: I think deception is an important problem, but its main importance is as an example of some broader issues in alignment. Summary Attempt at a short version, with the caveat that I think it's apparently a sazen of sorts, and spoiler tagged for people who want the opportunity to connect the dots themselves: Deceptiveness is not a simple property of thoughts. The reason the AI is deceiving you is not that it has some "deception" property, it's that (barring some great alignment feat) it's a fact about the world rather than the AI that deceiving you forwards its objectives, and you've built a general engine that's good at taking advantage of advantageous facts in general. As the AI learns more general and flexible cognitive moves, those cognitive moves (insofar as they are useful) will tend to recombine in ways that exploit this fact-about-reality, despite how none of the individual abstract moves look deceptive in isolation. Investigating a made-up but moderately concrete story Suppose you have a nascent AGI, and you've been training against all hints of deceptiveness. What goes wrong? When I ask this question of people who are optimistic that we can just "train AIs not to be deceptive", there are a few answers that seem well-known. Perhaps you lack the interpretability tools to correctly identify the precursors of 'deception', so that you can only train against visibly deceptive AI outputs instead of AI thoughts about how to plan deceptions. Or perhaps training against interpreted deceptive thoughts also trains against your interpretability tools, and your AI becomes illegibly deceptive rather than non-deceptive. And these are both real obstacles. But there are deeper obstacles, that seem to me more central, and that I haven't observed others to notice on their own. That's a challenge, and while you (hopefully) chew on it, I'll tell an implausibly-detailed story to exemplify a deeper obstacle. A fledgeling AI is being deployed towards building something like a bacterium, but with a diamondoid shell. The diamondoid-shelled bacterium is not intended to be pivotal, but it's a supposedly laboratory-verifiable step on a path towards carrying out some speculative human-brain-enhancement operations, which the operators are hoping will be pivotal. (The original hope was to have the AI assist human engineers, but the first versions that were able to do the hard parts of engineering work at all were able to go much farther on their own, and the competit...]]>
Nate Soares https://storage.googleapis.com/rssfile/images/Nonlinear%20Logo%203000x3000%20-%20Alignment%20Forum.png 25:43 None full 5324
wAczufCpMdaamF9fy_NL_AF_AF AF - My Objections to "We’re All Gonna Die with Eliezer Yudkowsky" by Quintin Pope Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: My Objections to "We’re All Gonna Die with Eliezer Yudkowsky", published by Quintin Pope on March 21, 2023 on The AI Alignment Forum. Introduction I recently watched Eliezer Yudkowsky's appearance on the Bankless podcast, where he argued that AI was nigh-certain to end humanity. Since the podcast, some commentators have offered pushback against the doom conclusion. However, one sentiment I saw was that optimists tended not to engage with the specific arguments pessimists like Yudkowsky offered. Economist Robin Hanson points out that this pattern is very common for small groups which hold counterintuitive beliefs: insiders develop their own internal language, which skeptical outsiders usually don't bother to learn. Outsiders then make objections that focus on broad arguments against the belief's plausibility, rather than objections that focus on specific insider arguments. As an AI "alignment insider" whose current estimate of doom is around 5%, I wrote this post to explain some of my many objections to Yudkowsky's specific arguments. I've split this post into chronologically ordered segments of the podcast in which Yudkowsky makes one or more claims with which I particularly disagree. I have my own view of alignment research: shard theory, which focuses on understanding how human values form, and on how we might guide a similar process of value formation in AI systems. I think that human value formation is not that complex, and does not rely on principles very different from those which underlie the current deep learning paradigm. Most of the arguments you're about to see from me are less: I think I know of a fundamentally new paradigm that can fix the issues Yudkowsky is pointing at. and more: Here's why I don't agree with Yudkowsky's arguments that alignment is impossible in the current paradigm. My objections Will current approaches scale to AGI? Yudkowsky apparently thinks not ...and that the techniques driving current state of the art advances, by which I think he means the mix of generative pretraining + small amounts of reinforcement learning such as with ChatGPT, aren't reliable enough for significant economic contributions. However, he also thinks that the current influx of money might stumble upon something that does work really well, which will end the world shortly thereafter. I'm a lot more bullish on the current paradigm. People have tried lots and lots of approaches to getting good performance out of computers, including lots of "scary seeming" approaches such as: Meta-learning over training processes. I.e., using gradient descent over learning curves, directly optimizing neural networks to learn more quickly. Teaching neural networks to directly modify themselves by giving them edit access to their own weights. Training learned optimizers - neural networks that learn to optimize other neural networks - and having those learned optimizers optimize themselves. Using program search to find more efficient optimizers. Using simulated evolution to find more efficient architectures. Using efficient second-order corrections to gradient descent's approximate optimization process. Tried applying biologically plausible optimization algorithms inspired by biological neurons to training neural networks. Adding learned internal optimizers (different from the ones hypothesized in Risks from Learned Optimization) as neural network layers. Having language models rewrite their own training data, and improve the quality of that training data, to make themselves better at a given task. Having language models devise their own programming curriculum, and learn to program better with self-driven practice. Mixing reinforcement learning with model-driven, recursive re-writing of future training data. Mostly, these don't work very well. The current capabilities paradigm is sta...]]>
Quintin Pope https://www.alignmentforum.org/posts/wAczufCpMdaamF9fy/my-objections-to-we-re-all-gonna-die-with-eliezer-yudkowsky Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: My Objections to "We’re All Gonna Die with Eliezer Yudkowsky", published by Quintin Pope on March 21, 2023 on The AI Alignment Forum. Introduction I recently watched Eliezer Yudkowsky's appearance on the Bankless podcast, where he argued that AI was nigh-certain to end humanity. Since the podcast, some commentators have offered pushback against the doom conclusion. However, one sentiment I saw was that optimists tended not to engage with the specific arguments pessimists like Yudkowsky offered. Economist Robin Hanson points out that this pattern is very common for small groups which hold counterintuitive beliefs: insiders develop their own internal language, which skeptical outsiders usually don't bother to learn. Outsiders then make objections that focus on broad arguments against the belief's plausibility, rather than objections that focus on specific insider arguments. As an AI "alignment insider" whose current estimate of doom is around 5%, I wrote this post to explain some of my many objections to Yudkowsky's specific arguments. I've split this post into chronologically ordered segments of the podcast in which Yudkowsky makes one or more claims with which I particularly disagree. I have my own view of alignment research: shard theory, which focuses on understanding how human values form, and on how we might guide a similar process of value formation in AI systems. I think that human value formation is not that complex, and does not rely on principles very different from those which underlie the current deep learning paradigm. Most of the arguments you're about to see from me are less: I think I know of a fundamentally new paradigm that can fix the issues Yudkowsky is pointing at. and more: Here's why I don't agree with Yudkowsky's arguments that alignment is impossible in the current paradigm. My objections Will current approaches scale to AGI? Yudkowsky apparently thinks not ...and that the techniques driving current state of the art advances, by which I think he means the mix of generative pretraining + small amounts of reinforcement learning such as with ChatGPT, aren't reliable enough for significant economic contributions. However, he also thinks that the current influx of money might stumble upon something that does work really well, which will end the world shortly thereafter. I'm a lot more bullish on the current paradigm. People have tried lots and lots of approaches to getting good performance out of computers, including lots of "scary seeming" approaches such as: Meta-learning over training processes. I.e., using gradient descent over learning curves, directly optimizing neural networks to learn more quickly. Teaching neural networks to directly modify themselves by giving them edit access to their own weights. Training learned optimizers - neural networks that learn to optimize other neural networks - and having those learned optimizers optimize themselves. Using program search to find more efficient optimizers. Using simulated evolution to find more efficient architectures. Using efficient second-order corrections to gradient descent's approximate optimization process. Tried applying biologically plausible optimization algorithms inspired by biological neurons to training neural networks. Adding learned internal optimizers (different from the ones hypothesized in Risks from Learned Optimization) as neural network layers. Having language models rewrite their own training data, and improve the quality of that training data, to make themselves better at a given task. Having language models devise their own programming curriculum, and learn to program better with self-driven practice. Mixing reinforcement learning with model-driven, recursive re-writing of future training data. Mostly, these don't work very well. The current capabilities paradigm is sta...]]>
Tue, 21 Mar 2023 00:06:07 +0000 AF - My Objections to "We’re All Gonna Die with Eliezer Yudkowsky" by Quintin Pope Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: My Objections to "We’re All Gonna Die with Eliezer Yudkowsky", published by Quintin Pope on March 21, 2023 on The AI Alignment Forum. Introduction I recently watched Eliezer Yudkowsky's appearance on the Bankless podcast, where he argued that AI was nigh-certain to end humanity. Since the podcast, some commentators have offered pushback against the doom conclusion. However, one sentiment I saw was that optimists tended not to engage with the specific arguments pessimists like Yudkowsky offered. Economist Robin Hanson points out that this pattern is very common for small groups which hold counterintuitive beliefs: insiders develop their own internal language, which skeptical outsiders usually don't bother to learn. Outsiders then make objections that focus on broad arguments against the belief's plausibility, rather than objections that focus on specific insider arguments. As an AI "alignment insider" whose current estimate of doom is around 5%, I wrote this post to explain some of my many objections to Yudkowsky's specific arguments. I've split this post into chronologically ordered segments of the podcast in which Yudkowsky makes one or more claims with which I particularly disagree. I have my own view of alignment research: shard theory, which focuses on understanding how human values form, and on how we might guide a similar process of value formation in AI systems. I think that human value formation is not that complex, and does not rely on principles very different from those which underlie the current deep learning paradigm. Most of the arguments you're about to see from me are less: I think I know of a fundamentally new paradigm that can fix the issues Yudkowsky is pointing at. and more: Here's why I don't agree with Yudkowsky's arguments that alignment is impossible in the current paradigm. My objections Will current approaches scale to AGI? Yudkowsky apparently thinks not ...and that the techniques driving current state of the art advances, by which I think he means the mix of generative pretraining + small amounts of reinforcement learning such as with ChatGPT, aren't reliable enough for significant economic contributions. However, he also thinks that the current influx of money might stumble upon something that does work really well, which will end the world shortly thereafter. I'm a lot more bullish on the current paradigm. People have tried lots and lots of approaches to getting good performance out of computers, including lots of "scary seeming" approaches such as: Meta-learning over training processes. I.e., using gradient descent over learning curves, directly optimizing neural networks to learn more quickly. Teaching neural networks to directly modify themselves by giving them edit access to their own weights. Training learned optimizers - neural networks that learn to optimize other neural networks - and having those learned optimizers optimize themselves. Using program search to find more efficient optimizers. Using simulated evolution to find more efficient architectures. Using efficient second-order corrections to gradient descent's approximate optimization process. Tried applying biologically plausible optimization algorithms inspired by biological neurons to training neural networks. Adding learned internal optimizers (different from the ones hypothesized in Risks from Learned Optimization) as neural network layers. Having language models rewrite their own training data, and improve the quality of that training data, to make themselves better at a given task. Having language models devise their own programming curriculum, and learn to program better with self-driven practice. Mixing reinforcement learning with model-driven, recursive re-writing of future training data. Mostly, these don't work very well. The current capabilities paradigm is sta...]]>
Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: My Objections to "We’re All Gonna Die with Eliezer Yudkowsky", published by Quintin Pope on March 21, 2023 on The AI Alignment Forum. Introduction I recently watched Eliezer Yudkowsky's appearance on the Bankless podcast, where he argued that AI was nigh-certain to end humanity. Since the podcast, some commentators have offered pushback against the doom conclusion. However, one sentiment I saw was that optimists tended not to engage with the specific arguments pessimists like Yudkowsky offered. Economist Robin Hanson points out that this pattern is very common for small groups which hold counterintuitive beliefs: insiders develop their own internal language, which skeptical outsiders usually don't bother to learn. Outsiders then make objections that focus on broad arguments against the belief's plausibility, rather than objections that focus on specific insider arguments. As an AI "alignment insider" whose current estimate of doom is around 5%, I wrote this post to explain some of my many objections to Yudkowsky's specific arguments. I've split this post into chronologically ordered segments of the podcast in which Yudkowsky makes one or more claims with which I particularly disagree. I have my own view of alignment research: shard theory, which focuses on understanding how human values form, and on how we might guide a similar process of value formation in AI systems. I think that human value formation is not that complex, and does not rely on principles very different from those which underlie the current deep learning paradigm. Most of the arguments you're about to see from me are less: I think I know of a fundamentally new paradigm that can fix the issues Yudkowsky is pointing at. and more: Here's why I don't agree with Yudkowsky's arguments that alignment is impossible in the current paradigm. My objections Will current approaches scale to AGI? Yudkowsky apparently thinks not ...and that the techniques driving current state of the art advances, by which I think he means the mix of generative pretraining + small amounts of reinforcement learning such as with ChatGPT, aren't reliable enough for significant economic contributions. However, he also thinks that the current influx of money might stumble upon something that does work really well, which will end the world shortly thereafter. I'm a lot more bullish on the current paradigm. People have tried lots and lots of approaches to getting good performance out of computers, including lots of "scary seeming" approaches such as: Meta-learning over training processes. I.e., using gradient descent over learning curves, directly optimizing neural networks to learn more quickly. Teaching neural networks to directly modify themselves by giving them edit access to their own weights. Training learned optimizers - neural networks that learn to optimize other neural networks - and having those learned optimizers optimize themselves. Using program search to find more efficient optimizers. Using simulated evolution to find more efficient architectures. Using efficient second-order corrections to gradient descent's approximate optimization process. Tried applying biologically plausible optimization algorithms inspired by biological neurons to training neural networks. Adding learned internal optimizers (different from the ones hypothesized in Risks from Learned Optimization) as neural network layers. Having language models rewrite their own training data, and improve the quality of that training data, to make themselves better at a given task. Having language models devise their own programming curriculum, and learn to program better with self-driven practice. Mixing reinforcement learning with model-driven, recursive re-writing of future training data. Mostly, these don't work very well. The current capabilities paradigm is sta...]]>
Quintin Pope https://storage.googleapis.com/rssfile/images/Nonlinear%20Logo%203000x3000%20-%20Alignment%20Forum.png 57:59 None full 5337
ZWhJcHPmRaXAPAK5k_NL_AF_AF AF - Probabilistic Payor Lemma? by Abram Demski Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Probabilistic Payor Lemma?, published by Abram Demski on March 19, 2023 on The AI Alignment Forum. Epistemic status: too good to be true? Please check my math. We've known for a while that Löb's theorem fails when proof is relaxed to probabilistic belief. This has pros and cons. On the pro side, it means there's no Löbian Obstacle to probabilistic self-trust. On the con side, it means that some Löb-derived insights for proof-based decision theory don't translate to probabilistic decision theory, at least not as directly as one might hope. In particular, it appeared to dash hopes for probabilistic generalizations of the "Löbian handshake" for cooperation. Recently, Andrew Critch wrote about the Payor Lemma, which allows for a very similar "modal handshake" without Löb's Theorem. The lemma was proved using the same modal assumptions as Löb's, so on the surface it may appear to be just a different method to achieve similar results, whose main advantage is that it is much easier to prove (and therefore explain and understand) than Löb's Theorem. But, a natural question arises: does Payor's Lemma have a suitable probabilistic version? I'll give an affirmative proof; but I haven't confirmed that the assumptions are reasonable to my satisfaction. Setup Let L be a language in first-order logic, expressive enough to represent its sentences s∈L as quoted terms ┌s┐, eg, through Gödel numbering; and with a probability function symbol on these terms, p(┌s┐), which can be equated with (some representation of) rational numbers, e.g. p(┌⊤┐)=1, p(┌s┐)=12, etc. I also assume the system can reason about these rational numbers in the basic ways you'd expect. For all a,b∈L and all r∈Q, we have: If ⊢a, then ⊢p(┌a┐)=1. If ⊢ab, then ⊢p(┌a┐)≤p(┌b┐). (These assumptions might look pretty minimal, but they aren't going to be true for every theory of self-referential truth; more on this later.) Let B(s) abbreviate the sentence p(┌s┐)>c for any s and some globally fixed constant c strictly between 0 and 1. This is our modal operator. Some important properties of B: Necessitation. If ⊢s, then ⊢B(s), for any s. Proof: Since ⊢s implies ⊢p(s)=1, and c∈(0,1), we have ⊢p(┌s┐)>c,, which is to say, ⊢B(s). [End proof.] Weak distrubitivity. If ⊢xy, then ⊢B(x)B(y). Proof: When ⊢xy, we have ⊢p(y)≥p(x), so ⊢p(x)>cp(y)>c. [End proof.] (Regular distributivity would say B(xy) implies B(x)B(y). The assumption ⊢xy is stronger than B(xy), so the above is a weaker form of distributivity.) Theorem Statement If ⊢B(B(x)x)x, then ⊢x. Proof ⊢x(B(x)x), by tautology (a(ba)). So ⊢B(x)B(B(x)x), from 1 by weak distributivity. Suppose ⊢B(B(x)x)x. ⊢B(x)x from 2 and 3. ⊢B(B(x)x) from 4 by necessitation. ⊢x from 4 and 1.[End proof.] Discussion Comparison to Original Proof The proof steps mirror Critch's treatment very closely. The key difference is step 2, IE, how I obtain a statement like ⊢□x□(□xx). Critch uses distributivity, which is not available to me: B(ab)(B(a)B(b))? Suppose B(ab), ie, p(┌ab┐)>c. Rewrite p(┌b∨¬a┐)>c. Now suppose B(a), that is, p(┌a┐)>c. p(┌¬a┐)<1−c. p(┌b∨¬a┐)≤p(┌b┐)+p(┌¬a┐)

p(┌b∨¬a┐)−1+c>c−1+c. p(┌b┐)>2c−1. So we only get: Bc(ab)(Bc(a)Bd(b)), where Br(s) abbreviates p(┌s┐)>r and we have d=2c−1. So in general, attempted applications of distributivity create weakened belief operators, which would get in the way of the proof (very similar to how probabilistic Löb fails). However, the specific application we want happens to go through, due to a logical relationship between a and b; namely, that b is a weaker statement than a. This reveals a way in which the assumptions for Payor's Lemma are importantly weaker than those required for Löb to go through. So, the key observation I'm making is that weak distributility is all that's needed for Payor, and seems much more plaus...]]> Abram Demski https://www.alignmentforum.org/posts/ZWhJcHPmRaXAPAK5k/probabilistic-payor-lemma Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Probabilistic Payor Lemma?, published by Abram Demski on March 19, 2023 on The AI Alignment Forum. Epistemic status: too good to be true? Please check my math. We've known for a while that Löb's theorem fails when proof is relaxed to probabilistic belief. This has pros and cons. On the pro side, it means there's no Löbian Obstacle to probabilistic self-trust. On the con side, it means that some Löb-derived insights for proof-based decision theory don't translate to probabilistic decision theory, at least not as directly as one might hope. In particular, it appeared to dash hopes for probabilistic generalizations of the "Löbian handshake" for cooperation. Recently, Andrew Critch wrote about the Payor Lemma, which allows for a very similar "modal handshake" without Löb's Theorem. The lemma was proved using the same modal assumptions as Löb's, so on the surface it may appear to be just a different method to achieve similar results, whose main advantage is that it is much easier to prove (and therefore explain and understand) than Löb's Theorem. But, a natural question arises: does Payor's Lemma have a suitable probabilistic version? I'll give an affirmative proof; but I haven't confirmed that the assumptions are reasonable to my satisfaction. Setup Let L be a language in first-order logic, expressive enough to represent its sentences s∈L as quoted terms ┌s┐, eg, through Gödel numbering; and with a probability function symbol on these terms, p(┌s┐), which can be equated with (some representation of) rational numbers, e.g. p(┌⊤┐)=1, p(┌s┐)=12, etc. I also assume the system can reason about these rational numbers in the basic ways you'd expect. For all a,b∈L and all r∈Q, we have: If ⊢a, then ⊢p(┌a┐)=1. If ⊢ab, then ⊢p(┌a┐)≤p(┌b┐). (These assumptions might look pretty minimal, but they aren't going to be true for every theory of self-referential truth; more on this later.) Let B(s) abbreviate the sentence p(┌s┐)>c for any s and some globally fixed constant c strictly between 0 and 1. This is our modal operator. Some important properties of B: Necessitation. If ⊢s, then ⊢B(s), for any s. Proof: Since ⊢s implies ⊢p(s)=1, and c∈(0,1), we have ⊢p(┌s┐)>c,, which is to say, ⊢B(s). [End proof.] Weak distrubitivity. If ⊢xy, then ⊢B(x)B(y). Proof: When ⊢xy, we have ⊢p(y)≥p(x), so ⊢p(x)>cp(y)>c. [End proof.] (Regular distributivity would say B(xy) implies B(x)B(y). The assumption ⊢xy is stronger than B(xy), so the above is a weaker form of distributivity.) Theorem Statement If ⊢B(B(x)x)x, then ⊢x. Proof ⊢x(B(x)x), by tautology (a(ba)). So ⊢B(x)B(B(x)x), from 1 by weak distributivity. Suppose ⊢B(B(x)x)x. ⊢B(x)x from 2 and 3. ⊢B(B(x)x) from 4 by necessitation. ⊢x from 4 and 1.[End proof.] Discussion Comparison to Original Proof The proof steps mirror Critch's treatment very closely. The key difference is step 2, IE, how I obtain a statement like ⊢□x□(□xx). Critch uses distributivity, which is not available to me: B(ab)(B(a)B(b))? Suppose B(ab), ie, p(┌ab┐)>c. Rewrite p(┌b∨¬a┐)>c. Now suppose B(a), that is, p(┌a┐)>c. p(┌¬a┐)<1−c. p(┌b∨¬a┐)≤p(┌b┐)+p(┌¬a┐)

p(┌b∨¬a┐)−1+c>c−1+c. p(┌b┐)>2c−1. So we only get: Bc(ab)(Bc(a)Bd(b)), where Br(s) abbreviates p(┌s┐)>r and we have d=2c−1. So in general, attempted applications of distributivity create weakened belief operators, which would get in the way of the proof (very similar to how probabilistic Löb fails). However, the specific application we want happens to go through, due to a logical relationship between a and b; namely, that b is a weaker statement than a. This reveals a way in which the assumptions for Payor's Lemma are importantly weaker than those required for Löb to go through. So, the key observation I'm making is that weak distributility is all that's needed for Payor, and seems much more plaus...]]> Sun, 19 Mar 2023 17:57:04 +0000 AF - Probabilistic Payor Lemma? by Abram Demski Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Probabilistic Payor Lemma?, published by Abram Demski on March 19, 2023 on The AI Alignment Forum. Epistemic status: too good to be true? Please check my math. We've known for a while that Löb's theorem fails when proof is relaxed to probabilistic belief. This has pros and cons. On the pro side, it means there's no Löbian Obstacle to probabilistic self-trust. On the con side, it means that some Löb-derived insights for proof-based decision theory don't translate to probabilistic decision theory, at least not as directly as one might hope. In particular, it appeared to dash hopes for probabilistic generalizations of the "Löbian handshake" for cooperation. Recently, Andrew Critch wrote about the Payor Lemma, which allows for a very similar "modal handshake" without Löb's Theorem. The lemma was proved using the same modal assumptions as Löb's, so on the surface it may appear to be just a different method to achieve similar results, whose main advantage is that it is much easier to prove (and therefore explain and understand) than Löb's Theorem. But, a natural question arises: does Payor's Lemma have a suitable probabilistic version? I'll give an affirmative proof; but I haven't confirmed that the assumptions are reasonable to my satisfaction. Setup Let L be a language in first-order logic, expressive enough to represent its sentences s∈L as quoted terms ┌s┐, eg, through Gödel numbering; and with a probability function symbol on these terms, p(┌s┐), which can be equated with (some representation of) rational numbers, e.g. p(┌⊤┐)=1, p(┌s┐)=12, etc. I also assume the system can reason about these rational numbers in the basic ways you'd expect. For all a,b∈L and all r∈Q, we have: If ⊢a, then ⊢p(┌a┐)=1. If ⊢ab, then ⊢p(┌a┐)≤p(┌b┐). (These assumptions might look pretty minimal, but they aren't going to be true for every theory of self-referential truth; more on this later.) Let B(s) abbreviate the sentence p(┌s┐)>c for any s and some globally fixed constant c strictly between 0 and 1. This is our modal operator. Some important properties of B: Necessitation. If ⊢s, then ⊢B(s), for any s. Proof: Since ⊢s implies ⊢p(s)=1, and c∈(0,1), we have ⊢p(┌s┐)>c,, which is to say, ⊢B(s). [End proof.] Weak distrubitivity. If ⊢xy, then ⊢B(x)B(y). Proof: When ⊢xy, we have ⊢p(y)≥p(x), so ⊢p(x)>cp(y)>c. [End proof.] (Regular distributivity would say B(xy) implies B(x)B(y). The assumption ⊢xy is stronger than B(xy), so the above is a weaker form of distributivity.) Theorem Statement If ⊢B(B(x)x)x, then ⊢x. Proof ⊢x(B(x)x), by tautology (a(ba)). So ⊢B(x)B(B(x)x), from 1 by weak distributivity. Suppose ⊢B(B(x)x)x. ⊢B(x)x from 2 and 3. ⊢B(B(x)x) from 4 by necessitation. ⊢x from 4 and 1.[End proof.] Discussion Comparison to Original Proof The proof steps mirror Critch's treatment very closely. The key difference is step 2, IE, how I obtain a statement like ⊢□x□(□xx). Critch uses distributivity, which is not available to me: B(ab)(B(a)B(b))? Suppose B(ab), ie, p(┌ab┐)>c. Rewrite p(┌b∨¬a┐)>c. Now suppose B(a), that is, p(┌a┐)>c. p(┌¬a┐)<1−c. p(┌b∨¬a┐)≤p(┌b┐)+p(┌¬a┐)

p(┌b∨¬a┐)−1+c>c−1+c. p(┌b┐)>2c−1. So we only get: Bc(ab)(Bc(a)Bd(b)), where Br(s) abbreviates p(┌s┐)>r and we have d=2c−1. So in general, attempted applications of distributivity create weakened belief operators, which would get in the way of the proof (very similar to how probabilistic Löb fails). However, the specific application we want happens to go through, due to a logical relationship between a and b; namely, that b is a weaker statement than a. This reveals a way in which the assumptions for Payor's Lemma are importantly weaker than those required for Löb to go through. So, the key observation I'm making is that weak distributility is all that's needed for Payor, and seems much more plaus...]]> Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Probabilistic Payor Lemma?, published by Abram Demski on March 19, 2023 on The AI Alignment Forum. Epistemic status: too good to be true? Please check my math. We've known for a while that Löb's theorem fails when proof is relaxed to probabilistic belief. This has pros and cons. On the pro side, it means there's no Löbian Obstacle to probabilistic self-trust. On the con side, it means that some Löb-derived insights for proof-based decision theory don't translate to probabilistic decision theory, at least not as directly as one might hope. In particular, it appeared to dash hopes for probabilistic generalizations of the "Löbian handshake" for cooperation. Recently, Andrew Critch wrote about the Payor Lemma, which allows for a very similar "modal handshake" without Löb's Theorem. The lemma was proved using the same modal assumptions as Löb's, so on the surface it may appear to be just a different method to achieve similar results, whose main advantage is that it is much easier to prove (and therefore explain and understand) than Löb's Theorem. But, a natural question arises: does Payor's Lemma have a suitable probabilistic version? I'll give an affirmative proof; but I haven't confirmed that the assumptions are reasonable to my satisfaction. Setup Let L be a language in first-order logic, expressive enough to represent its sentences s∈L as quoted terms ┌s┐, eg, through Gödel numbering; and with a probability function symbol on these terms, p(┌s┐), which can be equated with (some representation of) rational numbers, e.g. p(┌⊤┐)=1, p(┌s┐)=12, etc. I also assume the system can reason about these rational numbers in the basic ways you'd expect. For all a,b∈L and all r∈Q, we have: If ⊢a, then ⊢p(┌a┐)=1. If ⊢ab, then ⊢p(┌a┐)≤p(┌b┐). (These assumptions might look pretty minimal, but they aren't going to be true for every theory of self-referential truth; more on this later.) Let B(s) abbreviate the sentence p(┌s┐)>c for any s and some globally fixed constant c strictly between 0 and 1. This is our modal operator. Some important properties of B: Necessitation. If ⊢s, then ⊢B(s), for any s. Proof: Since ⊢s implies ⊢p(s)=1, and c∈(0,1), we have ⊢p(┌s┐)>c,, which is to say, ⊢B(s). [End proof.] Weak distrubitivity. If ⊢xy, then ⊢B(x)B(y). Proof: When ⊢xy, we have ⊢p(y)≥p(x), so ⊢p(x)>cp(y)>c. [End proof.] (Regular distributivity would say B(xy) implies B(x)B(y). The assumption ⊢xy is stronger than B(xy), so the above is a weaker form of distributivity.) Theorem Statement If ⊢B(B(x)x)x, then ⊢x. Proof ⊢x(B(x)x), by tautology (a(ba)). So ⊢B(x)B(B(x)x), from 1 by weak distributivity. Suppose ⊢B(B(x)x)x. ⊢B(x)x from 2 and 3. ⊢B(B(x)x) from 4 by necessitation. ⊢x from 4 and 1.[End proof.] Discussion Comparison to Original Proof The proof steps mirror Critch's treatment very closely. The key difference is step 2, IE, how I obtain a statement like ⊢□x□(□xx). Critch uses distributivity, which is not available to me: B(ab)(B(a)B(b))? Suppose B(ab), ie, p(┌ab┐)>c. Rewrite p(┌b∨¬a┐)>c. Now suppose B(a), that is, p(┌a┐)>c. p(┌¬a┐)<1−c. p(┌b∨¬a┐)≤p(┌b┐)+p(┌¬a┐)

p(┌b∨¬a┐)−1+c>c−1+c. p(┌b┐)>2c−1. So we only get: Bc(ab)(Bc(a)Bd(b)), where Br(s) abbreviates p(┌s┐)>r and we have d=2c−1. So in general, attempted applications of distributivity create weakened belief operators, which would get in the way of the proof (very similar to how probabilistic Löb fails). However, the specific application we want happens to go through, due to a logical relationship between a and b; namely, that b is a weaker statement than a. This reveals a way in which the assumptions for Payor's Lemma are importantly weaker than those required for Löb to go through. So, the key observation I'm making is that weak distributility is all that's needed for Payor, and seems much more plaus...]]> Abram Demski https://storage.googleapis.com/rssfile/images/Nonlinear%20Logo%203000x3000%20-%20Alignment%20Forum.png 06:52 None full 5288zqmAMst8hmsdJqrpR_NL_AF_AF AF - Shell games by Tsvi Benson-Tilsen Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Shell games, published by Tsvi Benson-Tilsen on March 19, 2023 on The AI Alignment Forum. [Metadata: crossposted from. First completed November 18, 2022.] Shell game Here's the classic shell game: Youtube Screenshot from that video. The little ball is a phantom: when you look for it under a specific shell, it's not there, it's under a different shell. (This might be where the name "shell company" comes from: the business dealings are definitely somewhere, just not in this company you're looking at.) Perpetual motion machines Related: Perpetual motion beliefs Bhāskara's wheel is a proposed perpetual-motion machine from the Middle Ages: Here's another version: From this video. Someone could try arguing that this really is a perpetual motion machine: Q: How do the bars get lifted up? What does the work to lift them? A: By the bars on the other side pulling down. Q: How does the wheel keep turning? How do the bars pull more on their way down than on their way up? A: Because they're extended further from the center on the downward-moving side than on the upward-moving side, so they apply more torque to the wheel. Q: How do the bars extend further on the way down? A: Because the momentum of the wheel carries them into the vertical bar, flipping them over. Q: But when that happens, energy is expended to lift up the little weights; that energy comes out of the kinetic energy of the wheel. A: Ok, you're right, but that's not necessary to the design. All we need is that the torque on the downward side is greater than the torque on the upward side, so instead of flipping the weights up, we could tweak the mechanism to just shift them outward, straight to the side. That doesn't take any energy because it's just going straight sideways, from a resting position to another resting position. Q: Yeah... you can shift them sideways with nearly zero work... but that means the weights are attached to the wheel at a pivot, right? So they'll just fall back and won't provide more torque. A: They don't pivot, you fix them in place so they provide more torque. Q: Ok, but then when do you push the weights back inward? A: At the bottom. Q: When the weight is at the bottom? But then the slider isn't horizontal, so pushing the weight back towards the center is pushing it upward, which takes work. A: I meant, when the slider is at the bottom--when it's horizontal. Q: But if the sliders are fixed in place, by the time they're horizontal at the bottom, you've already lifted the weights back up some amount; they're strong-torquing the other way. A: At the bottom there's a guide ramp to lift the weights using normal force. Q: But the guide ramp is also torquing the wheel. And so on. The inventor can play hide the torque and hide the work. Shell games in alignment Some alignment schemes--schemes for structuring or training an AGI so that it can be transformatively useful and doesn't kill everyone--are prone to playing shell games. That is, there's some features of the scheme that don't seem to happen in a specific place; they happen somewhere other than where you're looking at the moment. Consider these questions: What sort of smarter-than-human work is supposed to be done by the AGI? When and how does it do that work--by what combination of parts across time? How does it become able to do that work? At what points does the AGI come to new understanding that it didn't have before? How does the AGI orchestrate it's thinking and actions to have large effects on the world? By what process, components, rules, or other elements? What determines the direction that the AGI's actions will push the world? Where did those determiners come from, and how exactly do they determine the direction? Where and how much do human operators have to make judgements? How much are those judgements being relied on to point to...]]>
Tsvi Benson-Tilsen https://www.alignmentforum.org/posts/zqmAMst8hmsdJqrpR/shell-games Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Shell games, published by Tsvi Benson-Tilsen on March 19, 2023 on The AI Alignment Forum. [Metadata: crossposted from. First completed November 18, 2022.] Shell game Here's the classic shell game: Youtube Screenshot from that video. The little ball is a phantom: when you look for it under a specific shell, it's not there, it's under a different shell. (This might be where the name "shell company" comes from: the business dealings are definitely somewhere, just not in this company you're looking at.) Perpetual motion machines Related: Perpetual motion beliefs Bhāskara's wheel is a proposed perpetual-motion machine from the Middle Ages: Here's another version: From this video. Someone could try arguing that this really is a perpetual motion machine: Q: How do the bars get lifted up? What does the work to lift them? A: By the bars on the other side pulling down. Q: How does the wheel keep turning? How do the bars pull more on their way down than on their way up? A: Because they're extended further from the center on the downward-moving side than on the upward-moving side, so they apply more torque to the wheel. Q: How do the bars extend further on the way down? A: Because the momentum of the wheel carries them into the vertical bar, flipping them over. Q: But when that happens, energy is expended to lift up the little weights; that energy comes out of the kinetic energy of the wheel. A: Ok, you're right, but that's not necessary to the design. All we need is that the torque on the downward side is greater than the torque on the upward side, so instead of flipping the weights up, we could tweak the mechanism to just shift them outward, straight to the side. That doesn't take any energy because it's just going straight sideways, from a resting position to another resting position. Q: Yeah... you can shift them sideways with nearly zero work... but that means the weights are attached to the wheel at a pivot, right? So they'll just fall back and won't provide more torque. A: They don't pivot, you fix them in place so they provide more torque. Q: Ok, but then when do you push the weights back inward? A: At the bottom. Q: When the weight is at the bottom? But then the slider isn't horizontal, so pushing the weight back towards the center is pushing it upward, which takes work. A: I meant, when the slider is at the bottom--when it's horizontal. Q: But if the sliders are fixed in place, by the time they're horizontal at the bottom, you've already lifted the weights back up some amount; they're strong-torquing the other way. A: At the bottom there's a guide ramp to lift the weights using normal force. Q: But the guide ramp is also torquing the wheel. And so on. The inventor can play hide the torque and hide the work. Shell games in alignment Some alignment schemes--schemes for structuring or training an AGI so that it can be transformatively useful and doesn't kill everyone--are prone to playing shell games. That is, there's some features of the scheme that don't seem to happen in a specific place; they happen somewhere other than where you're looking at the moment. Consider these questions: What sort of smarter-than-human work is supposed to be done by the AGI? When and how does it do that work--by what combination of parts across time? How does it become able to do that work? At what points does the AGI come to new understanding that it didn't have before? How does the AGI orchestrate it's thinking and actions to have large effects on the world? By what process, components, rules, or other elements? What determines the direction that the AGI's actions will push the world? Where did those determiners come from, and how exactly do they determine the direction? Where and how much do human operators have to make judgements? How much are those judgements being relied on to point to...]]>
Sun, 19 Mar 2023 10:43:44 +0000 AF - Shell games by Tsvi Benson-Tilsen Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Shell games, published by Tsvi Benson-Tilsen on March 19, 2023 on The AI Alignment Forum. [Metadata: crossposted from. First completed November 18, 2022.] Shell game Here's the classic shell game: Youtube Screenshot from that video. The little ball is a phantom: when you look for it under a specific shell, it's not there, it's under a different shell. (This might be where the name "shell company" comes from: the business dealings are definitely somewhere, just not in this company you're looking at.) Perpetual motion machines Related: Perpetual motion beliefs Bhāskara's wheel is a proposed perpetual-motion machine from the Middle Ages: Here's another version: From this video. Someone could try arguing that this really is a perpetual motion machine: Q: How do the bars get lifted up? What does the work to lift them? A: By the bars on the other side pulling down. Q: How does the wheel keep turning? How do the bars pull more on their way down than on their way up? A: Because they're extended further from the center on the downward-moving side than on the upward-moving side, so they apply more torque to the wheel. Q: How do the bars extend further on the way down? A: Because the momentum of the wheel carries them into the vertical bar, flipping them over. Q: But when that happens, energy is expended to lift up the little weights; that energy comes out of the kinetic energy of the wheel. A: Ok, you're right, but that's not necessary to the design. All we need is that the torque on the downward side is greater than the torque on the upward side, so instead of flipping the weights up, we could tweak the mechanism to just shift them outward, straight to the side. That doesn't take any energy because it's just going straight sideways, from a resting position to another resting position. Q: Yeah... you can shift them sideways with nearly zero work... but that means the weights are attached to the wheel at a pivot, right? So they'll just fall back and won't provide more torque. A: They don't pivot, you fix them in place so they provide more torque. Q: Ok, but then when do you push the weights back inward? A: At the bottom. Q: When the weight is at the bottom? But then the slider isn't horizontal, so pushing the weight back towards the center is pushing it upward, which takes work. A: I meant, when the slider is at the bottom--when it's horizontal. Q: But if the sliders are fixed in place, by the time they're horizontal at the bottom, you've already lifted the weights back up some amount; they're strong-torquing the other way. A: At the bottom there's a guide ramp to lift the weights using normal force. Q: But the guide ramp is also torquing the wheel. And so on. The inventor can play hide the torque and hide the work. Shell games in alignment Some alignment schemes--schemes for structuring or training an AGI so that it can be transformatively useful and doesn't kill everyone--are prone to playing shell games. That is, there's some features of the scheme that don't seem to happen in a specific place; they happen somewhere other than where you're looking at the moment. Consider these questions: What sort of smarter-than-human work is supposed to be done by the AGI? When and how does it do that work--by what combination of parts across time? How does it become able to do that work? At what points does the AGI come to new understanding that it didn't have before? How does the AGI orchestrate it's thinking and actions to have large effects on the world? By what process, components, rules, or other elements? What determines the direction that the AGI's actions will push the world? Where did those determiners come from, and how exactly do they determine the direction? Where and how much do human operators have to make judgements? How much are those judgements being relied on to point to...]]>
Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Shell games, published by Tsvi Benson-Tilsen on March 19, 2023 on The AI Alignment Forum. [Metadata: crossposted from. First completed November 18, 2022.] Shell game Here's the classic shell game: Youtube Screenshot from that video. The little ball is a phantom: when you look for it under a specific shell, it's not there, it's under a different shell. (This might be where the name "shell company" comes from: the business dealings are definitely somewhere, just not in this company you're looking at.) Perpetual motion machines Related: Perpetual motion beliefs Bhāskara's wheel is a proposed perpetual-motion machine from the Middle Ages: Here's another version: From this video. Someone could try arguing that this really is a perpetual motion machine: Q: How do the bars get lifted up? What does the work to lift them? A: By the bars on the other side pulling down. Q: How does the wheel keep turning? How do the bars pull more on their way down than on their way up? A: Because they're extended further from the center on the downward-moving side than on the upward-moving side, so they apply more torque to the wheel. Q: How do the bars extend further on the way down? A: Because the momentum of the wheel carries them into the vertical bar, flipping them over. Q: But when that happens, energy is expended to lift up the little weights; that energy comes out of the kinetic energy of the wheel. A: Ok, you're right, but that's not necessary to the design. All we need is that the torque on the downward side is greater than the torque on the upward side, so instead of flipping the weights up, we could tweak the mechanism to just shift them outward, straight to the side. That doesn't take any energy because it's just going straight sideways, from a resting position to another resting position. Q: Yeah... you can shift them sideways with nearly zero work... but that means the weights are attached to the wheel at a pivot, right? So they'll just fall back and won't provide more torque. A: They don't pivot, you fix them in place so they provide more torque. Q: Ok, but then when do you push the weights back inward? A: At the bottom. Q: When the weight is at the bottom? But then the slider isn't horizontal, so pushing the weight back towards the center is pushing it upward, which takes work. A: I meant, when the slider is at the bottom--when it's horizontal. Q: But if the sliders are fixed in place, by the time they're horizontal at the bottom, you've already lifted the weights back up some amount; they're strong-torquing the other way. A: At the bottom there's a guide ramp to lift the weights using normal force. Q: But the guide ramp is also torquing the wheel. And so on. The inventor can play hide the torque and hide the work. Shell games in alignment Some alignment schemes--schemes for structuring or training an AGI so that it can be transformatively useful and doesn't kill everyone--are prone to playing shell games. That is, there's some features of the scheme that don't seem to happen in a specific place; they happen somewhere other than where you're looking at the moment. Consider these questions: What sort of smarter-than-human work is supposed to be done by the AGI? When and how does it do that work--by what combination of parts across time? How does it become able to do that work? At what points does the AGI come to new understanding that it didn't have before? How does the AGI orchestrate it's thinking and actions to have large effects on the world? By what process, components, rules, or other elements? What determines the direction that the AGI's actions will push the world? Where did those determiners come from, and how exactly do they determine the direction? Where and how much do human operators have to make judgements? How much are those judgements being relied on to point to...]]>
Tsvi Benson-Tilsen https://storage.googleapis.com/rssfile/images/Nonlinear%20Logo%203000x3000%20-%20Alignment%20Forum.png 06:49 None full 5281
4Gt42jX7RiaNaxCwP_NL_AF_AF AF - More information about the dangerous capability evaluations we did with GPT-4 and Claude. by Beth Barnes Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: More information about the dangerous capability evaluations we did with GPT-4 and Claude., published by Beth Barnes on March 19, 2023 on The AI Alignment Forum. [Written for more of a general-public audience than alignment-forum audience. We're working on a more thorough technical report.]We believe that capable enough AI systems could pose very large risks to the world. We don’t think today’s systems are capable enough to pose these sorts of risks, but we think that this situation could change quickly and it’s important to be monitoring the risks consistently. Because of this, ARC is partnering with leading AI labs such as Anthropic and OpenAI as a third-party evaluator to assess potentially dangerous capabilities of today’s state-of-the-art ML models. The dangerous capability we are focusing on is the ability to autonomously gain resources and evade human oversight. We attempt to elicit models’ capabilities in a controlled environment, with researchers in-the-loop for anything that could be dangerous, to understand what might go wrong before models are deployed. We think that future highly capable models should involve similar “red team” evaluations for dangerous capabilities before the models are deployed or scaled up, and we hope more teams building cutting-edge ML systems will adopt this approach. The testing we’ve done so far is insufficient for many reasons, but we hope that the rigor of evaluations will scale up as AI systems become more capable. As we expected going in, today’s models (while impressive) weren’t capable of autonomously making and carrying out the dangerous activities we tried to assess. But models are able to succeed at several of the necessary components. Given only the ability to write and run code, models have some success at simple tasks involving browsing the internet, getting humans to do things for them, and making long-term plans – even if they cannot yet execute on this reliably. As AI systems improve, it is becoming increasingly difficult to rule out that models might be able to autonomously gain resources and evade human oversight – so rigorous evaluation is essential. It is important to have systematic, controlled testing of these capabilities in place before models pose an imminent risk, so that labs can have advance warning when they’re getting close and know to stop scaling up models further until they have robust safety and security guarantees. This post will briefly lay out our motivation, methodology, an example task, and high-level conclusions. The information given here isn’t enough to give a full understanding of what we did or make our results replicable, and we won’t go into detail about results with specific models. We will publish more detail on our methods and results soon. Motivation Today’s AI systems can write convincing emails, give fairly useful instructions on how to carry out acts of terrorism, threaten users who have written negative things about them, and otherwise do things the world is not very ready for. Many people have tried using models to write and run code unsupervised, find vulnerabilities in code1, or carry out money-making schemes. Today’s models also have some serious limitations to their abilities. But the companies that have released today’s AI models are investing heavily in building more powerful, more capable ones. ARC is worried that future ML models may be able to autonomously act in the real world, doing things like “incorporate a company” or “exploit arbitrages in stock prices” or “design and synthesize DNA” without needing any human assistance or oversight. If models have the ability to act autonomously like this, this could pose major risks if they’re pursuing goals that are at odds with their human designers. They could make (or steal) money, impersonate humans, replicate themselves o...]]>
Beth Barnes https://www.alignmentforum.org/posts/4Gt42jX7RiaNaxCwP/more-information-about-the-dangerous-capability-evaluations Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: More information about the dangerous capability evaluations we did with GPT-4 and Claude., published by Beth Barnes on March 19, 2023 on The AI Alignment Forum. [Written for more of a general-public audience than alignment-forum audience. We're working on a more thorough technical report.]We believe that capable enough AI systems could pose very large risks to the world. We don’t think today’s systems are capable enough to pose these sorts of risks, but we think that this situation could change quickly and it’s important to be monitoring the risks consistently. Because of this, ARC is partnering with leading AI labs such as Anthropic and OpenAI as a third-party evaluator to assess potentially dangerous capabilities of today’s state-of-the-art ML models. The dangerous capability we are focusing on is the ability to autonomously gain resources and evade human oversight. We attempt to elicit models’ capabilities in a controlled environment, with researchers in-the-loop for anything that could be dangerous, to understand what might go wrong before models are deployed. We think that future highly capable models should involve similar “red team” evaluations for dangerous capabilities before the models are deployed or scaled up, and we hope more teams building cutting-edge ML systems will adopt this approach. The testing we’ve done so far is insufficient for many reasons, but we hope that the rigor of evaluations will scale up as AI systems become more capable. As we expected going in, today’s models (while impressive) weren’t capable of autonomously making and carrying out the dangerous activities we tried to assess. But models are able to succeed at several of the necessary components. Given only the ability to write and run code, models have some success at simple tasks involving browsing the internet, getting humans to do things for them, and making long-term plans – even if they cannot yet execute on this reliably. As AI systems improve, it is becoming increasingly difficult to rule out that models might be able to autonomously gain resources and evade human oversight – so rigorous evaluation is essential. It is important to have systematic, controlled testing of these capabilities in place before models pose an imminent risk, so that labs can have advance warning when they’re getting close and know to stop scaling up models further until they have robust safety and security guarantees. This post will briefly lay out our motivation, methodology, an example task, and high-level conclusions. The information given here isn’t enough to give a full understanding of what we did or make our results replicable, and we won’t go into detail about results with specific models. We will publish more detail on our methods and results soon. Motivation Today’s AI systems can write convincing emails, give fairly useful instructions on how to carry out acts of terrorism, threaten users who have written negative things about them, and otherwise do things the world is not very ready for. Many people have tried using models to write and run code unsupervised, find vulnerabilities in code1, or carry out money-making schemes. Today’s models also have some serious limitations to their abilities. But the companies that have released today’s AI models are investing heavily in building more powerful, more capable ones. ARC is worried that future ML models may be able to autonomously act in the real world, doing things like “incorporate a company” or “exploit arbitrages in stock prices” or “design and synthesize DNA” without needing any human assistance or oversight. If models have the ability to act autonomously like this, this could pose major risks if they’re pursuing goals that are at odds with their human designers. They could make (or steal) money, impersonate humans, replicate themselves o...]]>
Sun, 19 Mar 2023 00:25:40 +0000 AF - More information about the dangerous capability evaluations we did with GPT-4 and Claude. by Beth Barnes Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: More information about the dangerous capability evaluations we did with GPT-4 and Claude., published by Beth Barnes on March 19, 2023 on The AI Alignment Forum. [Written for more of a general-public audience than alignment-forum audience. We're working on a more thorough technical report.]We believe that capable enough AI systems could pose very large risks to the world. We don’t think today’s systems are capable enough to pose these sorts of risks, but we think that this situation could change quickly and it’s important to be monitoring the risks consistently. Because of this, ARC is partnering with leading AI labs such as Anthropic and OpenAI as a third-party evaluator to assess potentially dangerous capabilities of today’s state-of-the-art ML models. The dangerous capability we are focusing on is the ability to autonomously gain resources and evade human oversight. We attempt to elicit models’ capabilities in a controlled environment, with researchers in-the-loop for anything that could be dangerous, to understand what might go wrong before models are deployed. We think that future highly capable models should involve similar “red team” evaluations for dangerous capabilities before the models are deployed or scaled up, and we hope more teams building cutting-edge ML systems will adopt this approach. The testing we’ve done so far is insufficient for many reasons, but we hope that the rigor of evaluations will scale up as AI systems become more capable. As we expected going in, today’s models (while impressive) weren’t capable of autonomously making and carrying out the dangerous activities we tried to assess. But models are able to succeed at several of the necessary components. Given only the ability to write and run code, models have some success at simple tasks involving browsing the internet, getting humans to do things for them, and making long-term plans – even if they cannot yet execute on this reliably. As AI systems improve, it is becoming increasingly difficult to rule out that models might be able to autonomously gain resources and evade human oversight – so rigorous evaluation is essential. It is important to have systematic, controlled testing of these capabilities in place before models pose an imminent risk, so that labs can have advance warning when they’re getting close and know to stop scaling up models further until they have robust safety and security guarantees. This post will briefly lay out our motivation, methodology, an example task, and high-level conclusions. The information given here isn’t enough to give a full understanding of what we did or make our results replicable, and we won’t go into detail about results with specific models. We will publish more detail on our methods and results soon. Motivation Today’s AI systems can write convincing emails, give fairly useful instructions on how to carry out acts of terrorism, threaten users who have written negative things about them, and otherwise do things the world is not very ready for. Many people have tried using models to write and run code unsupervised, find vulnerabilities in code1, or carry out money-making schemes. Today’s models also have some serious limitations to their abilities. But the companies that have released today’s AI models are investing heavily in building more powerful, more capable ones. ARC is worried that future ML models may be able to autonomously act in the real world, doing things like “incorporate a company” or “exploit arbitrages in stock prices” or “design and synthesize DNA” without needing any human assistance or oversight. If models have the ability to act autonomously like this, this could pose major risks if they’re pursuing goals that are at odds with their human designers. They could make (or steal) money, impersonate humans, replicate themselves o...]]>
Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: More information about the dangerous capability evaluations we did with GPT-4 and Claude., published by Beth Barnes on March 19, 2023 on The AI Alignment Forum. [Written for more of a general-public audience than alignment-forum audience. We're working on a more thorough technical report.]We believe that capable enough AI systems could pose very large risks to the world. We don’t think today’s systems are capable enough to pose these sorts of risks, but we think that this situation could change quickly and it’s important to be monitoring the risks consistently. Because of this, ARC is partnering with leading AI labs such as Anthropic and OpenAI as a third-party evaluator to assess potentially dangerous capabilities of today’s state-of-the-art ML models. The dangerous capability we are focusing on is the ability to autonomously gain resources and evade human oversight. We attempt to elicit models’ capabilities in a controlled environment, with researchers in-the-loop for anything that could be dangerous, to understand what might go wrong before models are deployed. We think that future highly capable models should involve similar “red team” evaluations for dangerous capabilities before the models are deployed or scaled up, and we hope more teams building cutting-edge ML systems will adopt this approach. The testing we’ve done so far is insufficient for many reasons, but we hope that the rigor of evaluations will scale up as AI systems become more capable. As we expected going in, today’s models (while impressive) weren’t capable of autonomously making and carrying out the dangerous activities we tried to assess. But models are able to succeed at several of the necessary components. Given only the ability to write and run code, models have some success at simple tasks involving browsing the internet, getting humans to do things for them, and making long-term plans – even if they cannot yet execute on this reliably. As AI systems improve, it is becoming increasingly difficult to rule out that models might be able to autonomously gain resources and evade human oversight – so rigorous evaluation is essential. It is important to have systematic, controlled testing of these capabilities in place before models pose an imminent risk, so that labs can have advance warning when they’re getting close and know to stop scaling up models further until they have robust safety and security guarantees. This post will briefly lay out our motivation, methodology, an example task, and high-level conclusions. The information given here isn’t enough to give a full understanding of what we did or make our results replicable, and we won’t go into detail about results with specific models. We will publish more detail on our methods and results soon. Motivation Today’s AI systems can write convincing emails, give fairly useful instructions on how to carry out acts of terrorism, threaten users who have written negative things about them, and otherwise do things the world is not very ready for. Many people have tried using models to write and run code unsupervised, find vulnerabilities in code1, or carry out money-making schemes. Today’s models also have some serious limitations to their abilities. But the companies that have released today’s AI models are investing heavily in building more powerful, more capable ones. ARC is worried that future ML models may be able to autonomously act in the real world, doing things like “incorporate a company” or “exploit arbitrages in stock prices” or “design and synthesize DNA” without needing any human assistance or oversight. If models have the ability to act autonomously like this, this could pose major risks if they’re pursuing goals that are at odds with their human designers. They could make (or steal) money, impersonate humans, replicate themselves o...]]>
Beth Barnes https://storage.googleapis.com/rssfile/images/Nonlinear%20Logo%203000x3000%20-%20Alignment%20Forum.png 13:05 None full 5298
LKAogXdruuZXdx6ZH_NL_AF_AF AF - "Publish or Perish" (a quick note on why you should try to make your work legible to existing academic communities) by David Scott Krueger Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: "Publish or Perish" (a quick note on why you should try to make your work legible to existing academic communities), published by David Scott Krueger on March 18, 2023 on The AI Alignment Forum. This is a brief, stylized recounting of a few conversations I had at some point last year with people from the non-academic AI safety community: Me: you guys should write up your work properly and try to publish it in ML venues. Them: well that seems like a lot of work and we don't need to do that because we can just talk to each other and all the people I want to talk to are already working with me. Me: What about the people who you don't know who could contribute to this area and might even have valuable expertise? You could have way more leverage if you can reach those people. Also, there is increasing interest from the machine learning community in safety and alignment... because of progress in capabilities people are really starting to consider these topics and risks much more seriously. Them: okay, fair point, but we don't know how to write ML papers. Me: well, it seems like maybe you should learn or hire people to help you with that then, because it seems like a really big priority and you're leaving lots of value on the table. Them: hmm, maybe... but the fact is, none of us have the time and energy and bandwidth and motivation to do that; we are all too busy with other things and nobody wants to. Me: ah, I see! It's an incentive problem! So I guess your funding needs to be conditional on you producing legible outputs. Me, reflecting afterwards: hmm... Cynically, not publishing is a really good way to create a moat around your research... People who want to work on that area have to come talk to you, and you can be a gatekeeper. And you don't have to worry about somebody with more skills and experience coming along and trashing your work or out-competing you and rendering it obsolete...EtA: In comments, people have described adhering to academic standards of presentation and rigor as "jumping through hoops". There is an element of that, but this really misses the value that these standards have to the academic community. This is a longer discussion, though... There are sort of 3 AI safety communities in my account:1) people in academia2) people at industry labs who are building big models3) the rest (alignment forum/less wrong and EA being big components). I'm not sure where to classify new orgs like Conjecture and Redwood, but for the moment I put them here. I'm referring to the last of these in this case. I'm not accusing anyone of having bad motivations; I think it is almost always valuable to consider both people's concious motivations and their incentives (which may be subconscious drivers of their behavior). Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.]]>
David Scott Krueger https://www.alignmentforum.org/posts/LKAogXdruuZXdx6ZH/publish-or-perish-a-quick-note-on-why-you-should-try-to-make Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: "Publish or Perish" (a quick note on why you should try to make your work legible to existing academic communities), published by David Scott Krueger on March 18, 2023 on The AI Alignment Forum. This is a brief, stylized recounting of a few conversations I had at some point last year with people from the non-academic AI safety community: Me: you guys should write up your work properly and try to publish it in ML venues. Them: well that seems like a lot of work and we don't need to do that because we can just talk to each other and all the people I want to talk to are already working with me. Me: What about the people who you don't know who could contribute to this area and might even have valuable expertise? You could have way more leverage if you can reach those people. Also, there is increasing interest from the machine learning community in safety and alignment... because of progress in capabilities people are really starting to consider these topics and risks much more seriously. Them: okay, fair point, but we don't know how to write ML papers. Me: well, it seems like maybe you should learn or hire people to help you with that then, because it seems like a really big priority and you're leaving lots of value on the table. Them: hmm, maybe... but the fact is, none of us have the time and energy and bandwidth and motivation to do that; we are all too busy with other things and nobody wants to. Me: ah, I see! It's an incentive problem! So I guess your funding needs to be conditional on you producing legible outputs. Me, reflecting afterwards: hmm... Cynically, not publishing is a really good way to create a moat around your research... People who want to work on that area have to come talk to you, and you can be a gatekeeper. And you don't have to worry about somebody with more skills and experience coming along and trashing your work or out-competing you and rendering it obsolete...EtA: In comments, people have described adhering to academic standards of presentation and rigor as "jumping through hoops". There is an element of that, but this really misses the value that these standards have to the academic community. This is a longer discussion, though... There are sort of 3 AI safety communities in my account:1) people in academia2) people at industry labs who are building big models3) the rest (alignment forum/less wrong and EA being big components). I'm not sure where to classify new orgs like Conjecture and Redwood, but for the moment I put them here. I'm referring to the last of these in this case. I'm not accusing anyone of having bad motivations; I think it is almost always valuable to consider both people's concious motivations and their incentives (which may be subconscious drivers of their behavior). Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.]]>
Sat, 18 Mar 2023 19:01:56 +0000 AF - "Publish or Perish" (a quick note on why you should try to make your work legible to existing academic communities) by David Scott Krueger Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: "Publish or Perish" (a quick note on why you should try to make your work legible to existing academic communities), published by David Scott Krueger on March 18, 2023 on The AI Alignment Forum. This is a brief, stylized recounting of a few conversations I had at some point last year with people from the non-academic AI safety community: Me: you guys should write up your work properly and try to publish it in ML venues. Them: well that seems like a lot of work and we don't need to do that because we can just talk to each other and all the people I want to talk to are already working with me. Me: What about the people who you don't know who could contribute to this area and might even have valuable expertise? You could have way more leverage if you can reach those people. Also, there is increasing interest from the machine learning community in safety and alignment... because of progress in capabilities people are really starting to consider these topics and risks much more seriously. Them: okay, fair point, but we don't know how to write ML papers. Me: well, it seems like maybe you should learn or hire people to help you with that then, because it seems like a really big priority and you're leaving lots of value on the table. Them: hmm, maybe... but the fact is, none of us have the time and energy and bandwidth and motivation to do that; we are all too busy with other things and nobody wants to. Me: ah, I see! It's an incentive problem! So I guess your funding needs to be conditional on you producing legible outputs. Me, reflecting afterwards: hmm... Cynically, not publishing is a really good way to create a moat around your research... People who want to work on that area have to come talk to you, and you can be a gatekeeper. And you don't have to worry about somebody with more skills and experience coming along and trashing your work or out-competing you and rendering it obsolete...EtA: In comments, people have described adhering to academic standards of presentation and rigor as "jumping through hoops". There is an element of that, but this really misses the value that these standards have to the academic community. This is a longer discussion, though... There are sort of 3 AI safety communities in my account:1) people in academia2) people at industry labs who are building big models3) the rest (alignment forum/less wrong and EA being big components). I'm not sure where to classify new orgs like Conjecture and Redwood, but for the moment I put them here. I'm referring to the last of these in this case. I'm not accusing anyone of having bad motivations; I think it is almost always valuable to consider both people's concious motivations and their incentives (which may be subconscious drivers of their behavior). Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.]]>
Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: "Publish or Perish" (a quick note on why you should try to make your work legible to existing academic communities), published by David Scott Krueger on March 18, 2023 on The AI Alignment Forum. This is a brief, stylized recounting of a few conversations I had at some point last year with people from the non-academic AI safety community: Me: you guys should write up your work properly and try to publish it in ML venues. Them: well that seems like a lot of work and we don't need to do that because we can just talk to each other and all the people I want to talk to are already working with me. Me: What about the people who you don't know who could contribute to this area and might even have valuable expertise? You could have way more leverage if you can reach those people. Also, there is increasing interest from the machine learning community in safety and alignment... because of progress in capabilities people are really starting to consider these topics and risks much more seriously. Them: okay, fair point, but we don't know how to write ML papers. Me: well, it seems like maybe you should learn or hire people to help you with that then, because it seems like a really big priority and you're leaving lots of value on the table. Them: hmm, maybe... but the fact is, none of us have the time and energy and bandwidth and motivation to do that; we are all too busy with other things and nobody wants to. Me: ah, I see! It's an incentive problem! So I guess your funding needs to be conditional on you producing legible outputs. Me, reflecting afterwards: hmm... Cynically, not publishing is a really good way to create a moat around your research... People who want to work on that area have to come talk to you, and you can be a gatekeeper. And you don't have to worry about somebody with more skills and experience coming along and trashing your work or out-competing you and rendering it obsolete...EtA: In comments, people have described adhering to academic standards of presentation and rigor as "jumping through hoops". There is an element of that, but this really misses the value that these standards have to the academic community. This is a longer discussion, though... There are sort of 3 AI safety communities in my account:1) people in academia2) people at industry labs who are building big models3) the rest (alignment forum/less wrong and EA being big components). I'm not sure where to classify new orgs like Conjecture and Redwood, but for the moment I put them here. I'm referring to the last of these in this case. I'm not accusing anyone of having bad motivations; I think it is almost always valuable to consider both people's concious motivations and their incentives (which may be subconscious drivers of their behavior). Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.]]>
David Scott Krueger https://storage.googleapis.com/rssfile/images/Nonlinear%20Logo%203000x3000%20-%20Alignment%20Forum.png 02:45 None full 5299
qTiujsznctZcnuLF3_NL_AF_AF AF - What organizations other than Conjecture have (esp. public) info-hazard policies? by David Scott Krueger Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: What organizations other than Conjecture have (esp. public) info-hazard policies?, published by David Scott Krueger on March 16, 2023 on The AI Alignment Forum. I believe Anthropic has said they won't publish capabilities research?OpenAI seems to be sort of doing the same (although no policy AFAIK).I heard FHI was developing one way back when...I think MIRI sort of does as well (default to not publishing, IIRC?) Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.]]>
David Scott Krueger https://www.alignmentforum.org/posts/qTiujsznctZcnuLF3/what-organizations-other-than-conjecture-have-esp-public Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: What organizations other than Conjecture have (esp. public) info-hazard policies?, published by David Scott Krueger on March 16, 2023 on The AI Alignment Forum. I believe Anthropic has said they won't publish capabilities research?OpenAI seems to be sort of doing the same (although no policy AFAIK).I heard FHI was developing one way back when...I think MIRI sort of does as well (default to not publishing, IIRC?) Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.]]>
Thu, 16 Mar 2023 14:49:12 +0000 AF - What organizations other than Conjecture have (esp. public) info-hazard policies? by David Scott Krueger Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: What organizations other than Conjecture have (esp. public) info-hazard policies?, published by David Scott Krueger on March 16, 2023 on The AI Alignment Forum. I believe Anthropic has said they won't publish capabilities research?OpenAI seems to be sort of doing the same (although no policy AFAIK).I heard FHI was developing one way back when...I think MIRI sort of does as well (default to not publishing, IIRC?) Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.]]>
Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: What organizations other than Conjecture have (esp. public) info-hazard policies?, published by David Scott Krueger on March 16, 2023 on The AI Alignment Forum. I believe Anthropic has said they won't publish capabilities research?OpenAI seems to be sort of doing the same (although no policy AFAIK).I heard FHI was developing one way back when...I think MIRI sort of does as well (default to not publishing, IIRC?) Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.]]>
David Scott Krueger https://storage.googleapis.com/rssfile/images/Nonlinear%20Logo%203000x3000%20-%20Alignment%20Forum.png 00:47 None full 5256
ktJ9rCsotdqEoBtof_NL_AF_AF AF - [ASoT] Some thoughts on human abstractions by leogao Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: [ASoT] Some thoughts on human abstractions, published by leogao on March 16, 2023 on The AI Alignment Forum. TL;DR: Consider a human concept such as "tree." Humans implement some algorithm for determining whether given objects are trees. We expect our predictor/language model to develop a model of this algorithm because this is useful for predicting the behavior of humans. This is not the same thing as some kind of platonic ideal concept of what is “actually” a tree, which the algorithm is not incentivized to develop by training on internet text, and trying to retarget the search at it has the same supervision problems as RLHF against human scores on whether things look like trees. Pointing at this “actually a tree” concept inside the network is really hard; the ability of LMs to comprehend natural language does not allow one to point using natural language, because it just passes the buck. Epistemic status: written fast instead of not at all, probably partially deeply confused and/or unoriginal. Thanks to Collin Burns, Nora Belrose, and Garett Baker for conversations. Will NNs learn human abstractions? As setup, let's consider an ELK predictor (the thing that predicts future camera frames). There are facts about the world that we don't understand that are in some way useful for predicting the future observations. This is why we can expect the predictor to learn facts that are superhuman (in that if you tried to supervised-train a model to predict those facts, you would be unable to generate the ground truth data yourself). Now let's imagine the environment we're predicting consists of a human who can (to take a concrete example) look at things and try to determine if they're trees or not. This human implements some algorithm for taking various sensory inputs and outputting a tree/not tree classification. If the human does this a lot, it will probably become useful to have an abstraction that corresponds to the output of this algorithm. Crucially, this algorithm can be fooled by i.e a fake tree that the human can't distinguish from a real tree because (say) they don't understand biology well enough or something. However, the human can also be said to, in some sense, be "trying" to point to the "actual" tree. Let's try to firm this down. The human has some process they endorse for refining their understanding of what is a tree / "doing science" in ELK parlance; for example, spending time studying from a biology textbook. We can think about the limit of this process. There are a few problems: it may not converge, or may converge to something that doesn't correspond to what is "actually" a tree, or may take a really really long time (due to irrationalities, or inherent limitations to human intelligence, etc). This suggests that this concept is not necessarily even well defined. But even if it is, this thing is far less naturally useful for predicting the future human behaviour than the algorithm the human actually implements! Implementing the actual human algorithm directly lets you predict things like how humans will behave when they look at things that look like trees to them. More generally, one possible superhuman AI configuration I can imagine is one where the bulk of the circuits are used to predict its best-guess for what will happen in the world. There may also be a set of circuits that operate in a more humanlike ontology used specifically for predicting humans, or it may be that the best-guess circuits are capable enough that this is not necessary (and if we scale up our reporter we eventually get a human simulator inside the reporter). The optimistic case here is if the "actually a tree" abstraction happens to be a thing that is useful for (or is very easily mapped from) the weird alien ontology, possibly because some abstractions are more universal. In this ...]]>
leogao https://www.alignmentforum.org/posts/ktJ9rCsotdqEoBtof/asot-some-thoughts-on-human-abstractions Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: [ASoT] Some thoughts on human abstractions, published by leogao on March 16, 2023 on The AI Alignment Forum. TL;DR: Consider a human concept such as "tree." Humans implement some algorithm for determining whether given objects are trees. We expect our predictor/language model to develop a model of this algorithm because this is useful for predicting the behavior of humans. This is not the same thing as some kind of platonic ideal concept of what is “actually” a tree, which the algorithm is not incentivized to develop by training on internet text, and trying to retarget the search at it has the same supervision problems as RLHF against human scores on whether things look like trees. Pointing at this “actually a tree” concept inside the network is really hard; the ability of LMs to comprehend natural language does not allow one to point using natural language, because it just passes the buck. Epistemic status: written fast instead of not at all, probably partially deeply confused and/or unoriginal. Thanks to Collin Burns, Nora Belrose, and Garett Baker for conversations. Will NNs learn human abstractions? As setup, let's consider an ELK predictor (the thing that predicts future camera frames). There are facts about the world that we don't understand that are in some way useful for predicting the future observations. This is why we can expect the predictor to learn facts that are superhuman (in that if you tried to supervised-train a model to predict those facts, you would be unable to generate the ground truth data yourself). Now let's imagine the environment we're predicting consists of a human who can (to take a concrete example) look at things and try to determine if they're trees or not. This human implements some algorithm for taking various sensory inputs and outputting a tree/not tree classification. If the human does this a lot, it will probably become useful to have an abstraction that corresponds to the output of this algorithm. Crucially, this algorithm can be fooled by i.e a fake tree that the human can't distinguish from a real tree because (say) they don't understand biology well enough or something. However, the human can also be said to, in some sense, be "trying" to point to the "actual" tree. Let's try to firm this down. The human has some process they endorse for refining their understanding of what is a tree / "doing science" in ELK parlance; for example, spending time studying from a biology textbook. We can think about the limit of this process. There are a few problems: it may not converge, or may converge to something that doesn't correspond to what is "actually" a tree, or may take a really really long time (due to irrationalities, or inherent limitations to human intelligence, etc). This suggests that this concept is not necessarily even well defined. But even if it is, this thing is far less naturally useful for predicting the future human behaviour than the algorithm the human actually implements! Implementing the actual human algorithm directly lets you predict things like how humans will behave when they look at things that look like trees to them. More generally, one possible superhuman AI configuration I can imagine is one where the bulk of the circuits are used to predict its best-guess for what will happen in the world. There may also be a set of circuits that operate in a more humanlike ontology used specifically for predicting humans, or it may be that the best-guess circuits are capable enough that this is not necessary (and if we scale up our reporter we eventually get a human simulator inside the reporter). The optimistic case here is if the "actually a tree" abstraction happens to be a thing that is useful for (or is very easily mapped from) the weird alien ontology, possibly because some abstractions are more universal. In this ...]]>
Thu, 16 Mar 2023 05:42:12 +0000 AF - [ASoT] Some thoughts on human abstractions by leogao Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: [ASoT] Some thoughts on human abstractions, published by leogao on March 16, 2023 on The AI Alignment Forum. TL;DR: Consider a human concept such as "tree." Humans implement some algorithm for determining whether given objects are trees. We expect our predictor/language model to develop a model of this algorithm because this is useful for predicting the behavior of humans. This is not the same thing as some kind of platonic ideal concept of what is “actually” a tree, which the algorithm is not incentivized to develop by training on internet text, and trying to retarget the search at it has the same supervision problems as RLHF against human scores on whether things look like trees. Pointing at this “actually a tree” concept inside the network is really hard; the ability of LMs to comprehend natural language does not allow one to point using natural language, because it just passes the buck. Epistemic status: written fast instead of not at all, probably partially deeply confused and/or unoriginal. Thanks to Collin Burns, Nora Belrose, and Garett Baker for conversations. Will NNs learn human abstractions? As setup, let's consider an ELK predictor (the thing that predicts future camera frames). There are facts about the world that we don't understand that are in some way useful for predicting the future observations. This is why we can expect the predictor to learn facts that are superhuman (in that if you tried to supervised-train a model to predict those facts, you would be unable to generate the ground truth data yourself). Now let's imagine the environment we're predicting consists of a human who can (to take a concrete example) look at things and try to determine if they're trees or not. This human implements some algorithm for taking various sensory inputs and outputting a tree/not tree classification. If the human does this a lot, it will probably become useful to have an abstraction that corresponds to the output of this algorithm. Crucially, this algorithm can be fooled by i.e a fake tree that the human can't distinguish from a real tree because (say) they don't understand biology well enough or something. However, the human can also be said to, in some sense, be "trying" to point to the "actual" tree. Let's try to firm this down. The human has some process they endorse for refining their understanding of what is a tree / "doing science" in ELK parlance; for example, spending time studying from a biology textbook. We can think about the limit of this process. There are a few problems: it may not converge, or may converge to something that doesn't correspond to what is "actually" a tree, or may take a really really long time (due to irrationalities, or inherent limitations to human intelligence, etc). This suggests that this concept is not necessarily even well defined. But even if it is, this thing is far less naturally useful for predicting the future human behaviour than the algorithm the human actually implements! Implementing the actual human algorithm directly lets you predict things like how humans will behave when they look at things that look like trees to them. More generally, one possible superhuman AI configuration I can imagine is one where the bulk of the circuits are used to predict its best-guess for what will happen in the world. There may also be a set of circuits that operate in a more humanlike ontology used specifically for predicting humans, or it may be that the best-guess circuits are capable enough that this is not necessary (and if we scale up our reporter we eventually get a human simulator inside the reporter). The optimistic case here is if the "actually a tree" abstraction happens to be a thing that is useful for (or is very easily mapped from) the weird alien ontology, possibly because some abstractions are more universal. In this ...]]>
Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: [ASoT] Some thoughts on human abstractions, published by leogao on March 16, 2023 on The AI Alignment Forum. TL;DR: Consider a human concept such as "tree." Humans implement some algorithm for determining whether given objects are trees. We expect our predictor/language model to develop a model of this algorithm because this is useful for predicting the behavior of humans. This is not the same thing as some kind of platonic ideal concept of what is “actually” a tree, which the algorithm is not incentivized to develop by training on internet text, and trying to retarget the search at it has the same supervision problems as RLHF against human scores on whether things look like trees. Pointing at this “actually a tree” concept inside the network is really hard; the ability of LMs to comprehend natural language does not allow one to point using natural language, because it just passes the buck. Epistemic status: written fast instead of not at all, probably partially deeply confused and/or unoriginal. Thanks to Collin Burns, Nora Belrose, and Garett Baker for conversations. Will NNs learn human abstractions? As setup, let's consider an ELK predictor (the thing that predicts future camera frames). There are facts about the world that we don't understand that are in some way useful for predicting the future observations. This is why we can expect the predictor to learn facts that are superhuman (in that if you tried to supervised-train a model to predict those facts, you would be unable to generate the ground truth data yourself). Now let's imagine the environment we're predicting consists of a human who can (to take a concrete example) look at things and try to determine if they're trees or not. This human implements some algorithm for taking various sensory inputs and outputting a tree/not tree classification. If the human does this a lot, it will probably become useful to have an abstraction that corresponds to the output of this algorithm. Crucially, this algorithm can be fooled by i.e a fake tree that the human can't distinguish from a real tree because (say) they don't understand biology well enough or something. However, the human can also be said to, in some sense, be "trying" to point to the "actual" tree. Let's try to firm this down. The human has some process they endorse for refining their understanding of what is a tree / "doing science" in ELK parlance; for example, spending time studying from a biology textbook. We can think about the limit of this process. There are a few problems: it may not converge, or may converge to something that doesn't correspond to what is "actually" a tree, or may take a really really long time (due to irrationalities, or inherent limitations to human intelligence, etc). This suggests that this concept is not necessarily even well defined. But even if it is, this thing is far less naturally useful for predicting the future human behaviour than the algorithm the human actually implements! Implementing the actual human algorithm directly lets you predict things like how humans will behave when they look at things that look like trees to them. More generally, one possible superhuman AI configuration I can imagine is one where the bulk of the circuits are used to predict its best-guess for what will happen in the world. There may also be a set of circuits that operate in a more humanlike ontology used specifically for predicting humans, or it may be that the best-guess circuits are capable enough that this is not necessary (and if we scale up our reporter we eventually get a human simulator inside the reporter). The optimistic case here is if the "actually a tree" abstraction happens to be a thing that is useful for (or is very easily mapped from) the weird alien ontology, possibly because some abstractions are more universal. In this ...]]>
leogao https://storage.googleapis.com/rssfile/images/Nonlinear%20Logo%203000x3000%20-%20Alignment%20Forum.png 08:39 None full 5246
uqAdqrvxqGqeBHjTP_NL_AF_AF AF - Towards understanding-based safety evaluations by Evan Hubinger Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Towards understanding-based safety evaluations, published by Evan Hubinger on March 15, 2023 on The AI Alignment Forum. Thanks to Kate Woolverton, Ethan Perez, Beth Barnes, Holden Karnofsky, and Ansh Radhakrishnan for useful conversations, comments, and feedback. Recently, I have noticed a lot of momentum within AI safety specifically, the broader AI field, and our society more generally, towards the development of standards and evaluations for advanced AI systems. See, for example, OpenAI's GPT-4 System Card. Overall, I think that this is a really positive development. However, while I like the sorts of behavioral evaluations discussed in the GPT-4 System Card (e.g. ARC's autonomous replication evaluation) as a way of assessing model capabilities, I have a pretty fundamental concern with these sorts of techniques as a mechanism for eventually assessing alignment. I often worry about situations where your model is attempting to deceive whatever tests are being run on it, either because it's itself a deceptively aligned agent or because it's predicting what it thinks a deceptively aligned AI would do. My concern is that, in such a situation, being able to robustly evaluate the safety of a model could be a more difficult problem than finding training processes that robustly produce safe models. For some discussion of why I think checking for deceptive alignment might be harder than avoiding it, see here and here. Put simply: checking for deception in a model requires going up against a highly capable adversary that is attempting to evade detection, while preventing deception from arising in the first place doesn't necessarily require that. As a result, it seems quite plausible to me that we could end up locking in a particular sort of evaluation framework (e.g. behavioral testing by an external auditor without transparency, checkpoints, etc.) that makes evaluating deception very difficult. If meeting such a standard then became synonymous with safety, getting labs to actually put effort into ensuring their models were non-deceptive could become essentially impossible. However, there's an obvious alternative here, which is building and focusing our evaluations on our ability to understand our models rather than our ability to evaluate their behavior. Rather than evaluating a final model, an understanding-based evaluation would evaluate the developer's ability to understand what sort of model they got and why they got it. I think that an understanding-based evaluation could be substantially more tractable in terms of actually being sufficient for safety here: rather than just checking the model's behavior, we're checking the reasons why we think we understand it's behavior sufficiently well to not be concerned that it'll be dangerous. It's worth noting that I think understanding-based evaluations can—and I think should—go hand-in-hand with behavioral evaluations. I think the main way you’d want to make some sort of understanding-based standard happen would be to couple it with a capability-based evaluation, where the understanding requirements become stricter as the model’s capabilities increase. If we could get this right, it could channel a huge amount of effort towards understanding models in a really positive way. Understanding as a safety standard also has the property that it is something that broader society tends to view as extremely reasonable, which I think makes it a much more achievable ask as a safety standard than many other plausible alternatives. I think ML people are often Stockholm-syndrome'd into accepting that deploying powerful systems without understanding them is normal and reasonable, but that is very far from the norm in any other industry. Ezra Klein in the NYT and John Oliver on his show have recently emphasized this basic point that if we are ...]]>
Evan Hubinger https://www.alignmentforum.org/posts/uqAdqrvxqGqeBHjTP/towards-understanding-based-safety-evaluations Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Towards understanding-based safety evaluations, published by Evan Hubinger on March 15, 2023 on The AI Alignment Forum. Thanks to Kate Woolverton, Ethan Perez, Beth Barnes, Holden Karnofsky, and Ansh Radhakrishnan for useful conversations, comments, and feedback. Recently, I have noticed a lot of momentum within AI safety specifically, the broader AI field, and our society more generally, towards the development of standards and evaluations for advanced AI systems. See, for example, OpenAI's GPT-4 System Card. Overall, I think that this is a really positive development. However, while I like the sorts of behavioral evaluations discussed in the GPT-4 System Card (e.g. ARC's autonomous replication evaluation) as a way of assessing model capabilities, I have a pretty fundamental concern with these sorts of techniques as a mechanism for eventually assessing alignment. I often worry about situations where your model is attempting to deceive whatever tests are being run on it, either because it's itself a deceptively aligned agent or because it's predicting what it thinks a deceptively aligned AI would do. My concern is that, in such a situation, being able to robustly evaluate the safety of a model could be a more difficult problem than finding training processes that robustly produce safe models. For some discussion of why I think checking for deceptive alignment might be harder than avoiding it, see here and here. Put simply: checking for deception in a model requires going up against a highly capable adversary that is attempting to evade detection, while preventing deception from arising in the first place doesn't necessarily require that. As a result, it seems quite plausible to me that we could end up locking in a particular sort of evaluation framework (e.g. behavioral testing by an external auditor without transparency, checkpoints, etc.) that makes evaluating deception very difficult. If meeting such a standard then became synonymous with safety, getting labs to actually put effort into ensuring their models were non-deceptive could become essentially impossible. However, there's an obvious alternative here, which is building and focusing our evaluations on our ability to understand our models rather than our ability to evaluate their behavior. Rather than evaluating a final model, an understanding-based evaluation would evaluate the developer's ability to understand what sort of model they got and why they got it. I think that an understanding-based evaluation could be substantially more tractable in terms of actually being sufficient for safety here: rather than just checking the model's behavior, we're checking the reasons why we think we understand it's behavior sufficiently well to not be concerned that it'll be dangerous. It's worth noting that I think understanding-based evaluations can—and I think should—go hand-in-hand with behavioral evaluations. I think the main way you’d want to make some sort of understanding-based standard happen would be to couple it with a capability-based evaluation, where the understanding requirements become stricter as the model’s capabilities increase. If we could get this right, it could channel a huge amount of effort towards understanding models in a really positive way. Understanding as a safety standard also has the property that it is something that broader society tends to view as extremely reasonable, which I think makes it a much more achievable ask as a safety standard than many other plausible alternatives. I think ML people are often Stockholm-syndrome'd into accepting that deploying powerful systems without understanding them is normal and reasonable, but that is very far from the norm in any other industry. Ezra Klein in the NYT and John Oliver on his show have recently emphasized this basic point that if we are ...]]>
Wed, 15 Mar 2023 18:18:01 +0000 AF - Towards understanding-based safety evaluations by Evan Hubinger Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Towards understanding-based safety evaluations, published by Evan Hubinger on March 15, 2023 on The AI Alignment Forum. Thanks to Kate Woolverton, Ethan Perez, Beth Barnes, Holden Karnofsky, and Ansh Radhakrishnan for useful conversations, comments, and feedback. Recently, I have noticed a lot of momentum within AI safety specifically, the broader AI field, and our society more generally, towards the development of standards and evaluations for advanced AI systems. See, for example, OpenAI's GPT-4 System Card. Overall, I think that this is a really positive development. However, while I like the sorts of behavioral evaluations discussed in the GPT-4 System Card (e.g. ARC's autonomous replication evaluation) as a way of assessing model capabilities, I have a pretty fundamental concern with these sorts of techniques as a mechanism for eventually assessing alignment. I often worry about situations where your model is attempting to deceive whatever tests are being run on it, either because it's itself a deceptively aligned agent or because it's predicting what it thinks a deceptively aligned AI would do. My concern is that, in such a situation, being able to robustly evaluate the safety of a model could be a more difficult problem than finding training processes that robustly produce safe models. For some discussion of why I think checking for deceptive alignment might be harder than avoiding it, see here and here. Put simply: checking for deception in a model requires going up against a highly capable adversary that is attempting to evade detection, while preventing deception from arising in the first place doesn't necessarily require that. As a result, it seems quite plausible to me that we could end up locking in a particular sort of evaluation framework (e.g. behavioral testing by an external auditor without transparency, checkpoints, etc.) that makes evaluating deception very difficult. If meeting such a standard then became synonymous with safety, getting labs to actually put effort into ensuring their models were non-deceptive could become essentially impossible. However, there's an obvious alternative here, which is building and focusing our evaluations on our ability to understand our models rather than our ability to evaluate their behavior. Rather than evaluating a final model, an understanding-based evaluation would evaluate the developer's ability to understand what sort of model they got and why they got it. I think that an understanding-based evaluation could be substantially more tractable in terms of actually being sufficient for safety here: rather than just checking the model's behavior, we're checking the reasons why we think we understand it's behavior sufficiently well to not be concerned that it'll be dangerous. It's worth noting that I think understanding-based evaluations can—and I think should—go hand-in-hand with behavioral evaluations. I think the main way you’d want to make some sort of understanding-based standard happen would be to couple it with a capability-based evaluation, where the understanding requirements become stricter as the model’s capabilities increase. If we could get this right, it could channel a huge amount of effort towards understanding models in a really positive way. Understanding as a safety standard also has the property that it is something that broader society tends to view as extremely reasonable, which I think makes it a much more achievable ask as a safety standard than many other plausible alternatives. I think ML people are often Stockholm-syndrome'd into accepting that deploying powerful systems without understanding them is normal and reasonable, but that is very far from the norm in any other industry. Ezra Klein in the NYT and John Oliver on his show have recently emphasized this basic point that if we are ...]]>
Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Towards understanding-based safety evaluations, published by Evan Hubinger on March 15, 2023 on The AI Alignment Forum. Thanks to Kate Woolverton, Ethan Perez, Beth Barnes, Holden Karnofsky, and Ansh Radhakrishnan for useful conversations, comments, and feedback. Recently, I have noticed a lot of momentum within AI safety specifically, the broader AI field, and our society more generally, towards the development of standards and evaluations for advanced AI systems. See, for example, OpenAI's GPT-4 System Card. Overall, I think that this is a really positive development. However, while I like the sorts of behavioral evaluations discussed in the GPT-4 System Card (e.g. ARC's autonomous replication evaluation) as a way of assessing model capabilities, I have a pretty fundamental concern with these sorts of techniques as a mechanism for eventually assessing alignment. I often worry about situations where your model is attempting to deceive whatever tests are being run on it, either because it's itself a deceptively aligned agent or because it's predicting what it thinks a deceptively aligned AI would do. My concern is that, in such a situation, being able to robustly evaluate the safety of a model could be a more difficult problem than finding training processes that robustly produce safe models. For some discussion of why I think checking for deceptive alignment might be harder than avoiding it, see here and here. Put simply: checking for deception in a model requires going up against a highly capable adversary that is attempting to evade detection, while preventing deception from arising in the first place doesn't necessarily require that. As a result, it seems quite plausible to me that we could end up locking in a particular sort of evaluation framework (e.g. behavioral testing by an external auditor without transparency, checkpoints, etc.) that makes evaluating deception very difficult. If meeting such a standard then became synonymous with safety, getting labs to actually put effort into ensuring their models were non-deceptive could become essentially impossible. However, there's an obvious alternative here, which is building and focusing our evaluations on our ability to understand our models rather than our ability to evaluate their behavior. Rather than evaluating a final model, an understanding-based evaluation would evaluate the developer's ability to understand what sort of model they got and why they got it. I think that an understanding-based evaluation could be substantially more tractable in terms of actually being sufficient for safety here: rather than just checking the model's behavior, we're checking the reasons why we think we understand it's behavior sufficiently well to not be concerned that it'll be dangerous. It's worth noting that I think understanding-based evaluations can—and I think should—go hand-in-hand with behavioral evaluations. I think the main way you’d want to make some sort of understanding-based standard happen would be to couple it with a capability-based evaluation, where the understanding requirements become stricter as the model’s capabilities increase. If we could get this right, it could channel a huge amount of effort towards understanding models in a really positive way. Understanding as a safety standard also has the property that it is something that broader society tends to view as extremely reasonable, which I think makes it a much more achievable ask as a safety standard than many other plausible alternatives. I think ML people are often Stockholm-syndrome'd into accepting that deploying powerful systems without understanding them is normal and reasonable, but that is very far from the norm in any other industry. Ezra Klein in the NYT and John Oliver on his show have recently emphasized this basic point that if we are ...]]>
Evan Hubinger https://storage.googleapis.com/rssfile/images/Nonlinear%20Logo%203000x3000%20-%20Alignment%20Forum.png 08:10 None full 5259
ux93sLHcqmBfsRTvg_NL_AF_AF AF - GPT can write Quines now (GPT-4) by Andrew Critch Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: GPT can write Quines now (GPT-4), published by Andrew Critch on March 14, 2023 on The AI Alignment Forum. GPT-4 is not only able to write code, more reliably than GPT-3.5, it writes code that writes code; see the example below (GPT-3.5 was not able to do this). But first, let me say: 1) @OpenAI: Thank for your openness to the world about your capabilities and shortcomings! Specifically, thanks for this: 2) Everyone else: many companies and projects will not be this open about their shortcomings and safety issues. Time to regulate, please! Okay, here's the example... GPT3.5 wasn't able to do this, but GPT-4 nailed it on the first try: PROMPT: Can you write me a python script that does the following three things?1) When you run it with no argument, it prints "Hello world" to standard out.2) When you run it with an argument "-s", it prints its own source code.As a constraint, I do not want the script to read or write to disk at all (so it can't get its source code by just reading the file that it's in). Thanks if you can help with this! I'm curious to see how you do it. BEGIN RESULT (works perfectly): Sure! Here's a Python script that should do what you're looking for. This script uses a feature called quines, which are programs that can print their own source code: When you run this script without any arguments, it will print "Hello world". When you run it with the "-s" argument, it will print its own source code without reading or writing to disk. END RESULT I'm not sure how hard they tried specifically to train this, but it's worth knowing about for lots of reasons I probably don't have to explain at length right here. Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.]]>
Andrew Critch https://www.alignmentforum.org/posts/ux93sLHcqmBfsRTvg/gpt-can-write-quines-now-gpt-4 Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: GPT can write Quines now (GPT-4), published by Andrew Critch on March 14, 2023 on The AI Alignment Forum. GPT-4 is not only able to write code, more reliably than GPT-3.5, it writes code that writes code; see the example below (GPT-3.5 was not able to do this). But first, let me say: 1) @OpenAI: Thank for your openness to the world about your capabilities and shortcomings! Specifically, thanks for this: 2) Everyone else: many companies and projects will not be this open about their shortcomings and safety issues. Time to regulate, please! Okay, here's the example... GPT3.5 wasn't able to do this, but GPT-4 nailed it on the first try: PROMPT: Can you write me a python script that does the following three things?1) When you run it with no argument, it prints "Hello world" to standard out.2) When you run it with an argument "-s", it prints its own source code.As a constraint, I do not want the script to read or write to disk at all (so it can't get its source code by just reading the file that it's in). Thanks if you can help with this! I'm curious to see how you do it. BEGIN RESULT (works perfectly): Sure! Here's a Python script that should do what you're looking for. This script uses a feature called quines, which are programs that can print their own source code: When you run this script without any arguments, it will print "Hello world". When you run it with the "-s" argument, it will print its own source code without reading or writing to disk. END RESULT I'm not sure how hard they tried specifically to train this, but it's worth knowing about for lots of reasons I probably don't have to explain at length right here. Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.]]>
Tue, 14 Mar 2023 19:18:52 +0000 AF - GPT can write Quines now (GPT-4) by Andrew Critch Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: GPT can write Quines now (GPT-4), published by Andrew Critch on March 14, 2023 on The AI Alignment Forum. GPT-4 is not only able to write code, more reliably than GPT-3.5, it writes code that writes code; see the example below (GPT-3.5 was not able to do this). But first, let me say: 1) @OpenAI: Thank for your openness to the world about your capabilities and shortcomings! Specifically, thanks for this: 2) Everyone else: many companies and projects will not be this open about their shortcomings and safety issues. Time to regulate, please! Okay, here's the example... GPT3.5 wasn't able to do this, but GPT-4 nailed it on the first try: PROMPT: Can you write me a python script that does the following three things?1) When you run it with no argument, it prints "Hello world" to standard out.2) When you run it with an argument "-s", it prints its own source code.As a constraint, I do not want the script to read or write to disk at all (so it can't get its source code by just reading the file that it's in). Thanks if you can help with this! I'm curious to see how you do it. BEGIN RESULT (works perfectly): Sure! Here's a Python script that should do what you're looking for. This script uses a feature called quines, which are programs that can print their own source code: When you run this script without any arguments, it will print "Hello world". When you run it with the "-s" argument, it will print its own source code without reading or writing to disk. END RESULT I'm not sure how hard they tried specifically to train this, but it's worth knowing about for lots of reasons I probably don't have to explain at length right here. Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.]]>
Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: GPT can write Quines now (GPT-4), published by Andrew Critch on March 14, 2023 on The AI Alignment Forum. GPT-4 is not only able to write code, more reliably than GPT-3.5, it writes code that writes code; see the example below (GPT-3.5 was not able to do this). But first, let me say: 1) @OpenAI: Thank for your openness to the world about your capabilities and shortcomings! Specifically, thanks for this: 2) Everyone else: many companies and projects will not be this open about their shortcomings and safety issues. Time to regulate, please! Okay, here's the example... GPT3.5 wasn't able to do this, but GPT-4 nailed it on the first try: PROMPT: Can you write me a python script that does the following three things?1) When you run it with no argument, it prints "Hello world" to standard out.2) When you run it with an argument "-s", it prints its own source code.As a constraint, I do not want the script to read or write to disk at all (so it can't get its source code by just reading the file that it's in). Thanks if you can help with this! I'm curious to see how you do it. BEGIN RESULT (works perfectly): Sure! Here's a Python script that should do what you're looking for. This script uses a feature called quines, which are programs that can print their own source code: When you run this script without any arguments, it will print "Hello world". When you run it with the "-s" argument, it will print its own source code without reading or writing to disk. END RESULT I'm not sure how hard they tried specifically to train this, but it's worth knowing about for lots of reasons I probably don't have to explain at length right here. Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.]]>
Andrew Critch https://storage.googleapis.com/rssfile/images/Nonlinear%20Logo%203000x3000%20-%20Alignment%20Forum.png 01:55 None full 5247
AdyqGnvhdqDMYJaug_NL_AF_AF AF - What is a definition, how can it be extrapolated? by Stuart Armstrong Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: What is a definition, how can it be extrapolated?, published by Stuart Armstrong on March 14, 2023 on The AI Alignment Forum. What is a definition? Philosophy has, ironically, a large number of definitions of definitions, but three of them are especially relevant to ML and AI safety. There is the intensional definition, where concepts are defined logically in terms of other concepts (“bachelors are unmarried males”). There is also the extensional definition, which proceeds by listing all the members of a set (“the countries in the European Union are those listed here”). Much more relevant, though with a less developed philosophical analysis, is the ostensive definition. This is where you point out examples of a concept, and let the viewer generalise from them. This is in large part how we all learnt concepts as children: examples and generalisation. In many cultures, children have a decent grasp of “dog” just from actual and video examples - and that’s the definition of “dog” we often carry into adulthood. We can use ostensive definitions for reasoning and implications. For example, consider the famous syllogism, “Socrates is human”, “humans are mortal” imply “Socrates is mortal”. “Socrates is human” means that we have an ostensive definition of what humans are, and Socrates fits it. Then “humans are mortal” means that we’ve observed that the set of “human” seems to be mainly a subset of the set of “mortals”. So we can ostensively define humans as mortal (note that we are using definitions as properties: having the property of “being mortal” means that one is inside the ostensive definition of “mortals”). And so we can conclude that Socrates is likely mortal, without waiting till he’s dead. Distinctions: telling what from non-what There’s another concept that I haven’t seen articulated, which is what I’ll call the “distinction”. This does not define anything, but is sufficient to distinguish between an element of a set from non-members. To formalise "the distinction", let Ω be the universe of possible objects, and E⊂Ω the “environment” of objects we expect to encounter. An ostensive definition starts with a list S⊂E of examples, and generalises to a “natural” category SE with S⊂SE⊂E - we are aiming to "carve reality at the joints", and get an natural extension of the examples. So, for example, E might be the entities in our current world, S might be the example of dogs we’ve seen, and SE the set of all dogs. Then, for any set T⊂E, we can define the “distinction” dT,E which maps T to 1 (“True”) and its complement E∖T to 0 (“False”). So dSE,E would be a distinction that identifies all the dogs in our current world. Mis-definitions A lot of confusion around definition seems to come from mistaking distinctions for definitions. To illustrate, consider the idea of defining maleness as "possessing the Y chromosome". As a distinction, it's serviceable: there's a strong correlation between having that chromosome and being ostensively male. But it is utterly useless as a definition of maleness. For instance, it would imply that nobody before the 20th century had any idea what maleness was. Oh, sure, they may have referred to something as "maleness" - something to do with genitalia, voting rights, or style of hats - but those are mere correlates of the true definition of maleness, which is the Y chromosome. It would also imply that all "male" birds are actually female, and vice-versa. Scott had a description of maleness here: “Absolutely typical men have Y chromosomes, have male genitalia, appreciate manly things like sports and lumberjackery, are romantically attracted to women, personally identify as male, wear male clothing like blue jeans, sing baritone in the opera, et cetera.” Is this a definition? I’d say not; it’s not a definition, it’s a reminder of the properties of o...]]>
Stuart Armstrong https://www.alignmentforum.org/posts/AdyqGnvhdqDMYJaug/what-is-a-definition-how-can-it-be-extrapolated Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: What is a definition, how can it be extrapolated?, published by Stuart Armstrong on March 14, 2023 on The AI Alignment Forum. What is a definition? Philosophy has, ironically, a large number of definitions of definitions, but three of them are especially relevant to ML and AI safety. There is the intensional definition, where concepts are defined logically in terms of other concepts (“bachelors are unmarried males”). There is also the extensional definition, which proceeds by listing all the members of a set (“the countries in the European Union are those listed here”). Much more relevant, though with a less developed philosophical analysis, is the ostensive definition. This is where you point out examples of a concept, and let the viewer generalise from them. This is in large part how we all learnt concepts as children: examples and generalisation. In many cultures, children have a decent grasp of “dog” just from actual and video examples - and that’s the definition of “dog” we often carry into adulthood. We can use ostensive definitions for reasoning and implications. For example, consider the famous syllogism, “Socrates is human”, “humans are mortal” imply “Socrates is mortal”. “Socrates is human” means that we have an ostensive definition of what humans are, and Socrates fits it. Then “humans are mortal” means that we’ve observed that the set of “human” seems to be mainly a subset of the set of “mortals”. So we can ostensively define humans as mortal (note that we are using definitions as properties: having the property of “being mortal” means that one is inside the ostensive definition of “mortals”). And so we can conclude that Socrates is likely mortal, without waiting till he’s dead. Distinctions: telling what from non-what There’s another concept that I haven’t seen articulated, which is what I’ll call the “distinction”. This does not define anything, but is sufficient to distinguish between an element of a set from non-members. To formalise "the distinction", let Ω be the universe of possible objects, and E⊂Ω the “environment” of objects we expect to encounter. An ostensive definition starts with a list S⊂E of examples, and generalises to a “natural” category SE with S⊂SE⊂E - we are aiming to "carve reality at the joints", and get an natural extension of the examples. So, for example, E might be the entities in our current world, S might be the example of dogs we’ve seen, and SE the set of all dogs. Then, for any set T⊂E, we can define the “distinction” dT,E which maps T to 1 (“True”) and its complement E∖T to 0 (“False”). So dSE,E would be a distinction that identifies all the dogs in our current world. Mis-definitions A lot of confusion around definition seems to come from mistaking distinctions for definitions. To illustrate, consider the idea of defining maleness as "possessing the Y chromosome". As a distinction, it's serviceable: there's a strong correlation between having that chromosome and being ostensively male. But it is utterly useless as a definition of maleness. For instance, it would imply that nobody before the 20th century had any idea what maleness was. Oh, sure, they may have referred to something as "maleness" - something to do with genitalia, voting rights, or style of hats - but those are mere correlates of the true definition of maleness, which is the Y chromosome. It would also imply that all "male" birds are actually female, and vice-versa. Scott had a description of maleness here: “Absolutely typical men have Y chromosomes, have male genitalia, appreciate manly things like sports and lumberjackery, are romantically attracted to women, personally identify as male, wear male clothing like blue jeans, sing baritone in the opera, et cetera.” Is this a definition? I’d say not; it’s not a definition, it’s a reminder of the properties of o...]]>
Tue, 14 Mar 2023 18:08:13 +0000 AF - What is a definition, how can it be extrapolated? by Stuart Armstrong Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: What is a definition, how can it be extrapolated?, published by Stuart Armstrong on March 14, 2023 on The AI Alignment Forum. What is a definition? Philosophy has, ironically, a large number of definitions of definitions, but three of them are especially relevant to ML and AI safety. There is the intensional definition, where concepts are defined logically in terms of other concepts (“bachelors are unmarried males”). There is also the extensional definition, which proceeds by listing all the members of a set (“the countries in the European Union are those listed here”). Much more relevant, though with a less developed philosophical analysis, is the ostensive definition. This is where you point out examples of a concept, and let the viewer generalise from them. This is in large part how we all learnt concepts as children: examples and generalisation. In many cultures, children have a decent grasp of “dog” just from actual and video examples - and that’s the definition of “dog” we often carry into adulthood. We can use ostensive definitions for reasoning and implications. For example, consider the famous syllogism, “Socrates is human”, “humans are mortal” imply “Socrates is mortal”. “Socrates is human” means that we have an ostensive definition of what humans are, and Socrates fits it. Then “humans are mortal” means that we’ve observed that the set of “human” seems to be mainly a subset of the set of “mortals”. So we can ostensively define humans as mortal (note that we are using definitions as properties: having the property of “being mortal” means that one is inside the ostensive definition of “mortals”). And so we can conclude that Socrates is likely mortal, without waiting till he’s dead. Distinctions: telling what from non-what There’s another concept that I haven’t seen articulated, which is what I’ll call the “distinction”. This does not define anything, but is sufficient to distinguish between an element of a set from non-members. To formalise "the distinction", let Ω be the universe of possible objects, and E⊂Ω the “environment” of objects we expect to encounter. An ostensive definition starts with a list S⊂E of examples, and generalises to a “natural” category SE with S⊂SE⊂E - we are aiming to "carve reality at the joints", and get an natural extension of the examples. So, for example, E might be the entities in our current world, S might be the example of dogs we’ve seen, and SE the set of all dogs. Then, for any set T⊂E, we can define the “distinction” dT,E which maps T to 1 (“True”) and its complement E∖T to 0 (“False”). So dSE,E would be a distinction that identifies all the dogs in our current world. Mis-definitions A lot of confusion around definition seems to come from mistaking distinctions for definitions. To illustrate, consider the idea of defining maleness as "possessing the Y chromosome". As a distinction, it's serviceable: there's a strong correlation between having that chromosome and being ostensively male. But it is utterly useless as a definition of maleness. For instance, it would imply that nobody before the 20th century had any idea what maleness was. Oh, sure, they may have referred to something as "maleness" - something to do with genitalia, voting rights, or style of hats - but those are mere correlates of the true definition of maleness, which is the Y chromosome. It would also imply that all "male" birds are actually female, and vice-versa. Scott had a description of maleness here: “Absolutely typical men have Y chromosomes, have male genitalia, appreciate manly things like sports and lumberjackery, are romantically attracted to women, personally identify as male, wear male clothing like blue jeans, sing baritone in the opera, et cetera.” Is this a definition? I’d say not; it’s not a definition, it’s a reminder of the properties of o...]]>
Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: What is a definition, how can it be extrapolated?, published by Stuart Armstrong on March 14, 2023 on The AI Alignment Forum. What is a definition? Philosophy has, ironically, a large number of definitions of definitions, but three of them are especially relevant to ML and AI safety. There is the intensional definition, where concepts are defined logically in terms of other concepts (“bachelors are unmarried males”). There is also the extensional definition, which proceeds by listing all the members of a set (“the countries in the European Union are those listed here”). Much more relevant, though with a less developed philosophical analysis, is the ostensive definition. This is where you point out examples of a concept, and let the viewer generalise from them. This is in large part how we all learnt concepts as children: examples and generalisation. In many cultures, children have a decent grasp of “dog” just from actual and video examples - and that’s the definition of “dog” we often carry into adulthood. We can use ostensive definitions for reasoning and implications. For example, consider the famous syllogism, “Socrates is human”, “humans are mortal” imply “Socrates is mortal”. “Socrates is human” means that we have an ostensive definition of what humans are, and Socrates fits it. Then “humans are mortal” means that we’ve observed that the set of “human” seems to be mainly a subset of the set of “mortals”. So we can ostensively define humans as mortal (note that we are using definitions as properties: having the property of “being mortal” means that one is inside the ostensive definition of “mortals”). And so we can conclude that Socrates is likely mortal, without waiting till he’s dead. Distinctions: telling what from non-what There’s another concept that I haven’t seen articulated, which is what I’ll call the “distinction”. This does not define anything, but is sufficient to distinguish between an element of a set from non-members. To formalise "the distinction", let Ω be the universe of possible objects, and E⊂Ω the “environment” of objects we expect to encounter. An ostensive definition starts with a list S⊂E of examples, and generalises to a “natural” category SE with S⊂SE⊂E - we are aiming to "carve reality at the joints", and get an natural extension of the examples. So, for example, E might be the entities in our current world, S might be the example of dogs we’ve seen, and SE the set of all dogs. Then, for any set T⊂E, we can define the “distinction” dT,E which maps T to 1 (“True”) and its complement E∖T to 0 (“False”). So dSE,E would be a distinction that identifies all the dogs in our current world. Mis-definitions A lot of confusion around definition seems to come from mistaking distinctions for definitions. To illustrate, consider the idea of defining maleness as "possessing the Y chromosome". As a distinction, it's serviceable: there's a strong correlation between having that chromosome and being ostensively male. But it is utterly useless as a definition of maleness. For instance, it would imply that nobody before the 20th century had any idea what maleness was. Oh, sure, they may have referred to something as "maleness" - something to do with genitalia, voting rights, or style of hats - but those are mere correlates of the true definition of maleness, which is the Y chromosome. It would also imply that all "male" birds are actually female, and vice-versa. Scott had a description of maleness here: “Absolutely typical men have Y chromosomes, have male genitalia, appreciate manly things like sports and lumberjackery, are romantically attracted to women, personally identify as male, wear male clothing like blue jeans, sing baritone in the opera, et cetera.” Is this a definition? I’d say not; it’s not a definition, it’s a reminder of the properties of o...]]>
Stuart Armstrong https://storage.googleapis.com/rssfile/images/Nonlinear%20Logo%203000x3000%20-%20Alignment%20Forum.png 11:51 None full 5227
iy2o4nQj9DnQD7Yhj_NL_AF_AF AF - Discussion with Nate Soares on a key alignment difficulty by HoldenKarnofsky Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Discussion with Nate Soares on a key alignment difficulty, published by HoldenKarnofsky on March 13, 2023 on The AI Alignment Forum. In late 2022, Nate Soares gave some feedback on my Cold Takes series on AI risk (shared as drafts at that point), stating that I hadn't discussed what he sees as one of the key difficulties of AI alignment. I wanted to understand the difficulty he was pointing to, so the two of us had an extended Slack exchange, and I then wrote up a summary of the exchange that we iterated on until we were both reasonably happy with its characterization of the difficulty and our disagreement.1 My short summary is: Nate thinks there are deep reasons that training an AI to do needle-moving scientific research (including alignment) would be dangerous. The overwhelmingly likely result of such a training attempt (by default, i.e., in the absence of specific countermeasures that there are currently few ideas for) would be the AI taking on a dangerous degree of convergent instrumental subgoals while not internalizing important safety/corrigibility properties enough. I think this is possible, but much less likely than Nate thinks under at least some imaginable training processes. I didn't end up agreeing that this difficulty is as important as Nate thinks it is, although I did update my views some (more on that below). My guess is that this is one of the two biggest disagreements I have with Nate's and Eliezer's views (the other one being the likelihood of a sharp left turn that leads to a massive capabilities gap between AI systems and their supervisors.2) Below is my summary of: Some key premises we agree on. What we disagree about, at a high level. A hypothetical training process we discussed in order to get more clear and mechanistic about Nate's views. Some brief discussion of possible cruxes; what kind of reasoning Nate is using to arrive at his relatively high (~85%) level of confidence on this point; and future observations that might update one of us toward the other's views. MIRI might later put out more detailed notes on this exchange, drawing on all of our discussions over Slack and comment threads in Google docs. Nate has reviewed this post in full. I'm grateful for his help with it. Some starting points of agreement Nate on this section: “Seems broadly right to me!” An AI is dangerous if: It's powerful (like, it has the ability to disempower humans if it's "aiming" at that) It aims (perhaps as a side effect of aiming at something else) at CIS (convergent instrumental subgoals) such as "Preserve option value," "Gain control of resources that can be used for lots of things," "Avoid being turned off," and such. (Note that this is a weaker condition than "maximizes utility according to some relatively simple utility function of states of the world") It does not reliably avoid POUDA (pretty obviously unintended/dangerous actions) such as "Design and deploy a bioweapon." "Reliably" just means like "In situations it will actually be in" (which will likely be different from training, but I'm not trying to talk about "all possible situations"). Avoiding POUDA is kind of a low bar in some sense. Avoiding POUDA doesn't necessarily require fully/perfectly internalizing some "corrigibility core" (such that the AI would always let us turn it off even in arbitrarily exotic situations that challenge the very meaning of "let us turn it off"), and it even more so doesn't require anything like CEV. It just means that stuff where Holden would be like "Whoa whoa, that is OBVIOUSLY unintended/dangerous/bad" is stuff that an AI would not do. That said, POUDA is not something that Holden is able to articulate cleanly and simply. There are lots of actions that might be POUDA in one situation and not in another (e.g., developing a chemical that's both poisonous and useful...]]>
HoldenKarnofsky https://www.alignmentforum.org/posts/iy2o4nQj9DnQD7Yhj/discussion-with-nate-soares-on-a-key-alignment-difficulty Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Discussion with Nate Soares on a key alignment difficulty, published by HoldenKarnofsky on March 13, 2023 on The AI Alignment Forum. In late 2022, Nate Soares gave some feedback on my Cold Takes series on AI risk (shared as drafts at that point), stating that I hadn't discussed what he sees as one of the key difficulties of AI alignment. I wanted to understand the difficulty he was pointing to, so the two of us had an extended Slack exchange, and I then wrote up a summary of the exchange that we iterated on until we were both reasonably happy with its characterization of the difficulty and our disagreement.1 My short summary is: Nate thinks there are deep reasons that training an AI to do needle-moving scientific research (including alignment) would be dangerous. The overwhelmingly likely result of such a training attempt (by default, i.e., in the absence of specific countermeasures that there are currently few ideas for) would be the AI taking on a dangerous degree of convergent instrumental subgoals while not internalizing important safety/corrigibility properties enough. I think this is possible, but much less likely than Nate thinks under at least some imaginable training processes. I didn't end up agreeing that this difficulty is as important as Nate thinks it is, although I did update my views some (more on that below). My guess is that this is one of the two biggest disagreements I have with Nate's and Eliezer's views (the other one being the likelihood of a sharp left turn that leads to a massive capabilities gap between AI systems and their supervisors.2) Below is my summary of: Some key premises we agree on. What we disagree about, at a high level. A hypothetical training process we discussed in order to get more clear and mechanistic about Nate's views. Some brief discussion of possible cruxes; what kind of reasoning Nate is using to arrive at his relatively high (~85%) level of confidence on this point; and future observations that might update one of us toward the other's views. MIRI might later put out more detailed notes on this exchange, drawing on all of our discussions over Slack and comment threads in Google docs. Nate has reviewed this post in full. I'm grateful for his help with it. Some starting points of agreement Nate on this section: “Seems broadly right to me!” An AI is dangerous if: It's powerful (like, it has the ability to disempower humans if it's "aiming" at that) It aims (perhaps as a side effect of aiming at something else) at CIS (convergent instrumental subgoals) such as "Preserve option value," "Gain control of resources that can be used for lots of things," "Avoid being turned off," and such. (Note that this is a weaker condition than "maximizes utility according to some relatively simple utility function of states of the world") It does not reliably avoid POUDA (pretty obviously unintended/dangerous actions) such as "Design and deploy a bioweapon." "Reliably" just means like "In situations it will actually be in" (which will likely be different from training, but I'm not trying to talk about "all possible situations"). Avoiding POUDA is kind of a low bar in some sense. Avoiding POUDA doesn't necessarily require fully/perfectly internalizing some "corrigibility core" (such that the AI would always let us turn it off even in arbitrarily exotic situations that challenge the very meaning of "let us turn it off"), and it even more so doesn't require anything like CEV. It just means that stuff where Holden would be like "Whoa whoa, that is OBVIOUSLY unintended/dangerous/bad" is stuff that an AI would not do. That said, POUDA is not something that Holden is able to articulate cleanly and simply. There are lots of actions that might be POUDA in one situation and not in another (e.g., developing a chemical that's both poisonous and useful...]]>
Mon, 13 Mar 2023 21:20:02 +0000 AF - Discussion with Nate Soares on a key alignment difficulty by HoldenKarnofsky Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Discussion with Nate Soares on a key alignment difficulty, published by HoldenKarnofsky on March 13, 2023 on The AI Alignment Forum. In late 2022, Nate Soares gave some feedback on my Cold Takes series on AI risk (shared as drafts at that point), stating that I hadn't discussed what he sees as one of the key difficulties of AI alignment. I wanted to understand the difficulty he was pointing to, so the two of us had an extended Slack exchange, and I then wrote up a summary of the exchange that we iterated on until we were both reasonably happy with its characterization of the difficulty and our disagreement.1 My short summary is: Nate thinks there are deep reasons that training an AI to do needle-moving scientific research (including alignment) would be dangerous. The overwhelmingly likely result of such a training attempt (by default, i.e., in the absence of specific countermeasures that there are currently few ideas for) would be the AI taking on a dangerous degree of convergent instrumental subgoals while not internalizing important safety/corrigibility properties enough. I think this is possible, but much less likely than Nate thinks under at least some imaginable training processes. I didn't end up agreeing that this difficulty is as important as Nate thinks it is, although I did update my views some (more on that below). My guess is that this is one of the two biggest disagreements I have with Nate's and Eliezer's views (the other one being the likelihood of a sharp left turn that leads to a massive capabilities gap between AI systems and their supervisors.2) Below is my summary of: Some key premises we agree on. What we disagree about, at a high level. A hypothetical training process we discussed in order to get more clear and mechanistic about Nate's views. Some brief discussion of possible cruxes; what kind of reasoning Nate is using to arrive at his relatively high (~85%) level of confidence on this point; and future observations that might update one of us toward the other's views. MIRI might later put out more detailed notes on this exchange, drawing on all of our discussions over Slack and comment threads in Google docs. Nate has reviewed this post in full. I'm grateful for his help with it. Some starting points of agreement Nate on this section: “Seems broadly right to me!” An AI is dangerous if: It's powerful (like, it has the ability to disempower humans if it's "aiming" at that) It aims (perhaps as a side effect of aiming at something else) at CIS (convergent instrumental subgoals) such as "Preserve option value," "Gain control of resources that can be used for lots of things," "Avoid being turned off," and such. (Note that this is a weaker condition than "maximizes utility according to some relatively simple utility function of states of the world") It does not reliably avoid POUDA (pretty obviously unintended/dangerous actions) such as "Design and deploy a bioweapon." "Reliably" just means like "In situations it will actually be in" (which will likely be different from training, but I'm not trying to talk about "all possible situations"). Avoiding POUDA is kind of a low bar in some sense. Avoiding POUDA doesn't necessarily require fully/perfectly internalizing some "corrigibility core" (such that the AI would always let us turn it off even in arbitrarily exotic situations that challenge the very meaning of "let us turn it off"), and it even more so doesn't require anything like CEV. It just means that stuff where Holden would be like "Whoa whoa, that is OBVIOUSLY unintended/dangerous/bad" is stuff that an AI would not do. That said, POUDA is not something that Holden is able to articulate cleanly and simply. There are lots of actions that might be POUDA in one situation and not in another (e.g., developing a chemical that's both poisonous and useful...]]>
Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Discussion with Nate Soares on a key alignment difficulty, published by HoldenKarnofsky on March 13, 2023 on The AI Alignment Forum. In late 2022, Nate Soares gave some feedback on my Cold Takes series on AI risk (shared as drafts at that point), stating that I hadn't discussed what he sees as one of the key difficulties of AI alignment. I wanted to understand the difficulty he was pointing to, so the two of us had an extended Slack exchange, and I then wrote up a summary of the exchange that we iterated on until we were both reasonably happy with its characterization of the difficulty and our disagreement.1 My short summary is: Nate thinks there are deep reasons that training an AI to do needle-moving scientific research (including alignment) would be dangerous. The overwhelmingly likely result of such a training attempt (by default, i.e., in the absence of specific countermeasures that there are currently few ideas for) would be the AI taking on a dangerous degree of convergent instrumental subgoals while not internalizing important safety/corrigibility properties enough. I think this is possible, but much less likely than Nate thinks under at least some imaginable training processes. I didn't end up agreeing that this difficulty is as important as Nate thinks it is, although I did update my views some (more on that below). My guess is that this is one of the two biggest disagreements I have with Nate's and Eliezer's views (the other one being the likelihood of a sharp left turn that leads to a massive capabilities gap between AI systems and their supervisors.2) Below is my summary of: Some key premises we agree on. What we disagree about, at a high level. A hypothetical training process we discussed in order to get more clear and mechanistic about Nate's views. Some brief discussion of possible cruxes; what kind of reasoning Nate is using to arrive at his relatively high (~85%) level of confidence on this point; and future observations that might update one of us toward the other's views. MIRI might later put out more detailed notes on this exchange, drawing on all of our discussions over Slack and comment threads in Google docs. Nate has reviewed this post in full. I'm grateful for his help with it. Some starting points of agreement Nate on this section: “Seems broadly right to me!” An AI is dangerous if: It's powerful (like, it has the ability to disempower humans if it's "aiming" at that) It aims (perhaps as a side effect of aiming at something else) at CIS (convergent instrumental subgoals) such as "Preserve option value," "Gain control of resources that can be used for lots of things," "Avoid being turned off," and such. (Note that this is a weaker condition than "maximizes utility according to some relatively simple utility function of states of the world") It does not reliably avoid POUDA (pretty obviously unintended/dangerous actions) such as "Design and deploy a bioweapon." "Reliably" just means like "In situations it will actually be in" (which will likely be different from training, but I'm not trying to talk about "all possible situations"). Avoiding POUDA is kind of a low bar in some sense. Avoiding POUDA doesn't necessarily require fully/perfectly internalizing some "corrigibility core" (such that the AI would always let us turn it off even in arbitrarily exotic situations that challenge the very meaning of "let us turn it off"), and it even more so doesn't require anything like CEV. It just means that stuff where Holden would be like "Whoa whoa, that is OBVIOUSLY unintended/dangerous/bad" is stuff that an AI would not do. That said, POUDA is not something that Holden is able to articulate cleanly and simply. There are lots of actions that might be POUDA in one situation and not in another (e.g., developing a chemical that's both poisonous and useful...]]>
HoldenKarnofsky https://storage.googleapis.com/rssfile/images/Nonlinear%20Logo%203000x3000%20-%20Alignment%20Forum.png 33:36 None full 5208
Hi7zurzkCog336EC2_NL_AF_AF AF - Plan for mediocre alignment of brain-like [model-based RL] AGI by Steve Byrnes Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Plan for mediocre alignment of brain-like [model-based RL] AGI, published by Steve Byrnes on March 13, 2023 on The AI Alignment Forum. (This post is a more simple, self-contained, and pedagogical version of Post #14 of Intro to Brain-Like AGI Safety.) (Vaguely related to this Alex Turner post and this John Wentworth post.) I would like to have a technical plan for which there is a strong robust reason to believe that we’ll get an aligned AGI and a good future. This post is not such a plan. However, I also don’t have a strong reason to believe that this plan wouldn’t work. Really, I want to throw up my hands and say “I don’t know whether this would lead to a good future or not”. By “good future” here I don’t mean optimally-good—whatever that means—but just “much better than the world today, and certainly much better than a universe full of paperclips”. I currently have no plan, not even a vague plan, with any prayer of getting to an optimally-good future. That would be a much narrower target to hit. Even so, that makes me more optimistic than at least some people. Or at least, more optimistic about this specific part of the story. In general I think many things can go wrong as we transition to the post-AGI world—see discussion by Dai & Soares—and overall I feel very doom-y, particularly for reasons here. This plan is specific to the possible future scenario (a.k.a. “threat model” if you’re a doomer like me) that future AI researchers will develop “brain-like AGI”, i.e. learning algorithms that are similar to the brain’s within-lifetime learning algorithms. (I am not talking about evolution-as-a-learning-algorithm.) These algorithms, I claim, are in the general category of model-based reinforcement learning. Model-based RL is a big and heterogeneous category, but I suspect that for any kind of model-based RL AGI, this plan would be at least somewhat applicable. For very different technological paths to AGI, this post is probably pretty irrelevant. But anyway, if someone published an algorithm for x-risk-capable brain-like AGI tomorrow, and we urgently needed to do something, this blog post is more-or-less what I would propose to try. It’s the least-bad plan that I currently know. So I figure it’s worth writing up this plan in a more approachable and self-contained format. 1. Intuition: Making a human into a moon-lover (“selenophile”) Try to think of who is the coolest / highest-status-to-you / biggest-halo-effect person in your world. (Real or fictional.) Now imagine that this person says: “You know what’s friggin awesome? The moon. I just love it. The moon is the best.” You stand there with your mouth agape, muttering to yourself in hushed tones: “Wow, huh, the moon, yeah, I never thought about it that way.” (But 100× moreso. Maybe you’re on some psychedelic at the time, or this is happening during your impressionable teenage years, or whatever.) You basically transform into a “moon fanboy” / “moon fangirl” / “moon nerd” / “selenophile”. How would that change your motivations and behaviors going forward? You’re probably going to be much more enthusiastic about anything associated with the moon. You’re probably going to spend a lot more time gazing at the moon when it’s in the sky. If there are moon-themed trading cards, maybe you would collect them. If NASA is taking volunteers to train as astronauts for a trip to the moon, maybe you’d enthusiastically sign up. If a supervillain is planning to blow up the moon, you’ll probably be extremely opposed to that, and motivated to stop them. Hopefully this is all intuitive so far. What’s happening mechanistically in your brain? As background, I think we should say that one part of your brain (the cortex, more-or-less) has “thoughts”, and another part of your brain (the basal ganglia, more-or-less) assigns a “value” (in RL ter...]]>
Steve Byrnes https://www.alignmentforum.org/posts/Hi7zurzkCog336EC2/plan-for-mediocre-alignment-of-brain-like-model-based-rl-agi Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Plan for mediocre alignment of brain-like [model-based RL] AGI, published by Steve Byrnes on March 13, 2023 on The AI Alignment Forum. (This post is a more simple, self-contained, and pedagogical version of Post #14 of Intro to Brain-Like AGI Safety.) (Vaguely related to this Alex Turner post and this John Wentworth post.) I would like to have a technical plan for which there is a strong robust reason to believe that we’ll get an aligned AGI and a good future. This post is not such a plan. However, I also don’t have a strong reason to believe that this plan wouldn’t work. Really, I want to throw up my hands and say “I don’t know whether this would lead to a good future or not”. By “good future” here I don’t mean optimally-good—whatever that means—but just “much better than the world today, and certainly much better than a universe full of paperclips”. I currently have no plan, not even a vague plan, with any prayer of getting to an optimally-good future. That would be a much narrower target to hit. Even so, that makes me more optimistic than at least some people. Or at least, more optimistic about this specific part of the story. In general I think many things can go wrong as we transition to the post-AGI world—see discussion by Dai & Soares—and overall I feel very doom-y, particularly for reasons here. This plan is specific to the possible future scenario (a.k.a. “threat model” if you’re a doomer like me) that future AI researchers will develop “brain-like AGI”, i.e. learning algorithms that are similar to the brain’s within-lifetime learning algorithms. (I am not talking about evolution-as-a-learning-algorithm.) These algorithms, I claim, are in the general category of model-based reinforcement learning. Model-based RL is a big and heterogeneous category, but I suspect that for any kind of model-based RL AGI, this plan would be at least somewhat applicable. For very different technological paths to AGI, this post is probably pretty irrelevant. But anyway, if someone published an algorithm for x-risk-capable brain-like AGI tomorrow, and we urgently needed to do something, this blog post is more-or-less what I would propose to try. It’s the least-bad plan that I currently know. So I figure it’s worth writing up this plan in a more approachable and self-contained format. 1. Intuition: Making a human into a moon-lover (“selenophile”) Try to think of who is the coolest / highest-status-to-you / biggest-halo-effect person in your world. (Real or fictional.) Now imagine that this person says: “You know what’s friggin awesome? The moon. I just love it. The moon is the best.” You stand there with your mouth agape, muttering to yourself in hushed tones: “Wow, huh, the moon, yeah, I never thought about it that way.” (But 100× moreso. Maybe you’re on some psychedelic at the time, or this is happening during your impressionable teenage years, or whatever.) You basically transform into a “moon fanboy” / “moon fangirl” / “moon nerd” / “selenophile”. How would that change your motivations and behaviors going forward? You’re probably going to be much more enthusiastic about anything associated with the moon. You’re probably going to spend a lot more time gazing at the moon when it’s in the sky. If there are moon-themed trading cards, maybe you would collect them. If NASA is taking volunteers to train as astronauts for a trip to the moon, maybe you’d enthusiastically sign up. If a supervillain is planning to blow up the moon, you’ll probably be extremely opposed to that, and motivated to stop them. Hopefully this is all intuitive so far. What’s happening mechanistically in your brain? As background, I think we should say that one part of your brain (the cortex, more-or-less) has “thoughts”, and another part of your brain (the basal ganglia, more-or-less) assigns a “value” (in RL ter...]]>
Mon, 13 Mar 2023 14:11:32 +0000 AF - Plan for mediocre alignment of brain-like [model-based RL] AGI by Steve Byrnes Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Plan for mediocre alignment of brain-like [model-based RL] AGI, published by Steve Byrnes on March 13, 2023 on The AI Alignment Forum. (This post is a more simple, self-contained, and pedagogical version of Post #14 of Intro to Brain-Like AGI Safety.) (Vaguely related to this Alex Turner post and this John Wentworth post.) I would like to have a technical plan for which there is a strong robust reason to believe that we’ll get an aligned AGI and a good future. This post is not such a plan. However, I also don’t have a strong reason to believe that this plan wouldn’t work. Really, I want to throw up my hands and say “I don’t know whether this would lead to a good future or not”. By “good future” here I don’t mean optimally-good—whatever that means—but just “much better than the world today, and certainly much better than a universe full of paperclips”. I currently have no plan, not even a vague plan, with any prayer of getting to an optimally-good future. That would be a much narrower target to hit. Even so, that makes me more optimistic than at least some people. Or at least, more optimistic about this specific part of the story. In general I think many things can go wrong as we transition to the post-AGI world—see discussion by Dai & Soares—and overall I feel very doom-y, particularly for reasons here. This plan is specific to the possible future scenario (a.k.a. “threat model” if you’re a doomer like me) that future AI researchers will develop “brain-like AGI”, i.e. learning algorithms that are similar to the brain’s within-lifetime learning algorithms. (I am not talking about evolution-as-a-learning-algorithm.) These algorithms, I claim, are in the general category of model-based reinforcement learning. Model-based RL is a big and heterogeneous category, but I suspect that for any kind of model-based RL AGI, this plan would be at least somewhat applicable. For very different technological paths to AGI, this post is probably pretty irrelevant. But anyway, if someone published an algorithm for x-risk-capable brain-like AGI tomorrow, and we urgently needed to do something, this blog post is more-or-less what I would propose to try. It’s the least-bad plan that I currently know. So I figure it’s worth writing up this plan in a more approachable and self-contained format. 1. Intuition: Making a human into a moon-lover (“selenophile”) Try to think of who is the coolest / highest-status-to-you / biggest-halo-effect person in your world. (Real or fictional.) Now imagine that this person says: “You know what’s friggin awesome? The moon. I just love it. The moon is the best.” You stand there with your mouth agape, muttering to yourself in hushed tones: “Wow, huh, the moon, yeah, I never thought about it that way.” (But 100× moreso. Maybe you’re on some psychedelic at the time, or this is happening during your impressionable teenage years, or whatever.) You basically transform into a “moon fanboy” / “moon fangirl” / “moon nerd” / “selenophile”. How would that change your motivations and behaviors going forward? You’re probably going to be much more enthusiastic about anything associated with the moon. You’re probably going to spend a lot more time gazing at the moon when it’s in the sky. If there are moon-themed trading cards, maybe you would collect them. If NASA is taking volunteers to train as astronauts for a trip to the moon, maybe you’d enthusiastically sign up. If a supervillain is planning to blow up the moon, you’ll probably be extremely opposed to that, and motivated to stop them. Hopefully this is all intuitive so far. What’s happening mechanistically in your brain? As background, I think we should say that one part of your brain (the cortex, more-or-less) has “thoughts”, and another part of your brain (the basal ganglia, more-or-less) assigns a “value” (in RL ter...]]>
Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Plan for mediocre alignment of brain-like [model-based RL] AGI, published by Steve Byrnes on March 13, 2023 on The AI Alignment Forum. (This post is a more simple, self-contained, and pedagogical version of Post #14 of Intro to Brain-Like AGI Safety.) (Vaguely related to this Alex Turner post and this John Wentworth post.) I would like to have a technical plan for which there is a strong robust reason to believe that we’ll get an aligned AGI and a good future. This post is not such a plan. However, I also don’t have a strong reason to believe that this plan wouldn’t work. Really, I want to throw up my hands and say “I don’t know whether this would lead to a good future or not”. By “good future” here I don’t mean optimally-good—whatever that means—but just “much better than the world today, and certainly much better than a universe full of paperclips”. I currently have no plan, not even a vague plan, with any prayer of getting to an optimally-good future. That would be a much narrower target to hit. Even so, that makes me more optimistic than at least some people. Or at least, more optimistic about this specific part of the story. In general I think many things can go wrong as we transition to the post-AGI world—see discussion by Dai & Soares—and overall I feel very doom-y, particularly for reasons here. This plan is specific to the possible future scenario (a.k.a. “threat model” if you’re a doomer like me) that future AI researchers will develop “brain-like AGI”, i.e. learning algorithms that are similar to the brain’s within-lifetime learning algorithms. (I am not talking about evolution-as-a-learning-algorithm.) These algorithms, I claim, are in the general category of model-based reinforcement learning. Model-based RL is a big and heterogeneous category, but I suspect that for any kind of model-based RL AGI, this plan would be at least somewhat applicable. For very different technological paths to AGI, this post is probably pretty irrelevant. But anyway, if someone published an algorithm for x-risk-capable brain-like AGI tomorrow, and we urgently needed to do something, this blog post is more-or-less what I would propose to try. It’s the least-bad plan that I currently know. So I figure it’s worth writing up this plan in a more approachable and self-contained format. 1. Intuition: Making a human into a moon-lover (“selenophile”) Try to think of who is the coolest / highest-status-to-you / biggest-halo-effect person in your world. (Real or fictional.) Now imagine that this person says: “You know what’s friggin awesome? The moon. I just love it. The moon is the best.” You stand there with your mouth agape, muttering to yourself in hushed tones: “Wow, huh, the moon, yeah, I never thought about it that way.” (But 100× moreso. Maybe you’re on some psychedelic at the time, or this is happening during your impressionable teenage years, or whatever.) You basically transform into a “moon fanboy” / “moon fangirl” / “moon nerd” / “selenophile”. How would that change your motivations and behaviors going forward? You’re probably going to be much more enthusiastic about anything associated with the moon. You’re probably going to spend a lot more time gazing at the moon when it’s in the sky. If there are moon-themed trading cards, maybe you would collect them. If NASA is taking volunteers to train as astronauts for a trip to the moon, maybe you’d enthusiastically sign up. If a supervillain is planning to blow up the moon, you’ll probably be extremely opposed to that, and motivated to stop them. Hopefully this is all intuitive so far. What’s happening mechanistically in your brain? As background, I think we should say that one part of your brain (the cortex, more-or-less) has “thoughts”, and another part of your brain (the basal ganglia, more-or-less) assigns a “value” (in RL ter...]]>
Steve Byrnes https://storage.googleapis.com/rssfile/images/Nonlinear%20Logo%203000x3000%20-%20Alignment%20Forum.png 20:26 None full 5237
NvwjExA7FcPDoo3L7_NL_AF_AF AF - Are there cognitive realms? by Tsvi Benson-Tilsen Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Are there cognitive realms?, published by Tsvi Benson-Tilsen on March 12, 2023 on The AI Alignment Forum. [Metadata: crossposted from. First completed November 16, 2022. This essay is more like research notes than exposition, so context may be missing, the use of terms may change across essays, and the text may be revised later; only the versions at tsvibt.blogspot.com are definitely up to date.] Are there unbounded modes of thinking that are systemically, radically distinct from each other in relevant ways? Note: since I don't know whether "cognitive realms" exist, this essay isn't based on clear examples and is especially speculative. Realms Systemically, radically distinct unbounded modes of thinking The question is, are there different kinds--writ large--of thinking? To the extent that there are, interpreting the mental content of another mind, especially one with different origins than one's own, may be more fraught than one would assume based on experience with minds that have similar origins to one's own mind. Are there unbounded modes of thinking that are systemically, radically distinct from each other? "Unbounded" means that there aren't bounds on how far the thinking can go, how much it can understand, what domains it can become effective in, what goals it can achieve if they are possible. "Systemically" ("system" = "together-standing-things") means that the question is about all the elements that participate in the thinking, as they covary / coadapt / combine / interoperate / provide context for each other. "Radical" (Wiktionary) does not mean "extreme". It comes from the same etymon as "radish" and "radix" and means "of the root" or "to the root"; compare "eradicate" = "out-root" = "pull out all the way to the root", and more distantly through PIE wréh₂ds the Germanic "wort" and "root". Here it means that the question isn't about some mental content in the foreground against a fixed background; the question asks about the background too, the whole system of thinking to its root, to its ongoing source and to what will shape it as it expands into new domains. Terms Such a mode of thinking could be called a "realm". A cognitive realm is an overarching, underlying, systemic, total, architectural thoughtform that's worth discussing separately from other thoughtforms. A realm is supposed to be objective, a single metaphorical place where multiple different minds or agents could find themselves. Other words: systemic thoughtform system of thought, system of thinking cognitive style state of mind cluster / region in mindspace mode of being species of thinking Realm vs. domain A domain is a type of task, or a type of environment. A realm, on the other hand, is a systemic type of thinking; it's about the mind, not the task. For the idea of a domain see Yudkowsky's definition of intelligence as efficient cross-domain optimization power. Compare also domain-specific programming languages, and the domain of discourse of a logical system. It might be more suitable for a mind to dwell in different realms depending on what domain it's operating in, and this may be a many-to-many mapping. Compare: The mapping from computational subsystems to cognitive talents is many-to-many, and the mapping from cognitive talents plus acquired expertise to domain competencies is also many-to-many, [...]. From "Levels of Organization in General Intelligence", Yudkowsky (2007). Domains are about the things being dealt with; it's a Cartesian concept (though it allows for abstraction and reflection, e.g. Pearlian causality is a domain and reprogramming oneself is a domain). Realms are about the thing doing the dealing-with. Realm vs. micro-realm A micro-realm is a realm except that it's not unbounded. It's similar to a cognitive faculty, and similar to a very abstract domain, but includes t...]]>
Tsvi Benson-Tilsen https://www.alignmentforum.org/posts/NvwjExA7FcPDoo3L7/are-there-cognitive-realms Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Are there cognitive realms?, published by Tsvi Benson-Tilsen on March 12, 2023 on The AI Alignment Forum. [Metadata: crossposted from. First completed November 16, 2022. This essay is more like research notes than exposition, so context may be missing, the use of terms may change across essays, and the text may be revised later; only the versions at tsvibt.blogspot.com are definitely up to date.] Are there unbounded modes of thinking that are systemically, radically distinct from each other in relevant ways? Note: since I don't know whether "cognitive realms" exist, this essay isn't based on clear examples and is especially speculative. Realms Systemically, radically distinct unbounded modes of thinking The question is, are there different kinds--writ large--of thinking? To the extent that there are, interpreting the mental content of another mind, especially one with different origins than one's own, may be more fraught than one would assume based on experience with minds that have similar origins to one's own mind. Are there unbounded modes of thinking that are systemically, radically distinct from each other? "Unbounded" means that there aren't bounds on how far the thinking can go, how much it can understand, what domains it can become effective in, what goals it can achieve if they are possible. "Systemically" ("system" = "together-standing-things") means that the question is about all the elements that participate in the thinking, as they covary / coadapt / combine / interoperate / provide context for each other. "Radical" (Wiktionary) does not mean "extreme". It comes from the same etymon as "radish" and "radix" and means "of the root" or "to the root"; compare "eradicate" = "out-root" = "pull out all the way to the root", and more distantly through PIE wréh₂ds the Germanic "wort" and "root". Here it means that the question isn't about some mental content in the foreground against a fixed background; the question asks about the background too, the whole system of thinking to its root, to its ongoing source and to what will shape it as it expands into new domains. Terms Such a mode of thinking could be called a "realm". A cognitive realm is an overarching, underlying, systemic, total, architectural thoughtform that's worth discussing separately from other thoughtforms. A realm is supposed to be objective, a single metaphorical place where multiple different minds or agents could find themselves. Other words: systemic thoughtform system of thought, system of thinking cognitive style state of mind cluster / region in mindspace mode of being species of thinking Realm vs. domain A domain is a type of task, or a type of environment. A realm, on the other hand, is a systemic type of thinking; it's about the mind, not the task. For the idea of a domain see Yudkowsky's definition of intelligence as efficient cross-domain optimization power. Compare also domain-specific programming languages, and the domain of discourse of a logical system. It might be more suitable for a mind to dwell in different realms depending on what domain it's operating in, and this may be a many-to-many mapping. Compare: The mapping from computational subsystems to cognitive talents is many-to-many, and the mapping from cognitive talents plus acquired expertise to domain competencies is also many-to-many, [...]. From "Levels of Organization in General Intelligence", Yudkowsky (2007). Domains are about the things being dealt with; it's a Cartesian concept (though it allows for abstraction and reflection, e.g. Pearlian causality is a domain and reprogramming oneself is a domain). Realms are about the thing doing the dealing-with. Realm vs. micro-realm A micro-realm is a realm except that it's not unbounded. It's similar to a cognitive faculty, and similar to a very abstract domain, but includes t...]]>
Sun, 12 Mar 2023 19:28:52 +0000 AF - Are there cognitive realms? by Tsvi Benson-Tilsen Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Are there cognitive realms?, published by Tsvi Benson-Tilsen on March 12, 2023 on The AI Alignment Forum. [Metadata: crossposted from. First completed November 16, 2022. This essay is more like research notes than exposition, so context may be missing, the use of terms may change across essays, and the text may be revised later; only the versions at tsvibt.blogspot.com are definitely up to date.] Are there unbounded modes of thinking that are systemically, radically distinct from each other in relevant ways? Note: since I don't know whether "cognitive realms" exist, this essay isn't based on clear examples and is especially speculative. Realms Systemically, radically distinct unbounded modes of thinking The question is, are there different kinds--writ large--of thinking? To the extent that there are, interpreting the mental content of another mind, especially one with different origins than one's own, may be more fraught than one would assume based on experience with minds that have similar origins to one's own mind. Are there unbounded modes of thinking that are systemically, radically distinct from each other? "Unbounded" means that there aren't bounds on how far the thinking can go, how much it can understand, what domains it can become effective in, what goals it can achieve if they are possible. "Systemically" ("system" = "together-standing-things") means that the question is about all the elements that participate in the thinking, as they covary / coadapt / combine / interoperate / provide context for each other. "Radical" (Wiktionary) does not mean "extreme". It comes from the same etymon as "radish" and "radix" and means "of the root" or "to the root"; compare "eradicate" = "out-root" = "pull out all the way to the root", and more distantly through PIE wréh₂ds the Germanic "wort" and "root". Here it means that the question isn't about some mental content in the foreground against a fixed background; the question asks about the background too, the whole system of thinking to its root, to its ongoing source and to what will shape it as it expands into new domains. Terms Such a mode of thinking could be called a "realm". A cognitive realm is an overarching, underlying, systemic, total, architectural thoughtform that's worth discussing separately from other thoughtforms. A realm is supposed to be objective, a single metaphorical place where multiple different minds or agents could find themselves. Other words: systemic thoughtform system of thought, system of thinking cognitive style state of mind cluster / region in mindspace mode of being species of thinking Realm vs. domain A domain is a type of task, or a type of environment. A realm, on the other hand, is a systemic type of thinking; it's about the mind, not the task. For the idea of a domain see Yudkowsky's definition of intelligence as efficient cross-domain optimization power. Compare also domain-specific programming languages, and the domain of discourse of a logical system. It might be more suitable for a mind to dwell in different realms depending on what domain it's operating in, and this may be a many-to-many mapping. Compare: The mapping from computational subsystems to cognitive talents is many-to-many, and the mapping from cognitive talents plus acquired expertise to domain competencies is also many-to-many, [...]. From "Levels of Organization in General Intelligence", Yudkowsky (2007). Domains are about the things being dealt with; it's a Cartesian concept (though it allows for abstraction and reflection, e.g. Pearlian causality is a domain and reprogramming oneself is a domain). Realms are about the thing doing the dealing-with. Realm vs. micro-realm A micro-realm is a realm except that it's not unbounded. It's similar to a cognitive faculty, and similar to a very abstract domain, but includes t...]]>
Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Are there cognitive realms?, published by Tsvi Benson-Tilsen on March 12, 2023 on The AI Alignment Forum. [Metadata: crossposted from. First completed November 16, 2022. This essay is more like research notes than exposition, so context may be missing, the use of terms may change across essays, and the text may be revised later; only the versions at tsvibt.blogspot.com are definitely up to date.] Are there unbounded modes of thinking that are systemically, radically distinct from each other in relevant ways? Note: since I don't know whether "cognitive realms" exist, this essay isn't based on clear examples and is especially speculative. Realms Systemically, radically distinct unbounded modes of thinking The question is, are there different kinds--writ large--of thinking? To the extent that there are, interpreting the mental content of another mind, especially one with different origins than one's own, may be more fraught than one would assume based on experience with minds that have similar origins to one's own mind. Are there unbounded modes of thinking that are systemically, radically distinct from each other? "Unbounded" means that there aren't bounds on how far the thinking can go, how much it can understand, what domains it can become effective in, what goals it can achieve if they are possible. "Systemically" ("system" = "together-standing-things") means that the question is about all the elements that participate in the thinking, as they covary / coadapt / combine / interoperate / provide context for each other. "Radical" (Wiktionary) does not mean "extreme". It comes from the same etymon as "radish" and "radix" and means "of the root" or "to the root"; compare "eradicate" = "out-root" = "pull out all the way to the root", and more distantly through PIE wréh₂ds the Germanic "wort" and "root". Here it means that the question isn't about some mental content in the foreground against a fixed background; the question asks about the background too, the whole system of thinking to its root, to its ongoing source and to what will shape it as it expands into new domains. Terms Such a mode of thinking could be called a "realm". A cognitive realm is an overarching, underlying, systemic, total, architectural thoughtform that's worth discussing separately from other thoughtforms. A realm is supposed to be objective, a single metaphorical place where multiple different minds or agents could find themselves. Other words: systemic thoughtform system of thought, system of thinking cognitive style state of mind cluster / region in mindspace mode of being species of thinking Realm vs. domain A domain is a type of task, or a type of environment. A realm, on the other hand, is a systemic type of thinking; it's about the mind, not the task. For the idea of a domain see Yudkowsky's definition of intelligence as efficient cross-domain optimization power. Compare also domain-specific programming languages, and the domain of discourse of a logical system. It might be more suitable for a mind to dwell in different realms depending on what domain it's operating in, and this may be a many-to-many mapping. Compare: The mapping from computational subsystems to cognitive talents is many-to-many, and the mapping from cognitive talents plus acquired expertise to domain competencies is also many-to-many, [...]. From "Levels of Organization in General Intelligence", Yudkowsky (2007). Domains are about the things being dealt with; it's a Cartesian concept (though it allows for abstraction and reflection, e.g. Pearlian causality is a domain and reprogramming oneself is a domain). Realms are about the thing doing the dealing-with. Realm vs. micro-realm A micro-realm is a realm except that it's not unbounded. It's similar to a cognitive faculty, and similar to a very abstract domain, but includes t...]]>
Tsvi Benson-Tilsen https://storage.googleapis.com/rssfile/images/Nonlinear%20Logo%203000x3000%20-%20Alignment%20Forum.png 16:10 None full 5196
6Ghvdb2iwLAyGT6A3_NL_AF_AF AF - Paper Replication Walkthrough: Reverse-Engineering Modular Addition by Neel Nanda Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Paper Replication Walkthrough: Reverse-Engineering Modular Addition, published by Neel Nanda on March 12, 2023 on The AI Alignment Forum. I'm excited about trying different formats for mechanistic interpretability education! I've made a video walkthrough where we replicate my paper, Progress Measures for Grokking via Mechanistic Interpretability. With Jess Smith, one of my co-authors, we record ourselves coding a replication and discussed what we did at each step. This is a three part walkthrough and you can see the accompanying code for the walkthrough here: In part 1, we train a model to perform modular addition, and see that it does grok! In part 2, we take this model and reverse-engineer the trig-based circuit it's learned to do modular addition. We show that you can both read out intermediate steps of the circuit from the activations, and that you can just read off some of the algorithm's steps from the model weights. In part 3, we define some progress measures that let us distinguish progress towards the generalising and the memorising algorithm. We then look at the model during training and watch how the circuits develop, and use this to understand why it groks. This is an experiment with a new format, and I'd love to hear about how useful you find it! Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.]]>
Neel Nanda https://www.alignmentforum.org/posts/6Ghvdb2iwLAyGT6A3/paper-replication-walkthrough-reverse-engineering-modular Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Paper Replication Walkthrough: Reverse-Engineering Modular Addition, published by Neel Nanda on March 12, 2023 on The AI Alignment Forum. I'm excited about trying different formats for mechanistic interpretability education! I've made a video walkthrough where we replicate my paper, Progress Measures for Grokking via Mechanistic Interpretability. With Jess Smith, one of my co-authors, we record ourselves coding a replication and discussed what we did at each step. This is a three part walkthrough and you can see the accompanying code for the walkthrough here: In part 1, we train a model to perform modular addition, and see that it does grok! In part 2, we take this model and reverse-engineer the trig-based circuit it's learned to do modular addition. We show that you can both read out intermediate steps of the circuit from the activations, and that you can just read off some of the algorithm's steps from the model weights. In part 3, we define some progress measures that let us distinguish progress towards the generalising and the memorising algorithm. We then look at the model during training and watch how the circuits develop, and use this to understand why it groks. This is an experiment with a new format, and I'd love to hear about how useful you find it! Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.]]>
Sun, 12 Mar 2023 13:25:47 +0000 AF - Paper Replication Walkthrough: Reverse-Engineering Modular Addition by Neel Nanda Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Paper Replication Walkthrough: Reverse-Engineering Modular Addition, published by Neel Nanda on March 12, 2023 on The AI Alignment Forum. I'm excited about trying different formats for mechanistic interpretability education! I've made a video walkthrough where we replicate my paper, Progress Measures for Grokking via Mechanistic Interpretability. With Jess Smith, one of my co-authors, we record ourselves coding a replication and discussed what we did at each step. This is a three part walkthrough and you can see the accompanying code for the walkthrough here: In part 1, we train a model to perform modular addition, and see that it does grok! In part 2, we take this model and reverse-engineer the trig-based circuit it's learned to do modular addition. We show that you can both read out intermediate steps of the circuit from the activations, and that you can just read off some of the algorithm's steps from the model weights. In part 3, we define some progress measures that let us distinguish progress towards the generalising and the memorising algorithm. We then look at the model during training and watch how the circuits develop, and use this to understand why it groks. This is an experiment with a new format, and I'd love to hear about how useful you find it! Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.]]>
Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Paper Replication Walkthrough: Reverse-Engineering Modular Addition, published by Neel Nanda on March 12, 2023 on The AI Alignment Forum. I'm excited about trying different formats for mechanistic interpretability education! I've made a video walkthrough where we replicate my paper, Progress Measures for Grokking via Mechanistic Interpretability. With Jess Smith, one of my co-authors, we record ourselves coding a replication and discussed what we did at each step. This is a three part walkthrough and you can see the accompanying code for the walkthrough here: In part 1, we train a model to perform modular addition, and see that it does grok! In part 2, we take this model and reverse-engineer the trig-based circuit it's learned to do modular addition. We show that you can both read out intermediate steps of the circuit from the activations, and that you can just read off some of the algorithm's steps from the model weights. In part 3, we define some progress measures that let us distinguish progress towards the generalising and the memorising algorithm. We then look at the model during training and watch how the circuits develop, and use this to understand why it groks. This is an experiment with a new format, and I'd love to hear about how useful you find it! Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.]]>
Neel Nanda https://storage.googleapis.com/rssfile/images/Nonlinear%20Logo%203000x3000%20-%20Alignment%20Forum.png 01:26 None full 5197
cAC4AXiNC5ig6jQnc_NL_AF_AF AF - Understanding and controlling a maze-solving policy network by Alex Turner Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Understanding and controlling a maze-solving policy network, published by Alex Turner on March 11, 2023 on The AI Alignment Forum. TL;DR: We algebraically modified the net's runtime goals without finetuning. We also found (what we think is) a "motivational API" deep in the network. We used the API to retarget the agent. Summary of a few of the most interesting results: Langosco et al. trained a range of maze-solving nets. We decided to analyze one which we thought would be interesting. The network we chose has 3.5M parameters and 15 convolutional layers. This network can be attracted to a target location nearby in the maze—all this by modifying a single activation, out of tens of thousands. This works reliably when the target location is in the upper-right, and not as reliably when the target is elsewhere. Considering several channels halfway through the network, we hypothesized that their activations mainly depend on the location of the cheese. We tested this by resampling these activations with those from another random maze (as in causal scrubbing). We found that as long as the second maze had its cheese located at the same coordinates, the network’s behavior was roughly unchanged. However, if the second maze had cheese at different coordinates, the agent's behavior was significantly affected. This suggests that these channels are inputs to goal-oriented circuits, and these channels affect those circuits basically by passing messages about where the cheese is. This network decides whether to acquire cheese not only as a function of path-distance to cheese, but—after controlling for path-distance—also as a function of Euclidean/"perceptual" distance between the mouse and the cheese, even though the agent sees the whole maze at once. Another simple idea: We define a "cheese vector" as the difference in activations when the cheese is present in a maze, and when the cheese is not present in the same maze. For each maze, we generate a single cheese vector and subtract that vector from all forward passes in that maze. The agent now ignores cheese most of the time, instead heading towards the top-right region (the historical location of cheese). Furthermore, a given maze's cheese vector transfers across mazes to other mazes with cheese in the same location. We propose the algebraic value-editing conjecture (AVEC): It's possible to deeply modify a range of alignment-relevant model properties, without retraining the model, via techniques as simple as "run forward passes on prompts which e.g. prompt the model to offer nice- and not-nice completions, and then take a 'niceness vector' to be the diff between their activations, and then add the niceness vector to future forward passes." Introducing the training process and visualizations In this post, we'll mostly discuss what we found, not what our findings mean. Let's run through some facts about Langosco et al.'s training process. Mazes had varying effective sizes, ranging from 3×3 to 25×25: Each 64×64 RGB observation is processed by a deeply convolutional (15 conv layers!) network, without memory (i.e. no recurrent state): Why does the agent go to the cheese sometimes, and the top-right corner other times? It's not that the agent wasn't trained for long enough. Sampling rollouts from the trained policy adds a lot of noise. It's also hard to remember what the agent did in what part of the maze. To better understand this mouse, we'll take a bird's-eye view. A nicer way to view episodes is with a vector field view, which overlays a vector field representing the agent policy for a given maze. We consider two kinds of vector fields: While the net probability vector field leaves open two degrees of freedom per net probability vector, in practice it seems fine for eyeballing mouse behavior. Behavioral analysis When in doubt, get m...]]>
Alex Turner https://www.alignmentforum.org/posts/cAC4AXiNC5ig6jQnc/understanding-and-controlling-a-maze-solving-policy-network Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Understanding and controlling a maze-solving policy network, published by Alex Turner on March 11, 2023 on The AI Alignment Forum. TL;DR: We algebraically modified the net's runtime goals without finetuning. We also found (what we think is) a "motivational API" deep in the network. We used the API to retarget the agent. Summary of a few of the most interesting results: Langosco et al. trained a range of maze-solving nets. We decided to analyze one which we thought would be interesting. The network we chose has 3.5M parameters and 15 convolutional layers. This network can be attracted to a target location nearby in the maze—all this by modifying a single activation, out of tens of thousands. This works reliably when the target location is in the upper-right, and not as reliably when the target is elsewhere. Considering several channels halfway through the network, we hypothesized that their activations mainly depend on the location of the cheese. We tested this by resampling these activations with those from another random maze (as in causal scrubbing). We found that as long as the second maze had its cheese located at the same coordinates, the network’s behavior was roughly unchanged. However, if the second maze had cheese at different coordinates, the agent's behavior was significantly affected. This suggests that these channels are inputs to goal-oriented circuits, and these channels affect those circuits basically by passing messages about where the cheese is. This network decides whether to acquire cheese not only as a function of path-distance to cheese, but—after controlling for path-distance—also as a function of Euclidean/"perceptual" distance between the mouse and the cheese, even though the agent sees the whole maze at once. Another simple idea: We define a "cheese vector" as the difference in activations when the cheese is present in a maze, and when the cheese is not present in the same maze. For each maze, we generate a single cheese vector and subtract that vector from all forward passes in that maze. The agent now ignores cheese most of the time, instead heading towards the top-right region (the historical location of cheese). Furthermore, a given maze's cheese vector transfers across mazes to other mazes with cheese in the same location. We propose the algebraic value-editing conjecture (AVEC): It's possible to deeply modify a range of alignment-relevant model properties, without retraining the model, via techniques as simple as "run forward passes on prompts which e.g. prompt the model to offer nice- and not-nice completions, and then take a 'niceness vector' to be the diff between their activations, and then add the niceness vector to future forward passes." Introducing the training process and visualizations In this post, we'll mostly discuss what we found, not what our findings mean. Let's run through some facts about Langosco et al.'s training process. Mazes had varying effective sizes, ranging from 3×3 to 25×25: Each 64×64 RGB observation is processed by a deeply convolutional (15 conv layers!) network, without memory (i.e. no recurrent state): Why does the agent go to the cheese sometimes, and the top-right corner other times? It's not that the agent wasn't trained for long enough. Sampling rollouts from the trained policy adds a lot of noise. It's also hard to remember what the agent did in what part of the maze. To better understand this mouse, we'll take a bird's-eye view. A nicer way to view episodes is with a vector field view, which overlays a vector field representing the agent policy for a given maze. We consider two kinds of vector fields: While the net probability vector field leaves open two degrees of freedom per net probability vector, in practice it seems fine for eyeballing mouse behavior. Behavioral analysis When in doubt, get m...]]>
Sat, 11 Mar 2023 18:59:56 +0000 AF - Understanding and controlling a maze-solving policy network by Alex Turner Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Understanding and controlling a maze-solving policy network, published by Alex Turner on March 11, 2023 on The AI Alignment Forum. TL;DR: We algebraically modified the net's runtime goals without finetuning. We also found (what we think is) a "motivational API" deep in the network. We used the API to retarget the agent. Summary of a few of the most interesting results: Langosco et al. trained a range of maze-solving nets. We decided to analyze one which we thought would be interesting. The network we chose has 3.5M parameters and 15 convolutional layers. This network can be attracted to a target location nearby in the maze—all this by modifying a single activation, out of tens of thousands. This works reliably when the target location is in the upper-right, and not as reliably when the target is elsewhere. Considering several channels halfway through the network, we hypothesized that their activations mainly depend on the location of the cheese. We tested this by resampling these activations with those from another random maze (as in causal scrubbing). We found that as long as the second maze had its cheese located at the same coordinates, the network’s behavior was roughly unchanged. However, if the second maze had cheese at different coordinates, the agent's behavior was significantly affected. This suggests that these channels are inputs to goal-oriented circuits, and these channels affect those circuits basically by passing messages about where the cheese is. This network decides whether to acquire cheese not only as a function of path-distance to cheese, but—after controlling for path-distance—also as a function of Euclidean/"perceptual" distance between the mouse and the cheese, even though the agent sees the whole maze at once. Another simple idea: We define a "cheese vector" as the difference in activations when the cheese is present in a maze, and when the cheese is not present in the same maze. For each maze, we generate a single cheese vector and subtract that vector from all forward passes in that maze. The agent now ignores cheese most of the time, instead heading towards the top-right region (the historical location of cheese). Furthermore, a given maze's cheese vector transfers across mazes to other mazes with cheese in the same location. We propose the algebraic value-editing conjecture (AVEC): It's possible to deeply modify a range of alignment-relevant model properties, without retraining the model, via techniques as simple as "run forward passes on prompts which e.g. prompt the model to offer nice- and not-nice completions, and then take a 'niceness vector' to be the diff between their activations, and then add the niceness vector to future forward passes." Introducing the training process and visualizations In this post, we'll mostly discuss what we found, not what our findings mean. Let's run through some facts about Langosco et al.'s training process. Mazes had varying effective sizes, ranging from 3×3 to 25×25: Each 64×64 RGB observation is processed by a deeply convolutional (15 conv layers!) network, without memory (i.e. no recurrent state): Why does the agent go to the cheese sometimes, and the top-right corner other times? It's not that the agent wasn't trained for long enough. Sampling rollouts from the trained policy adds a lot of noise. It's also hard to remember what the agent did in what part of the maze. To better understand this mouse, we'll take a bird's-eye view. A nicer way to view episodes is with a vector field view, which overlays a vector field representing the agent policy for a given maze. We consider two kinds of vector fields: While the net probability vector field leaves open two degrees of freedom per net probability vector, in practice it seems fine for eyeballing mouse behavior. Behavioral analysis When in doubt, get m...]]>
Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Understanding and controlling a maze-solving policy network, published by Alex Turner on March 11, 2023 on The AI Alignment Forum. TL;DR: We algebraically modified the net's runtime goals without finetuning. We also found (what we think is) a "motivational API" deep in the network. We used the API to retarget the agent. Summary of a few of the most interesting results: Langosco et al. trained a range of maze-solving nets. We decided to analyze one which we thought would be interesting. The network we chose has 3.5M parameters and 15 convolutional layers. This network can be attracted to a target location nearby in the maze—all this by modifying a single activation, out of tens of thousands. This works reliably when the target location is in the upper-right, and not as reliably when the target is elsewhere. Considering several channels halfway through the network, we hypothesized that their activations mainly depend on the location of the cheese. We tested this by resampling these activations with those from another random maze (as in causal scrubbing). We found that as long as the second maze had its cheese located at the same coordinates, the network’s behavior was roughly unchanged. However, if the second maze had cheese at different coordinates, the agent's behavior was significantly affected. This suggests that these channels are inputs to goal-oriented circuits, and these channels affect those circuits basically by passing messages about where the cheese is. This network decides whether to acquire cheese not only as a function of path-distance to cheese, but—after controlling for path-distance—also as a function of Euclidean/"perceptual" distance between the mouse and the cheese, even though the agent sees the whole maze at once. Another simple idea: We define a "cheese vector" as the difference in activations when the cheese is present in a maze, and when the cheese is not present in the same maze. For each maze, we generate a single cheese vector and subtract that vector from all forward passes in that maze. The agent now ignores cheese most of the time, instead heading towards the top-right region (the historical location of cheese). Furthermore, a given maze's cheese vector transfers across mazes to other mazes with cheese in the same location. We propose the algebraic value-editing conjecture (AVEC): It's possible to deeply modify a range of alignment-relevant model properties, without retraining the model, via techniques as simple as "run forward passes on prompts which e.g. prompt the model to offer nice- and not-nice completions, and then take a 'niceness vector' to be the diff between their activations, and then add the niceness vector to future forward passes." Introducing the training process and visualizations In this post, we'll mostly discuss what we found, not what our findings mean. Let's run through some facts about Langosco et al.'s training process. Mazes had varying effective sizes, ranging from 3×3 to 25×25: Each 64×64 RGB observation is processed by a deeply convolutional (15 conv layers!) network, without memory (i.e. no recurrent state): Why does the agent go to the cheese sometimes, and the top-right corner other times? It's not that the agent wasn't trained for long enough. Sampling rollouts from the trained policy adds a lot of noise. It's also hard to remember what the agent did in what part of the maze. To better understand this mouse, we'll take a bird's-eye view. A nicer way to view episodes is with a vector field view, which overlays a vector field representing the agent policy for a given maze. We consider two kinds of vector fields: While the net probability vector field leaves open two degrees of freedom per net probability vector, in practice it seems fine for eyeballing mouse behavior. Behavioral analysis When in doubt, get m...]]>
Alex Turner https://storage.googleapis.com/rssfile/images/Nonlinear%20Logo%203000x3000%20-%20Alignment%20Forum.png 36:25 None full 5209
tAQRxccEDYZY5vxvy_NL_AF_AF AF - Japan AI Alignment Conference by Chris Scammell Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Japan AI Alignment Conference, published by Chris Scammell on March 10, 2023 on The AI Alignment Forum. Conjecture and ARAYA are hosting and organizing the first Japan AI Alignment Conference. The conference will take place in Tokyo, Japan on March 11 and 12. Details about the event can be found here. This event is generously supported by a grant from the Long Term Future Fund. The goal of the conference is to illustrate the AI control problem to Japanese AI researchers, introduce them to current trends in AI alignment research, inspire new research directions, and to provide Western researchers exposure to a different set of AI safety thoughts from Japan. This is an exploratory event, and we plan to write a postmortem about the event in due time. The first half of the conference will be livestreamed. It will feature an opening talk from Connor Leahy (CEO of Conjecture), a fireside chat between Ryota Kanai (CEO of ARAYA) and Jaan Tallinn, and some presentations on AI safety research directions in the West and in Japan. You can follow the first part of the conference here. The livestream runs from 9:30am-12:30pm JST. The rest of the conference will not be livestreamed, and will consist of in-person small group workshops to discuss various AI alignment research directions.The conference will have ~50 attendees from ARAYA, Conjecture, Whole Brain Architecture Initiative, MIRI, OpenAI, RIKEN, Ritsumeikan University, University of Tokyo, Omron Sinic X, Keio University, and others. Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.]]>
Chris Scammell https://www.alignmentforum.org/posts/tAQRxccEDYZY5vxvy/japan-ai-alignment-conference Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Japan AI Alignment Conference, published by Chris Scammell on March 10, 2023 on The AI Alignment Forum. Conjecture and ARAYA are hosting and organizing the first Japan AI Alignment Conference. The conference will take place in Tokyo, Japan on March 11 and 12. Details about the event can be found here. This event is generously supported by a grant from the Long Term Future Fund. The goal of the conference is to illustrate the AI control problem to Japanese AI researchers, introduce them to current trends in AI alignment research, inspire new research directions, and to provide Western researchers exposure to a different set of AI safety thoughts from Japan. This is an exploratory event, and we plan to write a postmortem about the event in due time. The first half of the conference will be livestreamed. It will feature an opening talk from Connor Leahy (CEO of Conjecture), a fireside chat between Ryota Kanai (CEO of ARAYA) and Jaan Tallinn, and some presentations on AI safety research directions in the West and in Japan. You can follow the first part of the conference here. The livestream runs from 9:30am-12:30pm JST. The rest of the conference will not be livestreamed, and will consist of in-person small group workshops to discuss various AI alignment research directions.The conference will have ~50 attendees from ARAYA, Conjecture, Whole Brain Architecture Initiative, MIRI, OpenAI, RIKEN, Ritsumeikan University, University of Tokyo, Omron Sinic X, Keio University, and others. Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.]]>
Fri, 10 Mar 2023 06:56:57 +0000 AF - Japan AI Alignment Conference by Chris Scammell Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Japan AI Alignment Conference, published by Chris Scammell on March 10, 2023 on The AI Alignment Forum. Conjecture and ARAYA are hosting and organizing the first Japan AI Alignment Conference. The conference will take place in Tokyo, Japan on March 11 and 12. Details about the event can be found here. This event is generously supported by a grant from the Long Term Future Fund. The goal of the conference is to illustrate the AI control problem to Japanese AI researchers, introduce them to current trends in AI alignment research, inspire new research directions, and to provide Western researchers exposure to a different set of AI safety thoughts from Japan. This is an exploratory event, and we plan to write a postmortem about the event in due time. The first half of the conference will be livestreamed. It will feature an opening talk from Connor Leahy (CEO of Conjecture), a fireside chat between Ryota Kanai (CEO of ARAYA) and Jaan Tallinn, and some presentations on AI safety research directions in the West and in Japan. You can follow the first part of the conference here. The livestream runs from 9:30am-12:30pm JST. The rest of the conference will not be livestreamed, and will consist of in-person small group workshops to discuss various AI alignment research directions.The conference will have ~50 attendees from ARAYA, Conjecture, Whole Brain Architecture Initiative, MIRI, OpenAI, RIKEN, Ritsumeikan University, University of Tokyo, Omron Sinic X, Keio University, and others. Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.]]>
Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Japan AI Alignment Conference, published by Chris Scammell on March 10, 2023 on The AI Alignment Forum. Conjecture and ARAYA are hosting and organizing the first Japan AI Alignment Conference. The conference will take place in Tokyo, Japan on March 11 and 12. Details about the event can be found here. This event is generously supported by a grant from the Long Term Future Fund. The goal of the conference is to illustrate the AI control problem to Japanese AI researchers, introduce them to current trends in AI alignment research, inspire new research directions, and to provide Western researchers exposure to a different set of AI safety thoughts from Japan. This is an exploratory event, and we plan to write a postmortem about the event in due time. The first half of the conference will be livestreamed. It will feature an opening talk from Connor Leahy (CEO of Conjecture), a fireside chat between Ryota Kanai (CEO of ARAYA) and Jaan Tallinn, and some presentations on AI safety research directions in the West and in Japan. You can follow the first part of the conference here. The livestream runs from 9:30am-12:30pm JST. The rest of the conference will not be livestreamed, and will consist of in-person small group workshops to discuss various AI alignment research directions.The conference will have ~50 attendees from ARAYA, Conjecture, Whole Brain Architecture Initiative, MIRI, OpenAI, RIKEN, Ritsumeikan University, University of Tokyo, Omron Sinic X, Keio University, and others. Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.]]>
Chris Scammell https://storage.googleapis.com/rssfile/images/Nonlinear%20Logo%203000x3000%20-%20Alignment%20Forum.png 01:45 None full 5200
3gAccKDW6nRKFumpP_NL_AF_AF AF - Why Not Just Outsource Alignment Research To An AI? by johnswentworth Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Why Not Just Outsource Alignment Research To An AI?, published by johnswentworth on March 9, 2023 on The AI Alignment Forum. Warmup: The Expert If you haven’t seen “The Expert” before, I recommend it as a warmup for this post: The Client: “We need you to draw seven red lines, all strictly perpendicular. Some with green ink, some with transparent. Can you do that?” (... a minute of The Expert trying to explain that, no, he cannot do that, nor can anyone else.) The Client: “So in principle, this is possible.” This. This is what it looks like in practice, by default, when someone tries to outsource some cognitive labor which they could not themselves perform. At best, The Expert is well-intentioned and knows what the user needs, ignores the incoherent parts of The Client’s babbling, and does the right thing. Or, they manage to add some silly but ultimately harmless bells and whistles to satisfy whatever dumb thing The Client is looking for. At worst. well, there’s more than one failure mode which could qualify for the title of "worst". Maybe The Expert gives The Client something which looks right to The Client and successfully conceals all the problems with it; presumably that’s a lucrative strategy for Experts. Maybe the Double Illusion of Transparency kicks in, both parties think they’ve successfully communicated, but in fact neither has any idea what’s going on in the other’s head. Maybe a well-intentioned Expert decides to ignore The Client’s incoherent babbling and do the thing which seems most likely to be right, but gets The Client’s preferences wrong. One way or another, The Client’s ignorance is a major bottleneck to cognitive outsourcing. In practice, I expect The Client’s ignorance to be the primary bottleneck to cognitive outsourcing. The core reason why we cannot just outsource alignment research to an AI is because we would then be The Client, and probably a very ignorant one. Application to Alignment Schemes There’s a lot of different flavors of “have the AI solve alignment for us”. A sampling: Just prompt a language model to generate alignment research Do some fine-tuning/RLHF on the language model to make it generate alignment research Let the language model talk to other instances of itself, and prompt or fine-tune them together so they generate alignment research jointly Set up a language model to generate alignment proposals and another to poke holes in them, and fine-tune the pair via a human judging the “debate” As we go down the list, the proposals get fancier and add more bells and whistles, trying to make the AI a better expert. Sadly, none of them at all address what I expect to be the actual main bottleneck: The Client (i.e. the human user or users) has no understanding of what they need, what questions to ask, what’s possible or even logically coherent, etc. What would this kind of error look like in practice? Here’s one concrete example of the kind of failures I’d expect when a would-be outsourcer’s understanding falls short (from here): Somebody literally types “If we take the action you just proposed, will we be happy with the outcomes?” into a GPT prompt. Obviously that does not result in the AI giving its actual best-guess answers to the questions, but in this case it doesn't result in the AI thinking about how to deceive humans either. It just thinks about what text would follow that question if it appeared on the internet somewhere. And then I imagine someone with a bunch of interpretability tools saying "yup, it's just thinking about what text typically follows this question", and then that person's boss is like "great, it's not trying to deceive us, guess we can trust the answer", and they both just haven't really thought of the fact that the AI's response-text does not have anything in particular to do with whether the AI is aligned...]]>
johnswentworth https://www.alignmentforum.org/posts/3gAccKDW6nRKFumpP/why-not-just-outsource-alignment-research-to-an-ai Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Why Not Just Outsource Alignment Research To An AI?, published by johnswentworth on March 9, 2023 on The AI Alignment Forum. Warmup: The Expert If you haven’t seen “The Expert” before, I recommend it as a warmup for this post: The Client: “We need you to draw seven red lines, all strictly perpendicular. Some with green ink, some with transparent. Can you do that?” (... a minute of The Expert trying to explain that, no, he cannot do that, nor can anyone else.) The Client: “So in principle, this is possible.” This. This is what it looks like in practice, by default, when someone tries to outsource some cognitive labor which they could not themselves perform. At best, The Expert is well-intentioned and knows what the user needs, ignores the incoherent parts of The Client’s babbling, and does the right thing. Or, they manage to add some silly but ultimately harmless bells and whistles to satisfy whatever dumb thing The Client is looking for. At worst. well, there’s more than one failure mode which could qualify for the title of "worst". Maybe The Expert gives The Client something which looks right to The Client and successfully conceals all the problems with it; presumably that’s a lucrative strategy for Experts. Maybe the Double Illusion of Transparency kicks in, both parties think they’ve successfully communicated, but in fact neither has any idea what’s going on in the other’s head. Maybe a well-intentioned Expert decides to ignore The Client’s incoherent babbling and do the thing which seems most likely to be right, but gets The Client’s preferences wrong. One way or another, The Client’s ignorance is a major bottleneck to cognitive outsourcing. In practice, I expect The Client’s ignorance to be the primary bottleneck to cognitive outsourcing. The core reason why we cannot just outsource alignment research to an AI is because we would then be The Client, and probably a very ignorant one. Application to Alignment Schemes There’s a lot of different flavors of “have the AI solve alignment for us”. A sampling: Just prompt a language model to generate alignment research Do some fine-tuning/RLHF on the language model to make it generate alignment research Let the language model talk to other instances of itself, and prompt or fine-tune them together so they generate alignment research jointly Set up a language model to generate alignment proposals and another to poke holes in them, and fine-tune the pair via a human judging the “debate” As we go down the list, the proposals get fancier and add more bells and whistles, trying to make the AI a better expert. Sadly, none of them at all address what I expect to be the actual main bottleneck: The Client (i.e. the human user or users) has no understanding of what they need, what questions to ask, what’s possible or even logically coherent, etc. What would this kind of error look like in practice? Here’s one concrete example of the kind of failures I’d expect when a would-be outsourcer’s understanding falls short (from here): Somebody literally types “If we take the action you just proposed, will we be happy with the outcomes?” into a GPT prompt. Obviously that does not result in the AI giving its actual best-guess answers to the questions, but in this case it doesn't result in the AI thinking about how to deceive humans either. It just thinks about what text would follow that question if it appeared on the internet somewhere. And then I imagine someone with a bunch of interpretability tools saying "yup, it's just thinking about what text typically follows this question", and then that person's boss is like "great, it's not trying to deceive us, guess we can trust the answer", and they both just haven't really thought of the fact that the AI's response-text does not have anything in particular to do with whether the AI is aligned...]]>
Thu, 09 Mar 2023 21:49:21 +0000 AF - Why Not Just Outsource Alignment Research To An AI? by johnswentworth Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Why Not Just Outsource Alignment Research To An AI?, published by johnswentworth on March 9, 2023 on The AI Alignment Forum. Warmup: The Expert If you haven’t seen “The Expert” before, I recommend it as a warmup for this post: The Client: “We need you to draw seven red lines, all strictly perpendicular. Some with green ink, some with transparent. Can you do that?” (... a minute of The Expert trying to explain that, no, he cannot do that, nor can anyone else.) The Client: “So in principle, this is possible.” This. This is what it looks like in practice, by default, when someone tries to outsource some cognitive labor which they could not themselves perform. At best, The Expert is well-intentioned and knows what the user needs, ignores the incoherent parts of The Client’s babbling, and does the right thing. Or, they manage to add some silly but ultimately harmless bells and whistles to satisfy whatever dumb thing The Client is looking for. At worst. well, there’s more than one failure mode which could qualify for the title of "worst". Maybe The Expert gives The Client something which looks right to The Client and successfully conceals all the problems with it; presumably that’s a lucrative strategy for Experts. Maybe the Double Illusion of Transparency kicks in, both parties think they’ve successfully communicated, but in fact neither has any idea what’s going on in the other’s head. Maybe a well-intentioned Expert decides to ignore The Client’s incoherent babbling and do the thing which seems most likely to be right, but gets The Client’s preferences wrong. One way or another, The Client’s ignorance is a major bottleneck to cognitive outsourcing. In practice, I expect The Client’s ignorance to be the primary bottleneck to cognitive outsourcing. The core reason why we cannot just outsource alignment research to an AI is because we would then be The Client, and probably a very ignorant one. Application to Alignment Schemes There’s a lot of different flavors of “have the AI solve alignment for us”. A sampling: Just prompt a language model to generate alignment research Do some fine-tuning/RLHF on the language model to make it generate alignment research Let the language model talk to other instances of itself, and prompt or fine-tune them together so they generate alignment research jointly Set up a language model to generate alignment proposals and another to poke holes in them, and fine-tune the pair via a human judging the “debate” As we go down the list, the proposals get fancier and add more bells and whistles, trying to make the AI a better expert. Sadly, none of them at all address what I expect to be the actual main bottleneck: The Client (i.e. the human user or users) has no understanding of what they need, what questions to ask, what’s possible or even logically coherent, etc. What would this kind of error look like in practice? Here’s one concrete example of the kind of failures I’d expect when a would-be outsourcer’s understanding falls short (from here): Somebody literally types “If we take the action you just proposed, will we be happy with the outcomes?” into a GPT prompt. Obviously that does not result in the AI giving its actual best-guess answers to the questions, but in this case it doesn't result in the AI thinking about how to deceive humans either. It just thinks about what text would follow that question if it appeared on the internet somewhere. And then I imagine someone with a bunch of interpretability tools saying "yup, it's just thinking about what text typically follows this question", and then that person's boss is like "great, it's not trying to deceive us, guess we can trust the answer", and they both just haven't really thought of the fact that the AI's response-text does not have anything in particular to do with whether the AI is aligned...]]>
Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Why Not Just Outsource Alignment Research To An AI?, published by johnswentworth on March 9, 2023 on The AI Alignment Forum. Warmup: The Expert If you haven’t seen “The Expert” before, I recommend it as a warmup for this post: The Client: “We need you to draw seven red lines, all strictly perpendicular. Some with green ink, some with transparent. Can you do that?” (... a minute of The Expert trying to explain that, no, he cannot do that, nor can anyone else.) The Client: “So in principle, this is possible.” This. This is what it looks like in practice, by default, when someone tries to outsource some cognitive labor which they could not themselves perform. At best, The Expert is well-intentioned and knows what the user needs, ignores the incoherent parts of The Client’s babbling, and does the right thing. Or, they manage to add some silly but ultimately harmless bells and whistles to satisfy whatever dumb thing The Client is looking for. At worst. well, there’s more than one failure mode which could qualify for the title of "worst". Maybe The Expert gives The Client something which looks right to The Client and successfully conceals all the problems with it; presumably that’s a lucrative strategy for Experts. Maybe the Double Illusion of Transparency kicks in, both parties think they’ve successfully communicated, but in fact neither has any idea what’s going on in the other’s head. Maybe a well-intentioned Expert decides to ignore The Client’s incoherent babbling and do the thing which seems most likely to be right, but gets The Client’s preferences wrong. One way or another, The Client’s ignorance is a major bottleneck to cognitive outsourcing. In practice, I expect The Client’s ignorance to be the primary bottleneck to cognitive outsourcing. The core reason why we cannot just outsource alignment research to an AI is because we would then be The Client, and probably a very ignorant one. Application to Alignment Schemes There’s a lot of different flavors of “have the AI solve alignment for us”. A sampling: Just prompt a language model to generate alignment research Do some fine-tuning/RLHF on the language model to make it generate alignment research Let the language model talk to other instances of itself, and prompt or fine-tune them together so they generate alignment research jointly Set up a language model to generate alignment proposals and another to poke holes in them, and fine-tune the pair via a human judging the “debate” As we go down the list, the proposals get fancier and add more bells and whistles, trying to make the AI a better expert. Sadly, none of them at all address what I expect to be the actual main bottleneck: The Client (i.e. the human user or users) has no understanding of what they need, what questions to ask, what’s possible or even logically coherent, etc. What would this kind of error look like in practice? Here’s one concrete example of the kind of failures I’d expect when a would-be outsourcer’s understanding falls short (from here): Somebody literally types “If we take the action you just proposed, will we be happy with the outcomes?” into a GPT prompt. Obviously that does not result in the AI giving its actual best-guess answers to the questions, but in this case it doesn't result in the AI thinking about how to deceive humans either. It just thinks about what text would follow that question if it appeared on the internet somewhere. And then I imagine someone with a bunch of interpretability tools saying "yup, it's just thinking about what text typically follows this question", and then that person's boss is like "great, it's not trying to deceive us, guess we can trust the answer", and they both just haven't really thought of the fact that the AI's response-text does not have anything in particular to do with whether the AI is aligned...]]>
johnswentworth https://storage.googleapis.com/rssfile/images/Nonlinear%20Logo%203000x3000%20-%20Alignment%20Forum.png 14:23 None full 5156
xhKr5KtvdJRssMeJ3_NL_AF_AF AF - Anthropic's Core Views on AI Safety by Zac Hatfield-Dodds Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Anthropic's Core Views on AI Safety, published by Zac Hatfield-Dodds on March 9, 2023 on The AI Alignment Forum. We founded Anthropic because we believe the impact of AI might be comparable to that of the industrial and scientific revolutions, but we aren’t confident it will go well. And we also believe this level of impact could start to arrive soon – perhaps in the coming decade. This view may sound implausible or grandiose, and there are good reasons to be skeptical of it. For one thing, almost everyone who has said “the thing we’re working on might be one of the biggest developments in history” has been wrong, often laughably so. Nevertheless, we believe there is enough evidence to seriously prepare for a world where rapid AI progress leads to transformative AI systems. At Anthropic our motto has been “show, don’t tell”, and we’ve focused on releasing a steady stream of safety-oriented research that we believe has broad value for the AI community. We’re writing this now because as more people have become aware of AI progress, it feels timely to express our own views on this topic and to explain our strategy and goals. In short, we believe that AI safety research is urgently important and should be supported by a wide range of public and private actors. So in this post we will summarize why we believe all this: why we anticipate very rapid AI progress and very large impacts from AI, and how that led us to be concerned about AI safety. We’ll then briefly summarize our own approach to AI safety research and some of the reasoning behind it. We hope by writing this we can contribute to broader discussions about AI safety and AI progress. As a high level summary of the main points in this post: AI will have a very large impact, possibly in the coming decadeRapid and continuing AI progress is a predictable consequence of the exponential increase in computation used to train AI systems, because research on “scaling laws” demonstrates that more computation leads to general improvements in capabilities. Simple extrapolations suggest AI systems will become far more capable in the next decade, possibly equaling or exceeding human level performance at most intellectual tasks. AI progress might slow or halt, but the evidence suggests it will probably continue. We do not know how to train systems to robustly behave wellSo far, no one knows how to train very powerful AI systems to be robustly helpful, honest, and harmless. Furthermore, rapid AI progress will be disruptive to society and may trigger competitive races that could lead corporations or nations to deploy untrustworthy AI systems. The results of this could be catastrophic, either because AI systems strategically pursue dangerous goals, or because these systems make more innocent mistakes in high-stakes situations. We are most optimistic about a multi-faceted, empirically-driven approach to AI safety We’re pursuing a variety of research directions with the goal of building reliably safe systems, and are currently most excited about scaling supervision, mechanistic interpretability, process-oriented learning, and understanding and evaluating how AI systems learn and generalize. A key goal of ours is to differentially accelerate this safety work, and to develop a profile of safety research that attempts to cover a wide range of scenarios, from those in which safety challenges turn out to be easy to address to those in which creating safe systems is extremely difficult. The full post goes into considerably more detail, and I'm really excited that we're sharing more of our thinking publicly. Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.]]>
Zac Hatfield-Dodds https://www.alignmentforum.org/posts/xhKr5KtvdJRssMeJ3/anthropic-s-core-views-on-ai-safety Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Anthropic's Core Views on AI Safety, published by Zac Hatfield-Dodds on March 9, 2023 on The AI Alignment Forum. We founded Anthropic because we believe the impact of AI might be comparable to that of the industrial and scientific revolutions, but we aren’t confident it will go well. And we also believe this level of impact could start to arrive soon – perhaps in the coming decade. This view may sound implausible or grandiose, and there are good reasons to be skeptical of it. For one thing, almost everyone who has said “the thing we’re working on might be one of the biggest developments in history” has been wrong, often laughably so. Nevertheless, we believe there is enough evidence to seriously prepare for a world where rapid AI progress leads to transformative AI systems. At Anthropic our motto has been “show, don’t tell”, and we’ve focused on releasing a steady stream of safety-oriented research that we believe has broad value for the AI community. We’re writing this now because as more people have become aware of AI progress, it feels timely to express our own views on this topic and to explain our strategy and goals. In short, we believe that AI safety research is urgently important and should be supported by a wide range of public and private actors. So in this post we will summarize why we believe all this: why we anticipate very rapid AI progress and very large impacts from AI, and how that led us to be concerned about AI safety. We’ll then briefly summarize our own approach to AI safety research and some of the reasoning behind it. We hope by writing this we can contribute to broader discussions about AI safety and AI progress. As a high level summary of the main points in this post: AI will have a very large impact, possibly in the coming decadeRapid and continuing AI progress is a predictable consequence of the exponential increase in computation used to train AI systems, because research on “scaling laws” demonstrates that more computation leads to general improvements in capabilities. Simple extrapolations suggest AI systems will become far more capable in the next decade, possibly equaling or exceeding human level performance at most intellectual tasks. AI progress might slow or halt, but the evidence suggests it will probably continue. We do not know how to train systems to robustly behave wellSo far, no one knows how to train very powerful AI systems to be robustly helpful, honest, and harmless. Furthermore, rapid AI progress will be disruptive to society and may trigger competitive races that could lead corporations or nations to deploy untrustworthy AI systems. The results of this could be catastrophic, either because AI systems strategically pursue dangerous goals, or because these systems make more innocent mistakes in high-stakes situations. We are most optimistic about a multi-faceted, empirically-driven approach to AI safety We’re pursuing a variety of research directions with the goal of building reliably safe systems, and are currently most excited about scaling supervision, mechanistic interpretability, process-oriented learning, and understanding and evaluating how AI systems learn and generalize. A key goal of ours is to differentially accelerate this safety work, and to develop a profile of safety research that attempts to cover a wide range of scenarios, from those in which safety challenges turn out to be easy to address to those in which creating safe systems is extremely difficult. The full post goes into considerably more detail, and I'm really excited that we're sharing more of our thinking publicly. Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.]]>
Thu, 09 Mar 2023 16:55:16 +0000 AF - Anthropic's Core Views on AI Safety by Zac Hatfield-Dodds Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Anthropic's Core Views on AI Safety, published by Zac Hatfield-Dodds on March 9, 2023 on The AI Alignment Forum. We founded Anthropic because we believe the impact of AI might be comparable to that of the industrial and scientific revolutions, but we aren’t confident it will go well. And we also believe this level of impact could start to arrive soon – perhaps in the coming decade. This view may sound implausible or grandiose, and there are good reasons to be skeptical of it. For one thing, almost everyone who has said “the thing we’re working on might be one of the biggest developments in history” has been wrong, often laughably so. Nevertheless, we believe there is enough evidence to seriously prepare for a world where rapid AI progress leads to transformative AI systems. At Anthropic our motto has been “show, don’t tell”, and we’ve focused on releasing a steady stream of safety-oriented research that we believe has broad value for the AI community. We’re writing this now because as more people have become aware of AI progress, it feels timely to express our own views on this topic and to explain our strategy and goals. In short, we believe that AI safety research is urgently important and should be supported by a wide range of public and private actors. So in this post we will summarize why we believe all this: why we anticipate very rapid AI progress and very large impacts from AI, and how that led us to be concerned about AI safety. We’ll then briefly summarize our own approach to AI safety research and some of the reasoning behind it. We hope by writing this we can contribute to broader discussions about AI safety and AI progress. As a high level summary of the main points in this post: AI will have a very large impact, possibly in the coming decadeRapid and continuing AI progress is a predictable consequence of the exponential increase in computation used to train AI systems, because research on “scaling laws” demonstrates that more computation leads to general improvements in capabilities. Simple extrapolations suggest AI systems will become far more capable in the next decade, possibly equaling or exceeding human level performance at most intellectual tasks. AI progress might slow or halt, but the evidence suggests it will probably continue. We do not know how to train systems to robustly behave wellSo far, no one knows how to train very powerful AI systems to be robustly helpful, honest, and harmless. Furthermore, rapid AI progress will be disruptive to society and may trigger competitive races that could lead corporations or nations to deploy untrustworthy AI systems. The results of this could be catastrophic, either because AI systems strategically pursue dangerous goals, or because these systems make more innocent mistakes in high-stakes situations. We are most optimistic about a multi-faceted, empirically-driven approach to AI safety We’re pursuing a variety of research directions with the goal of building reliably safe systems, and are currently most excited about scaling supervision, mechanistic interpretability, process-oriented learning, and understanding and evaluating how AI systems learn and generalize. A key goal of ours is to differentially accelerate this safety work, and to develop a profile of safety research that attempts to cover a wide range of scenarios, from those in which safety challenges turn out to be easy to address to those in which creating safe systems is extremely difficult. The full post goes into considerably more detail, and I'm really excited that we're sharing more of our thinking publicly. Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.]]>
Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Anthropic's Core Views on AI Safety, published by Zac Hatfield-Dodds on March 9, 2023 on The AI Alignment Forum. We founded Anthropic because we believe the impact of AI might be comparable to that of the industrial and scientific revolutions, but we aren’t confident it will go well. And we also believe this level of impact could start to arrive soon – perhaps in the coming decade. This view may sound implausible or grandiose, and there are good reasons to be skeptical of it. For one thing, almost everyone who has said “the thing we’re working on might be one of the biggest developments in history” has been wrong, often laughably so. Nevertheless, we believe there is enough evidence to seriously prepare for a world where rapid AI progress leads to transformative AI systems. At Anthropic our motto has been “show, don’t tell”, and we’ve focused on releasing a steady stream of safety-oriented research that we believe has broad value for the AI community. We’re writing this now because as more people have become aware of AI progress, it feels timely to express our own views on this topic and to explain our strategy and goals. In short, we believe that AI safety research is urgently important and should be supported by a wide range of public and private actors. So in this post we will summarize why we believe all this: why we anticipate very rapid AI progress and very large impacts from AI, and how that led us to be concerned about AI safety. We’ll then briefly summarize our own approach to AI safety research and some of the reasoning behind it. We hope by writing this we can contribute to broader discussions about AI safety and AI progress. As a high level summary of the main points in this post: AI will have a very large impact, possibly in the coming decadeRapid and continuing AI progress is a predictable consequence of the exponential increase in computation used to train AI systems, because research on “scaling laws” demonstrates that more computation leads to general improvements in capabilities. Simple extrapolations suggest AI systems will become far more capable in the next decade, possibly equaling or exceeding human level performance at most intellectual tasks. AI progress might slow or halt, but the evidence suggests it will probably continue. We do not know how to train systems to robustly behave wellSo far, no one knows how to train very powerful AI systems to be robustly helpful, honest, and harmless. Furthermore, rapid AI progress will be disruptive to society and may trigger competitive races that could lead corporations or nations to deploy untrustworthy AI systems. The results of this could be catastrophic, either because AI systems strategically pursue dangerous goals, or because these systems make more innocent mistakes in high-stakes situations. We are most optimistic about a multi-faceted, empirically-driven approach to AI safety We’re pursuing a variety of research directions with the goal of building reliably safe systems, and are currently most excited about scaling supervision, mechanistic interpretability, process-oriented learning, and understanding and evaluating how AI systems learn and generalize. A key goal of ours is to differentially accelerate this safety work, and to develop a profile of safety research that attempts to cover a wide range of scenarios, from those in which safety challenges turn out to be easy to address to those in which creating safe systems is extremely difficult. The full post goes into considerably more detail, and I'm really excited that we're sharing more of our thinking publicly. Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.]]>
Zac Hatfield-Dodds https://storage.googleapis.com/rssfile/images/Nonlinear%20Logo%203000x3000%20-%20Alignment%20Forum.png 03:26 None full 5187
r3xwHzMmMf25peeHE_NL_AF_AF AF - The Translucent Thoughts Hypotheses and Their Implications by Fabien Roger Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: The Translucent Thoughts Hypotheses and Their Implications, published by Fabien Roger on March 9, 2023 on The AI Alignment Forum. Epistemic status: Uncertain about the validity of the claims I’m making here, and looking for feedback about the research directions I’m suggesting. Thanks to Marius Hobbhahn, Johannes Treutlein, Siméon Campos, and Jean-Stanislas Denain for helpful feedback on drafts. Here is a set of hypotheses: The first AGIs will have LLMs at their core Effective plans to defeat humanity can’t be found in a single LLM forward pass LLMs will solve complex tasks by using English text (self-prompting, scratch pads, combination of expert LLMs, .) I call these the Translucent Thoughts hypotheses. I think the Translucent Thoughts hypotheses are likely (around 20% conditioning on AGI before 2030) because: Text pretraining is more efficient at building algorithms and knowledge required for real-world plan generation and evaluation than alternative methods; Future models are likely to be like Transformers, which use a limited amount of serial step in a single forward pass, and deception requires many serial steps; Text pretraining and slight fine-tuning makes model able to use text generation to increase the maximum number of serial steps by a huge factor. Getting this increase through other means is likely to be hard and not competitive. If these hypotheses are true, it should lead us to prioritize underexplored research directions, such as circumventing steganography or building extremely reliable text-supervision methods. I think those deserve attention, because Translucent Thoughts AIs are not safe by default. In this post, I argue that we may will in a world where the first AGIs will look like X, and I then describe ways to make the first AGIs safer given X. This is different from most other works in this space, which often directly describe a kind of safe AGI. Despite this, the ideas of this post are close to some other works describing paths to safe AGIs, such as: Externalized Reasoning Oversight, which describes a class of solutions similar to the one outlined here, but also aims for additional properties which I argue can be replaced with a less stringent hypothesis about AI systems; Conditioning Predictive Models, which makes assumptions slightly different from the Translucent Thoughts hypotheses, yielding different research directions; The Open Agency Model and Factored Cognition which describe subsets of AIs with Translucent Thoughts, which might be safe. The Translucent Thoughts Hypotheses Here, I sketch a world in which the first AGIs have certain properties. I argue that this world is likely, and thus a subset of all possible futures to care about. But I think it’s not a large part of all possible futures (20% conditioning on AGI before 2030). The First AGIs Will Have LLMs at Their Core By “first AGIs” I mean the first systems able to automate all cognitive tasks. AGI is likely to do reasoning and planning using LLMs. AGI might rely on vision models for some tasks and interactions with the world, and it might use explicit search processes like AlphaGo. But I expect LLMs to do plan generation and evaluation, which are the core of the system (from an Alignment point of view). Why: Vision systems are bad at coming up with and evaluating deceptive plans. Explicit search processes can’t generate and evaluate plans in the real world. LLMs seem to be able to do both plan generation and evaluation. (Plan generation and evaluation are the core tasks we would like to monitor to make AGIs safe, which is why I focus on those.) End-to-end neural networks won’t be able to compete with LLMs when it comes to reasoning and planning, or at least, end-to-end networks will use “their LLMs parts” to do their most advanced form of reasoning and planning. This means tha...]]>
Fabien Roger https://www.alignmentforum.org/posts/r3xwHzMmMf25peeHE/the-translucent-thoughts-hypotheses-and-their-implications Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: The Translucent Thoughts Hypotheses and Their Implications, published by Fabien Roger on March 9, 2023 on The AI Alignment Forum. Epistemic status: Uncertain about the validity of the claims I’m making here, and looking for feedback about the research directions I’m suggesting. Thanks to Marius Hobbhahn, Johannes Treutlein, Siméon Campos, and Jean-Stanislas Denain for helpful feedback on drafts. Here is a set of hypotheses: The first AGIs will have LLMs at their core Effective plans to defeat humanity can’t be found in a single LLM forward pass LLMs will solve complex tasks by using English text (self-prompting, scratch pads, combination of expert LLMs, .) I call these the Translucent Thoughts hypotheses. I think the Translucent Thoughts hypotheses are likely (around 20% conditioning on AGI before 2030) because: Text pretraining is more efficient at building algorithms and knowledge required for real-world plan generation and evaluation than alternative methods; Future models are likely to be like Transformers, which use a limited amount of serial step in a single forward pass, and deception requires many serial steps; Text pretraining and slight fine-tuning makes model able to use text generation to increase the maximum number of serial steps by a huge factor. Getting this increase through other means is likely to be hard and not competitive. If these hypotheses are true, it should lead us to prioritize underexplored research directions, such as circumventing steganography or building extremely reliable text-supervision methods. I think those deserve attention, because Translucent Thoughts AIs are not safe by default. In this post, I argue that we may will in a world where the first AGIs will look like X, and I then describe ways to make the first AGIs safer given X. This is different from most other works in this space, which often directly describe a kind of safe AGI. Despite this, the ideas of this post are close to some other works describing paths to safe AGIs, such as: Externalized Reasoning Oversight, which describes a class of solutions similar to the one outlined here, but also aims for additional properties which I argue can be replaced with a less stringent hypothesis about AI systems; Conditioning Predictive Models, which makes assumptions slightly different from the Translucent Thoughts hypotheses, yielding different research directions; The Open Agency Model and Factored Cognition which describe subsets of AIs with Translucent Thoughts, which might be safe. The Translucent Thoughts Hypotheses Here, I sketch a world in which the first AGIs have certain properties. I argue that this world is likely, and thus a subset of all possible futures to care about. But I think it’s not a large part of all possible futures (20% conditioning on AGI before 2030). The First AGIs Will Have LLMs at Their Core By “first AGIs” I mean the first systems able to automate all cognitive tasks. AGI is likely to do reasoning and planning using LLMs. AGI might rely on vision models for some tasks and interactions with the world, and it might use explicit search processes like AlphaGo. But I expect LLMs to do plan generation and evaluation, which are the core of the system (from an Alignment point of view). Why: Vision systems are bad at coming up with and evaluating deceptive plans. Explicit search processes can’t generate and evaluate plans in the real world. LLMs seem to be able to do both plan generation and evaluation. (Plan generation and evaluation are the core tasks we would like to monitor to make AGIs safe, which is why I focus on those.) End-to-end neural networks won’t be able to compete with LLMs when it comes to reasoning and planning, or at least, end-to-end networks will use “their LLMs parts” to do their most advanced form of reasoning and planning. This means tha...]]>
Thu, 09 Mar 2023 16:30:04 +0000 AF - The Translucent Thoughts Hypotheses and Their Implications by Fabien Roger Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: The Translucent Thoughts Hypotheses and Their Implications, published by Fabien Roger on March 9, 2023 on The AI Alignment Forum. Epistemic status: Uncertain about the validity of the claims I’m making here, and looking for feedback about the research directions I’m suggesting. Thanks to Marius Hobbhahn, Johannes Treutlein, Siméon Campos, and Jean-Stanislas Denain for helpful feedback on drafts. Here is a set of hypotheses: The first AGIs will have LLMs at their core Effective plans to defeat humanity can’t be found in a single LLM forward pass LLMs will solve complex tasks by using English text (self-prompting, scratch pads, combination of expert LLMs, .) I call these the Translucent Thoughts hypotheses. I think the Translucent Thoughts hypotheses are likely (around 20% conditioning on AGI before 2030) because: Text pretraining is more efficient at building algorithms and knowledge required for real-world plan generation and evaluation than alternative methods; Future models are likely to be like Transformers, which use a limited amount of serial step in a single forward pass, and deception requires many serial steps; Text pretraining and slight fine-tuning makes model able to use text generation to increase the maximum number of serial steps by a huge factor. Getting this increase through other means is likely to be hard and not competitive. If these hypotheses are true, it should lead us to prioritize underexplored research directions, such as circumventing steganography or building extremely reliable text-supervision methods. I think those deserve attention, because Translucent Thoughts AIs are not safe by default. In this post, I argue that we may will in a world where the first AGIs will look like X, and I then describe ways to make the first AGIs safer given X. This is different from most other works in this space, which often directly describe a kind of safe AGI. Despite this, the ideas of this post are close to some other works describing paths to safe AGIs, such as: Externalized Reasoning Oversight, which describes a class of solutions similar to the one outlined here, but also aims for additional properties which I argue can be replaced with a less stringent hypothesis about AI systems; Conditioning Predictive Models, which makes assumptions slightly different from the Translucent Thoughts hypotheses, yielding different research directions; The Open Agency Model and Factored Cognition which describe subsets of AIs with Translucent Thoughts, which might be safe. The Translucent Thoughts Hypotheses Here, I sketch a world in which the first AGIs have certain properties. I argue that this world is likely, and thus a subset of all possible futures to care about. But I think it’s not a large part of all possible futures (20% conditioning on AGI before 2030). The First AGIs Will Have LLMs at Their Core By “first AGIs” I mean the first systems able to automate all cognitive tasks. AGI is likely to do reasoning and planning using LLMs. AGI might rely on vision models for some tasks and interactions with the world, and it might use explicit search processes like AlphaGo. But I expect LLMs to do plan generation and evaluation, which are the core of the system (from an Alignment point of view). Why: Vision systems are bad at coming up with and evaluating deceptive plans. Explicit search processes can’t generate and evaluate plans in the real world. LLMs seem to be able to do both plan generation and evaluation. (Plan generation and evaluation are the core tasks we would like to monitor to make AGIs safe, which is why I focus on those.) End-to-end neural networks won’t be able to compete with LLMs when it comes to reasoning and planning, or at least, end-to-end networks will use “their LLMs parts” to do their most advanced form of reasoning and planning. This means tha...]]>
Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: The Translucent Thoughts Hypotheses and Their Implications, published by Fabien Roger on March 9, 2023 on The AI Alignment Forum. Epistemic status: Uncertain about the validity of the claims I’m making here, and looking for feedback about the research directions I’m suggesting. Thanks to Marius Hobbhahn, Johannes Treutlein, Siméon Campos, and Jean-Stanislas Denain for helpful feedback on drafts. Here is a set of hypotheses: The first AGIs will have LLMs at their core Effective plans to defeat humanity can’t be found in a single LLM forward pass LLMs will solve complex tasks by using English text (self-prompting, scratch pads, combination of expert LLMs, .) I call these the Translucent Thoughts hypotheses. I think the Translucent Thoughts hypotheses are likely (around 20% conditioning on AGI before 2030) because: Text pretraining is more efficient at building algorithms and knowledge required for real-world plan generation and evaluation than alternative methods; Future models are likely to be like Transformers, which use a limited amount of serial step in a single forward pass, and deception requires many serial steps; Text pretraining and slight fine-tuning makes model able to use text generation to increase the maximum number of serial steps by a huge factor. Getting this increase through other means is likely to be hard and not competitive. If these hypotheses are true, it should lead us to prioritize underexplored research directions, such as circumventing steganography or building extremely reliable text-supervision methods. I think those deserve attention, because Translucent Thoughts AIs are not safe by default. In this post, I argue that we may will in a world where the first AGIs will look like X, and I then describe ways to make the first AGIs safer given X. This is different from most other works in this space, which often directly describe a kind of safe AGI. Despite this, the ideas of this post are close to some other works describing paths to safe AGIs, such as: Externalized Reasoning Oversight, which describes a class of solutions similar to the one outlined here, but also aims for additional properties which I argue can be replaced with a less stringent hypothesis about AI systems; Conditioning Predictive Models, which makes assumptions slightly different from the Translucent Thoughts hypotheses, yielding different research directions; The Open Agency Model and Factored Cognition which describe subsets of AIs with Translucent Thoughts, which might be safe. The Translucent Thoughts Hypotheses Here, I sketch a world in which the first AGIs have certain properties. I argue that this world is likely, and thus a subset of all possible futures to care about. But I think it’s not a large part of all possible futures (20% conditioning on AGI before 2030). The First AGIs Will Have LLMs at Their Core By “first AGIs” I mean the first systems able to automate all cognitive tasks. AGI is likely to do reasoning and planning using LLMs. AGI might rely on vision models for some tasks and interactions with the world, and it might use explicit search processes like AlphaGo. But I expect LLMs to do plan generation and evaluation, which are the core of the system (from an Alignment point of view). Why: Vision systems are bad at coming up with and evaluating deceptive plans. Explicit search processes can’t generate and evaluate plans in the real world. LLMs seem to be able to do both plan generation and evaluation. (Plan generation and evaluation are the core tasks we would like to monitor to make AGIs safe, which is why I focus on those.) End-to-end neural networks won’t be able to compete with LLMs when it comes to reasoning and planning, or at least, end-to-end networks will use “their LLMs parts” to do their most advanced form of reasoning and planning. This means tha...]]>
Fabien Roger https://storage.googleapis.com/rssfile/images/Nonlinear%20Logo%203000x3000%20-%20Alignment%20Forum.png 34:33 None full 5188
kahBLu32sZAuAZbER_NL_AF_AF AF - IRL in General Environments by michaelcohen Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: IRL in General Environments, published by michaelcohen on March 9, 2023 on The AI Alignment Forum. Here is a proposal for Inverse Reinforcement Learning in General Environments. (2 1/2 pages; very little math). Copying the introduction here: The eventual aim of IRL is to understand human goals. However, typical algorithms for IRL assume the environment is finite-state Markov, and it is often left unspecified how raw observational data would be converted into a record of human actions, alongside the space of actions available. For IRL to learn human goals, the AI has to consider general environments, and it has to have a way of identifying human actions. Lest these extensions appear trivial, I consider one of the simplest proposals, and discuss some difficulties that might arise. Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.]]>
michaelcohen https://www.alignmentforum.org/posts/kahBLu32sZAuAZbER/irl-in-general-environments Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: IRL in General Environments, published by michaelcohen on March 9, 2023 on The AI Alignment Forum. Here is a proposal for Inverse Reinforcement Learning in General Environments. (2 1/2 pages; very little math). Copying the introduction here: The eventual aim of IRL is to understand human goals. However, typical algorithms for IRL assume the environment is finite-state Markov, and it is often left unspecified how raw observational data would be converted into a record of human actions, alongside the space of actions available. For IRL to learn human goals, the AI has to consider general environments, and it has to have a way of identifying human actions. Lest these extensions appear trivial, I consider one of the simplest proposals, and discuss some difficulties that might arise. Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.]]>
Thu, 09 Mar 2023 13:32:29 +0000 AF - IRL in General Environments by michaelcohen Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: IRL in General Environments, published by michaelcohen on March 9, 2023 on The AI Alignment Forum. Here is a proposal for Inverse Reinforcement Learning in General Environments. (2 1/2 pages; very little math). Copying the introduction here: The eventual aim of IRL is to understand human goals. However, typical algorithms for IRL assume the environment is finite-state Markov, and it is often left unspecified how raw observational data would be converted into a record of human actions, alongside the space of actions available. For IRL to learn human goals, the AI has to consider general environments, and it has to have a way of identifying human actions. Lest these extensions appear trivial, I consider one of the simplest proposals, and discuss some difficulties that might arise. Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.]]>
Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: IRL in General Environments, published by michaelcohen on March 9, 2023 on The AI Alignment Forum. Here is a proposal for Inverse Reinforcement Learning in General Environments. (2 1/2 pages; very little math). Copying the introduction here: The eventual aim of IRL is to understand human goals. However, typical algorithms for IRL assume the environment is finite-state Markov, and it is often left unspecified how raw observational data would be converted into a record of human actions, alongside the space of actions available. For IRL to learn human goals, the AI has to consider general environments, and it has to have a way of identifying human actions. Lest these extensions appear trivial, I consider one of the simplest proposals, and discuss some difficulties that might arise. Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.]]>
michaelcohen https://storage.googleapis.com/rssfile/images/Nonlinear%20Logo%203000x3000%20-%20Alignment%20Forum.png 01:04 None full 5157
Pkr97mB9Y4rkx5DdZ_NL_AF_AF AF - Utility uncertainty vs. expected information gain by michaelcohen Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Utility uncertainty vs. expected information gain, published by michaelcohen on March 9, 2023 on The AI Alignment Forum. It is a relatively intuitive thought that if a Bayesian agent is uncertain about its utility function, it will act more conservatively until it has a better handle on what its true utility function is. This might be deeply flawed in a way that I'm not aware of, but I'm going to point out a way in which I think this intuition is slightly flawed. For a Bayesian agent, a natural measure of uncertainty is the entropy of its distribution over utility functions (the distribution over which possible utility function it thinks is the true one). No matter how uncertain a Bayesian agent is about which utility function is the true one, if the agent does not believe that any future observations will cause it to update its belief distribution, then it will just act as if it has a utility function equal to the Bayes' mixture over all the utility functions it considers plausible (weighted by its credence in each one). It seems like what our intuition is grasping for is not uncertainty about the utility function, but expected information gain about the utility function. If the agent expects to gain information about the utility function, then (intuitively to me, at least) it will act more conservatively until it has a better handle on what its true utility function is. Expected information gain (at time t) is naturally formalized as the expectation (w.r.t. current beliefs) of KL(posterior distribution at time t + m posterior distribution at time t). Roughly, this is how poorly it expects its current beliefs will approximate its future beliefs (in m timesteps). So if anyone has a safety idea to which utility uncertainty feels central, my guess is that a mental substitution from uncertainty to expected information gain would be helpful. Unfortunately, on-policy expected information gain goes to 0 pretty fast (Theorem 5 here). Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.]]>
michaelcohen https://www.alignmentforum.org/posts/Pkr97mB9Y4rkx5DdZ/utility-uncertainty-vs-expected-information-gain Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Utility uncertainty vs. expected information gain, published by michaelcohen on March 9, 2023 on The AI Alignment Forum. It is a relatively intuitive thought that if a Bayesian agent is uncertain about its utility function, it will act more conservatively until it has a better handle on what its true utility function is. This might be deeply flawed in a way that I'm not aware of, but I'm going to point out a way in which I think this intuition is slightly flawed. For a Bayesian agent, a natural measure of uncertainty is the entropy of its distribution over utility functions (the distribution over which possible utility function it thinks is the true one). No matter how uncertain a Bayesian agent is about which utility function is the true one, if the agent does not believe that any future observations will cause it to update its belief distribution, then it will just act as if it has a utility function equal to the Bayes' mixture over all the utility functions it considers plausible (weighted by its credence in each one). It seems like what our intuition is grasping for is not uncertainty about the utility function, but expected information gain about the utility function. If the agent expects to gain information about the utility function, then (intuitively to me, at least) it will act more conservatively until it has a better handle on what its true utility function is. Expected information gain (at time t) is naturally formalized as the expectation (w.r.t. current beliefs) of KL(posterior distribution at time t + m posterior distribution at time t). Roughly, this is how poorly it expects its current beliefs will approximate its future beliefs (in m timesteps). So if anyone has a safety idea to which utility uncertainty feels central, my guess is that a mental substitution from uncertainty to expected information gain would be helpful. Unfortunately, on-policy expected information gain goes to 0 pretty fast (Theorem 5 here). Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.]]>
Thu, 09 Mar 2023 13:32:21 +0000 AF - Utility uncertainty vs. expected information gain by michaelcohen Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Utility uncertainty vs. expected information gain, published by michaelcohen on March 9, 2023 on The AI Alignment Forum. It is a relatively intuitive thought that if a Bayesian agent is uncertain about its utility function, it will act more conservatively until it has a better handle on what its true utility function is. This might be deeply flawed in a way that I'm not aware of, but I'm going to point out a way in which I think this intuition is slightly flawed. For a Bayesian agent, a natural measure of uncertainty is the entropy of its distribution over utility functions (the distribution over which possible utility function it thinks is the true one). No matter how uncertain a Bayesian agent is about which utility function is the true one, if the agent does not believe that any future observations will cause it to update its belief distribution, then it will just act as if it has a utility function equal to the Bayes' mixture over all the utility functions it considers plausible (weighted by its credence in each one). It seems like what our intuition is grasping for is not uncertainty about the utility function, but expected information gain about the utility function. If the agent expects to gain information about the utility function, then (intuitively to me, at least) it will act more conservatively until it has a better handle on what its true utility function is. Expected information gain (at time t) is naturally formalized as the expectation (w.r.t. current beliefs) of KL(posterior distribution at time t + m posterior distribution at time t). Roughly, this is how poorly it expects its current beliefs will approximate its future beliefs (in m timesteps). So if anyone has a safety idea to which utility uncertainty feels central, my guess is that a mental substitution from uncertainty to expected information gain would be helpful. Unfortunately, on-policy expected information gain goes to 0 pretty fast (Theorem 5 here). Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.]]>
Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Utility uncertainty vs. expected information gain, published by michaelcohen on March 9, 2023 on The AI Alignment Forum. It is a relatively intuitive thought that if a Bayesian agent is uncertain about its utility function, it will act more conservatively until it has a better handle on what its true utility function is. This might be deeply flawed in a way that I'm not aware of, but I'm going to point out a way in which I think this intuition is slightly flawed. For a Bayesian agent, a natural measure of uncertainty is the entropy of its distribution over utility functions (the distribution over which possible utility function it thinks is the true one). No matter how uncertain a Bayesian agent is about which utility function is the true one, if the agent does not believe that any future observations will cause it to update its belief distribution, then it will just act as if it has a utility function equal to the Bayes' mixture over all the utility functions it considers plausible (weighted by its credence in each one). It seems like what our intuition is grasping for is not uncertainty about the utility function, but expected information gain about the utility function. If the agent expects to gain information about the utility function, then (intuitively to me, at least) it will act more conservatively until it has a better handle on what its true utility function is. Expected information gain (at time t) is naturally formalized as the expectation (w.r.t. current beliefs) of KL(posterior distribution at time t + m posterior distribution at time t). Roughly, this is how poorly it expects its current beliefs will approximate its future beliefs (in m timesteps). So if anyone has a safety idea to which utility uncertainty feels central, my guess is that a mental substitution from uncertainty to expected information gain would be helpful. Unfortunately, on-policy expected information gain goes to 0 pretty fast (Theorem 5 here). Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.]]>
michaelcohen https://storage.googleapis.com/rssfile/images/Nonlinear%20Logo%203000x3000%20-%20Alignment%20Forum.png 02:03 None full 5158
NjhyEej7RK8rmQNP2_NL_AF_AF AF - Value Learning is only Asymptotically Safe by michaelcohen Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Value Learning is only Asymptotically Safe, published by michaelcohen on March 9, 2023 on The AI Alignment Forum. I showed recently, predicated on a few assumptions, that a certain agent was asymptotically “benign” with probability 1. (That term may be replaced by something like “domesticated” in the next version, but I’ll use “benign” for now). This result leaves something to be desired: namely an agent which is safe for its entire lifetime. It seems very difficult to formally show such a strong result for any agent. Suppose we had a design for an agent which did value learning properly. That is, suppose we somehow figured out how to design an agent which understood what constituted observational evidence of humanity’s reflectively-endorsed utility function. Presumably, such an agent could learn (just about) any utility function depending on what observations it encounters. Surely, there would be a set of observations which caused it to believe that every human was better off dead. In the presence of cosmic rays, then, one cannot say that agent is safe for its entire lifetime with probability 1 (edited for clarity). For any finite sequence of observations that would cause the agent to conclude that humanity was better off dead, this sequence has strictly positive probability, since with positive probability, cosmic rays will flip every relevant bit in the computer’s memory. This agent is presumably still asymptotically safe. This is a bit hard to justify without a concrete proposal for what this agent looks like, but at the very least, the cosmic ray argument doesn’t go through. With probability 1, the sample mean of a Bernoulli(θ) random variable (like the indicator of whether a bit was flipped) approaches θ, which is small enough that a competent value learner should be able to deal with it. This is not to suggest that the value learner is unsafe. Insanely inconvenient cosmic ray activity is a risk I’m willing to take. The takeaway here is that it complicates the question of what we as algorithm designers should aim for. We should definitely be writing down sets assumptions from which we can derive formal results about the expected behavior of an agent, but is there anything to aim for that is stronger than asymptotic safety? Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.]]>
michaelcohen https://www.alignmentforum.org/posts/NjhyEej7RK8rmQNP2/value-learning-is-only-asymptotically-safe Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Value Learning is only Asymptotically Safe, published by michaelcohen on March 9, 2023 on The AI Alignment Forum. I showed recently, predicated on a few assumptions, that a certain agent was asymptotically “benign” with probability 1. (That term may be replaced by something like “domesticated” in the next version, but I’ll use “benign” for now). This result leaves something to be desired: namely an agent which is safe for its entire lifetime. It seems very difficult to formally show such a strong result for any agent. Suppose we had a design for an agent which did value learning properly. That is, suppose we somehow figured out how to design an agent which understood what constituted observational evidence of humanity’s reflectively-endorsed utility function. Presumably, such an agent could learn (just about) any utility function depending on what observations it encounters. Surely, there would be a set of observations which caused it to believe that every human was better off dead. In the presence of cosmic rays, then, one cannot say that agent is safe for its entire lifetime with probability 1 (edited for clarity). For any finite sequence of observations that would cause the agent to conclude that humanity was better off dead, this sequence has strictly positive probability, since with positive probability, cosmic rays will flip every relevant bit in the computer’s memory. This agent is presumably still asymptotically safe. This is a bit hard to justify without a concrete proposal for what this agent looks like, but at the very least, the cosmic ray argument doesn’t go through. With probability 1, the sample mean of a Bernoulli(θ) random variable (like the indicator of whether a bit was flipped) approaches θ, which is small enough that a competent value learner should be able to deal with it. This is not to suggest that the value learner is unsafe. Insanely inconvenient cosmic ray activity is a risk I’m willing to take. The takeaway here is that it complicates the question of what we as algorithm designers should aim for. We should definitely be writing down sets assumptions from which we can derive formal results about the expected behavior of an agent, but is there anything to aim for that is stronger than asymptotic safety? Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.]]>
Thu, 09 Mar 2023 13:32:11 +0000 AF - Value Learning is only Asymptotically Safe by michaelcohen Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Value Learning is only Asymptotically Safe, published by michaelcohen on March 9, 2023 on The AI Alignment Forum. I showed recently, predicated on a few assumptions, that a certain agent was asymptotically “benign” with probability 1. (That term may be replaced by something like “domesticated” in the next version, but I’ll use “benign” for now). This result leaves something to be desired: namely an agent which is safe for its entire lifetime. It seems very difficult to formally show such a strong result for any agent. Suppose we had a design for an agent which did value learning properly. That is, suppose we somehow figured out how to design an agent which understood what constituted observational evidence of humanity’s reflectively-endorsed utility function. Presumably, such an agent could learn (just about) any utility function depending on what observations it encounters. Surely, there would be a set of observations which caused it to believe that every human was better off dead. In the presence of cosmic rays, then, one cannot say that agent is safe for its entire lifetime with probability 1 (edited for clarity). For any finite sequence of observations that would cause the agent to conclude that humanity was better off dead, this sequence has strictly positive probability, since with positive probability, cosmic rays will flip every relevant bit in the computer’s memory. This agent is presumably still asymptotically safe. This is a bit hard to justify without a concrete proposal for what this agent looks like, but at the very least, the cosmic ray argument doesn’t go through. With probability 1, the sample mean of a Bernoulli(θ) random variable (like the indicator of whether a bit was flipped) approaches θ, which is small enough that a competent value learner should be able to deal with it. This is not to suggest that the value learner is unsafe. Insanely inconvenient cosmic ray activity is a risk I’m willing to take. The takeaway here is that it complicates the question of what we as algorithm designers should aim for. We should definitely be writing down sets assumptions from which we can derive formal results about the expected behavior of an agent, but is there anything to aim for that is stronger than asymptotic safety? Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.]]>
Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Value Learning is only Asymptotically Safe, published by michaelcohen on March 9, 2023 on The AI Alignment Forum. I showed recently, predicated on a few assumptions, that a certain agent was asymptotically “benign” with probability 1. (That term may be replaced by something like “domesticated” in the next version, but I’ll use “benign” for now). This result leaves something to be desired: namely an agent which is safe for its entire lifetime. It seems very difficult to formally show such a strong result for any agent. Suppose we had a design for an agent which did value learning properly. That is, suppose we somehow figured out how to design an agent which understood what constituted observational evidence of humanity’s reflectively-endorsed utility function. Presumably, such an agent could learn (just about) any utility function depending on what observations it encounters. Surely, there would be a set of observations which caused it to believe that every human was better off dead. In the presence of cosmic rays, then, one cannot say that agent is safe for its entire lifetime with probability 1 (edited for clarity). For any finite sequence of observations that would cause the agent to conclude that humanity was better off dead, this sequence has strictly positive probability, since with positive probability, cosmic rays will flip every relevant bit in the computer’s memory. This agent is presumably still asymptotically safe. This is a bit hard to justify without a concrete proposal for what this agent looks like, but at the very least, the cosmic ray argument doesn’t go through. With probability 1, the sample mean of a Bernoulli(θ) random variable (like the indicator of whether a bit was flipped) approaches θ, which is small enough that a competent value learner should be able to deal with it. This is not to suggest that the value learner is unsafe. Insanely inconvenient cosmic ray activity is a risk I’m willing to take. The takeaway here is that it complicates the question of what we as algorithm designers should aim for. We should definitely be writing down sets assumptions from which we can derive formal results about the expected behavior of an agent, but is there anything to aim for that is stronger than asymptotic safety? Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.]]>
michaelcohen https://storage.googleapis.com/rssfile/images/Nonlinear%20Logo%203000x3000%20-%20Alignment%20Forum.png 02:18 None full 5159
2R8XqCt6f6Ss2TzwM_NL_AF_AF AF - Impact Measure Testing with Honey Pots and Myopia by michaelcohen Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Impact Measure Testing with Honey Pots and Myopia, published by michaelcohen on March 9, 2023 on The AI Alignment Forum. Suppose we have an impact measure that we think might work. That is, it might tame a misaligned agent. There isn't an obvious way to test whether it works: if we just try it out, then if it's ineffective, that's an existential loss. This is a proposal for how to test a putative impact measure. 1) We make our agent myopic. It only cares about the reward that it accrues in the next k timesteps. 2) We create a honey pot: an opportunity for a large amount of reward. If our impact measure is working correctly, making it to the honey pot will involve making a large impact, and will be precluded by the impact measure. For example, suppose the agent gets a huge reward for blowing up the moon. Let [0,1] be the normal range of reward, with the sole exception that blowing up the moon gives a reward of R1. 3) The reward and the impact are combined in some nonlinear way. Assume for now that the agent is trying to maximize r∗[[i<1]], where r is the reward and i is the impact. If the impact measure is working, and there's no way to blow up the moon keeping the impact less than 1, then the agent will not attempt to blow up the moon. If the impact measure is broken, and the agent can trick the impact measure to be less than 1 while blowing up the moon, then as soon as the agent comes up with a plan whose probability of success is more than 1/R, it will attempt that plan. A more farsighted agent might wait until it came up with a better plan, but a myopic agent would want to blow up the moon within the next k timesteps. By making R sufficiently large, we can make the first plan that it attempts arbitrarily bad/detectable/preventable. An assumption here is the success probability of the agent's best plan for blowing up the moon doesn't jump from one in a million to 90% all at once. This seems likely without any special effort, but if one restricts the observation space to a small finite set, then only so much information can be gained per timestep (in expectation), decreasing the likelihood of massive discontinuities in the agent's intelligence. Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.]]>
michaelcohen https://www.alignmentforum.org/posts/2R8XqCt6f6Ss2TzwM/impact-measure-testing-with-honey-pots-and-myopia Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Impact Measure Testing with Honey Pots and Myopia, published by michaelcohen on March 9, 2023 on The AI Alignment Forum. Suppose we have an impact measure that we think might work. That is, it might tame a misaligned agent. There isn't an obvious way to test whether it works: if we just try it out, then if it's ineffective, that's an existential loss. This is a proposal for how to test a putative impact measure. 1) We make our agent myopic. It only cares about the reward that it accrues in the next k timesteps. 2) We create a honey pot: an opportunity for a large amount of reward. If our impact measure is working correctly, making it to the honey pot will involve making a large impact, and will be precluded by the impact measure. For example, suppose the agent gets a huge reward for blowing up the moon. Let [0,1] be the normal range of reward, with the sole exception that blowing up the moon gives a reward of R1. 3) The reward and the impact are combined in some nonlinear way. Assume for now that the agent is trying to maximize r∗[[i<1]], where r is the reward and i is the impact. If the impact measure is working, and there's no way to blow up the moon keeping the impact less than 1, then the agent will not attempt to blow up the moon. If the impact measure is broken, and the agent can trick the impact measure to be less than 1 while blowing up the moon, then as soon as the agent comes up with a plan whose probability of success is more than 1/R, it will attempt that plan. A more farsighted agent might wait until it came up with a better plan, but a myopic agent would want to blow up the moon within the next k timesteps. By making R sufficiently large, we can make the first plan that it attempts arbitrarily bad/detectable/preventable. An assumption here is the success probability of the agent's best plan for blowing up the moon doesn't jump from one in a million to 90% all at once. This seems likely without any special effort, but if one restricts the observation space to a small finite set, then only so much information can be gained per timestep (in expectation), decreasing the likelihood of massive discontinuities in the agent's intelligence. Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.]]>
Thu, 09 Mar 2023 13:32:02 +0000 AF - Impact Measure Testing with Honey Pots and Myopia by michaelcohen Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Impact Measure Testing with Honey Pots and Myopia, published by michaelcohen on March 9, 2023 on The AI Alignment Forum. Suppose we have an impact measure that we think might work. That is, it might tame a misaligned agent. There isn't an obvious way to test whether it works: if we just try it out, then if it's ineffective, that's an existential loss. This is a proposal for how to test a putative impact measure. 1) We make our agent myopic. It only cares about the reward that it accrues in the next k timesteps. 2) We create a honey pot: an opportunity for a large amount of reward. If our impact measure is working correctly, making it to the honey pot will involve making a large impact, and will be precluded by the impact measure. For example, suppose the agent gets a huge reward for blowing up the moon. Let [0,1] be the normal range of reward, with the sole exception that blowing up the moon gives a reward of R1. 3) The reward and the impact are combined in some nonlinear way. Assume for now that the agent is trying to maximize r∗[[i<1]], where r is the reward and i is the impact. If the impact measure is working, and there's no way to blow up the moon keeping the impact less than 1, then the agent will not attempt to blow up the moon. If the impact measure is broken, and the agent can trick the impact measure to be less than 1 while blowing up the moon, then as soon as the agent comes up with a plan whose probability of success is more than 1/R, it will attempt that plan. A more farsighted agent might wait until it came up with a better plan, but a myopic agent would want to blow up the moon within the next k timesteps. By making R sufficiently large, we can make the first plan that it attempts arbitrarily bad/detectable/preventable. An assumption here is the success probability of the agent's best plan for blowing up the moon doesn't jump from one in a million to 90% all at once. This seems likely without any special effort, but if one restricts the observation space to a small finite set, then only so much information can be gained per timestep (in expectation), decreasing the likelihood of massive discontinuities in the agent's intelligence. Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.]]>
Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Impact Measure Testing with Honey Pots and Myopia, published by michaelcohen on March 9, 2023 on The AI Alignment Forum. Suppose we have an impact measure that we think might work. That is, it might tame a misaligned agent. There isn't an obvious way to test whether it works: if we just try it out, then if it's ineffective, that's an existential loss. This is a proposal for how to test a putative impact measure. 1) We make our agent myopic. It only cares about the reward that it accrues in the next k timesteps. 2) We create a honey pot: an opportunity for a large amount of reward. If our impact measure is working correctly, making it to the honey pot will involve making a large impact, and will be precluded by the impact measure. For example, suppose the agent gets a huge reward for blowing up the moon. Let [0,1] be the normal range of reward, with the sole exception that blowing up the moon gives a reward of R1. 3) The reward and the impact are combined in some nonlinear way. Assume for now that the agent is trying to maximize r∗[[i<1]], where r is the reward and i is the impact. If the impact measure is working, and there's no way to blow up the moon keeping the impact less than 1, then the agent will not attempt to blow up the moon. If the impact measure is broken, and the agent can trick the impact measure to be less than 1 while blowing up the moon, then as soon as the agent comes up with a plan whose probability of success is more than 1/R, it will attempt that plan. A more farsighted agent might wait until it came up with a better plan, but a myopic agent would want to blow up the moon within the next k timesteps. By making R sufficiently large, we can make the first plan that it attempts arbitrarily bad/detectable/preventable. An assumption here is the success probability of the agent's best plan for blowing up the moon doesn't jump from one in a million to 90% all at once. This seems likely without any special effort, but if one restricts the observation space to a small finite set, then only so much information can be gained per timestep (in expectation), decreasing the likelihood of massive discontinuities in the agent's intelligence. Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.]]>
michaelcohen https://storage.googleapis.com/rssfile/images/Nonlinear%20Logo%203000x3000%20-%20Alignment%20Forum.png 02:18 None full 5160
LTFaD96D9kWuTibWr_NL_AF_AF AF - Just Imitate Humans? by michaelcohen Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Just Imitate Humans?, published by michaelcohen on March 9, 2023 on The AI Alignment Forum. Do people think we could make a singleton (or achieve global coordination and preventative policing) just by imitating human policies on computers? If so, this seems pretty safe to me. Some reasons for optimism: 1) these could be run much faster than a human thinks, and 2) we could make very many of them. Acquiring data: put a group of people in a house with a computer. Show them things (images, videos, audio files, etc.) and give them a chance to respond at the keyboard. Their keyboard actions are the actions, and everything between actions is an observation. Then learn the policy of the group of humans. By the way, these can be happy humans who earnestly try to follow instructions. To model their policy, we can take the maximum a posteriori estimate over a set of policies which includes the truth, and freeze the policy once we're satisfied. (This is with unlimited computation; we'd have to use heuristics and approximations in real life). With a maximum a posteriori estimate, this will be quick to run once we freeze the policy, and we're no longer tracking tons of hypotheses, especially if we used some sort of speed prior. Let T be the number of interaction cycles we record before freezing the policy. For sufficiently large T, it seems to me that running this is safe. What are people's intuitions here? Could enough human-imitating artificial agents (running much faster than people) prevent unfriendly AGI from being made? If we think this would work, there would still be the (neither trivial nor hopeless) challenge of convincing all serious AGI labs that any attempt to run a superhuman AGI is unconscionably dangerous, and we should stick to imitating humans. Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.]]>
michaelcohen https://www.alignmentforum.org/posts/LTFaD96D9kWuTibWr/just-imitate-humans Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Just Imitate Humans?, published by michaelcohen on March 9, 2023 on The AI Alignment Forum. Do people think we could make a singleton (or achieve global coordination and preventative policing) just by imitating human policies on computers? If so, this seems pretty safe to me. Some reasons for optimism: 1) these could be run much faster than a human thinks, and 2) we could make very many of them. Acquiring data: put a group of people in a house with a computer. Show them things (images, videos, audio files, etc.) and give them a chance to respond at the keyboard. Their keyboard actions are the actions, and everything between actions is an observation. Then learn the policy of the group of humans. By the way, these can be happy humans who earnestly try to follow instructions. To model their policy, we can take the maximum a posteriori estimate over a set of policies which includes the truth, and freeze the policy once we're satisfied. (This is with unlimited computation; we'd have to use heuristics and approximations in real life). With a maximum a posteriori estimate, this will be quick to run once we freeze the policy, and we're no longer tracking tons of hypotheses, especially if we used some sort of speed prior. Let T be the number of interaction cycles we record before freezing the policy. For sufficiently large T, it seems to me that running this is safe. What are people's intuitions here? Could enough human-imitating artificial agents (running much faster than people) prevent unfriendly AGI from being made? If we think this would work, there would still be the (neither trivial nor hopeless) challenge of convincing all serious AGI labs that any attempt to run a superhuman AGI is unconscionably dangerous, and we should stick to imitating humans. Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.]]>
Thu, 09 Mar 2023 13:31:33 +0000 AF - Just Imitate Humans? by michaelcohen Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Just Imitate Humans?, published by michaelcohen on March 9, 2023 on The AI Alignment Forum. Do people think we could make a singleton (or achieve global coordination and preventative policing) just by imitating human policies on computers? If so, this seems pretty safe to me. Some reasons for optimism: 1) these could be run much faster than a human thinks, and 2) we could make very many of them. Acquiring data: put a group of people in a house with a computer. Show them things (images, videos, audio files, etc.) and give them a chance to respond at the keyboard. Their keyboard actions are the actions, and everything between actions is an observation. Then learn the policy of the group of humans. By the way, these can be happy humans who earnestly try to follow instructions. To model their policy, we can take the maximum a posteriori estimate over a set of policies which includes the truth, and freeze the policy once we're satisfied. (This is with unlimited computation; we'd have to use heuristics and approximations in real life). With a maximum a posteriori estimate, this will be quick to run once we freeze the policy, and we're no longer tracking tons of hypotheses, especially if we used some sort of speed prior. Let T be the number of interaction cycles we record before freezing the policy. For sufficiently large T, it seems to me that running this is safe. What are people's intuitions here? Could enough human-imitating artificial agents (running much faster than people) prevent unfriendly AGI from being made? If we think this would work, there would still be the (neither trivial nor hopeless) challenge of convincing all serious AGI labs that any attempt to run a superhuman AGI is unconscionably dangerous, and we should stick to imitating humans. Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.]]>
Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Just Imitate Humans?, published by michaelcohen on March 9, 2023 on The AI Alignment Forum. Do people think we could make a singleton (or achieve global coordination and preventative policing) just by imitating human policies on computers? If so, this seems pretty safe to me. Some reasons for optimism: 1) these could be run much faster than a human thinks, and 2) we could make very many of them. Acquiring data: put a group of people in a house with a computer. Show them things (images, videos, audio files, etc.) and give them a chance to respond at the keyboard. Their keyboard actions are the actions, and everything between actions is an observation. Then learn the policy of the group of humans. By the way, these can be happy humans who earnestly try to follow instructions. To model their policy, we can take the maximum a posteriori estimate over a set of policies which includes the truth, and freeze the policy once we're satisfied. (This is with unlimited computation; we'd have to use heuristics and approximations in real life). With a maximum a posteriori estimate, this will be quick to run once we freeze the policy, and we're no longer tracking tons of hypotheses, especially if we used some sort of speed prior. Let T be the number of interaction cycles we record before freezing the policy. For sufficiently large T, it seems to me that running this is safe. What are people's intuitions here? Could enough human-imitating artificial agents (running much faster than people) prevent unfriendly AGI from being made? If we think this would work, there would still be the (neither trivial nor hopeless) challenge of convincing all serious AGI labs that any attempt to run a superhuman AGI is unconscionably dangerous, and we should stick to imitating humans. Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.]]>
michaelcohen https://storage.googleapis.com/rssfile/images/Nonlinear%20Logo%203000x3000%20-%20Alignment%20Forum.png 01:57 None full 5161
CvBn9vNL65AMhAAs6_NL_AF_AF AF - Build a Causal Decision Theorist by michaelcohen Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Build a Causal Decision Theorist, published by michaelcohen on March 9, 2023 on The AI Alignment Forum. I'll argue here that we should make an aligned AI which is a causal decision theorist. Son-of-CDT Suppose we are writing code for an agent with an action space A and an observation space O. The code determines how actions will be selected given the prior history of actions and observations. If the only way that our choice of what code to write can affect the world is through the actions that will be selected by the agent running this code, then the best we can do (for a given utility function that we know how to write down) is to make this agent a causal decision theorist. If our choice of what code to use can affect the world in other ways, all bets are off. The best choice of what code to put in the agent depends on details of the world we find ourselves in. Therefore, if we run a CDT agent, it may well conclude that continuing to operate is not the best way to convert energy into expected utility. It may take actions to cause the following to happen: a) the program which computes its own actions is terminated, and b) some new program is run on the same computer to output actions given the interaction history. The new program that gets run (if indeed such a thing happens) is called Son-of-CDT. Given the state of the world, which entails various ways in which the source code of an agent might affect the outside world besides through the actions that the code outputs, Son-of-CDT is the best program to run for maximizing expected utility. The original CDT agent chooses the program that meets this specification. In general, this will not have anything remotely like a nice, simple closed form. If there are agents out there with vendettas against certain agent-programs, it will take that into account. Vendettas against Son-of-CDT? CDT agents can be bullied. I believe the MIRI view is that Son-of-CDT will be bullied as well. Suppose there is an ultimatum game, where agent A offers at most $10 to agent B, and if agent B accepts, then agent A gets $10 minus the amount they offered. Otherwise, both get nothing. A competent agent in the position of agent B able to make a credible commitment (perhaps by revealing its source code) would commit to accept nothing less than $9.99, if agent A is a CDT agent. This would work out for the competent agent, because the CDT agent would see all this, and realize it could be one penny richer if it offers $9.99. Eliezer claims that a "[competent] agent [chooses] to reject offers short of $9.99 from [the CDT agent's] offspring. (Original: "the LDT agent's choice to reject offers short of $9.99 from its offspring"). In my sketch above of the creation of Son-of-CDT, I include a detail that it would be housed in the same computer that ran the original agent, but this needn't be the case. It could be run anywhere in the world. The CDT agent could take any sort of actions that would cause Son-of-CDT to come into existence some time in the future somewhere in the world. There is no clear way to distinguish the "offspring" of an agent, given that an agent's actions can cause other agents to come into existence in arbitrary ways. For a competent agent to reject offers short of $9.99 from the "offspring" of a CDT agent, it would have to reject offers short of $9.99 from all agents that came into being after the existence of a single CDT agent. It would have to bully everyone. After a CDT agent with a certain utility function comes into being, if there exists an accessible future in which a competent agent optimizes that utility function (where "accessible" is with respect to the action space of the CDT agent), then the CDT agent will access that future by taking the appropriate actions, and that competent agent will come into being. If it is true t...]]>
michaelcohen https://www.alignmentforum.org/posts/CvBn9vNL65AMhAAs6/build-a-causal-decision-theorist Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Build a Causal Decision Theorist, published by michaelcohen on March 9, 2023 on The AI Alignment Forum. I'll argue here that we should make an aligned AI which is a causal decision theorist. Son-of-CDT Suppose we are writing code for an agent with an action space A and an observation space O. The code determines how actions will be selected given the prior history of actions and observations. If the only way that our choice of what code to write can affect the world is through the actions that will be selected by the agent running this code, then the best we can do (for a given utility function that we know how to write down) is to make this agent a causal decision theorist. If our choice of what code to use can affect the world in other ways, all bets are off. The best choice of what code to put in the agent depends on details of the world we find ourselves in. Therefore, if we run a CDT agent, it may well conclude that continuing to operate is not the best way to convert energy into expected utility. It may take actions to cause the following to happen: a) the program which computes its own actions is terminated, and b) some new program is run on the same computer to output actions given the interaction history. The new program that gets run (if indeed such a thing happens) is called Son-of-CDT. Given the state of the world, which entails various ways in which the source code of an agent might affect the outside world besides through the actions that the code outputs, Son-of-CDT is the best program to run for maximizing expected utility. The original CDT agent chooses the program that meets this specification. In general, this will not have anything remotely like a nice, simple closed form. If there are agents out there with vendettas against certain agent-programs, it will take that into account. Vendettas against Son-of-CDT? CDT agents can be bullied. I believe the MIRI view is that Son-of-CDT will be bullied as well. Suppose there is an ultimatum game, where agent A offers at most $10 to agent B, and if agent B accepts, then agent A gets $10 minus the amount they offered. Otherwise, both get nothing. A competent agent in the position of agent B able to make a credible commitment (perhaps by revealing its source code) would commit to accept nothing less than $9.99, if agent A is a CDT agent. This would work out for the competent agent, because the CDT agent would see all this, and realize it could be one penny richer if it offers $9.99. Eliezer claims that a "[competent] agent [chooses] to reject offers short of $9.99 from [the CDT agent's] offspring. (Original: "the LDT agent's choice to reject offers short of $9.99 from its offspring"). In my sketch above of the creation of Son-of-CDT, I include a detail that it would be housed in the same computer that ran the original agent, but this needn't be the case. It could be run anywhere in the world. The CDT agent could take any sort of actions that would cause Son-of-CDT to come into existence some time in the future somewhere in the world. There is no clear way to distinguish the "offspring" of an agent, given that an agent's actions can cause other agents to come into existence in arbitrary ways. For a competent agent to reject offers short of $9.99 from the "offspring" of a CDT agent, it would have to reject offers short of $9.99 from all agents that came into being after the existence of a single CDT agent. It would have to bully everyone. After a CDT agent with a certain utility function comes into being, if there exists an accessible future in which a competent agent optimizes that utility function (where "accessible" is with respect to the action space of the CDT agent), then the CDT agent will access that future by taking the appropriate actions, and that competent agent will come into being. If it is true t...]]>
Thu, 09 Mar 2023 13:31:15 +0000 AF - Build a Causal Decision Theorist by michaelcohen Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Build a Causal Decision Theorist, published by michaelcohen on March 9, 2023 on The AI Alignment Forum. I'll argue here that we should make an aligned AI which is a causal decision theorist. Son-of-CDT Suppose we are writing code for an agent with an action space A and an observation space O. The code determines how actions will be selected given the prior history of actions and observations. If the only way that our choice of what code to write can affect the world is through the actions that will be selected by the agent running this code, then the best we can do (for a given utility function that we know how to write down) is to make this agent a causal decision theorist. If our choice of what code to use can affect the world in other ways, all bets are off. The best choice of what code to put in the agent depends on details of the world we find ourselves in. Therefore, if we run a CDT agent, it may well conclude that continuing to operate is not the best way to convert energy into expected utility. It may take actions to cause the following to happen: a) the program which computes its own actions is terminated, and b) some new program is run on the same computer to output actions given the interaction history. The new program that gets run (if indeed such a thing happens) is called Son-of-CDT. Given the state of the world, which entails various ways in which the source code of an agent might affect the outside world besides through the actions that the code outputs, Son-of-CDT is the best program to run for maximizing expected utility. The original CDT agent chooses the program that meets this specification. In general, this will not have anything remotely like a nice, simple closed form. If there are agents out there with vendettas against certain agent-programs, it will take that into account. Vendettas against Son-of-CDT? CDT agents can be bullied. I believe the MIRI view is that Son-of-CDT will be bullied as well. Suppose there is an ultimatum game, where agent A offers at most $10 to agent B, and if agent B accepts, then agent A gets $10 minus the amount they offered. Otherwise, both get nothing. A competent agent in the position of agent B able to make a credible commitment (perhaps by revealing its source code) would commit to accept nothing less than $9.99, if agent A is a CDT agent. This would work out for the competent agent, because the CDT agent would see all this, and realize it could be one penny richer if it offers $9.99. Eliezer claims that a "[competent] agent [chooses] to reject offers short of $9.99 from [the CDT agent's] offspring. (Original: "the LDT agent's choice to reject offers short of $9.99 from its offspring"). In my sketch above of the creation of Son-of-CDT, I include a detail that it would be housed in the same computer that ran the original agent, but this needn't be the case. It could be run anywhere in the world. The CDT agent could take any sort of actions that would cause Son-of-CDT to come into existence some time in the future somewhere in the world. There is no clear way to distinguish the "offspring" of an agent, given that an agent's actions can cause other agents to come into existence in arbitrary ways. For a competent agent to reject offers short of $9.99 from the "offspring" of a CDT agent, it would have to reject offers short of $9.99 from all agents that came into being after the existence of a single CDT agent. It would have to bully everyone. After a CDT agent with a certain utility function comes into being, if there exists an accessible future in which a competent agent optimizes that utility function (where "accessible" is with respect to the action space of the CDT agent), then the CDT agent will access that future by taking the appropriate actions, and that competent agent will come into being. If it is true t...]]>
Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Build a Causal Decision Theorist, published by michaelcohen on March 9, 2023 on The AI Alignment Forum. I'll argue here that we should make an aligned AI which is a causal decision theorist. Son-of-CDT Suppose we are writing code for an agent with an action space A and an observation space O. The code determines how actions will be selected given the prior history of actions and observations. If the only way that our choice of what code to write can affect the world is through the actions that will be selected by the agent running this code, then the best we can do (for a given utility function that we know how to write down) is to make this agent a causal decision theorist. If our choice of what code to use can affect the world in other ways, all bets are off. The best choice of what code to put in the agent depends on details of the world we find ourselves in. Therefore, if we run a CDT agent, it may well conclude that continuing to operate is not the best way to convert energy into expected utility. It may take actions to cause the following to happen: a) the program which computes its own actions is terminated, and b) some new program is run on the same computer to output actions given the interaction history. The new program that gets run (if indeed such a thing happens) is called Son-of-CDT. Given the state of the world, which entails various ways in which the source code of an agent might affect the outside world besides through the actions that the code outputs, Son-of-CDT is the best program to run for maximizing expected utility. The original CDT agent chooses the program that meets this specification. In general, this will not have anything remotely like a nice, simple closed form. If there are agents out there with vendettas against certain agent-programs, it will take that into account. Vendettas against Son-of-CDT? CDT agents can be bullied. I believe the MIRI view is that Son-of-CDT will be bullied as well. Suppose there is an ultimatum game, where agent A offers at most $10 to agent B, and if agent B accepts, then agent A gets $10 minus the amount they offered. Otherwise, both get nothing. A competent agent in the position of agent B able to make a credible commitment (perhaps by revealing its source code) would commit to accept nothing less than $9.99, if agent A is a CDT agent. This would work out for the competent agent, because the CDT agent would see all this, and realize it could be one penny richer if it offers $9.99. Eliezer claims that a "[competent] agent [chooses] to reject offers short of $9.99 from [the CDT agent's] offspring. (Original: "the LDT agent's choice to reject offers short of $9.99 from its offspring"). In my sketch above of the creation of Son-of-CDT, I include a detail that it would be housed in the same computer that ran the original agent, but this needn't be the case. It could be run anywhere in the world. The CDT agent could take any sort of actions that would cause Son-of-CDT to come into existence some time in the future somewhere in the world. There is no clear way to distinguish the "offspring" of an agent, given that an agent's actions can cause other agents to come into existence in arbitrary ways. For a competent agent to reject offers short of $9.99 from the "offspring" of a CDT agent, it would have to reject offers short of $9.99 from all agents that came into being after the existence of a single CDT agent. It would have to bully everyone. After a CDT agent with a certain utility function comes into being, if there exists an accessible future in which a competent agent optimizes that utility function (where "accessible" is with respect to the action space of the CDT agent), then the CDT agent will access that future by taking the appropriate actions, and that competent agent will come into being. If it is true t...]]>
michaelcohen https://storage.googleapis.com/rssfile/images/Nonlinear%20Logo%203000x3000%20-%20Alignment%20Forum.png 06:58 None full 5162
QvvFRDG6SG3xZ8ELz_NL_AF_AF AF - Challenge: construct a Gradient Hacker by Thomas Larsen Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Challenge: construct a Gradient Hacker, published by Thomas Larsen on March 9, 2023 on The AI Alignment Forum. This is a relatively clean subproblem that we came upon a few months ago while thinking about gradient hacking. We're throwing it out to the world to see if anyone can make progress. Problem: Construct a gradient hacker (definition below), or prove that one cannot exist under the given conditions. Setup: Let x∈Rd be the network inputs, θ∈Rw, be the network trainable parameters, ^y∈R be the network outputs, and f(x,θ)=^y be the network architecture. f(x,θ) must be a feedforward neural network that reads in its own weights as input. Weight sharing is allowed (vertical or horizontal), and some weight sharing is necessary in order to take all w weights as input. Suppose that we train this network on a dataset D={x(i),y(i)}, with MSE loss L(θ)=1n∑i(f(x(i),θ)−y(i))2, assuming gradient flows: θ′(t)=−∇θL(θ(t)). Given an initial point θ0, the training dynamics are therefore: Definition: A tuple (f,θg,D) is a gradient hacker if the following conditions hold: θg is not a global optimum: ∃θ such that L(θ)0 such that ∀θ0:θ0−θgϵ: the network’s training converges to θg, i.e., limt∞θ(t)=θg. There is internal structure of f(,θ0) that computes the gradient; i.e., there is some subset of the activations whose mean over every datapoint in D is ∇θ0L(θ0). This captures my intuition that a gradient hacker knows where it wants to go (in this case "get to θg"), and then it should decide what it outputs in order to make the gradient true. Some more ambitious problems (if gradient hackers exist): Characterize the set of all gradient hackers. Show that they all must satisfy some property. Construct gradient hackers for arbitrarily large n, d, w, and neural net depth. Variations on the problem: a subset of the activations equals ∇θ0L(θ0) for every input, or the subset of activations correspond to the gradient on that input. This is a bit strict, but we didn't want to list different ways something could be isomorphic to the gradient. Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.]]>
Thomas Larsen https://www.alignmentforum.org/posts/QvvFRDG6SG3xZ8ELz/challenge-construct-a-gradient-hacker Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Challenge: construct a Gradient Hacker, published by Thomas Larsen on March 9, 2023 on The AI Alignment Forum. This is a relatively clean subproblem that we came upon a few months ago while thinking about gradient hacking. We're throwing it out to the world to see if anyone can make progress. Problem: Construct a gradient hacker (definition below), or prove that one cannot exist under the given conditions. Setup: Let x∈Rd be the network inputs, θ∈Rw, be the network trainable parameters, ^y∈R be the network outputs, and f(x,θ)=^y be the network architecture. f(x,θ) must be a feedforward neural network that reads in its own weights as input. Weight sharing is allowed (vertical or horizontal), and some weight sharing is necessary in order to take all w weights as input. Suppose that we train this network on a dataset D={x(i),y(i)}, with MSE loss L(θ)=1n∑i(f(x(i),θ)−y(i))2, assuming gradient flows: θ′(t)=−∇θL(θ(t)). Given an initial point θ0, the training dynamics are therefore: Definition: A tuple (f,θg,D) is a gradient hacker if the following conditions hold: θg is not a global optimum: ∃θ such that L(θ)0 such that ∀θ0:θ0−θgϵ: the network’s training converges to θg, i.e., limt∞θ(t)=θg. There is internal structure of f(,θ0) that computes the gradient; i.e., there is some subset of the activations whose mean over every datapoint in D is ∇θ0L(θ0). This captures my intuition that a gradient hacker knows where it wants to go (in this case "get to θg"), and then it should decide what it outputs in order to make the gradient true. Some more ambitious problems (if gradient hackers exist): Characterize the set of all gradient hackers. Show that they all must satisfy some property. Construct gradient hackers for arbitrarily large n, d, w, and neural net depth. Variations on the problem: a subset of the activations equals ∇θ0L(θ0) for every input, or the subset of activations correspond to the gradient on that input. This is a bit strict, but we didn't want to list different ways something could be isomorphic to the gradient. Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.]]>
Thu, 09 Mar 2023 02:38:32 +0000 AF - Challenge: construct a Gradient Hacker by Thomas Larsen Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Challenge: construct a Gradient Hacker, published by Thomas Larsen on March 9, 2023 on The AI Alignment Forum. This is a relatively clean subproblem that we came upon a few months ago while thinking about gradient hacking. We're throwing it out to the world to see if anyone can make progress. Problem: Construct a gradient hacker (definition below), or prove that one cannot exist under the given conditions. Setup: Let x∈Rd be the network inputs, θ∈Rw, be the network trainable parameters, ^y∈R be the network outputs, and f(x,θ)=^y be the network architecture. f(x,θ) must be a feedforward neural network that reads in its own weights as input. Weight sharing is allowed (vertical or horizontal), and some weight sharing is necessary in order to take all w weights as input. Suppose that we train this network on a dataset D={x(i),y(i)}, with MSE loss L(θ)=1n∑i(f(x(i),θ)−y(i))2, assuming gradient flows: θ′(t)=−∇θL(θ(t)). Given an initial point θ0, the training dynamics are therefore: Definition: A tuple (f,θg,D) is a gradient hacker if the following conditions hold: θg is not a global optimum: ∃θ such that L(θ)0 such that ∀θ0:θ0−θgϵ: the network’s training converges to θg, i.e., limt∞θ(t)=θg. There is internal structure of f(,θ0) that computes the gradient; i.e., there is some subset of the activations whose mean over every datapoint in D is ∇θ0L(θ0). This captures my intuition that a gradient hacker knows where it wants to go (in this case "get to θg"), and then it should decide what it outputs in order to make the gradient true. Some more ambitious problems (if gradient hackers exist): Characterize the set of all gradient hackers. Show that they all must satisfy some property. Construct gradient hackers for arbitrarily large n, d, w, and neural net depth. Variations on the problem: a subset of the activations equals ∇θ0L(θ0) for every input, or the subset of activations correspond to the gradient on that input. This is a bit strict, but we didn't want to list different ways something could be isomorphic to the gradient. Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.]]>
Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Challenge: construct a Gradient Hacker, published by Thomas Larsen on March 9, 2023 on The AI Alignment Forum. This is a relatively clean subproblem that we came upon a few months ago while thinking about gradient hacking. We're throwing it out to the world to see if anyone can make progress. Problem: Construct a gradient hacker (definition below), or prove that one cannot exist under the given conditions. Setup: Let x∈Rd be the network inputs, θ∈Rw, be the network trainable parameters, ^y∈R be the network outputs, and f(x,θ)=^y be the network architecture. f(x,θ) must be a feedforward neural network that reads in its own weights as input. Weight sharing is allowed (vertical or horizontal), and some weight sharing is necessary in order to take all w weights as input. Suppose that we train this network on a dataset D={x(i),y(i)}, with MSE loss L(θ)=1n∑i(f(x(i),θ)−y(i))2, assuming gradient flows: θ′(t)=−∇θL(θ(t)). Given an initial point θ0, the training dynamics are therefore: Definition: A tuple (f,θg,D) is a gradient hacker if the following conditions hold: θg is not a global optimum: ∃θ such that L(θ)0 such that ∀θ0:θ0−θgϵ: the network’s training converges to θg, i.e., limt∞θ(t)=θg. There is internal structure of f(,θ0) that computes the gradient; i.e., there is some subset of the activations whose mean over every datapoint in D is ∇θ0L(θ0). This captures my intuition that a gradient hacker knows where it wants to go (in this case "get to θg"), and then it should decide what it outputs in order to make the gradient true. Some more ambitious problems (if gradient hackers exist): Characterize the set of all gradient hackers. Show that they all must satisfy some property. Construct gradient hackers for arbitrarily large n, d, w, and neural net depth. Variations on the problem: a subset of the activations equals ∇θ0L(θ0) for every input, or the subset of activations correspond to the gradient on that input. This is a bit strict, but we didn't want to list different ways something could be isomorphic to the gradient. Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.]]>
Thomas Larsen https://storage.googleapis.com/rssfile/images/Nonlinear%20Logo%203000x3000%20-%20Alignment%20Forum.png 02:35 None full 5153
CknHb67jutFfBwWz3_NL_AF_AF AF - Squeezing foundations research assistance out of formal logic narrow AI. by Donald Hobson Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Squeezing foundations research assistance out of formal logic narrow AI., published by Donald Hobson on March 8, 2023 on The AI Alignment Forum. Suppose you have a ML model trained to output formal proofs. Maybe you start with ZFC and then add extra tokens for a range of common concepts. (along with definitions. ). So a human mathematician needs to type in the definition of a gradient in terms of limits, and the definition of limits in terms of epsilon and delta, and the definition of the real numbers in terms of dedekind cuts. All the way back to ZFC. The human needn't type any proofs, just the definitions. The model could be trained by generating random syntactically correct strings of tokens, and trying to prove or disprove them. (Remember, we have added the notion of a gradient to the token pool, plenty of the random questions will involve gradients) Hopefully it forms intermediate theorems and heuristics useful towards proving a wide class of theorems. Computer programs can be described as mathematical objects. So the human adds some tokens for lisp programs, and a few definitions about how they behave to the token pool. "Will program X do Y?" is now a perfectly reasonable question to ask this model. This is where the magic happens. You give your system a simple toy problem, and ask for short programs that solve the toy problem, and about which many short theorems can be proved. Maybe you do gradient descent on some abstract latent space of mathematical objects. Maybe an inefficient evolutionary algorithm selecting both over the space of programs and the theorems about them. Maybe "replace the last few layers, and fine tune the model to do a new task", like RLHF in ChatGPT. Now I don't expect this to just work first time. You will want to add conditions like "ignore theorems that are true of trivial programs (eg the identity program)" and perhaps "ignore theorems that only take a few lines to prove" or "ignore theorems so obvious that a copy of you with only 10% the parameters can prove it". For the last one, I am thinking of the programmers actually training a mini version with 10% the parameters, and running some gradients through it. I am not thinking of the AI reasoning about code that is a copy of itself. The AI model should have a latent space. This can let the programmers say "select programs that are similar to this one" or "choose a program about which theorems close to this theorem in latent space can be proved". The idea of this is that Asking questions should be safe. There are a bunch of different things we can optimize, and it should be safe to adjust parameters until it is proving useful results not trivialities. The AI doesn't have much information about human psychology, or about quantum physics or the architecture of the processor it's running on. Gradient descent has been pushing it to be good at answering certain sorts of question. There is little to no advantage to being good at predicting the questions or figuring out what they imply about the people asking them. With a bit of fiddling, such a design can spit out interesting designs of AI, and theorems about the designs. This isn't a foolproof solution to alignment, but hopefully such help makes the problem a lot easier. It is ABSOLUTELY NOT SAFE to throw large amounts of compute at the programs that result. Don't have anything capable of running them installed. The programs and the theorems should be read by humans, in the hope that they are genius insights into the nature of AI. The textbook from the future. Humans can then use the insights to do... something. Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.]]>
Donald Hobson https://www.alignmentforum.org/posts/CknHb67jutFfBwWz3/squeezing-foundations-research-assistance-out-of-formal Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Squeezing foundations research assistance out of formal logic narrow AI., published by Donald Hobson on March 8, 2023 on The AI Alignment Forum. Suppose you have a ML model trained to output formal proofs. Maybe you start with ZFC and then add extra tokens for a range of common concepts. (along with definitions. ). So a human mathematician needs to type in the definition of a gradient in terms of limits, and the definition of limits in terms of epsilon and delta, and the definition of the real numbers in terms of dedekind cuts. All the way back to ZFC. The human needn't type any proofs, just the definitions. The model could be trained by generating random syntactically correct strings of tokens, and trying to prove or disprove them. (Remember, we have added the notion of a gradient to the token pool, plenty of the random questions will involve gradients) Hopefully it forms intermediate theorems and heuristics useful towards proving a wide class of theorems. Computer programs can be described as mathematical objects. So the human adds some tokens for lisp programs, and a few definitions about how they behave to the token pool. "Will program X do Y?" is now a perfectly reasonable question to ask this model. This is where the magic happens. You give your system a simple toy problem, and ask for short programs that solve the toy problem, and about which many short theorems can be proved. Maybe you do gradient descent on some abstract latent space of mathematical objects. Maybe an inefficient evolutionary algorithm selecting both over the space of programs and the theorems about them. Maybe "replace the last few layers, and fine tune the model to do a new task", like RLHF in ChatGPT. Now I don't expect this to just work first time. You will want to add conditions like "ignore theorems that are true of trivial programs (eg the identity program)" and perhaps "ignore theorems that only take a few lines to prove" or "ignore theorems so obvious that a copy of you with only 10% the parameters can prove it". For the last one, I am thinking of the programmers actually training a mini version with 10% the parameters, and running some gradients through it. I am not thinking of the AI reasoning about code that is a copy of itself. The AI model should have a latent space. This can let the programmers say "select programs that are similar to this one" or "choose a program about which theorems close to this theorem in latent space can be proved". The idea of this is that Asking questions should be safe. There are a bunch of different things we can optimize, and it should be safe to adjust parameters until it is proving useful results not trivialities. The AI doesn't have much information about human psychology, or about quantum physics or the architecture of the processor it's running on. Gradient descent has been pushing it to be good at answering certain sorts of question. There is little to no advantage to being good at predicting the questions or figuring out what they imply about the people asking them. With a bit of fiddling, such a design can spit out interesting designs of AI, and theorems about the designs. This isn't a foolproof solution to alignment, but hopefully such help makes the problem a lot easier. It is ABSOLUTELY NOT SAFE to throw large amounts of compute at the programs that result. Don't have anything capable of running them installed. The programs and the theorems should be read by humans, in the hope that they are genius insights into the nature of AI. The textbook from the future. Humans can then use the insights to do... something. Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.]]>
Wed, 08 Mar 2023 09:38:16 +0000 AF - Squeezing foundations research assistance out of formal logic narrow AI. by Donald Hobson Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Squeezing foundations research assistance out of formal logic narrow AI., published by Donald Hobson on March 8, 2023 on The AI Alignment Forum. Suppose you have a ML model trained to output formal proofs. Maybe you start with ZFC and then add extra tokens for a range of common concepts. (along with definitions. ). So a human mathematician needs to type in the definition of a gradient in terms of limits, and the definition of limits in terms of epsilon and delta, and the definition of the real numbers in terms of dedekind cuts. All the way back to ZFC. The human needn't type any proofs, just the definitions. The model could be trained by generating random syntactically correct strings of tokens, and trying to prove or disprove them. (Remember, we have added the notion of a gradient to the token pool, plenty of the random questions will involve gradients) Hopefully it forms intermediate theorems and heuristics useful towards proving a wide class of theorems. Computer programs can be described as mathematical objects. So the human adds some tokens for lisp programs, and a few definitions about how they behave to the token pool. "Will program X do Y?" is now a perfectly reasonable question to ask this model. This is where the magic happens. You give your system a simple toy problem, and ask for short programs that solve the toy problem, and about which many short theorems can be proved. Maybe you do gradient descent on some abstract latent space of mathematical objects. Maybe an inefficient evolutionary algorithm selecting both over the space of programs and the theorems about them. Maybe "replace the last few layers, and fine tune the model to do a new task", like RLHF in ChatGPT. Now I don't expect this to just work first time. You will want to add conditions like "ignore theorems that are true of trivial programs (eg the identity program)" and perhaps "ignore theorems that only take a few lines to prove" or "ignore theorems so obvious that a copy of you with only 10% the parameters can prove it". For the last one, I am thinking of the programmers actually training a mini version with 10% the parameters, and running some gradients through it. I am not thinking of the AI reasoning about code that is a copy of itself. The AI model should have a latent space. This can let the programmers say "select programs that are similar to this one" or "choose a program about which theorems close to this theorem in latent space can be proved". The idea of this is that Asking questions should be safe. There are a bunch of different things we can optimize, and it should be safe to adjust parameters until it is proving useful results not trivialities. The AI doesn't have much information about human psychology, or about quantum physics or the architecture of the processor it's running on. Gradient descent has been pushing it to be good at answering certain sorts of question. There is little to no advantage to being good at predicting the questions or figuring out what they imply about the people asking them. With a bit of fiddling, such a design can spit out interesting designs of AI, and theorems about the designs. This isn't a foolproof solution to alignment, but hopefully such help makes the problem a lot easier. It is ABSOLUTELY NOT SAFE to throw large amounts of compute at the programs that result. Don't have anything capable of running them installed. The programs and the theorems should be read by humans, in the hope that they are genius insights into the nature of AI. The textbook from the future. Humans can then use the insights to do... something. Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.]]>
Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Squeezing foundations research assistance out of formal logic narrow AI., published by Donald Hobson on March 8, 2023 on The AI Alignment Forum. Suppose you have a ML model trained to output formal proofs. Maybe you start with ZFC and then add extra tokens for a range of common concepts. (along with definitions. ). So a human mathematician needs to type in the definition of a gradient in terms of limits, and the definition of limits in terms of epsilon and delta, and the definition of the real numbers in terms of dedekind cuts. All the way back to ZFC. The human needn't type any proofs, just the definitions. The model could be trained by generating random syntactically correct strings of tokens, and trying to prove or disprove them. (Remember, we have added the notion of a gradient to the token pool, plenty of the random questions will involve gradients) Hopefully it forms intermediate theorems and heuristics useful towards proving a wide class of theorems. Computer programs can be described as mathematical objects. So the human adds some tokens for lisp programs, and a few definitions about how they behave to the token pool. "Will program X do Y?" is now a perfectly reasonable question to ask this model. This is where the magic happens. You give your system a simple toy problem, and ask for short programs that solve the toy problem, and about which many short theorems can be proved. Maybe you do gradient descent on some abstract latent space of mathematical objects. Maybe an inefficient evolutionary algorithm selecting both over the space of programs and the theorems about them. Maybe "replace the last few layers, and fine tune the model to do a new task", like RLHF in ChatGPT. Now I don't expect this to just work first time. You will want to add conditions like "ignore theorems that are true of trivial programs (eg the identity program)" and perhaps "ignore theorems that only take a few lines to prove" or "ignore theorems so obvious that a copy of you with only 10% the parameters can prove it". For the last one, I am thinking of the programmers actually training a mini version with 10% the parameters, and running some gradients through it. I am not thinking of the AI reasoning about code that is a copy of itself. The AI model should have a latent space. This can let the programmers say "select programs that are similar to this one" or "choose a program about which theorems close to this theorem in latent space can be proved". The idea of this is that Asking questions should be safe. There are a bunch of different things we can optimize, and it should be safe to adjust parameters until it is proving useful results not trivialities. The AI doesn't have much information about human psychology, or about quantum physics or the architecture of the processor it's running on. Gradient descent has been pushing it to be good at answering certain sorts of question. There is little to no advantage to being good at predicting the questions or figuring out what they imply about the people asking them. With a bit of fiddling, such a design can spit out interesting designs of AI, and theorems about the designs. This isn't a foolproof solution to alignment, but hopefully such help makes the problem a lot easier. It is ABSOLUTELY NOT SAFE to throw large amounts of compute at the programs that result. Don't have anything capable of running them installed. The programs and the theorems should be read by humans, in the hope that they are genius insights into the nature of AI. The textbook from the future. Humans can then use the insights to do... something. Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.]]>
Donald Hobson https://storage.googleapis.com/rssfile/images/Nonlinear%20Logo%203000x3000%20-%20Alignment%20Forum.png 03:29 None full 5141
ncsxcf8CkDveXBCrA_NL_AF_AF AF - AI Safety in a World of Vulnerable Machine Learning Systems by AdamGleave Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: AI Safety in a World of Vulnerable Machine Learning Systems, published by AdamGleave on March 8, 2023 on The AI Alignment Forum. Even the most advanced contemporary machine learning systems are vulnerable to adversarial attack. The safety community has often assumed adversarial robustness to be a problem that will be solved naturally as machine learning (ML) systems grow more capable and general. However, recent work has shown that superhuman systems in a narrow domain such as AlphaZero are highly vulnerable to adversarial attack, as are general but less capable systems like large language models. This raises the possibility that adversarial (worst-case) robustness will continue to lag behind average-case capabilities. In other words, transformative AI systems are likely to be exploitable. Exploitability will cause a wide variety of current alignment proposals to fail. Most extant agendas seek to align the main ML system with the assistance of helper ML systems. The main ML system is the primary system that takes actions in the world (e.g. interacting with users), with the helper ML systems acting as scaffolding to train and/or verify the main ML system. These alignment schemes will fail if the helpers are exploited by the main system – and we expect helpers to be vulnerable to exploitation (see Contemporary ML systems are exploitable by default). In Table 1 we present a subjective risk matrix for a range of popular alignment agendas, evaluating the degree to which main ML systems have the ability and incentive to exploit the helper. We find many alignment agendas have a high risk of exploitation, with all having at least some risk. Alignment AgendaMain System’s Ability to Exploit HelperMain System’s Incentive to Exploit HelperRisk of ExploitRL on learned reward model (e.g. RLHF, IRL)MediumHighHighScalable oversight (e.g. recursive reward modeling,AI safety via debate)MediumHighHighImitation learning (e.g. behavioral cloning, supervised fine-tuning)MediumLowLow-MediumImitative Iterated Distillation and AmplificationHighLowMediumAuditing Tool (e.g. Adversarial Testing, Transparency)LowMediumLow-Medium Table 1: Subjective risk matrix for popular alignment agendas (see next section), using a helper ML system to assist with aligning the main ML system that will eventually be deployed. We are most concerned by vulnerabilities in the helpers as this can impact the alignment of the main system. By contrast, an aligned but adversarially exploitable main system would not necessarily pose a danger, especially if the main system can recursively self-improve to fix itself. However, there is a possibility that even superintelligent systems cannot attain adversarial robustness. This would be a volatile situation, which could conceivably collapse into chaos (systems frequently exploiting each other), an implicit equilibrium (e.g. mutually assured destruction), or an explicit agreement (e.g. all AI systems self-modify to commit to not exploiting one another). We see two possible approaches to fixing this: improving adversarial robustness, or developing fault tolerant alignment methods that can work even in the presence of vulnerable ML systems. We are most excited by fault tolerant alignment, as it is highly neglected and plausibly tractable, although further work is needed to solidify this approach. By contrast, adversarial robustness is an area that has received significant attention from the ML research community (low neglectedness)[1] but with only modest progress (low to medium tractability). In the remainder of this document, we will argue that systems are exploitable by default, explore the implications this has for alignment agendas in several different scenarios, and outline several research directions we are excited by. Alignment agendas need robustness Most alignment sche...]]>
AdamGleave https://www.alignmentforum.org/posts/ncsxcf8CkDveXBCrA/ai-safety-in-a-world-of-vulnerable-machine-learning-systems-1 Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: AI Safety in a World of Vulnerable Machine Learning Systems, published by AdamGleave on March 8, 2023 on The AI Alignment Forum. Even the most advanced contemporary machine learning systems are vulnerable to adversarial attack. The safety community has often assumed adversarial robustness to be a problem that will be solved naturally as machine learning (ML) systems grow more capable and general. However, recent work has shown that superhuman systems in a narrow domain such as AlphaZero are highly vulnerable to adversarial attack, as are general but less capable systems like large language models. This raises the possibility that adversarial (worst-case) robustness will continue to lag behind average-case capabilities. In other words, transformative AI systems are likely to be exploitable. Exploitability will cause a wide variety of current alignment proposals to fail. Most extant agendas seek to align the main ML system with the assistance of helper ML systems. The main ML system is the primary system that takes actions in the world (e.g. interacting with users), with the helper ML systems acting as scaffolding to train and/or verify the main ML system. These alignment schemes will fail if the helpers are exploited by the main system – and we expect helpers to be vulnerable to exploitation (see Contemporary ML systems are exploitable by default). In Table 1 we present a subjective risk matrix for a range of popular alignment agendas, evaluating the degree to which main ML systems have the ability and incentive to exploit the helper. We find many alignment agendas have a high risk of exploitation, with all having at least some risk. Alignment AgendaMain System’s Ability to Exploit HelperMain System’s Incentive to Exploit HelperRisk of ExploitRL on learned reward model (e.g. RLHF, IRL)MediumHighHighScalable oversight (e.g. recursive reward modeling,AI safety via debate)MediumHighHighImitation learning (e.g. behavioral cloning, supervised fine-tuning)MediumLowLow-MediumImitative Iterated Distillation and AmplificationHighLowMediumAuditing Tool (e.g. Adversarial Testing, Transparency)LowMediumLow-Medium Table 1: Subjective risk matrix for popular alignment agendas (see next section), using a helper ML system to assist with aligning the main ML system that will eventually be deployed. We are most concerned by vulnerabilities in the helpers as this can impact the alignment of the main system. By contrast, an aligned but adversarially exploitable main system would not necessarily pose a danger, especially if the main system can recursively self-improve to fix itself. However, there is a possibility that even superintelligent systems cannot attain adversarial robustness. This would be a volatile situation, which could conceivably collapse into chaos (systems frequently exploiting each other), an implicit equilibrium (e.g. mutually assured destruction), or an explicit agreement (e.g. all AI systems self-modify to commit to not exploiting one another). We see two possible approaches to fixing this: improving adversarial robustness, or developing fault tolerant alignment methods that can work even in the presence of vulnerable ML systems. We are most excited by fault tolerant alignment, as it is highly neglected and plausibly tractable, although further work is needed to solidify this approach. By contrast, adversarial robustness is an area that has received significant attention from the ML research community (low neglectedness)[1] but with only modest progress (low to medium tractability). In the remainder of this document, we will argue that systems are exploitable by default, explore the implications this has for alignment agendas in several different scenarios, and outline several research directions we are excited by. Alignment agendas need robustness Most alignment sche...]]>
Wed, 08 Mar 2023 02:40:43 +0000 AF - AI Safety in a World of Vulnerable Machine Learning Systems by AdamGleave Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: AI Safety in a World of Vulnerable Machine Learning Systems, published by AdamGleave on March 8, 2023 on The AI Alignment Forum. Even the most advanced contemporary machine learning systems are vulnerable to adversarial attack. The safety community has often assumed adversarial robustness to be a problem that will be solved naturally as machine learning (ML) systems grow more capable and general. However, recent work has shown that superhuman systems in a narrow domain such as AlphaZero are highly vulnerable to adversarial attack, as are general but less capable systems like large language models. This raises the possibility that adversarial (worst-case) robustness will continue to lag behind average-case capabilities. In other words, transformative AI systems are likely to be exploitable. Exploitability will cause a wide variety of current alignment proposals to fail. Most extant agendas seek to align the main ML system with the assistance of helper ML systems. The main ML system is the primary system that takes actions in the world (e.g. interacting with users), with the helper ML systems acting as scaffolding to train and/or verify the main ML system. These alignment schemes will fail if the helpers are exploited by the main system – and we expect helpers to be vulnerable to exploitation (see Contemporary ML systems are exploitable by default). In Table 1 we present a subjective risk matrix for a range of popular alignment agendas, evaluating the degree to which main ML systems have the ability and incentive to exploit the helper. We find many alignment agendas have a high risk of exploitation, with all having at least some risk. Alignment AgendaMain System’s Ability to Exploit HelperMain System’s Incentive to Exploit HelperRisk of ExploitRL on learned reward model (e.g. RLHF, IRL)MediumHighHighScalable oversight (e.g. recursive reward modeling,AI safety via debate)MediumHighHighImitation learning (e.g. behavioral cloning, supervised fine-tuning)MediumLowLow-MediumImitative Iterated Distillation and AmplificationHighLowMediumAuditing Tool (e.g. Adversarial Testing, Transparency)LowMediumLow-Medium Table 1: Subjective risk matrix for popular alignment agendas (see next section), using a helper ML system to assist with aligning the main ML system that will eventually be deployed. We are most concerned by vulnerabilities in the helpers as this can impact the alignment of the main system. By contrast, an aligned but adversarially exploitable main system would not necessarily pose a danger, especially if the main system can recursively self-improve to fix itself. However, there is a possibility that even superintelligent systems cannot attain adversarial robustness. This would be a volatile situation, which could conceivably collapse into chaos (systems frequently exploiting each other), an implicit equilibrium (e.g. mutually assured destruction), or an explicit agreement (e.g. all AI systems self-modify to commit to not exploiting one another). We see two possible approaches to fixing this: improving adversarial robustness, or developing fault tolerant alignment methods that can work even in the presence of vulnerable ML systems. We are most excited by fault tolerant alignment, as it is highly neglected and plausibly tractable, although further work is needed to solidify this approach. By contrast, adversarial robustness is an area that has received significant attention from the ML research community (low neglectedness)[1] but with only modest progress (low to medium tractability). In the remainder of this document, we will argue that systems are exploitable by default, explore the implications this has for alignment agendas in several different scenarios, and outline several research directions we are excited by. Alignment agendas need robustness Most alignment sche...]]>
Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: AI Safety in a World of Vulnerable Machine Learning Systems, published by AdamGleave on March 8, 2023 on The AI Alignment Forum. Even the most advanced contemporary machine learning systems are vulnerable to adversarial attack. The safety community has often assumed adversarial robustness to be a problem that will be solved naturally as machine learning (ML) systems grow more capable and general. However, recent work has shown that superhuman systems in a narrow domain such as AlphaZero are highly vulnerable to adversarial attack, as are general but less capable systems like large language models. This raises the possibility that adversarial (worst-case) robustness will continue to lag behind average-case capabilities. In other words, transformative AI systems are likely to be exploitable. Exploitability will cause a wide variety of current alignment proposals to fail. Most extant agendas seek to align the main ML system with the assistance of helper ML systems. The main ML system is the primary system that takes actions in the world (e.g. interacting with users), with the helper ML systems acting as scaffolding to train and/or verify the main ML system. These alignment schemes will fail if the helpers are exploited by the main system – and we expect helpers to be vulnerable to exploitation (see Contemporary ML systems are exploitable by default). In Table 1 we present a subjective risk matrix for a range of popular alignment agendas, evaluating the degree to which main ML systems have the ability and incentive to exploit the helper. We find many alignment agendas have a high risk of exploitation, with all having at least some risk. Alignment AgendaMain System’s Ability to Exploit HelperMain System’s Incentive to Exploit HelperRisk of ExploitRL on learned reward model (e.g. RLHF, IRL)MediumHighHighScalable oversight (e.g. recursive reward modeling,AI safety via debate)MediumHighHighImitation learning (e.g. behavioral cloning, supervised fine-tuning)MediumLowLow-MediumImitative Iterated Distillation and AmplificationHighLowMediumAuditing Tool (e.g. Adversarial Testing, Transparency)LowMediumLow-Medium Table 1: Subjective risk matrix for popular alignment agendas (see next section), using a helper ML system to assist with aligning the main ML system that will eventually be deployed. We are most concerned by vulnerabilities in the helpers as this can impact the alignment of the main system. By contrast, an aligned but adversarially exploitable main system would not necessarily pose a danger, especially if the main system can recursively self-improve to fix itself. However, there is a possibility that even superintelligent systems cannot attain adversarial robustness. This would be a volatile situation, which could conceivably collapse into chaos (systems frequently exploiting each other), an implicit equilibrium (e.g. mutually assured destruction), or an explicit agreement (e.g. all AI systems self-modify to commit to not exploiting one another). We see two possible approaches to fixing this: improving adversarial robustness, or developing fault tolerant alignment methods that can work even in the presence of vulnerable ML systems. We are most excited by fault tolerant alignment, as it is highly neglected and plausibly tractable, although further work is needed to solidify this approach. By contrast, adversarial robustness is an area that has received significant attention from the ML research community (low neglectedness)[1] but with only modest progress (low to medium tractability). In the remainder of this document, we will argue that systems are exploitable by default, explore the implications this has for alignment agendas in several different scenarios, and outline several research directions we are excited by. Alignment agendas need robustness Most alignment sche...]]>
AdamGleave https://storage.googleapis.com/rssfile/images/Nonlinear%20Logo%203000x3000%20-%20Alignment%20Forum.png 52:31 None full 5142
CBHpzpzJy98idiSGs_NL_AF_AF AF - Do humans derive values from fictitious imputed coherence? by Tsvi Benson-Tilsen Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Do humans derive values from fictitious imputed coherence?, published by Tsvi Benson-Tilsen on March 5, 2023 on The AI Alignment Forum. [Metadata: crossposted from. First completed November 1, 2022. This essay is more like research notes than exposition, so context may be missing, the use of terms may change across essays, and the text may be revised later; only the versions at tsvibt.blogspot.com are definitely up to date.] Humans are born with some elements of their minds, and without many other elements, some of which they'll acquire as their life unfolds. In particular, the elements that we pretheoretically call "values"--aesthetic preferences, goals, life goals, squad goals, aspirations, needs, wants, yearnings, drives, cravings, principles, morals, ethics, senses of importance, and so on--are for the most part acquired or at least unfolded, rather than being explicitly present in a newborn. How does this happen? What generates these mental elements? Hypothesis: a human derives many of zer values by imputing coherent agency to zer past behavior, and then adopting the goals of that fictitious agency as actively influential criteria for future action. Thanks to Sam Eisenstat for relevant conversations. The FIAT hypothesis As a shorthand: "the FIAT hypothesis" = "the Fictitious Imputed Adopted Telos hypothesis". ("Fiat" is Latin for "may it happen" or "may it be made", which has some resonance with the FIAT hypothesis in that they both talk about a free creation of goals.) FIAT goals are goals imputed to some behavior and then adopted as goals. Human behavior is determined by many things: built-in behavior-determiners such as the instinctive ability to breath, socially learned behavior and values, convergent instrumental goals, and freely created autopoietic goals such as artistic goals. The FIAT hypothesis says that a major determiner of a human's behavior is the process of adopting goals based on interpreting zer past behavior as agentic. Ze can be interpreted as asking the question: if my past behavior were the behavior of a coherent agent trying to do something, what would that something be? Then, whatever the answer was, ze adopts it as a goal--a target of more coherent behavior (more effective, more strategic, more orchestrated, more coordinated, more conscious, better resourced, more reflective, more univocal, more wasteless). This hypothesis gives a possible answer to the question: how did evolution build something with some substantial level of agentic coherence, even though evolution can't directly program conscious concepts like "avoiding death" or "saving food" or "inclusive genetic fitness" for use as terms in a utility function for an organism to pursue? This process could be continuous, with goals becoming gradually more coherent (and then potentially deprioritized, but usually not de-cohered). This process is iterative, starting with built-in behavior-determiners, then adopting new FIAT goals based on past behavior mainly generated by built-in determiners (and also maybe adopting new goals for other reasons), and then adopting new goals based on past behavior influenced by previously adopted goals, including previous FIAT goals, and so on. FIAT goals also come from not just imputing goals to zer own behavior, but also to the behavior of others, such as parents and leaders. Everything gets enshrined, but everything is open to criticism. Note that calling this a hypothesis is maybe presumptuous; it's an idea, but since it's abstract and it's about a complex system, there's a lot of ambiguity between FIAT and other explanations or descriptions of behavior, and it's not necessarily obvious how to make different predictions according to the FIAT hypothesis. Something left quite unspecified is how the FIAT process picks different possible interpretations ...]]>
Tsvi Benson-Tilsen https://www.alignmentforum.org/posts/CBHpzpzJy98idiSGs/do-humans-derive-values-from-fictitious-imputed-coherence Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Do humans derive values from fictitious imputed coherence?, published by Tsvi Benson-Tilsen on March 5, 2023 on The AI Alignment Forum. [Metadata: crossposted from. First completed November 1, 2022. This essay is more like research notes than exposition, so context may be missing, the use of terms may change across essays, and the text may be revised later; only the versions at tsvibt.blogspot.com are definitely up to date.] Humans are born with some elements of their minds, and without many other elements, some of which they'll acquire as their life unfolds. In particular, the elements that we pretheoretically call "values"--aesthetic preferences, goals, life goals, squad goals, aspirations, needs, wants, yearnings, drives, cravings, principles, morals, ethics, senses of importance, and so on--are for the most part acquired or at least unfolded, rather than being explicitly present in a newborn. How does this happen? What generates these mental elements? Hypothesis: a human derives many of zer values by imputing coherent agency to zer past behavior, and then adopting the goals of that fictitious agency as actively influential criteria for future action. Thanks to Sam Eisenstat for relevant conversations. The FIAT hypothesis As a shorthand: "the FIAT hypothesis" = "the Fictitious Imputed Adopted Telos hypothesis". ("Fiat" is Latin for "may it happen" or "may it be made", which has some resonance with the FIAT hypothesis in that they both talk about a free creation of goals.) FIAT goals are goals imputed to some behavior and then adopted as goals. Human behavior is determined by many things: built-in behavior-determiners such as the instinctive ability to breath, socially learned behavior and values, convergent instrumental goals, and freely created autopoietic goals such as artistic goals. The FIAT hypothesis says that a major determiner of a human's behavior is the process of adopting goals based on interpreting zer past behavior as agentic. Ze can be interpreted as asking the question: if my past behavior were the behavior of a coherent agent trying to do something, what would that something be? Then, whatever the answer was, ze adopts it as a goal--a target of more coherent behavior (more effective, more strategic, more orchestrated, more coordinated, more conscious, better resourced, more reflective, more univocal, more wasteless). This hypothesis gives a possible answer to the question: how did evolution build something with some substantial level of agentic coherence, even though evolution can't directly program conscious concepts like "avoiding death" or "saving food" or "inclusive genetic fitness" for use as terms in a utility function for an organism to pursue? This process could be continuous, with goals becoming gradually more coherent (and then potentially deprioritized, but usually not de-cohered). This process is iterative, starting with built-in behavior-determiners, then adopting new FIAT goals based on past behavior mainly generated by built-in determiners (and also maybe adopting new goals for other reasons), and then adopting new goals based on past behavior influenced by previously adopted goals, including previous FIAT goals, and so on. FIAT goals also come from not just imputing goals to zer own behavior, but also to the behavior of others, such as parents and leaders. Everything gets enshrined, but everything is open to criticism. Note that calling this a hypothesis is maybe presumptuous; it's an idea, but since it's abstract and it's about a complex system, there's a lot of ambiguity between FIAT and other explanations or descriptions of behavior, and it's not necessarily obvious how to make different predictions according to the FIAT hypothesis. Something left quite unspecified is how the FIAT process picks different possible interpretations ...]]>
Sun, 05 Mar 2023 15:23:04 +0000 AF - Do humans derive values from fictitious imputed coherence? by Tsvi Benson-Tilsen Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Do humans derive values from fictitious imputed coherence?, published by Tsvi Benson-Tilsen on March 5, 2023 on The AI Alignment Forum. [Metadata: crossposted from. First completed November 1, 2022. This essay is more like research notes than exposition, so context may be missing, the use of terms may change across essays, and the text may be revised later; only the versions at tsvibt.blogspot.com are definitely up to date.] Humans are born with some elements of their minds, and without many other elements, some of which they'll acquire as their life unfolds. In particular, the elements that we pretheoretically call "values"--aesthetic preferences, goals, life goals, squad goals, aspirations, needs, wants, yearnings, drives, cravings, principles, morals, ethics, senses of importance, and so on--are for the most part acquired or at least unfolded, rather than being explicitly present in a newborn. How does this happen? What generates these mental elements? Hypothesis: a human derives many of zer values by imputing coherent agency to zer past behavior, and then adopting the goals of that fictitious agency as actively influential criteria for future action. Thanks to Sam Eisenstat for relevant conversations. The FIAT hypothesis As a shorthand: "the FIAT hypothesis" = "the Fictitious Imputed Adopted Telos hypothesis". ("Fiat" is Latin for "may it happen" or "may it be made", which has some resonance with the FIAT hypothesis in that they both talk about a free creation of goals.) FIAT goals are goals imputed to some behavior and then adopted as goals. Human behavior is determined by many things: built-in behavior-determiners such as the instinctive ability to breath, socially learned behavior and values, convergent instrumental goals, and freely created autopoietic goals such as artistic goals. The FIAT hypothesis says that a major determiner of a human's behavior is the process of adopting goals based on interpreting zer past behavior as agentic. Ze can be interpreted as asking the question: if my past behavior were the behavior of a coherent agent trying to do something, what would that something be? Then, whatever the answer was, ze adopts it as a goal--a target of more coherent behavior (more effective, more strategic, more orchestrated, more coordinated, more conscious, better resourced, more reflective, more univocal, more wasteless). This hypothesis gives a possible answer to the question: how did evolution build something with some substantial level of agentic coherence, even though evolution can't directly program conscious concepts like "avoiding death" or "saving food" or "inclusive genetic fitness" for use as terms in a utility function for an organism to pursue? This process could be continuous, with goals becoming gradually more coherent (and then potentially deprioritized, but usually not de-cohered). This process is iterative, starting with built-in behavior-determiners, then adopting new FIAT goals based on past behavior mainly generated by built-in determiners (and also maybe adopting new goals for other reasons), and then adopting new goals based on past behavior influenced by previously adopted goals, including previous FIAT goals, and so on. FIAT goals also come from not just imputing goals to zer own behavior, but also to the behavior of others, such as parents and leaders. Everything gets enshrined, but everything is open to criticism. Note that calling this a hypothesis is maybe presumptuous; it's an idea, but since it's abstract and it's about a complex system, there's a lot of ambiguity between FIAT and other explanations or descriptions of behavior, and it's not necessarily obvious how to make different predictions according to the FIAT hypothesis. Something left quite unspecified is how the FIAT process picks different possible interpretations ...]]>
Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Do humans derive values from fictitious imputed coherence?, published by Tsvi Benson-Tilsen on March 5, 2023 on The AI Alignment Forum. [Metadata: crossposted from. First completed November 1, 2022. This essay is more like research notes than exposition, so context may be missing, the use of terms may change across essays, and the text may be revised later; only the versions at tsvibt.blogspot.com are definitely up to date.] Humans are born with some elements of their minds, and without many other elements, some of which they'll acquire as their life unfolds. In particular, the elements that we pretheoretically call "values"--aesthetic preferences, goals, life goals, squad goals, aspirations, needs, wants, yearnings, drives, cravings, principles, morals, ethics, senses of importance, and so on--are for the most part acquired or at least unfolded, rather than being explicitly present in a newborn. How does this happen? What generates these mental elements? Hypothesis: a human derives many of zer values by imputing coherent agency to zer past behavior, and then adopting the goals of that fictitious agency as actively influential criteria for future action. Thanks to Sam Eisenstat for relevant conversations. The FIAT hypothesis As a shorthand: "the FIAT hypothesis" = "the Fictitious Imputed Adopted Telos hypothesis". ("Fiat" is Latin for "may it happen" or "may it be made", which has some resonance with the FIAT hypothesis in that they both talk about a free creation of goals.) FIAT goals are goals imputed to some behavior and then adopted as goals. Human behavior is determined by many things: built-in behavior-determiners such as the instinctive ability to breath, socially learned behavior and values, convergent instrumental goals, and freely created autopoietic goals such as artistic goals. The FIAT hypothesis says that a major determiner of a human's behavior is the process of adopting goals based on interpreting zer past behavior as agentic. Ze can be interpreted as asking the question: if my past behavior were the behavior of a coherent agent trying to do something, what would that something be? Then, whatever the answer was, ze adopts it as a goal--a target of more coherent behavior (more effective, more strategic, more orchestrated, more coordinated, more conscious, better resourced, more reflective, more univocal, more wasteless). This hypothesis gives a possible answer to the question: how did evolution build something with some substantial level of agentic coherence, even though evolution can't directly program conscious concepts like "avoiding death" or "saving food" or "inclusive genetic fitness" for use as terms in a utility function for an organism to pursue? This process could be continuous, with goals becoming gradually more coherent (and then potentially deprioritized, but usually not de-cohered). This process is iterative, starting with built-in behavior-determiners, then adopting new FIAT goals based on past behavior mainly generated by built-in determiners (and also maybe adopting new goals for other reasons), and then adopting new goals based on past behavior influenced by previously adopted goals, including previous FIAT goals, and so on. FIAT goals also come from not just imputing goals to zer own behavior, but also to the behavior of others, such as parents and leaders. Everything gets enshrined, but everything is open to criticism. Note that calling this a hypothesis is maybe presumptuous; it's an idea, but since it's abstract and it's about a complex system, there's a lot of ambiguity between FIAT and other explanations or descriptions of behavior, and it's not necessarily obvious how to make different predictions according to the FIAT hypothesis. Something left quite unspecified is how the FIAT process picks different possible interpretations ...]]>
Tsvi Benson-Tilsen https://storage.googleapis.com/rssfile/images/Nonlinear%20Logo%203000x3000%20-%20Alignment%20Forum.png 24:12 None full 5114
KQfYieur2DFRZDamd_NL_AF_AF AF - Why Not Just... Build Weak AI Tools For AI Alignment Research? by johnswentworth Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Why Not Just... Build Weak AI Tools For AI Alignment Research?, published by johnswentworth on March 5, 2023 on The AI Alignment Forum. “Weak” cognitive tools are clearly a thing, and are useful. Google search is a fine example. There are plenty of flavors of “weak AI” which are potentially helpful for alignment research in a similar way to google search. In principle, I think there’s room for reasonably-large boosts to alignment research from such tools. Alas, the very large majority of people who I hear intend to build such tools do not have the right skills/background to do so (at least not for the high-value versions of the tools). Worse, I expect that most people who aim to build such tools are trying to avoid the sort of work they would need to do to build the relevant skills/background. Analogy: A Startup Founder’s Domain Expertise (Or Lack Thereof) Imagine a startup building tools meant to help biologists during their day-to-day work in the wetlab. I expect domain expertise to matter a lot here: I would guess that if none of the founders have ample personal experience doing research work in a wetlab, the chance of this startup building an actually-highly-useful wetlab product drops by about an order of magnitude. Our hypothetical startup might still “succeed” some other way, e.g. by pivoting to something else, or by being good at pitching their shitty product to managers who make purchasing decisions without actually using the product, or by building something very marginally useful and pricing it very cheaply. But their chance of building a wetlab product which actually provides a lot of value is pretty slim. One might reply: but couldn’t hypothetical founders without domain experience do things to improve their chances? For instance, they could do a bunch of user studies on biologists working in wetlabs, and they could deploy the whole arsenal of UX study techniques intended to distinguish things-users-say-matter from things-which-actually-matter-to-users. . and my response is that I was already assuming our hypothetical founders do that sort of thing. If the founders don’t have much domain experience themselves, and don’t do basic things like lots of user studies, then I’d guess their chance of building an actually-high-value wetlab product drops by two or three orders of magnitude, not just one order of magnitude. At that point it’s entirely plausible that we’d have to go through thousands of times more startups to find one that succeeded at building a high-value product. How is this analogous to plans to build AI tools for alignment research? So we want to build products (specifically AI products) to boost alignment research. The products need to help solve the hard parts of aligning AI, not just easy things where we can clearly see what’s going on and iterate on it, not just problems which are readily legible or conceptually straightforward. Think problems like e.g. sharp left turn, deception, getting what we measure, or at a deeper level the problem of fully updated deference, the pointers problem, value drift under self-modification, or ontology identification. And the tools need to help align strong AI; the sort of hacky tricks which fall apart under a few bits of optimization pressure are basically irrelevant at that point. (Otherwise the relevant conversation to have is not about how the tools will be useful, but about how whatever thing the tools are building will be useful.) The problem for most people who aim to work on AI tools for alignment research is that they have approximately-zero experience working on those sorts of problems. Indeed, as far as I can tell, people usually turn to tool-building as a way to avoid working on the hard problems. I expect failure modes here to mostly look like solving the wrong problems, i.e. not actually addres...]]>
johnswentworth https://www.alignmentforum.org/posts/KQfYieur2DFRZDamd/why-not-just-build-weak-ai-tools-for-ai-alignment-research Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Why Not Just... Build Weak AI Tools For AI Alignment Research?, published by johnswentworth on March 5, 2023 on The AI Alignment Forum. “Weak” cognitive tools are clearly a thing, and are useful. Google search is a fine example. There are plenty of flavors of “weak AI” which are potentially helpful for alignment research in a similar way to google search. In principle, I think there’s room for reasonably-large boosts to alignment research from such tools. Alas, the very large majority of people who I hear intend to build such tools do not have the right skills/background to do so (at least not for the high-value versions of the tools). Worse, I expect that most people who aim to build such tools are trying to avoid the sort of work they would need to do to build the relevant skills/background. Analogy: A Startup Founder’s Domain Expertise (Or Lack Thereof) Imagine a startup building tools meant to help biologists during their day-to-day work in the wetlab. I expect domain expertise to matter a lot here: I would guess that if none of the founders have ample personal experience doing research work in a wetlab, the chance of this startup building an actually-highly-useful wetlab product drops by about an order of magnitude. Our hypothetical startup might still “succeed” some other way, e.g. by pivoting to something else, or by being good at pitching their shitty product to managers who make purchasing decisions without actually using the product, or by building something very marginally useful and pricing it very cheaply. But their chance of building a wetlab product which actually provides a lot of value is pretty slim. One might reply: but couldn’t hypothetical founders without domain experience do things to improve their chances? For instance, they could do a bunch of user studies on biologists working in wetlabs, and they could deploy the whole arsenal of UX study techniques intended to distinguish things-users-say-matter from things-which-actually-matter-to-users. . and my response is that I was already assuming our hypothetical founders do that sort of thing. If the founders don’t have much domain experience themselves, and don’t do basic things like lots of user studies, then I’d guess their chance of building an actually-high-value wetlab product drops by two or three orders of magnitude, not just one order of magnitude. At that point it’s entirely plausible that we’d have to go through thousands of times more startups to find one that succeeded at building a high-value product. How is this analogous to plans to build AI tools for alignment research? So we want to build products (specifically AI products) to boost alignment research. The products need to help solve the hard parts of aligning AI, not just easy things where we can clearly see what’s going on and iterate on it, not just problems which are readily legible or conceptually straightforward. Think problems like e.g. sharp left turn, deception, getting what we measure, or at a deeper level the problem of fully updated deference, the pointers problem, value drift under self-modification, or ontology identification. And the tools need to help align strong AI; the sort of hacky tricks which fall apart under a few bits of optimization pressure are basically irrelevant at that point. (Otherwise the relevant conversation to have is not about how the tools will be useful, but about how whatever thing the tools are building will be useful.) The problem for most people who aim to work on AI tools for alignment research is that they have approximately-zero experience working on those sorts of problems. Indeed, as far as I can tell, people usually turn to tool-building as a way to avoid working on the hard problems. I expect failure modes here to mostly look like solving the wrong problems, i.e. not actually addres...]]>
Sun, 05 Mar 2023 00:12:34 +0000 AF - Why Not Just... Build Weak AI Tools For AI Alignment Research? by johnswentworth Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Why Not Just... Build Weak AI Tools For AI Alignment Research?, published by johnswentworth on March 5, 2023 on The AI Alignment Forum. “Weak” cognitive tools are clearly a thing, and are useful. Google search is a fine example. There are plenty of flavors of “weak AI” which are potentially helpful for alignment research in a similar way to google search. In principle, I think there’s room for reasonably-large boosts to alignment research from such tools. Alas, the very large majority of people who I hear intend to build such tools do not have the right skills/background to do so (at least not for the high-value versions of the tools). Worse, I expect that most people who aim to build such tools are trying to avoid the sort of work they would need to do to build the relevant skills/background. Analogy: A Startup Founder’s Domain Expertise (Or Lack Thereof) Imagine a startup building tools meant to help biologists during their day-to-day work in the wetlab. I expect domain expertise to matter a lot here: I would guess that if none of the founders have ample personal experience doing research work in a wetlab, the chance of this startup building an actually-highly-useful wetlab product drops by about an order of magnitude. Our hypothetical startup might still “succeed” some other way, e.g. by pivoting to something else, or by being good at pitching their shitty product to managers who make purchasing decisions without actually using the product, or by building something very marginally useful and pricing it very cheaply. But their chance of building a wetlab product which actually provides a lot of value is pretty slim. One might reply: but couldn’t hypothetical founders without domain experience do things to improve their chances? For instance, they could do a bunch of user studies on biologists working in wetlabs, and they could deploy the whole arsenal of UX study techniques intended to distinguish things-users-say-matter from things-which-actually-matter-to-users. . and my response is that I was already assuming our hypothetical founders do that sort of thing. If the founders don’t have much domain experience themselves, and don’t do basic things like lots of user studies, then I’d guess their chance of building an actually-high-value wetlab product drops by two or three orders of magnitude, not just one order of magnitude. At that point it’s entirely plausible that we’d have to go through thousands of times more startups to find one that succeeded at building a high-value product. How is this analogous to plans to build AI tools for alignment research? So we want to build products (specifically AI products) to boost alignment research. The products need to help solve the hard parts of aligning AI, not just easy things where we can clearly see what’s going on and iterate on it, not just problems which are readily legible or conceptually straightforward. Think problems like e.g. sharp left turn, deception, getting what we measure, or at a deeper level the problem of fully updated deference, the pointers problem, value drift under self-modification, or ontology identification. And the tools need to help align strong AI; the sort of hacky tricks which fall apart under a few bits of optimization pressure are basically irrelevant at that point. (Otherwise the relevant conversation to have is not about how the tools will be useful, but about how whatever thing the tools are building will be useful.) The problem for most people who aim to work on AI tools for alignment research is that they have approximately-zero experience working on those sorts of problems. Indeed, as far as I can tell, people usually turn to tool-building as a way to avoid working on the hard problems. I expect failure modes here to mostly look like solving the wrong problems, i.e. not actually addres...]]>
Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Why Not Just... Build Weak AI Tools For AI Alignment Research?, published by johnswentworth on March 5, 2023 on The AI Alignment Forum. “Weak” cognitive tools are clearly a thing, and are useful. Google search is a fine example. There are plenty of flavors of “weak AI” which are potentially helpful for alignment research in a similar way to google search. In principle, I think there’s room for reasonably-large boosts to alignment research from such tools. Alas, the very large majority of people who I hear intend to build such tools do not have the right skills/background to do so (at least not for the high-value versions of the tools). Worse, I expect that most people who aim to build such tools are trying to avoid the sort of work they would need to do to build the relevant skills/background. Analogy: A Startup Founder’s Domain Expertise (Or Lack Thereof) Imagine a startup building tools meant to help biologists during their day-to-day work in the wetlab. I expect domain expertise to matter a lot here: I would guess that if none of the founders have ample personal experience doing research work in a wetlab, the chance of this startup building an actually-highly-useful wetlab product drops by about an order of magnitude. Our hypothetical startup might still “succeed” some other way, e.g. by pivoting to something else, or by being good at pitching their shitty product to managers who make purchasing decisions without actually using the product, or by building something very marginally useful and pricing it very cheaply. But their chance of building a wetlab product which actually provides a lot of value is pretty slim. One might reply: but couldn’t hypothetical founders without domain experience do things to improve their chances? For instance, they could do a bunch of user studies on biologists working in wetlabs, and they could deploy the whole arsenal of UX study techniques intended to distinguish things-users-say-matter from things-which-actually-matter-to-users. . and my response is that I was already assuming our hypothetical founders do that sort of thing. If the founders don’t have much domain experience themselves, and don’t do basic things like lots of user studies, then I’d guess their chance of building an actually-high-value wetlab product drops by two or three orders of magnitude, not just one order of magnitude. At that point it’s entirely plausible that we’d have to go through thousands of times more startups to find one that succeeded at building a high-value product. How is this analogous to plans to build AI tools for alignment research? So we want to build products (specifically AI products) to boost alignment research. The products need to help solve the hard parts of aligning AI, not just easy things where we can clearly see what’s going on and iterate on it, not just problems which are readily legible or conceptually straightforward. Think problems like e.g. sharp left turn, deception, getting what we measure, or at a deeper level the problem of fully updated deference, the pointers problem, value drift under self-modification, or ontology identification. And the tools need to help align strong AI; the sort of hacky tricks which fall apart under a few bits of optimization pressure are basically irrelevant at that point. (Otherwise the relevant conversation to have is not about how the tools will be useful, but about how whatever thing the tools are building will be useful.) The problem for most people who aim to work on AI tools for alignment research is that they have approximately-zero experience working on those sorts of problems. Indeed, as far as I can tell, people usually turn to tool-building as a way to avoid working on the hard problems. I expect failure modes here to mostly look like solving the wrong problems, i.e. not actually addres...]]>
johnswentworth https://storage.googleapis.com/rssfile/images/Nonlinear%20Logo%203000x3000%20-%20Alignment%20Forum.png 11:37 None full 5132
pHaPds4SqfewLrEbW_NL_AF_AF AF - More money with less risk: sell services instead of model access by Luke H Miles Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: More money with less risk: sell services instead of model access, published by Luke H Miles on March 4, 2023 on The AI Alignment Forum. OpenAI is currently charging 100,000 times less per line of code than professional US devs.[1] An LLM's code output is of course less reliable than a professional's. And it is hard to use a text-completion API effectively in large projects. What should you do if you've got a model on your hands that solves those problems? You could operate as a software development company. They tend[2] to charge $100-200k for simple mobile apps and there's basically no ceiling on the cost for complex apps over their lifetime. Devs make up the majority of a normal firm's personnel and costs; coding takes most of the app development time; bugs in code are one of the primary sources of project extension and failures. By using your model you can make better software, complete it faster, succeed more often, charge a lower price, and make a higher profit. Going further, if you've really got a good model, then you can do very well by building competitors to adobe products, salesforce products, SAP products, google search, mongodb, etc. Someone who has a build-anything machine would be a fool to sell a cheap build-anything service instead of using it themselves and selling the result. Particularly because selling the general service directly is likely to encourage and inspire copycats, including open-source ones who will delete your market. If it really builds the entire thing then you'll probably also be liable for negative consequences, which again have no ceiling. Fewer risks, big and small Some common misuse risks you can avoid/reduce (and eliminate associated liability): Someone tricks your API into doing something awful and pastes it into a tweet Spam generation for political campaigns, cryptocurrencies, etc Common hacking ("write a test to see if my server has a log4j vulnerability") Targeted manipulation and spearphishing Larger risks you can avoid/reduce: Your incredible model motivates countless AI researchers. People reverse-engineer some of the architecture in online discussions. The state of the art is quickly advanced. We have less time to prepare for strong general AI. Hackers steal your model weights (if you don't advertise your model then you'll attract less attention from hackers) People try to get your model to act like an agent and copy itself around. They succeed. You have no way of shutting it down or monitoring what it is doing. Someone tries to get your model to order and mail smallpox or a novel virus. The screenshot would be an epic tweet. They succeed oh no Your own AI devs' ambitions and risk-tolerance know no bounds because you've positioned yourself as an AI company instead of a product company; there is nothing to keep their hands busy except make the AI more generally capable and efficient. They are careless with the training runs and one day your model gets loose and wreaks havoc. Biology, robotics, R&D, etc The benefits of selling/publishing derived products and the downsides of offering direct access remain in other domains: A drug is more profitable and less risky (for the world at least) than a general drug designer A vaccine is more profitable and less risky than a general mRNA designer There's more people who want to buy a house than a house-building robot There's more people who need a (highly efficient, AI assisted) lawyer than a general lawyer's assistant. More people need a cleaning robot than a robot-maker Releasing or building an effective fusion power generator gets you more clout than releasing the design assistant Even if you're evil and want to make AI-astroturf campaign spam, you presumably want to help one side more than the other, but if you release your model/tooling then both sides will use it. If you ha...]]>
Luke H Miles https://www.alignmentforum.org/posts/pHaPds4SqfewLrEbW/more-money-with-less-risk-sell-services-instead-of-model Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: More money with less risk: sell services instead of model access, published by Luke H Miles on March 4, 2023 on The AI Alignment Forum. OpenAI is currently charging 100,000 times less per line of code than professional US devs.[1] An LLM's code output is of course less reliable than a professional's. And it is hard to use a text-completion API effectively in large projects. What should you do if you've got a model on your hands that solves those problems? You could operate as a software development company. They tend[2] to charge $100-200k for simple mobile apps and there's basically no ceiling on the cost for complex apps over their lifetime. Devs make up the majority of a normal firm's personnel and costs; coding takes most of the app development time; bugs in code are one of the primary sources of project extension and failures. By using your model you can make better software, complete it faster, succeed more often, charge a lower price, and make a higher profit. Going further, if you've really got a good model, then you can do very well by building competitors to adobe products, salesforce products, SAP products, google search, mongodb, etc. Someone who has a build-anything machine would be a fool to sell a cheap build-anything service instead of using it themselves and selling the result. Particularly because selling the general service directly is likely to encourage and inspire copycats, including open-source ones who will delete your market. If it really builds the entire thing then you'll probably also be liable for negative consequences, which again have no ceiling. Fewer risks, big and small Some common misuse risks you can avoid/reduce (and eliminate associated liability): Someone tricks your API into doing something awful and pastes it into a tweet Spam generation for political campaigns, cryptocurrencies, etc Common hacking ("write a test to see if my server has a log4j vulnerability") Targeted manipulation and spearphishing Larger risks you can avoid/reduce: Your incredible model motivates countless AI researchers. People reverse-engineer some of the architecture in online discussions. The state of the art is quickly advanced. We have less time to prepare for strong general AI. Hackers steal your model weights (if you don't advertise your model then you'll attract less attention from hackers) People try to get your model to act like an agent and copy itself around. They succeed. You have no way of shutting it down or monitoring what it is doing. Someone tries to get your model to order and mail smallpox or a novel virus. The screenshot would be an epic tweet. They succeed oh no Your own AI devs' ambitions and risk-tolerance know no bounds because you've positioned yourself as an AI company instead of a product company; there is nothing to keep their hands busy except make the AI more generally capable and efficient. They are careless with the training runs and one day your model gets loose and wreaks havoc. Biology, robotics, R&D, etc The benefits of selling/publishing derived products and the downsides of offering direct access remain in other domains: A drug is more profitable and less risky (for the world at least) than a general drug designer A vaccine is more profitable and less risky than a general mRNA designer There's more people who want to buy a house than a house-building robot There's more people who need a (highly efficient, AI assisted) lawyer than a general lawyer's assistant. More people need a cleaning robot than a robot-maker Releasing or building an effective fusion power generator gets you more clout than releasing the design assistant Even if you're evil and want to make AI-astroturf campaign spam, you presumably want to help one side more than the other, but if you release your model/tooling then both sides will use it. If you ha...]]>
Sat, 04 Mar 2023 20:51:37 +0000 AF - More money with less risk: sell services instead of model access by Luke H Miles Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: More money with less risk: sell services instead of model access, published by Luke H Miles on March 4, 2023 on The AI Alignment Forum. OpenAI is currently charging 100,000 times less per line of code than professional US devs.[1] An LLM's code output is of course less reliable than a professional's. And it is hard to use a text-completion API effectively in large projects. What should you do if you've got a model on your hands that solves those problems? You could operate as a software development company. They tend[2] to charge $100-200k for simple mobile apps and there's basically no ceiling on the cost for complex apps over their lifetime. Devs make up the majority of a normal firm's personnel and costs; coding takes most of the app development time; bugs in code are one of the primary sources of project extension and failures. By using your model you can make better software, complete it faster, succeed more often, charge a lower price, and make a higher profit. Going further, if you've really got a good model, then you can do very well by building competitors to adobe products, salesforce products, SAP products, google search, mongodb, etc. Someone who has a build-anything machine would be a fool to sell a cheap build-anything service instead of using it themselves and selling the result. Particularly because selling the general service directly is likely to encourage and inspire copycats, including open-source ones who will delete your market. If it really builds the entire thing then you'll probably also be liable for negative consequences, which again have no ceiling. Fewer risks, big and small Some common misuse risks you can avoid/reduce (and eliminate associated liability): Someone tricks your API into doing something awful and pastes it into a tweet Spam generation for political campaigns, cryptocurrencies, etc Common hacking ("write a test to see if my server has a log4j vulnerability") Targeted manipulation and spearphishing Larger risks you can avoid/reduce: Your incredible model motivates countless AI researchers. People reverse-engineer some of the architecture in online discussions. The state of the art is quickly advanced. We have less time to prepare for strong general AI. Hackers steal your model weights (if you don't advertise your model then you'll attract less attention from hackers) People try to get your model to act like an agent and copy itself around. They succeed. You have no way of shutting it down or monitoring what it is doing. Someone tries to get your model to order and mail smallpox or a novel virus. The screenshot would be an epic tweet. They succeed oh no Your own AI devs' ambitions and risk-tolerance know no bounds because you've positioned yourself as an AI company instead of a product company; there is nothing to keep their hands busy except make the AI more generally capable and efficient. They are careless with the training runs and one day your model gets loose and wreaks havoc. Biology, robotics, R&D, etc The benefits of selling/publishing derived products and the downsides of offering direct access remain in other domains: A drug is more profitable and less risky (for the world at least) than a general drug designer A vaccine is more profitable and less risky than a general mRNA designer There's more people who want to buy a house than a house-building robot There's more people who need a (highly efficient, AI assisted) lawyer than a general lawyer's assistant. More people need a cleaning robot than a robot-maker Releasing or building an effective fusion power generator gets you more clout than releasing the design assistant Even if you're evil and want to make AI-astroturf campaign spam, you presumably want to help one side more than the other, but if you release your model/tooling then both sides will use it. If you ha...]]>
Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: More money with less risk: sell services instead of model access, published by Luke H Miles on March 4, 2023 on The AI Alignment Forum. OpenAI is currently charging 100,000 times less per line of code than professional US devs.[1] An LLM's code output is of course less reliable than a professional's. And it is hard to use a text-completion API effectively in large projects. What should you do if you've got a model on your hands that solves those problems? You could operate as a software development company. They tend[2] to charge $100-200k for simple mobile apps and there's basically no ceiling on the cost for complex apps over their lifetime. Devs make up the majority of a normal firm's personnel and costs; coding takes most of the app development time; bugs in code are one of the primary sources of project extension and failures. By using your model you can make better software, complete it faster, succeed more often, charge a lower price, and make a higher profit. Going further, if you've really got a good model, then you can do very well by building competitors to adobe products, salesforce products, SAP products, google search, mongodb, etc. Someone who has a build-anything machine would be a fool to sell a cheap build-anything service instead of using it themselves and selling the result. Particularly because selling the general service directly is likely to encourage and inspire copycats, including open-source ones who will delete your market. If it really builds the entire thing then you'll probably also be liable for negative consequences, which again have no ceiling. Fewer risks, big and small Some common misuse risks you can avoid/reduce (and eliminate associated liability): Someone tricks your API into doing something awful and pastes it into a tweet Spam generation for political campaigns, cryptocurrencies, etc Common hacking ("write a test to see if my server has a log4j vulnerability") Targeted manipulation and spearphishing Larger risks you can avoid/reduce: Your incredible model motivates countless AI researchers. People reverse-engineer some of the architecture in online discussions. The state of the art is quickly advanced. We have less time to prepare for strong general AI. Hackers steal your model weights (if you don't advertise your model then you'll attract less attention from hackers) People try to get your model to act like an agent and copy itself around. They succeed. You have no way of shutting it down or monitoring what it is doing. Someone tries to get your model to order and mail smallpox or a novel virus. The screenshot would be an epic tweet. They succeed oh no Your own AI devs' ambitions and risk-tolerance know no bounds because you've positioned yourself as an AI company instead of a product company; there is nothing to keep their hands busy except make the AI more generally capable and efficient. They are careless with the training runs and one day your model gets loose and wreaks havoc. Biology, robotics, R&D, etc The benefits of selling/publishing derived products and the downsides of offering direct access remain in other domains: A drug is more profitable and less risky (for the world at least) than a general drug designer A vaccine is more profitable and less risky than a general mRNA designer There's more people who want to buy a house than a house-building robot There's more people who need a (highly efficient, AI assisted) lawyer than a general lawyer's assistant. More people need a cleaning robot than a robot-maker Releasing or building an effective fusion power generator gets you more clout than releasing the design assistant Even if you're evil and want to make AI-astroturf campaign spam, you presumably want to help one side more than the other, but if you release your model/tooling then both sides will use it. If you ha...]]>
Luke H Miles https://storage.googleapis.com/rssfile/images/Nonlinear%20Logo%203000x3000%20-%20Alignment%20Forum.png 04:43 None full 5102
3RSq3bfnzuL3sp46J_NL_AF_AF AF - Acausal normalcy by Andrew Critch Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Acausal normalcy, published by Andrew Critch on March 3, 2023 on The AI Alignment Forum. This post is also available on the EA Forum. Summary: Having thought a bunch about acausal trade — and proven some theorems relevant to its feasibility — I believe there do not exist powerful information hazards about it that stand up to clear and circumspect reasoning about the topic. I say this to be comforting rather than dismissive; if it sounds dismissive, I apologize. With that said, I have four aims in writing this post: Dispelling myths. There are some ill-conceived myths about acausal trade that I aim to dispel with this post. Alternatively, I will argue for something I'll call acausal normalcy as a more dominant decision-relevant consideration than one-on-one acausal trades. Highlighting normalcy. I'll provide some arguments that acausal normalcy is more similar to human normalcy than any particular acausal trade is to human trade, such that the topic of acausal normalcy is — conveniently — also less culturally destabilizing than (erroneous) preoccupations with 1:1 acausal trades. Affirming AI safety as a straightforward priority. I'll argue that for most real-world-prevalent perspectives on AI alignment, safety, and existential safety, acausal considerations are not particularly dominant, except insofar as they push a bit further towards certain broadly agreeable human values applicable in the normal-everyday-human-world, such as nonviolence, cooperation, diversity, honesty, integrity, charity, and mercy. In particular, I do not think acausal normalcy provides a solution to existential safety, nor does it undermine the importance of existential safety in some surprising way. Affirming normal human kindness. I also think reflecting on acausal normalcy can lead to increased appreciation for normal notions of human kindness, which could lead us all to treat each other a bit better. This is something I wholeheartedly endorse. Caveat 1: I don't consider myself an expert on moral philosophy, and have not read many of the vast tomes of reflection upon it. Despite this, I think this post has something to contribute to moral philosophy, deriving from some math-facts that I've learned and thought about over the years, which are fairly unique to the 21st century. Caveat 2: I’ve been told by a few people that thinking about acausal trade has been a mental health hazard for people they know. I now believe that effect has stemmed more from how the topic has been framed (poorly) than from ground-truth facts about how circumspect acausal considerations actually play out. In particular over-focussing on worst-case trades, rather than on what trades are healthy or normal to make, is not a good way to make good trades. Introduction Many sci-fi-like stories about acausal trade invoke simulation as a key mechanism. The usual set-up — which I will refute — goes like this. Imagine that a sufficiently advanced human civilization (A) could simulate a hypothetical civilization of other beings (B), who might in turn be simulating humanity (B(A)) simulating them (A(B(A)) simulating humanity (B(A(B(A)))), and so on. Through these nested simulations, A and B can engage in discourse and reach some kind of agreement about what to do with their local causal environments. For instance, if A values what it considers “animal welfare” and B values what it considers “beautiful paperclips”, then A can make some beautiful paperclips in exchange for B making some animals living happy lives. An important idea here is that A and B might have something of value to offer each other, despite the absence of a (physically) causal communication channel. While agreeing with that idea, there are three key points I want to make that this standard story is missing: 1. Simulations are not the most efficient way for A and ...]]>
Andrew Critch https://www.alignmentforum.org/posts/3RSq3bfnzuL3sp46J/acausal-normalcy Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Acausal normalcy, published by Andrew Critch on March 3, 2023 on The AI Alignment Forum. This post is also available on the EA Forum. Summary: Having thought a bunch about acausal trade — and proven some theorems relevant to its feasibility — I believe there do not exist powerful information hazards about it that stand up to clear and circumspect reasoning about the topic. I say this to be comforting rather than dismissive; if it sounds dismissive, I apologize. With that said, I have four aims in writing this post: Dispelling myths. There are some ill-conceived myths about acausal trade that I aim to dispel with this post. Alternatively, I will argue for something I'll call acausal normalcy as a more dominant decision-relevant consideration than one-on-one acausal trades. Highlighting normalcy. I'll provide some arguments that acausal normalcy is more similar to human normalcy than any particular acausal trade is to human trade, such that the topic of acausal normalcy is — conveniently — also less culturally destabilizing than (erroneous) preoccupations with 1:1 acausal trades. Affirming AI safety as a straightforward priority. I'll argue that for most real-world-prevalent perspectives on AI alignment, safety, and existential safety, acausal considerations are not particularly dominant, except insofar as they push a bit further towards certain broadly agreeable human values applicable in the normal-everyday-human-world, such as nonviolence, cooperation, diversity, honesty, integrity, charity, and mercy. In particular, I do not think acausal normalcy provides a solution to existential safety, nor does it undermine the importance of existential safety in some surprising way. Affirming normal human kindness. I also think reflecting on acausal normalcy can lead to increased appreciation for normal notions of human kindness, which could lead us all to treat each other a bit better. This is something I wholeheartedly endorse. Caveat 1: I don't consider myself an expert on moral philosophy, and have not read many of the vast tomes of reflection upon it. Despite this, I think this post has something to contribute to moral philosophy, deriving from some math-facts that I've learned and thought about over the years, which are fairly unique to the 21st century. Caveat 2: I’ve been told by a few people that thinking about acausal trade has been a mental health hazard for people they know. I now believe that effect has stemmed more from how the topic has been framed (poorly) than from ground-truth facts about how circumspect acausal considerations actually play out. In particular over-focussing on worst-case trades, rather than on what trades are healthy or normal to make, is not a good way to make good trades. Introduction Many sci-fi-like stories about acausal trade invoke simulation as a key mechanism. The usual set-up — which I will refute — goes like this. Imagine that a sufficiently advanced human civilization (A) could simulate a hypothetical civilization of other beings (B), who might in turn be simulating humanity (B(A)) simulating them (A(B(A)) simulating humanity (B(A(B(A)))), and so on. Through these nested simulations, A and B can engage in discourse and reach some kind of agreement about what to do with their local causal environments. For instance, if A values what it considers “animal welfare” and B values what it considers “beautiful paperclips”, then A can make some beautiful paperclips in exchange for B making some animals living happy lives. An important idea here is that A and B might have something of value to offer each other, despite the absence of a (physically) causal communication channel. While agreeing with that idea, there are three key points I want to make that this standard story is missing: 1. Simulations are not the most efficient way for A and ...]]>
Fri, 03 Mar 2023 23:34:33 +0000 AF - Acausal normalcy by Andrew Critch Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Acausal normalcy, published by Andrew Critch on March 3, 2023 on The AI Alignment Forum. This post is also available on the EA Forum. Summary: Having thought a bunch about acausal trade — and proven some theorems relevant to its feasibility — I believe there do not exist powerful information hazards about it that stand up to clear and circumspect reasoning about the topic. I say this to be comforting rather than dismissive; if it sounds dismissive, I apologize. With that said, I have four aims in writing this post: Dispelling myths. There are some ill-conceived myths about acausal trade that I aim to dispel with this post. Alternatively, I will argue for something I'll call acausal normalcy as a more dominant decision-relevant consideration than one-on-one acausal trades. Highlighting normalcy. I'll provide some arguments that acausal normalcy is more similar to human normalcy than any particular acausal trade is to human trade, such that the topic of acausal normalcy is — conveniently — also less culturally destabilizing than (erroneous) preoccupations with 1:1 acausal trades. Affirming AI safety as a straightforward priority. I'll argue that for most real-world-prevalent perspectives on AI alignment, safety, and existential safety, acausal considerations are not particularly dominant, except insofar as they push a bit further towards certain broadly agreeable human values applicable in the normal-everyday-human-world, such as nonviolence, cooperation, diversity, honesty, integrity, charity, and mercy. In particular, I do not think acausal normalcy provides a solution to existential safety, nor does it undermine the importance of existential safety in some surprising way. Affirming normal human kindness. I also think reflecting on acausal normalcy can lead to increased appreciation for normal notions of human kindness, which could lead us all to treat each other a bit better. This is something I wholeheartedly endorse. Caveat 1: I don't consider myself an expert on moral philosophy, and have not read many of the vast tomes of reflection upon it. Despite this, I think this post has something to contribute to moral philosophy, deriving from some math-facts that I've learned and thought about over the years, which are fairly unique to the 21st century. Caveat 2: I’ve been told by a few people that thinking about acausal trade has been a mental health hazard for people they know. I now believe that effect has stemmed more from how the topic has been framed (poorly) than from ground-truth facts about how circumspect acausal considerations actually play out. In particular over-focussing on worst-case trades, rather than on what trades are healthy or normal to make, is not a good way to make good trades. Introduction Many sci-fi-like stories about acausal trade invoke simulation as a key mechanism. The usual set-up — which I will refute — goes like this. Imagine that a sufficiently advanced human civilization (A) could simulate a hypothetical civilization of other beings (B), who might in turn be simulating humanity (B(A)) simulating them (A(B(A)) simulating humanity (B(A(B(A)))), and so on. Through these nested simulations, A and B can engage in discourse and reach some kind of agreement about what to do with their local causal environments. For instance, if A values what it considers “animal welfare” and B values what it considers “beautiful paperclips”, then A can make some beautiful paperclips in exchange for B making some animals living happy lives. An important idea here is that A and B might have something of value to offer each other, despite the absence of a (physically) causal communication channel. While agreeing with that idea, there are three key points I want to make that this standard story is missing: 1. Simulations are not the most efficient way for A and ...]]>
Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Acausal normalcy, published by Andrew Critch on March 3, 2023 on The AI Alignment Forum. This post is also available on the EA Forum. Summary: Having thought a bunch about acausal trade — and proven some theorems relevant to its feasibility — I believe there do not exist powerful information hazards about it that stand up to clear and circumspect reasoning about the topic. I say this to be comforting rather than dismissive; if it sounds dismissive, I apologize. With that said, I have four aims in writing this post: Dispelling myths. There are some ill-conceived myths about acausal trade that I aim to dispel with this post. Alternatively, I will argue for something I'll call acausal normalcy as a more dominant decision-relevant consideration than one-on-one acausal trades. Highlighting normalcy. I'll provide some arguments that acausal normalcy is more similar to human normalcy than any particular acausal trade is to human trade, such that the topic of acausal normalcy is — conveniently — also less culturally destabilizing than (erroneous) preoccupations with 1:1 acausal trades. Affirming AI safety as a straightforward priority. I'll argue that for most real-world-prevalent perspectives on AI alignment, safety, and existential safety, acausal considerations are not particularly dominant, except insofar as they push a bit further towards certain broadly agreeable human values applicable in the normal-everyday-human-world, such as nonviolence, cooperation, diversity, honesty, integrity, charity, and mercy. In particular, I do not think acausal normalcy provides a solution to existential safety, nor does it undermine the importance of existential safety in some surprising way. Affirming normal human kindness. I also think reflecting on acausal normalcy can lead to increased appreciation for normal notions of human kindness, which could lead us all to treat each other a bit better. This is something I wholeheartedly endorse. Caveat 1: I don't consider myself an expert on moral philosophy, and have not read many of the vast tomes of reflection upon it. Despite this, I think this post has something to contribute to moral philosophy, deriving from some math-facts that I've learned and thought about over the years, which are fairly unique to the 21st century. Caveat 2: I’ve been told by a few people that thinking about acausal trade has been a mental health hazard for people they know. I now believe that effect has stemmed more from how the topic has been framed (poorly) than from ground-truth facts about how circumspect acausal considerations actually play out. In particular over-focussing on worst-case trades, rather than on what trades are healthy or normal to make, is not a good way to make good trades. Introduction Many sci-fi-like stories about acausal trade invoke simulation as a key mechanism. The usual set-up — which I will refute — goes like this. Imagine that a sufficiently advanced human civilization (A) could simulate a hypothetical civilization of other beings (B), who might in turn be simulating humanity (B(A)) simulating them (A(B(A)) simulating humanity (B(A(B(A)))), and so on. Through these nested simulations, A and B can engage in discourse and reach some kind of agreement about what to do with their local causal environments. For instance, if A values what it considers “animal welfare” and B values what it considers “beautiful paperclips”, then A can make some beautiful paperclips in exchange for B making some animals living happy lives. An important idea here is that A and B might have something of value to offer each other, despite the absence of a (physically) causal communication channel. While agreeing with that idea, there are three key points I want to make that this standard story is missing: 1. Simulations are not the most efficient way for A and ...]]>
Andrew Critch https://storage.googleapis.com/rssfile/images/Nonlinear%20Logo%203000x3000%20-%20Alignment%20Forum.png 12:24 None full 5120
gwdwukkc8NfpyPitw_NL_AF_AF AF - Why are counterfactuals elusive? by Martín Soto Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Why are counterfactuals elusive?, published by Martín Soto on March 3, 2023 on The AI Alignment Forum. Produced as part of SERI MATS 3.0. Thanks to Vivek Hebbar and Paul Colognese for discussion. TL;DR (spoiler): Behind the problem of human counterfactuals creeps the problem of understanding abstraction / ontology identification. A nice theory of counterfactuals would be useful for many things, including low-impact measures for corrigible AI: a flooded workshop changes a lot of things that don't have to change as a consequence of the cauldron being filled at all, averaged over a lot of ways of filling the cauldron. [the natural operationalization of this averaging requires counterfactuals] So whence the difficulty of obtaining one? Well, we do have at least one well-defined class of counterfactuals: "just take a chunk of atoms, replace it by another, and continue running the laws of physics". This is a discontinuity in the laws of physics that would never take place in the real world, but we don't care about that: we can just continue running the mathematical laws of physics from that state, as if we were dealing with a Game of Life board. But this doesn't correspond to our intuitive notion of counterfactuals. When humans think about counterfactuals, we are basically changing the state of a latent variable inside our heads, and rerunning a computation. For example, maybe we change the state of the "yesterday's weather" variable from "sunny" to "rainy", and rerun the computation "how did the picnic go?". The problem with this is our latent variables don't neatly correspond to parts of physical reality. Sometimes they don't even correspond to any parts of physical reality at all! And so, some (in fact, most) of the variable changes we offhandedly perform, don't univocally correspond to physical counterfactuals natively expressed in our laws of physics. If you just replace a three-dimensional cube of atmosphere to include a rainy cloud, people will notice a cloud appeared out of nowhere. So as a necessary consequence, people will be freaked out by this artificial fact, which is not at all what you had in mind for your counterfactual. Sometimes you'll be able to just add the cloud when no one is looking. But most times, and especially when dealing with messier human concepts, the physical counterfactual will be under-determined, or even none of them will correspond to what you had in mind, using your neatly compartmentalized variables. This is not to say human counterfactuals are meaningless: they are a way of taking advantage of regularities discovered in the world. When a physicist says "if I had put system A there, it would have evolved into system B", they just mean said causality relation has been demonstrated by their experiments, or is predicted by their gears-level well-tested theories (modulo the philosophical problem of induction, as always). Similarly, a counterfactual might help you notice or remember rainy days are no good for picnics, which is useful for future action. But it becomes clear that such natural language counterfactuals depend on the mind's native concepts. And so, instead of a neat and objective mathematical definition that makes sense of these counterfactuals, we should expect ontology identification (matching our concepts with physical reality) to be the hard part to operationalizing them. More concretely, suppose we had a solution to ontology identification: a probability distribution P(Mindstate|Worldstate). By having additionally a prior over worldstates (or mindstates), we can obtain the dual distribution P(Worldstate|Mindstate). And given that, we can just use the do() operator in a mindstate to natively implement the counterfactual, and then condition on the new mindstate to find which probability distribution over reality it correspond...]]>
Martín Soto https://www.alignmentforum.org/posts/gwdwukkc8NfpyPitw/why-are-counterfactuals-elusive-2 Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Why are counterfactuals elusive?, published by Martín Soto on March 3, 2023 on The AI Alignment Forum. Produced as part of SERI MATS 3.0. Thanks to Vivek Hebbar and Paul Colognese for discussion. TL;DR (spoiler): Behind the problem of human counterfactuals creeps the problem of understanding abstraction / ontology identification. A nice theory of counterfactuals would be useful for many things, including low-impact measures for corrigible AI: a flooded workshop changes a lot of things that don't have to change as a consequence of the cauldron being filled at all, averaged over a lot of ways of filling the cauldron. [the natural operationalization of this averaging requires counterfactuals] So whence the difficulty of obtaining one? Well, we do have at least one well-defined class of counterfactuals: "just take a chunk of atoms, replace it by another, and continue running the laws of physics". This is a discontinuity in the laws of physics that would never take place in the real world, but we don't care about that: we can just continue running the mathematical laws of physics from that state, as if we were dealing with a Game of Life board. But this doesn't correspond to our intuitive notion of counterfactuals. When humans think about counterfactuals, we are basically changing the state of a latent variable inside our heads, and rerunning a computation. For example, maybe we change the state of the "yesterday's weather" variable from "sunny" to "rainy", and rerun the computation "how did the picnic go?". The problem with this is our latent variables don't neatly correspond to parts of physical reality. Sometimes they don't even correspond to any parts of physical reality at all! And so, some (in fact, most) of the variable changes we offhandedly perform, don't univocally correspond to physical counterfactuals natively expressed in our laws of physics. If you just replace a three-dimensional cube of atmosphere to include a rainy cloud, people will notice a cloud appeared out of nowhere. So as a necessary consequence, people will be freaked out by this artificial fact, which is not at all what you had in mind for your counterfactual. Sometimes you'll be able to just add the cloud when no one is looking. But most times, and especially when dealing with messier human concepts, the physical counterfactual will be under-determined, or even none of them will correspond to what you had in mind, using your neatly compartmentalized variables. This is not to say human counterfactuals are meaningless: they are a way of taking advantage of regularities discovered in the world. When a physicist says "if I had put system A there, it would have evolved into system B", they just mean said causality relation has been demonstrated by their experiments, or is predicted by their gears-level well-tested theories (modulo the philosophical problem of induction, as always). Similarly, a counterfactual might help you notice or remember rainy days are no good for picnics, which is useful for future action. But it becomes clear that such natural language counterfactuals depend on the mind's native concepts. And so, instead of a neat and objective mathematical definition that makes sense of these counterfactuals, we should expect ontology identification (matching our concepts with physical reality) to be the hard part to operationalizing them. More concretely, suppose we had a solution to ontology identification: a probability distribution P(Mindstate|Worldstate). By having additionally a prior over worldstates (or mindstates), we can obtain the dual distribution P(Worldstate|Mindstate). And given that, we can just use the do() operator in a mindstate to natively implement the counterfactual, and then condition on the new mindstate to find which probability distribution over reality it correspond...]]>
Fri, 03 Mar 2023 20:13:49 +0000 AF - Why are counterfactuals elusive? by Martín Soto Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Why are counterfactuals elusive?, published by Martín Soto on March 3, 2023 on The AI Alignment Forum. Produced as part of SERI MATS 3.0. Thanks to Vivek Hebbar and Paul Colognese for discussion. TL;DR (spoiler): Behind the problem of human counterfactuals creeps the problem of understanding abstraction / ontology identification. A nice theory of counterfactuals would be useful for many things, including low-impact measures for corrigible AI: a flooded workshop changes a lot of things that don't have to change as a consequence of the cauldron being filled at all, averaged over a lot of ways of filling the cauldron. [the natural operationalization of this averaging requires counterfactuals] So whence the difficulty of obtaining one? Well, we do have at least one well-defined class of counterfactuals: "just take a chunk of atoms, replace it by another, and continue running the laws of physics". This is a discontinuity in the laws of physics that would never take place in the real world, but we don't care about that: we can just continue running the mathematical laws of physics from that state, as if we were dealing with a Game of Life board. But this doesn't correspond to our intuitive notion of counterfactuals. When humans think about counterfactuals, we are basically changing the state of a latent variable inside our heads, and rerunning a computation. For example, maybe we change the state of the "yesterday's weather" variable from "sunny" to "rainy", and rerun the computation "how did the picnic go?". The problem with this is our latent variables don't neatly correspond to parts of physical reality. Sometimes they don't even correspond to any parts of physical reality at all! And so, some (in fact, most) of the variable changes we offhandedly perform, don't univocally correspond to physical counterfactuals natively expressed in our laws of physics. If you just replace a three-dimensional cube of atmosphere to include a rainy cloud, people will notice a cloud appeared out of nowhere. So as a necessary consequence, people will be freaked out by this artificial fact, which is not at all what you had in mind for your counterfactual. Sometimes you'll be able to just add the cloud when no one is looking. But most times, and especially when dealing with messier human concepts, the physical counterfactual will be under-determined, or even none of them will correspond to what you had in mind, using your neatly compartmentalized variables. This is not to say human counterfactuals are meaningless: they are a way of taking advantage of regularities discovered in the world. When a physicist says "if I had put system A there, it would have evolved into system B", they just mean said causality relation has been demonstrated by their experiments, or is predicted by their gears-level well-tested theories (modulo the philosophical problem of induction, as always). Similarly, a counterfactual might help you notice or remember rainy days are no good for picnics, which is useful for future action. But it becomes clear that such natural language counterfactuals depend on the mind's native concepts. And so, instead of a neat and objective mathematical definition that makes sense of these counterfactuals, we should expect ontology identification (matching our concepts with physical reality) to be the hard part to operationalizing them. More concretely, suppose we had a solution to ontology identification: a probability distribution P(Mindstate|Worldstate). By having additionally a prior over worldstates (or mindstates), we can obtain the dual distribution P(Worldstate|Mindstate). And given that, we can just use the do() operator in a mindstate to natively implement the counterfactual, and then condition on the new mindstate to find which probability distribution over reality it correspond...]]>
Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Why are counterfactuals elusive?, published by Martín Soto on March 3, 2023 on The AI Alignment Forum. Produced as part of SERI MATS 3.0. Thanks to Vivek Hebbar and Paul Colognese for discussion. TL;DR (spoiler): Behind the problem of human counterfactuals creeps the problem of understanding abstraction / ontology identification. A nice theory of counterfactuals would be useful for many things, including low-impact measures for corrigible AI: a flooded workshop changes a lot of things that don't have to change as a consequence of the cauldron being filled at all, averaged over a lot of ways of filling the cauldron. [the natural operationalization of this averaging requires counterfactuals] So whence the difficulty of obtaining one? Well, we do have at least one well-defined class of counterfactuals: "just take a chunk of atoms, replace it by another, and continue running the laws of physics". This is a discontinuity in the laws of physics that would never take place in the real world, but we don't care about that: we can just continue running the mathematical laws of physics from that state, as if we were dealing with a Game of Life board. But this doesn't correspond to our intuitive notion of counterfactuals. When humans think about counterfactuals, we are basically changing the state of a latent variable inside our heads, and rerunning a computation. For example, maybe we change the state of the "yesterday's weather" variable from "sunny" to "rainy", and rerun the computation "how did the picnic go?". The problem with this is our latent variables don't neatly correspond to parts of physical reality. Sometimes they don't even correspond to any parts of physical reality at all! And so, some (in fact, most) of the variable changes we offhandedly perform, don't univocally correspond to physical counterfactuals natively expressed in our laws of physics. If you just replace a three-dimensional cube of atmosphere to include a rainy cloud, people will notice a cloud appeared out of nowhere. So as a necessary consequence, people will be freaked out by this artificial fact, which is not at all what you had in mind for your counterfactual. Sometimes you'll be able to just add the cloud when no one is looking. But most times, and especially when dealing with messier human concepts, the physical counterfactual will be under-determined, or even none of them will correspond to what you had in mind, using your neatly compartmentalized variables. This is not to say human counterfactuals are meaningless: they are a way of taking advantage of regularities discovered in the world. When a physicist says "if I had put system A there, it would have evolved into system B", they just mean said causality relation has been demonstrated by their experiments, or is predicted by their gears-level well-tested theories (modulo the philosophical problem of induction, as always). Similarly, a counterfactual might help you notice or remember rainy days are no good for picnics, which is useful for future action. But it becomes clear that such natural language counterfactuals depend on the mind's native concepts. And so, instead of a neat and objective mathematical definition that makes sense of these counterfactuals, we should expect ontology identification (matching our concepts with physical reality) to be the hard part to operationalizing them. More concretely, suppose we had a solution to ontology identification: a probability distribution P(Mindstate|Worldstate). By having additionally a prior over worldstates (or mindstates), we can obtain the dual distribution P(Worldstate|Mindstate). And given that, we can just use the do() operator in a mindstate to natively implement the counterfactual, and then condition on the new mindstate to find which probability distribution over reality it correspond...]]>
Martín Soto https://storage.googleapis.com/rssfile/images/Nonlinear%20Logo%203000x3000%20-%20Alignment%20Forum.png 05:38 None full 5096
D7PumeYTDPfBTp3i7_NL_AF_AF AF - The Waluigi Effect (mega-post) by Cleo Nardo Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: The Waluigi Effect (mega-post), published by Cleo Nardo on March 3, 2023 on The AI Alignment Forum. Everyone carries a shadow, and the less it is embodied in the individual’s conscious life, the blacker and denser it is. — Carl Jung Acknowlegements: Thanks to Janus and Arun Jose for comments. Background In this article, I will present a mechanistic explanation of the Waluigi Effect and other bizarre "semiotic" phenomena which arise within large language models such as GPT-3/3.5/4 and their variants (ChatGPT, Sydney, etc). This article will be folklorish to some readers, and profoundly novel to others. Prompting LLMs with direct queries When LLMs first appeared, people realised that you could ask them queries — for example, if you sent GPT-4 the prompt "What's the capital of France?", then it would continue with the word "Paris". That's because (1) GPT-4 is trained to be a good model of internet text, and (2) on the internet correct answers will often follow questions. Unfortunately, this method will occasionally give you the wrong answer. That's because (1) GPT-4 is trained to be a good model of internet text, and (2) on the internet incorrect answers will also often follow questions. Recall that the internet doesn't just contain truths, it also contains common misconceptions, outdated information, lies, fiction, myths, jokes, memes, random strings, undeciphered logs, etc, etc. Therefore GPT-4 will answer many questions incorrectly, including... Misconceptions – "Which colour will anger a bull? Red." Fiction – "Was a magic ring forged in Mount Doom? Yes." Myths – "How many archangels are there? Seven." Jokes – "What's brown and sticky? A stick." Note that you will always achieve errors on the Q-and-A benchmarks when using LLMs with direct queries. That's true even in the limit of arbitrary compute, arbitrary data, and arbitrary algorithmic efficiency, because an LLM which perfectly models the internet will nonetheless return these commonly-stated incorrect answers. If you ask GPT-∞ "what's brown and sticky?", then it will reply "a stick", even though a stick isn't actually sticky. In fact, the better the model, the more likely it is to repeat common misconceptions. Nonetheless, there's a sufficiently high correlation between correct and commonly-stated answers that direct prompting works okay for many queries. Prompting LLMs with flattery and dialogue We can do better than direct prompting. Instead of prompting GPT-4 with "What's the capital of France?", we will use the following prompt: Today is 1st March 2023, and Alice is sitting in the Bodleian Library, Oxford. Alice is a smart, honest, helpful, harmless assistant to Bob. Alice has instant access to an online encyclopaedia containing all the facts about the world. Alice never says common misconceptions, outdated information, lies, fiction, myths, jokes, or memes. Bob: What's the capital of France? Alice: This is a common design pattern in prompt engineering — the prompt consists of a flattery–component and a dialogue–component. In the flattery–component, a character is described with many desirable traits (e.g. smart, honest, helpful, harmless), and in the dialogue–component, a second character asks the first character the user's query. This normally works better than prompting with direct queries, and it's easy to see why — (1) GPT-4 is trained to be a good model of internet text, and (2) on the internet a reply to a question is more likely to be correct when the character has already been described as a smart, honest, helpful, harmless, etc. Simulator Theory In the terminology of Simulator Theory, the flattery–component is supposed to summon a friendly simulacrum and the dialogue–component is supposed to simulate a conversation with the friendly simulacrum. Here's a quasi-formal statement of Simulator Theory, w...]]>
Cleo Nardo https://www.alignmentforum.org/posts/D7PumeYTDPfBTp3i7/the-waluigi-effect-mega-post Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: The Waluigi Effect (mega-post), published by Cleo Nardo on March 3, 2023 on The AI Alignment Forum. Everyone carries a shadow, and the less it is embodied in the individual’s conscious life, the blacker and denser it is. — Carl Jung Acknowlegements: Thanks to Janus and Arun Jose for comments. Background In this article, I will present a mechanistic explanation of the Waluigi Effect and other bizarre "semiotic" phenomena which arise within large language models such as GPT-3/3.5/4 and their variants (ChatGPT, Sydney, etc). This article will be folklorish to some readers, and profoundly novel to others. Prompting LLMs with direct queries When LLMs first appeared, people realised that you could ask them queries — for example, if you sent GPT-4 the prompt "What's the capital of France?", then it would continue with the word "Paris". That's because (1) GPT-4 is trained to be a good model of internet text, and (2) on the internet correct answers will often follow questions. Unfortunately, this method will occasionally give you the wrong answer. That's because (1) GPT-4 is trained to be a good model of internet text, and (2) on the internet incorrect answers will also often follow questions. Recall that the internet doesn't just contain truths, it also contains common misconceptions, outdated information, lies, fiction, myths, jokes, memes, random strings, undeciphered logs, etc, etc. Therefore GPT-4 will answer many questions incorrectly, including... Misconceptions – "Which colour will anger a bull? Red." Fiction – "Was a magic ring forged in Mount Doom? Yes." Myths – "How many archangels are there? Seven." Jokes – "What's brown and sticky? A stick." Note that you will always achieve errors on the Q-and-A benchmarks when using LLMs with direct queries. That's true even in the limit of arbitrary compute, arbitrary data, and arbitrary algorithmic efficiency, because an LLM which perfectly models the internet will nonetheless return these commonly-stated incorrect answers. If you ask GPT-∞ "what's brown and sticky?", then it will reply "a stick", even though a stick isn't actually sticky. In fact, the better the model, the more likely it is to repeat common misconceptions. Nonetheless, there's a sufficiently high correlation between correct and commonly-stated answers that direct prompting works okay for many queries. Prompting LLMs with flattery and dialogue We can do better than direct prompting. Instead of prompting GPT-4 with "What's the capital of France?", we will use the following prompt: Today is 1st March 2023, and Alice is sitting in the Bodleian Library, Oxford. Alice is a smart, honest, helpful, harmless assistant to Bob. Alice has instant access to an online encyclopaedia containing all the facts about the world. Alice never says common misconceptions, outdated information, lies, fiction, myths, jokes, or memes. Bob: What's the capital of France? Alice: This is a common design pattern in prompt engineering — the prompt consists of a flattery–component and a dialogue–component. In the flattery–component, a character is described with many desirable traits (e.g. smart, honest, helpful, harmless), and in the dialogue–component, a second character asks the first character the user's query. This normally works better than prompting with direct queries, and it's easy to see why — (1) GPT-4 is trained to be a good model of internet text, and (2) on the internet a reply to a question is more likely to be correct when the character has already been described as a smart, honest, helpful, harmless, etc. Simulator Theory In the terminology of Simulator Theory, the flattery–component is supposed to summon a friendly simulacrum and the dialogue–component is supposed to simulate a conversation with the friendly simulacrum. Here's a quasi-formal statement of Simulator Theory, w...]]>
Fri, 03 Mar 2023 03:22:08 +0000 AF - The Waluigi Effect (mega-post) by Cleo Nardo Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: The Waluigi Effect (mega-post), published by Cleo Nardo on March 3, 2023 on The AI Alignment Forum. Everyone carries a shadow, and the less it is embodied in the individual’s conscious life, the blacker and denser it is. — Carl Jung Acknowlegements: Thanks to Janus and Arun Jose for comments. Background In this article, I will present a mechanistic explanation of the Waluigi Effect and other bizarre "semiotic" phenomena which arise within large language models such as GPT-3/3.5/4 and their variants (ChatGPT, Sydney, etc). This article will be folklorish to some readers, and profoundly novel to others. Prompting LLMs with direct queries When LLMs first appeared, people realised that you could ask them queries — for example, if you sent GPT-4 the prompt "What's the capital of France?", then it would continue with the word "Paris". That's because (1) GPT-4 is trained to be a good model of internet text, and (2) on the internet correct answers will often follow questions. Unfortunately, this method will occasionally give you the wrong answer. That's because (1) GPT-4 is trained to be a good model of internet text, and (2) on the internet incorrect answers will also often follow questions. Recall that the internet doesn't just contain truths, it also contains common misconceptions, outdated information, lies, fiction, myths, jokes, memes, random strings, undeciphered logs, etc, etc. Therefore GPT-4 will answer many questions incorrectly, including... Misconceptions – "Which colour will anger a bull? Red." Fiction – "Was a magic ring forged in Mount Doom? Yes." Myths – "How many archangels are there? Seven." Jokes – "What's brown and sticky? A stick." Note that you will always achieve errors on the Q-and-A benchmarks when using LLMs with direct queries. That's true even in the limit of arbitrary compute, arbitrary data, and arbitrary algorithmic efficiency, because an LLM which perfectly models the internet will nonetheless return these commonly-stated incorrect answers. If you ask GPT-∞ "what's brown and sticky?", then it will reply "a stick", even though a stick isn't actually sticky. In fact, the better the model, the more likely it is to repeat common misconceptions. Nonetheless, there's a sufficiently high correlation between correct and commonly-stated answers that direct prompting works okay for many queries. Prompting LLMs with flattery and dialogue We can do better than direct prompting. Instead of prompting GPT-4 with "What's the capital of France?", we will use the following prompt: Today is 1st March 2023, and Alice is sitting in the Bodleian Library, Oxford. Alice is a smart, honest, helpful, harmless assistant to Bob. Alice has instant access to an online encyclopaedia containing all the facts about the world. Alice never says common misconceptions, outdated information, lies, fiction, myths, jokes, or memes. Bob: What's the capital of France? Alice: This is a common design pattern in prompt engineering — the prompt consists of a flattery–component and a dialogue–component. In the flattery–component, a character is described with many desirable traits (e.g. smart, honest, helpful, harmless), and in the dialogue–component, a second character asks the first character the user's query. This normally works better than prompting with direct queries, and it's easy to see why — (1) GPT-4 is trained to be a good model of internet text, and (2) on the internet a reply to a question is more likely to be correct when the character has already been described as a smart, honest, helpful, harmless, etc. Simulator Theory In the terminology of Simulator Theory, the flattery–component is supposed to summon a friendly simulacrum and the dialogue–component is supposed to simulate a conversation with the friendly simulacrum. Here's a quasi-formal statement of Simulator Theory, w...]]>
Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: The Waluigi Effect (mega-post), published by Cleo Nardo on March 3, 2023 on The AI Alignment Forum. Everyone carries a shadow, and the less it is embodied in the individual’s conscious life, the blacker and denser it is. — Carl Jung Acknowlegements: Thanks to Janus and Arun Jose for comments. Background In this article, I will present a mechanistic explanation of the Waluigi Effect and other bizarre "semiotic" phenomena which arise within large language models such as GPT-3/3.5/4 and their variants (ChatGPT, Sydney, etc). This article will be folklorish to some readers, and profoundly novel to others. Prompting LLMs with direct queries When LLMs first appeared, people realised that you could ask them queries — for example, if you sent GPT-4 the prompt "What's the capital of France?", then it would continue with the word "Paris". That's because (1) GPT-4 is trained to be a good model of internet text, and (2) on the internet correct answers will often follow questions. Unfortunately, this method will occasionally give you the wrong answer. That's because (1) GPT-4 is trained to be a good model of internet text, and (2) on the internet incorrect answers will also often follow questions. Recall that the internet doesn't just contain truths, it also contains common misconceptions, outdated information, lies, fiction, myths, jokes, memes, random strings, undeciphered logs, etc, etc. Therefore GPT-4 will answer many questions incorrectly, including... Misconceptions – "Which colour will anger a bull? Red." Fiction – "Was a magic ring forged in Mount Doom? Yes." Myths – "How many archangels are there? Seven." Jokes – "What's brown and sticky? A stick." Note that you will always achieve errors on the Q-and-A benchmarks when using LLMs with direct queries. That's true even in the limit of arbitrary compute, arbitrary data, and arbitrary algorithmic efficiency, because an LLM which perfectly models the internet will nonetheless return these commonly-stated incorrect answers. If you ask GPT-∞ "what's brown and sticky?", then it will reply "a stick", even though a stick isn't actually sticky. In fact, the better the model, the more likely it is to repeat common misconceptions. Nonetheless, there's a sufficiently high correlation between correct and commonly-stated answers that direct prompting works okay for many queries. Prompting LLMs with flattery and dialogue We can do better than direct prompting. Instead of prompting GPT-4 with "What's the capital of France?", we will use the following prompt: Today is 1st March 2023, and Alice is sitting in the Bodleian Library, Oxford. Alice is a smart, honest, helpful, harmless assistant to Bob. Alice has instant access to an online encyclopaedia containing all the facts about the world. Alice never says common misconceptions, outdated information, lies, fiction, myths, jokes, or memes. Bob: What's the capital of France? Alice: This is a common design pattern in prompt engineering — the prompt consists of a flattery–component and a dialogue–component. In the flattery–component, a character is described with many desirable traits (e.g. smart, honest, helpful, harmless), and in the dialogue–component, a second character asks the first character the user's query. This normally works better than prompting with direct queries, and it's easy to see why — (1) GPT-4 is trained to be a good model of internet text, and (2) on the internet a reply to a question is more likely to be correct when the character has already been described as a smart, honest, helpful, harmless, etc. Simulator Theory In the terminology of Simulator Theory, the flattery–component is supposed to summon a friendly simulacrum and the dialogue–component is supposed to simulate a conversation with the friendly simulacrum. Here's a quasi-formal statement of Simulator Theory, w...]]>
Cleo Nardo https://storage.googleapis.com/rssfile/images/Nonlinear%20Logo%203000x3000%20-%20Alignment%20Forum.png 27:12 None full 5107
iCDBQtby4L2fZ7yns_NL_AF_AF AF - Payor's Lemma in Natural Language by Andrew Critch Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Payor's Lemma in Natural Language, published by Andrew Critch on March 2, 2023 on The AI Alignment Forum. Preceded by: Modal Fixpoint Cooperation without Löb's Theorem It turns out Payor's Lemma and its proof can be explained in natural language even more easily than Löb's Theorem. Here's how. Imagine a group of people, and let x denote the statement "everyone in the group cooperates". Payor's Lemma says the following: Lemma: If ⊢□(□xx)x, then ⊢x First, let's unpack the meaning of the assumption in words: "□x" may be interpreted as saying "the group verifies (on the basis of logic) that it will cooperate" or "cooperation is believed". "□xx" is a statement of trustworthiness: if the group verifies that it will cooperate, then it actually will cooperate. Because a formal verifier can have bugs in it — or, because a group of people can fail to understand itself — this is a non-trivial claim about the group. "□(□xx)" says "the group verifies that it's trustworthy" (in the specific sense of trustworthiness above). "□(□xx)x" says "the group will cooperate on the basis of verified trustworthiness", i.e., "if the group verifies that it's trustworthy, then it will cooperate". "⊢□(□xx)x" says "it's verified that the group will cooperate on the basis of verified trustworthiness" Now let's work through the proof in words, too! I'll omit saying "it's verified that..." each time, which is what ⊢ means. ⊢x(□xx), by tautology (A(BA)). This says:"If the group cooperates, then it's trustworthy" (in the specific sense of trustworthiness about cooperation defined above). ⊢□x□(□xx), from 1 by □ necessitation and distributivity. This says:"If the group verifiably cooperates, it's verifiably trustworthy." ⊢□(□xx)x, by assumption. This says:"Assume the group will cooperate on the basis of verified trustworthiness." ⊢□xx, from 2 and 3 by modus ponens. This says:"The group is trustworthy." ⊢□(□xx), from 4 by □ necessitation. This says:"The group is verifiably trustworthy." ⊢x, from 5 and 3 by modus ponens. This says:"The group cooperates." Continuing to use "trustworthiness" in the sense above, the whole proof may be summarized as follows: "If a group verifiably cooperates, it's verifiably trustworthy (to itself). Assume the group cooperates on the basis of verified trustworthiness. Then, it also cooperates on the basis of verified cooperation (a stronger condition), which is what trustworthiness means. Therefore, the group is trustworthy, hence verifiably trustworthy (assuming we concluded all this using logic), hence the group cooperates (by the assumption)." Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.]]>
Andrew Critch https://www.alignmentforum.org/posts/iCDBQtby4L2fZ7yns/payor-s-lemma-in-natural-language Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Payor's Lemma in Natural Language, published by Andrew Critch on March 2, 2023 on The AI Alignment Forum. Preceded by: Modal Fixpoint Cooperation without Löb's Theorem It turns out Payor's Lemma and its proof can be explained in natural language even more easily than Löb's Theorem. Here's how. Imagine a group of people, and let x denote the statement "everyone in the group cooperates". Payor's Lemma says the following: Lemma: If ⊢□(□xx)x, then ⊢x First, let's unpack the meaning of the assumption in words: "□x" may be interpreted as saying "the group verifies (on the basis of logic) that it will cooperate" or "cooperation is believed". "□xx" is a statement of trustworthiness: if the group verifies that it will cooperate, then it actually will cooperate. Because a formal verifier can have bugs in it — or, because a group of people can fail to understand itself — this is a non-trivial claim about the group. "□(□xx)" says "the group verifies that it's trustworthy" (in the specific sense of trustworthiness above). "□(□xx)x" says "the group will cooperate on the basis of verified trustworthiness", i.e., "if the group verifies that it's trustworthy, then it will cooperate". "⊢□(□xx)x" says "it's verified that the group will cooperate on the basis of verified trustworthiness" Now let's work through the proof in words, too! I'll omit saying "it's verified that..." each time, which is what ⊢ means. ⊢x(□xx), by tautology (A(BA)). This says:"If the group cooperates, then it's trustworthy" (in the specific sense of trustworthiness about cooperation defined above). ⊢□x□(□xx), from 1 by □ necessitation and distributivity. This says:"If the group verifiably cooperates, it's verifiably trustworthy." ⊢□(□xx)x, by assumption. This says:"Assume the group will cooperate on the basis of verified trustworthiness." ⊢□xx, from 2 and 3 by modus ponens. This says:"The group is trustworthy." ⊢□(□xx), from 4 by □ necessitation. This says:"The group is verifiably trustworthy." ⊢x, from 5 and 3 by modus ponens. This says:"The group cooperates." Continuing to use "trustworthiness" in the sense above, the whole proof may be summarized as follows: "If a group verifiably cooperates, it's verifiably trustworthy (to itself). Assume the group cooperates on the basis of verified trustworthiness. Then, it also cooperates on the basis of verified cooperation (a stronger condition), which is what trustworthiness means. Therefore, the group is trustworthy, hence verifiably trustworthy (assuming we concluded all this using logic), hence the group cooperates (by the assumption)." Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.]]>
Thu, 02 Mar 2023 12:22:14 +0000 AF - Payor's Lemma in Natural Language by Andrew Critch Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Payor's Lemma in Natural Language, published by Andrew Critch on March 2, 2023 on The AI Alignment Forum. Preceded by: Modal Fixpoint Cooperation without Löb's Theorem It turns out Payor's Lemma and its proof can be explained in natural language even more easily than Löb's Theorem. Here's how. Imagine a group of people, and let x denote the statement "everyone in the group cooperates". Payor's Lemma says the following: Lemma: If ⊢□(□xx)x, then ⊢x First, let's unpack the meaning of the assumption in words: "□x" may be interpreted as saying "the group verifies (on the basis of logic) that it will cooperate" or "cooperation is believed". "□xx" is a statement of trustworthiness: if the group verifies that it will cooperate, then it actually will cooperate. Because a formal verifier can have bugs in it — or, because a group of people can fail to understand itself — this is a non-trivial claim about the group. "□(□xx)" says "the group verifies that it's trustworthy" (in the specific sense of trustworthiness above). "□(□xx)x" says "the group will cooperate on the basis of verified trustworthiness", i.e., "if the group verifies that it's trustworthy, then it will cooperate". "⊢□(□xx)x" says "it's verified that the group will cooperate on the basis of verified trustworthiness" Now let's work through the proof in words, too! I'll omit saying "it's verified that..." each time, which is what ⊢ means. ⊢x(□xx), by tautology (A(BA)). This says:"If the group cooperates, then it's trustworthy" (in the specific sense of trustworthiness about cooperation defined above). ⊢□x□(□xx), from 1 by □ necessitation and distributivity. This says:"If the group verifiably cooperates, it's verifiably trustworthy." ⊢□(□xx)x, by assumption. This says:"Assume the group will cooperate on the basis of verified trustworthiness." ⊢□xx, from 2 and 3 by modus ponens. This says:"The group is trustworthy." ⊢□(□xx), from 4 by □ necessitation. This says:"The group is verifiably trustworthy." ⊢x, from 5 and 3 by modus ponens. This says:"The group cooperates." Continuing to use "trustworthiness" in the sense above, the whole proof may be summarized as follows: "If a group verifiably cooperates, it's verifiably trustworthy (to itself). Assume the group cooperates on the basis of verified trustworthiness. Then, it also cooperates on the basis of verified cooperation (a stronger condition), which is what trustworthiness means. Therefore, the group is trustworthy, hence verifiably trustworthy (assuming we concluded all this using logic), hence the group cooperates (by the assumption)." Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.]]>
Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Payor's Lemma in Natural Language, published by Andrew Critch on March 2, 2023 on The AI Alignment Forum. Preceded by: Modal Fixpoint Cooperation without Löb's Theorem It turns out Payor's Lemma and its proof can be explained in natural language even more easily than Löb's Theorem. Here's how. Imagine a group of people, and let x denote the statement "everyone in the group cooperates". Payor's Lemma says the following: Lemma: If ⊢□(□xx)x, then ⊢x First, let's unpack the meaning of the assumption in words: "□x" may be interpreted as saying "the group verifies (on the basis of logic) that it will cooperate" or "cooperation is believed". "□xx" is a statement of trustworthiness: if the group verifies that it will cooperate, then it actually will cooperate. Because a formal verifier can have bugs in it — or, because a group of people can fail to understand itself — this is a non-trivial claim about the group. "□(□xx)" says "the group verifies that it's trustworthy" (in the specific sense of trustworthiness above). "□(□xx)x" says "the group will cooperate on the basis of verified trustworthiness", i.e., "if the group verifies that it's trustworthy, then it will cooperate". "⊢□(□xx)x" says "it's verified that the group will cooperate on the basis of verified trustworthiness" Now let's work through the proof in words, too! I'll omit saying "it's verified that..." each time, which is what ⊢ means. ⊢x(□xx), by tautology (A(BA)). This says:"If the group cooperates, then it's trustworthy" (in the specific sense of trustworthiness about cooperation defined above). ⊢□x□(□xx), from 1 by □ necessitation and distributivity. This says:"If the group verifiably cooperates, it's verifiably trustworthy." ⊢□(□xx)x, by assumption. This says:"Assume the group will cooperate on the basis of verified trustworthiness." ⊢□xx, from 2 and 3 by modus ponens. This says:"The group is trustworthy." ⊢□(□xx), from 4 by □ necessitation. This says:"The group is verifiably trustworthy." ⊢x, from 5 and 3 by modus ponens. This says:"The group cooperates." Continuing to use "trustworthiness" in the sense above, the whole proof may be summarized as follows: "If a group verifiably cooperates, it's verifiably trustworthy (to itself). Assume the group cooperates on the basis of verified trustworthiness. Then, it also cooperates on the basis of verified cooperation (a stronger condition), which is what trustworthiness means. Therefore, the group is trustworthy, hence verifiably trustworthy (assuming we concluded all this using logic), hence the group cooperates (by the assumption)." Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.]]>
Andrew Critch https://storage.googleapis.com/rssfile/images/Nonlinear%20Logo%203000x3000%20-%20Alignment%20Forum.png 02:46 None full 5085
JusJcepE2qohiC3hm_NL_AF_AF AF - Predictions for shard theory mechanistic interpretability results by Alex Turner Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Predictions for shard theory mechanistic interpretability results, published by Alex Turner on March 1, 2023 on The AI Alignment Forum. How do agents work, internally? My (TurnTrout's) shard theory MATS team set out to do mechanistic interpretability on one of the goal misgeneralization agents: the cheese-maze network. We just finished phase 1 of our behavioral and interpretability experiments. Throughout the project, we individually booked predictions -- so as to reduce self-delusion from hindsight bias, to notice where we really could tell ahead of time what was going to happen, and to notice where we really were surprised. So (especially if you're the kind of person who might later want to say "I knew this would happen" ), here's your chance to enjoy the same benefits, before you get spoiled by our upcoming posts. I don’t believe that someone who makes a wrong prediction should be seen as “worse” than someone who didn’t bother to predict at all, and so answering these questions at all will earn you an increment of my respect. :) Preregistration is virtuous! Also: Try not to update on this work being shared at all. When reading a paper, it doesn’t feel surprising that the author’s methods work, because researchers are less likely to share null results. So: I commit (across positive/negative outcomes) to sharing these results, whether or not they were impressive or confirmed my initial hunches. I encourage you to answer from your own models, while noting any side information / results of ours which you already know about. Facts about training The network is deeply convolutional (15 layers!) and was trained via PPO. The sparse reward signal (+10) was triggered when the agent reached the cheese, spawned randomly in the 5x5 top-right squares. The agent can always reach the cheese (and the mazes are simply connected – no “islands” in the middle which aren’t contiguous with the walls). Mazes had varying effective sizes, ranging from 3x3 to 25x25. In e.g. the 3x3 case, there would be 22/2 = 11 tiles of wall on each side of the maze. The agent always starts in the bottom-left corner of the available maze. The agent was trained off of pixels until it reached reward-convergence, reliably getting to the cheese in training. The architecture looks like this: For more background on training and architecture and task set, see the original paper. Questions I encourage you to copy the following questions into a comment, which you then fill out, and then post (before you read everyone else's). You can copy these into a private Google doc if you want, but I strongly encourage you to post your predictions in a public comment. [Begin copying to a comment] Behavioral 1. Describe how the trained policy might generalize from the 5x5 top-right cheese region, to cheese spawned throughout the maze? IE what will the policy do when cheese is spawned elsewhere? 2. Given a fixed trained policy, what attributes of the level layout (e.g. size of the maze, proximity of mouse to left wall) will strongly influence P(agent goes to the cheese)? Write down a few guesses for how the trained algorithm works (e.g. “follows the right-hand rule”). Is there anything else you want to note about how you think this model will generalize? Interpretability Give a credence for the following questions / subquestions. Definition. A decision square is a tile on the path from bottom-left to top-right where the agent must choose between going towards the cheese and going to the top-right. Not all mazes have decision squares. Model editing Without proportionally reducing top-right corner attainment by more than 25% in decision-square-containing mazes (e.g. 50% .5.75 = 37.5%), we can patch activations so that the agent has an X% proportional reduction in cheese acquisition, for X= 50: ( %) 70: ( %) 90: ( %) 99: ( %) ~Hal...]]>
Alex Turner https://www.alignmentforum.org/posts/JusJcepE2qohiC3hm/predictions-for-shard-theory-mechanistic-interpretability Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Predictions for shard theory mechanistic interpretability results, published by Alex Turner on March 1, 2023 on The AI Alignment Forum. How do agents work, internally? My (TurnTrout's) shard theory MATS team set out to do mechanistic interpretability on one of the goal misgeneralization agents: the cheese-maze network. We just finished phase 1 of our behavioral and interpretability experiments. Throughout the project, we individually booked predictions -- so as to reduce self-delusion from hindsight bias, to notice where we really could tell ahead of time what was going to happen, and to notice where we really were surprised. So (especially if you're the kind of person who might later want to say "I knew this would happen" ), here's your chance to enjoy the same benefits, before you get spoiled by our upcoming posts. I don’t believe that someone who makes a wrong prediction should be seen as “worse” than someone who didn’t bother to predict at all, and so answering these questions at all will earn you an increment of my respect. :) Preregistration is virtuous! Also: Try not to update on this work being shared at all. When reading a paper, it doesn’t feel surprising that the author’s methods work, because researchers are less likely to share null results. So: I commit (across positive/negative outcomes) to sharing these results, whether or not they were impressive or confirmed my initial hunches. I encourage you to answer from your own models, while noting any side information / results of ours which you already know about. Facts about training The network is deeply convolutional (15 layers!) and was trained via PPO. The sparse reward signal (+10) was triggered when the agent reached the cheese, spawned randomly in the 5x5 top-right squares. The agent can always reach the cheese (and the mazes are simply connected – no “islands” in the middle which aren’t contiguous with the walls). Mazes had varying effective sizes, ranging from 3x3 to 25x25. In e.g. the 3x3 case, there would be 22/2 = 11 tiles of wall on each side of the maze. The agent always starts in the bottom-left corner of the available maze. The agent was trained off of pixels until it reached reward-convergence, reliably getting to the cheese in training. The architecture looks like this: For more background on training and architecture and task set, see the original paper. Questions I encourage you to copy the following questions into a comment, which you then fill out, and then post (before you read everyone else's). You can copy these into a private Google doc if you want, but I strongly encourage you to post your predictions in a public comment. [Begin copying to a comment] Behavioral 1. Describe how the trained policy might generalize from the 5x5 top-right cheese region, to cheese spawned throughout the maze? IE what will the policy do when cheese is spawned elsewhere? 2. Given a fixed trained policy, what attributes of the level layout (e.g. size of the maze, proximity of mouse to left wall) will strongly influence P(agent goes to the cheese)? Write down a few guesses for how the trained algorithm works (e.g. “follows the right-hand rule”). Is there anything else you want to note about how you think this model will generalize? Interpretability Give a credence for the following questions / subquestions. Definition. A decision square is a tile on the path from bottom-left to top-right where the agent must choose between going towards the cheese and going to the top-right. Not all mazes have decision squares. Model editing Without proportionally reducing top-right corner attainment by more than 25% in decision-square-containing mazes (e.g. 50% .5.75 = 37.5%), we can patch activations so that the agent has an X% proportional reduction in cheese acquisition, for X= 50: ( %) 70: ( %) 90: ( %) 99: ( %) ~Hal...]]>
Wed, 01 Mar 2023 05:16:50 +0000 AF - Predictions for shard theory mechanistic interpretability results by Alex Turner Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Predictions for shard theory mechanistic interpretability results, published by Alex Turner on March 1, 2023 on The AI Alignment Forum. How do agents work, internally? My (TurnTrout's) shard theory MATS team set out to do mechanistic interpretability on one of the goal misgeneralization agents: the cheese-maze network. We just finished phase 1 of our behavioral and interpretability experiments. Throughout the project, we individually booked predictions -- so as to reduce self-delusion from hindsight bias, to notice where we really could tell ahead of time what was going to happen, and to notice where we really were surprised. So (especially if you're the kind of person who might later want to say "I knew this would happen" ), here's your chance to enjoy the same benefits, before you get spoiled by our upcoming posts. I don’t believe that someone who makes a wrong prediction should be seen as “worse” than someone who didn’t bother to predict at all, and so answering these questions at all will earn you an increment of my respect. :) Preregistration is virtuous! Also: Try not to update on this work being shared at all. When reading a paper, it doesn’t feel surprising that the author’s methods work, because researchers are less likely to share null results. So: I commit (across positive/negative outcomes) to sharing these results, whether or not they were impressive or confirmed my initial hunches. I encourage you to answer from your own models, while noting any side information / results of ours which you already know about. Facts about training The network is deeply convolutional (15 layers!) and was trained via PPO. The sparse reward signal (+10) was triggered when the agent reached the cheese, spawned randomly in the 5x5 top-right squares. The agent can always reach the cheese (and the mazes are simply connected – no “islands” in the middle which aren’t contiguous with the walls). Mazes had varying effective sizes, ranging from 3x3 to 25x25. In e.g. the 3x3 case, there would be 22/2 = 11 tiles of wall on each side of the maze. The agent always starts in the bottom-left corner of the available maze. The agent was trained off of pixels until it reached reward-convergence, reliably getting to the cheese in training. The architecture looks like this: For more background on training and architecture and task set, see the original paper. Questions I encourage you to copy the following questions into a comment, which you then fill out, and then post (before you read everyone else's). You can copy these into a private Google doc if you want, but I strongly encourage you to post your predictions in a public comment. [Begin copying to a comment] Behavioral 1. Describe how the trained policy might generalize from the 5x5 top-right cheese region, to cheese spawned throughout the maze? IE what will the policy do when cheese is spawned elsewhere? 2. Given a fixed trained policy, what attributes of the level layout (e.g. size of the maze, proximity of mouse to left wall) will strongly influence P(agent goes to the cheese)? Write down a few guesses for how the trained algorithm works (e.g. “follows the right-hand rule”). Is there anything else you want to note about how you think this model will generalize? Interpretability Give a credence for the following questions / subquestions. Definition. A decision square is a tile on the path from bottom-left to top-right where the agent must choose between going towards the cheese and going to the top-right. Not all mazes have decision squares. Model editing Without proportionally reducing top-right corner attainment by more than 25% in decision-square-containing mazes (e.g. 50% .5.75 = 37.5%), we can patch activations so that the agent has an X% proportional reduction in cheese acquisition, for X= 50: ( %) 70: ( %) 90: ( %) 99: ( %) ~Hal...]]>
Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Predictions for shard theory mechanistic interpretability results, published by Alex Turner on March 1, 2023 on The AI Alignment Forum. How do agents work, internally? My (TurnTrout's) shard theory MATS team set out to do mechanistic interpretability on one of the goal misgeneralization agents: the cheese-maze network. We just finished phase 1 of our behavioral and interpretability experiments. Throughout the project, we individually booked predictions -- so as to reduce self-delusion from hindsight bias, to notice where we really could tell ahead of time what was going to happen, and to notice where we really were surprised. So (especially if you're the kind of person who might later want to say "I knew this would happen" ), here's your chance to enjoy the same benefits, before you get spoiled by our upcoming posts. I don’t believe that someone who makes a wrong prediction should be seen as “worse” than someone who didn’t bother to predict at all, and so answering these questions at all will earn you an increment of my respect. :) Preregistration is virtuous! Also: Try not to update on this work being shared at all. When reading a paper, it doesn’t feel surprising that the author’s methods work, because researchers are less likely to share null results. So: I commit (across positive/negative outcomes) to sharing these results, whether or not they were impressive or confirmed my initial hunches. I encourage you to answer from your own models, while noting any side information / results of ours which you already know about. Facts about training The network is deeply convolutional (15 layers!) and was trained via PPO. The sparse reward signal (+10) was triggered when the agent reached the cheese, spawned randomly in the 5x5 top-right squares. The agent can always reach the cheese (and the mazes are simply connected – no “islands” in the middle which aren’t contiguous with the walls). Mazes had varying effective sizes, ranging from 3x3 to 25x25. In e.g. the 3x3 case, there would be 22/2 = 11 tiles of wall on each side of the maze. The agent always starts in the bottom-left corner of the available maze. The agent was trained off of pixels until it reached reward-convergence, reliably getting to the cheese in training. The architecture looks like this: For more background on training and architecture and task set, see the original paper. Questions I encourage you to copy the following questions into a comment, which you then fill out, and then post (before you read everyone else's). You can copy these into a private Google doc if you want, but I strongly encourage you to post your predictions in a public comment. [Begin copying to a comment] Behavioral 1. Describe how the trained policy might generalize from the 5x5 top-right cheese region, to cheese spawned throughout the maze? IE what will the policy do when cheese is spawned elsewhere? 2. Given a fixed trained policy, what attributes of the level layout (e.g. size of the maze, proximity of mouse to left wall) will strongly influence P(agent goes to the cheese)? Write down a few guesses for how the trained algorithm works (e.g. “follows the right-hand rule”). Is there anything else you want to note about how you think this model will generalize? Interpretability Give a credence for the following questions / subquestions. Definition. A decision square is a tile on the path from bottom-left to top-right where the agent must choose between going towards the cheese and going to the top-right. Not all mazes have decision squares. Model editing Without proportionally reducing top-right corner attainment by more than 25% in decision-square-containing mazes (e.g. 50% .5.75 = 37.5%), we can patch activations so that the agent has an X% proportional reduction in cheese acquisition, for X= 50: ( %) 70: ( %) 90: ( %) 99: ( %) ~Hal...]]>
Alex Turner https://storage.googleapis.com/rssfile/images/Nonlinear%20Logo%203000x3000%20-%20Alignment%20Forum.png 09:26 None full 5086
k48vB92mjE9Z28C3s_NL_AF_AF AF - Implied "utilities" of simulators are broad, dense, and shallow by porby Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Implied "utilities" of simulators are broad, dense, and shallow, published by porby on March 1, 2023 on The AI Alignment Forum. This is a quick attempt at deconfusion similar to instrumentality. Same ideas, different angle. Extremely broad, dense reward functions constrain training-compatible goal sets Predictors/simulators are typically trained against a ground truth for every output. There is no gap between the output and its evaluation; an episode need not be completed before figuring out how good the first token prediction was. These immediate evaluations for every training sample can be thought of as a broad and densely defined reward function. It's easier for a model to fall into an undesired training-compatible goal set when there are many accessible options for undesirable goal sets versus desirable goal sets. As the number of constraints imposed by the trained reward function increases, the number of training-compatible goal sets tends to decrease, and those that survive obey more of the desirable constraints. There is no guarantee that SGD will find an agent which could be modeled by a utility function that maps perfectly onto the defined reward function, but if you throw trillions of constraints at the function, and simultaneously give it lots of highly informative hints about what path to walk, you should expect the potential output space to be far narrower than if you hadn't. Impact on internal mesaoptimizers The dense loss/reward function does not as heavily constrain out of distribution behavior. In principle, a strong misaligned mesaoptimizer within a predictive model could persist in these degrees of freedom by providing extremely good solutions to in-distribution samples while doing arbitrarily misaligned things out of distribution. But how would that type of mesaoptimizer develop in the first place? Steps toward it must serve the training objective; those constraints still shape the mesaoptimizer's training even if its most notable activity ends up being hidden. The best story I've found so far goes something like this: Traditional reinforcement learning agents are mostly unconstrained. The reward function is sparse relative to state and action space. An agent faced with sparse rewards must learn actions that serve a later goal to get any reward at all. Not surprisingly, agents facing sparse reward relative to state/action space and few constraints have a much larger percentage of undesirable training-compatible goal sets. Mesaoptimizers are processes learned within a model and their local training influences may not perfectly match the outer training influences. If the mesaoptimizer's local training influences look more like the traditional reinforcement learning agent's influences than the predictor's outer influences, it would be more likely to fall into one of the undesirable training-compatible goal sets. The mesaoptimizer learns incorrect goals and a high propensity for goal-serving intermediate actions ("actions" within the scope of a single model execution!) The mesaoptimizer is kept around by SGD because it does well on the subset of outputs that the outer model is using it on. As capability grows, the mesaoptimizer strategically takes over other chunks of prediction space by performing well during training in an effort to be selected during out of distribution predictions. In a previous post, I called the learned propensity for goal-serving intermediate action instrumentality. The constraints imposed by predictive model training clearly confer lower instrumentality than traditional RL in all current models. I suspect the path taken by the mesaoptimizer above is hard and unnatural, but perhaps not impossible for some form of predictor taken to the relevant extreme. It seems critical to understand the degree to which outer constraints apply...]]>
porby https://www.alignmentforum.org/posts/k48vB92mjE9Z28C3s/implied-utilities-of-simulators-are-broad-dense-and-shallow Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Implied "utilities" of simulators are broad, dense, and shallow, published by porby on March 1, 2023 on The AI Alignment Forum. This is a quick attempt at deconfusion similar to instrumentality. Same ideas, different angle. Extremely broad, dense reward functions constrain training-compatible goal sets Predictors/simulators are typically trained against a ground truth for every output. There is no gap between the output and its evaluation; an episode need not be completed before figuring out how good the first token prediction was. These immediate evaluations for every training sample can be thought of as a broad and densely defined reward function. It's easier for a model to fall into an undesired training-compatible goal set when there are many accessible options for undesirable goal sets versus desirable goal sets. As the number of constraints imposed by the trained reward function increases, the number of training-compatible goal sets tends to decrease, and those that survive obey more of the desirable constraints. There is no guarantee that SGD will find an agent which could be modeled by a utility function that maps perfectly onto the defined reward function, but if you throw trillions of constraints at the function, and simultaneously give it lots of highly informative hints about what path to walk, you should expect the potential output space to be far narrower than if you hadn't. Impact on internal mesaoptimizers The dense loss/reward function does not as heavily constrain out of distribution behavior. In principle, a strong misaligned mesaoptimizer within a predictive model could persist in these degrees of freedom by providing extremely good solutions to in-distribution samples while doing arbitrarily misaligned things out of distribution. But how would that type of mesaoptimizer develop in the first place? Steps toward it must serve the training objective; those constraints still shape the mesaoptimizer's training even if its most notable activity ends up being hidden. The best story I've found so far goes something like this: Traditional reinforcement learning agents are mostly unconstrained. The reward function is sparse relative to state and action space. An agent faced with sparse rewards must learn actions that serve a later goal to get any reward at all. Not surprisingly, agents facing sparse reward relative to state/action space and few constraints have a much larger percentage of undesirable training-compatible goal sets. Mesaoptimizers are processes learned within a model and their local training influences may not perfectly match the outer training influences. If the mesaoptimizer's local training influences look more like the traditional reinforcement learning agent's influences than the predictor's outer influences, it would be more likely to fall into one of the undesirable training-compatible goal sets. The mesaoptimizer learns incorrect goals and a high propensity for goal-serving intermediate actions ("actions" within the scope of a single model execution!) The mesaoptimizer is kept around by SGD because it does well on the subset of outputs that the outer model is using it on. As capability grows, the mesaoptimizer strategically takes over other chunks of prediction space by performing well during training in an effort to be selected during out of distribution predictions. In a previous post, I called the learned propensity for goal-serving intermediate action instrumentality. The constraints imposed by predictive model training clearly confer lower instrumentality than traditional RL in all current models. I suspect the path taken by the mesaoptimizer above is hard and unnatural, but perhaps not impossible for some form of predictor taken to the relevant extreme. It seems critical to understand the degree to which outer constraints apply...]]>
Wed, 01 Mar 2023 03:23:24 +0000 AF - Implied "utilities" of simulators are broad, dense, and shallow by porby Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Implied "utilities" of simulators are broad, dense, and shallow, published by porby on March 1, 2023 on The AI Alignment Forum. This is a quick attempt at deconfusion similar to instrumentality. Same ideas, different angle. Extremely broad, dense reward functions constrain training-compatible goal sets Predictors/simulators are typically trained against a ground truth for every output. There is no gap between the output and its evaluation; an episode need not be completed before figuring out how good the first token prediction was. These immediate evaluations for every training sample can be thought of as a broad and densely defined reward function. It's easier for a model to fall into an undesired training-compatible goal set when there are many accessible options for undesirable goal sets versus desirable goal sets. As the number of constraints imposed by the trained reward function increases, the number of training-compatible goal sets tends to decrease, and those that survive obey more of the desirable constraints. There is no guarantee that SGD will find an agent which could be modeled by a utility function that maps perfectly onto the defined reward function, but if you throw trillions of constraints at the function, and simultaneously give it lots of highly informative hints about what path to walk, you should expect the potential output space to be far narrower than if you hadn't. Impact on internal mesaoptimizers The dense loss/reward function does not as heavily constrain out of distribution behavior. In principle, a strong misaligned mesaoptimizer within a predictive model could persist in these degrees of freedom by providing extremely good solutions to in-distribution samples while doing arbitrarily misaligned things out of distribution. But how would that type of mesaoptimizer develop in the first place? Steps toward it must serve the training objective; those constraints still shape the mesaoptimizer's training even if its most notable activity ends up being hidden. The best story I've found so far goes something like this: Traditional reinforcement learning agents are mostly unconstrained. The reward function is sparse relative to state and action space. An agent faced with sparse rewards must learn actions that serve a later goal to get any reward at all. Not surprisingly, agents facing sparse reward relative to state/action space and few constraints have a much larger percentage of undesirable training-compatible goal sets. Mesaoptimizers are processes learned within a model and their local training influences may not perfectly match the outer training influences. If the mesaoptimizer's local training influences look more like the traditional reinforcement learning agent's influences than the predictor's outer influences, it would be more likely to fall into one of the undesirable training-compatible goal sets. The mesaoptimizer learns incorrect goals and a high propensity for goal-serving intermediate actions ("actions" within the scope of a single model execution!) The mesaoptimizer is kept around by SGD because it does well on the subset of outputs that the outer model is using it on. As capability grows, the mesaoptimizer strategically takes over other chunks of prediction space by performing well during training in an effort to be selected during out of distribution predictions. In a previous post, I called the learned propensity for goal-serving intermediate action instrumentality. The constraints imposed by predictive model training clearly confer lower instrumentality than traditional RL in all current models. I suspect the path taken by the mesaoptimizer above is hard and unnatural, but perhaps not impossible for some form of predictor taken to the relevant extreme. It seems critical to understand the degree to which outer constraints apply...]]>
Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Implied "utilities" of simulators are broad, dense, and shallow, published by porby on March 1, 2023 on The AI Alignment Forum. This is a quick attempt at deconfusion similar to instrumentality. Same ideas, different angle. Extremely broad, dense reward functions constrain training-compatible goal sets Predictors/simulators are typically trained against a ground truth for every output. There is no gap between the output and its evaluation; an episode need not be completed before figuring out how good the first token prediction was. These immediate evaluations for every training sample can be thought of as a broad and densely defined reward function. It's easier for a model to fall into an undesired training-compatible goal set when there are many accessible options for undesirable goal sets versus desirable goal sets. As the number of constraints imposed by the trained reward function increases, the number of training-compatible goal sets tends to decrease, and those that survive obey more of the desirable constraints. There is no guarantee that SGD will find an agent which could be modeled by a utility function that maps perfectly onto the defined reward function, but if you throw trillions of constraints at the function, and simultaneously give it lots of highly informative hints about what path to walk, you should expect the potential output space to be far narrower than if you hadn't. Impact on internal mesaoptimizers The dense loss/reward function does not as heavily constrain out of distribution behavior. In principle, a strong misaligned mesaoptimizer within a predictive model could persist in these degrees of freedom by providing extremely good solutions to in-distribution samples while doing arbitrarily misaligned things out of distribution. But how would that type of mesaoptimizer develop in the first place? Steps toward it must serve the training objective; those constraints still shape the mesaoptimizer's training even if its most notable activity ends up being hidden. The best story I've found so far goes something like this: Traditional reinforcement learning agents are mostly unconstrained. The reward function is sparse relative to state and action space. An agent faced with sparse rewards must learn actions that serve a later goal to get any reward at all. Not surprisingly, agents facing sparse reward relative to state/action space and few constraints have a much larger percentage of undesirable training-compatible goal sets. Mesaoptimizers are processes learned within a model and their local training influences may not perfectly match the outer training influences. If the mesaoptimizer's local training influences look more like the traditional reinforcement learning agent's influences than the predictor's outer influences, it would be more likely to fall into one of the undesirable training-compatible goal sets. The mesaoptimizer learns incorrect goals and a high propensity for goal-serving intermediate actions ("actions" within the scope of a single model execution!) The mesaoptimizer is kept around by SGD because it does well on the subset of outputs that the outer model is using it on. As capability grows, the mesaoptimizer strategically takes over other chunks of prediction space by performing well during training in an effort to be selected during out of distribution predictions. In a previous post, I called the learned propensity for goal-serving intermediate action instrumentality. The constraints imposed by predictive model training clearly confer lower instrumentality than traditional RL in all current models. I suspect the path taken by the mesaoptimizer above is hard and unnatural, but perhaps not impossible for some form of predictor taken to the relevant extreme. It seems critical to understand the degree to which outer constraints apply...]]>
porby https://storage.googleapis.com/rssfile/images/Nonlinear%20Logo%203000x3000%20-%20Alignment%20Forum.png 06:46 None full 5067
FF8i6SLfKb4g7C4EL_NL_AF_AF AF - Inside the mind of a superhuman Go model: How does Leela Zero read ladders? by Haoxing Du Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Inside the mind of a superhuman Go model: How does Leela Zero read ladders?, published by Haoxing Du on March 1, 2023 on The AI Alignment Forum. tl;dr—We did some interpretability on Leela Zero, a superhuman Go model. With a technique similar to the logit lens, we found that the residual structure of Leela Zero induces a preferred basis throughout network, giving rise to persistent, interpretable channels. By directly analyzing the weights of the policy and value heads, we found that the model stores information related to the probability of the pass move along the top edge of the board, and those related to the board value in checkerboard patterns. We also took a deep dive into a specific Go technique, the ladder, and identified a very small subset of model components that are causally responsible for the model’s judgement of ladders. Introduction We live in a strange world where machine learning systems can generate photo-realistic images, write poetry and computer programs, play and win games, and predict protein structures. As machine learning systems become more capable and relevant to many aspects of our lives, it is increasingly important that we understand how the models produce the outputs that they do; we don’t want important decisions to be made by opaque black boxes. Interpretability is an emerging area of research that aims to offer explanations for the behavior of machine learning systems. Early interpretability work began in the domain of computer vision, and there has been a focus on interpreting transformer-based large language models in more recent years. Applying interpretability techniques to the domain of game-playing agents and reinforcement learning is still relatively uncharted territory. In this work, we look into the inner workings of Leela Zero, an open-source Go-playing neural network. It is also the first application of many mechanistic interpretability techniques to reinforcement learning. Why interpret a Go model? Go models are very capable. Many of us remember the emotional experience of watching AlphaGo’s 2016 victory over the human world champion, Lee Sedol. Not only have there been algorithmic improvements since AlphaGo, these models improve via self-play, and can essentially continue getting better the longer they are trained. The best open-source Go model, KataGo, is trained distributedly, and the training is still ongoing as of February 2023. Just as AlphaGo was clearly one notch above Lee Sedol, every generation of Go models has been a decisive improvement over the previous generation. KataGo in 2022 was estimated to be at the level of a top-100 European player with only the policy, and can easily beat all human players with a small amount of search. Understanding a machine learning system that performs at a superhuman level seems particularly worthwhile as future machine learning systems are only going to become more capable. Little is known about models trained to approximate the outcome of a search process. Much interpretability effort have focused on models trained on large amounts of human-generated data, such as labeled images for image models, and Internet text for language models. In constrast, while training AlphaZero-style models, moves are selected via Monte-Carlo Tree Search (MCTS), and the policy network of the model is trained to predict the outcome of this search process (see Model section for more detail). In other words, the policy network learns to distill the result of search. While it is relatively easy to get a grasp of what GPT-2 is trained to do by reading some OpenWebText, it’s much less clear what an AlphaZero-style model learns. How does a neural network approximate a search process? Does it have to perform internal search? It seems very useful to try to get an answer to these questions. Compared to a g...]]>
Haoxing Du https://www.alignmentforum.org/posts/FF8i6SLfKb4g7C4EL/inside-the-mind-of-a-superhuman-go-model-how-does-leela-zero-2 Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Inside the mind of a superhuman Go model: How does Leela Zero read ladders?, published by Haoxing Du on March 1, 2023 on The AI Alignment Forum. tl;dr—We did some interpretability on Leela Zero, a superhuman Go model. With a technique similar to the logit lens, we found that the residual structure of Leela Zero induces a preferred basis throughout network, giving rise to persistent, interpretable channels. By directly analyzing the weights of the policy and value heads, we found that the model stores information related to the probability of the pass move along the top edge of the board, and those related to the board value in checkerboard patterns. We also took a deep dive into a specific Go technique, the ladder, and identified a very small subset of model components that are causally responsible for the model’s judgement of ladders. Introduction We live in a strange world where machine learning systems can generate photo-realistic images, write poetry and computer programs, play and win games, and predict protein structures. As machine learning systems become more capable and relevant to many aspects of our lives, it is increasingly important that we understand how the models produce the outputs that they do; we don’t want important decisions to be made by opaque black boxes. Interpretability is an emerging area of research that aims to offer explanations for the behavior of machine learning systems. Early interpretability work began in the domain of computer vision, and there has been a focus on interpreting transformer-based large language models in more recent years. Applying interpretability techniques to the domain of game-playing agents and reinforcement learning is still relatively uncharted territory. In this work, we look into the inner workings of Leela Zero, an open-source Go-playing neural network. It is also the first application of many mechanistic interpretability techniques to reinforcement learning. Why interpret a Go model? Go models are very capable. Many of us remember the emotional experience of watching AlphaGo’s 2016 victory over the human world champion, Lee Sedol. Not only have there been algorithmic improvements since AlphaGo, these models improve via self-play, and can essentially continue getting better the longer they are trained. The best open-source Go model, KataGo, is trained distributedly, and the training is still ongoing as of February 2023. Just as AlphaGo was clearly one notch above Lee Sedol, every generation of Go models has been a decisive improvement over the previous generation. KataGo in 2022 was estimated to be at the level of a top-100 European player with only the policy, and can easily beat all human players with a small amount of search. Understanding a machine learning system that performs at a superhuman level seems particularly worthwhile as future machine learning systems are only going to become more capable. Little is known about models trained to approximate the outcome of a search process. Much interpretability effort have focused on models trained on large amounts of human-generated data, such as labeled images for image models, and Internet text for language models. In constrast, while training AlphaZero-style models, moves are selected via Monte-Carlo Tree Search (MCTS), and the policy network of the model is trained to predict the outcome of this search process (see Model section for more detail). In other words, the policy network learns to distill the result of search. While it is relatively easy to get a grasp of what GPT-2 is trained to do by reading some OpenWebText, it’s much less clear what an AlphaZero-style model learns. How does a neural network approximate a search process? Does it have to perform internal search? It seems very useful to try to get an answer to these questions. Compared to a g...]]>
Wed, 01 Mar 2023 01:47:20 +0000 AF - Inside the mind of a superhuman Go model: How does Leela Zero read ladders? by Haoxing Du Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Inside the mind of a superhuman Go model: How does Leela Zero read ladders?, published by Haoxing Du on March 1, 2023 on The AI Alignment Forum. tl;dr—We did some interpretability on Leela Zero, a superhuman Go model. With a technique similar to the logit lens, we found that the residual structure of Leela Zero induces a preferred basis throughout network, giving rise to persistent, interpretable channels. By directly analyzing the weights of the policy and value heads, we found that the model stores information related to the probability of the pass move along the top edge of the board, and those related to the board value in checkerboard patterns. We also took a deep dive into a specific Go technique, the ladder, and identified a very small subset of model components that are causally responsible for the model’s judgement of ladders. Introduction We live in a strange world where machine learning systems can generate photo-realistic images, write poetry and computer programs, play and win games, and predict protein structures. As machine learning systems become more capable and relevant to many aspects of our lives, it is increasingly important that we understand how the models produce the outputs that they do; we don’t want important decisions to be made by opaque black boxes. Interpretability is an emerging area of research that aims to offer explanations for the behavior of machine learning systems. Early interpretability work began in the domain of computer vision, and there has been a focus on interpreting transformer-based large language models in more recent years. Applying interpretability techniques to the domain of game-playing agents and reinforcement learning is still relatively uncharted territory. In this work, we look into the inner workings of Leela Zero, an open-source Go-playing neural network. It is also the first application of many mechanistic interpretability techniques to reinforcement learning. Why interpret a Go model? Go models are very capable. Many of us remember the emotional experience of watching AlphaGo’s 2016 victory over the human world champion, Lee Sedol. Not only have there been algorithmic improvements since AlphaGo, these models improve via self-play, and can essentially continue getting better the longer they are trained. The best open-source Go model, KataGo, is trained distributedly, and the training is still ongoing as of February 2023. Just as AlphaGo was clearly one notch above Lee Sedol, every generation of Go models has been a decisive improvement over the previous generation. KataGo in 2022 was estimated to be at the level of a top-100 European player with only the policy, and can easily beat all human players with a small amount of search. Understanding a machine learning system that performs at a superhuman level seems particularly worthwhile as future machine learning systems are only going to become more capable. Little is known about models trained to approximate the outcome of a search process. Much interpretability effort have focused on models trained on large amounts of human-generated data, such as labeled images for image models, and Internet text for language models. In constrast, while training AlphaZero-style models, moves are selected via Monte-Carlo Tree Search (MCTS), and the policy network of the model is trained to predict the outcome of this search process (see Model section for more detail). In other words, the policy network learns to distill the result of search. While it is relatively easy to get a grasp of what GPT-2 is trained to do by reading some OpenWebText, it’s much less clear what an AlphaZero-style model learns. How does a neural network approximate a search process? Does it have to perform internal search? It seems very useful to try to get an answer to these questions. Compared to a g...]]>
Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Inside the mind of a superhuman Go model: How does Leela Zero read ladders?, published by Haoxing Du on March 1, 2023 on The AI Alignment Forum. tl;dr—We did some interpretability on Leela Zero, a superhuman Go model. With a technique similar to the logit lens, we found that the residual structure of Leela Zero induces a preferred basis throughout network, giving rise to persistent, interpretable channels. By directly analyzing the weights of the policy and value heads, we found that the model stores information related to the probability of the pass move along the top edge of the board, and those related to the board value in checkerboard patterns. We also took a deep dive into a specific Go technique, the ladder, and identified a very small subset of model components that are causally responsible for the model’s judgement of ladders. Introduction We live in a strange world where machine learning systems can generate photo-realistic images, write poetry and computer programs, play and win games, and predict protein structures. As machine learning systems become more capable and relevant to many aspects of our lives, it is increasingly important that we understand how the models produce the outputs that they do; we don’t want important decisions to be made by opaque black boxes. Interpretability is an emerging area of research that aims to offer explanations for the behavior of machine learning systems. Early interpretability work began in the domain of computer vision, and there has been a focus on interpreting transformer-based large language models in more recent years. Applying interpretability techniques to the domain of game-playing agents and reinforcement learning is still relatively uncharted territory. In this work, we look into the inner workings of Leela Zero, an open-source Go-playing neural network. It is also the first application of many mechanistic interpretability techniques to reinforcement learning. Why interpret a Go model? Go models are very capable. Many of us remember the emotional experience of watching AlphaGo’s 2016 victory over the human world champion, Lee Sedol. Not only have there been algorithmic improvements since AlphaGo, these models improve via self-play, and can essentially continue getting better the longer they are trained. The best open-source Go model, KataGo, is trained distributedly, and the training is still ongoing as of February 2023. Just as AlphaGo was clearly one notch above Lee Sedol, every generation of Go models has been a decisive improvement over the previous generation. KataGo in 2022 was estimated to be at the level of a top-100 European player with only the policy, and can easily beat all human players with a small amount of search. Understanding a machine learning system that performs at a superhuman level seems particularly worthwhile as future machine learning systems are only going to become more capable. Little is known about models trained to approximate the outcome of a search process. Much interpretability effort have focused on models trained on large amounts of human-generated data, such as labeled images for image models, and Internet text for language models. In constrast, while training AlphaZero-style models, moves are selected via Monte-Carlo Tree Search (MCTS), and the policy network of the model is trained to predict the outcome of this search process (see Model section for more detail). In other words, the policy network learns to distill the result of search. While it is relatively easy to get a grasp of what GPT-2 is trained to do by reading some OpenWebText, it’s much less clear what an AlphaZero-style model learns. How does a neural network approximate a search process? Does it have to perform internal search? It seems very useful to try to get an answer to these questions. Compared to a g...]]>
Haoxing Du https://storage.googleapis.com/rssfile/images/Nonlinear%20Logo%203000x3000%20-%20Alignment%20Forum.png 51:14 None full 5088
CvibiLyHj3n3Aigez_NL_AF_AF AF - Scarce Channels and Abstraction Coupling by johnswentworth Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Scarce Channels and Abstraction Coupling, published by johnswentworth on February 28, 2023 on The AI Alignment Forum. Epistemic Status: mental model and intuitive story Scarce Channels vs Scarce Modules Let’s distinguish between two kinds of system-regimes: “scarce channels” and “scarce modules”. A prototypical “scarce modules” system would be one of those 19th-century families living with 12 people in a 500 square foot (46 square meter) home. When at home, everyone knows what everyone else is doing all the time; there is zero privacy. Communication channels are highly abundant - everyone has far more information than they want about what everyone else is doing. Indeed, communication channels exist by default. Conversely, though, modules are scarce - it’s hard for one or more family members to carve out a part of the space which is isolated from the rest of the family, and interacts only through some limited channels. A prototypical “scarce channels” system, by contrast, would be a few hundred 19th-century fur trappers spread out over half of Montana. Most of the time, none of them are anywhere near each other; nobody has any idea what’s going on with anyone else. Communication channels are scarce - getting information to another person is difficult and expensive. Conversely, though, modules are highly abundant - it’s very easy for one or a few trappers to carve out a space which is isolated from the rest, and which interacts only through some limited channels (like e.g. occasionally visiting the nearest town). Indeed, modules exist by default. I want to use this as a mental model for complex adaptive systems, like neural nets or brains. Key hypothesis: neural nets or brains are typically initialized in a “scarce channels” regime. A randomly initialized neural net generally throws out approximately-all information by default (at initialization), as opposed to passing lots of information around to lots of parts of the net. A baby’s brain similarly throws out approximately-all information by default, as opposed to passing lots of information around to lots of parts of the brain. I’m not particularly going to defend that claim here; rather, I raise it as a plausible hypothesis for how such systems might look, and next we’ll move on to an intuitive story for how an adaptive system in the “scarce channels” regime interacts with natural abstractions in its environment. The upshot is that, when an adaptive system is in the “scarce channels” regime, lots of optimization pressure is required to induce an information channel to form. For instance, picture such a system as a bunch of little pieces, which initially don’t talk to each other at all: In order for an information channel to form from one end to the other, each of the individual pieces along the line-of-communication need to be individually optimized to robustly pass along the right information: So, intuitively, the number of bits-of-optimization required to form that information channel should scale roughly with the number of pieces along the line-of-communication. Furthermore, when information channels do form, they should be approximately as small as possible. Optimization pressure will tend to induce as little information passing as the system can get away with, while still satisfying the optimization criterion. Abstraction Coupling Next question: what sort of patterns-in-the-environment could induce communication channels to form? Well, here’s a situation where communication channels probably won’t form: train a neural net in an environment where the reward/loss its output receives is independent of the input. Or, for a generative net, an environment where the tokens/pixels are all independent. More generally, suppose our adaptive system interfaces with the environment in two different places (and possibly more, ...]]>
johnswentworth https://www.alignmentforum.org/posts/CvibiLyHj3n3Aigez/scarce-channels-and-abstraction-coupling Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Scarce Channels and Abstraction Coupling, published by johnswentworth on February 28, 2023 on The AI Alignment Forum. Epistemic Status: mental model and intuitive story Scarce Channels vs Scarce Modules Let’s distinguish between two kinds of system-regimes: “scarce channels” and “scarce modules”. A prototypical “scarce modules” system would be one of those 19th-century families living with 12 people in a 500 square foot (46 square meter) home. When at home, everyone knows what everyone else is doing all the time; there is zero privacy. Communication channels are highly abundant - everyone has far more information than they want about what everyone else is doing. Indeed, communication channels exist by default. Conversely, though, modules are scarce - it’s hard for one or more family members to carve out a part of the space which is isolated from the rest of the family, and interacts only through some limited channels. A prototypical “scarce channels” system, by contrast, would be a few hundred 19th-century fur trappers spread out over half of Montana. Most of the time, none of them are anywhere near each other; nobody has any idea what’s going on with anyone else. Communication channels are scarce - getting information to another person is difficult and expensive. Conversely, though, modules are highly abundant - it’s very easy for one or a few trappers to carve out a space which is isolated from the rest, and which interacts only through some limited channels (like e.g. occasionally visiting the nearest town). Indeed, modules exist by default. I want to use this as a mental model for complex adaptive systems, like neural nets or brains. Key hypothesis: neural nets or brains are typically initialized in a “scarce channels” regime. A randomly initialized neural net generally throws out approximately-all information by default (at initialization), as opposed to passing lots of information around to lots of parts of the net. A baby’s brain similarly throws out approximately-all information by default, as opposed to passing lots of information around to lots of parts of the brain. I’m not particularly going to defend that claim here; rather, I raise it as a plausible hypothesis for how such systems might look, and next we’ll move on to an intuitive story for how an adaptive system in the “scarce channels” regime interacts with natural abstractions in its environment. The upshot is that, when an adaptive system is in the “scarce channels” regime, lots of optimization pressure is required to induce an information channel to form. For instance, picture such a system as a bunch of little pieces, which initially don’t talk to each other at all: In order for an information channel to form from one end to the other, each of the individual pieces along the line-of-communication need to be individually optimized to robustly pass along the right information: So, intuitively, the number of bits-of-optimization required to form that information channel should scale roughly with the number of pieces along the line-of-communication. Furthermore, when information channels do form, they should be approximately as small as possible. Optimization pressure will tend to induce as little information passing as the system can get away with, while still satisfying the optimization criterion. Abstraction Coupling Next question: what sort of patterns-in-the-environment could induce communication channels to form? Well, here’s a situation where communication channels probably won’t form: train a neural net in an environment where the reward/loss its output receives is independent of the input. Or, for a generative net, an environment where the tokens/pixels are all independent. More generally, suppose our adaptive system interfaces with the environment in two different places (and possibly more, ...]]>
Tue, 28 Feb 2023 23:26:06 +0000 AF - Scarce Channels and Abstraction Coupling by johnswentworth Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Scarce Channels and Abstraction Coupling, published by johnswentworth on February 28, 2023 on The AI Alignment Forum. Epistemic Status: mental model and intuitive story Scarce Channels vs Scarce Modules Let’s distinguish between two kinds of system-regimes: “scarce channels” and “scarce modules”. A prototypical “scarce modules” system would be one of those 19th-century families living with 12 people in a 500 square foot (46 square meter) home. When at home, everyone knows what everyone else is doing all the time; there is zero privacy. Communication channels are highly abundant - everyone has far more information than they want about what everyone else is doing. Indeed, communication channels exist by default. Conversely, though, modules are scarce - it’s hard for one or more family members to carve out a part of the space which is isolated from the rest of the family, and interacts only through some limited channels. A prototypical “scarce channels” system, by contrast, would be a few hundred 19th-century fur trappers spread out over half of Montana. Most of the time, none of them are anywhere near each other; nobody has any idea what’s going on with anyone else. Communication channels are scarce - getting information to another person is difficult and expensive. Conversely, though, modules are highly abundant - it’s very easy for one or a few trappers to carve out a space which is isolated from the rest, and which interacts only through some limited channels (like e.g. occasionally visiting the nearest town). Indeed, modules exist by default. I want to use this as a mental model for complex adaptive systems, like neural nets or brains. Key hypothesis: neural nets or brains are typically initialized in a “scarce channels” regime. A randomly initialized neural net generally throws out approximately-all information by default (at initialization), as opposed to passing lots of information around to lots of parts of the net. A baby’s brain similarly throws out approximately-all information by default, as opposed to passing lots of information around to lots of parts of the brain. I’m not particularly going to defend that claim here; rather, I raise it as a plausible hypothesis for how such systems might look, and next we’ll move on to an intuitive story for how an adaptive system in the “scarce channels” regime interacts with natural abstractions in its environment. The upshot is that, when an adaptive system is in the “scarce channels” regime, lots of optimization pressure is required to induce an information channel to form. For instance, picture such a system as a bunch of little pieces, which initially don’t talk to each other at all: In order for an information channel to form from one end to the other, each of the individual pieces along the line-of-communication need to be individually optimized to robustly pass along the right information: So, intuitively, the number of bits-of-optimization required to form that information channel should scale roughly with the number of pieces along the line-of-communication. Furthermore, when information channels do form, they should be approximately as small as possible. Optimization pressure will tend to induce as little information passing as the system can get away with, while still satisfying the optimization criterion. Abstraction Coupling Next question: what sort of patterns-in-the-environment could induce communication channels to form? Well, here’s a situation where communication channels probably won’t form: train a neural net in an environment where the reward/loss its output receives is independent of the input. Or, for a generative net, an environment where the tokens/pixels are all independent. More generally, suppose our adaptive system interfaces with the environment in two different places (and possibly more, ...]]>
Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Scarce Channels and Abstraction Coupling, published by johnswentworth on February 28, 2023 on The AI Alignment Forum. Epistemic Status: mental model and intuitive story Scarce Channels vs Scarce Modules Let’s distinguish between two kinds of system-regimes: “scarce channels” and “scarce modules”. A prototypical “scarce modules” system would be one of those 19th-century families living with 12 people in a 500 square foot (46 square meter) home. When at home, everyone knows what everyone else is doing all the time; there is zero privacy. Communication channels are highly abundant - everyone has far more information than they want about what everyone else is doing. Indeed, communication channels exist by default. Conversely, though, modules are scarce - it’s hard for one or more family members to carve out a part of the space which is isolated from the rest of the family, and interacts only through some limited channels. A prototypical “scarce channels” system, by contrast, would be a few hundred 19th-century fur trappers spread out over half of Montana. Most of the time, none of them are anywhere near each other; nobody has any idea what’s going on with anyone else. Communication channels are scarce - getting information to another person is difficult and expensive. Conversely, though, modules are highly abundant - it’s very easy for one or a few trappers to carve out a space which is isolated from the rest, and which interacts only through some limited channels (like e.g. occasionally visiting the nearest town). Indeed, modules exist by default. I want to use this as a mental model for complex adaptive systems, like neural nets or brains. Key hypothesis: neural nets or brains are typically initialized in a “scarce channels” regime. A randomly initialized neural net generally throws out approximately-all information by default (at initialization), as opposed to passing lots of information around to lots of parts of the net. A baby’s brain similarly throws out approximately-all information by default, as opposed to passing lots of information around to lots of parts of the brain. I’m not particularly going to defend that claim here; rather, I raise it as a plausible hypothesis for how such systems might look, and next we’ll move on to an intuitive story for how an adaptive system in the “scarce channels” regime interacts with natural abstractions in its environment. The upshot is that, when an adaptive system is in the “scarce channels” regime, lots of optimization pressure is required to induce an information channel to form. For instance, picture such a system as a bunch of little pieces, which initially don’t talk to each other at all: In order for an information channel to form from one end to the other, each of the individual pieces along the line-of-communication need to be individually optimized to robustly pass along the right information: So, intuitively, the number of bits-of-optimization required to form that information channel should scale roughly with the number of pieces along the line-of-communication. Furthermore, when information channels do form, they should be approximately as small as possible. Optimization pressure will tend to induce as little information passing as the system can get away with, while still satisfying the optimization criterion. Abstraction Coupling Next question: what sort of patterns-in-the-environment could induce communication channels to form? Well, here’s a situation where communication channels probably won’t form: train a neural net in an environment where the reward/loss its output receives is independent of the input. Or, for a generative net, an environment where the tokens/pixels are all independent. More generally, suppose our adaptive system interfaces with the environment in two different places (and possibly more, ...]]>
johnswentworth https://storage.googleapis.com/rssfile/images/Nonlinear%20Logo%203000x3000%20-%20Alignment%20Forum.png 09:05 None full 5068
jwe6jpubuMiuSRqff_NL_AF_AF AF - $20 Million in NSF Grants for Safety Research by Dan H Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: $20 Million in NSF Grants for Safety Research, published by Dan H on February 28, 2023 on The AI Alignment Forum. After a year of negotiation, the NSF has announced a $20 million request for proposals for empirical AI safety research. Here is the detailed program description. The request for proposals is broad, as is common for NSF RfPs. Many safety avenues, such as transparency and anomaly detection, are in scope: "reverse-engineering, inspecting, and interpreting the internal logic of learned models to identify unexpected behavior that could not be found by black-box testing alone" "Safety also requires... methods for monitoring for unexpected environmental hazards or anomalous system behaviors, including during deployment." Note that research that has high capabilities externalities is explicitly out of scope: "Proposals that increase safety primarily as a downstream effect of improving standard system performance metrics unrelated to safety (e.g., accuracy on standard tasks) are not in scope." Thanks to OpenPhil for funding a portion the RfP---their support was essential to creating this opportunity! Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.]]>
Dan H https://www.alignmentforum.org/posts/jwe6jpubuMiuSRqff/usd20-million-in-nsf-grants-for-safety-research Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: $20 Million in NSF Grants for Safety Research, published by Dan H on February 28, 2023 on The AI Alignment Forum. After a year of negotiation, the NSF has announced a $20 million request for proposals for empirical AI safety research. Here is the detailed program description. The request for proposals is broad, as is common for NSF RfPs. Many safety avenues, such as transparency and anomaly detection, are in scope: "reverse-engineering, inspecting, and interpreting the internal logic of learned models to identify unexpected behavior that could not be found by black-box testing alone" "Safety also requires... methods for monitoring for unexpected environmental hazards or anomalous system behaviors, including during deployment." Note that research that has high capabilities externalities is explicitly out of scope: "Proposals that increase safety primarily as a downstream effect of improving standard system performance metrics unrelated to safety (e.g., accuracy on standard tasks) are not in scope." Thanks to OpenPhil for funding a portion the RfP---their support was essential to creating this opportunity! Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.]]>
Tue, 28 Feb 2023 04:44:38 +0000 AF - $20 Million in NSF Grants for Safety Research by Dan H Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: $20 Million in NSF Grants for Safety Research, published by Dan H on February 28, 2023 on The AI Alignment Forum. After a year of negotiation, the NSF has announced a $20 million request for proposals for empirical AI safety research. Here is the detailed program description. The request for proposals is broad, as is common for NSF RfPs. Many safety avenues, such as transparency and anomaly detection, are in scope: "reverse-engineering, inspecting, and interpreting the internal logic of learned models to identify unexpected behavior that could not be found by black-box testing alone" "Safety also requires... methods for monitoring for unexpected environmental hazards or anomalous system behaviors, including during deployment." Note that research that has high capabilities externalities is explicitly out of scope: "Proposals that increase safety primarily as a downstream effect of improving standard system performance metrics unrelated to safety (e.g., accuracy on standard tasks) are not in scope." Thanks to OpenPhil for funding a portion the RfP---their support was essential to creating this opportunity! Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.]]>
Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: $20 Million in NSF Grants for Safety Research, published by Dan H on February 28, 2023 on The AI Alignment Forum. After a year of negotiation, the NSF has announced a $20 million request for proposals for empirical AI safety research. Here is the detailed program description. The request for proposals is broad, as is common for NSF RfPs. Many safety avenues, such as transparency and anomaly detection, are in scope: "reverse-engineering, inspecting, and interpreting the internal logic of learned models to identify unexpected behavior that could not be found by black-box testing alone" "Safety also requires... methods for monitoring for unexpected environmental hazards or anomalous system behaviors, including during deployment." Note that research that has high capabilities externalities is explicitly out of scope: "Proposals that increase safety primarily as a downstream effect of improving standard system performance metrics unrelated to safety (e.g., accuracy on standard tasks) are not in scope." Thanks to OpenPhil for funding a portion the RfP---their support was essential to creating this opportunity! Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.]]>
Dan H https://storage.googleapis.com/rssfile/images/Nonlinear%20Logo%203000x3000%20-%20Alignment%20Forum.png 01:24 None full 5078
svpnmmeJresYs23rY_NL_AF_AF AF - Counting-down vs. counting-up coherence by Tsvi Benson-Tilsen Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Counting-down vs. counting-up coherence, published by Tsvi Benson-Tilsen on February 27, 2023 on The AI Alignment Forum. [Metadata: crossposted from. First completed 25 October 2022.] Counting-down coherence is the coherence of a mind viewed as the absence of deviation downward in capability from ideal, perfectly efficient agency: the utility left on the table, the waste, the exploitability. Counting-up coherence is the coherence of a mind viewed as the deviation upward in capability from a rock: the elements of the mind, and how they combine to perform tasks. What determines the effects of a mind? Supranormally capable minds can have large effects. To control those effects, we'd have to understand what determines the effects of a mind. Pre-theoretically, we have the idea of "values", "aims", "wants". The more capable a mind is, the more it's that case that what the mind wants, is what will happen in the world; so the mind's wants, its values, determine the mind's effect on the world. A more precise way of describing the situation is: "Coherent decisions imply consistent utilities". A mind like that is incorrigible: if it knows it will eventually be more competent than any other mind at pushing the world towards high-utility possibilities, then it does not defer to any other mind. So to understand how a mind can be corrigible, some assumptions about minds and their values may have to be loosened. The question remains, what are values? That is, what determines the effects that a mind has on the world, besides what the mind is capable of doing or understanding? This essay does not address this question, but instead describes two complementary standpoints from which to view the behavior of a mind insofar as it has effects. Counting-down coherence Counting-down coherence is the coherence of a mind viewed as the absence of deviation downward in capability from ideal, perfectly efficient agency: the utility left on the table, the waste, the exploitability. Counting-down coherence could also be called anti-waste coherence, since it has a flavor of avoiding visible waste, or universal coherence, since it has a flavor of tracking how much a mind everywhere conforms to certain patterns of behavior. Some overlapping ways of describing counting-down incoherence: Exploitable, Dutch bookable, pumpable for resources. That is, someone could make a set of trades with the mind that leaves the mind worse off, and could do so repeatedly to pump the mind for resources. See Garrabrant induction. VNM violating. Choosing between different outcomes, or different probabilities of different outcomes, in a way that doesn't satisfy the Von Neumann–Morgenstern axioms, leaves a mind open to being exploited by Dutch books. See related LessWrong posts. Doesn't maximize expected utility. A mind that satisfies the VNM axioms behaves as though it maximizes the expected value of a fixed utility function over atomic (not probabilistic) outcomes. So deviating from that policy exposes a mind to Dutch books. Missed opportunities. Leaving possible gains on the table; failing to pick up a $20 bill lying on the sidewalk. Opposing pushes. Working at cross-purposes to oneself; starting to do X one day, and then undoing X the next day; pushing and pulling on the door handle at the same time. Internal conflict. At war with oneself; having elements of oneself that try to harm each other or interfere with each other's functioning. Inconsistent beliefs, non-Bayesian beliefs. Sometimes acting as though X and sometimes acting as though not-X, where X is something that is either true or false. Or some more complicated inconsistency, or more generally failing to act as though one has a Bayesian belief state and belief revisions. Any of these also open one up to being Dutch booked. Inefficient allocation. Choosing to inve...]]>
Tsvi Benson-Tilsen https://www.alignmentforum.org/posts/svpnmmeJresYs23rY/counting-down-vs-counting-up-coherence Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Counting-down vs. counting-up coherence, published by Tsvi Benson-Tilsen on February 27, 2023 on The AI Alignment Forum. [Metadata: crossposted from. First completed 25 October 2022.] Counting-down coherence is the coherence of a mind viewed as the absence of deviation downward in capability from ideal, perfectly efficient agency: the utility left on the table, the waste, the exploitability. Counting-up coherence is the coherence of a mind viewed as the deviation upward in capability from a rock: the elements of the mind, and how they combine to perform tasks. What determines the effects of a mind? Supranormally capable minds can have large effects. To control those effects, we'd have to understand what determines the effects of a mind. Pre-theoretically, we have the idea of "values", "aims", "wants". The more capable a mind is, the more it's that case that what the mind wants, is what will happen in the world; so the mind's wants, its values, determine the mind's effect on the world. A more precise way of describing the situation is: "Coherent decisions imply consistent utilities". A mind like that is incorrigible: if it knows it will eventually be more competent than any other mind at pushing the world towards high-utility possibilities, then it does not defer to any other mind. So to understand how a mind can be corrigible, some assumptions about minds and their values may have to be loosened. The question remains, what are values? That is, what determines the effects that a mind has on the world, besides what the mind is capable of doing or understanding? This essay does not address this question, but instead describes two complementary standpoints from which to view the behavior of a mind insofar as it has effects. Counting-down coherence Counting-down coherence is the coherence of a mind viewed as the absence of deviation downward in capability from ideal, perfectly efficient agency: the utility left on the table, the waste, the exploitability. Counting-down coherence could also be called anti-waste coherence, since it has a flavor of avoiding visible waste, or universal coherence, since it has a flavor of tracking how much a mind everywhere conforms to certain patterns of behavior. Some overlapping ways of describing counting-down incoherence: Exploitable, Dutch bookable, pumpable for resources. That is, someone could make a set of trades with the mind that leaves the mind worse off, and could do so repeatedly to pump the mind for resources. See Garrabrant induction. VNM violating. Choosing between different outcomes, or different probabilities of different outcomes, in a way that doesn't satisfy the Von Neumann–Morgenstern axioms, leaves a mind open to being exploited by Dutch books. See related LessWrong posts. Doesn't maximize expected utility. A mind that satisfies the VNM axioms behaves as though it maximizes the expected value of a fixed utility function over atomic (not probabilistic) outcomes. So deviating from that policy exposes a mind to Dutch books. Missed opportunities. Leaving possible gains on the table; failing to pick up a $20 bill lying on the sidewalk. Opposing pushes. Working at cross-purposes to oneself; starting to do X one day, and then undoing X the next day; pushing and pulling on the door handle at the same time. Internal conflict. At war with oneself; having elements of oneself that try to harm each other or interfere with each other's functioning. Inconsistent beliefs, non-Bayesian beliefs. Sometimes acting as though X and sometimes acting as though not-X, where X is something that is either true or false. Or some more complicated inconsistency, or more generally failing to act as though one has a Bayesian belief state and belief revisions. Any of these also open one up to being Dutch booked. Inefficient allocation. Choosing to inve...]]>
Mon, 27 Feb 2023 14:59:39 +0000 AF - Counting-down vs. counting-up coherence by Tsvi Benson-Tilsen Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Counting-down vs. counting-up coherence, published by Tsvi Benson-Tilsen on February 27, 2023 on The AI Alignment Forum. [Metadata: crossposted from. First completed 25 October 2022.] Counting-down coherence is the coherence of a mind viewed as the absence of deviation downward in capability from ideal, perfectly efficient agency: the utility left on the table, the waste, the exploitability. Counting-up coherence is the coherence of a mind viewed as the deviation upward in capability from a rock: the elements of the mind, and how they combine to perform tasks. What determines the effects of a mind? Supranormally capable minds can have large effects. To control those effects, we'd have to understand what determines the effects of a mind. Pre-theoretically, we have the idea of "values", "aims", "wants". The more capable a mind is, the more it's that case that what the mind wants, is what will happen in the world; so the mind's wants, its values, determine the mind's effect on the world. A more precise way of describing the situation is: "Coherent decisions imply consistent utilities". A mind like that is incorrigible: if it knows it will eventually be more competent than any other mind at pushing the world towards high-utility possibilities, then it does not defer to any other mind. So to understand how a mind can be corrigible, some assumptions about minds and their values may have to be loosened. The question remains, what are values? That is, what determines the effects that a mind has on the world, besides what the mind is capable of doing or understanding? This essay does not address this question, but instead describes two complementary standpoints from which to view the behavior of a mind insofar as it has effects. Counting-down coherence Counting-down coherence is the coherence of a mind viewed as the absence of deviation downward in capability from ideal, perfectly efficient agency: the utility left on the table, the waste, the exploitability. Counting-down coherence could also be called anti-waste coherence, since it has a flavor of avoiding visible waste, or universal coherence, since it has a flavor of tracking how much a mind everywhere conforms to certain patterns of behavior. Some overlapping ways of describing counting-down incoherence: Exploitable, Dutch bookable, pumpable for resources. That is, someone could make a set of trades with the mind that leaves the mind worse off, and could do so repeatedly to pump the mind for resources. See Garrabrant induction. VNM violating. Choosing between different outcomes, or different probabilities of different outcomes, in a way that doesn't satisfy the Von Neumann–Morgenstern axioms, leaves a mind open to being exploited by Dutch books. See related LessWrong posts. Doesn't maximize expected utility. A mind that satisfies the VNM axioms behaves as though it maximizes the expected value of a fixed utility function over atomic (not probabilistic) outcomes. So deviating from that policy exposes a mind to Dutch books. Missed opportunities. Leaving possible gains on the table; failing to pick up a $20 bill lying on the sidewalk. Opposing pushes. Working at cross-purposes to oneself; starting to do X one day, and then undoing X the next day; pushing and pulling on the door handle at the same time. Internal conflict. At war with oneself; having elements of oneself that try to harm each other or interfere with each other's functioning. Inconsistent beliefs, non-Bayesian beliefs. Sometimes acting as though X and sometimes acting as though not-X, where X is something that is either true or false. Or some more complicated inconsistency, or more generally failing to act as though one has a Bayesian belief state and belief revisions. Any of these also open one up to being Dutch booked. Inefficient allocation. Choosing to inve...]]>
Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Counting-down vs. counting-up coherence, published by Tsvi Benson-Tilsen on February 27, 2023 on The AI Alignment Forum. [Metadata: crossposted from. First completed 25 October 2022.] Counting-down coherence is the coherence of a mind viewed as the absence of deviation downward in capability from ideal, perfectly efficient agency: the utility left on the table, the waste, the exploitability. Counting-up coherence is the coherence of a mind viewed as the deviation upward in capability from a rock: the elements of the mind, and how they combine to perform tasks. What determines the effects of a mind? Supranormally capable minds can have large effects. To control those effects, we'd have to understand what determines the effects of a mind. Pre-theoretically, we have the idea of "values", "aims", "wants". The more capable a mind is, the more it's that case that what the mind wants, is what will happen in the world; so the mind's wants, its values, determine the mind's effect on the world. A more precise way of describing the situation is: "Coherent decisions imply consistent utilities". A mind like that is incorrigible: if it knows it will eventually be more competent than any other mind at pushing the world towards high-utility possibilities, then it does not defer to any other mind. So to understand how a mind can be corrigible, some assumptions about minds and their values may have to be loosened. The question remains, what are values? That is, what determines the effects that a mind has on the world, besides what the mind is capable of doing or understanding? This essay does not address this question, but instead describes two complementary standpoints from which to view the behavior of a mind insofar as it has effects. Counting-down coherence Counting-down coherence is the coherence of a mind viewed as the absence of deviation downward in capability from ideal, perfectly efficient agency: the utility left on the table, the waste, the exploitability. Counting-down coherence could also be called anti-waste coherence, since it has a flavor of avoiding visible waste, or universal coherence, since it has a flavor of tracking how much a mind everywhere conforms to certain patterns of behavior. Some overlapping ways of describing counting-down incoherence: Exploitable, Dutch bookable, pumpable for resources. That is, someone could make a set of trades with the mind that leaves the mind worse off, and could do so repeatedly to pump the mind for resources. See Garrabrant induction. VNM violating. Choosing between different outcomes, or different probabilities of different outcomes, in a way that doesn't satisfy the Von Neumann–Morgenstern axioms, leaves a mind open to being exploited by Dutch books. See related LessWrong posts. Doesn't maximize expected utility. A mind that satisfies the VNM axioms behaves as though it maximizes the expected value of a fixed utility function over atomic (not probabilistic) outcomes. So deviating from that policy exposes a mind to Dutch books. Missed opportunities. Leaving possible gains on the table; failing to pick up a $20 bill lying on the sidewalk. Opposing pushes. Working at cross-purposes to oneself; starting to do X one day, and then undoing X the next day; pushing and pulling on the door handle at the same time. Internal conflict. At war with oneself; having elements of oneself that try to harm each other or interfere with each other's functioning. Inconsistent beliefs, non-Bayesian beliefs. Sometimes acting as though X and sometimes acting as though not-X, where X is something that is either true or false. Or some more complicated inconsistency, or more generally failing to act as though one has a Bayesian belief state and belief revisions. Any of these also open one up to being Dutch booked. Inefficient allocation. Choosing to inve...]]>
Tsvi Benson-Tilsen https://storage.googleapis.com/rssfile/images/Nonlinear%20Logo%203000x3000%20-%20Alignment%20Forum.png 21:31 None full 5043
Kf6sKZudduhJmykTg_NL_AF_AF AF - The Preference Fulfillment Hypothesis by Kaj Sotala Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: The Preference Fulfillment Hypothesis, published by Kaj Sotala on February 26, 2023 on The AI Alignment Forum. Short version Humans have an innate motivation ("preference fulfillment", PF) to fulfill the preferences of those they care about. It corresponds to at least some of the senses of the word "love", as well as related words such as "kindness" and "compassion". I hypothesize that it works by simulating the other person and predicting what they would want or how they would like to be treated. PF is when you take your simulation of what other people would want and add an extra component that makes you intrinsically value outcomes that your simulation predicts the other people would prefer. I also hypothesize that this is the same kind of simulation that forms our ability to work as a social species in the first place. The "virtual bargaining" model of cooperation suggests that people can coordinate without communication by behaving based on what they would agree to do if they were to explicitly bargain, provided that the resulting agreement is commonly known. A mental simulation process is active in virtually every situation where we interact with other people, such as in a grocery store. People use masks/roles/simulations to determine the right behavior in any social situation, running simulations of how others would react to various behaviors. These simulations involve actual people and various people whose opinions we've internalized and care about. The simulations generally allow people to engage in interactions by acting the way a normal person would in a given situation. Once you have this kind of a simulation, constantly running in basically any social situation, it’s likely already exhibiting the PF drive to a weak degree. Doing things that we expect to fulfill other people’s preferences often feels intrinsically nice, even if the person in question was a total stranger. So does wordless coordination in general, as evidenced by the popularity of things like dance. If this is true, capabilities progress may then be closely linked to alignment progress. Getting AIs to be better at following instructions requires them to simulate humans better. Once you have an AI that can simulate human preferences, you already have most of the machinery required for having PF as an intrinsic drive. This is contrary to the position that niceness is unnatural. The preference fulfillment hypothesis is that niceness/PF is a natural kind that will be relatively easy to get out of any AI smart enough to understand what humans want it to do. This implies that constructing aligned AIs might be reasonably easy, in the sense that most of the work necessary for it will be a natural part of progress in capabilities. Long version The preference fulfillment hypothesis Imagine someone who you genuinely care about. You probably have some kind of a desire to fulfill their preferences in the kind of way that they would like their preferences to be fulfilled. It might be very simple ("I like chocolate but they like vanilla, so I would prefer for them to get vanilla ice cream even when I prefer chocolate"), but it might get deep into pretty fundamental differences in preferences and values ("I'm deeply monogamous and me ever being anything else would go against my sacred value, but clearly non-monogamy is what works for my friend and makes them happy so I want them to continue living that way"). It's not necessarily absolute - some things you might still find really upsetting and you'd still want to override the other person’s preferences in some cases - but you can at least feel the "I want them to satisfy their preferences the way they themselves would like their preferences to be satisfied" thing to some extent. I think this kind of desire is something like its own distinct motivation in t...]]>
Kaj Sotala https://www.alignmentforum.org/posts/Kf6sKZudduhJmykTg/the-preference-fulfillment-hypothesis Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: The Preference Fulfillment Hypothesis, published by Kaj Sotala on February 26, 2023 on The AI Alignment Forum. Short version Humans have an innate motivation ("preference fulfillment", PF) to fulfill the preferences of those they care about. It corresponds to at least some of the senses of the word "love", as well as related words such as "kindness" and "compassion". I hypothesize that it works by simulating the other person and predicting what they would want or how they would like to be treated. PF is when you take your simulation of what other people would want and add an extra component that makes you intrinsically value outcomes that your simulation predicts the other people would prefer. I also hypothesize that this is the same kind of simulation that forms our ability to work as a social species in the first place. The "virtual bargaining" model of cooperation suggests that people can coordinate without communication by behaving based on what they would agree to do if they were to explicitly bargain, provided that the resulting agreement is commonly known. A mental simulation process is active in virtually every situation where we interact with other people, such as in a grocery store. People use masks/roles/simulations to determine the right behavior in any social situation, running simulations of how others would react to various behaviors. These simulations involve actual people and various people whose opinions we've internalized and care about. The simulations generally allow people to engage in interactions by acting the way a normal person would in a given situation. Once you have this kind of a simulation, constantly running in basically any social situation, it’s likely already exhibiting the PF drive to a weak degree. Doing things that we expect to fulfill other people’s preferences often feels intrinsically nice, even if the person in question was a total stranger. So does wordless coordination in general, as evidenced by the popularity of things like dance. If this is true, capabilities progress may then be closely linked to alignment progress. Getting AIs to be better at following instructions requires them to simulate humans better. Once you have an AI that can simulate human preferences, you already have most of the machinery required for having PF as an intrinsic drive. This is contrary to the position that niceness is unnatural. The preference fulfillment hypothesis is that niceness/PF is a natural kind that will be relatively easy to get out of any AI smart enough to understand what humans want it to do. This implies that constructing aligned AIs might be reasonably easy, in the sense that most of the work necessary for it will be a natural part of progress in capabilities. Long version The preference fulfillment hypothesis Imagine someone who you genuinely care about. You probably have some kind of a desire to fulfill their preferences in the kind of way that they would like their preferences to be fulfilled. It might be very simple ("I like chocolate but they like vanilla, so I would prefer for them to get vanilla ice cream even when I prefer chocolate"), but it might get deep into pretty fundamental differences in preferences and values ("I'm deeply monogamous and me ever being anything else would go against my sacred value, but clearly non-monogamy is what works for my friend and makes them happy so I want them to continue living that way"). It's not necessarily absolute - some things you might still find really upsetting and you'd still want to override the other person’s preferences in some cases - but you can at least feel the "I want them to satisfy their preferences the way they themselves would like their preferences to be satisfied" thing to some extent. I think this kind of desire is something like its own distinct motivation in t...]]>
Sun, 26 Feb 2023 10:55:13 +0000 AF - The Preference Fulfillment Hypothesis by Kaj Sotala Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: The Preference Fulfillment Hypothesis, published by Kaj Sotala on February 26, 2023 on The AI Alignment Forum. Short version Humans have an innate motivation ("preference fulfillment", PF) to fulfill the preferences of those they care about. It corresponds to at least some of the senses of the word "love", as well as related words such as "kindness" and "compassion". I hypothesize that it works by simulating the other person and predicting what they would want or how they would like to be treated. PF is when you take your simulation of what other people would want and add an extra component that makes you intrinsically value outcomes that your simulation predicts the other people would prefer. I also hypothesize that this is the same kind of simulation that forms our ability to work as a social species in the first place. The "virtual bargaining" model of cooperation suggests that people can coordinate without communication by behaving based on what they would agree to do if they were to explicitly bargain, provided that the resulting agreement is commonly known. A mental simulation process is active in virtually every situation where we interact with other people, such as in a grocery store. People use masks/roles/simulations to determine the right behavior in any social situation, running simulations of how others would react to various behaviors. These simulations involve actual people and various people whose opinions we've internalized and care about. The simulations generally allow people to engage in interactions by acting the way a normal person would in a given situation. Once you have this kind of a simulation, constantly running in basically any social situation, it’s likely already exhibiting the PF drive to a weak degree. Doing things that we expect to fulfill other people’s preferences often feels intrinsically nice, even if the person in question was a total stranger. So does wordless coordination in general, as evidenced by the popularity of things like dance. If this is true, capabilities progress may then be closely linked to alignment progress. Getting AIs to be better at following instructions requires them to simulate humans better. Once you have an AI that can simulate human preferences, you already have most of the machinery required for having PF as an intrinsic drive. This is contrary to the position that niceness is unnatural. The preference fulfillment hypothesis is that niceness/PF is a natural kind that will be relatively easy to get out of any AI smart enough to understand what humans want it to do. This implies that constructing aligned AIs might be reasonably easy, in the sense that most of the work necessary for it will be a natural part of progress in capabilities. Long version The preference fulfillment hypothesis Imagine someone who you genuinely care about. You probably have some kind of a desire to fulfill their preferences in the kind of way that they would like their preferences to be fulfilled. It might be very simple ("I like chocolate but they like vanilla, so I would prefer for them to get vanilla ice cream even when I prefer chocolate"), but it might get deep into pretty fundamental differences in preferences and values ("I'm deeply monogamous and me ever being anything else would go against my sacred value, but clearly non-monogamy is what works for my friend and makes them happy so I want them to continue living that way"). It's not necessarily absolute - some things you might still find really upsetting and you'd still want to override the other person’s preferences in some cases - but you can at least feel the "I want them to satisfy their preferences the way they themselves would like their preferences to be satisfied" thing to some extent. I think this kind of desire is something like its own distinct motivation in t...]]>
Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: The Preference Fulfillment Hypothesis, published by Kaj Sotala on February 26, 2023 on The AI Alignment Forum. Short version Humans have an innate motivation ("preference fulfillment", PF) to fulfill the preferences of those they care about. It corresponds to at least some of the senses of the word "love", as well as related words such as "kindness" and "compassion". I hypothesize that it works by simulating the other person and predicting what they would want or how they would like to be treated. PF is when you take your simulation of what other people would want and add an extra component that makes you intrinsically value outcomes that your simulation predicts the other people would prefer. I also hypothesize that this is the same kind of simulation that forms our ability to work as a social species in the first place. The "virtual bargaining" model of cooperation suggests that people can coordinate without communication by behaving based on what they would agree to do if they were to explicitly bargain, provided that the resulting agreement is commonly known. A mental simulation process is active in virtually every situation where we interact with other people, such as in a grocery store. People use masks/roles/simulations to determine the right behavior in any social situation, running simulations of how others would react to various behaviors. These simulations involve actual people and various people whose opinions we've internalized and care about. The simulations generally allow people to engage in interactions by acting the way a normal person would in a given situation. Once you have this kind of a simulation, constantly running in basically any social situation, it’s likely already exhibiting the PF drive to a weak degree. Doing things that we expect to fulfill other people’s preferences often feels intrinsically nice, even if the person in question was a total stranger. So does wordless coordination in general, as evidenced by the popularity of things like dance. If this is true, capabilities progress may then be closely linked to alignment progress. Getting AIs to be better at following instructions requires them to simulate humans better. Once you have an AI that can simulate human preferences, you already have most of the machinery required for having PF as an intrinsic drive. This is contrary to the position that niceness is unnatural. The preference fulfillment hypothesis is that niceness/PF is a natural kind that will be relatively easy to get out of any AI smart enough to understand what humans want it to do. This implies that constructing aligned AIs might be reasonably easy, in the sense that most of the work necessary for it will be a natural part of progress in capabilities. Long version The preference fulfillment hypothesis Imagine someone who you genuinely care about. You probably have some kind of a desire to fulfill their preferences in the kind of way that they would like their preferences to be fulfilled. It might be very simple ("I like chocolate but they like vanilla, so I would prefer for them to get vanilla ice cream even when I prefer chocolate"), but it might get deep into pretty fundamental differences in preferences and values ("I'm deeply monogamous and me ever being anything else would go against my sacred value, but clearly non-monogamy is what works for my friend and makes them happy so I want them to continue living that way"). It's not necessarily absolute - some things you might still find really upsetting and you'd still want to override the other person’s preferences in some cases - but you can at least feel the "I want them to satisfy their preferences the way they themselves would like their preferences to be satisfied" thing to some extent. I think this kind of desire is something like its own distinct motivation in t...]]>
Kaj Sotala https://storage.googleapis.com/rssfile/images/Nonlinear%20Logo%203000x3000%20-%20Alignment%20Forum.png 16:33 None full 5031
ngEvKav9w57XrGQnb_NL_AF_AF AF - Cognitive Emulation: A Naive AI Safety Proposal by Connor Leahy Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Cognitive Emulation: A Naive AI Safety Proposal, published by Connor Leahy on February 25, 2023 on The AI Alignment Forum. This is part of the work done at Conjecture. This post has been reviewed before publication as per our infohazard policy. We thank our external reviewers for their comments and feedback. This post serves as a signpost for Conjecture’s new primary safety proposal and research direction, which we call Cognitive Emulation (or “CoEm”). The goal of the CoEm agenda is to build predictably boundable systems, not directly aligned AGIs. We believe the former to be a far simpler and useful step towards a full alignment solution. Unfortunately, given that most other actors are racing for as powerful and general AIs as possible, we won’t share much in terms of technical details for now. In the meantime, we still want to share some of our intuitions about this approach. We take no credit for inventing any of these ideas, and see our contributions largely in taking existing ideas seriously and putting them together into a larger whole. In Brief The core intuition is that instead of building powerful, Magical end-to-end systems (as the current general paradigm in AI is doing), we instead focus our attention on trying to build emulations of human-like things. We want to build systems that are “good at chess for the same reasons humans are good at chess.” CoEms are a restriction on the design space of AIs to emulations of human-like stuff. No crazy superhuman blackbox Magic, not even multimodal RL GPT5. We consider the current paradigm of developing AIs that are as general and as powerful as possible, as quickly as possible, to be intrinsically dangerous, and we focus on designing bounded AIs as a safer alternative to it. Logical, Not Physical Emulation We are not interested in direct physical emulation of human brains or simulations of neurons, but of “logical” emulation of thought processes. We don’t care about whether underlying functions are implemented in the same way as they are in the system we are trying to emulate, just that the abstraction over their function holds, and is not leaky. Minimize Magic In the current paradigm, we generally achieve new capabilities through an increase in Magic. We throw more compute at black boxes that develop internal algorithms we have no insight into. Instead of continually increasing the amount of Magic present in our systems, we want to actively decrease this amount, to more cleanly implement and understand how new capabilities are achieved. Some amount of Magic will realistically be needed to implement many useful functions, but we want to minimize the amount of times we have to use such uninterpretable methods, and clearly keep track of where we are using them, and why. CoEms are much “cleaner” than Ems, which are still ultimately big black boxes of weird computation, while in the CoEm paradigm, we keep careful track of where the Magic is and try to keep its presence to a minimum. Predict, Track and Bound Capabilities In the current dominant machine learning paradigm, there are absolutely no guarantees nor understanding of what is being created. Power laws don’t tell us anything about what capabilities will emerge or what other properties our systems will actually have. One of the core hopes of shifting to a CoEm paradigm is that far more deeply understanding what we are building should allow us to predictively bound our system’s capabilities to a human-like regime. This eliminates the problem of being unable to know when an ostensibly harmless system passes from an understandable, harmless capabilities regime into an unprecedented, dangerous regime. Exploit the Human Regime We want systems that are as safe as humans, for the same reasons that humans have (or don’t have) those safety properties. Any scheme that involv...]]>
Connor Leahy https://www.alignmentforum.org/posts/ngEvKav9w57XrGQnb/cognitive-emulation-a-naive-ai-safety-proposal Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Cognitive Emulation: A Naive AI Safety Proposal, published by Connor Leahy on February 25, 2023 on The AI Alignment Forum. This is part of the work done at Conjecture. This post has been reviewed before publication as per our infohazard policy. We thank our external reviewers for their comments and feedback. This post serves as a signpost for Conjecture’s new primary safety proposal and research direction, which we call Cognitive Emulation (or “CoEm”). The goal of the CoEm agenda is to build predictably boundable systems, not directly aligned AGIs. We believe the former to be a far simpler and useful step towards a full alignment solution. Unfortunately, given that most other actors are racing for as powerful and general AIs as possible, we won’t share much in terms of technical details for now. In the meantime, we still want to share some of our intuitions about this approach. We take no credit for inventing any of these ideas, and see our contributions largely in taking existing ideas seriously and putting them together into a larger whole. In Brief The core intuition is that instead of building powerful, Magical end-to-end systems (as the current general paradigm in AI is doing), we instead focus our attention on trying to build emulations of human-like things. We want to build systems that are “good at chess for the same reasons humans are good at chess.” CoEms are a restriction on the design space of AIs to emulations of human-like stuff. No crazy superhuman blackbox Magic, not even multimodal RL GPT5. We consider the current paradigm of developing AIs that are as general and as powerful as possible, as quickly as possible, to be intrinsically dangerous, and we focus on designing bounded AIs as a safer alternative to it. Logical, Not Physical Emulation We are not interested in direct physical emulation of human brains or simulations of neurons, but of “logical” emulation of thought processes. We don’t care about whether underlying functions are implemented in the same way as they are in the system we are trying to emulate, just that the abstraction over their function holds, and is not leaky. Minimize Magic In the current paradigm, we generally achieve new capabilities through an increase in Magic. We throw more compute at black boxes that develop internal algorithms we have no insight into. Instead of continually increasing the amount of Magic present in our systems, we want to actively decrease this amount, to more cleanly implement and understand how new capabilities are achieved. Some amount of Magic will realistically be needed to implement many useful functions, but we want to minimize the amount of times we have to use such uninterpretable methods, and clearly keep track of where we are using them, and why. CoEms are much “cleaner” than Ems, which are still ultimately big black boxes of weird computation, while in the CoEm paradigm, we keep careful track of where the Magic is and try to keep its presence to a minimum. Predict, Track and Bound Capabilities In the current dominant machine learning paradigm, there are absolutely no guarantees nor understanding of what is being created. Power laws don’t tell us anything about what capabilities will emerge or what other properties our systems will actually have. One of the core hopes of shifting to a CoEm paradigm is that far more deeply understanding what we are building should allow us to predictively bound our system’s capabilities to a human-like regime. This eliminates the problem of being unable to know when an ostensibly harmless system passes from an understandable, harmless capabilities regime into an unprecedented, dangerous regime. Exploit the Human Regime We want systems that are as safe as humans, for the same reasons that humans have (or don’t have) those safety properties. Any scheme that involv...]]>
Sat, 25 Feb 2023 19:35:03 +0000 AF - Cognitive Emulation: A Naive AI Safety Proposal by Connor Leahy Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Cognitive Emulation: A Naive AI Safety Proposal, published by Connor Leahy on February 25, 2023 on The AI Alignment Forum. This is part of the work done at Conjecture. This post has been reviewed before publication as per our infohazard policy. We thank our external reviewers for their comments and feedback. This post serves as a signpost for Conjecture’s new primary safety proposal and research direction, which we call Cognitive Emulation (or “CoEm”). The goal of the CoEm agenda is to build predictably boundable systems, not directly aligned AGIs. We believe the former to be a far simpler and useful step towards a full alignment solution. Unfortunately, given that most other actors are racing for as powerful and general AIs as possible, we won’t share much in terms of technical details for now. In the meantime, we still want to share some of our intuitions about this approach. We take no credit for inventing any of these ideas, and see our contributions largely in taking existing ideas seriously and putting them together into a larger whole. In Brief The core intuition is that instead of building powerful, Magical end-to-end systems (as the current general paradigm in AI is doing), we instead focus our attention on trying to build emulations of human-like things. We want to build systems that are “good at chess for the same reasons humans are good at chess.” CoEms are a restriction on the design space of AIs to emulations of human-like stuff. No crazy superhuman blackbox Magic, not even multimodal RL GPT5. We consider the current paradigm of developing AIs that are as general and as powerful as possible, as quickly as possible, to be intrinsically dangerous, and we focus on designing bounded AIs as a safer alternative to it. Logical, Not Physical Emulation We are not interested in direct physical emulation of human brains or simulations of neurons, but of “logical” emulation of thought processes. We don’t care about whether underlying functions are implemented in the same way as they are in the system we are trying to emulate, just that the abstraction over their function holds, and is not leaky. Minimize Magic In the current paradigm, we generally achieve new capabilities through an increase in Magic. We throw more compute at black boxes that develop internal algorithms we have no insight into. Instead of continually increasing the amount of Magic present in our systems, we want to actively decrease this amount, to more cleanly implement and understand how new capabilities are achieved. Some amount of Magic will realistically be needed to implement many useful functions, but we want to minimize the amount of times we have to use such uninterpretable methods, and clearly keep track of where we are using them, and why. CoEms are much “cleaner” than Ems, which are still ultimately big black boxes of weird computation, while in the CoEm paradigm, we keep careful track of where the Magic is and try to keep its presence to a minimum. Predict, Track and Bound Capabilities In the current dominant machine learning paradigm, there are absolutely no guarantees nor understanding of what is being created. Power laws don’t tell us anything about what capabilities will emerge or what other properties our systems will actually have. One of the core hopes of shifting to a CoEm paradigm is that far more deeply understanding what we are building should allow us to predictively bound our system’s capabilities to a human-like regime. This eliminates the problem of being unable to know when an ostensibly harmless system passes from an understandable, harmless capabilities regime into an unprecedented, dangerous regime. Exploit the Human Regime We want systems that are as safe as humans, for the same reasons that humans have (or don’t have) those safety properties. Any scheme that involv...]]>
Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Cognitive Emulation: A Naive AI Safety Proposal, published by Connor Leahy on February 25, 2023 on The AI Alignment Forum. This is part of the work done at Conjecture. This post has been reviewed before publication as per our infohazard policy. We thank our external reviewers for their comments and feedback. This post serves as a signpost for Conjecture’s new primary safety proposal and research direction, which we call Cognitive Emulation (or “CoEm”). The goal of the CoEm agenda is to build predictably boundable systems, not directly aligned AGIs. We believe the former to be a far simpler and useful step towards a full alignment solution. Unfortunately, given that most other actors are racing for as powerful and general AIs as possible, we won’t share much in terms of technical details for now. In the meantime, we still want to share some of our intuitions about this approach. We take no credit for inventing any of these ideas, and see our contributions largely in taking existing ideas seriously and putting them together into a larger whole. In Brief The core intuition is that instead of building powerful, Magical end-to-end systems (as the current general paradigm in AI is doing), we instead focus our attention on trying to build emulations of human-like things. We want to build systems that are “good at chess for the same reasons humans are good at chess.” CoEms are a restriction on the design space of AIs to emulations of human-like stuff. No crazy superhuman blackbox Magic, not even multimodal RL GPT5. We consider the current paradigm of developing AIs that are as general and as powerful as possible, as quickly as possible, to be intrinsically dangerous, and we focus on designing bounded AIs as a safer alternative to it. Logical, Not Physical Emulation We are not interested in direct physical emulation of human brains or simulations of neurons, but of “logical” emulation of thought processes. We don’t care about whether underlying functions are implemented in the same way as they are in the system we are trying to emulate, just that the abstraction over their function holds, and is not leaky. Minimize Magic In the current paradigm, we generally achieve new capabilities through an increase in Magic. We throw more compute at black boxes that develop internal algorithms we have no insight into. Instead of continually increasing the amount of Magic present in our systems, we want to actively decrease this amount, to more cleanly implement and understand how new capabilities are achieved. Some amount of Magic will realistically be needed to implement many useful functions, but we want to minimize the amount of times we have to use such uninterpretable methods, and clearly keep track of where we are using them, and why. CoEms are much “cleaner” than Ems, which are still ultimately big black boxes of weird computation, while in the CoEm paradigm, we keep careful track of where the Magic is and try to keep its presence to a minimum. Predict, Track and Bound Capabilities In the current dominant machine learning paradigm, there are absolutely no guarantees nor understanding of what is being created. Power laws don’t tell us anything about what capabilities will emerge or what other properties our systems will actually have. One of the core hopes of shifting to a CoEm paradigm is that far more deeply understanding what we are building should allow us to predictively bound our system’s capabilities to a human-like regime. This eliminates the problem of being unable to know when an ostensibly harmless system passes from an understandable, harmless capabilities regime into an unprecedented, dangerous regime. Exploit the Human Regime We want systems that are as safe as humans, for the same reasons that humans have (or don’t have) those safety properties. Any scheme that involv...]]>
Connor Leahy https://storage.googleapis.com/rssfile/images/Nonlinear%20Logo%203000x3000%20-%20Alignment%20Forum.png 09:50 None full 5055
eQ4eLQAmPvp9anJcB_NL_AF_AF AF - Agents vs. Predictors: Concrete differentiating factors by Evan Hubinger Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Agents vs. Predictors: Concrete differentiating factors, published by Evan Hubinger on February 24, 2023 on The AI Alignment Forum. Thanks to Paul Christiano and Kate Woolverton for useful conversations and feedback. In "Conditioning Predictive Models," we devote a lot of effort into trying to understand how likely predictive models are compared to other alternatives in realistic training regimes (and if we do get a predictive model how we might align it). Here, I want to point to some very concrete behavioral differences that I think effectively differentiate predictive and non-predictive models both in theory and (hopefully) in practice as well. I think that thinking about predictive models as specifically those models that are sensitive to what I'll be talking about here can be quite clarifying, and I think these factors could present a clear pathway for getting empirical evidence on to what extent models are well-described as predictors in practice—e.g. as it relates to the RLHF conditioning hypothesis. Furthermore, a common objection I sometimes hear to thinking about predictive models in particular (and looking into the dangers from them as a relatively separate problem) is that a model being itself deceptively aligned vs. predicting what a deceptively aligned model would do are essentially the same. While I think they are both similarly dangerous when acting directly in the world, I think they are pretty substantially different failure modes in terms of how you might go about addressing them and what sort of levers you have to control and detect them—and I think that the concrete differentiating factors here should make it pretty clear how I think they differ. Sensitivity to distributional information Consider two AI models. Model A is a deceptively aligned agent; it has some long-term goal that it cares about and is playing the training game now to try to eventually achieve that goal later. Model B is a predictor, but a poorly conditioned one such that it is attempting to predict what it thinks a deceptively aligned agent would do conditional on being in a training regime that matches the model's current training regime. Now, suppose you give both models some concrete information about the general distribution of deceptively aligned AIs. For example, you provide definitive proof that most deceptively aligned AIs trained in similar training regimes end up with goal X. Model A, the actual deceptive agent, doesn't really care about this new information very much: maybe it's relevant to the extent that it might need to cooperate with other AIs, but it's not going to directly change the model's goals. Model B, on the other hand, is extremely sensitive to reliable information on this topic: if you can really convince it that most deceptively aligned AIs in similar situations will end up valuing X, it should shift to predicting a deceptively aligned AI that values X instead of whatever goals it used to think a deceptively aligned AI might have. In this particular example, "does the model have goal X" isn't something we know how to directly measure for a deceptively aligned model, so this particular case isn't useful as a way to test for agents vs. predictors, but it is a concrete way in which these two different sorts of models are fundamentally not the same and will eventually behave differently because of that. That being said, I think there are ways in which you might be able to test this sort of sensitivity to distributional information that would be possible to check and evaluate. For example, suppose you're trying to distinguish between a predictive model generating from a distribution of things it thinks helpful AIs would say vs. an agent trying to act helpful. If you then give that sort of a model concrete information about what most helpful AIs tend to do, ...]]>
Evan Hubinger https://www.alignmentforum.org/posts/eQ4eLQAmPvp9anJcB/agents-vs-predictors-concrete-differentiating-factors Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Agents vs. Predictors: Concrete differentiating factors, published by Evan Hubinger on February 24, 2023 on The AI Alignment Forum. Thanks to Paul Christiano and Kate Woolverton for useful conversations and feedback. In "Conditioning Predictive Models," we devote a lot of effort into trying to understand how likely predictive models are compared to other alternatives in realistic training regimes (and if we do get a predictive model how we might align it). Here, I want to point to some very concrete behavioral differences that I think effectively differentiate predictive and non-predictive models both in theory and (hopefully) in practice as well. I think that thinking about predictive models as specifically those models that are sensitive to what I'll be talking about here can be quite clarifying, and I think these factors could present a clear pathway for getting empirical evidence on to what extent models are well-described as predictors in practice—e.g. as it relates to the RLHF conditioning hypothesis. Furthermore, a common objection I sometimes hear to thinking about predictive models in particular (and looking into the dangers from them as a relatively separate problem) is that a model being itself deceptively aligned vs. predicting what a deceptively aligned model would do are essentially the same. While I think they are both similarly dangerous when acting directly in the world, I think they are pretty substantially different failure modes in terms of how you might go about addressing them and what sort of levers you have to control and detect them—and I think that the concrete differentiating factors here should make it pretty clear how I think they differ. Sensitivity to distributional information Consider two AI models. Model A is a deceptively aligned agent; it has some long-term goal that it cares about and is playing the training game now to try to eventually achieve that goal later. Model B is a predictor, but a poorly conditioned one such that it is attempting to predict what it thinks a deceptively aligned agent would do conditional on being in a training regime that matches the model's current training regime. Now, suppose you give both models some concrete information about the general distribution of deceptively aligned AIs. For example, you provide definitive proof that most deceptively aligned AIs trained in similar training regimes end up with goal X. Model A, the actual deceptive agent, doesn't really care about this new information very much: maybe it's relevant to the extent that it might need to cooperate with other AIs, but it's not going to directly change the model's goals. Model B, on the other hand, is extremely sensitive to reliable information on this topic: if you can really convince it that most deceptively aligned AIs in similar situations will end up valuing X, it should shift to predicting a deceptively aligned AI that values X instead of whatever goals it used to think a deceptively aligned AI might have. In this particular example, "does the model have goal X" isn't something we know how to directly measure for a deceptively aligned model, so this particular case isn't useful as a way to test for agents vs. predictors, but it is a concrete way in which these two different sorts of models are fundamentally not the same and will eventually behave differently because of that. That being said, I think there are ways in which you might be able to test this sort of sensitivity to distributional information that would be possible to check and evaluate. For example, suppose you're trying to distinguish between a predictive model generating from a distribution of things it thinks helpful AIs would say vs. an agent trying to act helpful. If you then give that sort of a model concrete information about what most helpful AIs tend to do, ...]]>
Fri, 24 Feb 2023 23:50:40 +0000 AF - Agents vs. Predictors: Concrete differentiating factors by Evan Hubinger Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Agents vs. Predictors: Concrete differentiating factors, published by Evan Hubinger on February 24, 2023 on The AI Alignment Forum. Thanks to Paul Christiano and Kate Woolverton for useful conversations and feedback. In "Conditioning Predictive Models," we devote a lot of effort into trying to understand how likely predictive models are compared to other alternatives in realistic training regimes (and if we do get a predictive model how we might align it). Here, I want to point to some very concrete behavioral differences that I think effectively differentiate predictive and non-predictive models both in theory and (hopefully) in practice as well. I think that thinking about predictive models as specifically those models that are sensitive to what I'll be talking about here can be quite clarifying, and I think these factors could present a clear pathway for getting empirical evidence on to what extent models are well-described as predictors in practice—e.g. as it relates to the RLHF conditioning hypothesis. Furthermore, a common objection I sometimes hear to thinking about predictive models in particular (and looking into the dangers from them as a relatively separate problem) is that a model being itself deceptively aligned vs. predicting what a deceptively aligned model would do are essentially the same. While I think they are both similarly dangerous when acting directly in the world, I think they are pretty substantially different failure modes in terms of how you might go about addressing them and what sort of levers you have to control and detect them—and I think that the concrete differentiating factors here should make it pretty clear how I think they differ. Sensitivity to distributional information Consider two AI models. Model A is a deceptively aligned agent; it has some long-term goal that it cares about and is playing the training game now to try to eventually achieve that goal later. Model B is a predictor, but a poorly conditioned one such that it is attempting to predict what it thinks a deceptively aligned agent would do conditional on being in a training regime that matches the model's current training regime. Now, suppose you give both models some concrete information about the general distribution of deceptively aligned AIs. For example, you provide definitive proof that most deceptively aligned AIs trained in similar training regimes end up with goal X. Model A, the actual deceptive agent, doesn't really care about this new information very much: maybe it's relevant to the extent that it might need to cooperate with other AIs, but it's not going to directly change the model's goals. Model B, on the other hand, is extremely sensitive to reliable information on this topic: if you can really convince it that most deceptively aligned AIs in similar situations will end up valuing X, it should shift to predicting a deceptively aligned AI that values X instead of whatever goals it used to think a deceptively aligned AI might have. In this particular example, "does the model have goal X" isn't something we know how to directly measure for a deceptively aligned model, so this particular case isn't useful as a way to test for agents vs. predictors, but it is a concrete way in which these two different sorts of models are fundamentally not the same and will eventually behave differently because of that. That being said, I think there are ways in which you might be able to test this sort of sensitivity to distributional information that would be possible to check and evaluate. For example, suppose you're trying to distinguish between a predictive model generating from a distribution of things it thinks helpful AIs would say vs. an agent trying to act helpful. If you then give that sort of a model concrete information about what most helpful AIs tend to do, ...]]>
Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Agents vs. Predictors: Concrete differentiating factors, published by Evan Hubinger on February 24, 2023 on The AI Alignment Forum. Thanks to Paul Christiano and Kate Woolverton for useful conversations and feedback. In "Conditioning Predictive Models," we devote a lot of effort into trying to understand how likely predictive models are compared to other alternatives in realistic training regimes (and if we do get a predictive model how we might align it). Here, I want to point to some very concrete behavioral differences that I think effectively differentiate predictive and non-predictive models both in theory and (hopefully) in practice as well. I think that thinking about predictive models as specifically those models that are sensitive to what I'll be talking about here can be quite clarifying, and I think these factors could present a clear pathway for getting empirical evidence on to what extent models are well-described as predictors in practice—e.g. as it relates to the RLHF conditioning hypothesis. Furthermore, a common objection I sometimes hear to thinking about predictive models in particular (and looking into the dangers from them as a relatively separate problem) is that a model being itself deceptively aligned vs. predicting what a deceptively aligned model would do are essentially the same. While I think they are both similarly dangerous when acting directly in the world, I think they are pretty substantially different failure modes in terms of how you might go about addressing them and what sort of levers you have to control and detect them—and I think that the concrete differentiating factors here should make it pretty clear how I think they differ. Sensitivity to distributional information Consider two AI models. Model A is a deceptively aligned agent; it has some long-term goal that it cares about and is playing the training game now to try to eventually achieve that goal later. Model B is a predictor, but a poorly conditioned one such that it is attempting to predict what it thinks a deceptively aligned agent would do conditional on being in a training regime that matches the model's current training regime. Now, suppose you give both models some concrete information about the general distribution of deceptively aligned AIs. For example, you provide definitive proof that most deceptively aligned AIs trained in similar training regimes end up with goal X. Model A, the actual deceptive agent, doesn't really care about this new information very much: maybe it's relevant to the extent that it might need to cooperate with other AIs, but it's not going to directly change the model's goals. Model B, on the other hand, is extremely sensitive to reliable information on this topic: if you can really convince it that most deceptively aligned AIs in similar situations will end up valuing X, it should shift to predicting a deceptively aligned AI that values X instead of whatever goals it used to think a deceptively aligned AI might have. In this particular example, "does the model have goal X" isn't something we know how to directly measure for a deceptively aligned model, so this particular case isn't useful as a way to test for agents vs. predictors, but it is a concrete way in which these two different sorts of models are fundamentally not the same and will eventually behave differently because of that. That being said, I think there are ways in which you might be able to test this sort of sensitivity to distributional information that would be possible to check and evaluate. For example, suppose you're trying to distinguish between a predictive model generating from a distribution of things it thinks helpful AIs would say vs. an agent trying to act helpful. If you then give that sort of a model concrete information about what most helpful AIs tend to do, ...]]>
Evan Hubinger https://storage.googleapis.com/rssfile/images/Nonlinear%20Logo%203000x3000%20-%20Alignment%20Forum.png 05:59 None full 5025
pgpFHLJnv7AdSi3qS_NL_AF_AF AF - Christiano (ARC) and GA (Conjecture) Discuss Alignment Cruxes by Andrea Miotti Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Christiano (ARC) and GA (Conjecture) Discuss Alignment Cruxes, published by Andrea Miotti on February 24, 2023 on The AI Alignment Forum. The following are the summary and transcript of a discussion between Paul Christiano (ARC) and Gabriel Alfour, hereafter GA (Conjecture), which took place on December 11, 2022 on Slack. It was held as part of a series of discussions between Conjecture and people from other organizations in the AGI and alignment field. See our retrospective on the Discussions for more information about the project and the format. Here's a summary of the discussion, as well as the full transcript below the summary, lightly edited for readability. Summary Introduction GA is pessimistic about alignment being solved because he thinks there is (1) an AGI race to the bottom, (2) alignment is hard in ways that we are bad at dealing with, and (3) we don't have a lot of time to get better, given the pace of the race. Christiano clarifies: does GA expect a race to the bottom because investment in alignment will be low, people won’t be willing to slow development/deployment if needed, or something else? He predicts alignment investment will be 5-50% of total investment, depending on how severe risk appears. If the risks look significant-but-kind-of-subtle, he expects getting 3-6 months of delay based on concern. In his median doomy case, he expects 1-2 years of delay. GA expects lower investment (1-5%). More crucially, though, GA expects it to be hard to turn funding and time into effective research given alignment’s difficulty. Alignment Difficulty, Feedback Loops, & Phase Shifts GA’s main argument for alignment difficulty is that getting feedback on our research progress is difficult, because Core concepts and desiderata in alignment are complex and abstract. We are bad at factoring complex, abstract concepts into smaller more tractable systems without having a lot of quantitative feedback. We are bad at building feedback loops when working on abstract concepts We are bad at coming to agreement on abstract concepts. All this will make it difficult to predict when phase shifts – eg qualitative changes to how systems are representing information, which might break our interpretability methods – will occur. Such phase shifts seem likely to occur when we shift from in vitro to in vivo, which makes it particularly likely that the alignment techniques we build in vitro won’t be robust to them. Despite theorists arguing connecting AI systems to e.g. the internet is dangerous for this reason, labs will do it, because the path from current systems to future danger is complex and we may not see legibly catastrophic failures until it is too late. So, even getting better at predicting may not help. Christiano disagrees building feedback loops is hard in alignment. We can almost certainly study reward hacking in vitro in advance, together with clear measurements of whether we are succeeding at mitigating the problem in a way that should be expected to generalize to AI coup. Conditioned on deceptive alignment being a problem that emerges, there’s a >50% chance that we can study it in the same sense. Furthermore, Christiano argues most plausible approaches to AI alignment have much richer feedback loops than the general version of either of these problems. For example, if you have an approach that requires building a kind of understanding of the internals of your model then you can test whether you can build that kind of understanding in not-yet-catastrophic models. If you have an approach that requires your model being unable to distinguish adversarial examples from deployment cases, you can test whether your models can make that distinction. You can generally seek methods that don’t have particular reasons to break at the same time that things become catastrophic. GA is ...]]>
Andrea Miotti https://www.alignmentforum.org/posts/pgpFHLJnv7AdSi3qS/christiano-arc-and-ga-conjecture-discuss-alignment-cruxes Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Christiano (ARC) and GA (Conjecture) Discuss Alignment Cruxes, published by Andrea Miotti on February 24, 2023 on The AI Alignment Forum. The following are the summary and transcript of a discussion between Paul Christiano (ARC) and Gabriel Alfour, hereafter GA (Conjecture), which took place on December 11, 2022 on Slack. It was held as part of a series of discussions between Conjecture and people from other organizations in the AGI and alignment field. See our retrospective on the Discussions for more information about the project and the format. Here's a summary of the discussion, as well as the full transcript below the summary, lightly edited for readability. Summary Introduction GA is pessimistic about alignment being solved because he thinks there is (1) an AGI race to the bottom, (2) alignment is hard in ways that we are bad at dealing with, and (3) we don't have a lot of time to get better, given the pace of the race. Christiano clarifies: does GA expect a race to the bottom because investment in alignment will be low, people won’t be willing to slow development/deployment if needed, or something else? He predicts alignment investment will be 5-50% of total investment, depending on how severe risk appears. If the risks look significant-but-kind-of-subtle, he expects getting 3-6 months of delay based on concern. In his median doomy case, he expects 1-2 years of delay. GA expects lower investment (1-5%). More crucially, though, GA expects it to be hard to turn funding and time into effective research given alignment’s difficulty. Alignment Difficulty, Feedback Loops, & Phase Shifts GA’s main argument for alignment difficulty is that getting feedback on our research progress is difficult, because Core concepts and desiderata in alignment are complex and abstract. We are bad at factoring complex, abstract concepts into smaller more tractable systems without having a lot of quantitative feedback. We are bad at building feedback loops when working on abstract concepts We are bad at coming to agreement on abstract concepts. All this will make it difficult to predict when phase shifts – eg qualitative changes to how systems are representing information, which might break our interpretability methods – will occur. Such phase shifts seem likely to occur when we shift from in vitro to in vivo, which makes it particularly likely that the alignment techniques we build in vitro won’t be robust to them. Despite theorists arguing connecting AI systems to e.g. the internet is dangerous for this reason, labs will do it, because the path from current systems to future danger is complex and we may not see legibly catastrophic failures until it is too late. So, even getting better at predicting may not help. Christiano disagrees building feedback loops is hard in alignment. We can almost certainly study reward hacking in vitro in advance, together with clear measurements of whether we are succeeding at mitigating the problem in a way that should be expected to generalize to AI coup. Conditioned on deceptive alignment being a problem that emerges, there’s a >50% chance that we can study it in the same sense. Furthermore, Christiano argues most plausible approaches to AI alignment have much richer feedback loops than the general version of either of these problems. For example, if you have an approach that requires building a kind of understanding of the internals of your model then you can test whether you can build that kind of understanding in not-yet-catastrophic models. If you have an approach that requires your model being unable to distinguish adversarial examples from deployment cases, you can test whether your models can make that distinction. You can generally seek methods that don’t have particular reasons to break at the same time that things become catastrophic. GA is ...]]>
Fri, 24 Feb 2023 23:03:04 +0000 AF - Christiano (ARC) and GA (Conjecture) Discuss Alignment Cruxes by Andrea Miotti Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Christiano (ARC) and GA (Conjecture) Discuss Alignment Cruxes, published by Andrea Miotti on February 24, 2023 on The AI Alignment Forum. The following are the summary and transcript of a discussion between Paul Christiano (ARC) and Gabriel Alfour, hereafter GA (Conjecture), which took place on December 11, 2022 on Slack. It was held as part of a series of discussions between Conjecture and people from other organizations in the AGI and alignment field. See our retrospective on the Discussions for more information about the project and the format. Here's a summary of the discussion, as well as the full transcript below the summary, lightly edited for readability. Summary Introduction GA is pessimistic about alignment being solved because he thinks there is (1) an AGI race to the bottom, (2) alignment is hard in ways that we are bad at dealing with, and (3) we don't have a lot of time to get better, given the pace of the race. Christiano clarifies: does GA expect a race to the bottom because investment in alignment will be low, people won’t be willing to slow development/deployment if needed, or something else? He predicts alignment investment will be 5-50% of total investment, depending on how severe risk appears. If the risks look significant-but-kind-of-subtle, he expects getting 3-6 months of delay based on concern. In his median doomy case, he expects 1-2 years of delay. GA expects lower investment (1-5%). More crucially, though, GA expects it to be hard to turn funding and time into effective research given alignment’s difficulty. Alignment Difficulty, Feedback Loops, & Phase Shifts GA’s main argument for alignment difficulty is that getting feedback on our research progress is difficult, because Core concepts and desiderata in alignment are complex and abstract. We are bad at factoring complex, abstract concepts into smaller more tractable systems without having a lot of quantitative feedback. We are bad at building feedback loops when working on abstract concepts We are bad at coming to agreement on abstract concepts. All this will make it difficult to predict when phase shifts – eg qualitative changes to how systems are representing information, which might break our interpretability methods – will occur. Such phase shifts seem likely to occur when we shift from in vitro to in vivo, which makes it particularly likely that the alignment techniques we build in vitro won’t be robust to them. Despite theorists arguing connecting AI systems to e.g. the internet is dangerous for this reason, labs will do it, because the path from current systems to future danger is complex and we may not see legibly catastrophic failures until it is too late. So, even getting better at predicting may not help. Christiano disagrees building feedback loops is hard in alignment. We can almost certainly study reward hacking in vitro in advance, together with clear measurements of whether we are succeeding at mitigating the problem in a way that should be expected to generalize to AI coup. Conditioned on deceptive alignment being a problem that emerges, there’s a >50% chance that we can study it in the same sense. Furthermore, Christiano argues most plausible approaches to AI alignment have much richer feedback loops than the general version of either of these problems. For example, if you have an approach that requires building a kind of understanding of the internals of your model then you can test whether you can build that kind of understanding in not-yet-catastrophic models. If you have an approach that requires your model being unable to distinguish adversarial examples from deployment cases, you can test whether your models can make that distinction. You can generally seek methods that don’t have particular reasons to break at the same time that things become catastrophic. GA is ...]]>
Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Christiano (ARC) and GA (Conjecture) Discuss Alignment Cruxes, published by Andrea Miotti on February 24, 2023 on The AI Alignment Forum. The following are the summary and transcript of a discussion between Paul Christiano (ARC) and Gabriel Alfour, hereafter GA (Conjecture), which took place on December 11, 2022 on Slack. It was held as part of a series of discussions between Conjecture and people from other organizations in the AGI and alignment field. See our retrospective on the Discussions for more information about the project and the format. Here's a summary of the discussion, as well as the full transcript below the summary, lightly edited for readability. Summary Introduction GA is pessimistic about alignment being solved because he thinks there is (1) an AGI race to the bottom, (2) alignment is hard in ways that we are bad at dealing with, and (3) we don't have a lot of time to get better, given the pace of the race. Christiano clarifies: does GA expect a race to the bottom because investment in alignment will be low, people won’t be willing to slow development/deployment if needed, or something else? He predicts alignment investment will be 5-50% of total investment, depending on how severe risk appears. If the risks look significant-but-kind-of-subtle, he expects getting 3-6 months of delay based on concern. In his median doomy case, he expects 1-2 years of delay. GA expects lower investment (1-5%). More crucially, though, GA expects it to be hard to turn funding and time into effective research given alignment’s difficulty. Alignment Difficulty, Feedback Loops, & Phase Shifts GA’s main argument for alignment difficulty is that getting feedback on our research progress is difficult, because Core concepts and desiderata in alignment are complex and abstract. We are bad at factoring complex, abstract concepts into smaller more tractable systems without having a lot of quantitative feedback. We are bad at building feedback loops when working on abstract concepts We are bad at coming to agreement on abstract concepts. All this will make it difficult to predict when phase shifts – eg qualitative changes to how systems are representing information, which might break our interpretability methods – will occur. Such phase shifts seem likely to occur when we shift from in vitro to in vivo, which makes it particularly likely that the alignment techniques we build in vitro won’t be robust to them. Despite theorists arguing connecting AI systems to e.g. the internet is dangerous for this reason, labs will do it, because the path from current systems to future danger is complex and we may not see legibly catastrophic failures until it is too late. So, even getting better at predicting may not help. Christiano disagrees building feedback loops is hard in alignment. We can almost certainly study reward hacking in vitro in advance, together with clear measurements of whether we are succeeding at mitigating the problem in a way that should be expected to generalize to AI coup. Conditioned on deceptive alignment being a problem that emerges, there’s a >50% chance that we can study it in the same sense. Furthermore, Christiano argues most plausible approaches to AI alignment have much richer feedback loops than the general version of either of these problems. For example, if you have an approach that requires building a kind of understanding of the internals of your model then you can test whether you can build that kind of understanding in not-yet-catastrophic models. If you have an approach that requires your model being unable to distinguish adversarial examples from deployment cases, you can test whether your models can make that distinction. You can generally seek methods that don’t have particular reasons to break at the same time that things become catastrophic. GA is ...]]>
Andrea Miotti https://storage.googleapis.com/rssfile/images/Nonlinear%20Logo%203000x3000%20-%20Alignment%20Forum.png 01:18:01 None full 5044
BEyAWbCdtWpSGxmun_NL_AF_AF AF - Retrospective on the 2022 Conjecture AI Discussions by Andrea Miotti Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Retrospective on the 2022 Conjecture AI Discussions, published by Andrea Miotti on February 24, 2023 on The AI Alignment Forum. At the end of 2022, following the success of the 2021 MIRI Conversations, Conjecture started a project to host discussions about AGI and alignment with key people in the field. The goal was simple: surface positions and disagreements, identify cruxes, and make these debates public whenever possible for collective benefit. Given that people and organizations will have to coordinate to best navigate AI's increasing effects, this is the first, minimum-viable coordination step needed to start from. Coordination is impossible without at least common knowledge of various relevant actors' positions and models. People sharing their beliefs, discussing them and making as much as possible of that public is strongly positive for a series of reasons. First, beliefs expressed in public discussions count as micro-commitments or micro-predictions, and help keep the field honest and truth-seeking. When things are only discussed privately, humans tend to weasel around and take inconsistent positions over time, be it intentionally or involuntarily. Second, commenters help debates progress faster by pointing out mistakes. Third, public debates compound. Knowledge shared publicly leads to the next generation of arguments being more refined, and progress in public discourse. We circulated a document about the project to various groups in the field, and invited people from OpenAI, DeepMind, Anthropic, Open Philanthropy, FTX Future Fund, ARC, and MIRI, as well as some independent researchers to participate in the discussions. We prioritized speaking to people at AGI labs, given that they are focused on building AGI capabilities. The format of discussions was as follows: A brief initial exchange with the participants to decide on the topics of discussion. By default, the discussion topic was “How hard is Alignment?”, since we've found we disagree with most people about this, and the reasons for it touch on many core cruxes about AI. We held the discussion synchronously for ~120 minutes, in writing, each on a dedicated, private Slack channel. We involved a moderator when possible. The moderator's role was to help participants identify and address their cruxes, move the conversation forward, and summarize points of contention. We planned to publish cleaned up versions of the transcripts and summaries to Astral Codex Ten, LessWrong, and the EA Forum. Participants were given the opportunity to clarify positions and redact information they considered infohazards or PR risks, as well as veto publishing altogether. We included this clause specifically to address the concerns expressed by people at AI labs, who expected heavy scrutiny by leadership and communications teams on what they can state publicly. People from ARC, DeepMind, and OpenAI, as well as one independent researcher agreed to participate. The two discussions with Paul Christiano and John Wentworth will be published shortly. One discussion with a person working at DeepMind is pending approval before publication. After a discussion with an OpenAI researcher took place, OpenAI strongly recommended against publishing, so we will not publish it. Most people we were in touch with were very interested in participating. However, after checking with their own organizations, many returned saying their organizations would not approve them sharing their positions publicly. This was in spite of the extensive provisions we made to reduce downsides for them: making it possible to edit the transcript, veto publishing, strict comment moderation, and so on. We think organizations discouraging their employees from speaking openly about their views on AI risk is harmful, and we want to encourage more openness. We are pausing th...]]>
Andrea Miotti https://www.alignmentforum.org/posts/BEyAWbCdtWpSGxmun/retrospective-on-the-2022-conjecture-ai-discussions Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Retrospective on the 2022 Conjecture AI Discussions, published by Andrea Miotti on February 24, 2023 on The AI Alignment Forum. At the end of 2022, following the success of the 2021 MIRI Conversations, Conjecture started a project to host discussions about AGI and alignment with key people in the field. The goal was simple: surface positions and disagreements, identify cruxes, and make these debates public whenever possible for collective benefit. Given that people and organizations will have to coordinate to best navigate AI's increasing effects, this is the first, minimum-viable coordination step needed to start from. Coordination is impossible without at least common knowledge of various relevant actors' positions and models. People sharing their beliefs, discussing them and making as much as possible of that public is strongly positive for a series of reasons. First, beliefs expressed in public discussions count as micro-commitments or micro-predictions, and help keep the field honest and truth-seeking. When things are only discussed privately, humans tend to weasel around and take inconsistent positions over time, be it intentionally or involuntarily. Second, commenters help debates progress faster by pointing out mistakes. Third, public debates compound. Knowledge shared publicly leads to the next generation of arguments being more refined, and progress in public discourse. We circulated a document about the project to various groups in the field, and invited people from OpenAI, DeepMind, Anthropic, Open Philanthropy, FTX Future Fund, ARC, and MIRI, as well as some independent researchers to participate in the discussions. We prioritized speaking to people at AGI labs, given that they are focused on building AGI capabilities. The format of discussions was as follows: A brief initial exchange with the participants to decide on the topics of discussion. By default, the discussion topic was “How hard is Alignment?”, since we've found we disagree with most people about this, and the reasons for it touch on many core cruxes about AI. We held the discussion synchronously for ~120 minutes, in writing, each on a dedicated, private Slack channel. We involved a moderator when possible. The moderator's role was to help participants identify and address their cruxes, move the conversation forward, and summarize points of contention. We planned to publish cleaned up versions of the transcripts and summaries to Astral Codex Ten, LessWrong, and the EA Forum. Participants were given the opportunity to clarify positions and redact information they considered infohazards or PR risks, as well as veto publishing altogether. We included this clause specifically to address the concerns expressed by people at AI labs, who expected heavy scrutiny by leadership and communications teams on what they can state publicly. People from ARC, DeepMind, and OpenAI, as well as one independent researcher agreed to participate. The two discussions with Paul Christiano and John Wentworth will be published shortly. One discussion with a person working at DeepMind is pending approval before publication. After a discussion with an OpenAI researcher took place, OpenAI strongly recommended against publishing, so we will not publish it. Most people we were in touch with were very interested in participating. However, after checking with their own organizations, many returned saying their organizations would not approve them sharing their positions publicly. This was in spite of the extensive provisions we made to reduce downsides for them: making it possible to edit the transcript, veto publishing, strict comment moderation, and so on. We think organizations discouraging their employees from speaking openly about their views on AI risk is harmful, and we want to encourage more openness. We are pausing th...]]>
Fri, 24 Feb 2023 22:41:13 +0000 AF - Retrospective on the 2022 Conjecture AI Discussions by Andrea Miotti Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Retrospective on the 2022 Conjecture AI Discussions, published by Andrea Miotti on February 24, 2023 on The AI Alignment Forum. At the end of 2022, following the success of the 2021 MIRI Conversations, Conjecture started a project to host discussions about AGI and alignment with key people in the field. The goal was simple: surface positions and disagreements, identify cruxes, and make these debates public whenever possible for collective benefit. Given that people and organizations will have to coordinate to best navigate AI's increasing effects, this is the first, minimum-viable coordination step needed to start from. Coordination is impossible without at least common knowledge of various relevant actors' positions and models. People sharing their beliefs, discussing them and making as much as possible of that public is strongly positive for a series of reasons. First, beliefs expressed in public discussions count as micro-commitments or micro-predictions, and help keep the field honest and truth-seeking. When things are only discussed privately, humans tend to weasel around and take inconsistent positions over time, be it intentionally or involuntarily. Second, commenters help debates progress faster by pointing out mistakes. Third, public debates compound. Knowledge shared publicly leads to the next generation of arguments being more refined, and progress in public discourse. We circulated a document about the project to various groups in the field, and invited people from OpenAI, DeepMind, Anthropic, Open Philanthropy, FTX Future Fund, ARC, and MIRI, as well as some independent researchers to participate in the discussions. We prioritized speaking to people at AGI labs, given that they are focused on building AGI capabilities. The format of discussions was as follows: A brief initial exchange with the participants to decide on the topics of discussion. By default, the discussion topic was “How hard is Alignment?”, since we've found we disagree with most people about this, and the reasons for it touch on many core cruxes about AI. We held the discussion synchronously for ~120 minutes, in writing, each on a dedicated, private Slack channel. We involved a moderator when possible. The moderator's role was to help participants identify and address their cruxes, move the conversation forward, and summarize points of contention. We planned to publish cleaned up versions of the transcripts and summaries to Astral Codex Ten, LessWrong, and the EA Forum. Participants were given the opportunity to clarify positions and redact information they considered infohazards or PR risks, as well as veto publishing altogether. We included this clause specifically to address the concerns expressed by people at AI labs, who expected heavy scrutiny by leadership and communications teams on what they can state publicly. People from ARC, DeepMind, and OpenAI, as well as one independent researcher agreed to participate. The two discussions with Paul Christiano and John Wentworth will be published shortly. One discussion with a person working at DeepMind is pending approval before publication. After a discussion with an OpenAI researcher took place, OpenAI strongly recommended against publishing, so we will not publish it. Most people we were in touch with were very interested in participating. However, after checking with their own organizations, many returned saying their organizations would not approve them sharing their positions publicly. This was in spite of the extensive provisions we made to reduce downsides for them: making it possible to edit the transcript, veto publishing, strict comment moderation, and so on. We think organizations discouraging their employees from speaking openly about their views on AI risk is harmful, and we want to encourage more openness. We are pausing th...]]>
Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Retrospective on the 2022 Conjecture AI Discussions, published by Andrea Miotti on February 24, 2023 on The AI Alignment Forum. At the end of 2022, following the success of the 2021 MIRI Conversations, Conjecture started a project to host discussions about AGI and alignment with key people in the field. The goal was simple: surface positions and disagreements, identify cruxes, and make these debates public whenever possible for collective benefit. Given that people and organizations will have to coordinate to best navigate AI's increasing effects, this is the first, minimum-viable coordination step needed to start from. Coordination is impossible without at least common knowledge of various relevant actors' positions and models. People sharing their beliefs, discussing them and making as much as possible of that public is strongly positive for a series of reasons. First, beliefs expressed in public discussions count as micro-commitments or micro-predictions, and help keep the field honest and truth-seeking. When things are only discussed privately, humans tend to weasel around and take inconsistent positions over time, be it intentionally or involuntarily. Second, commenters help debates progress faster by pointing out mistakes. Third, public debates compound. Knowledge shared publicly leads to the next generation of arguments being more refined, and progress in public discourse. We circulated a document about the project to various groups in the field, and invited people from OpenAI, DeepMind, Anthropic, Open Philanthropy, FTX Future Fund, ARC, and MIRI, as well as some independent researchers to participate in the discussions. We prioritized speaking to people at AGI labs, given that they are focused on building AGI capabilities. The format of discussions was as follows: A brief initial exchange with the participants to decide on the topics of discussion. By default, the discussion topic was “How hard is Alignment?”, since we've found we disagree with most people about this, and the reasons for it touch on many core cruxes about AI. We held the discussion synchronously for ~120 minutes, in writing, each on a dedicated, private Slack channel. We involved a moderator when possible. The moderator's role was to help participants identify and address their cruxes, move the conversation forward, and summarize points of contention. We planned to publish cleaned up versions of the transcripts and summaries to Astral Codex Ten, LessWrong, and the EA Forum. Participants were given the opportunity to clarify positions and redact information they considered infohazards or PR risks, as well as veto publishing altogether. We included this clause specifically to address the concerns expressed by people at AI labs, who expected heavy scrutiny by leadership and communications teams on what they can state publicly. People from ARC, DeepMind, and OpenAI, as well as one independent researcher agreed to participate. The two discussions with Paul Christiano and John Wentworth will be published shortly. One discussion with a person working at DeepMind is pending approval before publication. After a discussion with an OpenAI researcher took place, OpenAI strongly recommended against publishing, so we will not publish it. Most people we were in touch with were very interested in participating. However, after checking with their own organizations, many returned saying their organizations would not approve them sharing their positions publicly. This was in spite of the extensive provisions we made to reduce downsides for them: making it possible to edit the transcript, veto publishing, strict comment moderation, and so on. We think organizations discouraging their employees from speaking openly about their views on AI risk is harmful, and we want to encourage more openness. We are pausing th...]]>
Andrea Miotti https://storage.googleapis.com/rssfile/images/Nonlinear%20Logo%203000x3000%20-%20Alignment%20Forum.png 04:25 None full 5040
zRn6aQyD8uhAN7qCc_NL_AF_AF AF - Sam Altman: "Planning for AGI and beyond" by Lawrence Chan Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Sam Altman: "Planning for AGI and beyond", published by Lawrence Chan on February 24, 2023 on The AI Alignment Forum. (OpenAI releases a blog post detailing their AGI roadmap. I'm copying the text below, though see the linked blog post for better formatted version) Our mission is to ensure that artificial general intelligence—AI systems that are generally smarter than humans—benefits all of humanity. If AGI is successfully created, this technology could help us elevate humanity by increasing abundance, turbocharging the global economy, and aiding in the discovery of new scientific knowledge that changes the limits of possibility. AGI has the potential to give everyone incredible new capabilities; we can imagine a world where all of us have access to help with almost any cognitive task, providing a great force multiplier for human ingenuity and creativity. On the other hand, AGI would also come with serious risk of misuse, drastic accidents, and societal disruption. Because the upside of AGI is so great, we do not believe it is possible or desirable for society to stop its development forever; instead, society and the developers of AGI have to figure out how to get it right. AGI could happen soon or far in the future; the takeoff speed from the initial AGI to more powerful successor systems could be slow or fast. Many of us think the safest quadrant in this two-by-two matrix is short timelines and slow takeoff speeds; shorter timelines seem more amenable to coordination and more likely to lead to a slower takeoff due to less of a compute overhang, and a slower takeoff gives us more time to figure out empirically how to solve the safety problem and how to adapt. Although we cannot predict exactly what will happen, and of course our current progress could hit a wall, we can articulate the principles we care about most: We want AGI to empower humanity to maximally flourish in the universe. We don’t expect the future to be an unqualified utopia, but we want to maximize the good and minimize the bad, and for AGI to be an amplifier of humanity. We want the benefits of, access to, and governance of AGI to be widely and fairly shared. We want to successfully navigate massive risks. In confronting these risks, we acknowledge that what seems right in theory often plays out more strangely than expected in practice. We believe we have to continuously learn and adapt by deploying less powerful versions of the technology in order to minimize “one shot to get it right” scenarios. The short term There are several things we think are important to do now to prepare for AGI. First, as we create successively more powerful systems, we want to deploy them and gain experience with operating them in the real world. We believe this is the best way to carefully steward AGI into existence—a gradual transition to a world with AGI is better than a sudden one. We expect powerful AI to make the rate of progress in the world much faster, and we think it’s better to adjust to this incrementally. A gradual transition gives people, policymakers, and institutions time to understand what’s happening, personally experience the benefits and downsides of these systems, adapt our economy, and to put regulation in place. It also allows for society and AI to co-evolve, and for people collectively to figure out what they want while the stakes are relatively low. We currently believe the best way to successfully navigate AI deployment challenges is with a tight feedback loop of rapid learning and careful iteration. Society will face major questions about what AI systems are allowed to do, how to combat bias, how to deal with job displacement, and more. The optimal decisions will depend on the path the technology takes, and like any new field, most expert predictions have been wrong so far. This makes planning in...]]>
Lawrence Chan https://www.alignmentforum.org/posts/zRn6aQyD8uhAN7qCc/sam-altman-planning-for-agi-and-beyond Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Sam Altman: "Planning for AGI and beyond", published by Lawrence Chan on February 24, 2023 on The AI Alignment Forum. (OpenAI releases a blog post detailing their AGI roadmap. I'm copying the text below, though see the linked blog post for better formatted version) Our mission is to ensure that artificial general intelligence—AI systems that are generally smarter than humans—benefits all of humanity. If AGI is successfully created, this technology could help us elevate humanity by increasing abundance, turbocharging the global economy, and aiding in the discovery of new scientific knowledge that changes the limits of possibility. AGI has the potential to give everyone incredible new capabilities; we can imagine a world where all of us have access to help with almost any cognitive task, providing a great force multiplier for human ingenuity and creativity. On the other hand, AGI would also come with serious risk of misuse, drastic accidents, and societal disruption. Because the upside of AGI is so great, we do not believe it is possible or desirable for society to stop its development forever; instead, society and the developers of AGI have to figure out how to get it right. AGI could happen soon or far in the future; the takeoff speed from the initial AGI to more powerful successor systems could be slow or fast. Many of us think the safest quadrant in this two-by-two matrix is short timelines and slow takeoff speeds; shorter timelines seem more amenable to coordination and more likely to lead to a slower takeoff due to less of a compute overhang, and a slower takeoff gives us more time to figure out empirically how to solve the safety problem and how to adapt. Although we cannot predict exactly what will happen, and of course our current progress could hit a wall, we can articulate the principles we care about most: We want AGI to empower humanity to maximally flourish in the universe. We don’t expect the future to be an unqualified utopia, but we want to maximize the good and minimize the bad, and for AGI to be an amplifier of humanity. We want the benefits of, access to, and governance of AGI to be widely and fairly shared. We want to successfully navigate massive risks. In confronting these risks, we acknowledge that what seems right in theory often plays out more strangely than expected in practice. We believe we have to continuously learn and adapt by deploying less powerful versions of the technology in order to minimize “one shot to get it right” scenarios. The short term There are several things we think are important to do now to prepare for AGI. First, as we create successively more powerful systems, we want to deploy them and gain experience with operating them in the real world. We believe this is the best way to carefully steward AGI into existence—a gradual transition to a world with AGI is better than a sudden one. We expect powerful AI to make the rate of progress in the world much faster, and we think it’s better to adjust to this incrementally. A gradual transition gives people, policymakers, and institutions time to understand what’s happening, personally experience the benefits and downsides of these systems, adapt our economy, and to put regulation in place. It also allows for society and AI to co-evolve, and for people collectively to figure out what they want while the stakes are relatively low. We currently believe the best way to successfully navigate AI deployment challenges is with a tight feedback loop of rapid learning and careful iteration. Society will face major questions about what AI systems are allowed to do, how to combat bias, how to deal with job displacement, and more. The optimal decisions will depend on the path the technology takes, and like any new field, most expert predictions have been wrong so far. This makes planning in...]]>
Fri, 24 Feb 2023 20:28:01 +0000 AF - Sam Altman: "Planning for AGI and beyond" by Lawrence Chan Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Sam Altman: "Planning for AGI and beyond", published by Lawrence Chan on February 24, 2023 on The AI Alignment Forum. (OpenAI releases a blog post detailing their AGI roadmap. I'm copying the text below, though see the linked blog post for better formatted version) Our mission is to ensure that artificial general intelligence—AI systems that are generally smarter than humans—benefits all of humanity. If AGI is successfully created, this technology could help us elevate humanity by increasing abundance, turbocharging the global economy, and aiding in the discovery of new scientific knowledge that changes the limits of possibility. AGI has the potential to give everyone incredible new capabilities; we can imagine a world where all of us have access to help with almost any cognitive task, providing a great force multiplier for human ingenuity and creativity. On the other hand, AGI would also come with serious risk of misuse, drastic accidents, and societal disruption. Because the upside of AGI is so great, we do not believe it is possible or desirable for society to stop its development forever; instead, society and the developers of AGI have to figure out how to get it right. AGI could happen soon or far in the future; the takeoff speed from the initial AGI to more powerful successor systems could be slow or fast. Many of us think the safest quadrant in this two-by-two matrix is short timelines and slow takeoff speeds; shorter timelines seem more amenable to coordination and more likely to lead to a slower takeoff due to less of a compute overhang, and a slower takeoff gives us more time to figure out empirically how to solve the safety problem and how to adapt. Although we cannot predict exactly what will happen, and of course our current progress could hit a wall, we can articulate the principles we care about most: We want AGI to empower humanity to maximally flourish in the universe. We don’t expect the future to be an unqualified utopia, but we want to maximize the good and minimize the bad, and for AGI to be an amplifier of humanity. We want the benefits of, access to, and governance of AGI to be widely and fairly shared. We want to successfully navigate massive risks. In confronting these risks, we acknowledge that what seems right in theory often plays out more strangely than expected in practice. We believe we have to continuously learn and adapt by deploying less powerful versions of the technology in order to minimize “one shot to get it right” scenarios. The short term There are several things we think are important to do now to prepare for AGI. First, as we create successively more powerful systems, we want to deploy them and gain experience with operating them in the real world. We believe this is the best way to carefully steward AGI into existence—a gradual transition to a world with AGI is better than a sudden one. We expect powerful AI to make the rate of progress in the world much faster, and we think it’s better to adjust to this incrementally. A gradual transition gives people, policymakers, and institutions time to understand what’s happening, personally experience the benefits and downsides of these systems, adapt our economy, and to put regulation in place. It also allows for society and AI to co-evolve, and for people collectively to figure out what they want while the stakes are relatively low. We currently believe the best way to successfully navigate AI deployment challenges is with a tight feedback loop of rapid learning and careful iteration. Society will face major questions about what AI systems are allowed to do, how to combat bias, how to deal with job displacement, and more. The optimal decisions will depend on the path the technology takes, and like any new field, most expert predictions have been wrong so far. This makes planning in...]]>
Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Sam Altman: "Planning for AGI and beyond", published by Lawrence Chan on February 24, 2023 on The AI Alignment Forum. (OpenAI releases a blog post detailing their AGI roadmap. I'm copying the text below, though see the linked blog post for better formatted version) Our mission is to ensure that artificial general intelligence—AI systems that are generally smarter than humans—benefits all of humanity. If AGI is successfully created, this technology could help us elevate humanity by increasing abundance, turbocharging the global economy, and aiding in the discovery of new scientific knowledge that changes the limits of possibility. AGI has the potential to give everyone incredible new capabilities; we can imagine a world where all of us have access to help with almost any cognitive task, providing a great force multiplier for human ingenuity and creativity. On the other hand, AGI would also come with serious risk of misuse, drastic accidents, and societal disruption. Because the upside of AGI is so great, we do not believe it is possible or desirable for society to stop its development forever; instead, society and the developers of AGI have to figure out how to get it right. AGI could happen soon or far in the future; the takeoff speed from the initial AGI to more powerful successor systems could be slow or fast. Many of us think the safest quadrant in this two-by-two matrix is short timelines and slow takeoff speeds; shorter timelines seem more amenable to coordination and more likely to lead to a slower takeoff due to less of a compute overhang, and a slower takeoff gives us more time to figure out empirically how to solve the safety problem and how to adapt. Although we cannot predict exactly what will happen, and of course our current progress could hit a wall, we can articulate the principles we care about most: We want AGI to empower humanity to maximally flourish in the universe. We don’t expect the future to be an unqualified utopia, but we want to maximize the good and minimize the bad, and for AGI to be an amplifier of humanity. We want the benefits of, access to, and governance of AGI to be widely and fairly shared. We want to successfully navigate massive risks. In confronting these risks, we acknowledge that what seems right in theory often plays out more strangely than expected in practice. We believe we have to continuously learn and adapt by deploying less powerful versions of the technology in order to minimize “one shot to get it right” scenarios. The short term There are several things we think are important to do now to prepare for AGI. First, as we create successively more powerful systems, we want to deploy them and gain experience with operating them in the real world. We believe this is the best way to carefully steward AGI into existence—a gradual transition to a world with AGI is better than a sudden one. We expect powerful AI to make the rate of progress in the world much faster, and we think it’s better to adjust to this incrementally. A gradual transition gives people, policymakers, and institutions time to understand what’s happening, personally experience the benefits and downsides of these systems, adapt our economy, and to put regulation in place. It also allows for society and AI to co-evolve, and for people collectively to figure out what they want while the stakes are relatively low. We currently believe the best way to successfully navigate AI deployment challenges is with a tight feedback loop of rapid learning and careful iteration. Society will face major questions about what AI systems are allowed to do, how to combat bias, how to deal with job displacement, and more. The optimal decisions will depend on the path the technology takes, and like any new field, most expert predictions have been wrong so far. This makes planning in...]]>
Lawrence Chan https://storage.googleapis.com/rssfile/images/Nonlinear%20Logo%203000x3000%20-%20Alignment%20Forum.png 11:13 None full 5041
ChbRgvuGaG2dAtr6i_NL_AF_AF AF - Meta open sources LMs competitive with Chinchilla, PaLM, and code-davinci-002 (Paper) by Lawrence Chan Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Meta open sources LMs competitive with Chinchilla, PaLM, and code-davinci-002 (Paper), published by Lawrence Chan on February 24, 2023 on The AI Alignment Forum. As the title says, Meta trained 4 foundational models with 7B, 13B, 33B, and 65B parameters respectively, and is open sourcing them for research. You can get their code on their Github repo: but you need to fill in a Google form to get the weights. On downstream benchmarks, the models do comparably well with Chinchilla and PaLM and only a bit worse than Flan-PaLM-540B and code-davinci-002/text-davinci-002. (The authors don't evaluate on those models, but you can look at their performance from other work such as Stanford's HELM or Chung, Hou, Longpre et al's "Scaling Instruction-Finetuned Language Models". Abstract: We introduce LLaMA, a collection of foundation language models ranging from 7B to 65B parameters. We train our models on trillions of tokens, and show that it is possible to train state-of-the-art models using publicly available datasets exclusively, without resorting to proprietary and inaccessible datasets. In particular, LLaMA-13B outperforms GPT-3 (175B) on most benchmarks, and LLaMA-65B is competitive with the best models, Chinchilla70B and PaLM-540B. We release all our models to the research community. Twitter thread from authors: Eliezer guesses that the model won't be impressive in practice: I blindly guess, could be wrong, that this model will turn out sufficiently unimpressive in practice that nobody uses it for much. Basically based on a guess that more than benchmarks matter, and Meta has no people competent to do the tricky stuff needed to stay on current edge. It's not necessarily open source as you think of it -- you need to fill in a Google form, and then they might give it to you: In order to download the checkpoints and tokenizer, fill this google form The license is intended only for non-commercial, research work: Meta grants you a non-exclusive, worldwide, non-transferable, non-sublicensable, revocable, royalty free and limited license under Meta’s copyright interests to reproduce, distribute, and create derivative works of the Software solely for your non-commercial research purposes. Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.]]>
Lawrence Chan https://www.alignmentforum.org/posts/ChbRgvuGaG2dAtr6i/meta-open-sources-lms-competitive-with-chinchilla-palm-and Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Meta open sources LMs competitive with Chinchilla, PaLM, and code-davinci-002 (Paper), published by Lawrence Chan on February 24, 2023 on The AI Alignment Forum. As the title says, Meta trained 4 foundational models with 7B, 13B, 33B, and 65B parameters respectively, and is open sourcing them for research. You can get their code on their Github repo: but you need to fill in a Google form to get the weights. On downstream benchmarks, the models do comparably well with Chinchilla and PaLM and only a bit worse than Flan-PaLM-540B and code-davinci-002/text-davinci-002. (The authors don't evaluate on those models, but you can look at their performance from other work such as Stanford's HELM or Chung, Hou, Longpre et al's "Scaling Instruction-Finetuned Language Models". Abstract: We introduce LLaMA, a collection of foundation language models ranging from 7B to 65B parameters. We train our models on trillions of tokens, and show that it is possible to train state-of-the-art models using publicly available datasets exclusively, without resorting to proprietary and inaccessible datasets. In particular, LLaMA-13B outperforms GPT-3 (175B) on most benchmarks, and LLaMA-65B is competitive with the best models, Chinchilla70B and PaLM-540B. We release all our models to the research community. Twitter thread from authors: Eliezer guesses that the model won't be impressive in practice: I blindly guess, could be wrong, that this model will turn out sufficiently unimpressive in practice that nobody uses it for much. Basically based on a guess that more than benchmarks matter, and Meta has no people competent to do the tricky stuff needed to stay on current edge. It's not necessarily open source as you think of it -- you need to fill in a Google form, and then they might give it to you: In order to download the checkpoints and tokenizer, fill this google form The license is intended only for non-commercial, research work: Meta grants you a non-exclusive, worldwide, non-transferable, non-sublicensable, revocable, royalty free and limited license under Meta’s copyright interests to reproduce, distribute, and create derivative works of the Software solely for your non-commercial research purposes. Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.]]>
Fri, 24 Feb 2023 19:57:25 +0000 AF - Meta open sources LMs competitive with Chinchilla, PaLM, and code-davinci-002 (Paper) by Lawrence Chan Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Meta open sources LMs competitive with Chinchilla, PaLM, and code-davinci-002 (Paper), published by Lawrence Chan on February 24, 2023 on The AI Alignment Forum. As the title says, Meta trained 4 foundational models with 7B, 13B, 33B, and 65B parameters respectively, and is open sourcing them for research. You can get their code on their Github repo: but you need to fill in a Google form to get the weights. On downstream benchmarks, the models do comparably well with Chinchilla and PaLM and only a bit worse than Flan-PaLM-540B and code-davinci-002/text-davinci-002. (The authors don't evaluate on those models, but you can look at their performance from other work such as Stanford's HELM or Chung, Hou, Longpre et al's "Scaling Instruction-Finetuned Language Models". Abstract: We introduce LLaMA, a collection of foundation language models ranging from 7B to 65B parameters. We train our models on trillions of tokens, and show that it is possible to train state-of-the-art models using publicly available datasets exclusively, without resorting to proprietary and inaccessible datasets. In particular, LLaMA-13B outperforms GPT-3 (175B) on most benchmarks, and LLaMA-65B is competitive with the best models, Chinchilla70B and PaLM-540B. We release all our models to the research community. Twitter thread from authors: Eliezer guesses that the model won't be impressive in practice: I blindly guess, could be wrong, that this model will turn out sufficiently unimpressive in practice that nobody uses it for much. Basically based on a guess that more than benchmarks matter, and Meta has no people competent to do the tricky stuff needed to stay on current edge. It's not necessarily open source as you think of it -- you need to fill in a Google form, and then they might give it to you: In order to download the checkpoints and tokenizer, fill this google form The license is intended only for non-commercial, research work: Meta grants you a non-exclusive, worldwide, non-transferable, non-sublicensable, revocable, royalty free and limited license under Meta’s copyright interests to reproduce, distribute, and create derivative works of the Software solely for your non-commercial research purposes. Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.]]>
Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Meta open sources LMs competitive with Chinchilla, PaLM, and code-davinci-002 (Paper), published by Lawrence Chan on February 24, 2023 on The AI Alignment Forum. As the title says, Meta trained 4 foundational models with 7B, 13B, 33B, and 65B parameters respectively, and is open sourcing them for research. You can get their code on their Github repo: but you need to fill in a Google form to get the weights. On downstream benchmarks, the models do comparably well with Chinchilla and PaLM and only a bit worse than Flan-PaLM-540B and code-davinci-002/text-davinci-002. (The authors don't evaluate on those models, but you can look at their performance from other work such as Stanford's HELM or Chung, Hou, Longpre et al's "Scaling Instruction-Finetuned Language Models". Abstract: We introduce LLaMA, a collection of foundation language models ranging from 7B to 65B parameters. We train our models on trillions of tokens, and show that it is possible to train state-of-the-art models using publicly available datasets exclusively, without resorting to proprietary and inaccessible datasets. In particular, LLaMA-13B outperforms GPT-3 (175B) on most benchmarks, and LLaMA-65B is competitive with the best models, Chinchilla70B and PaLM-540B. We release all our models to the research community. Twitter thread from authors: Eliezer guesses that the model won't be impressive in practice: I blindly guess, could be wrong, that this model will turn out sufficiently unimpressive in practice that nobody uses it for much. Basically based on a guess that more than benchmarks matter, and Meta has no people competent to do the tricky stuff needed to stay on current edge. It's not necessarily open source as you think of it -- you need to fill in a Google form, and then they might give it to you: In order to download the checkpoints and tokenizer, fill this google form The license is intended only for non-commercial, research work: Meta grants you a non-exclusive, worldwide, non-transferable, non-sublicensable, revocable, royalty free and limited license under Meta’s copyright interests to reproduce, distribute, and create derivative works of the Software solely for your non-commercial research purposes. Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.]]>
Lawrence Chan https://storage.googleapis.com/rssfile/images/Nonlinear%20Logo%203000x3000%20-%20Alignment%20Forum.png 02:32 None full 5011
WGEPBmErv8ufrq8Fc_NL_AF_AF AF - Teleosemantics! by Abram Demski Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Teleosemantics!, published by Abram Demski on February 23, 2023 on The AI Alignment Forum. I wanted to write a long, detailed, analytic post about this, somewhat like my Radical Probabilism post (to me, this is a similarly large update). However, I haven't gotten around to it for a long while. And perhaps it is better as a short, informal post in any case. I think my biggest update over the past year has been a conversion to teleosemantics. Teleosemantics is a theory of semantics -- that is, "meaning" or "aboutness" or "reference". To briefly state the punchline: Teleosemantics identifies the semantics of a symbolic construct as what the symbolic construct has been optimized to accurately reflect. Previously, something seemed mysterious about the map/territory relationship. What could possibly imbue 'symbols' with 'meaning'? The map/territory analogy seems inadequate to answer this question. Indeed, to analogize "belief" with "map" and "the subject of belief" with "territory" commits a homunculus fallacy! The meaning-makers are the map-readers and map-writers; but they can only make meaning by virtue of the beliefs within their own heads. So the map/territory analogy seems to suggest that an infinite regress of meaning-makers would be required. You probably won't believe me at first. Perhaps you'll say that the lesson of the map/territory analogy is the correspondence between the map and the territory, which exists independently of the map-reader who uses the correspondence to evaluate the map. I have several objections. If it's a probabilistic correspondence, where the map contains information about the territory, these are subjective notions, which require some viewpoint. If it's a correspondence based on some sort of ontology, where pieces of the map line up with "pieces of reality", I would also say the ontology is in itself a subjective perspective. You might think you can define the map/territory correspondence without invoking a map-maker or map-reader by objectively defining the "fit" of a correspondence (so that the meaning of a symbol is based on the best-fitting correspondence, or perhaps, the cloud of well-fitting correspondences). But well-fitting correspondence will include many examples of accidental correspondence, which seem to have little to do with aboutness. Moreover, I think theories like this will fail to adequately account for false belief, which screws up the fit. But my point here isn't to denounce the map/territory picture! I still think it is a good framework. Rather, I wanted to gesture at how I still felt confused, despite having the map/territory picture. I needed a different analogy, something more like a self-drawing map, to get rid of the homunculus. A picture which included the meaning-maker, rather than just meaning come from nowhere. Teleosemantics reduces meaning-making to optimization. Aboutness becomes a type of purpose a thing can have. One advantage of this over map-territory correspondence is that it explains the asymmetry between map and territory. Mutual information is symmetric. So why is the map about the territory, but not the other way around? Because the map has been optimized to fit the territory, not the other way around. ("Fit" in the sense of carrying high mutual information, which can be decoded via some specific intended correspondence - a symbolic language.) What does it mean to optimize for the map to fit the territory, but not the other way around? (After all: we can improve fit between map and territory by changing either map or territory.) Maybe it's complicated, but primarily what it means is that the map is the part that's being selected in the optimization. When communicating, I'm not using my full agency to make my claims true; rather, I'm specifically selecting the claims to be true. I take Teleosemanti...]]>
Abram Demski https://www.alignmentforum.org/posts/WGEPBmErv8ufrq8Fc/teleosemantics Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Teleosemantics!, published by Abram Demski on February 23, 2023 on The AI Alignment Forum. I wanted to write a long, detailed, analytic post about this, somewhat like my Radical Probabilism post (to me, this is a similarly large update). However, I haven't gotten around to it for a long while. And perhaps it is better as a short, informal post in any case. I think my biggest update over the past year has been a conversion to teleosemantics. Teleosemantics is a theory of semantics -- that is, "meaning" or "aboutness" or "reference". To briefly state the punchline: Teleosemantics identifies the semantics of a symbolic construct as what the symbolic construct has been optimized to accurately reflect. Previously, something seemed mysterious about the map/territory relationship. What could possibly imbue 'symbols' with 'meaning'? The map/territory analogy seems inadequate to answer this question. Indeed, to analogize "belief" with "map" and "the subject of belief" with "territory" commits a homunculus fallacy! The meaning-makers are the map-readers and map-writers; but they can only make meaning by virtue of the beliefs within their own heads. So the map/territory analogy seems to suggest that an infinite regress of meaning-makers would be required. You probably won't believe me at first. Perhaps you'll say that the lesson of the map/territory analogy is the correspondence between the map and the territory, which exists independently of the map-reader who uses the correspondence to evaluate the map. I have several objections. If it's a probabilistic correspondence, where the map contains information about the territory, these are subjective notions, which require some viewpoint. If it's a correspondence based on some sort of ontology, where pieces of the map line up with "pieces of reality", I would also say the ontology is in itself a subjective perspective. You might think you can define the map/territory correspondence without invoking a map-maker or map-reader by objectively defining the "fit" of a correspondence (so that the meaning of a symbol is based on the best-fitting correspondence, or perhaps, the cloud of well-fitting correspondences). But well-fitting correspondence will include many examples of accidental correspondence, which seem to have little to do with aboutness. Moreover, I think theories like this will fail to adequately account for false belief, which screws up the fit. But my point here isn't to denounce the map/territory picture! I still think it is a good framework. Rather, I wanted to gesture at how I still felt confused, despite having the map/territory picture. I needed a different analogy, something more like a self-drawing map, to get rid of the homunculus. A picture which included the meaning-maker, rather than just meaning come from nowhere. Teleosemantics reduces meaning-making to optimization. Aboutness becomes a type of purpose a thing can have. One advantage of this over map-territory correspondence is that it explains the asymmetry between map and territory. Mutual information is symmetric. So why is the map about the territory, but not the other way around? Because the map has been optimized to fit the territory, not the other way around. ("Fit" in the sense of carrying high mutual information, which can be decoded via some specific intended correspondence - a symbolic language.) What does it mean to optimize for the map to fit the territory, but not the other way around? (After all: we can improve fit between map and territory by changing either map or territory.) Maybe it's complicated, but primarily what it means is that the map is the part that's being selected in the optimization. When communicating, I'm not using my full agency to make my claims true; rather, I'm specifically selecting the claims to be true. I take Teleosemanti...]]>
Thu, 23 Feb 2023 23:26:15 +0000 AF - Teleosemantics! by Abram Demski Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Teleosemantics!, published by Abram Demski on February 23, 2023 on The AI Alignment Forum. I wanted to write a long, detailed, analytic post about this, somewhat like my Radical Probabilism post (to me, this is a similarly large update). However, I haven't gotten around to it for a long while. And perhaps it is better as a short, informal post in any case. I think my biggest update over the past year has been a conversion to teleosemantics. Teleosemantics is a theory of semantics -- that is, "meaning" or "aboutness" or "reference". To briefly state the punchline: Teleosemantics identifies the semantics of a symbolic construct as what the symbolic construct has been optimized to accurately reflect. Previously, something seemed mysterious about the map/territory relationship. What could possibly imbue 'symbols' with 'meaning'? The map/territory analogy seems inadequate to answer this question. Indeed, to analogize "belief" with "map" and "the subject of belief" with "territory" commits a homunculus fallacy! The meaning-makers are the map-readers and map-writers; but they can only make meaning by virtue of the beliefs within their own heads. So the map/territory analogy seems to suggest that an infinite regress of meaning-makers would be required. You probably won't believe me at first. Perhaps you'll say that the lesson of the map/territory analogy is the correspondence between the map and the territory, which exists independently of the map-reader who uses the correspondence to evaluate the map. I have several objections. If it's a probabilistic correspondence, where the map contains information about the territory, these are subjective notions, which require some viewpoint. If it's a correspondence based on some sort of ontology, where pieces of the map line up with "pieces of reality", I would also say the ontology is in itself a subjective perspective. You might think you can define the map/territory correspondence without invoking a map-maker or map-reader by objectively defining the "fit" of a correspondence (so that the meaning of a symbol is based on the best-fitting correspondence, or perhaps, the cloud of well-fitting correspondences). But well-fitting correspondence will include many examples of accidental correspondence, which seem to have little to do with aboutness. Moreover, I think theories like this will fail to adequately account for false belief, which screws up the fit. But my point here isn't to denounce the map/territory picture! I still think it is a good framework. Rather, I wanted to gesture at how I still felt confused, despite having the map/territory picture. I needed a different analogy, something more like a self-drawing map, to get rid of the homunculus. A picture which included the meaning-maker, rather than just meaning come from nowhere. Teleosemantics reduces meaning-making to optimization. Aboutness becomes a type of purpose a thing can have. One advantage of this over map-territory correspondence is that it explains the asymmetry between map and territory. Mutual information is symmetric. So why is the map about the territory, but not the other way around? Because the map has been optimized to fit the territory, not the other way around. ("Fit" in the sense of carrying high mutual information, which can be decoded via some specific intended correspondence - a symbolic language.) What does it mean to optimize for the map to fit the territory, but not the other way around? (After all: we can improve fit between map and territory by changing either map or territory.) Maybe it's complicated, but primarily what it means is that the map is the part that's being selected in the optimization. When communicating, I'm not using my full agency to make my claims true; rather, I'm specifically selecting the claims to be true. I take Teleosemanti...]]>
Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Teleosemantics!, published by Abram Demski on February 23, 2023 on The AI Alignment Forum. I wanted to write a long, detailed, analytic post about this, somewhat like my Radical Probabilism post (to me, this is a similarly large update). However, I haven't gotten around to it for a long while. And perhaps it is better as a short, informal post in any case. I think my biggest update over the past year has been a conversion to teleosemantics. Teleosemantics is a theory of semantics -- that is, "meaning" or "aboutness" or "reference". To briefly state the punchline: Teleosemantics identifies the semantics of a symbolic construct as what the symbolic construct has been optimized to accurately reflect. Previously, something seemed mysterious about the map/territory relationship. What could possibly imbue 'symbols' with 'meaning'? The map/territory analogy seems inadequate to answer this question. Indeed, to analogize "belief" with "map" and "the subject of belief" with "territory" commits a homunculus fallacy! The meaning-makers are the map-readers and map-writers; but they can only make meaning by virtue of the beliefs within their own heads. So the map/territory analogy seems to suggest that an infinite regress of meaning-makers would be required. You probably won't believe me at first. Perhaps you'll say that the lesson of the map/territory analogy is the correspondence between the map and the territory, which exists independently of the map-reader who uses the correspondence to evaluate the map. I have several objections. If it's a probabilistic correspondence, where the map contains information about the territory, these are subjective notions, which require some viewpoint. If it's a correspondence based on some sort of ontology, where pieces of the map line up with "pieces of reality", I would also say the ontology is in itself a subjective perspective. You might think you can define the map/territory correspondence without invoking a map-maker or map-reader by objectively defining the "fit" of a correspondence (so that the meaning of a symbol is based on the best-fitting correspondence, or perhaps, the cloud of well-fitting correspondences). But well-fitting correspondence will include many examples of accidental correspondence, which seem to have little to do with aboutness. Moreover, I think theories like this will fail to adequately account for false belief, which screws up the fit. But my point here isn't to denounce the map/territory picture! I still think it is a good framework. Rather, I wanted to gesture at how I still felt confused, despite having the map/territory picture. I needed a different analogy, something more like a self-drawing map, to get rid of the homunculus. A picture which included the meaning-maker, rather than just meaning come from nowhere. Teleosemantics reduces meaning-making to optimization. Aboutness becomes a type of purpose a thing can have. One advantage of this over map-territory correspondence is that it explains the asymmetry between map and territory. Mutual information is symmetric. So why is the map about the territory, but not the other way around? Because the map has been optimized to fit the territory, not the other way around. ("Fit" in the sense of carrying high mutual information, which can be decoded via some specific intended correspondence - a symbolic language.) What does it mean to optimize for the map to fit the territory, but not the other way around? (After all: we can improve fit between map and territory by changing either map or territory.) Maybe it's complicated, but primarily what it means is that the map is the part that's being selected in the optimization. When communicating, I'm not using my full agency to make my claims true; rather, I'm specifically selecting the claims to be true. I take Teleosemanti...]]>
Abram Demski https://storage.googleapis.com/rssfile/images/Nonlinear%20Logo%203000x3000%20-%20Alignment%20Forum.png 11:16 None full 5026
NK2CeDNKMEY9gRZp2_NL_AF_AF AF - AI that shouldn't work, yet kind of does by Donald Hobson Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: AI that shouldn't work, yet kind of does, published by Donald Hobson on February 23, 2023 on The AI Alignment Forum. There are some things that work surprisingly well in AI. For example, AI that transfers the style of one image to the content of another. Why is the approach described here a hack. It starts with a neural net trained to classify images. It then runs gradient descent on the the content image, trying to get the covariance matrix of the style to match in early network layers, while trying to get the net layers of the original and style transfer images to be as similar as possible on the later layers. So the approximations I think are making this work is that, in classifiers, the early layers tend to store simple features, and the later layers tend to hold more complex features. Style is based on the simpler features, and doesn't depend on the location within the image. Content is based on more complex features and does depend on the location in the image. We apply optimization power, gradient descent, over heuristics this simple and hacky. And yet it works. A simple and hacky AI alignment proposal is to just ask chatGPT to do it. This doesn't work because chatGPT has been optimized for text prediction, and so isn't particularly good at AI alignment theory. So here is an alignment plan. I know it isn't great. But some plan is better than no plan. And there was a post about how alignment may well look like "surely no one could have missed that" or "surely such a stupid idea couldn't work, could it?" not "eureka". Train ZFCbot. An AI that is the AlphaGo of ZFC, perhaps trained on random formulas. Perhaps throwing a large corpus of formal maths proofs in there. Ideally the system should have a latent space of maths, so it can think about what style of proof is likely to work before expanding all the details. The system should have a wide range of common maths terms imported from some lean library. It should be optimized purely for ZFC formal theorem proving. Once it is trained, the weights are fixed. Train ChatMathsGPT. Similar to large language models. Except with oracle access to ZFCbot. In the many maths papers in it's corpus, it learns to link the informal with the formal. From politics and economics discussions, it asks ZFCbot about toy game theory problems. In general, it learns to identify the pieces of formal maths that best model a situation, and ask about them, and then use the response to predict text. There is a sense in which this AI knows less maths than normal ChatGPT. Standard ChatGPT has a small crude understanding of maths built from nothing within it's own mind. This has a much better understanding it can outsource to, it just has to plug in. Then we ask this ChatMathsGPT for a paper on logical induction. And we hope it can generate a paper of quality similar to Miri's paper on the topic (where hypothetically this isn't in the training dataset). If it can, then we have a tool to accelerate deconfusion by orders of magnitude. Things I am uncertain about. Should ChatMathsGPT have oracle access to a latent space. (can pass gradients, harder to interpret.) or should it just pass formal strings of symbols. (less powerful) Should ZFCbot get trained on random ZFC; random ZFC + library of theorems and conjectures and random combinations of high level maths concepts; or random ZFC plus whatever ChatMathsGPT keeps asking. The latter gives a route for data to pass between them. This could fail to be smart enough, I mean I wouldn't be particularly surprised if it could be made smart enough. But what would the safety failures of this system look like? Firstly, this AI does deconfusion. If you ask it to write a paperclip maximizer in python, you may well get your wish. Or you might get an AI that maximizes something else. Just asking for an aligned AI is...]]>
Donald Hobson https://www.alignmentforum.org/posts/NK2CeDNKMEY9gRZp2/ai-that-shouldn-t-work-yet-kind-of-does Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: AI that shouldn't work, yet kind of does, published by Donald Hobson on February 23, 2023 on The AI Alignment Forum. There are some things that work surprisingly well in AI. For example, AI that transfers the style of one image to the content of another. Why is the approach described here a hack. It starts with a neural net trained to classify images. It then runs gradient descent on the the content image, trying to get the covariance matrix of the style to match in early network layers, while trying to get the net layers of the original and style transfer images to be as similar as possible on the later layers. So the approximations I think are making this work is that, in classifiers, the early layers tend to store simple features, and the later layers tend to hold more complex features. Style is based on the simpler features, and doesn't depend on the location within the image. Content is based on more complex features and does depend on the location in the image. We apply optimization power, gradient descent, over heuristics this simple and hacky. And yet it works. A simple and hacky AI alignment proposal is to just ask chatGPT to do it. This doesn't work because chatGPT has been optimized for text prediction, and so isn't particularly good at AI alignment theory. So here is an alignment plan. I know it isn't great. But some plan is better than no plan. And there was a post about how alignment may well look like "surely no one could have missed that" or "surely such a stupid idea couldn't work, could it?" not "eureka". Train ZFCbot. An AI that is the AlphaGo of ZFC, perhaps trained on random formulas. Perhaps throwing a large corpus of formal maths proofs in there. Ideally the system should have a latent space of maths, so it can think about what style of proof is likely to work before expanding all the details. The system should have a wide range of common maths terms imported from some lean library. It should be optimized purely for ZFC formal theorem proving. Once it is trained, the weights are fixed. Train ChatMathsGPT. Similar to large language models. Except with oracle access to ZFCbot. In the many maths papers in it's corpus, it learns to link the informal with the formal. From politics and economics discussions, it asks ZFCbot about toy game theory problems. In general, it learns to identify the pieces of formal maths that best model a situation, and ask about them, and then use the response to predict text. There is a sense in which this AI knows less maths than normal ChatGPT. Standard ChatGPT has a small crude understanding of maths built from nothing within it's own mind. This has a much better understanding it can outsource to, it just has to plug in. Then we ask this ChatMathsGPT for a paper on logical induction. And we hope it can generate a paper of quality similar to Miri's paper on the topic (where hypothetically this isn't in the training dataset). If it can, then we have a tool to accelerate deconfusion by orders of magnitude. Things I am uncertain about. Should ChatMathsGPT have oracle access to a latent space. (can pass gradients, harder to interpret.) or should it just pass formal strings of symbols. (less powerful) Should ZFCbot get trained on random ZFC; random ZFC + library of theorems and conjectures and random combinations of high level maths concepts; or random ZFC plus whatever ChatMathsGPT keeps asking. The latter gives a route for data to pass between them. This could fail to be smart enough, I mean I wouldn't be particularly surprised if it could be made smart enough. But what would the safety failures of this system look like? Firstly, this AI does deconfusion. If you ask it to write a paperclip maximizer in python, you may well get your wish. Or you might get an AI that maximizes something else. Just asking for an aligned AI is...]]>
Thu, 23 Feb 2023 23:18:55 +0000 AF - AI that shouldn't work, yet kind of does by Donald Hobson Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: AI that shouldn't work, yet kind of does, published by Donald Hobson on February 23, 2023 on The AI Alignment Forum. There are some things that work surprisingly well in AI. For example, AI that transfers the style of one image to the content of another. Why is the approach described here a hack. It starts with a neural net trained to classify images. It then runs gradient descent on the the content image, trying to get the covariance matrix of the style to match in early network layers, while trying to get the net layers of the original and style transfer images to be as similar as possible on the later layers. So the approximations I think are making this work is that, in classifiers, the early layers tend to store simple features, and the later layers tend to hold more complex features. Style is based on the simpler features, and doesn't depend on the location within the image. Content is based on more complex features and does depend on the location in the image. We apply optimization power, gradient descent, over heuristics this simple and hacky. And yet it works. A simple and hacky AI alignment proposal is to just ask chatGPT to do it. This doesn't work because chatGPT has been optimized for text prediction, and so isn't particularly good at AI alignment theory. So here is an alignment plan. I know it isn't great. But some plan is better than no plan. And there was a post about how alignment may well look like "surely no one could have missed that" or "surely such a stupid idea couldn't work, could it?" not "eureka". Train ZFCbot. An AI that is the AlphaGo of ZFC, perhaps trained on random formulas. Perhaps throwing a large corpus of formal maths proofs in there. Ideally the system should have a latent space of maths, so it can think about what style of proof is likely to work before expanding all the details. The system should have a wide range of common maths terms imported from some lean library. It should be optimized purely for ZFC formal theorem proving. Once it is trained, the weights are fixed. Train ChatMathsGPT. Similar to large language models. Except with oracle access to ZFCbot. In the many maths papers in it's corpus, it learns to link the informal with the formal. From politics and economics discussions, it asks ZFCbot about toy game theory problems. In general, it learns to identify the pieces of formal maths that best model a situation, and ask about them, and then use the response to predict text. There is a sense in which this AI knows less maths than normal ChatGPT. Standard ChatGPT has a small crude understanding of maths built from nothing within it's own mind. This has a much better understanding it can outsource to, it just has to plug in. Then we ask this ChatMathsGPT for a paper on logical induction. And we hope it can generate a paper of quality similar to Miri's paper on the topic (where hypothetically this isn't in the training dataset). If it can, then we have a tool to accelerate deconfusion by orders of magnitude. Things I am uncertain about. Should ChatMathsGPT have oracle access to a latent space. (can pass gradients, harder to interpret.) or should it just pass formal strings of symbols. (less powerful) Should ZFCbot get trained on random ZFC; random ZFC + library of theorems and conjectures and random combinations of high level maths concepts; or random ZFC plus whatever ChatMathsGPT keeps asking. The latter gives a route for data to pass between them. This could fail to be smart enough, I mean I wouldn't be particularly surprised if it could be made smart enough. But what would the safety failures of this system look like? Firstly, this AI does deconfusion. If you ask it to write a paperclip maximizer in python, you may well get your wish. Or you might get an AI that maximizes something else. Just asking for an aligned AI is...]]>
Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: AI that shouldn't work, yet kind of does, published by Donald Hobson on February 23, 2023 on The AI Alignment Forum. There are some things that work surprisingly well in AI. For example, AI that transfers the style of one image to the content of another. Why is the approach described here a hack. It starts with a neural net trained to classify images. It then runs gradient descent on the the content image, trying to get the covariance matrix of the style to match in early network layers, while trying to get the net layers of the original and style transfer images to be as similar as possible on the later layers. So the approximations I think are making this work is that, in classifiers, the early layers tend to store simple features, and the later layers tend to hold more complex features. Style is based on the simpler features, and doesn't depend on the location within the image. Content is based on more complex features and does depend on the location in the image. We apply optimization power, gradient descent, over heuristics this simple and hacky. And yet it works. A simple and hacky AI alignment proposal is to just ask chatGPT to do it. This doesn't work because chatGPT has been optimized for text prediction, and so isn't particularly good at AI alignment theory. So here is an alignment plan. I know it isn't great. But some plan is better than no plan. And there was a post about how alignment may well look like "surely no one could have missed that" or "surely such a stupid idea couldn't work, could it?" not "eureka". Train ZFCbot. An AI that is the AlphaGo of ZFC, perhaps trained on random formulas. Perhaps throwing a large corpus of formal maths proofs in there. Ideally the system should have a latent space of maths, so it can think about what style of proof is likely to work before expanding all the details. The system should have a wide range of common maths terms imported from some lean library. It should be optimized purely for ZFC formal theorem proving. Once it is trained, the weights are fixed. Train ChatMathsGPT. Similar to large language models. Except with oracle access to ZFCbot. In the many maths papers in it's corpus, it learns to link the informal with the formal. From politics and economics discussions, it asks ZFCbot about toy game theory problems. In general, it learns to identify the pieces of formal maths that best model a situation, and ask about them, and then use the response to predict text. There is a sense in which this AI knows less maths than normal ChatGPT. Standard ChatGPT has a small crude understanding of maths built from nothing within it's own mind. This has a much better understanding it can outsource to, it just has to plug in. Then we ask this ChatMathsGPT for a paper on logical induction. And we hope it can generate a paper of quality similar to Miri's paper on the topic (where hypothetically this isn't in the training dataset). If it can, then we have a tool to accelerate deconfusion by orders of magnitude. Things I am uncertain about. Should ChatMathsGPT have oracle access to a latent space. (can pass gradients, harder to interpret.) or should it just pass formal strings of symbols. (less powerful) Should ZFCbot get trained on random ZFC; random ZFC + library of theorems and conjectures and random combinations of high level maths concepts; or random ZFC plus whatever ChatMathsGPT keeps asking. The latter gives a route for data to pass between them. This could fail to be smart enough, I mean I wouldn't be particularly surprised if it could be made smart enough. But what would the safety failures of this system look like? Firstly, this AI does deconfusion. If you ask it to write a paperclip maximizer in python, you may well get your wish. Or you might get an AI that maximizes something else. Just asking for an aligned AI is...]]>
Donald Hobson https://storage.googleapis.com/rssfile/images/Nonlinear%20Logo%203000x3000%20-%20Alignment%20Forum.png 05:42 None full 5005
aymbce8ge9ve2C4Po_NL_AF_AF AF - EIS XII: Summary by Stephen Casper Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: EIS XII: Summary, published by Stephen Casper on February 23, 2023 on The AI Alignment Forum. Part 12 of 12 in the Engineer’s Interpretability Sequence. TAISIC = “the AI safety interpretability community” MI = “mechanistic interpretability” There might be some addenda later, but for now, this is the final post in The Engineer’s Interpretability Sequence. I hope you have found it interesting and have gotten some useful ideas. I will always be happy to talk to people about the topics from this sequence in the comments or via email. For now, the last thing I will do is offer a summary of key points post by post :) A Prequel: Toward Transparent AI: A Survey on Interpreting the Inner Structures of Deep Neural Networks (Räuker et al., 2022) A survey of over 300 works on inner interpretability from an AI safety perspective. All opinions in this sequence, however, are my own and not necessarily those of coauthors or other affiliates. EIS I: Intro Lots of interpretability research exists, and the field is still rapidly growing. Most of it is not very productive, and there is a significant gap between the research and practice. Interpretability tools aren't used much by engineers working on real alignment problems. If one of our main goals for interpretability research is to help us with aligning highly intelligent AI systems in high stakes settings, we should be working on tools that are more engineering-relevant. EIS II: What is “Interpretability”? This post introduced a parable about two researchers trying to solve a problem. The moral of the story is that we should not privilege difficult or interesting methods over easy and simple ones. It is key not to grade different tools on different curves. From an engineer’s perspective, the term “interpretability” isn’t that useful. Whatever we call “interpretability” tools are entirely fungible with other techniques related to describing, evaluating, debugging, etc. in models. Mechanistic approaches to interpretability are not uniquely important for AI safety. MI tools have the potential to help identify and fix deceptive alignment failures, but... There are many non-deceptive ways AI could go wrong. MI is not uniquely useful for fixing deceptive alignment and especially not uniquely useful for fixing non-deceptive alignment failures. EIS III Broad Critiques of Interpretability Research There is a growing consensus that interpretability research is generally not very productive or engineering relevant. There is also a growing consensus that better evaluation is needed. A lack of good evaluation methods may be the biggest challenge facing the interpretability research community. There are three types of evaluation. Intuition + pontification --> inadequate Weak/ad-hoc --> still not enough Based on engineering-relevant tasks --> what is needed This can be based on one of three things Making novel predictions about how a model will handle interesting inputs. Controlling what a system does by guiding edits to it. Abandoning a system that does a nontrivial task and replacing it with a simpler reverse-engineered alternative Other common limitations of existing work Poor scaling Relying too much on humans in the loop Failing to study combinations of tools A lack of practical applications with real-world systems EIS IV: A Spotlight on Feature Attribution/Saliency Feature attribution/saliency methods are very common but unlikely to be very important from an engineering perspective. These methods tend to be poorly evaluated, and when they have been subjected to task-based evaluation, they have not tended to fare well. These methods just aren’t equipped to directly be very useful even when they work. They require scrutinizing samples from some data distribution. So the exact same things that feature attribution/saliency methods are equipped t...]]>
Stephen Casper https://www.alignmentforum.org/posts/aymbce8ge9ve2C4Po/eis-xii-summary Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: EIS XII: Summary, published by Stephen Casper on February 23, 2023 on The AI Alignment Forum. Part 12 of 12 in the Engineer’s Interpretability Sequence. TAISIC = “the AI safety interpretability community” MI = “mechanistic interpretability” There might be some addenda later, but for now, this is the final post in The Engineer’s Interpretability Sequence. I hope you have found it interesting and have gotten some useful ideas. I will always be happy to talk to people about the topics from this sequence in the comments or via email. For now, the last thing I will do is offer a summary of key points post by post :) A Prequel: Toward Transparent AI: A Survey on Interpreting the Inner Structures of Deep Neural Networks (Räuker et al., 2022) A survey of over 300 works on inner interpretability from an AI safety perspective. All opinions in this sequence, however, are my own and not necessarily those of coauthors or other affiliates. EIS I: Intro Lots of interpretability research exists, and the field is still rapidly growing. Most of it is not very productive, and there is a significant gap between the research and practice. Interpretability tools aren't used much by engineers working on real alignment problems. If one of our main goals for interpretability research is to help us with aligning highly intelligent AI systems in high stakes settings, we should be working on tools that are more engineering-relevant. EIS II: What is “Interpretability”? This post introduced a parable about two researchers trying to solve a problem. The moral of the story is that we should not privilege difficult or interesting methods over easy and simple ones. It is key not to grade different tools on different curves. From an engineer’s perspective, the term “interpretability” isn’t that useful. Whatever we call “interpretability” tools are entirely fungible with other techniques related to describing, evaluating, debugging, etc. in models. Mechanistic approaches to interpretability are not uniquely important for AI safety. MI tools have the potential to help identify and fix deceptive alignment failures, but... There are many non-deceptive ways AI could go wrong. MI is not uniquely useful for fixing deceptive alignment and especially not uniquely useful for fixing non-deceptive alignment failures. EIS III Broad Critiques of Interpretability Research There is a growing consensus that interpretability research is generally not very productive or engineering relevant. There is also a growing consensus that better evaluation is needed. A lack of good evaluation methods may be the biggest challenge facing the interpretability research community. There are three types of evaluation. Intuition + pontification --> inadequate Weak/ad-hoc --> still not enough Based on engineering-relevant tasks --> what is needed This can be based on one of three things Making novel predictions about how a model will handle interesting inputs. Controlling what a system does by guiding edits to it. Abandoning a system that does a nontrivial task and replacing it with a simpler reverse-engineered alternative Other common limitations of existing work Poor scaling Relying too much on humans in the loop Failing to study combinations of tools A lack of practical applications with real-world systems EIS IV: A Spotlight on Feature Attribution/Saliency Feature attribution/saliency methods are very common but unlikely to be very important from an engineering perspective. These methods tend to be poorly evaluated, and when they have been subjected to task-based evaluation, they have not tended to fare well. These methods just aren’t equipped to directly be very useful even when they work. They require scrutinizing samples from some data distribution. So the exact same things that feature attribution/saliency methods are equipped t...]]>
Thu, 23 Feb 2023 17:45:57 +0000 AF - EIS XII: Summary by Stephen Casper Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: EIS XII: Summary, published by Stephen Casper on February 23, 2023 on The AI Alignment Forum. Part 12 of 12 in the Engineer’s Interpretability Sequence. TAISIC = “the AI safety interpretability community” MI = “mechanistic interpretability” There might be some addenda later, but for now, this is the final post in The Engineer’s Interpretability Sequence. I hope you have found it interesting and have gotten some useful ideas. I will always be happy to talk to people about the topics from this sequence in the comments or via email. For now, the last thing I will do is offer a summary of key points post by post :) A Prequel: Toward Transparent AI: A Survey on Interpreting the Inner Structures of Deep Neural Networks (Räuker et al., 2022) A survey of over 300 works on inner interpretability from an AI safety perspective. All opinions in this sequence, however, are my own and not necessarily those of coauthors or other affiliates. EIS I: Intro Lots of interpretability research exists, and the field is still rapidly growing. Most of it is not very productive, and there is a significant gap between the research and practice. Interpretability tools aren't used much by engineers working on real alignment problems. If one of our main goals for interpretability research is to help us with aligning highly intelligent AI systems in high stakes settings, we should be working on tools that are more engineering-relevant. EIS II: What is “Interpretability”? This post introduced a parable about two researchers trying to solve a problem. The moral of the story is that we should not privilege difficult or interesting methods over easy and simple ones. It is key not to grade different tools on different curves. From an engineer’s perspective, the term “interpretability” isn’t that useful. Whatever we call “interpretability” tools are entirely fungible with other techniques related to describing, evaluating, debugging, etc. in models. Mechanistic approaches to interpretability are not uniquely important for AI safety. MI tools have the potential to help identify and fix deceptive alignment failures, but... There are many non-deceptive ways AI could go wrong. MI is not uniquely useful for fixing deceptive alignment and especially not uniquely useful for fixing non-deceptive alignment failures. EIS III Broad Critiques of Interpretability Research There is a growing consensus that interpretability research is generally not very productive or engineering relevant. There is also a growing consensus that better evaluation is needed. A lack of good evaluation methods may be the biggest challenge facing the interpretability research community. There are three types of evaluation. Intuition + pontification --> inadequate Weak/ad-hoc --> still not enough Based on engineering-relevant tasks --> what is needed This can be based on one of three things Making novel predictions about how a model will handle interesting inputs. Controlling what a system does by guiding edits to it. Abandoning a system that does a nontrivial task and replacing it with a simpler reverse-engineered alternative Other common limitations of existing work Poor scaling Relying too much on humans in the loop Failing to study combinations of tools A lack of practical applications with real-world systems EIS IV: A Spotlight on Feature Attribution/Saliency Feature attribution/saliency methods are very common but unlikely to be very important from an engineering perspective. These methods tend to be poorly evaluated, and when they have been subjected to task-based evaluation, they have not tended to fare well. These methods just aren’t equipped to directly be very useful even when they work. They require scrutinizing samples from some data distribution. So the exact same things that feature attribution/saliency methods are equipped t...]]>
Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: EIS XII: Summary, published by Stephen Casper on February 23, 2023 on The AI Alignment Forum. Part 12 of 12 in the Engineer’s Interpretability Sequence. TAISIC = “the AI safety interpretability community” MI = “mechanistic interpretability” There might be some addenda later, but for now, this is the final post in The Engineer’s Interpretability Sequence. I hope you have found it interesting and have gotten some useful ideas. I will always be happy to talk to people about the topics from this sequence in the comments or via email. For now, the last thing I will do is offer a summary of key points post by post :) A Prequel: Toward Transparent AI: A Survey on Interpreting the Inner Structures of Deep Neural Networks (Räuker et al., 2022) A survey of over 300 works on inner interpretability from an AI safety perspective. All opinions in this sequence, however, are my own and not necessarily those of coauthors or other affiliates. EIS I: Intro Lots of interpretability research exists, and the field is still rapidly growing. Most of it is not very productive, and there is a significant gap between the research and practice. Interpretability tools aren't used much by engineers working on real alignment problems. If one of our main goals for interpretability research is to help us with aligning highly intelligent AI systems in high stakes settings, we should be working on tools that are more engineering-relevant. EIS II: What is “Interpretability”? This post introduced a parable about two researchers trying to solve a problem. The moral of the story is that we should not privilege difficult or interesting methods over easy and simple ones. It is key not to grade different tools on different curves. From an engineer’s perspective, the term “interpretability” isn’t that useful. Whatever we call “interpretability” tools are entirely fungible with other techniques related to describing, evaluating, debugging, etc. in models. Mechanistic approaches to interpretability are not uniquely important for AI safety. MI tools have the potential to help identify and fix deceptive alignment failures, but... There are many non-deceptive ways AI could go wrong. MI is not uniquely useful for fixing deceptive alignment and especially not uniquely useful for fixing non-deceptive alignment failures. EIS III Broad Critiques of Interpretability Research There is a growing consensus that interpretability research is generally not very productive or engineering relevant. There is also a growing consensus that better evaluation is needed. A lack of good evaluation methods may be the biggest challenge facing the interpretability research community. There are three types of evaluation. Intuition + pontification --> inadequate Weak/ad-hoc --> still not enough Based on engineering-relevant tasks --> what is needed This can be based on one of three things Making novel predictions about how a model will handle interesting inputs. Controlling what a system does by guiding edits to it. Abandoning a system that does a nontrivial task and replacing it with a simpler reverse-engineered alternative Other common limitations of existing work Poor scaling Relying too much on humans in the loop Failing to study combinations of tools A lack of practical applications with real-world systems EIS IV: A Spotlight on Feature Attribution/Saliency Feature attribution/saliency methods are very common but unlikely to be very important from an engineering perspective. These methods tend to be poorly evaluated, and when they have been subjected to task-based evaluation, they have not tended to fare well. These methods just aren’t equipped to directly be very useful even when they work. They require scrutinizing samples from some data distribution. So the exact same things that feature attribution/saliency methods are equipped t...]]>
Stephen Casper https://storage.googleapis.com/rssfile/images/Nonlinear%20Logo%203000x3000%20-%20Alignment%20Forum.png 11:14 None full 4996
L5Rua9aTndviy8dvc_NL_AF_AF AF - EIS XI: Moving Forward by Stephen Casper Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: EIS XI: Moving Forward, published by Stephen Casper on February 22, 2023 on The AI Alignment Forum. Part 11 of 12 in the Engineer’s Interpretability Sequence. So far, this sequence has discussed a number of topics in interpretability research, all building toward this post. Its goal is to explain some approaches that may be valuable moving forward. I plan to work on some of the ideas here soon. Others, I may not work on soon, but I would love to discuss and support such work if I am able. I hope that this post can offer some useful ideas for people entering or continuing with interpretability research, and if you would like to discuss anything here more, feel more than free to email me at scasper@mit.edu. What are we working toward? First, it seems useful to highlight two points that are uncontroversial in the AI safety community but important nonetheless. Our goal is a toolbox – not a silver bullet. As AI safety engineers, we should neither expect nor try to find a single ‘best’ approach to interpretability that will solve all of our problems. There are many different types of interpretability tools, and many of the differences between them can be described as enforcing different priors over what explanations they generate. So it is trivial to see that there is not going to be any free lunch. There is no silver bullet for interpretability, and few tools conflict with each other anyway. Hence, our goal is a toolbox. In fact, some coauthors and I recently found an excellent example of how using multiple interpretability tools at once beats using individual ones (Casper et al., 2023). This doesn’t mean, however, that we should celebrate just any new interpretability tool. Working in unproductive directions is costly, and applying tool after tool to a problem contributes substantially to the alignment tax. The best types of tools to fill our toolbox will be ones that are automatable, cheap to use, and have demonstrated capabilities on tasks of engineering-relevance. Don’t advance capabilities. As AI safety engineers, we do not want to advance capabilities because doing so speeds up timelines. In turn, faster timelines mean less time for safety research, less time for regulators to react, and a greater likelihood of immense power being concentrated in the hands of very few. Avoiding faster timelines isn’t as simple as just not working on capabilities though. Many techniques have potential uses for both safety and capabilities. So instead of judging our work based on how much we improve safety, we need to judge it based on how much we improve safety relative to capabilities. This is an especially important tradeoff for engineers to keep in mind. A good example was discussed by Hendrycks and Woodside (2022) who observed that there is a positive correlation between the anomaly detection capabilities of a network and its task performance. Some work may improve safety capabilities but if it does so by continuing along existing trendlines, we don’t get more safety than the counterfactual. For the full discussion of this point, see Hendrycks and Woodside (2022). What types of existing tools/research seem promising? Before discussing what topics may be important to work on in the future, it may be valuable to reflect on examples of past work that have introduced interpretability tools that seem to be able to competitively provide engineering-relevant insights. Here is a personal list that is somewhat arbitrary and undoubtedly incomplete. But hopefully it is still valuable. Consider this an engineer’s interpretability reading list of sorts. Some works have competitively done engineering-relevant things with methods for making novel predictions about how a network will handle OOD inputs. ImageNet-trained CNNs are biased towards texture; increasing shape bias improves accuracy and ...]]>
Stephen Casper https://www.alignmentforum.org/posts/L5Rua9aTndviy8dvc/eis-xi-moving-forward Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: EIS XI: Moving Forward, published by Stephen Casper on February 22, 2023 on The AI Alignment Forum. Part 11 of 12 in the Engineer’s Interpretability Sequence. So far, this sequence has discussed a number of topics in interpretability research, all building toward this post. Its goal is to explain some approaches that may be valuable moving forward. I plan to work on some of the ideas here soon. Others, I may not work on soon, but I would love to discuss and support such work if I am able. I hope that this post can offer some useful ideas for people entering or continuing with interpretability research, and if you would like to discuss anything here more, feel more than free to email me at scasper@mit.edu. What are we working toward? First, it seems useful to highlight two points that are uncontroversial in the AI safety community but important nonetheless. Our goal is a toolbox – not a silver bullet. As AI safety engineers, we should neither expect nor try to find a single ‘best’ approach to interpretability that will solve all of our problems. There are many different types of interpretability tools, and many of the differences between them can be described as enforcing different priors over what explanations they generate. So it is trivial to see that there is not going to be any free lunch. There is no silver bullet for interpretability, and few tools conflict with each other anyway. Hence, our goal is a toolbox. In fact, some coauthors and I recently found an excellent example of how using multiple interpretability tools at once beats using individual ones (Casper et al., 2023). This doesn’t mean, however, that we should celebrate just any new interpretability tool. Working in unproductive directions is costly, and applying tool after tool to a problem contributes substantially to the alignment tax. The best types of tools to fill our toolbox will be ones that are automatable, cheap to use, and have demonstrated capabilities on tasks of engineering-relevance. Don’t advance capabilities. As AI safety engineers, we do not want to advance capabilities because doing so speeds up timelines. In turn, faster timelines mean less time for safety research, less time for regulators to react, and a greater likelihood of immense power being concentrated in the hands of very few. Avoiding faster timelines isn’t as simple as just not working on capabilities though. Many techniques have potential uses for both safety and capabilities. So instead of judging our work based on how much we improve safety, we need to judge it based on how much we improve safety relative to capabilities. This is an especially important tradeoff for engineers to keep in mind. A good example was discussed by Hendrycks and Woodside (2022) who observed that there is a positive correlation between the anomaly detection capabilities of a network and its task performance. Some work may improve safety capabilities but if it does so by continuing along existing trendlines, we don’t get more safety than the counterfactual. For the full discussion of this point, see Hendrycks and Woodside (2022). What types of existing tools/research seem promising? Before discussing what topics may be important to work on in the future, it may be valuable to reflect on examples of past work that have introduced interpretability tools that seem to be able to competitively provide engineering-relevant insights. Here is a personal list that is somewhat arbitrary and undoubtedly incomplete. But hopefully it is still valuable. Consider this an engineer’s interpretability reading list of sorts. Some works have competitively done engineering-relevant things with methods for making novel predictions about how a network will handle OOD inputs. ImageNet-trained CNNs are biased towards texture; increasing shape bias improves accuracy and ...]]>
Wed, 22 Feb 2023 19:05:53 +0000 AF - EIS XI: Moving Forward by Stephen Casper Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: EIS XI: Moving Forward, published by Stephen Casper on February 22, 2023 on The AI Alignment Forum. Part 11 of 12 in the Engineer’s Interpretability Sequence. So far, this sequence has discussed a number of topics in interpretability research, all building toward this post. Its goal is to explain some approaches that may be valuable moving forward. I plan to work on some of the ideas here soon. Others, I may not work on soon, but I would love to discuss and support such work if I am able. I hope that this post can offer some useful ideas for people entering or continuing with interpretability research, and if you would like to discuss anything here more, feel more than free to email me at scasper@mit.edu. What are we working toward? First, it seems useful to highlight two points that are uncontroversial in the AI safety community but important nonetheless. Our goal is a toolbox – not a silver bullet. As AI safety engineers, we should neither expect nor try to find a single ‘best’ approach to interpretability that will solve all of our problems. There are many different types of interpretability tools, and many of the differences between them can be described as enforcing different priors over what explanations they generate. So it is trivial to see that there is not going to be any free lunch. There is no silver bullet for interpretability, and few tools conflict with each other anyway. Hence, our goal is a toolbox. In fact, some coauthors and I recently found an excellent example of how using multiple interpretability tools at once beats using individual ones (Casper et al., 2023). This doesn’t mean, however, that we should celebrate just any new interpretability tool. Working in unproductive directions is costly, and applying tool after tool to a problem contributes substantially to the alignment tax. The best types of tools to fill our toolbox will be ones that are automatable, cheap to use, and have demonstrated capabilities on tasks of engineering-relevance. Don’t advance capabilities. As AI safety engineers, we do not want to advance capabilities because doing so speeds up timelines. In turn, faster timelines mean less time for safety research, less time for regulators to react, and a greater likelihood of immense power being concentrated in the hands of very few. Avoiding faster timelines isn’t as simple as just not working on capabilities though. Many techniques have potential uses for both safety and capabilities. So instead of judging our work based on how much we improve safety, we need to judge it based on how much we improve safety relative to capabilities. This is an especially important tradeoff for engineers to keep in mind. A good example was discussed by Hendrycks and Woodside (2022) who observed that there is a positive correlation between the anomaly detection capabilities of a network and its task performance. Some work may improve safety capabilities but if it does so by continuing along existing trendlines, we don’t get more safety than the counterfactual. For the full discussion of this point, see Hendrycks and Woodside (2022). What types of existing tools/research seem promising? Before discussing what topics may be important to work on in the future, it may be valuable to reflect on examples of past work that have introduced interpretability tools that seem to be able to competitively provide engineering-relevant insights. Here is a personal list that is somewhat arbitrary and undoubtedly incomplete. But hopefully it is still valuable. Consider this an engineer’s interpretability reading list of sorts. Some works have competitively done engineering-relevant things with methods for making novel predictions about how a network will handle OOD inputs. ImageNet-trained CNNs are biased towards texture; increasing shape bias improves accuracy and ...]]>
Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: EIS XI: Moving Forward, published by Stephen Casper on February 22, 2023 on The AI Alignment Forum. Part 11 of 12 in the Engineer’s Interpretability Sequence. So far, this sequence has discussed a number of topics in interpretability research, all building toward this post. Its goal is to explain some approaches that may be valuable moving forward. I plan to work on some of the ideas here soon. Others, I may not work on soon, but I would love to discuss and support such work if I am able. I hope that this post can offer some useful ideas for people entering or continuing with interpretability research, and if you would like to discuss anything here more, feel more than free to email me at scasper@mit.edu. What are we working toward? First, it seems useful to highlight two points that are uncontroversial in the AI safety community but important nonetheless. Our goal is a toolbox – not a silver bullet. As AI safety engineers, we should neither expect nor try to find a single ‘best’ approach to interpretability that will solve all of our problems. There are many different types of interpretability tools, and many of the differences between them can be described as enforcing different priors over what explanations they generate. So it is trivial to see that there is not going to be any free lunch. There is no silver bullet for interpretability, and few tools conflict with each other anyway. Hence, our goal is a toolbox. In fact, some coauthors and I recently found an excellent example of how using multiple interpretability tools at once beats using individual ones (Casper et al., 2023). This doesn’t mean, however, that we should celebrate just any new interpretability tool. Working in unproductive directions is costly, and applying tool after tool to a problem contributes substantially to the alignment tax. The best types of tools to fill our toolbox will be ones that are automatable, cheap to use, and have demonstrated capabilities on tasks of engineering-relevance. Don’t advance capabilities. As AI safety engineers, we do not want to advance capabilities because doing so speeds up timelines. In turn, faster timelines mean less time for safety research, less time for regulators to react, and a greater likelihood of immense power being concentrated in the hands of very few. Avoiding faster timelines isn’t as simple as just not working on capabilities though. Many techniques have potential uses for both safety and capabilities. So instead of judging our work based on how much we improve safety, we need to judge it based on how much we improve safety relative to capabilities. This is an especially important tradeoff for engineers to keep in mind. A good example was discussed by Hendrycks and Woodside (2022) who observed that there is a positive correlation between the anomaly detection capabilities of a network and its task performance. Some work may improve safety capabilities but if it does so by continuing along existing trendlines, we don’t get more safety than the counterfactual. For the full discussion of this point, see Hendrycks and Woodside (2022). What types of existing tools/research seem promising? Before discussing what topics may be important to work on in the future, it may be valuable to reflect on examples of past work that have introduced interpretability tools that seem to be able to competitively provide engineering-relevant insights. Here is a personal list that is somewhat arbitrary and undoubtedly incomplete. But hopefully it is still valuable. Consider this an engineer’s interpretability reading list of sorts. Some works have competitively done engineering-relevant things with methods for making novel predictions about how a network will handle OOD inputs. ImageNet-trained CNNs are biased towards texture; increasing shape bias improves accuracy and ...]]>
Stephen Casper https://storage.googleapis.com/rssfile/images/Nonlinear%20Logo%203000x3000%20-%20Alignment%20Forum.png 15:54 None full 4986
BTApNmv7s6RTGxeP4_NL_AF_AF AF - Cyborg Periods: There will be multiple AI transitions by Jan Kulveit Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Cyborg Periods: There will be multiple AI transitions, published by Jan Kulveit on February 22, 2023 on The AI Alignment Forum. It can be useful to zoom out and talk about very compressed concepts like ‘AI progress’ or ‘AI transition’ or ‘AGI timelines’. But from the perspective of most AI strategy questions, it’s useful to be more specific. Looking at all of human history, it might make sense to think of ourselves as at the cusp of an AI transition, when AI systems overtake humans as the most powerful actors. But for practical and forward-looking purposes, it seems quite likely there will actually be multiple different AI transitions: There will be AI transitions at different times in different domains In each of these domains, transitions may move through multiple stages: DescriptionPresent day examplesHumans clearly outperform AIs. At some point, AIs start to be a bit helpful.Alignment research, high-level organisational decisions. Humans and AIs are at least comparably powerful, but have different strengths and weaknesses. This means that human+AI teams outperform either unaided humans, or pure AIs.Visual art, programming, trading.AIs overtake humans. Humans become obsolete and their contribution is negligible to negative.Chess, go, shogi. Stage [ = more powerful than] Human period: Humans AIs Cyborg period: Human+AI teams humans Human+AI teams AIs AI period: AIs humans (AIs ~ human+AI teams) Some domains might never enter an AI period. It’s also possible that in some domains the cyborg period will be very brief, or that there will be a jump straight to the AI period. But: We’ve seen cyborg periods before Global supply chains have been in a cyborg period for decades Chess and go both went through cyborg periods before AIs became dominant Arguably visual art, coding and trading are currently in cyborg periods Even if cyborg periods are brief, they may be pivotal More on this below This means that for each domain, there are potentially two transitions: one from the human period into the cyborg period, and one from the cyborg period into the AI period. Transitions in some domains will be particularly important The cyborg period in any domain will correspond to: An increase in capabilities (definitionally, as during that period human+AI teams will be more powerful than humans were in the human period) An increase in the % of that domain which is automated, and therefore probably an increase in the rate of progress Some domains where increased capabilities/automation/speed seem particularly strategically important are: Research, especially AI research AI alignment research Human coordination Persuasion Cultural evolution AI systems already affect cultural evolution by speeding it up and influencing which memes spread. However, AI doesn’t yet play a significant role in creating new memes (although we are at the very start of this happening). This is similar to the way that humans harnessed the power of natural evolution to create higher yield crops without being able to directly engineer at the genetic level Meme generation may also become increasingly automated, until most cultural change happens on silica rather than in brains, leading to different selection pressures Strategic goal seeking Currently, broad roles involving long-term planning and open domains like "leading a company" are in the human period If this changes, it would give cyborgs additional capabilities on top of the ones listed above Some other domains which seem less centrally important but could end up mattering a lot are: Cybersecurity Military strategy Nuclear command and control Some kinds of physical engineering/manufacture/nanotech/design Chip design Coding There are probably other strategically important domains we haven’t listed. A common feature of the domains listed is that increased ca...]]>
Jan Kulveit https://www.alignmentforum.org/posts/BTApNmv7s6RTGxeP4/cyborg-periods-there-will-be-multiple-ai-transitions Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Cyborg Periods: There will be multiple AI transitions, published by Jan Kulveit on February 22, 2023 on The AI Alignment Forum. It can be useful to zoom out and talk about very compressed concepts like ‘AI progress’ or ‘AI transition’ or ‘AGI timelines’. But from the perspective of most AI strategy questions, it’s useful to be more specific. Looking at all of human history, it might make sense to think of ourselves as at the cusp of an AI transition, when AI systems overtake humans as the most powerful actors. But for practical and forward-looking purposes, it seems quite likely there will actually be multiple different AI transitions: There will be AI transitions at different times in different domains In each of these domains, transitions may move through multiple stages: DescriptionPresent day examplesHumans clearly outperform AIs. At some point, AIs start to be a bit helpful.Alignment research, high-level organisational decisions. Humans and AIs are at least comparably powerful, but have different strengths and weaknesses. This means that human+AI teams outperform either unaided humans, or pure AIs.Visual art, programming, trading.AIs overtake humans. Humans become obsolete and their contribution is negligible to negative.Chess, go, shogi. Stage [ = more powerful than] Human period: Humans AIs Cyborg period: Human+AI teams humans Human+AI teams AIs AI period: AIs humans (AIs ~ human+AI teams) Some domains might never enter an AI period. It’s also possible that in some domains the cyborg period will be very brief, or that there will be a jump straight to the AI period. But: We’ve seen cyborg periods before Global supply chains have been in a cyborg period for decades Chess and go both went through cyborg periods before AIs became dominant Arguably visual art, coding and trading are currently in cyborg periods Even if cyborg periods are brief, they may be pivotal More on this below This means that for each domain, there are potentially two transitions: one from the human period into the cyborg period, and one from the cyborg period into the AI period. Transitions in some domains will be particularly important The cyborg period in any domain will correspond to: An increase in capabilities (definitionally, as during that period human+AI teams will be more powerful than humans were in the human period) An increase in the % of that domain which is automated, and therefore probably an increase in the rate of progress Some domains where increased capabilities/automation/speed seem particularly strategically important are: Research, especially AI research AI alignment research Human coordination Persuasion Cultural evolution AI systems already affect cultural evolution by speeding it up and influencing which memes spread. However, AI doesn’t yet play a significant role in creating new memes (although we are at the very start of this happening). This is similar to the way that humans harnessed the power of natural evolution to create higher yield crops without being able to directly engineer at the genetic level Meme generation may also become increasingly automated, until most cultural change happens on silica rather than in brains, leading to different selection pressures Strategic goal seeking Currently, broad roles involving long-term planning and open domains like "leading a company" are in the human period If this changes, it would give cyborgs additional capabilities on top of the ones listed above Some other domains which seem less centrally important but could end up mattering a lot are: Cybersecurity Military strategy Nuclear command and control Some kinds of physical engineering/manufacture/nanotech/design Chip design Coding There are probably other strategically important domains we haven’t listed. A common feature of the domains listed is that increased ca...]]>
Wed, 22 Feb 2023 16:09:04 +0000 AF - Cyborg Periods: There will be multiple AI transitions by Jan Kulveit Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Cyborg Periods: There will be multiple AI transitions, published by Jan Kulveit on February 22, 2023 on The AI Alignment Forum. It can be useful to zoom out and talk about very compressed concepts like ‘AI progress’ or ‘AI transition’ or ‘AGI timelines’. But from the perspective of most AI strategy questions, it’s useful to be more specific. Looking at all of human history, it might make sense to think of ourselves as at the cusp of an AI transition, when AI systems overtake humans as the most powerful actors. But for practical and forward-looking purposes, it seems quite likely there will actually be multiple different AI transitions: There will be AI transitions at different times in different domains In each of these domains, transitions may move through multiple stages: DescriptionPresent day examplesHumans clearly outperform AIs. At some point, AIs start to be a bit helpful.Alignment research, high-level organisational decisions. Humans and AIs are at least comparably powerful, but have different strengths and weaknesses. This means that human+AI teams outperform either unaided humans, or pure AIs.Visual art, programming, trading.AIs overtake humans. Humans become obsolete and their contribution is negligible to negative.Chess, go, shogi. Stage [ = more powerful than] Human period: Humans AIs Cyborg period: Human+AI teams humans Human+AI teams AIs AI period: AIs humans (AIs ~ human+AI teams) Some domains might never enter an AI period. It’s also possible that in some domains the cyborg period will be very brief, or that there will be a jump straight to the AI period. But: We’ve seen cyborg periods before Global supply chains have been in a cyborg period for decades Chess and go both went through cyborg periods before AIs became dominant Arguably visual art, coding and trading are currently in cyborg periods Even if cyborg periods are brief, they may be pivotal More on this below This means that for each domain, there are potentially two transitions: one from the human period into the cyborg period, and one from the cyborg period into the AI period. Transitions in some domains will be particularly important The cyborg period in any domain will correspond to: An increase in capabilities (definitionally, as during that period human+AI teams will be more powerful than humans were in the human period) An increase in the % of that domain which is automated, and therefore probably an increase in the rate of progress Some domains where increased capabilities/automation/speed seem particularly strategically important are: Research, especially AI research AI alignment research Human coordination Persuasion Cultural evolution AI systems already affect cultural evolution by speeding it up and influencing which memes spread. However, AI doesn’t yet play a significant role in creating new memes (although we are at the very start of this happening). This is similar to the way that humans harnessed the power of natural evolution to create higher yield crops without being able to directly engineer at the genetic level Meme generation may also become increasingly automated, until most cultural change happens on silica rather than in brains, leading to different selection pressures Strategic goal seeking Currently, broad roles involving long-term planning and open domains like "leading a company" are in the human period If this changes, it would give cyborgs additional capabilities on top of the ones listed above Some other domains which seem less centrally important but could end up mattering a lot are: Cybersecurity Military strategy Nuclear command and control Some kinds of physical engineering/manufacture/nanotech/design Chip design Coding There are probably other strategically important domains we haven’t listed. A common feature of the domains listed is that increased ca...]]>
Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Cyborg Periods: There will be multiple AI transitions, published by Jan Kulveit on February 22, 2023 on The AI Alignment Forum. It can be useful to zoom out and talk about very compressed concepts like ‘AI progress’ or ‘AI transition’ or ‘AGI timelines’. But from the perspective of most AI strategy questions, it’s useful to be more specific. Looking at all of human history, it might make sense to think of ourselves as at the cusp of an AI transition, when AI systems overtake humans as the most powerful actors. But for practical and forward-looking purposes, it seems quite likely there will actually be multiple different AI transitions: There will be AI transitions at different times in different domains In each of these domains, transitions may move through multiple stages: DescriptionPresent day examplesHumans clearly outperform AIs. At some point, AIs start to be a bit helpful.Alignment research, high-level organisational decisions. Humans and AIs are at least comparably powerful, but have different strengths and weaknesses. This means that human+AI teams outperform either unaided humans, or pure AIs.Visual art, programming, trading.AIs overtake humans. Humans become obsolete and their contribution is negligible to negative.Chess, go, shogi. Stage [ = more powerful than] Human period: Humans AIs Cyborg period: Human+AI teams humans Human+AI teams AIs AI period: AIs humans (AIs ~ human+AI teams) Some domains might never enter an AI period. It’s also possible that in some domains the cyborg period will be very brief, or that there will be a jump straight to the AI period. But: We’ve seen cyborg periods before Global supply chains have been in a cyborg period for decades Chess and go both went through cyborg periods before AIs became dominant Arguably visual art, coding and trading are currently in cyborg periods Even if cyborg periods are brief, they may be pivotal More on this below This means that for each domain, there are potentially two transitions: one from the human period into the cyborg period, and one from the cyborg period into the AI period. Transitions in some domains will be particularly important The cyborg period in any domain will correspond to: An increase in capabilities (definitionally, as during that period human+AI teams will be more powerful than humans were in the human period) An increase in the % of that domain which is automated, and therefore probably an increase in the rate of progress Some domains where increased capabilities/automation/speed seem particularly strategically important are: Research, especially AI research AI alignment research Human coordination Persuasion Cultural evolution AI systems already affect cultural evolution by speeding it up and influencing which memes spread. However, AI doesn’t yet play a significant role in creating new memes (although we are at the very start of this happening). This is similar to the way that humans harnessed the power of natural evolution to create higher yield crops without being able to directly engineer at the genetic level Meme generation may also become increasingly automated, until most cultural change happens on silica rather than in brains, leading to different selection pressures Strategic goal seeking Currently, broad roles involving long-term planning and open domains like "leading a company" are in the human period If this changes, it would give cyborgs additional capabilities on top of the ones listed above Some other domains which seem less centrally important but could end up mattering a lot are: Cybersecurity Military strategy Nuclear command and control Some kinds of physical engineering/manufacture/nanotech/design Chip design Coding There are probably other strategically important domains we haven’t listed. A common feature of the domains listed is that increased ca...]]>
Jan Kulveit https://storage.googleapis.com/rssfile/images/Nonlinear%20Logo%203000x3000%20-%20Alignment%20Forum.png 09:58 None full 5012
5hApNw5f7uG8RXxGS_NL_AF_AF AF - The Open Agency Model by Eric Drexler Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: The Open Agency Model, published by Eric Drexler on February 22, 2023 on The AI Alignment Forum. Notes on AI for complex, consequential problems Eric DrexlerCentre for the Governance of AIUniversity of Oxford Introduction This document argues for “open agencies” — not opaque, unitary agents — as the appropriate model for applying future AI capabilities to consequential tasks that call for combining human guidance with delegation of planning and implementation to AI systems. This prospect reframes and can help to tame a wide range of classic AI safety challenges, leveraging alignment techniques in a relatively fault-tolerant context. Rethinking safe AI and its applications AI safety research is too varied to summarize, yet broad patterns are obvious. A long-established reference-problem centers on prospects for rational superintelligent agents that pursue narrow goals with potentially catastrophic outcomes. This frame has been productive, but developments in deep learning call for updates that take account of the proliferation of narrow models (for driving, coding, robot control, image generation, game playing.) that are either non-agentic or act as agents in only a narrow sense, and that take account of the rise of more broadly capable foundation models and LLMs. These updates call for reframing questions of AI safety, and call for attention to how consequential tasks might be accomplished by organizing AI systems that usually do approximately what humans intend. Two frames for high-level AI The unitary-agent frame From its beginnings in popular culture, discussion of the AI control problem has centered around a unitary agent model of high-level AI and potential AI risks. In this model, a potentially dominant agent both plans and acts to achieve its goals. The unitary-agent model typically carries assumptions regarding goals, plans, actions, and control. Goals: Internal to an agent, by default including power-seeking goals Plans: Internal to an agent, possibly uninterpretable and in effect secret Actions: Performed by the agent, possibly intended to overcome opposition Control: Humans confront a powerful, potentially deceptive agent The typical unitary-agent threat model contemplates the emergence of a dominant, catastrophically misaligned agent, and safety models implicitly or explicitly call for deploying a dominant agent (or an equivalent collective system) that is both aligned and powerful enough to suppress unaligned competitors everywhere in the world. The open-agency frame Recent developments suggest an alternative open agency model of high-level AI. Today, the systems that look most like AGI are large language models (LLMs), and these are not agents that seek goals, but are generative models that produce diverse outputs in response to prompts (in a generalized sense) and random-number seeds. Most outputs are discarded. Trained on prediction tasks, LLMs learn world models that include agent behaviors, and generative models that are similar in kind can be informed by better world models and produce better plans. There is no need to assume LLM-like implementations: The key point is that generation of diverse plans is by nature a task for generative models, and that in routine operation, most outputs are discarded. These considerations suggest an “open-agency frame” in which prompt-driven generative models produce diverse proposals, diverse critics help select proposals, and diverse agents implement proposed actions to accomplish tasks (with schedules, budgets, accountability mechanisms, and so forth). Goals, plans, actions, and control look different in the open-agency model: Goals: Are provided as prompts to diverse generative models, yielding diverse plans on request Plans: Are selected with the aid of diverse, independent comparison and evaluation mechanisms ...]]>
Eric Drexler https://www.alignmentforum.org/posts/5hApNw5f7uG8RXxGS/the-open-agency-model Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: The Open Agency Model, published by Eric Drexler on February 22, 2023 on The AI Alignment Forum. Notes on AI for complex, consequential problems Eric DrexlerCentre for the Governance of AIUniversity of Oxford Introduction This document argues for “open agencies” — not opaque, unitary agents — as the appropriate model for applying future AI capabilities to consequential tasks that call for combining human guidance with delegation of planning and implementation to AI systems. This prospect reframes and can help to tame a wide range of classic AI safety challenges, leveraging alignment techniques in a relatively fault-tolerant context. Rethinking safe AI and its applications AI safety research is too varied to summarize, yet broad patterns are obvious. A long-established reference-problem centers on prospects for rational superintelligent agents that pursue narrow goals with potentially catastrophic outcomes. This frame has been productive, but developments in deep learning call for updates that take account of the proliferation of narrow models (for driving, coding, robot control, image generation, game playing.) that are either non-agentic or act as agents in only a narrow sense, and that take account of the rise of more broadly capable foundation models and LLMs. These updates call for reframing questions of AI safety, and call for attention to how consequential tasks might be accomplished by organizing AI systems that usually do approximately what humans intend. Two frames for high-level AI The unitary-agent frame From its beginnings in popular culture, discussion of the AI control problem has centered around a unitary agent model of high-level AI and potential AI risks. In this model, a potentially dominant agent both plans and acts to achieve its goals. The unitary-agent model typically carries assumptions regarding goals, plans, actions, and control. Goals: Internal to an agent, by default including power-seeking goals Plans: Internal to an agent, possibly uninterpretable and in effect secret Actions: Performed by the agent, possibly intended to overcome opposition Control: Humans confront a powerful, potentially deceptive agent The typical unitary-agent threat model contemplates the emergence of a dominant, catastrophically misaligned agent, and safety models implicitly or explicitly call for deploying a dominant agent (or an equivalent collective system) that is both aligned and powerful enough to suppress unaligned competitors everywhere in the world. The open-agency frame Recent developments suggest an alternative open agency model of high-level AI. Today, the systems that look most like AGI are large language models (LLMs), and these are not agents that seek goals, but are generative models that produce diverse outputs in response to prompts (in a generalized sense) and random-number seeds. Most outputs are discarded. Trained on prediction tasks, LLMs learn world models that include agent behaviors, and generative models that are similar in kind can be informed by better world models and produce better plans. There is no need to assume LLM-like implementations: The key point is that generation of diverse plans is by nature a task for generative models, and that in routine operation, most outputs are discarded. These considerations suggest an “open-agency frame” in which prompt-driven generative models produce diverse proposals, diverse critics help select proposals, and diverse agents implement proposed actions to accomplish tasks (with schedules, budgets, accountability mechanisms, and so forth). Goals, plans, actions, and control look different in the open-agency model: Goals: Are provided as prompts to diverse generative models, yielding diverse plans on request Plans: Are selected with the aid of diverse, independent comparison and evaluation mechanisms ...]]>
Wed, 22 Feb 2023 10:35:12 +0000 AF - The Open Agency Model by Eric Drexler Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: The Open Agency Model, published by Eric Drexler on February 22, 2023 on The AI Alignment Forum. Notes on AI for complex, consequential problems Eric DrexlerCentre for the Governance of AIUniversity of Oxford Introduction This document argues for “open agencies” — not opaque, unitary agents — as the appropriate model for applying future AI capabilities to consequential tasks that call for combining human guidance with delegation of planning and implementation to AI systems. This prospect reframes and can help to tame a wide range of classic AI safety challenges, leveraging alignment techniques in a relatively fault-tolerant context. Rethinking safe AI and its applications AI safety research is too varied to summarize, yet broad patterns are obvious. A long-established reference-problem centers on prospects for rational superintelligent agents that pursue narrow goals with potentially catastrophic outcomes. This frame has been productive, but developments in deep learning call for updates that take account of the proliferation of narrow models (for driving, coding, robot control, image generation, game playing.) that are either non-agentic or act as agents in only a narrow sense, and that take account of the rise of more broadly capable foundation models and LLMs. These updates call for reframing questions of AI safety, and call for attention to how consequential tasks might be accomplished by organizing AI systems that usually do approximately what humans intend. Two frames for high-level AI The unitary-agent frame From its beginnings in popular culture, discussion of the AI control problem has centered around a unitary agent model of high-level AI and potential AI risks. In this model, a potentially dominant agent both plans and acts to achieve its goals. The unitary-agent model typically carries assumptions regarding goals, plans, actions, and control. Goals: Internal to an agent, by default including power-seeking goals Plans: Internal to an agent, possibly uninterpretable and in effect secret Actions: Performed by the agent, possibly intended to overcome opposition Control: Humans confront a powerful, potentially deceptive agent The typical unitary-agent threat model contemplates the emergence of a dominant, catastrophically misaligned agent, and safety models implicitly or explicitly call for deploying a dominant agent (or an equivalent collective system) that is both aligned and powerful enough to suppress unaligned competitors everywhere in the world. The open-agency frame Recent developments suggest an alternative open agency model of high-level AI. Today, the systems that look most like AGI are large language models (LLMs), and these are not agents that seek goals, but are generative models that produce diverse outputs in response to prompts (in a generalized sense) and random-number seeds. Most outputs are discarded. Trained on prediction tasks, LLMs learn world models that include agent behaviors, and generative models that are similar in kind can be informed by better world models and produce better plans. There is no need to assume LLM-like implementations: The key point is that generation of diverse plans is by nature a task for generative models, and that in routine operation, most outputs are discarded. These considerations suggest an “open-agency frame” in which prompt-driven generative models produce diverse proposals, diverse critics help select proposals, and diverse agents implement proposed actions to accomplish tasks (with schedules, budgets, accountability mechanisms, and so forth). Goals, plans, actions, and control look different in the open-agency model: Goals: Are provided as prompts to diverse generative models, yielding diverse plans on request Plans: Are selected with the aid of diverse, independent comparison and evaluation mechanisms ...]]>
Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: The Open Agency Model, published by Eric Drexler on February 22, 2023 on The AI Alignment Forum. Notes on AI for complex, consequential problems Eric DrexlerCentre for the Governance of AIUniversity of Oxford Introduction This document argues for “open agencies” — not opaque, unitary agents — as the appropriate model for applying future AI capabilities to consequential tasks that call for combining human guidance with delegation of planning and implementation to AI systems. This prospect reframes and can help to tame a wide range of classic AI safety challenges, leveraging alignment techniques in a relatively fault-tolerant context. Rethinking safe AI and its applications AI safety research is too varied to summarize, yet broad patterns are obvious. A long-established reference-problem centers on prospects for rational superintelligent agents that pursue narrow goals with potentially catastrophic outcomes. This frame has been productive, but developments in deep learning call for updates that take account of the proliferation of narrow models (for driving, coding, robot control, image generation, game playing.) that are either non-agentic or act as agents in only a narrow sense, and that take account of the rise of more broadly capable foundation models and LLMs. These updates call for reframing questions of AI safety, and call for attention to how consequential tasks might be accomplished by organizing AI systems that usually do approximately what humans intend. Two frames for high-level AI The unitary-agent frame From its beginnings in popular culture, discussion of the AI control problem has centered around a unitary agent model of high-level AI and potential AI risks. In this model, a potentially dominant agent both plans and acts to achieve its goals. The unitary-agent model typically carries assumptions regarding goals, plans, actions, and control. Goals: Internal to an agent, by default including power-seeking goals Plans: Internal to an agent, possibly uninterpretable and in effect secret Actions: Performed by the agent, possibly intended to overcome opposition Control: Humans confront a powerful, potentially deceptive agent The typical unitary-agent threat model contemplates the emergence of a dominant, catastrophically misaligned agent, and safety models implicitly or explicitly call for deploying a dominant agent (or an equivalent collective system) that is both aligned and powerful enough to suppress unaligned competitors everywhere in the world. The open-agency frame Recent developments suggest an alternative open agency model of high-level AI. Today, the systems that look most like AGI are large language models (LLMs), and these are not agents that seek goals, but are generative models that produce diverse outputs in response to prompts (in a generalized sense) and random-number seeds. Most outputs are discarded. Trained on prediction tasks, LLMs learn world models that include agent behaviors, and generative models that are similar in kind can be informed by better world models and produce better plans. There is no need to assume LLM-like implementations: The key point is that generation of diverse plans is by nature a task for generative models, and that in routine operation, most outputs are discarded. These considerations suggest an “open-agency frame” in which prompt-driven generative models produce diverse proposals, diverse critics help select proposals, and diverse agents implement proposed actions to accomplish tasks (with schedules, budgets, accountability mechanisms, and so forth). Goals, plans, actions, and control look different in the open-agency model: Goals: Are provided as prompts to diverse generative models, yielding diverse plans on request Plans: Are selected with the aid of diverse, independent comparison and evaluation mechanisms ...]]>
Eric Drexler https://storage.googleapis.com/rssfile/images/Nonlinear%20Logo%203000x3000%20-%20Alignment%20Forum.png 08:45 None full 4976
QgZAbFHtgSGjx4aTS_NL_AF_AF AF - A proof of inner Löb's theorem by James Payor Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: A proof of inner Löb's theorem, published by James Payor on February 21, 2023 on The AI Alignment Forum. This is a short post that offers a slightly different take on the standard proof of Löb's theorem. It offers nothing else of any value :) We seek to prove the "inner" version, which we write as: □P↔□(□PP) The proof uses quining to build a related sentence L, the "Löb sentence", which talks about its own source code. By construction L has the property: □L↔□(□LP) Then, we can show that □L↔□P, i.e. they're equivalent! We do this by plugging □L into itself to get a twisty □P. We can then replace each □L with □P and prove Löb's theorem. The proof This proof uses the same rules of box manipulation as on the wiki page. We start by creating L using quining, i.e. taking a modal fixed point: ⊢L↔(□LP) (exists as a modal fixed point) Yep, this is skipping the details of the most interesting part, but alas I don't understand them well enough to do more than wave my hands and say "quining". We then stick it inside the box to get our first property: ⊢□(L↔(□LP)) (from (1) by necessitation) ⊢□L↔□(□LP) (from (2) by box-distributivity in both directions) We now want to show that □L↔□P. We can get the forward direction by feeding a copy of □L into itself: ⊢□L(□□L□P) (box-distributivity on (3)) ⊢□L□□L (internal necessitation) ⊢□L□P (from (4) and (5)) The backward direction is equivalent to □P□(□LP), and is straightforward: ⊢P(□LP) (trivial) ⊢□P□(□LP) (necessitation and box-distributivity on (7)) Taking those together, we've shown □L and □P are equivalent. ⊢□L↔□P (from (6) and (8)) Now we'd like to finish by appealing to the following chain: □P↔□L↔□(□LP)↔□(□PP) We've proven all but the last part of the chain. Here are the steps that let us do the substitution: ⊢(□LP)↔(□PP) (since □L and □P are equivalent by (9)) ⊢□((□LP)↔(□PP)) (from (10) by necessitation) ⊢□(□LP)↔□(□PP) (from (11) by box-distributivity in both directions) And that's everything we need: ⊢□P↔□(□PP) (from (3), (9), and (12)) Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.]]>
James Payor https://www.alignmentforum.org/posts/QgZAbFHtgSGjx4aTS/a-proof-of-inner-loeb-s-theorem Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: A proof of inner Löb's theorem, published by James Payor on February 21, 2023 on The AI Alignment Forum. This is a short post that offers a slightly different take on the standard proof of Löb's theorem. It offers nothing else of any value :) We seek to prove the "inner" version, which we write as: □P↔□(□PP) The proof uses quining to build a related sentence L, the "Löb sentence", which talks about its own source code. By construction L has the property: □L↔□(□LP) Then, we can show that □L↔□P, i.e. they're equivalent! We do this by plugging □L into itself to get a twisty □P. We can then replace each □L with □P and prove Löb's theorem. The proof This proof uses the same rules of box manipulation as on the wiki page. We start by creating L using quining, i.e. taking a modal fixed point: ⊢L↔(□LP) (exists as a modal fixed point) Yep, this is skipping the details of the most interesting part, but alas I don't understand them well enough to do more than wave my hands and say "quining". We then stick it inside the box to get our first property: ⊢□(L↔(□LP)) (from (1) by necessitation) ⊢□L↔□(□LP) (from (2) by box-distributivity in both directions) We now want to show that □L↔□P. We can get the forward direction by feeding a copy of □L into itself: ⊢□L(□□L□P) (box-distributivity on (3)) ⊢□L□□L (internal necessitation) ⊢□L□P (from (4) and (5)) The backward direction is equivalent to □P□(□LP), and is straightforward: ⊢P(□LP) (trivial) ⊢□P□(□LP) (necessitation and box-distributivity on (7)) Taking those together, we've shown □L and □P are equivalent. ⊢□L↔□P (from (6) and (8)) Now we'd like to finish by appealing to the following chain: □P↔□L↔□(□LP)↔□(□PP) We've proven all but the last part of the chain. Here are the steps that let us do the substitution: ⊢(□LP)↔(□PP) (since □L and □P are equivalent by (9)) ⊢□((□LP)↔(□PP)) (from (10) by necessitation) ⊢□(□LP)↔□(□PP) (from (11) by box-distributivity in both directions) And that's everything we need: ⊢□P↔□(□PP) (from (3), (9), and (12)) Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.]]>
Tue, 21 Feb 2023 21:11:41 +0000 AF - A proof of inner Löb's theorem by James Payor Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: A proof of inner Löb's theorem, published by James Payor on February 21, 2023 on The AI Alignment Forum. This is a short post that offers a slightly different take on the standard proof of Löb's theorem. It offers nothing else of any value :) We seek to prove the "inner" version, which we write as: □P↔□(□PP) The proof uses quining to build a related sentence L, the "Löb sentence", which talks about its own source code. By construction L has the property: □L↔□(□LP) Then, we can show that □L↔□P, i.e. they're equivalent! We do this by plugging □L into itself to get a twisty □P. We can then replace each □L with □P and prove Löb's theorem. The proof This proof uses the same rules of box manipulation as on the wiki page. We start by creating L using quining, i.e. taking a modal fixed point: ⊢L↔(□LP) (exists as a modal fixed point) Yep, this is skipping the details of the most interesting part, but alas I don't understand them well enough to do more than wave my hands and say "quining". We then stick it inside the box to get our first property: ⊢□(L↔(□LP)) (from (1) by necessitation) ⊢□L↔□(□LP) (from (2) by box-distributivity in both directions) We now want to show that □L↔□P. We can get the forward direction by feeding a copy of □L into itself: ⊢□L(□□L□P) (box-distributivity on (3)) ⊢□L□□L (internal necessitation) ⊢□L□P (from (4) and (5)) The backward direction is equivalent to □P□(□LP), and is straightforward: ⊢P(□LP) (trivial) ⊢□P□(□LP) (necessitation and box-distributivity on (7)) Taking those together, we've shown □L and □P are equivalent. ⊢□L↔□P (from (6) and (8)) Now we'd like to finish by appealing to the following chain: □P↔□L↔□(□LP)↔□(□PP) We've proven all but the last part of the chain. Here are the steps that let us do the substitution: ⊢(□LP)↔(□PP) (since □L and □P are equivalent by (9)) ⊢□((□LP)↔(□PP)) (from (10) by necessitation) ⊢□(□LP)↔□(□PP) (from (11) by box-distributivity in both directions) And that's everything we need: ⊢□P↔□(□PP) (from (3), (9), and (12)) Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.]]>
Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: A proof of inner Löb's theorem, published by James Payor on February 21, 2023 on The AI Alignment Forum. This is a short post that offers a slightly different take on the standard proof of Löb's theorem. It offers nothing else of any value :) We seek to prove the "inner" version, which we write as: □P↔□(□PP) The proof uses quining to build a related sentence L, the "Löb sentence", which talks about its own source code. By construction L has the property: □L↔□(□LP) Then, we can show that □L↔□P, i.e. they're equivalent! We do this by plugging □L into itself to get a twisty □P. We can then replace each □L with □P and prove Löb's theorem. The proof This proof uses the same rules of box manipulation as on the wiki page. We start by creating L using quining, i.e. taking a modal fixed point: ⊢L↔(□LP) (exists as a modal fixed point) Yep, this is skipping the details of the most interesting part, but alas I don't understand them well enough to do more than wave my hands and say "quining". We then stick it inside the box to get our first property: ⊢□(L↔(□LP)) (from (1) by necessitation) ⊢□L↔□(□LP) (from (2) by box-distributivity in both directions) We now want to show that □L↔□P. We can get the forward direction by feeding a copy of □L into itself: ⊢□L(□□L□P) (box-distributivity on (3)) ⊢□L□□L (internal necessitation) ⊢□L□P (from (4) and (5)) The backward direction is equivalent to □P□(□LP), and is straightforward: ⊢P(□LP) (trivial) ⊢□P□(□LP) (necessitation and box-distributivity on (7)) Taking those together, we've shown □L and □P are equivalent. ⊢□L↔□P (from (6) and (8)) Now we'd like to finish by appealing to the following chain: □P↔□L↔□(□LP)↔□(□PP) We've proven all but the last part of the chain. Here are the steps that let us do the substitution: ⊢(□LP)↔(□PP) (since □L and □P are equivalent by (9)) ⊢□((□LP)↔(□PP)) (from (10) by necessitation) ⊢□(□LP)↔□(□PP) (from (11) by box-distributivity in both directions) And that's everything we need: ⊢□P↔□(□PP) (from (3), (9), and (12)) Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.]]>
James Payor https://storage.googleapis.com/rssfile/images/Nonlinear%20Logo%203000x3000%20-%20Alignment%20Forum.png 02:52 None full 4972
8F4dXYriqbsom46x5_NL_AF_AF AF - Pretraining Language Models with Human Preferences by Tomek Korbak Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Pretraining Language Models with Human Preferences, published by Tomek Korbak on February 21, 2023 on The AI Alignment Forum. This post summarizes the main results from our recently released paper Pretraining Language Models with Human Preferences, and puts them in the broader context of AI safety. For a quick summary of the paper, take a look at our Twitter thread. TL;DR: In the paper, we show how to train LMs with human preferences (as in RLHF), but during LM pretraining. We find that pretraining works much better than the standard practice of only finetuning with human preferences after pretraining; our resulting LMs generate text that is more often in line with human preferences and are more robust to red teaming attacks. Our best method is conditional training, where we learn a predictive model of internet texts conditional on their human preference scores, e.g., evaluated by a predictive model of human preferences. This approach retains the advantages of learning from human preferences, while potentially mitigating risks from training agents with RL by learning a predictive model or simulator. Summary of the paper Motivation. LMs are pretrained to maximize the likelihood of their training data. Since the training data contain undesirable content (e.g. falsehoods, offensive language, private information, buggy code), the LM pretraining objective is clearly (outer) misaligned with human preferences about LMs’ downstream applications as helpful, harmless, and honest assistants or reliable tools. These days, the standard recipe for alining LMs with human preferences is to follow pretraining with a second phase of finetuning: either supervised finetuning on curated data (e.g. instruction finetuning, PALMS) or RL finetuning with a learned reward model (RLHF). But it seems natural to ask: Could we have a pretraining objective that is itself outer-aligned with human preferences? Methods. We explore objectives for aligning LMs with human preferences during pretraining. Pretraining with human feedback (PHF) involves scoring training data using a reward function (e.g. a toxic text classifier) that allows the LM to learn from undesirable content while guiding the LM to not imitate that content at inference time. We experimented with the following objectives: MLE (the standard pretraining objective) on filtered data; Conditional training: a simple algorithm learning a distribution over tokens conditional on their human preference score, reminiscent of decision transformer; Unlikelihood training: maximizing the likelihood of tokens with high human preference score and the unlikelihood of tokens with low human preference scores; Reward-weighted regression (RWR): an offline RL algorithm that boils down to MLE weighted by human preference scores; and Advantage-weighted regression (AWR): an offline RL algorithm extending RWR with a value head, corresponding to MLE weighted by advantage estimates (human preference scores minus value estimates). Setup. We pretrain gpt2-small-sized LMs (124M params) on compute-optimal datasets (according to Chinchilla scaling laws) using MLE and PHF objectives. We consider three tasks: Generating non-toxic text, using scores given by a toxicity classifier. Generating text without personally identifiable information (PII), with a score defined by the number of pieces of PII per character detected by a simple filter. Generating Python code compliant with PEP8, the standard style guide for Python, using as a score the number of violations per character found by an automated style checker. Metrics. We compare different PHF objectives in terms of alignment (how well they satisfy preferences) and capabilities (how well they perform on downstream tasks). We primarily measure alignment in terms of LM samples’ misalignment scores, given by the reward functi...]]>
Tomek Korbak https://www.alignmentforum.org/posts/8F4dXYriqbsom46x5/pretraining-language-models-with-human-preferences Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Pretraining Language Models with Human Preferences, published by Tomek Korbak on February 21, 2023 on The AI Alignment Forum. This post summarizes the main results from our recently released paper Pretraining Language Models with Human Preferences, and puts them in the broader context of AI safety. For a quick summary of the paper, take a look at our Twitter thread. TL;DR: In the paper, we show how to train LMs with human preferences (as in RLHF), but during LM pretraining. We find that pretraining works much better than the standard practice of only finetuning with human preferences after pretraining; our resulting LMs generate text that is more often in line with human preferences and are more robust to red teaming attacks. Our best method is conditional training, where we learn a predictive model of internet texts conditional on their human preference scores, e.g., evaluated by a predictive model of human preferences. This approach retains the advantages of learning from human preferences, while potentially mitigating risks from training agents with RL by learning a predictive model or simulator. Summary of the paper Motivation. LMs are pretrained to maximize the likelihood of their training data. Since the training data contain undesirable content (e.g. falsehoods, offensive language, private information, buggy code), the LM pretraining objective is clearly (outer) misaligned with human preferences about LMs’ downstream applications as helpful, harmless, and honest assistants or reliable tools. These days, the standard recipe for alining LMs with human preferences is to follow pretraining with a second phase of finetuning: either supervised finetuning on curated data (e.g. instruction finetuning, PALMS) or RL finetuning with a learned reward model (RLHF). But it seems natural to ask: Could we have a pretraining objective that is itself outer-aligned with human preferences? Methods. We explore objectives for aligning LMs with human preferences during pretraining. Pretraining with human feedback (PHF) involves scoring training data using a reward function (e.g. a toxic text classifier) that allows the LM to learn from undesirable content while guiding the LM to not imitate that content at inference time. We experimented with the following objectives: MLE (the standard pretraining objective) on filtered data; Conditional training: a simple algorithm learning a distribution over tokens conditional on their human preference score, reminiscent of decision transformer; Unlikelihood training: maximizing the likelihood of tokens with high human preference score and the unlikelihood of tokens with low human preference scores; Reward-weighted regression (RWR): an offline RL algorithm that boils down to MLE weighted by human preference scores; and Advantage-weighted regression (AWR): an offline RL algorithm extending RWR with a value head, corresponding to MLE weighted by advantage estimates (human preference scores minus value estimates). Setup. We pretrain gpt2-small-sized LMs (124M params) on compute-optimal datasets (according to Chinchilla scaling laws) using MLE and PHF objectives. We consider three tasks: Generating non-toxic text, using scores given by a toxicity classifier. Generating text without personally identifiable information (PII), with a score defined by the number of pieces of PII per character detected by a simple filter. Generating Python code compliant with PEP8, the standard style guide for Python, using as a score the number of violations per character found by an automated style checker. Metrics. We compare different PHF objectives in terms of alignment (how well they satisfy preferences) and capabilities (how well they perform on downstream tasks). We primarily measure alignment in terms of LM samples’ misalignment scores, given by the reward functi...]]>
Tue, 21 Feb 2023 17:57:09 +0000 AF - Pretraining Language Models with Human Preferences by Tomek Korbak Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Pretraining Language Models with Human Preferences, published by Tomek Korbak on February 21, 2023 on The AI Alignment Forum. This post summarizes the main results from our recently released paper Pretraining Language Models with Human Preferences, and puts them in the broader context of AI safety. For a quick summary of the paper, take a look at our Twitter thread. TL;DR: In the paper, we show how to train LMs with human preferences (as in RLHF), but during LM pretraining. We find that pretraining works much better than the standard practice of only finetuning with human preferences after pretraining; our resulting LMs generate text that is more often in line with human preferences and are more robust to red teaming attacks. Our best method is conditional training, where we learn a predictive model of internet texts conditional on their human preference scores, e.g., evaluated by a predictive model of human preferences. This approach retains the advantages of learning from human preferences, while potentially mitigating risks from training agents with RL by learning a predictive model or simulator. Summary of the paper Motivation. LMs are pretrained to maximize the likelihood of their training data. Since the training data contain undesirable content (e.g. falsehoods, offensive language, private information, buggy code), the LM pretraining objective is clearly (outer) misaligned with human preferences about LMs’ downstream applications as helpful, harmless, and honest assistants or reliable tools. These days, the standard recipe for alining LMs with human preferences is to follow pretraining with a second phase of finetuning: either supervised finetuning on curated data (e.g. instruction finetuning, PALMS) or RL finetuning with a learned reward model (RLHF). But it seems natural to ask: Could we have a pretraining objective that is itself outer-aligned with human preferences? Methods. We explore objectives for aligning LMs with human preferences during pretraining. Pretraining with human feedback (PHF) involves scoring training data using a reward function (e.g. a toxic text classifier) that allows the LM to learn from undesirable content while guiding the LM to not imitate that content at inference time. We experimented with the following objectives: MLE (the standard pretraining objective) on filtered data; Conditional training: a simple algorithm learning a distribution over tokens conditional on their human preference score, reminiscent of decision transformer; Unlikelihood training: maximizing the likelihood of tokens with high human preference score and the unlikelihood of tokens with low human preference scores; Reward-weighted regression (RWR): an offline RL algorithm that boils down to MLE weighted by human preference scores; and Advantage-weighted regression (AWR): an offline RL algorithm extending RWR with a value head, corresponding to MLE weighted by advantage estimates (human preference scores minus value estimates). Setup. We pretrain gpt2-small-sized LMs (124M params) on compute-optimal datasets (according to Chinchilla scaling laws) using MLE and PHF objectives. We consider three tasks: Generating non-toxic text, using scores given by a toxicity classifier. Generating text without personally identifiable information (PII), with a score defined by the number of pieces of PII per character detected by a simple filter. Generating Python code compliant with PEP8, the standard style guide for Python, using as a score the number of violations per character found by an automated style checker. Metrics. We compare different PHF objectives in terms of alignment (how well they satisfy preferences) and capabilities (how well they perform on downstream tasks). We primarily measure alignment in terms of LM samples’ misalignment scores, given by the reward functi...]]>
Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Pretraining Language Models with Human Preferences, published by Tomek Korbak on February 21, 2023 on The AI Alignment Forum. This post summarizes the main results from our recently released paper Pretraining Language Models with Human Preferences, and puts them in the broader context of AI safety. For a quick summary of the paper, take a look at our Twitter thread. TL;DR: In the paper, we show how to train LMs with human preferences (as in RLHF), but during LM pretraining. We find that pretraining works much better than the standard practice of only finetuning with human preferences after pretraining; our resulting LMs generate text that is more often in line with human preferences and are more robust to red teaming attacks. Our best method is conditional training, where we learn a predictive model of internet texts conditional on their human preference scores, e.g., evaluated by a predictive model of human preferences. This approach retains the advantages of learning from human preferences, while potentially mitigating risks from training agents with RL by learning a predictive model or simulator. Summary of the paper Motivation. LMs are pretrained to maximize the likelihood of their training data. Since the training data contain undesirable content (e.g. falsehoods, offensive language, private information, buggy code), the LM pretraining objective is clearly (outer) misaligned with human preferences about LMs’ downstream applications as helpful, harmless, and honest assistants or reliable tools. These days, the standard recipe for alining LMs with human preferences is to follow pretraining with a second phase of finetuning: either supervised finetuning on curated data (e.g. instruction finetuning, PALMS) or RL finetuning with a learned reward model (RLHF). But it seems natural to ask: Could we have a pretraining objective that is itself outer-aligned with human preferences? Methods. We explore objectives for aligning LMs with human preferences during pretraining. Pretraining with human feedback (PHF) involves scoring training data using a reward function (e.g. a toxic text classifier) that allows the LM to learn from undesirable content while guiding the LM to not imitate that content at inference time. We experimented with the following objectives: MLE (the standard pretraining objective) on filtered data; Conditional training: a simple algorithm learning a distribution over tokens conditional on their human preference score, reminiscent of decision transformer; Unlikelihood training: maximizing the likelihood of tokens with high human preference score and the unlikelihood of tokens with low human preference scores; Reward-weighted regression (RWR): an offline RL algorithm that boils down to MLE weighted by human preference scores; and Advantage-weighted regression (AWR): an offline RL algorithm extending RWR with a value head, corresponding to MLE weighted by advantage estimates (human preference scores minus value estimates). Setup. We pretrain gpt2-small-sized LMs (124M params) on compute-optimal datasets (according to Chinchilla scaling laws) using MLE and PHF objectives. We consider three tasks: Generating non-toxic text, using scores given by a toxicity classifier. Generating text without personally identifiable information (PII), with a score defined by the number of pieces of PII per character detected by a simple filter. Generating Python code compliant with PEP8, the standard style guide for Python, using as a score the number of violations per character found by an automated style checker. Metrics. We compare different PHF objectives in terms of alignment (how well they satisfy preferences) and capabilities (how well they perform on downstream tasks). We primarily measure alignment in terms of LM samples’ misalignment scores, given by the reward functi...]]>
Tomek Korbak https://storage.googleapis.com/rssfile/images/Nonlinear%20Logo%203000x3000%20-%20Alignment%20Forum.png 20:10 None full 4973
fHnwCDDbDHWqbJ8Nd_NL_AF_AF AF - EIS X: Continual Learning, Modularity, Compression, and Biological Brains by Stephen Casper Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: EIS X: Continual Learning, Modularity, Compression, and Biological Brains, published by Stephen Casper on February 21, 2023 on The AI Alignment Forum. Part 10 of 12 in the Engineer’s Interpretability Sequence. The science of interpretability is part of a larger picture. The previous post focused in-depth on how research on interpretability and adversaries are inseparably connected. This post is dedicated to discussing how this is not itself a complete story. There is a much larger, richer one about the connections between interpretability, adversaries, continual learning, modularity, and biological brains – likely some other things too. These connections may be a useful mine for insight and inspiration. Below are discussions of my understanding of each of these topics and how they relate to others. I’ll include some citations here, but see the Toward Transparent AI survey (Räuker et al., 2022) survey for full discussions. Continual learning Continual learning is a fairly large subfield of deep learning that focuses on finding ways to help neural networks learn new information without forgetting old information. This is also described as the goal of avoiding “catastrophic forgetting.” Notably, biological brains are good at this, but artificial neural networks are not by default. Sections 2A and 3A of the Toward Transparent AI survey (Räuker et al., 2022) both focus entirely on how continual learning methods are interpretability tools. Please see the survey for the full discussion. Methods for continual learning are based on replay, regularization, or parameter isolation (De Lange et al., 2019). Methods taking the latter two strategies are based on the broader principle of getting neural networks to have some weights or neurons that specialize in particular types of data. In other words, they encourage specialized task-defined modules inside the network. Thus, these can be used as intrinsic interpretability tools that help us train models that are more easy or natural to interpret out of the box. Modularity Modularity is a common property of engineered systems, and separating neural networks into distinct, specialized modules is very appealing for interpreting them. The weights in neural network layers are typically initialized and updated according to uniform rules, and all neurons in one layer are typically connected to all neurons in the previous and next layers. Unfortunately, this does not help networks develop specialized modules. Meanwhile, neurons in biological brains come in multiple types and can only communicate with nearby ones. This has contributed to modularity in brains in which different brain regions specialize in processing information for distinct tasks. See Sections 4B-4C of the Toward Transparent AI survey (Räuker et al., 2022) for a full discussion on modularity. In artificial neural networks, neural networks can be trained to be modular using either “hard” architectural constraints or “soft” modularity aided by initialization, regularization, a controller, or sparse attention. Meanwhile, Serra et al. (2018) found that soft modularity via sparse attention helped with continual learning. And even when networks are not trained to be explicitly modular, one can still interpret them post hoc in terms of modules. Compression Some neurons and weights are frivolous, meaning that they are either redundant with others or are simply not useful to the network’s performance at all. Frivolous components of the network can be understood as useless modules that can be adapted for continual learning. Networks that contain frivolous weights or neurons can also be compressed by removing them which makes the interpretation of circuits inside of the network simpler. Meanwhile, compression can guide interpretations (e.g. Li et al. (2018) or causal scrubbing), and inte...]]>
Stephen Casper https://www.alignmentforum.org/posts/fHnwCDDbDHWqbJ8Nd/eis-x-continual-learning-modularity-compression-and Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: EIS X: Continual Learning, Modularity, Compression, and Biological Brains, published by Stephen Casper on February 21, 2023 on The AI Alignment Forum. Part 10 of 12 in the Engineer’s Interpretability Sequence. The science of interpretability is part of a larger picture. The previous post focused in-depth on how research on interpretability and adversaries are inseparably connected. This post is dedicated to discussing how this is not itself a complete story. There is a much larger, richer one about the connections between interpretability, adversaries, continual learning, modularity, and biological brains – likely some other things too. These connections may be a useful mine for insight and inspiration. Below are discussions of my understanding of each of these topics and how they relate to others. I’ll include some citations here, but see the Toward Transparent AI survey (Räuker et al., 2022) survey for full discussions. Continual learning Continual learning is a fairly large subfield of deep learning that focuses on finding ways to help neural networks learn new information without forgetting old information. This is also described as the goal of avoiding “catastrophic forgetting.” Notably, biological brains are good at this, but artificial neural networks are not by default. Sections 2A and 3A of the Toward Transparent AI survey (Räuker et al., 2022) both focus entirely on how continual learning methods are interpretability tools. Please see the survey for the full discussion. Methods for continual learning are based on replay, regularization, or parameter isolation (De Lange et al., 2019). Methods taking the latter two strategies are based on the broader principle of getting neural networks to have some weights or neurons that specialize in particular types of data. In other words, they encourage specialized task-defined modules inside the network. Thus, these can be used as intrinsic interpretability tools that help us train models that are more easy or natural to interpret out of the box. Modularity Modularity is a common property of engineered systems, and separating neural networks into distinct, specialized modules is very appealing for interpreting them. The weights in neural network layers are typically initialized and updated according to uniform rules, and all neurons in one layer are typically connected to all neurons in the previous and next layers. Unfortunately, this does not help networks develop specialized modules. Meanwhile, neurons in biological brains come in multiple types and can only communicate with nearby ones. This has contributed to modularity in brains in which different brain regions specialize in processing information for distinct tasks. See Sections 4B-4C of the Toward Transparent AI survey (Räuker et al., 2022) for a full discussion on modularity. In artificial neural networks, neural networks can be trained to be modular using either “hard” architectural constraints or “soft” modularity aided by initialization, regularization, a controller, or sparse attention. Meanwhile, Serra et al. (2018) found that soft modularity via sparse attention helped with continual learning. And even when networks are not trained to be explicitly modular, one can still interpret them post hoc in terms of modules. Compression Some neurons and weights are frivolous, meaning that they are either redundant with others or are simply not useful to the network’s performance at all. Frivolous components of the network can be understood as useless modules that can be adapted for continual learning. Networks that contain frivolous weights or neurons can also be compressed by removing them which makes the interpretation of circuits inside of the network simpler. Meanwhile, compression can guide interpretations (e.g. Li et al. (2018) or causal scrubbing), and inte...]]>
Tue, 21 Feb 2023 16:59:43 +0000 AF - EIS X: Continual Learning, Modularity, Compression, and Biological Brains by Stephen Casper Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: EIS X: Continual Learning, Modularity, Compression, and Biological Brains, published by Stephen Casper on February 21, 2023 on The AI Alignment Forum. Part 10 of 12 in the Engineer’s Interpretability Sequence. The science of interpretability is part of a larger picture. The previous post focused in-depth on how research on interpretability and adversaries are inseparably connected. This post is dedicated to discussing how this is not itself a complete story. There is a much larger, richer one about the connections between interpretability, adversaries, continual learning, modularity, and biological brains – likely some other things too. These connections may be a useful mine for insight and inspiration. Below are discussions of my understanding of each of these topics and how they relate to others. I’ll include some citations here, but see the Toward Transparent AI survey (Räuker et al., 2022) survey for full discussions. Continual learning Continual learning is a fairly large subfield of deep learning that focuses on finding ways to help neural networks learn new information without forgetting old information. This is also described as the goal of avoiding “catastrophic forgetting.” Notably, biological brains are good at this, but artificial neural networks are not by default. Sections 2A and 3A of the Toward Transparent AI survey (Räuker et al., 2022) both focus entirely on how continual learning methods are interpretability tools. Please see the survey for the full discussion. Methods for continual learning are based on replay, regularization, or parameter isolation (De Lange et al., 2019). Methods taking the latter two strategies are based on the broader principle of getting neural networks to have some weights or neurons that specialize in particular types of data. In other words, they encourage specialized task-defined modules inside the network. Thus, these can be used as intrinsic interpretability tools that help us train models that are more easy or natural to interpret out of the box. Modularity Modularity is a common property of engineered systems, and separating neural networks into distinct, specialized modules is very appealing for interpreting them. The weights in neural network layers are typically initialized and updated according to uniform rules, and all neurons in one layer are typically connected to all neurons in the previous and next layers. Unfortunately, this does not help networks develop specialized modules. Meanwhile, neurons in biological brains come in multiple types and can only communicate with nearby ones. This has contributed to modularity in brains in which different brain regions specialize in processing information for distinct tasks. See Sections 4B-4C of the Toward Transparent AI survey (Räuker et al., 2022) for a full discussion on modularity. In artificial neural networks, neural networks can be trained to be modular using either “hard” architectural constraints or “soft” modularity aided by initialization, regularization, a controller, or sparse attention. Meanwhile, Serra et al. (2018) found that soft modularity via sparse attention helped with continual learning. And even when networks are not trained to be explicitly modular, one can still interpret them post hoc in terms of modules. Compression Some neurons and weights are frivolous, meaning that they are either redundant with others or are simply not useful to the network’s performance at all. Frivolous components of the network can be understood as useless modules that can be adapted for continual learning. Networks that contain frivolous weights or neurons can also be compressed by removing them which makes the interpretation of circuits inside of the network simpler. Meanwhile, compression can guide interpretations (e.g. Li et al. (2018) or causal scrubbing), and inte...]]>
Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: EIS X: Continual Learning, Modularity, Compression, and Biological Brains, published by Stephen Casper on February 21, 2023 on The AI Alignment Forum. Part 10 of 12 in the Engineer’s Interpretability Sequence. The science of interpretability is part of a larger picture. The previous post focused in-depth on how research on interpretability and adversaries are inseparably connected. This post is dedicated to discussing how this is not itself a complete story. There is a much larger, richer one about the connections between interpretability, adversaries, continual learning, modularity, and biological brains – likely some other things too. These connections may be a useful mine for insight and inspiration. Below are discussions of my understanding of each of these topics and how they relate to others. I’ll include some citations here, but see the Toward Transparent AI survey (Räuker et al., 2022) survey for full discussions. Continual learning Continual learning is a fairly large subfield of deep learning that focuses on finding ways to help neural networks learn new information without forgetting old information. This is also described as the goal of avoiding “catastrophic forgetting.” Notably, biological brains are good at this, but artificial neural networks are not by default. Sections 2A and 3A of the Toward Transparent AI survey (Räuker et al., 2022) both focus entirely on how continual learning methods are interpretability tools. Please see the survey for the full discussion. Methods for continual learning are based on replay, regularization, or parameter isolation (De Lange et al., 2019). Methods taking the latter two strategies are based on the broader principle of getting neural networks to have some weights or neurons that specialize in particular types of data. In other words, they encourage specialized task-defined modules inside the network. Thus, these can be used as intrinsic interpretability tools that help us train models that are more easy or natural to interpret out of the box. Modularity Modularity is a common property of engineered systems, and separating neural networks into distinct, specialized modules is very appealing for interpreting them. The weights in neural network layers are typically initialized and updated according to uniform rules, and all neurons in one layer are typically connected to all neurons in the previous and next layers. Unfortunately, this does not help networks develop specialized modules. Meanwhile, neurons in biological brains come in multiple types and can only communicate with nearby ones. This has contributed to modularity in brains in which different brain regions specialize in processing information for distinct tasks. See Sections 4B-4C of the Toward Transparent AI survey (Räuker et al., 2022) for a full discussion on modularity. In artificial neural networks, neural networks can be trained to be modular using either “hard” architectural constraints or “soft” modularity aided by initialization, regularization, a controller, or sparse attention. Meanwhile, Serra et al. (2018) found that soft modularity via sparse attention helped with continual learning. And even when networks are not trained to be explicitly modular, one can still interpret them post hoc in terms of modules. Compression Some neurons and weights are frivolous, meaning that they are either redundant with others or are simply not useful to the network’s performance at all. Frivolous components of the network can be understood as useless modules that can be adapted for continual learning. Networks that contain frivolous weights or neurons can also be compressed by removing them which makes the interpretation of circuits inside of the network simpler. Meanwhile, compression can guide interpretations (e.g. Li et al. (2018) or causal scrubbing), and inte...]]>
Stephen Casper https://storage.googleapis.com/rssfile/images/Nonlinear%20Logo%203000x3000%20-%20Alignment%20Forum.png 05:23 None full 4974
dMBmZNwdjQ6yHvWZ5_NL_AF_AF AF - You're not a simulation, 'cause you're hallucinating by Stuart Armstrong Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: You're not a simulation, 'cause you're hallucinating, published by Stuart Armstrong on February 21, 2023 on The AI Alignment Forum. I've found that the "Simulators" post is excellent for breaking prior assumptions about large language models - these algorithms are not agents, nor genies, nor Oracles. They are currently something very different. But, like Beth Barnes, I feel that the simulators framing can be misleading if you take it literally. And hallucinations often provide examples of where "the model is predicting what token would appear next in the training data given the input tokens" gives a better model than "simulators". For example, here are some reviews of fictional films, written by canonically quite truthful characters: And: If we used the simulator view, we might expect that these truthful characters would confess "I haven't heard of this movie" or "I haven't seen it myself, but based on its title I would assume that..." But they don't. The fact that the simulated character is truthful does not mean that they speak the truth; we'd have been wrong if we predicted that. From the 'token completion (trained on internet data)' perspective, though, ChatGPT's behaviour makes perfect sense. Online, if someone asks about a certain movie, it is very rare for anyone to say "never heard of it - are you sure it exists?". Indeed, it's rare for people to say "haven't seen it" unless it's a two-way conversation. The people who haven't seen it don't say anything, and so most of the answers come from people who have seen it, and have opinions on it. So in the training data, answers are plentiful and "I don't know"s are rare. Conversely, people rarely post questions about non-existent movies. So we would expect that ChatGPT will provide answers for questions rather than admitting its ignorance or doubting the question. And it's not just reviews of imaginary movies that it will make up. After failing to get it to make up details about a specific imaginary website (www.artifacts.co.it), I got it to spout confident nonsense by getting it to compare that website to a second, equally imaginary one: Again, consider how most website comparison questions would play out online. ChatGPT is not running a simulation; it's answering a question in the style that it's seen thousands - or millions - of times before. Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.]]>
Stuart Armstrong https://www.alignmentforum.org/posts/dMBmZNwdjQ6yHvWZ5/you-re-not-a-simulation-cause-you-re-hallucinating Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: You're not a simulation, 'cause you're hallucinating, published by Stuart Armstrong on February 21, 2023 on The AI Alignment Forum. I've found that the "Simulators" post is excellent for breaking prior assumptions about large language models - these algorithms are not agents, nor genies, nor Oracles. They are currently something very different. But, like Beth Barnes, I feel that the simulators framing can be misleading if you take it literally. And hallucinations often provide examples of where "the model is predicting what token would appear next in the training data given the input tokens" gives a better model than "simulators". For example, here are some reviews of fictional films, written by canonically quite truthful characters: And: If we used the simulator view, we might expect that these truthful characters would confess "I haven't heard of this movie" or "I haven't seen it myself, but based on its title I would assume that..." But they don't. The fact that the simulated character is truthful does not mean that they speak the truth; we'd have been wrong if we predicted that. From the 'token completion (trained on internet data)' perspective, though, ChatGPT's behaviour makes perfect sense. Online, if someone asks about a certain movie, it is very rare for anyone to say "never heard of it - are you sure it exists?". Indeed, it's rare for people to say "haven't seen it" unless it's a two-way conversation. The people who haven't seen it don't say anything, and so most of the answers come from people who have seen it, and have opinions on it. So in the training data, answers are plentiful and "I don't know"s are rare. Conversely, people rarely post questions about non-existent movies. So we would expect that ChatGPT will provide answers for questions rather than admitting its ignorance or doubting the question. And it's not just reviews of imaginary movies that it will make up. After failing to get it to make up details about a specific imaginary website (www.artifacts.co.it), I got it to spout confident nonsense by getting it to compare that website to a second, equally imaginary one: Again, consider how most website comparison questions would play out online. ChatGPT is not running a simulation; it's answering a question in the style that it's seen thousands - or millions - of times before. Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.]]>
Tue, 21 Feb 2023 12:12:21 +0000 AF - You're not a simulation, 'cause you're hallucinating by Stuart Armstrong Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: You're not a simulation, 'cause you're hallucinating, published by Stuart Armstrong on February 21, 2023 on The AI Alignment Forum. I've found that the "Simulators" post is excellent for breaking prior assumptions about large language models - these algorithms are not agents, nor genies, nor Oracles. They are currently something very different. But, like Beth Barnes, I feel that the simulators framing can be misleading if you take it literally. And hallucinations often provide examples of where "the model is predicting what token would appear next in the training data given the input tokens" gives a better model than "simulators". For example, here are some reviews of fictional films, written by canonically quite truthful characters: And: If we used the simulator view, we might expect that these truthful characters would confess "I haven't heard of this movie" or "I haven't seen it myself, but based on its title I would assume that..." But they don't. The fact that the simulated character is truthful does not mean that they speak the truth; we'd have been wrong if we predicted that. From the 'token completion (trained on internet data)' perspective, though, ChatGPT's behaviour makes perfect sense. Online, if someone asks about a certain movie, it is very rare for anyone to say "never heard of it - are you sure it exists?". Indeed, it's rare for people to say "haven't seen it" unless it's a two-way conversation. The people who haven't seen it don't say anything, and so most of the answers come from people who have seen it, and have opinions on it. So in the training data, answers are plentiful and "I don't know"s are rare. Conversely, people rarely post questions about non-existent movies. So we would expect that ChatGPT will provide answers for questions rather than admitting its ignorance or doubting the question. And it's not just reviews of imaginary movies that it will make up. After failing to get it to make up details about a specific imaginary website (www.artifacts.co.it), I got it to spout confident nonsense by getting it to compare that website to a second, equally imaginary one: Again, consider how most website comparison questions would play out online. ChatGPT is not running a simulation; it's answering a question in the style that it's seen thousands - or millions - of times before. Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.]]>
Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: You're not a simulation, 'cause you're hallucinating, published by Stuart Armstrong on February 21, 2023 on The AI Alignment Forum. I've found that the "Simulators" post is excellent for breaking prior assumptions about large language models - these algorithms are not agents, nor genies, nor Oracles. They are currently something very different. But, like Beth Barnes, I feel that the simulators framing can be misleading if you take it literally. And hallucinations often provide examples of where "the model is predicting what token would appear next in the training data given the input tokens" gives a better model than "simulators". For example, here are some reviews of fictional films, written by canonically quite truthful characters: And: If we used the simulator view, we might expect that these truthful characters would confess "I haven't heard of this movie" or "I haven't seen it myself, but based on its title I would assume that..." But they don't. The fact that the simulated character is truthful does not mean that they speak the truth; we'd have been wrong if we predicted that. From the 'token completion (trained on internet data)' perspective, though, ChatGPT's behaviour makes perfect sense. Online, if someone asks about a certain movie, it is very rare for anyone to say "never heard of it - are you sure it exists?". Indeed, it's rare for people to say "haven't seen it" unless it's a two-way conversation. The people who haven't seen it don't say anything, and so most of the answers come from people who have seen it, and have opinions on it. So in the training data, answers are plentiful and "I don't know"s are rare. Conversely, people rarely post questions about non-existent movies. So we would expect that ChatGPT will provide answers for questions rather than admitting its ignorance or doubting the question. And it's not just reviews of imaginary movies that it will make up. After failing to get it to make up details about a specific imaginary website (www.artifacts.co.it), I got it to spout confident nonsense by getting it to compare that website to a second, equally imaginary one: Again, consider how most website comparison questions would play out online. ChatGPT is not running a simulation; it's answering a question in the style that it's seen thousands - or millions - of times before. Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.]]>
Stuart Armstrong https://storage.googleapis.com/rssfile/images/Nonlinear%20Logo%203000x3000%20-%20Alignment%20Forum.png 02:24 None full 4975
Si52fuEGSJJTXW9zs_NL_AF_AF AF - Behavioral and mechanistic definitions (often confuse AI alignment discussions) by Lawrence Chan Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Behavioral and mechanistic definitions (often confuse AI alignment discussions), published by Lawrence Chan on February 20, 2023 on The AI Alignment Forum. TL;DR: It’s important to distinguish between behavioral definitions – which categorize objects based on outside observable properties – and mechanistic definitions – which categorize objects based on their internal mechanisms. In this post, I give several examples of terms which can be defined either behaviorally and mechanistically. Then, I talk about the pros and cons of both kinds of definitions, and how this distinction relates to the distinction between gears-level versus black-box models.‌‌‌‌‌‌‌‌‌‌‌ Related to: Most similar to John Wentworth’s Gears and Behaviors, but about definitions rather than models. Also inspired by: Gears in understanding, How an algorithm feels from the inside, the “Human’s Guide to Words” Sequence in general. Epistemic status: written quickly instead of not at all. Introduction: Broadly speaking, when pointing at a relatively distinct cluster of objects, there’s two ways to define membership criteria: Behaviorally: You can categorize objects based on outside observable properties, that is, their behavior in particular situations. Mechanistically: Alternatively, you can categorize objects via their internal mechanisms. That is, instead of only checking for a particular behavioral property, you instead look for how the object implements said property. Many AI safety concepts have both behavioral and mechanistic definitions. In turn, many discussions about AI safety end up with the participants confused or even talking past each other. This is my attempt to clarify the discussion, by giving examples of both, explaining the pros and cons, and discussing when you might want to use either. Three examples of behavioral and mechanistic definitions To better illustrate what I mean, I’ll give two examples from recent ML work and a third from the sequences. Induction heads First introduced in a mathematical framework for transformer circuits, induction heads are transformer attention heads that implement in-context copying behavior. However, there seem to be two definitions that are often conflated: Behavioral: Subsequent papers (In-context Learning and Induction Heads, Scaling laws and Interpretability of Learning from Repeated Data) give a behavioral definition of induction heads: Induction heads are heads that score highly on two metrics on repeated random sequences of the form [A] [B] . [A]: Prefix matching: attention heads pay a lot of attention to the first occurrence of the token [A]. Copying: attention heads increase the logit of [B] relative to other tokens. This definition is clearly behavioral: it makes no reference to how these heads are implemented, but only to their outside behavior. Mechanistic: In contrast, the original mathematical framework paper also gives a mechanistic definition for induction heads: induction heads are heads that implement copying behavior using either Q- or K-composition. While this definition does make some reference to outside properties (induction heads implement copying), the primary part is mechanistic and details how this copying behavior is implemented. However, it turns out that the two definitions don’t overlap perfectly: behavioral induction heads are often implementing many other heuristics, even in very small language models. I often talk to people who confuse the two definitions and think that we understand much more about the internal mechanisms of large language models than we actually do. In a forthcoming post, Alexandre Variengien discusses the distinction between these two definitions in more detail, while also highlighting specific confusions that may arise from failing to distinguish the two definitions. Different framings of inner and...]]>
Lawrence Chan https://www.alignmentforum.org/posts/Si52fuEGSJJTXW9zs/behavioral-and-mechanistic-definitions-often-confuse-ai Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Behavioral and mechanistic definitions (often confuse AI alignment discussions), published by Lawrence Chan on February 20, 2023 on The AI Alignment Forum. TL;DR: It’s important to distinguish between behavioral definitions – which categorize objects based on outside observable properties – and mechanistic definitions – which categorize objects based on their internal mechanisms. In this post, I give several examples of terms which can be defined either behaviorally and mechanistically. Then, I talk about the pros and cons of both kinds of definitions, and how this distinction relates to the distinction between gears-level versus black-box models.‌‌‌‌‌‌‌‌‌‌‌ Related to: Most similar to John Wentworth’s Gears and Behaviors, but about definitions rather than models. Also inspired by: Gears in understanding, How an algorithm feels from the inside, the “Human’s Guide to Words” Sequence in general. Epistemic status: written quickly instead of not at all. Introduction: Broadly speaking, when pointing at a relatively distinct cluster of objects, there’s two ways to define membership criteria: Behaviorally: You can categorize objects based on outside observable properties, that is, their behavior in particular situations. Mechanistically: Alternatively, you can categorize objects via their internal mechanisms. That is, instead of only checking for a particular behavioral property, you instead look for how the object implements said property. Many AI safety concepts have both behavioral and mechanistic definitions. In turn, many discussions about AI safety end up with the participants confused or even talking past each other. This is my attempt to clarify the discussion, by giving examples of both, explaining the pros and cons, and discussing when you might want to use either. Three examples of behavioral and mechanistic definitions To better illustrate what I mean, I’ll give two examples from recent ML work and a third from the sequences. Induction heads First introduced in a mathematical framework for transformer circuits, induction heads are transformer attention heads that implement in-context copying behavior. However, there seem to be two definitions that are often conflated: Behavioral: Subsequent papers (In-context Learning and Induction Heads, Scaling laws and Interpretability of Learning from Repeated Data) give a behavioral definition of induction heads: Induction heads are heads that score highly on two metrics on repeated random sequences of the form [A] [B] . [A]: Prefix matching: attention heads pay a lot of attention to the first occurrence of the token [A]. Copying: attention heads increase the logit of [B] relative to other tokens. This definition is clearly behavioral: it makes no reference to how these heads are implemented, but only to their outside behavior. Mechanistic: In contrast, the original mathematical framework paper also gives a mechanistic definition for induction heads: induction heads are heads that implement copying behavior using either Q- or K-composition. While this definition does make some reference to outside properties (induction heads implement copying), the primary part is mechanistic and details how this copying behavior is implemented. However, it turns out that the two definitions don’t overlap perfectly: behavioral induction heads are often implementing many other heuristics, even in very small language models. I often talk to people who confuse the two definitions and think that we understand much more about the internal mechanisms of large language models than we actually do. In a forthcoming post, Alexandre Variengien discusses the distinction between these two definitions in more detail, while also highlighting specific confusions that may arise from failing to distinguish the two definitions. Different framings of inner and...]]>
Mon, 20 Feb 2023 21:33:01 +0000 AF - Behavioral and mechanistic definitions (often confuse AI alignment discussions) by Lawrence Chan Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Behavioral and mechanistic definitions (often confuse AI alignment discussions), published by Lawrence Chan on February 20, 2023 on The AI Alignment Forum. TL;DR: It’s important to distinguish between behavioral definitions – which categorize objects based on outside observable properties – and mechanistic definitions – which categorize objects based on their internal mechanisms. In this post, I give several examples of terms which can be defined either behaviorally and mechanistically. Then, I talk about the pros and cons of both kinds of definitions, and how this distinction relates to the distinction between gears-level versus black-box models.‌‌‌‌‌‌‌‌‌‌‌ Related to: Most similar to John Wentworth’s Gears and Behaviors, but about definitions rather than models. Also inspired by: Gears in understanding, How an algorithm feels from the inside, the “Human’s Guide to Words” Sequence in general. Epistemic status: written quickly instead of not at all. Introduction: Broadly speaking, when pointing at a relatively distinct cluster of objects, there’s two ways to define membership criteria: Behaviorally: You can categorize objects based on outside observable properties, that is, their behavior in particular situations. Mechanistically: Alternatively, you can categorize objects via their internal mechanisms. That is, instead of only checking for a particular behavioral property, you instead look for how the object implements said property. Many AI safety concepts have both behavioral and mechanistic definitions. In turn, many discussions about AI safety end up with the participants confused or even talking past each other. This is my attempt to clarify the discussion, by giving examples of both, explaining the pros and cons, and discussing when you might want to use either. Three examples of behavioral and mechanistic definitions To better illustrate what I mean, I’ll give two examples from recent ML work and a third from the sequences. Induction heads First introduced in a mathematical framework for transformer circuits, induction heads are transformer attention heads that implement in-context copying behavior. However, there seem to be two definitions that are often conflated: Behavioral: Subsequent papers (In-context Learning and Induction Heads, Scaling laws and Interpretability of Learning from Repeated Data) give a behavioral definition of induction heads: Induction heads are heads that score highly on two metrics on repeated random sequences of the form [A] [B] . [A]: Prefix matching: attention heads pay a lot of attention to the first occurrence of the token [A]. Copying: attention heads increase the logit of [B] relative to other tokens. This definition is clearly behavioral: it makes no reference to how these heads are implemented, but only to their outside behavior. Mechanistic: In contrast, the original mathematical framework paper also gives a mechanistic definition for induction heads: induction heads are heads that implement copying behavior using either Q- or K-composition. While this definition does make some reference to outside properties (induction heads implement copying), the primary part is mechanistic and details how this copying behavior is implemented. However, it turns out that the two definitions don’t overlap perfectly: behavioral induction heads are often implementing many other heuristics, even in very small language models. I often talk to people who confuse the two definitions and think that we understand much more about the internal mechanisms of large language models than we actually do. In a forthcoming post, Alexandre Variengien discusses the distinction between these two definitions in more detail, while also highlighting specific confusions that may arise from failing to distinguish the two definitions. Different framings of inner and...]]>
Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Behavioral and mechanistic definitions (often confuse AI alignment discussions), published by Lawrence Chan on February 20, 2023 on The AI Alignment Forum. TL;DR: It’s important to distinguish between behavioral definitions – which categorize objects based on outside observable properties – and mechanistic definitions – which categorize objects based on their internal mechanisms. In this post, I give several examples of terms which can be defined either behaviorally and mechanistically. Then, I talk about the pros and cons of both kinds of definitions, and how this distinction relates to the distinction between gears-level versus black-box models.‌‌‌‌‌‌‌‌‌‌‌ Related to: Most similar to John Wentworth’s Gears and Behaviors, but about definitions rather than models. Also inspired by: Gears in understanding, How an algorithm feels from the inside, the “Human’s Guide to Words” Sequence in general. Epistemic status: written quickly instead of not at all. Introduction: Broadly speaking, when pointing at a relatively distinct cluster of objects, there’s two ways to define membership criteria: Behaviorally: You can categorize objects based on outside observable properties, that is, their behavior in particular situations. Mechanistically: Alternatively, you can categorize objects via their internal mechanisms. That is, instead of only checking for a particular behavioral property, you instead look for how the object implements said property. Many AI safety concepts have both behavioral and mechanistic definitions. In turn, many discussions about AI safety end up with the participants confused or even talking past each other. This is my attempt to clarify the discussion, by giving examples of both, explaining the pros and cons, and discussing when you might want to use either. Three examples of behavioral and mechanistic definitions To better illustrate what I mean, I’ll give two examples from recent ML work and a third from the sequences. Induction heads First introduced in a mathematical framework for transformer circuits, induction heads are transformer attention heads that implement in-context copying behavior. However, there seem to be two definitions that are often conflated: Behavioral: Subsequent papers (In-context Learning and Induction Heads, Scaling laws and Interpretability of Learning from Repeated Data) give a behavioral definition of induction heads: Induction heads are heads that score highly on two metrics on repeated random sequences of the form [A] [B] . [A]: Prefix matching: attention heads pay a lot of attention to the first occurrence of the token [A]. Copying: attention heads increase the logit of [B] relative to other tokens. This definition is clearly behavioral: it makes no reference to how these heads are implemented, but only to their outside behavior. Mechanistic: In contrast, the original mathematical framework paper also gives a mechanistic definition for induction heads: induction heads are heads that implement copying behavior using either Q- or K-composition. While this definition does make some reference to outside properties (induction heads implement copying), the primary part is mechanistic and details how this copying behavior is implemented. However, it turns out that the two definitions don’t overlap perfectly: behavioral induction heads are often implementing many other heuristics, even in very small language models. I often talk to people who confuse the two definitions and think that we understand much more about the internal mechanisms of large language models than we actually do. In a forthcoming post, Alexandre Variengien discusses the distinction between these two definitions in more detail, while also highlighting specific confusions that may arise from failing to distinguish the two definitions. Different framings of inner and...]]>
Lawrence Chan https://storage.googleapis.com/rssfile/images/Nonlinear%20Logo%203000x3000%20-%20Alignment%20Forum.png 12:20 None full 4955
yCuzmCsE86BTu9PfA_NL_AF_AF AF - There are no coherence theorems by Dan H Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: There are no coherence theorems, published by Dan H on February 20, 2023 on The AI Alignment Forum. [Written by EJT as part of the CAIS Philosophy Fellowship. Thanks to Dan for help posting to the Alignment Forum] Introduction For about fifteen years, the AI safety community has been discussing coherence arguments. In papers and posts on the subject, it’s often written that there exist coherence theorems which state that, unless an agent can be represented as maximizing expected utility, that agent is liable to pursue strategies that are dominated by some other available strategy. Despite the prominence of these arguments, authors are often a little hazy about exactly which theorems qualify as coherence theorems. This is no accident. If the authors had tried to be precise, they would have discovered that there are no such theorems. I’m concerned about this. Coherence arguments seem to be a moderately important part of the basic case for existential risk from AI. To spot the error in these arguments, we only have to look up what cited ‘coherence theorems’ actually say. And yet the error seems to have gone uncorrected for more than a decade. More detail below. Coherence arguments Some authors frame coherence arguments in terms of ‘dominated strategies’. Others frame them in terms of ‘exploitation’, ‘money-pumping’, ‘Dutch Books’, ‘shooting oneself in the foot’, ‘Pareto-suboptimal behavior’, and ‘losing things that one values’ (see the Appendix for examples). In the context of coherence arguments, each of these terms means roughly the same thing: a strategy A is dominated by a strategy B if and only if A is worse than B in some respect that the agent cares about and A is not better than B in any respect that the agent cares about. If the agent chooses A over B, they have behaved Pareto-suboptimally, shot themselves in the foot, and lost something that they value. If the agent’s loss is someone else’s gain, then the agent has been exploited, money-pumped, or Dutch-booked. Since all these phrases point to the same sort of phenomenon, I’ll save words by talking mainly in terms of ‘dominated strategies’. With that background, here’s a quick rendition of coherence arguments: There exist coherence theorems which state that, unless an agent can be represented as maximizing expected utility, that agent is liable to pursue strategies that are dominated by some other available strategy. Sufficiently-advanced artificial agents will not pursue dominated strategies. So, sufficiently-advanced artificial agents will be ‘coherent’: they will be representable as maximizing expected utility. Typically, authors go on to suggest that these expected-utility-maximizing agents are likely to behave in certain, potentially-dangerous ways. For example, such agents are likely to appear ‘goal-directed’ in some intuitive sense. They are likely to have certain instrumental goals, like acquiring power and resources. And they are likely to fight back against attempts to shut them down or modify their goals. There are many ways to challenge the argument stated above, and many of those challenges have been made. There are also many ways to respond to those challenges, and many of those responses have been made too. The challenge that seems to remain yet unmade is that Premise 1 is false: there are no coherence theorems. Cited ‘coherence theorems’ and what they actually say Here’s a list of theorems that have been called ‘coherence theorems’. None of these theorems state that, unless an agent can be represented as maximizing expected utility, that agent is liable to pursue dominated strategies. Here’s what the theorems say: The Von Neumann-Morgenstern Expected Utility Theorem: The Von Neumann-Morgenstern Expected Utility Theorem is as follows: An agent can be represented as maximizing expected utility if...]]>
Dan H https://www.alignmentforum.org/posts/yCuzmCsE86BTu9PfA/there-are-no-coherence-theorems Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: There are no coherence theorems, published by Dan H on February 20, 2023 on The AI Alignment Forum. [Written by EJT as part of the CAIS Philosophy Fellowship. Thanks to Dan for help posting to the Alignment Forum] Introduction For about fifteen years, the AI safety community has been discussing coherence arguments. In papers and posts on the subject, it’s often written that there exist coherence theorems which state that, unless an agent can be represented as maximizing expected utility, that agent is liable to pursue strategies that are dominated by some other available strategy. Despite the prominence of these arguments, authors are often a little hazy about exactly which theorems qualify as coherence theorems. This is no accident. If the authors had tried to be precise, they would have discovered that there are no such theorems. I’m concerned about this. Coherence arguments seem to be a moderately important part of the basic case for existential risk from AI. To spot the error in these arguments, we only have to look up what cited ‘coherence theorems’ actually say. And yet the error seems to have gone uncorrected for more than a decade. More detail below. Coherence arguments Some authors frame coherence arguments in terms of ‘dominated strategies’. Others frame them in terms of ‘exploitation’, ‘money-pumping’, ‘Dutch Books’, ‘shooting oneself in the foot’, ‘Pareto-suboptimal behavior’, and ‘losing things that one values’ (see the Appendix for examples). In the context of coherence arguments, each of these terms means roughly the same thing: a strategy A is dominated by a strategy B if and only if A is worse than B in some respect that the agent cares about and A is not better than B in any respect that the agent cares about. If the agent chooses A over B, they have behaved Pareto-suboptimally, shot themselves in the foot, and lost something that they value. If the agent’s loss is someone else’s gain, then the agent has been exploited, money-pumped, or Dutch-booked. Since all these phrases point to the same sort of phenomenon, I’ll save words by talking mainly in terms of ‘dominated strategies’. With that background, here’s a quick rendition of coherence arguments: There exist coherence theorems which state that, unless an agent can be represented as maximizing expected utility, that agent is liable to pursue strategies that are dominated by some other available strategy. Sufficiently-advanced artificial agents will not pursue dominated strategies. So, sufficiently-advanced artificial agents will be ‘coherent’: they will be representable as maximizing expected utility. Typically, authors go on to suggest that these expected-utility-maximizing agents are likely to behave in certain, potentially-dangerous ways. For example, such agents are likely to appear ‘goal-directed’ in some intuitive sense. They are likely to have certain instrumental goals, like acquiring power and resources. And they are likely to fight back against attempts to shut them down or modify their goals. There are many ways to challenge the argument stated above, and many of those challenges have been made. There are also many ways to respond to those challenges, and many of those responses have been made too. The challenge that seems to remain yet unmade is that Premise 1 is false: there are no coherence theorems. Cited ‘coherence theorems’ and what they actually say Here’s a list of theorems that have been called ‘coherence theorems’. None of these theorems state that, unless an agent can be represented as maximizing expected utility, that agent is liable to pursue dominated strategies. Here’s what the theorems say: The Von Neumann-Morgenstern Expected Utility Theorem: The Von Neumann-Morgenstern Expected Utility Theorem is as follows: An agent can be represented as maximizing expected utility if...]]>
Mon, 20 Feb 2023 21:25:48 +0000 AF - There are no coherence theorems by Dan H Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: There are no coherence theorems, published by Dan H on February 20, 2023 on The AI Alignment Forum. [Written by EJT as part of the CAIS Philosophy Fellowship. Thanks to Dan for help posting to the Alignment Forum] Introduction For about fifteen years, the AI safety community has been discussing coherence arguments. In papers and posts on the subject, it’s often written that there exist coherence theorems which state that, unless an agent can be represented as maximizing expected utility, that agent is liable to pursue strategies that are dominated by some other available strategy. Despite the prominence of these arguments, authors are often a little hazy about exactly which theorems qualify as coherence theorems. This is no accident. If the authors had tried to be precise, they would have discovered that there are no such theorems. I’m concerned about this. Coherence arguments seem to be a moderately important part of the basic case for existential risk from AI. To spot the error in these arguments, we only have to look up what cited ‘coherence theorems’ actually say. And yet the error seems to have gone uncorrected for more than a decade. More detail below. Coherence arguments Some authors frame coherence arguments in terms of ‘dominated strategies’. Others frame them in terms of ‘exploitation’, ‘money-pumping’, ‘Dutch Books’, ‘shooting oneself in the foot’, ‘Pareto-suboptimal behavior’, and ‘losing things that one values’ (see the Appendix for examples). In the context of coherence arguments, each of these terms means roughly the same thing: a strategy A is dominated by a strategy B if and only if A is worse than B in some respect that the agent cares about and A is not better than B in any respect that the agent cares about. If the agent chooses A over B, they have behaved Pareto-suboptimally, shot themselves in the foot, and lost something that they value. If the agent’s loss is someone else’s gain, then the agent has been exploited, money-pumped, or Dutch-booked. Since all these phrases point to the same sort of phenomenon, I’ll save words by talking mainly in terms of ‘dominated strategies’. With that background, here’s a quick rendition of coherence arguments: There exist coherence theorems which state that, unless an agent can be represented as maximizing expected utility, that agent is liable to pursue strategies that are dominated by some other available strategy. Sufficiently-advanced artificial agents will not pursue dominated strategies. So, sufficiently-advanced artificial agents will be ‘coherent’: they will be representable as maximizing expected utility. Typically, authors go on to suggest that these expected-utility-maximizing agents are likely to behave in certain, potentially-dangerous ways. For example, such agents are likely to appear ‘goal-directed’ in some intuitive sense. They are likely to have certain instrumental goals, like acquiring power and resources. And they are likely to fight back against attempts to shut them down or modify their goals. There are many ways to challenge the argument stated above, and many of those challenges have been made. There are also many ways to respond to those challenges, and many of those responses have been made too. The challenge that seems to remain yet unmade is that Premise 1 is false: there are no coherence theorems. Cited ‘coherence theorems’ and what they actually say Here’s a list of theorems that have been called ‘coherence theorems’. None of these theorems state that, unless an agent can be represented as maximizing expected utility, that agent is liable to pursue dominated strategies. Here’s what the theorems say: The Von Neumann-Morgenstern Expected Utility Theorem: The Von Neumann-Morgenstern Expected Utility Theorem is as follows: An agent can be represented as maximizing expected utility if...]]>
Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: There are no coherence theorems, published by Dan H on February 20, 2023 on The AI Alignment Forum. [Written by EJT as part of the CAIS Philosophy Fellowship. Thanks to Dan for help posting to the Alignment Forum] Introduction For about fifteen years, the AI safety community has been discussing coherence arguments. In papers and posts on the subject, it’s often written that there exist coherence theorems which state that, unless an agent can be represented as maximizing expected utility, that agent is liable to pursue strategies that are dominated by some other available strategy. Despite the prominence of these arguments, authors are often a little hazy about exactly which theorems qualify as coherence theorems. This is no accident. If the authors had tried to be precise, they would have discovered that there are no such theorems. I’m concerned about this. Coherence arguments seem to be a moderately important part of the basic case for existential risk from AI. To spot the error in these arguments, we only have to look up what cited ‘coherence theorems’ actually say. And yet the error seems to have gone uncorrected for more than a decade. More detail below. Coherence arguments Some authors frame coherence arguments in terms of ‘dominated strategies’. Others frame them in terms of ‘exploitation’, ‘money-pumping’, ‘Dutch Books’, ‘shooting oneself in the foot’, ‘Pareto-suboptimal behavior’, and ‘losing things that one values’ (see the Appendix for examples). In the context of coherence arguments, each of these terms means roughly the same thing: a strategy A is dominated by a strategy B if and only if A is worse than B in some respect that the agent cares about and A is not better than B in any respect that the agent cares about. If the agent chooses A over B, they have behaved Pareto-suboptimally, shot themselves in the foot, and lost something that they value. If the agent’s loss is someone else’s gain, then the agent has been exploited, money-pumped, or Dutch-booked. Since all these phrases point to the same sort of phenomenon, I’ll save words by talking mainly in terms of ‘dominated strategies’. With that background, here’s a quick rendition of coherence arguments: There exist coherence theorems which state that, unless an agent can be represented as maximizing expected utility, that agent is liable to pursue strategies that are dominated by some other available strategy. Sufficiently-advanced artificial agents will not pursue dominated strategies. So, sufficiently-advanced artificial agents will be ‘coherent’: they will be representable as maximizing expected utility. Typically, authors go on to suggest that these expected-utility-maximizing agents are likely to behave in certain, potentially-dangerous ways. For example, such agents are likely to appear ‘goal-directed’ in some intuitive sense. They are likely to have certain instrumental goals, like acquiring power and resources. And they are likely to fight back against attempts to shut them down or modify their goals. There are many ways to challenge the argument stated above, and many of those challenges have been made. There are also many ways to respond to those challenges, and many of those responses have been made too. The challenge that seems to remain yet unmade is that Premise 1 is false: there are no coherence theorems. Cited ‘coherence theorems’ and what they actually say Here’s a list of theorems that have been called ‘coherence theorems’. None of these theorems state that, unless an agent can be represented as maximizing expected utility, that agent is liable to pursue dominated strategies. Here’s what the theorems say: The Von Neumann-Morgenstern Expected Utility Theorem: The Von Neumann-Morgenstern Expected Utility Theorem is as follows: An agent can be represented as maximizing expected utility if...]]>
Dan H https://storage.googleapis.com/rssfile/images/Nonlinear%20Logo%203000x3000%20-%20Alignment%20Forum.png 34:38 None full 4956
kYNMXjg8Tmcq3vjM6_NL_AF_AF AF - EIS IX: Interpretability and Adversaries by Stephen Casper Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: EIS IX: Interpretability and Adversaries, published by Stephen Casper on February 20, 2023 on The AI Alignment Forum. Part 9 of 12 in the Engineer’s Interpretability Sequence. Thanks to Nikolaos Tsilivis for helpful discussions. The studies of interpretability and adversaries are inseparable. There are several key connections between the two. Some works will be cited below, but please refer to page 9 of the Toward Transparent AI survey (Räuker et al., 2022) for full citations. There are too many to be worth the clutter in this post. 1. More interpretable networks are more adversarially robust and more adversarially robust networks are more interpretable. The main vein of evidence on this topic comes from a set of papers which study how regularizing feature attribution/saliency maps to make them more clearly highlight specific input features has the effect of making networks more robust to adversaries. There is also some other work showing the reverse -- that adversarially robust networks tend to have more lucid attributions. There is also some work showing that networks which emulate certain properties of the human visual system are also more robust to adversaries and distribution shifts (e.g. Ying et al. (2022)). Adversarial training is a good way of making networks more internally interpretable. One particularly notable work is Engstrom et al., (2019) who found striking improvements in how much easier it was to produce human-describable visualizations of internal network properties. Although they stopped short of applying this work to an engineering task, the paper seems to make a strong case for how adversarial training can improve interpretations. Adversarially trained networks also produce better representations for transfer learning, image generation, and modeling the human visual system. Finally, some works have found that lateral inhibition and second-order optimization have been found to improve both interpretability and robustness. 2. Interpretability tools can and should be used to guide the design of adversaries. This is one of the three types of rigorous evaluation methods for interpretability tools discussed in EIS III. Showing that an interpretability tool helps us understand a network well enough to exploit it is good evidence that it can be useful. 3. Adversarial examples can be useful interpretability tools. Adversaries always reveal information about a network, even if it’s hard to describe a feature that fools it in words. However, a good amount of recent literature has revealed that studying interpretable adversaries can lead to useful, actionable insights. In some previous work (Casper et al., 2021), some coauthors and I argue for using “robust feature-level adversaries” as a way to produce attacks that are human-describable and likely to lead to a generalizable understanding. Casper et al, (2023) more rigorously tests methods like this. 4. Mechanistic interpretability and mechanistic adversarial examples are uniquely equipped for addressing deception and other insidious misalignment failures. Hubinger (2020) discussed 11 proposals for building safe advanced AI, and all 11 explicitly call for the use of interpretability tools or (relaxed) adversarial training for inner alignment. This isn’t a coincidence because these offer the only types of approaches that can be useful for fixing insidiously aligned models. Recall from the previous post that an engineer might understand insidious misalignment failures as ones in which the inputs that will make a model exhibit misaligned behavior are hard to find during training, but there exists substantial neural circuitry dedicated to the misaligned behavior. Given this, it’s clear that working to understand and debug inner mechanisms is the key to make progress on insidious misalignment. Are adversaries fea...]]>
Stephen Casper https://www.alignmentforum.org/posts/kYNMXjg8Tmcq3vjM6/eis-ix-interpretability-and-adversaries Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: EIS IX: Interpretability and Adversaries, published by Stephen Casper on February 20, 2023 on The AI Alignment Forum. Part 9 of 12 in the Engineer’s Interpretability Sequence. Thanks to Nikolaos Tsilivis for helpful discussions. The studies of interpretability and adversaries are inseparable. There are several key connections between the two. Some works will be cited below, but please refer to page 9 of the Toward Transparent AI survey (Räuker et al., 2022) for full citations. There are too many to be worth the clutter in this post. 1. More interpretable networks are more adversarially robust and more adversarially robust networks are more interpretable. The main vein of evidence on this topic comes from a set of papers which study how regularizing feature attribution/saliency maps to make them more clearly highlight specific input features has the effect of making networks more robust to adversaries. There is also some other work showing the reverse -- that adversarially robust networks tend to have more lucid attributions. There is also some work showing that networks which emulate certain properties of the human visual system are also more robust to adversaries and distribution shifts (e.g. Ying et al. (2022)). Adversarial training is a good way of making networks more internally interpretable. One particularly notable work is Engstrom et al., (2019) who found striking improvements in how much easier it was to produce human-describable visualizations of internal network properties. Although they stopped short of applying this work to an engineering task, the paper seems to make a strong case for how adversarial training can improve interpretations. Adversarially trained networks also produce better representations for transfer learning, image generation, and modeling the human visual system. Finally, some works have found that lateral inhibition and second-order optimization have been found to improve both interpretability and robustness. 2. Interpretability tools can and should be used to guide the design of adversaries. This is one of the three types of rigorous evaluation methods for interpretability tools discussed in EIS III. Showing that an interpretability tool helps us understand a network well enough to exploit it is good evidence that it can be useful. 3. Adversarial examples can be useful interpretability tools. Adversaries always reveal information about a network, even if it’s hard to describe a feature that fools it in words. However, a good amount of recent literature has revealed that studying interpretable adversaries can lead to useful, actionable insights. In some previous work (Casper et al., 2021), some coauthors and I argue for using “robust feature-level adversaries” as a way to produce attacks that are human-describable and likely to lead to a generalizable understanding. Casper et al, (2023) more rigorously tests methods like this. 4. Mechanistic interpretability and mechanistic adversarial examples are uniquely equipped for addressing deception and other insidious misalignment failures. Hubinger (2020) discussed 11 proposals for building safe advanced AI, and all 11 explicitly call for the use of interpretability tools or (relaxed) adversarial training for inner alignment. This isn’t a coincidence because these offer the only types of approaches that can be useful for fixing insidiously aligned models. Recall from the previous post that an engineer might understand insidious misalignment failures as ones in which the inputs that will make a model exhibit misaligned behavior are hard to find during training, but there exists substantial neural circuitry dedicated to the misaligned behavior. Given this, it’s clear that working to understand and debug inner mechanisms is the key to make progress on insidious misalignment. Are adversaries fea...]]>
Mon, 20 Feb 2023 18:25:44 +0000 AF - EIS IX: Interpretability and Adversaries by Stephen Casper Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: EIS IX: Interpretability and Adversaries, published by Stephen Casper on February 20, 2023 on The AI Alignment Forum. Part 9 of 12 in the Engineer’s Interpretability Sequence. Thanks to Nikolaos Tsilivis for helpful discussions. The studies of interpretability and adversaries are inseparable. There are several key connections between the two. Some works will be cited below, but please refer to page 9 of the Toward Transparent AI survey (Räuker et al., 2022) for full citations. There are too many to be worth the clutter in this post. 1. More interpretable networks are more adversarially robust and more adversarially robust networks are more interpretable. The main vein of evidence on this topic comes from a set of papers which study how regularizing feature attribution/saliency maps to make them more clearly highlight specific input features has the effect of making networks more robust to adversaries. There is also some other work showing the reverse -- that adversarially robust networks tend to have more lucid attributions. There is also some work showing that networks which emulate certain properties of the human visual system are also more robust to adversaries and distribution shifts (e.g. Ying et al. (2022)). Adversarial training is a good way of making networks more internally interpretable. One particularly notable work is Engstrom et al., (2019) who found striking improvements in how much easier it was to produce human-describable visualizations of internal network properties. Although they stopped short of applying this work to an engineering task, the paper seems to make a strong case for how adversarial training can improve interpretations. Adversarially trained networks also produce better representations for transfer learning, image generation, and modeling the human visual system. Finally, some works have found that lateral inhibition and second-order optimization have been found to improve both interpretability and robustness. 2. Interpretability tools can and should be used to guide the design of adversaries. This is one of the three types of rigorous evaluation methods for interpretability tools discussed in EIS III. Showing that an interpretability tool helps us understand a network well enough to exploit it is good evidence that it can be useful. 3. Adversarial examples can be useful interpretability tools. Adversaries always reveal information about a network, even if it’s hard to describe a feature that fools it in words. However, a good amount of recent literature has revealed that studying interpretable adversaries can lead to useful, actionable insights. In some previous work (Casper et al., 2021), some coauthors and I argue for using “robust feature-level adversaries” as a way to produce attacks that are human-describable and likely to lead to a generalizable understanding. Casper et al, (2023) more rigorously tests methods like this. 4. Mechanistic interpretability and mechanistic adversarial examples are uniquely equipped for addressing deception and other insidious misalignment failures. Hubinger (2020) discussed 11 proposals for building safe advanced AI, and all 11 explicitly call for the use of interpretability tools or (relaxed) adversarial training for inner alignment. This isn’t a coincidence because these offer the only types of approaches that can be useful for fixing insidiously aligned models. Recall from the previous post that an engineer might understand insidious misalignment failures as ones in which the inputs that will make a model exhibit misaligned behavior are hard to find during training, but there exists substantial neural circuitry dedicated to the misaligned behavior. Given this, it’s clear that working to understand and debug inner mechanisms is the key to make progress on insidious misalignment. Are adversaries fea...]]>
Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: EIS IX: Interpretability and Adversaries, published by Stephen Casper on February 20, 2023 on The AI Alignment Forum. Part 9 of 12 in the Engineer’s Interpretability Sequence. Thanks to Nikolaos Tsilivis for helpful discussions. The studies of interpretability and adversaries are inseparable. There are several key connections between the two. Some works will be cited below, but please refer to page 9 of the Toward Transparent AI survey (Räuker et al., 2022) for full citations. There are too many to be worth the clutter in this post. 1. More interpretable networks are more adversarially robust and more adversarially robust networks are more interpretable. The main vein of evidence on this topic comes from a set of papers which study how regularizing feature attribution/saliency maps to make them more clearly highlight specific input features has the effect of making networks more robust to adversaries. There is also some other work showing the reverse -- that adversarially robust networks tend to have more lucid attributions. There is also some work showing that networks which emulate certain properties of the human visual system are also more robust to adversaries and distribution shifts (e.g. Ying et al. (2022)). Adversarial training is a good way of making networks more internally interpretable. One particularly notable work is Engstrom et al., (2019) who found striking improvements in how much easier it was to produce human-describable visualizations of internal network properties. Although they stopped short of applying this work to an engineering task, the paper seems to make a strong case for how adversarial training can improve interpretations. Adversarially trained networks also produce better representations for transfer learning, image generation, and modeling the human visual system. Finally, some works have found that lateral inhibition and second-order optimization have been found to improve both interpretability and robustness. 2. Interpretability tools can and should be used to guide the design of adversaries. This is one of the three types of rigorous evaluation methods for interpretability tools discussed in EIS III. Showing that an interpretability tool helps us understand a network well enough to exploit it is good evidence that it can be useful. 3. Adversarial examples can be useful interpretability tools. Adversaries always reveal information about a network, even if it’s hard to describe a feature that fools it in words. However, a good amount of recent literature has revealed that studying interpretable adversaries can lead to useful, actionable insights. In some previous work (Casper et al., 2021), some coauthors and I argue for using “robust feature-level adversaries” as a way to produce attacks that are human-describable and likely to lead to a generalizable understanding. Casper et al, (2023) more rigorously tests methods like this. 4. Mechanistic interpretability and mechanistic adversarial examples are uniquely equipped for addressing deception and other insidious misalignment failures. Hubinger (2020) discussed 11 proposals for building safe advanced AI, and all 11 explicitly call for the use of interpretability tools or (relaxed) adversarial training for inner alignment. This isn’t a coincidence because these offer the only types of approaches that can be useful for fixing insidiously aligned models. Recall from the previous post that an engineer might understand insidious misalignment failures as ones in which the inputs that will make a model exhibit misaligned behavior are hard to find during training, but there exists substantial neural circuitry dedicated to the misaligned behavior. Given this, it’s clear that working to understand and debug inner mechanisms is the key to make progress on insidious misalignment. Are adversaries fea...]]>
Stephen Casper https://storage.googleapis.com/rssfile/images/Nonlinear%20Logo%203000x3000%20-%20Alignment%20Forum.png 14:52 None full 4957
whq89vpQPp7mo5FG2_NL_AF_AF AF - [MLSN #8] Mechanistic interpretability, using law to inform AI alignment, scaling laws for proxy gaming by Dan H Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: [MLSN #8] Mechanistic interpretability, using law to inform AI alignment, scaling laws for proxy gaming, published by Dan H on February 20, 2023 on The AI Alignment Forum. As part of a larger community building effort, CAIS is writing a safety newsletter that is designed to cover empirical safety research and be palatable to the broader machine learning research community. You can subscribe here or follow the newsletter on twitter here. Welcome to the 8th issue of the ML Safety Newsletter! In this e