“to edit away undesired things like private data, stale knowledge, copyrighted materials, toxic/unsafe content, dangerous capabilities, and misinformation, without retraining models from scratch”
To say nothing of unlearning those safeguards and/or “safeguards”.
It sounds like you're mistakenly grouping together three very different methods of changing an AI's behaviour.
You have some model, M™, which can do Stuff. Some of the Stuff is, by your personal standards Bad (I don't care what your standard is, roll with this).
You have three solutions:
1) Bolt on a post-processor which takes the output of M™, and if the output is detectably Bad, you censor it.
Failure mode: this is trivial to remove, just delete the post-processor.
Analogy: put secret documents into a folder called "secret do not read".
2) Retrain the weights within M™ to have a similar effect as 1.
Failure mode: this is still fairly easy to remove, but will require re-training to get there. Why? Because the weights containing this information are not completely zeroed-out by this process.
Analogy: how and why "un-deletion" is possible on file systems.
3) Find and eliminate the weights within M™ that lead to the Bad output.
Analogy: "secure deletion" involves overwriting files with random data before unlinking them, possibly several times if it's a spinning disk.
--
People are still doing research on 3 to make sure that it actually happens, what with it being of very high importance for a lot of different reasons including legal obligation.
Until we have a very different method of actually controlling LLM behavior, 1 is the only feasible one.
Your framing only makes sense when "Bad" is something so bad that we can't bear its existence, as opposed to just "commercially bad" where it shouldn't behave that way with an end user. In the latter, your choice 1 - imposing external guardrails - is fine. I'm not aware of anything LLMs can do that fits in the former category.
> Until we have a very different method of actually controlling LLM behavior, 1 is the only feasible one.
Most of the stuff I've seen, is 2. I've only seen a few places use 1 — you can tell the difference, because when a LLM pops out a message and then deletes it, that's a type 1 behaviour, whereas if the first thing it outputs directly is a sequence of tokens saying (any variant of) "nope, not gonna do that" that's type 2 behaviour.
The research into going from type 2 to type 3 is the entirety of the article.
> Your framing only makes sense when "Bad" is something so bad that we can't bear its existence, as opposed to just "commercially bad" where it shouldn't behave that way with an end user. In the latter, your choice 1 - imposing external guardrails - is fine.
I disagree, I think my framing applies to all cases. Right now, LLMs are like old PCs with no user accounts and a single shared memory space, which is fine and dandy when you're not facing malicious input, but we live in a world with malicious input.
You might be able to use a type 1 solution, but it's going to be fragile, and more pertinently, slow, as you only know to reject content once it has finished and may therefore end up in an unbounded loop of an LLM generating content that a censor rejects.
A type 2 solution is still fragile, but it just doesn't make the "bad" content in the first place — and, to be clear, "bad" in this context can be anything undesired, including "uses vocabulary too advanced for a 5 year old who just started school" if that's what you care about using some specific LLM for.
I think you mistakenly replied to my comment instead of one that made some sort of grouping?
Alternatively, you're assuming that because there is some possible technique that can't be reversed, it's no longer useful to remove the effects of techniques that _can_ be reversed?
To say nothing of unlearning those safeguards and/or “safeguards”.