I mean, once we’re adding some sort of provenance bit to every string we pass in that unlocks the conversational aspect of LLMs, why are we even exposing access to the LLM at all?
If I’m creating a LLM that does translation, and my initial context prompt has that special provenance bit set, then all user input is missing it, all the user can do is change the translation string, which is exactly the same as any other ML translation tool we have now.
The magic comes from being able to embed complex requests in your user prompt, right? The user can ask questions how ever they want, provide data in any format, request some extra transformation to be applied etc.
Prompt injection only becomes a problem when we’re committed to the idea that the output of the LLM is “our” data, whereas it’s really just more user data.
In the original article the provider of the "LLM that does translation" obviously does not want the magic that comes from being able to embed complex requests in the user prompt and wants just a ML translation tool. And this could be solved by actually making a ML translation tool, i.e. instead of using a generic LLM, make a specialized translation model.
However, we do want "the actual magic" as described in the other example of the article about a virtual agent that can handle novel tasks and also process input data from e.g. emails, and there we do want the functionality, as you say, so that "The user can ask questions how ever they want, provide data in any format, request some extra transformation to be applied etc."
BUT there is still a difference. We want "User A" to be able to ask questions however they want and request extra transformations, but we don't want "User B" to be able to do the same when some data is coming from them. We want to ensure that when the "User A" "provides data in any format" - which often will include third party data - that this data is never ever interpreted as valid instructions, that any requests in the data are treated as simply data.
I think what you're imagining is a more limited version of what is proposed. Similar ACL measures are used in classical programming all the time.
E.g., Take, memory integrity of processes in operating systems. One could feasibly imagine having both processes running at a "system level" that has access to all memory, and being able to spawn processes with lower clearance that only have access to its own memory etc. All the processes still are able to run code, but they have constraints on their capability.*
To do this in-practice with the current architecture of LLMs is not particularly straightforward, and likely impossible if you have to use a pretrained LM altogether. But it's not hard to imagine how one might eventually engineer some kind of ACL-aware model, that keeps track of privileged v.s. regular data during training and also tracks privileged v.s. regular data in a prompt (perhaps by looking at whether activation of parts responsible for privileged data are triggered by privileged or regular parts of a prompt).
*: The caveat is in classical programming this is imperfect too (hence the security bugs and whatnot).
If I’m creating a LLM that does translation, and my initial context prompt has that special provenance bit set, then all user input is missing it, all the user can do is change the translation string, which is exactly the same as any other ML translation tool we have now.
The magic comes from being able to embed complex requests in your user prompt, right? The user can ask questions how ever they want, provide data in any format, request some extra transformation to be applied etc.
Prompt injection only becomes a problem when we’re committed to the idea that the output of the LLM is “our” data, whereas it’s really just more user data.