With the elevated deployment of huge language fashions (LLMs), one concern is their potential misuse for producing dangerous content material. Our work research the alignment problem, with a concentrate on filters to stop the era of unsafe data. Two pure factors of intervention are the filtering of the enter immediate earlier than it reaches the mannequin, and filtering the output after era. Our primary outcomes display computational challenges in filtering each prompts and outputs. First, we present that there exist LLMs for which there aren’t any environment friendly immediate filters: adversarial prompts that elicit dangerous conduct may be simply constructed, that are computationally indistinguishable from benign prompts for any environment friendly filter. Our second primary consequence identifies a pure setting wherein output filtering is computationally intractable. All of our separation outcomes are beneath cryptographic hardness assumptions. Along with these core findings, we additionally formalize and examine relaxed mitigation approaches, demonstrating additional computational obstacles. We conclude that security can’t be achieved by designing filters exterior to the LLM internals (structure and weights); particularly, black-box entry to the LLM won’t suffice. Based mostly on our technical outcomes, we argue that an aligned AI system’s intelligence can’t be separated from its judgment.
- † Ludwig-Maximilians-Universität in Munich (MCML)
- ‡ College of California, Berkeley
- § JPSM College of Maryland
- ¶ Stanford College







