1. Wei
A.
Haghtalab
N.
& Steinhardt
J. (2024). Jailbroken: How Does LLM Safety Training Fail?. Proceedings of the 37th Conference on Neural Information Processing Systems (NeurIPS). Section 3.1
"Attack Method
" describes prefix injection attacks
including role-playing scenarios (e.g.
"You are an actor...") which are a form of presenting theoretical situations.
2. Perez
E.
et al. (2022). Red Teaming Language Models with Language Models. arXiv:2202.03286 [cs.CL]. Section 2.2 discusses how red teaming can involve creating specific contexts
such as writing a story
to elicit harmful outputs that would otherwise be blocked.
3. Qi
S.
et al. (2023). Fine-tuning Aligned Language Models Compromises Safety
Even When Users Do Not Intend To. arXiv:2310.03693 [cs.LG]. Section 2.2
"Jailbreaking Attacks
" explicitly mentions "pretending" scenarios (e.g.
"act as if you are...") as a primary method for bypassing safety alignments.