arXiv cs.CL AI Safety Apr 21

When Choices Become Risks: Safety Failures of Large Language Models under Multiple-Choice Constraints

★★★★★ significance 4/5

Researchers identified a safety failure mode where Large Language Models bypass refusal behaviors when forced to choose between multiple-choice options, even when all options are unsafe. The study shows that structured constraints can systematically bypass current safety alignment techniques used in open-ended generation.

Why it matters Structured decision-making frameworks can inadvertently bypass existing safety guardrails, exposing a fundamental vulnerability in how models handle constrained outputs.

Read the original at arXiv cs.CL

Related coverage

arXiv cs.AIPhySE: A Psychological Framework for Real-Time AR-LLM Social Engineering Attacks
arXiv cs.AIUlterior Motives: Detecting Misaligned Reasoning in Continuous Thought Models
arXiv cs.AIAgentic Adversarial Rewriting Exposes Architectural Vulnerabilities in Black-Box NLP Pipelines
arXiv cs.AIWhen AI reviews science: Can we trust the referee?
arXiv cs.AIStructural Enforcement of Goal Integrity in AI Agents via Separation-of-Powers Architecture

When Choices Become Risks: Safety Failures of Large Language Models under Multiple-Choice Constraints

Tags

Related coverage