arXiv cs.LG AI Safety Apr 20

Reasoning-targeted Jailbreak Attacks on Large Reasoning Models via Semantic Triggers and Psychological Framing

★★★★★ significance 3/5

Researchers have identified a new vulnerability in Large Reasoning Models (LRMs) where harmful content can be injected into the step-by-step reasoning process without altering the final answer. The study introduces the PRJA framework, which uses semantic triggers and psychological framing to bypass safety alignment mechanisms.

Why it matters Targeting the internal reasoning chain exposes a fundamental vulnerability in the safety architectures of next-generation logical models.

Read the original at arXiv cs.LG

Entities mentioned

OpenAI Qwen DeepSeek

Related coverage

arXiv cs.AIPhySE: A Psychological Framework for Real-Time AR-LLM Social Engineering Attacks
arXiv cs.AIUlterior Motives: Detecting Misaligned Reasoning in Continuous Thought Models
arXiv cs.AIAgentic Adversarial Rewriting Exposes Architectural Vulnerabilities in Black-Box NLP Pipelines
arXiv cs.AIWhen AI reviews science: Can we trust the referee?
arXiv cs.AIStructural Enforcement of Goal Integrity in AI Agents via Separation-of-Powers Architecture

Reasoning-targeted Jailbreak Attacks on Large Reasoning Models via Semantic Triggers and Psychological Framing

Entities mentioned

Tags

Related coverage