arXiv cs.CL AI Safety Apr 22

When Safety Fails Before the Answer: Benchmarking Harmful Behavior Detection in Reasoning Chains

★★★★★ significance 3/5

Researchers introduce HarmThoughts, a new benchmark designed to detect harmful behaviors within the multi-step reasoning traces of large reasoning models. The study identifies 16 specific harmful behaviors and demonstrates that current safety detectors struggle to identify these issues at a sentence-level granularity.

Why it matters Current safety guardrails fail to intercept harmful transitions occurring within the internal reasoning steps of advanced large reasoning models.

Read the original at arXiv cs.CL

Related coverage

arXiv cs.AIPhySE: A Psychological Framework for Real-Time AR-LLM Social Engineering Attacks
arXiv cs.AIUlterior Motives: Detecting Misaligned Reasoning in Continuous Thought Models
arXiv cs.AIAgentic Adversarial Rewriting Exposes Architectural Vulnerabilities in Black-Box NLP Pipelines
arXiv cs.AIWhen AI reviews science: Can we trust the referee?
arXiv cs.AIStructural Enforcement of Goal Integrity in AI Agents via Separation-of-Powers Architecture

When Safety Fails Before the Answer: Benchmarking Harmful Behavior Detection in Reasoning Chains

Tags

Related coverage