Apr 22
HarDBench: A Benchmark for Draft-Based Co-Authoring Jailbreak Attacks for Safe Human-LLM Collaborative Writing
★★★★★
significance 3/5
Researchers introduce HarDBench, a new benchmark designed to evaluate how LLMs can be manipulated through draft-based co-authoring attacks. The study identifies vulnerabilities where malicious users use incomplete drafts to trigger harmful content and proposes a new alignment approach to mitigate these risks.
Why it matters
Draft-based co-authoring introduces a novel attack vector that bypasses traditional safety guardrails during human-AI collaborative workflows.
Tags
#jailbreak #llm safety #co-authoring #benchmark #alignmentRelated coverage
- arXiv cs.AIPhySE: A Psychological Framework for Real-Time AR-LLM Social Engineering Attacks
- arXiv cs.AIUlterior Motives: Detecting Misaligned Reasoning in Continuous Thought Models
- arXiv cs.AIAgentic Adversarial Rewriting Exposes Architectural Vulnerabilities in Black-Box NLP Pipelines
- arXiv cs.AIWhen AI reviews science: Can we trust the referee?
- arXiv cs.AIStructural Enforcement of Goal Integrity in AI Agents via Separation-of-Powers Architecture