The 8088 The 8088 ← All news
arXiv cs.CL AI Research 11h ago

JudgeSense: A Benchmark for Prompt Sensitivity in LLM-as-a-Judge Systems

★★★★★ significance 3/5

Researchers introduce JudgeSense, a new benchmark designed to measure how sensitive LLM-as-a-Judge systems are to prompt paraphrasing. The study reveals significant inconsistencies in how different models evaluate tasks like factuality and preference based on slight semantic changes.

Why it matters Unreliable evaluation benchmarks threaten the reliability of automated model-tuning and quality assurance pipelines.
Read the original at arXiv cs.CL

Tags

#llm-as-a-judge #benchmark #prompt sensitivity #evaluations

Related coverage