Skip to main navigation Skip to search Skip to main content

Control Illusion: The Failure of Instruction Hierarchies in Large Language Models

  • Yilin Geng
  • , Haonan Li
  • , Honglin Mu
  • , Xudong Han
  • , Timothy Baldwin
  • , Omri Abend
  • , Eduard Hovy
  • , Lea Frermann

Research output: Contribution to journalConference articlepeer-review

Abstract

Large language models (LLMs) are increasingly deployed with hierarchical instruction schemes, where certain instructions (e.g., system-level directives) are expected to take precedence over others (e.g., user messages). Yet, we lack a systematic understanding of how effectively these hierarchical control mechanisms work. We introduce a systematic evaluation framework based on constraint prioritization to assess how well LLMs enforce instruction hierarchies. Our experiments across six state-of-the-art LLMs reveal that models struggle with consistent instruction prioritization, even for simple formatting conflicts. We find that the widely-adopted system/user prompt separation fails to establish a reliable instruction hierarchy, and models exhibit strong inherent biases toward certain constraint types regardless of their priority designation. Interestingly, we also find that societal hierarchy framings (e.g., authority, expertise, consensus) show stronger influence on model behavior than system/user roles, suggesting that pretraining-derived social structures function as latent behavioral priors with potentially greater impact than post-training guardrails.

Original languageEnglish
Pages (from-to)30816-30824
Number of pages9
JournalProceedings of the AAAI Conference on Artificial Intelligence
Volume40
Issue number36
DOIs
StatePublished - 2026
Event40th AAAI Conference on Artificial Intelligence, AAAI 2026 - Singapore, Singapore
Duration: 20 Jan 202627 Jan 2026

Bibliographical note

Publisher Copyright:
© 2026, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved.

Fingerprint

Dive into the research topics of 'Control Illusion: The Failure of Instruction Hierarchies in Large Language Models'. Together they form a unique fingerprint.

Cite this