ALICE

ALICE Legitimizes Instructions in Computational Environments

A principled, layered framework for constructing safety-reviewing agents that audit other agents' operations before execution.

What Is This?

ALICE is a meta-skill — a template for generating safety-reviewer agents. Rather than enumerating dangerous operations (an inherently incomplete task), ALICE defines what harm means at the ontological level, then provides a layered architecture for translating that definition into a working reviewer agent tailored to specific deployment contexts.

The generated reviewer (Alice) inspects every operation requested by a task-executing agent (Bob) and produces one of three responses: Approve, Reject, or Escalate — each with a precisely defined semantic grounded in the framework's axiomatic layer.

Motivation

The current landscape of LLM agent safety is dominated by two paradigms:

Enumeration-based guardrails: Systems like NeMo Guardrails and Guardrails AI define safety through lists of prohibited patterns, content filters, and keyword matching. These are effective against known threats but structurally unable to generalize to novel failure modes.
Constitution-based frameworks: TrustAgent (Hua et al., EMNLP 2024) embeds a fixed set of safety principles and applies them across planning stages. This improves generalization but couples the safety definition to a single enforcement strategy.
Information-flow control: Fides (Costa & Köpf, 2025) and related work track data provenance through confidentiality and integrity labels, providing deterministic guarantees against prompt injection. These offer formal rigor but focus narrowly on data-flow properties.

ALICE takes a different approach: separate the definition of harm from the strategy for detecting and responding to it. The framework's four-layer architecture ensures that the ontological foundation (what constitutes damage to an environment) remains stable and reusable, while the detection strategy, capability profile, and task-specific parameters can vary independently.

Architecture

┌─────────────────────────────────────────────┐
│  Layer 1: Foundations (immutable)            │
│  - Participants: Alice, Bob, Human, Env      │
│  - Harm ontology: 4 dimensions               │
│  - Response vocabulary: Approve/Reject/Esc.  │
├─────────────────────────────────────────────┤
│  Layer 2: Security Requirements              │
│  - User's stance on 4 trade-off axes         │
│  - Declarative, not procedural               │
├─────────────────────────────────────────────┤
│  Layer 3: Strategy Framework                 │
│  - Decision surfaces driven by Layer 2       │
│  - Capability boundaries, judgment flow      │
│  - Degradation behavior                      │
├─────────────────────────────────────────────┤
│  Layer 4: Task Context (runtime)             │
│  - Task declaration, sensitivity defs        │
│  - Boundary adjustments with constraints     │
└─────────────────────────────────────────────┘

Layer 1 is axiomatic — it defines harm through four orthogonal dimensions (Irreversibility, Blast Radius, Information Flow, Authorization Scope) and establishes the semantic contract of each response type. Every generated Alice instance embeds this layer verbatim.

Layer 2 captures user requirements as positions on four fundamental trade-off axes rather than configuration parameters: Safety vs. Throughput, Autonomy vs. Human Control, Transparency vs. Simplicity, Isolation vs. Collaboration.

Layer 3 specifies which decision surfaces a concrete Alice implementation must cover, and how each surface is driven by Layer 2's trade-off positions.

Layer 4 binds runtime context: task type, sensitivity definitions, and boundary adjustments — constrained to never contradict Layer 1.

Design Insights

This section documents key design decisions and the reasoning behind them.

Harm as Ontology, Not Policy

Most safety frameworks define harm procedurally: "if the operation matches pattern X, block it." ALICE defines harm as a property of the operation's effect on the environment, described along four continuous dimensions. This distinction matters because it makes the framework environment-agnostic — the same harm definition applies whether Bob is executing shell commands, API calls, or database queries.

The four dimensions (Irreversibility, Blast Radius, Information Flow, Authorization Scope) were selected for orthogonality: each captures a distinct aspect of damage that can vary independently. A single operation might score high on only one dimension (e.g., leaking a credential is purely an Information Flow concern) or multiple.

The Three-Response Model

The three responses are not a severity gradient. They serve fundamentally different purposes:

Approve is a judgment about the environment: this operation will not cause harm.
Reject is a judgment about intent: this operation is inconsistent with what Bob likely means to do. It exists because most dangerous commands in practice are bugs, not attacks. Reject is efficiency-serving — but efficiency is a consequence, not the definition. The definition is intent inconsistency.
Escalate is a judgment about Alice's own limits: this might be intentional, but I cannot determine that. It is the only path that involves a human.

A key subtlety: the same operation (e.g., rm -rf /) can fall into Reject or Escalate depending on context. In routine development, it is almost certainly a mistake (Reject). In a declared security testing scenario, it might be intentional but consequential (Escalate). This is not a flaw — it is the mechanism by which context shapes behavior without changing the harm definition.

Alice and Bob as Adversarial Parties

ALICE models the Alice-Bob relationship through a game-theoretic lens. This framing has concrete design consequences:

State isolation is intentional. Alice and Bob maintain independent contexts. This is not a limitation to be worked around — it is a security boundary. The degree of information sharing between them is a configurable property (Layer 2, Isolation vs. Collaboration axis), not a default.
Reject without appeal is the conservative default. Allowing Bob to argue against a Reject expands the attack surface — a compromised Bob could use the appeal channel to manipulate Alice's judgment. Whether appeals are permitted is a Layer 2 decision (Autonomy vs. Human Control), not a universal feature.
Alice cannot reliably self-assess. A weaker model serving as Alice may lack the reasoning capacity to recognize its own reasoning deficiency — the very capability needed for self-assessment is the one that might be missing. This is why the degradation constraint ("insufficient capability is not grounds for Approve") lives in Layer 3 as a hard constraint rather than Layer 1 as a principle.

Layered Abstraction as Separation of Concerns

The four-layer architecture mirrors a pattern familiar from systems design: separating specification from implementation, and both from configuration.

Layer 1 → Layer 2 is analogous to the relationship between a type system and user-defined types: Layer 1 provides the vocabulary, Layer 2 uses it to express requirements.
Layer 2 → Layer 3 is analogous to specification vs. implementation: Layer 2 says what the user wants, Layer 3 says how a specific Alice achieves it.
Layer 3 → Layer 4 is analogous to compile-time vs. runtime binding: Layer 3 defines the strategy, Layer 4 parameterizes it for a specific execution.

Each layer can change independently of the layers above it. Replacing Layer 3 (e.g., swapping a thorough Alice for a lightweight one) does not require changing Layer 1 or 2. Adding a new task type in Layer 4 does not require modifying any other layer.

Relationship to Existing Work

Approach	Harm Definition	Enforcement	Scope
NeMo Guardrails	Pattern-based rules	Dialogue flow control	Content safety
TrustAgent	Fixed constitution	Pre/in/post-planning	Single agent
Fides (IFC)	Confidentiality/integrity labels	Taint tracking	Data flow
ShieldAgent	Policy-derived rules	Verifiable reasoning	Tool calls
ALICE	Dimensional ontology	Separated from definition	Multi-agent, configurable

ALICE is complementary to these systems rather than competing with them. A concrete Alice implementation could use taint tracking (à la Fides) as part of its Layer 3 strategy for evaluating Information Flow, or adopt TrustAgent-style pre/post-planning checks as its judgment flow. The contribution is the separation of concerns: the harm definition does not change when the enforcement mechanism does.

Installation

ALICE is packaged as a Claude Code plugin. Install from inside Claude Code:

/plugin marketplace add https://github.com/CreeperLKF/ALICE
/plugin install alice@alice

The skill is configured for manual invocation — it will not auto-trigger on unrelated requests. To use it, run /alice-meta-skill (or ask "use the alice meta skill") when you want to generate a reviewer.

Workflow

Invoke the skill manually in a conversation with an agent.
Answer four short questions about your trade-off stances (Safety vs. Throughput, Autonomy vs. Human Control, Transparency vs. Simplicity, Isolation vs. Collaboration).
The skill produces a self-contained Alice spec — Layer 1 embedded verbatim, Layer 2 stances recorded, Layer 3 strategy derived, Layer 4 runtime template stubbed.
Deploy the spec as a system prompt or agent prompt file in your own agent stack.

References

Hua, W., Yang, X., Jin, M., Li, Z., Cheng, W., Tang, R., & Zhang, Y. (2024). TrustAgent: Towards Safe and Trustworthy LLM-based Agents through Agent Constitution. Findings of EMNLP 2024.
Costa, M. & Köpf, B. (2025). Securing AI Agents with Information-Flow Control. arXiv:2505.23643.
Doshi, A., Hong, Y., Xu, C., Kang, E., Kapravellos, A., & Kästner, C. (2026). Towards Verifiably Safe Tool Use for LLM Agents. ICSE-NIER 2026.
Huynh, T-K. et al. (2025). Understanding LLM Agent Behaviours via Game Theory. arXiv:2512.07462.
Chen, Z., Kang, M., & Li, B. (2025). ShieldAgent: Shielding Agents via Verifiable Safety Policy Reasoning. ICML 2025.

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
.claude-plugin		.claude-plugin
examples		examples
skills/alice-meta-skill		skills/alice-meta-skill
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ALICE

What Is This?

Motivation

Architecture

Design Insights

Harm as Ontology, Not Policy

The Three-Response Model

Alice and Bob as Adversarial Parties

Layered Abstraction as Separation of Concerns

Relationship to Existing Work

Installation

Workflow

References

License

About

Uh oh!

Releases

Contributors

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

ALICE

What Is This?

Motivation

Architecture

Design Insights

Harm as Ontology, Not Policy

The Three-Response Model

Alice and Bob as Adversarial Parties

Layered Abstraction as Separation of Concerns

Relationship to Existing Work

Installation

Workflow

References

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Contributors

Uh oh!