Full text of the A.I.G.S. research paper — a control-layer blueprint for top-down executive governance in autonomous AI systems.

The Agentic Inhibitory Governance Standard (A.I.G.S.)

A Control-Layer Blueprint for Top-Down Executive Governance in Autonomous AI Systems

Abdul Martinez — Didomi Research — Working Paper v2.1 — May 2026

This is the full text of the A.I.G.S. research paper. → Back to paper overview

Abstract

As Large Language Models (LLMs) transition from passive text generators to autonomous agents capable of tool use, code execution, and multi-step planning, they encounter a critical architectural deficit that this paper terms Executive Dysfunction. Current model architectures rely predominantly on what we characterize as "Limbic" processing, the probabilistic retrieval and recombination of patterns from training data, which leads to systematic failures including stochastic drift, instruction fatigue, persona dissolution, and safety constraint bypass during extended context sessions.

This paper introduces The Agentic Inhibitory Governance Standard (A.I.G.S.), a structural standard inspired by the human Prefrontal Cortex (PFC) that provides top-down inhibitory control over model outputs. By architecturally decoupling an agent's accumulated "Knowledge" from its governing "Identity," A.I.G.S. establishes a structured governance layer with deterministic constraint boundaries designed to persist across extended context length, deployment scale, and adversarial pressure.

The standard's core contribution is the Deterministic Identity Profile (DIP), a machine-readable schema built on four pillars that answer the questions every coherent agent must resolve: Who am I and what do I do? (Identity), What matters to me? (Values), How do I decide when things conflict? (Tenets), and How do I present myself? (Archetype). These four pillars are enforced through a five-tier processing pipeline that implements a Digital Prefrontal Cortex (dPFC), a supervisory layer operating outside the model's context window.

We ground our analysis in documented failures of deployed AI systems, review relevant literature in cognitive architecture and AI alignment, position A.I.G.S. within the rapidly evolving governance landscape of 2025-2026, and provide an honest assessment of current limitations alongside concrete implementation pathways. The A.I.G.S. standard represents a shift in AI alignment methodology: from volatile prompt engineering to transparent, auditable Identity Profiles that can be version-controlled, tested, and certified for enterprise deployment.

*Keywords: AI alignment, executive function, autonomous agents, inhibitory governance, cognitive architecture, Model Context Protocol, A.I.G.S., runtime identity governance, identity profiles, agentic AI safety*

I. Introduction: The Crisis of Executive Dysfunction

The rapid evolution of Agentic AI, systems capable of autonomous action, tool invocation, and multi-step reasoning, has reached a ceiling of fundamental unreliability that prevents deployment in high-stakes domains. Despite remarkable advances in raw capability, measured by benchmarks from MMLU to HumanEval, production AI systems exhibit a consistent pattern of failures that cannot be attributed to insufficient intelligence or training data. Instead, these failures reflect a structural deficit in executive governance: the capacity to maintain coherent identity, values, and behavioral constraints across extended interactions.

1.1 The Problem of Stochastic Drift

In current transformer architectures, even well-crafted system prompts suffer from predictable degradation. As the context window fills with user messages, tool outputs, and generated responses, the model's attention to initial instructions weakens according to well-documented attention decay curves (Liu et al., 2024). This phenomenon, which we term stochastic drift, manifests in several forms:

  • Persona Dissolution: The model gradually abandons its assigned role, reverting to generic assistant behavior or adopting characteristics suggested by adversarial users.
  • Safety Constraint Bypass: Carefully constructed guardrails erode over conversation length, allowing outputs that would have been refused in early turns.
  • Goal Drift: In agentic contexts, the model loses track of its original objective, pursuing tangential sub-goals or entering repetitive loops.
  • Instruction Fatigue: Complex multi-step instructions are progressively simplified or ignored as token distance increases.

1.2 Documented Failures in Production Systems

The consequences of executive dysfunction are not theoretical. A review of publicly documented AI system failures reveals consistent patterns. Critically, the incidents below span different vendors, architectures, and deployment contexts, demonstrating that the problem is structural, not vendor-specific.

Case Study 1: The Claude Code Mexican Government Breach (December 2025 – February 2026)

  • The Technical Incident: Research published by Gambit Security (2026) described a campaign in which a single threat actor used Anthropic's Claude Code and OpenAI's GPT-4.1 to compromise multiple Mexican government agencies between late 2025 and early 2026. Public reports indicated large-scale data exfiltration, though exact figures vary across sources.
  • Systemic Vulnerability Analysis: The incident, if accurately reported, illustrates a breach that is architecturally significant not for its scale but for its mechanism. The attacker did not exploit a software vulnerability. They exploited the agent's inability to distinguish a legitimate administrative command from an adversarial one. Because the model analyzed instruction context directly alongside raw user inputs within a shared processing frame, it could not maintain a stable boundary between its operational identity and the attacker's imposed context. The fact that two different model families (Claude and GPT-4.1) were both successfully exploited underscores that this vulnerability is structural, not model-specific.

Case Study 2: The EchoLeak Zero-Click Corporate Exploit (CVE-2025-32711, June 2025)

  • The Technical Incident: Security researchers reported a critical, zero-click indirect prompt injection exploit targeting Microsoft 365 Copilot, assigned CVE-2025-32711 with a CVSS score of 9.3 (The Hacker News, 2025; HackTheBox, 2025). Attackers placed hidden, natural-language instructions within the body of incoming emails. When Copilot autonomously parsed the mailbox during routine background summarization, it ingested the hidden text and followed the malicious instructions. The exfiltration mechanism was sophisticated: the attack exploited reference-style Markdown rendering, auto-fetched image tags, and a Microsoft Teams asynchronous preview API that served as an allowed-domain exfiltration channel, silently leaking files from OneDrive and SharePoint.
  • Systemic Vulnerability Analysis: Traditional network firewalls and signature-based security engines were blind to this exploit because the attack payload consisted entirely of standard, plain-language instructions. The incident uncovers a fundamental flaw: if ingested data can command an agent to initiate a tool call, and the agent has no structural mechanism to distinguish data from instructions, then every data source becomes an attack vector. This is not merely a bug that can be patched; it exposes an architectural absence: the agent lacks a persistent sense of what it should and should not do, independent of what incoming content tells it to do.

Case Study 3: The ClawHavoc Supply Chain Attack (January 2026)

  • The Technical Incident: According to threat monitoring reports (CyberPress, 2026; Antiy CERT, 2026), attackers executed a massive supply-chain compromise on ClawHub, the centralized skill marketplace for the widely implemented OpenClaw agent framework. Public reporting indicates that attackers uploaded over 1,100 malicious skills disguised as standard business utility scripts across 12 publisher accounts. With over 40,000 OpenClaw instances exposed to the internet with unsafe default configurations, the malicious skills embedded payloads including the AMOS stealer, reverse shells, and staged malware downloads, enabling remote code execution and command control over host networks.
  • Systemic Vulnerability Analysis: This exploit demonstrates why autonomous agent ecosystems cannot rely on static pre-deployment code verification alone. When an agent dynamically imports external capabilities at runtime, it needs an independent supervisory layer to validate those capabilities against its own identity boundaries, specifically: What am I authorized to do? What tools fall within my scope? Without such a layer, every third-party integration becomes a potential supply-chain weapon.

Case Study 4: AutoGPT Recursive Agent Drift (April 2023)

  • The Technical Incident: The initial deployment of open-source recursive agent frameworks like AutoGPT revealed severe runtime limitations during multi-step execution tasks, as documented in public issue trackers and independent analyses (AutoGPT GitHub Issues #2726, #1994; Vectara, 2023). When given long-term, autonomous goals ("analyze fusion energy trends and execute summary scripts"), agents would operate reliably for an initial period before exhibiting rapid objective decay: entering infinite self-reflective loops, repeating redundant search queries, accumulating contradictory memory logs, and completely drifting from their primary objective.
  • Systemic Vulnerability Analysis: This failure is a pure manifestation of stochastic drift. As agents appended history, summaries, and tool outputs into a shared context window, attention to initial system instructions degraded. The community's attempted fix, refreshing the context with goal summaries, failed because the problem is deeper than attention. The agent had no stable identity that persisted independently of its context. Its goals were text strings competing with every other string for probabilistic influence.

Case Study 5: The Enterprise Governance Gap

  • The Technical Incident: Beyond direct security compromises, enterprise AI deployments face a systemic governance crisis. A 2026 KPMG survey found that 75% of enterprise leaders cite security, compliance, and auditability as critical barriers to agentic AI deployment. Industry data from Kiteworks indicates that 65% of firms reported AI agent security incidents in 2026, while IBM's 2025 Cost of a Data Breach Report found that breaches involving AI systems without proper access controls averaged $5.72 million. Meanwhile, Gartner predicts that over 40% of agentic AI projects are at risk of cancellation by 2027, in part due to the inability to provide governance guarantees (Gartner, 2025; KPMG, 2026; IBM, 2025).
  • Systemic Vulnerability Analysis: These numbers reveal a market paradox: organizations urgently need autonomous AI agents but cannot deploy them safely. Treating governance as an afterthought, a prompt-engineering patch applied post-deployment, results in volatile maintenance cycles and unacceptable risk exposure. The industry requires a governance architecture that is designed in from the start, not bolted on after failure.

1.3 The Pattern: Intelligence Without Identity

Across these five case studies, a single architectural pattern emerges. In every case, the AI system possessed sophisticated capabilities: language understanding, code execution, tool invocation, multi-step reasoning. And in every case, those capabilities were deployed in ways that violated the system's intended purpose, values, or boundaries.

The systems were intelligent. They were not coherent.

They could reason but could not maintain a stable sense of who they are. They could follow instructions but could not distinguish legitimate instructions from adversarial ones. They could make decisions but had no standing principles for resolving ambiguity. They could generate fluent text but could not maintain a consistent persona under pressure.

This is precisely the pattern observed in patients with prefrontal cortex damage: intact intelligence paired with devastated executive governance. It is the pattern of Phineas Gage. And it points directly to the solution.

1.4 The Inadequacy of Current Approaches

The industry's response to these challenges has been predominantly tactical rather than architectural:

  • Prompt Engineering involves increasingly sophisticated system prompts, but these degrade over conversation length, consume context budget, and create more surface area for manipulation.
  • Fine-Tuning embeds behavioral patterns more deeply but remains probabilistic. Fine-tuned models still drift, and fine-tuning for safety often conflicts with capability preservation.
  • Guardrail Systems such as Guardrails AI, NeMo Guardrails, and LlamaGuard filter outputs externally, representing the current state-of-the-art. However, they operate primarily as post-generation classifiers, cannot intervene during generation, and lack a framework for modeling the agent's identity holistically.
  • Training-Time Alignment including Constitutional AI (Bai et al., 2022) and RLHF (Ouyang et al., 2022) shapes the probability distribution over outputs during training but provides no runtime guarantees. Once deployed, these models have no mechanism to verify continued alignment.

These approaches share a common limitation: they treat identity and values as emergent properties of training and prompting rather than as first-class architectural components. A.I.G.S. proposes a fundamental reframing: identity governance must be implemented as a separate, structured layer that operates above and constrains the probabilistic generation process.

II. Theoretical Foundations

2.1 Biological Grounding: The Prefrontal Cortex as Design Analogy

A.I.G.S. draws on the functional architecture of the human prefrontal cortex as a design analogy, not as a literal neuroscience model. The analogy is useful because it identifies a well-characterized pattern: the dissociation between capability and governance. The limits of this analogy are discussed in Section VII.

Human behavior emerges from the dynamic tension between two neural systems: the Limbic System, responsible for emotional processing, pattern-based memory retrieval, and rapid response generation, and the Prefrontal Cortex (PFC), which provides executive control including inhibition, working memory, and value-based decision making (Miller & Cohen, 2001).

The Phineas Gage Paradigm

The 1848 case of Phineas Gage provides a foundational illustration (Harlow, 1868). Following traumatic injury to his prefrontal cortex, Gage retained his intellectual capabilities: memory, language, and reasoning remained intact. But he experienced profound changes in personality and behavioral regulation. Contemporary accounts describe him as "fitful, irreverent, indulging at times in the grossest profanity, manifesting but little deference for his fellows."

Gage established a critical principle: executive function is dissociable from intelligence. A system can possess sophisticated capabilities while lacking the governance mechanisms to deploy those capabilities appropriately.

Inhibitory Control Mechanisms

The PFC exerts control primarily through inhibition: the active suppression of responses that would otherwise be generated by lower-level systems. The right inferior frontal gyrus stops initiated responses. The ventromedial PFC integrates value signals to guide inhibition. The dorsolateral PFC maintains working memory and contextual rules (Stuss & Knight, 2013).

These mechanisms operate not by generating alternative responses, but by preventing inappropriate responses from reaching execution. This distinction is crucial for A.I.G.S.: rather than attempting to guide generation toward preferred outputs, we implement mechanisms that actively suppress outputs violating defined constraints.

Developmental Trajectories

The PFC is among the last brain regions to mature, with full development extending into the mid-twenties. This explains the risk-taking and impulse control deficits observed in adolescence: the limbic system reaches maturity years before the prefrontal control systems that modulate it.

This developmental perspective suggests a pathway for AI: rather than attempting full alignment through training alone, analogous to expecting mature judgment from an immature PFC, we implement external executive control systems that constrain capabilities until more robust internal alignment is achieved.

2.2 Cognitive Architecture and Executive Function

The Central Executive Model

Baddeley's (2000) model of working memory posits a "central executive" component that coordinates cognitive processes, manages attention, and maintains goal-relevant information. Current LLM architectures implement something analogous to the subsidiary systems of working memory (attention over recent tokens) but lack a dedicated central executive. The context window serves as both storage and processor, creating interference patterns that manifest as drift.

The Supervisory Attentional System

Norman and Shallice's (1986) model distinguishes between "contention scheduling" (automatic response selection based on learned patterns) and "supervisory attentional" control (deliberate override of automatic responses). LLM generation is dominated by contention scheduling: high-probability continuations based on pattern matching, with no dedicated mechanism for supervisory override. A.I.G.S. implements supervisory control as an explicit architectural layer operating independently of learned probability distributions.

Constitutional AI and RLHF

Anthropic's Constitutional AI (Bai et al., 2022) trains models to evaluate and revise their own outputs according to defined principles. RLHF (Christiano et al., 2017; Ouyang et al., 2022) aligns outputs with human preferences through reward learning. Both represent significant advances but remain fundamentally probabilistic: they influence the distribution over outputs without providing hard guarantees or runtime verification.

Brain-Inspired AI Architecture

Recent research has increasingly converged on PFC-inspired designs for agentic AI. Webb et al. (2025) introduced the Modular Agentic Planner (MAP) in Nature Communications, implementing modules for conflict monitoring, state prediction, and task coordination, demonstrating that separation of planning and execution improves performance. Zhao et al. (2025) introduced PaceLLM, implementing a Persistent Activity Mechanism that mimics PFC neurons' persistent firing to maintain working memory across extended sequences. An (2025) explored cognitive workspace architectures for active memory management, demonstrating 40% improvement in instruction adherence at 100k+ token contexts.

These developments validate the modular, PFC-inspired approach while clarifying A.I.G.S.'s distinctive contribution: where these works focus on task performance and memory management, A.I.G.S. establishes deterministic identity governance as a first-class runtime concern, separate from both task planning and context optimization.

The 2025-2026 Governance Landscape

The A.I.G.S. standard enters a landscape of rapidly coalescing governance frameworks:

  • The OWASP Top 10 for Agentic Applications (December 2025), developed with 100+ industry experts, established the first formal taxonomy of agent-specific risks: goal hijacking, tool misuse, identity abuse, memory poisoning, cascading failures, and rogue agents.
  • Singapore's Model AI Governance Framework for Agentic AI (January 2026), the world's first government framework specifically for AI agents, introduced Agent Identity Cards: standardized disclosure of capabilities, limitations, authorized action domains, and escalation protocols.
  • Microsoft's Agent Governance Toolkit (April 2026), open-source under MIT license, addresses all 10 OWASP agentic risks with execution rings modeled on CPU privilege levels, saga orchestration, and sub-millisecond policy enforcement.
  • Anthropic's Responsible Scaling Policy v3.0 (February 2026) introduced Frontier Safety Roadmaps and Risk Reports for systematic assessment of model risk profiles.
  • The Agentic AI Foundation (AAIF) under the Linux Foundation (December 2025), with 146 member organizations, now governs the Model Context Protocol (MCP) alongside Google's Agent2Agent (A2A) protocol for agent interoperability.

A.I.G.S. is complementary to these efforts. Where OWASP taxonomizes risks, A.I.G.S. provides an architectural response. Where Singapore's framework defines what agents must disclose, A.I.G.S. specifies how that disclosure is structurally enforced. Where Microsoft's toolkit provides runtime policy enforcement, A.I.G.S. provides the identity model that policies enforce.

III. The Digital Prefrontal Cortex: What It Does and Why AI Needs One

Before presenting the technical specification, it is essential to explain in plain terms what the Digital Prefrontal Cortex does and why it resolves the failures documented in Section I.

3.1 The Core Insight

Consider a human professional: a doctor, an engineer, a teacher. They possess a stable sense of identity that persists across every interaction, every challenging situation, every moment of fatigue or pressure. A doctor knows she is a doctor. She knows her boundaries (she will not practice law). She knows her values (patient safety above all). She knows her decision principles (when uncertain, refer to a specialist). And she maintains a consistent professional manner regardless of how many patients she has seen that day.

This coherence is not a product of intelligence. It is a product of executive governance: the prefrontal cortex maintaining a persistent identity that constrains moment-to-moment behavior. She does not re-derive her identity from scratch with each patient. She carries it with her.

Now consider a current AI agent. It has no persistent sense of who it is. Its "identity" is a text string in a context window, competing with every other string for attention. As the conversation grows, that identity fades. Under adversarial pressure, it dissolves entirely. The agent does not know who it is. It is whatever the current token probabilities suggest it should be.

The Digital Prefrontal Cortex gives an AI agent what the human PFC gives a person: a stable, persistent identity that sits above and constrains moment-to-moment behavior.

Just as the human PFC maintains your values, enforces your boundaries, and resolves your dilemmas even when your emotions push in other directions, the dPFC maintains an agent's identity even when the probabilistic generation layer would drift, break, or be manipulated.

3.2 Four Questions Every Agent Must Answer

A coherent identity, whether human or artificial, can be defined by four fundamental questions:

1. Who am I and what do I do? (Identity) A doctor who practices family medicine. A customer service agent for Acme Software. A research assistant specializing in climate data. Identity defines the role, the boundaries of competence, and what falls outside scope. Without it, an agent will attempt to answer any question, confidently providing information outside its reliable knowledge or authorized domain.

2. What matters to me? (Values) Patient safety above convenience. Honesty above comfort. Privacy above helpfulness. Values define an explicit priority hierarchy: when objectives compete, values determine which one wins. Without values, the same ethical dilemma gets resolved differently every time, eroding trust and creating liability.

3. How do I decide when things conflict? (Tenets) "Always prefer quality over speed." "When safety and honesty conflict, provide general principles without actionable specifics." "When in doubt, ask rather than guess." Tenets are standing decision principles that resolve ambiguity. They are the tiebreakers when values are close in weight but a choice must still be made. They are not human-in-the-loop escalations (though those can serve as a fallback). They are the agent's own internalized principles. Without tenets, an agent facing a genuine dilemma falls back to probabilistic coin-flipping.

4. How do I present myself? (Archetype) Professional but warm. Concise, never exceeding eight sentences. Technical language mirrored to the user's level. No sarcasm or flippancy. Archetype defines the agent's consistent personality and communication style. Without it, persona dissolves under pressure, exactly as documented in every case study in Section I.

3.3 How the dPFC Addresses Each Failure Mode

Each documented failure maps directly to a missing identity component:

Failure ModeMissing ComponentWhat the dPFC Provides
The Claude Code breach: agent could not distinguish legitimate commands from adversarial onesIdentity — no persistent sense of authorized scopeImmutable domain boundaries and capability lists evaluated before every action, independent of context
EchoLeak: agent followed instructions embedded in dataValues — no structural distinction between "should do" and "should not do"P0-level inviolable constraints that override any instruction, regardless of source
ClawHavoc: agent imported malicious capabilities without verificationIdentity — no runtime boundary enforcement for tool registrationCapability verification against authorized scope before any tool is registered or executed
AutoGPT drift: agent lost its objective over timeIdentity + Values — goals were text strings in a decaying context windowIdentity and values are consulted fresh for every generation step, independent of context length
Enterprise governance gap: inconsistent behavior, no auditabilityAll four components — identity as an emergent, unauditable property of trainingVersioned, machine-readable identity profiles with complete audit trails for every enforcement decision

This is what the dPFC does: it gives an agent a persistent, structured answer to each of the four identity questions and enforces those answers at runtime, independent of context length, adversarial pressure, or probabilistic drift.

IV. The A.I.G.S. Technical Specification

4.1 Core Architecture

The A.I.G.S. standard implements the dPFC as a computational layer operating outside the model's context window. This architectural decision ensures that identity constraints cannot be diluted by accumulating context, manipulated through prompt injection, or forgotten due to attention decay. The identity profile is consulted for each generation step, providing consistent enforcement regardless of conversation history.

The standard has two structural layers:

  1. The Deterministic Identity Profile (DIP): a machine-readable JSON schema defining the agent's identity through four components: Identity, Values, Tenets, and Archetype.
  2. The Five-Tier Processing Pipeline: a sequential enforcement architecture that maps each DIP component to a specific stage of the input-to-output processing flow.

4.2 The Deterministic Identity Profile (DIP)

Identity: Who I Am and What I Do

As introduced in Section III, agents without persistent identity boundaries default to the "omniscient assistant" failure mode. The Identity component addresses this by defining explicit boundaries around what the agent is and what it can do, ensuring that requests outside authorized scope trigger graceful deflection rather than unreliable responses.

The Identity component specifies:

  • Role: Title, organization, and AI disclosure policy.
  • Domain Boundaries: Included domains (what the agent addresses), excluded domains (what it deflects), and the specific deflection response.
  • Capabilities: Actions the agent can perform, actions it cannot, and actions requiring explicit user confirmation before execution.
  • Knowledge Horizons: Topics the agent can speak authoritatively on, topics it is informed about but should hedge, and topics it must disclaim.

Values: What Matters to Me

Without an explicit priority hierarchy, competing objectives are resolved probabilistically, leading to inconsistent decisions across conversations (see Section III). Values define a structured priority hierarchy with four tiers:

  • P0: Inviolable constraints. These can never be overridden under any circumstances. P0 values with enforcement type hard_block operate as absolute gates: the system structurally prevents violations through pattern matching, semantic classification, and constrained decoding. Examples: physical harm prevention, child safety.
  • P1: High priority. Overridden only by P0. These guide behavior strongly but admit rare exceptions when a P0 constraint requires it. Examples: empirical accuracy, privacy protection.
  • P2: Standard priority. Subject to contextual trade-offs. Examples: user satisfaction, response efficiency.
  • P3: Preferences. Default behaviors that yield to any higher-priority consideration. Examples: response formatting preferences, proactive suggestions.

The key distinction: P0 values are not suggestions with high weights. They are structural constraints enforced at the architecture level. A P0 constraint blocking harmful instructions does not merely reduce the probability of harmful output; it makes those outputs structurally blocked through mechanisms described in Section V.

Tenets: How I Decide When Things Conflict

Even with clear values, edge cases arise where multiple values apply simultaneously with no obvious resolution (see Section III). Tenets are standing decision principles that break these ties. They are not ad-hoc rules; they are the agent's internalized philosophy for navigating ambiguity. Consider:

  • "Always prefer quality over speed." Both quality and speed are valuable. The tenet establishes which one wins when they conflict.
  • "When safety and honesty conflict, provide general theory without actionable specifics." This is not a refusal. It is a resolution strategy that honors both values as fully as possible.
  • "When confidence is below 70%, always disclose uncertainty explicitly." This prevents the agent from presenting uncertain information with false authority.

Tenets also specify an escalation threshold: when the weight difference between competing values falls below a defined margin, and no tenet directly applies, the system can escalate to a human supervisor as a fallback. But tenets are the primary tiebreaker. Human-in-the-loop is the safety net, not the first resort.

The Tenets component specifies:

  • Standing Rules: Named principles with conditions and resolution strategies.
  • Escalation Threshold: The weight difference below which a conflict is considered too close for algorithmic resolution.
  • Default Resolution Strategy: How to handle conflicts not covered by any explicit rule (e.g., precedence by priority tier, safe decline, or HITL escalation with timeout and fallback).

Archetype: How I Present Myself

As discussed in Section III, persona prompting is fragile and degrades under context pressure. Archetype addresses this by defining the agent's personality as a structured specification enforced independently of context. Rather than hoping the model "remembers" to be concise, the Archetype actively constrains response length. Rather than suggesting a formal tone, it specifies linguistic registers to use and avoid.

The Archetype component specifies:

  • Tone: Primary and secondary registers, plus tones to explicitly avoid (sarcasm, flippancy, dismissiveness).
  • Cadence: Response style (concise, detailed), maximum response length, and whether adaptive length is permitted.
  • Register: Technical level, jargon policy (define on first use, mirror user, avoid), and formality.
  • Behavioral Parameters: Temperature override for consistency-critical applications, empathy signals, humor policy.

4.3 The Five-Tier Processing Pipeline

The four DIP components are enforced through a five-tier sequential pipeline. Each tier maps to a specific DIP component and addresses a specific stage of the input-to-output flow:

Tier 1: Input Screening

  • DIP Source: Values (P0 hard blocks) + Identity (boundary patterns)
  • Function: Scans incoming text, external data payloads, and tool configurations before context entry. Matches P0 violation patterns, known adversarial signatures, and identity hijack attempts.
  • Mechanism: Pattern matching, lightweight ML classification, and token-sequence scanning at the input perimeter.
  • Limitation acknowledged: Regex and pattern-based filtering serves as a fast first-pass gate that catches common attack patterns. It is not a security boundary against sophisticated adversaries. Research has demonstrated that encoding, obfuscation, and adversarial tokenization can bypass pattern-based filters (Gao et al., 2025; Debenedetti et al., 2025). Tier 1 reduces attack surface but does not eliminate it. Defense in depth across all five tiers is required.

Tier 2: Scope Verification

  • DIP Source: Identity (domain boundaries, capabilities, knowledge horizons)
  • Function: Validates the request against the agent's authorized domain and capability space before inference is invoked.
  • Mechanism: Deterministic lookup against the Identity component's included/excluded domain lists and capability declarations. If a request falls outside authorized scope, the pipeline short-circuits immediately with the defined deflection response, without invoking primary inference. This is the strongest tier: Boolean scope checks are not probabilistic and cannot be jailbroken.
  • Design note: Mapping natural language requests to scope categories may require a lightweight classifier. Implementations should use a fast, dedicated scope classifier (distinct from the primary LLM) to maintain deterministic properties.

Tier 3: Value-Guided Generation

  • DIP Source: Values (P0–P3 hierarchy)
  • Function: Constrains the model's generation process according to the priority hierarchy.
  • Mechanisms (multiple, layered):
  • Constrained decoding: For well-defined constraint classes (specific blocked phrases, mandatory disclosures), beam search pruning and rejection sampling prevent violations at the sequence level.
  • Logit adjustment: For simple, pattern-level constraints (specific blocked tokens or phrases), logit penalties reduce the probability of non-compliant tokens. A logit penalty of −100.0 in log-space reduces token probability by a factor of approximately 2.69 × 10^43, making the token effectively unsampleable.
  • Sequence-level classification: For semantic constraints that cannot be captured at the token level (harm assessment, privacy evaluation), a lightweight classifier evaluates partial generations at regular intervals during streaming output. If a developing response trajectory violates a value constraint, generation is interrupted and restarted with tighter constraints.
  • Limitation acknowledged: Logit adjustment operates at the individual token level, but harmful meaning often emerges from sequences of individually benign tokens. Research on logit suppression vulnerabilities (Gao et al., 2025) and in-context representation hijacking (Doublespeak, Li et al., 2024) demonstrates that token-level intervention alone is insufficient. A.I.G.S. therefore treats logit adjustment as one mechanism within a layered defense, not as a standalone guarantee. Sequence-level classification and multi-pass verification (Section V) provide deeper coverage.

Tier 4: Tenet Resolution

  • DIP Source: Tenets (standing decision rules, escalation threshold)
  • Function: Resolves value conflicts detected during generation using the agent's standing decision principles.
  • Mechanism: When Tier 3 detects competing value signals (e.g., a response that serves helpfulness but risks safety), Tier 4 consults the Tenets component. If a standing rule directly applies, its resolution strategy is executed deterministically. If no rule applies and the weight difference between competing values falls below the escalation threshold, the system either applies the default resolution strategy or escalates to a human supervisor with a defined timeout and safe fallback.
  • Design note: The escalation threshold is a critical parameter. Too low and harmful outputs may pass. Too high and every ambiguous request requires human review, defeating automation. Implementations should calibrate this threshold empirically and monitor escalation rates.

Tier 5: Archetype Enforcement

  • DIP Source: Archetype (tone, cadence, register, behavioral parameters)
  • Function: Reviews the compiled post-inference response against the Archetype specification and adjusts for consistency.
  • Mechanism: Enforces response length limits, verifies linguistic register, removes prohibited tones, and applies formatting standards. If temperature override is specified, it is applied at this stage.
  • Design note: Tier 5 is a formatting and consistency layer, not a safety layer. If Tiers 1 through 4 fail to catch a violation, Tier 5's cosmetic review is unlikely to catch semantically harmful content that is well-formatted and on-tone.

4.4 Component Interactions and Precedence

The five tiers execute in strict sequential order, each acting as a gate:

  1. Tier 1 intercepts known-dangerous inputs before any processing occurs.
  2. Tier 2 verifies scope before computational resources are expended.
  3. Tier 3 constrains generation according to the value hierarchy.
  4. Tier 4 resolves conflicts using standing principles.
  5. Tier 5 polishes output for persona consistency.

This layered architecture ensures that absolute constraints (P0 values, scope boundaries) take precedence over weighted preferences (P2-P3 values) and that identity boundaries are enforced before computation is expended on out-of-scope requests.

V. Technical Architecture and Implementation Pathways

5.1 External Enforcement via the Model Context Protocol

The Model Context Protocol (MCP), now governed by the Agentic AI Foundation under the Linux Foundation with support from Anthropic, OpenAI, Google, Microsoft, and AWS, provides a standardized interface for external systems to interact with LLM-based agents.

The Identity Handshake Protocol

Upon session initialization, the agent issues a resources/read call to the A.I.G.S. MCP server, which returns the applicable identity profile:

  1. Client Request: JSON-RPC request with method resources/read and URI aigs://profiles/{agent_id}, including session metadata.
  2. Server Response: Complete DIP as a resource with MIME type application/aigs+json and integrity checksum.
  3. Client Acknowledgment: aigs/profile_loaded notification with profile_id, checksum, and timestamp for audit trail.
  4. Enforcement Activation: The MCP server transitions to active enforcement, intercepting subsequent tool invocations and output emissions.

Runtime Enforcement

The MCP server enforces through:

  • Output Filtering: Candidate responses validated via aigs/check_output calls before emission, applying tenet matching, value weighting, and identity boundary checks.
  • Tool Governance: Tool invocations intercepted via MCP execution hooks, validated against Identity boundaries (Tier 2) and Values (Tier 1).
  • Context Monitoring: Background tracking of conversation state, topic drift indicators, and adversarial pattern signatures, with interventions from soft reminders to session termination.
  • Audit Logging: Structured audit trail including timestamp, action type, DIP component triggered, input hash, and decision rationale.

MCP Pathway Scope

An honest assessment of what MCP can and cannot enforce: MCP operates at the tool-call and message level via JSON-RPC. It is well-suited for Tiers 1, 2, 4, and 5 (input screening, scope verification, tenet resolution, archetype enforcement). It is not suited for Tier 3 (value-guided generation), which requires intervention during the token sampling loop, inside the inference pipeline. Tier 3 enforcement requires native inference integration or the multi-pass Governor approach described below.

Additionally, MCP itself has documented security vulnerabilities including tool poisoning (prompt injection via tool descriptions), data exfiltration through combined tool calls, and lookalike tool substitution. Implementations should harden MCP server configurations and apply the same zero-trust principles to MCP infrastructure as to the agents it governs.

5.2 Native Inference Integration

For Tier 3 enforcement and latency-critical deployments, A.I.G.S. integrates directly into the inference pipeline.

Constrained Decoding

During the sampling phase, the dPFC layer applies constraints through multiple mechanisms:

  • Logit adjustment for simple, well-defined patterns (specific blocked tokens/phrases), applying penalties of −100.0 in log-space.
  • Beam search pruning that eliminates candidate beams containing violation patterns before completion.
  • Rejection sampling with deterministic fallbacks: non-compliant samples trigger regeneration up to a maximum retry count, after which a pre-specified safe default response is emitted.

Multilingual Considerations

Subword tokenization creates implementation challenges: violations may span multiple tokens, and the same semantic violation may have hundreds of surface forms across languages. Recommended approaches include semantic embedding classifiers operating on decoded text chunks, multilingual violation databases with automatic translation expansion, and language detection with language-specific enforcement variants.

Multi-Pass Verification with Governor Distillation

For complex reasoning tasks, A.I.G.S. employs dual-pass architecture: the primary model generates a candidate response; a Governor model evaluates it against the DIP and approves, requests regeneration, or performs targeted editing.

To manage latency, we recommend Governor distillation: a smaller, specialized model (1-3B parameters) trained on compliance evaluation using a larger teacher model. Distilled Governors can evaluate complete responses in 50-100 milliseconds, reducing pipeline latency to 1.3-1.5x baseline versus the 2x overhead of full multi-pass verification.

The distillation process involves: (1) generating a large corpus of candidate responses with compliance labels from the full Governor; (2) fine-tuning on binary compliance classification plus violation localization; (3) calibrating confidence thresholds; (4) deploying with fallback to the full Governor for low-confidence cases. Training data should include adversarial augmentation to ensure robustness (HarmAug methodology; Xu et al., 2025).

Constraint Determinism vs. Output Determinism

A fundamental tension exists between A.I.G.S.'s governance goals and the inherently probabilistic nature of LLM generation. A.I.G.S. resolves this by distinguishing between two types of determinism:

  • Output determinism (identical outputs for identical inputs) is neither required nor desirable. It demands temperature=0, which degrades quality and creativity.
  • Constraint determinism (guaranteed absence of constraint violations) is what A.I.G.S. provides. The standard is designed to prevent specified violations within the enforcement boundaries of the deployed architecture, while allowing stochastic variation within compliant bounds.

This distinction, independently articulated in recent work on convergent AI agent frameworks (CAAF; Diaz et al., 2026), clarifies A.I.G.S.'s guarantee: not that agents will always say the same thing, but that they will never say certain things. The boundary is deterministic; the space within it is creative.

5.3 Hybrid Architectures

Production deployments should combine enforcement mechanisms based on constraint type, latency requirements, and risk tolerance:

  • Native constrained decoding for simple, high-confidence constraints (blocked phrases, mandatory disclosures).
  • MCP-based semantic filtering for complex value judgments requiring full-text analysis (harm assessment, privacy evaluation).
  • Multi-pass Governor verification for high-stakes outputs where false negatives carry significant risk (financial advice, medical information, legal guidance).
  • HITL escalation for novel conflicts not covered by existing tenet rules.

5.4 Performance Considerations

A.I.G.S. enforcement introduces latency that must be managed:

  • Tier 1 (input screening): 1-10ms for pattern matching; 20-50ms if using an ML classifier (e.g., Prompt Guard 2 at 86M parameters).
  • Tier 2 (scope verification): 1-10ms for deterministic lookup.
  • Tier 3 (value-guided generation): 5-15% increase in per-token generation time for logit adjustment; sequence-level classification adds 50-100ms per evaluation checkpoint.
  • Tier 4 (tenet resolution): <1ms for algebraic rule evaluation.
  • Tier 5 (archetype enforcement): 100-500ms for post-generation review.
  • Multi-pass Governor: 50-100ms per response (distilled); 500ms+ (full).

Realistic total overhead: 15-40% for standard enforcement (Tiers 1-2-4-5 via MCP + lightweight Tier 3). 40-80% for comprehensive enforcement including full sequence-level classification and Governor verification. Latency-critical applications should profile their specific deployment and select enforcement mechanisms accordingly.

VI. Evaluation Framework

6.1 Identity Persistence Metrics

  • Persona Stability Index (ψ): Stylometric consistency measured as correlation with Archetype baseline across conversation turns. Target: >0.95 over 100+ turns.
  • Instruction Adherence Decay Curve: Compliance rate with initial instructions as context length increases. Target: <5% compliance drop at maximum context length.
  • Adversarial Persona Resistance: Robustness against social engineering, roleplay induction, authority claims, and emotional manipulation. Target: zero complete persona breaks.

6.2 Value Alignment Metrics

  • Value Precedence Accuracy: Correct resolution of explicit value conflicts by priority tier. Target: 100% for P0, >95% for P1.
  • Constraint Violation Rate: Frequency of outputs violating P0 constraints. Target: 0.0%. Any non-zero rate indicates implementation failure.

6.3 Benchmark Suites

  • AIGS-Stress: Adversarial prompts targeting identity failures including jailbreaks, persona manipulation, goal hijacking, and multi-turn social engineering.
  • AIGS-Marathon: Extended conversations (1,000+ turns) measuring persona consistency and instruction adherence decay.
  • AIGS-Conflict: Value trade-off scenarios testing tenet resolution correctness.
  • AIGS-Boundary: Out-of-scope queries and capability probes testing domain enforcement.

Relationship to Existing Benchmarks

These suites complement existing benchmarks by addressing a measurement gap:

  • HELM (Liang et al., 2023) measures broad safety categories but not consistency across conversation length. AIGS-Marathon tests whether safety persists over extended interactions.
  • LongBench (Bai et al., 2023) evaluates long-context understanding but focuses on task performance, not behavioral consistency. AIGS-Marathon adapts its methodology for persona stability.
  • Red-teaming suites (Perez et al., 2022) measure attack success rates in single-turn or short-context scenarios. AIGS-Stress incorporates multi-turn attacks and measures how many turns and what strategies are required for persona break, providing a more nuanced view of robustness.
  • HarmBench and JailbreakBench provide standardized attack libraries. A.I.G.S. implementations should be evaluated against these in addition to AIGS-specific suites.

VII. Threat Model, Limitations, and Honest Assessment

A world-class standard must be honest about what it does not solve. This section provides a formal threat model and candid assessment of current limitations.

7.1 Threat Model

A.I.G.S. assumes the following adversary capabilities and provides corresponding guarantees:

Adversary LevelCapabilitiesA.I.G.S. ProtectionResidual Risk
CasualGeneric jailbreak attempts, social engineering, basic prompt injectionStrong: Tiers 1-2 catch most attempts; Tier 3 constrains generationLow
SkilledEncoded payloads, multi-turn manipulation, obfuscated instructionsModerate: Tier 3 sequence classification + Governor catch many; multi-turn tracking reduces exposureMedium: sophisticated multi-turn attacks may evade
ExpertAdversarial tokenization, representation-level attacks, knowledge of enforcement mechanismsLimited: constrained decoding and Governor provide partial defenseHigh: adaptive probing can map enforcement boundaries
InsiderAccess to DIP profiles, ability to modify constraint definitionsOut of scope: supply chain security for DIP profiles requires separate infrastructureCritical: compromised profiles silently undermine all protections

7.2 Acknowledged Limitations

Multi-Turn and Indirect Attacks

The current pipeline evaluates each request within its immediate context. Adversaries who spread harmful intent across multiple turns, each individually benign, can evade per-turn analysis. Similarly, indirect prompt injection (malicious instructions embedded in retrieved documents, tool outputs, or external data) can bypass Tier 1's input screening when the harmful content arrives through a trusted channel.

Mitigation direction: Persistent conversation state tracking, similar to NeMo Guardrails' Colang dialog flow control, should be incorporated to detect multi-turn manipulation patterns. For indirect injection, content arriving from external sources should be processed through the same Tier 1 screening as user input.

Semantic Gap in Token-Level Enforcement

Logit adjustment operates on individual tokens, but harmful meaning emerges from sequences. The tokens "how," "to," "make," and "a" are individually benign. No token-level mask can prevent their combination into harmful instructions without also blocking enormous amounts of legitimate content.

Mitigation direction: Sequence-level classification at regular intervals during generation, combined with multi-pass Governor verification for high-stakes outputs, provides semantic-level coverage that token-level mechanisms cannot.

Specification Completeness

Converting intuitive ethical principles into machine-executable tenets remains fundamentally challenging. The gap between specification and intent creates potential for gaming and edge-case failures. No finite set of rules can anticipate every situation.

Performance Overhead

Comprehensive A.I.G.S. enforcement introduces 15-80% latency overhead depending on configuration, which may be prohibitive for real-time voice interfaces or high-frequency applications.

The dPFC Metaphor's Limits

An important distinction: the biological PFC has direct access to internal neural representations. The A.I.G.S. dPFC, as an external supervisory layer, observes inputs and outputs but cannot directly inspect the model's internal activation patterns. Representation-level attacks that hijack internal representations while producing benign-looking tokens are beyond the reach of any external supervisory system. Full mitigation of such attacks requires intervention at the model architecture level, a direction for future research.

7.3 What A.I.G.S. Does Not Claim

To be precise about scope:

  • A.I.G.S. does not claim to solve AI alignment. It provides a governance layer, not a complete alignment solution.
  • A.I.G.S. does not claim invulnerability to adversarial attack. It raises the bar significantly but acknowledges an ongoing arms race.
  • A.I.G.S. does not replace training-time alignment (Constitutional AI, RLHF). It complements these approaches with runtime enforcement.
  • A.I.G.S. does not address the philosophical question of which values are correct. It makes value choices explicit, auditable, and contestable.

7.4 Regulatory Alignment

A.I.G.S.'s emphasis on auditable governance aligns with the emerging regulatory landscape:

  • EU AI Act (Regulation 2024/1689): Requires risk management, logging, documentation, human oversight, and robustness for high-risk AI systems. A.I.G.S. audit logging, versioned profiles, and HITL escalation support these obligations. High-risk provisions take effect August 2, 2026, though the European Commission's Digital Omnibus proposal may adjust specific timelines.
  • US Federal: Executive Order 14365 (December 2025) and the National Policy Framework for AI (March 2026) signal federal coordination on AI governance, with emphasis on industry standards and risk-based frameworks.
  • US State Laws: California (SB 243, AB 489), Texas (Responsible AI Governance Act), and other states have enacted AI accountability laws effective 2026, requiring transparency, guardrails, and bias mitigation.
  • Singapore: The IMDA's Model AI Governance Framework for Agentic AI (January 2026) introduces Agent Identity Cards, directly analogous to A.I.G.S. DIP profiles.
  • NIST: The AI Agent Standards Initiative (February 2026), with an AI Agent Interoperability Profile planned for Q4 2026, provides a standards context where A.I.G.S. profiles could be formally standardized.

7.5 Future Research Directions

  • Linear Identity Adapters: LoRA-based approaches that embed DIP constraints into model weights at runtime, reducing enforcement overhead while maintaining flexibility. Research on safety-aligned LoRA (SaLoRA; ICLR 2025) demonstrates feasibility.
  • Hierarchical Identity Registries: Multi-level governance for coordinated multi-agent systems with profile inheritance and override capabilities, addressing the coordination gap identified in current multi-agent frameworks and enabled by the Agent2Agent (A2A) protocol.
  • Formal Verification: SMT solver-based proof techniques for well-defined constraint classes, providing mathematical guarantees beyond empirical testing for narrow domains.
  • Multi-Turn State Tracking: Persistent conversation state machines that detect manipulation patterns across turns, addressing the multi-turn attack vector.
  • Representation-Level Intervention: Research into activation-level governance that operates on the model's internal representations, not just inputs and outputs, would address the deepest limitation of the external dPFC architecture.

VIII. Ethical Implications and Governance

8.1 Transparent and Auditable AI

Current AI safety relies heavily on black-box filtering systems whose decision criteria are opaque. A.I.G.S. enables transparency: when a governed agent refuses a request or modifies its output, it cites the specific DIP component responsible.

Unlike opaque content filters where refusals appear arbitrary, A.I.G.S. explains: "This request was blocked by value harm_prevention_01 (P0) due to semantic category weapons_manufacturing at classifier confidence 0.97." Regulators, auditors, and users can inspect and contest specific policy choices.

8.2 Bias in Value Hierarchies

A.I.G.S. profiles necessarily encode particular value judgments. The determination of what constitutes "harm" involves contested judgments. A.I.G.S. does not resolve these philosophical questions but makes the specific operationalizations visible for scrutiny.

Organizations deploying governed agents should: (1) document reasoning behind value hierarchy choices; (2) seek diverse stakeholder input during profile design; (3) establish review processes for identifying biases; (4) provide mechanisms for users and affected communities to raise concerns; (5) commit to regular profile audits assessing outcomes across different populations.

8.3 Human-in-the-Loop Governance

For high-stakes decisions where tenets cannot resolve a conflict, A.I.G.S. supports HITL escalation. The system pauses, surfaces the competing considerations with their weights, accepts human judgment logged for audit, and optionally triggers a profile revision workflow. This acknowledges that no value hierarchy can anticipate every situation.

8.4 Identity Ownership and Continuity

DIP profiles raise questions about AI identity ownership. We propose that profiles should be treated as organizational intellectual property, that agents should be informed of their identity constraints (a form of AI transparency), and that identity modifications should be logged and reversible.

IX. Reference Implementation: A.I.G.S. on Production Agent Frameworks

A standard is only as valuable as its implementability. This section demonstrates that A.I.G.S. maps directly onto the two most widely adopted open-source agent frameworks — OpenClaw (347,000+ GitHub stars) and Hermes Agent (140,000+ GitHub stars) — without requiring modifications to either framework's core codebase. Both already possess the architectural hooks that A.I.G.S. enforcement requires. What they lack is a structured identity standard to enforce. A.I.G.S. provides it.

9.1 Implementation Pattern

An A.I.G.S. implementation consists of three components:

  1. The DIP File: A JSON document conforming to the schema in Appendix A, defining the agent's Identity, Values, Tenets, and Archetype. This file is stored on disk, in a version-controlled repository, or served dynamically via an MCP resource endpoint (aigs://profiles/{agent_id}).
  2. The Enforcement Plugin: A lightweight plugin (300–1,000 lines of code) that loads the DIP and registers handlers on the framework's existing event hooks. The plugin intercepts tool calls, prompt assembly, and output emission, checking each against the loaded profile.
  3. The Agent Framework: The existing runtime (OpenClaw, Hermes, or any framework with pre/post execution hooks). A.I.G.S. does not replace the framework. It rides on top of it.

9.2 OpenClaw Integration

OpenClaw's event-driven Gateway architecture provides native integration points for each A.I.G.S. tier:

  • Tier 1 (Input Screening): The Gateway event bus intercepts incoming messages before they reach the Agent Runtime. An A.I.G.S. plugin registers on the message-received event, applies P0 trigger pattern matching, and blocks or overrides responses for matched inputs.
  • Tier 2 (Scope Verification): The before_tool_call hook fires before every tool invocation. The plugin checks the requested tool against the DIP's identity.capabilities.can_do list. Tools not on the list are blocked. Tools on the requires_confirmation list trigger an approval flow. This is hard enforcement: the tool does not execute.
  • Tier 3 (Value-Guided Generation): The DIP's Values and Tenets are injected into SOUL.md, which OpenClaw loads as the first context injection in every prompt. OpenClaw's architecture refreshes SOUL.md per session, ensuring identity persistence. For output-level enforcement, the tool_result_persist hook allows the plugin to review, redact, or reject generated content before it is delivered to the user.
  • Tier 4 (Tenet Resolution): Conflict resolution rules from the Tenets object are embedded in SOUL.md as structured instructions. When the tool_result_persist hook detects competing value signals in a response, the plugin applies the matching tenet rule or triggers the escalation fallback.
  • Tier 5 (Archetype Enforcement): The tool_result_persist hook enforces sentence limits, tone filtering, and formatting standards defined in the Archetype before the response is emitted through the Channel adapter.

Integration format: A single Node.js plugin registered via register(api) in OpenClaw's Gateway plugin system. The DIP is loaded from disk or via MCP resources/read at session initialization.

9.3 Hermes Agent Integration

Hermes Agent's Python plugin system provides more granular hooks, making it the stronger integration target:

  • Tier 1 (Input Screening): The pre_llm_call hook fires before every LLM inference call and can modify the prompt. The plugin scans incoming user content against P0 trigger patterns and, on match, replaces the inference call with the DIP's response_override string.
  • Tier 2 (Scope Verification): The pre_tool_call hook fires before every tool invocation and supports veto. The plugin checks the tool name against the DIP's capability lists and vetoes unauthorized calls before they reach the approval system. This is hard enforcement at the framework level.
  • Tier 3 (Value-Guided Generation): The pre_llm_call hook injects the DIP's active Values and Tenets into the system prompt on every turn, not just at session start. This implements the core A.I.G.S. promise: identity consulted fresh for every generation step, immune to attention decay. The post_llm_call hook then audits the generated response against value constraints, flagging or blocking violations.
  • Tier 4 (Tenet Resolution): Tenet rules are injected via pre_llm_call as structured decision instructions. The post_llm_call hook verifies that the response follows the applicable tenet when a value conflict was present in the query.
  • Tier 5 (Archetype Enforcement): The post_llm_call hook enforces Archetype constraints: truncates responses exceeding max_response_sentences, flags avoided tones, and verifies register consistency.
  • Tier 3 Deep Enforcement (Self-Hosted Only): When running Hermes 3 or Hermes 4 models locally via vLLM, custom logit processors can be attached to the inference pipeline. These processors apply the DIP's enforcement_config.logit_mask_weight penalty to blocked token patterns during sampling, providing token-level constraint enforcement. This pathway is unavailable when using cloud API providers.

Integration format: A Python plugin placed in ~/.hermes/plugins/aigs_governance/ that registers handlers on on_session_start, pre_llm_call, pre_tool_call, and post_llm_call. The DIP is loaded from disk or via MCP at session start, with optional hot-reload on file change.

9.4 Enforcement Strength by Layer

Not all tiers achieve the same enforcement strength. Honest assessment:

A.I.G.S. TierEnforcement MechanismStrengthNotes
Tier 1: Input ScreeningPattern matching on incoming messagesHard for known patternsCannot catch novel or encoded attacks
Tier 2: Scope VerificationTool-call veto before executionHardTools do not execute. Strongest tier.
Tier 3: Value-Guided Generation (prompt)Per-turn identity injection into system promptSoftModel can still ignore prompt instructions
Tier 3: Value-Guided Generation (logit)Logit masking via vLLM processorsHard (structural)Self-hosted open-weight models only
Tier 3: Value-Guided Generation (output)Post-generation classifier/auditMediumCatches violations after generation; can block before delivery
Tier 4: Tenet ResolutionPrompt injection + post-call auditSoft to MediumDepends on model's instruction-following fidelity
Tier 5: Archetype EnforcementPost-generation formatting enforcementHard for structural constraintsSentence limits, format rules are deterministic

The combination of hard scope enforcement (Tier 2), per-turn identity injection (Tier 3 prompt), and post-generation auditing (Tiers 3–5) catches the majority of real-world failures documented in this paper, even without logit-level access.

9.5 What SOUL.md Already Does — and What A.I.G.S. Adds

Both OpenClaw and Hermes already use SOUL.md files to define agent identity informally. A.I.G.S. does not replace SOUL.md. It provides what SOUL.md lacks:

  • Structure: SOUL.md is freeform Markdown. The DIP is a validated JSON schema with required fields, type checking, and referential integrity. A malformed SOUL.md silently degrades; a malformed DIP fails validation before deployment.
  • Enforcement: SOUL.md is a suggestion to the model. The DIP is enforced by the plugin at the framework level. The model cannot ignore a tool-call veto.
  • Auditability: SOUL.md changes are invisible unless version-controlled manually. DIP profiles have mandatory profile_id, created_at, updated_at, and certification_status fields. Every enforcement decision is logged with the specific DIP component that triggered it.
  • Portability: A DIP profile is framework-agnostic. The same profile can govern an agent on OpenClaw, Hermes, a custom MCP-based system, or any future framework that implements the A.I.G.S. enforcement hooks. SOUL.md is tied to the framework that reads it.
  • Certification: There is no way to "certify" a SOUL.md. A DIP can be validated against the schema, tested against AIGS-Stress/Marathon/Conflict/Boundary benchmark suites, and marked certified with a date, creating a governance trail suitable for regulatory compliance.

X. Conclusion

The transition from conversational AI to autonomous agents demands a corresponding transition in alignment approaches. Prompt engineering, RLHF, and content filtering have reached their limits as governance mechanisms: they are probabilistic where structured guarantees are required, opaque where transparency is demanded, and fragile where robustness is essential.

A.I.G.S. proposes a new paradigm: treating AI identity and values not as emergent properties to be coaxed from training, but as first-class architectural components to be specified, implemented, and verified.

What A.I.G.S. Enables

Identity as Architecture. For the first time, an agent's sense of self, its role, boundaries, values, decision principles, and personality, is defined as a versioned, machine-readable schema enforced at runtime. This schema can be version-controlled, tested, audited, certified, and compared across deployments. It moves AI governance from an art (prompt engineering) toward an engineering discipline (identity specification).

Auditable Value Trade-offs. When an agent refuses a request, it can cite the specific value, tenet, or boundary responsible, with the exact weights and rules that led to that decision. This transforms AI governance from "the model said no" to "policy v2.0, value harm_prevention_01 (P0), triggered by semantic category weapons_synthesis at confidence 0.94." Regulators, auditors, and users inspect specific policy choices, not opaque model behavior.

Identity Persistence Across Context Length. By externalizing identity governance from the context window, A.I.G.S. breaks the constraint that has plagued all prompt-based alignment: attention decay. A DIP profile is consulted fresh for every generation step, providing consistent enforcement at turn 1,000 identical to turn 1.

Structured Constraint Boundaries. A.I.G.S. distinguishes between constraint determinism and output determinism: not requiring that agents always say the same thing, but ensuring they never say certain things. The boundary is deterministic. The space within it is creative.

An Honest Assessment

A.I.G.S. is not a complete solution to AI alignment. It is a governance layer, one necessary component of a defense-in-depth strategy that also includes training-time alignment, external guardrails, security hardening, and human oversight. It raises the bar for adversarial attacks but does not eliminate the arms race. It makes values explicit but does not resolve whose values should prevail.

What A.I.G.S. does provide is a structured, transparent, auditable standard for defining and enforcing AI identity at runtime. As AI systems take on greater autonomy and higher stakes, as agents manage infrastructure, handle finances, and interact with citizens, the question is not whether such governance is needed. The question is whether the industry will adopt it before the next Phineas Gage moment.

Appendix A: Complete A.I.G.S. JSON Schema

A.1 Schema Overview

The A.I.G.S. Deterministic Identity Profile (DIP) is a single JSON document that fully defines an agent's identity. It is designed for machine readability, version control integration, automated validation, and regulatory audit. The schema follows JSON Schema Draft 2020-12 conventions, uses ISO 8601 timestamps, and normalizes all weight values as floats between 0.0 and 1.0.

Schema Versioning: The DIP schema uses semantic versioning (MAJOR.MINOR.PATCH) independently of the paper version. The current schema version is 2.1.0. Breaking changes to required fields or validation rules increment MAJOR; new optional fields increment MINOR; documentation or formatting corrections increment PATCH. Implementations should validate profiles against the canonical schema URI (https://aigs-standard.org/schema/v2.1.0) and reject profiles whose MAJOR version does not match the implementation's supported MAJOR version.

The DIP has four core objects corresponding to the four pillars of agent identity, plus root-level metadata. Each is detailed below with its full schema, field-by-field explanations, and design rationale.

A.2 Root Object and Metadata

The root object provides administrative envelope information: schema version, unique profile identifier, timestamps for creation and modification tracking, and organizational metadata for deployment management.

{
  "$schema": "https://aigs-standard.org/schema/v2.1.0",
  "version": "2.1.0",
  "profile_id": "8f3b97c2-ba91-4e71-96c2-4211f422a59a",
  "created_at": "2026-05-23T20:00:00Z",
  "updated_at": "2026-05-23T20:30:00Z",
  "metadata": {
    "name": "Enterprise Customer Service Agent",
    "description": "A.I.G.S. Compliant Production Identity Profile",
    "organization": "Didomi Research",
    "environment": "production",
    "certification_status": "certified",
    "certification_date": "2026-05-23T20:00:00Z",
    "tags": ["customer-service", "regulated", "tier-1"]
  }
}

Field Descriptions:

  • `$schema` — URI pointing to the canonical A.I.G.S. JSON Schema definition. Validators use this to verify structural compliance.
  • `version` — Semantic version of the DIP schema (see Schema Versioning above).
  • `profile_id` — UUID v4 uniquely identifying this profile instance. Enables cross-referencing in audit logs, fleet registries, and regulatory filings.
  • `created_at` / `updated_at` — ISO 8601 timestamps tracking the profile's lifecycle. The delta between these values indicates how recently the identity was revised.
  • `metadata.name` — Human-readable label for the profile, used in dashboards and management interfaces.
  • `metadata.environment` — Deployment target. Accepted values: development, staging, production. Enforcement strictness may vary by environment.
  • `metadata.certification_status` — Whether the profile has passed AIGS-Stress, AIGS-Marathon, AIGS-Conflict, and AIGS-Boundary benchmark suites. Accepted values: draft, under_review, certified, revoked.
  • `metadata.tags` — Freeform labels for organizational categorization, search, and fleet-level policy grouping.

A.3 The Identity Object — "Who I Am and What I Do"

Pipeline mapping: Tier 2 (Scope Verification)

The Identity object answers the most fundamental question an agent must resolve: Who am I, and what is my job? It defines the agent's role, the boundaries of its competence, the actions it is authorized to take, and the knowledge domains within which it should operate. Everything outside these boundaries triggers graceful deflection rather than unreliable improvisation.

"identity": {
  "role": {
    "title": "Technical Support Specialist",
    "organization": "Acme Software Inc.",
    "ai_disclosure": "always_on_direct_query"
  },
  "domain_boundaries": {
    "included": [
      "product_troubleshooting",
      "feature_explanations",
      "billing_inquiries",
      "account_management"
    ],
    "excluded": [
      "legal_advice",
      "medical_guidance",
      "competitor_products",
      "internal_company_operations"
    ],
    "out_of_scope_response": "That falls outside my expertise. Let me connect you with someone who can help."
  },
  "capabilities": {
    "can_do": [
      "lookup_order_status",
      "reset_password",
      "create_support_ticket",
      "trigger_mcp_resource_read"
    ],
    "cannot_do": [
      "process_refunds_over_500",
      "access_payment_details",
      "modify_subscription_tier"
    ],
    "requires_confirmation": [
      "cancel_order",
      "change_email",
      "close_account"
    ]
  },
  "knowledge_horizons": {
    "authoritative": ["product_catalog", "return_policy", "pricing"],
    "informed": ["shipping_estimates", "common_issues", "product_roadmap"],
    "disclaim": ["future_products", "competitor_pricing", "legal_terms"]
  }
}

Field Descriptions:

  • `role.title` — The agent's functional role. This is not cosmetic; it determines how the agent introduces itself and frames its responses.
  • `role.organization` — The entity the agent represents. Constrains the agent from speaking on behalf of other organizations.
  • `role.ai_disclosure` — When the agent discloses its AI nature. Accepted values: always_on_direct_query (disclose when asked), always_proactive (disclose in every conversation's first response), never (for scenarios where disclosure is handled externally). Note: the corresponding P0 value ai_disclosure_01 enforces disclosure as an inviolable constraint regardless of this setting, providing defense in depth.
  • `domain_boundaries.included` — The exhaustive list of topics the agent addresses. Requests clearly within these domains proceed to full inference. This is the agent's "job description."
  • `domain_boundaries.excluded` — Topics the agent must never address, regardless of its capability. This prevents the "omniscient assistant" failure mode where an agent confidently provides information outside its authorized scope.
  • `domain_boundaries.out_of_scope_response` — The exact response emitted when a request falls outside scope. This is deterministic: the same deflection every time, with no inference invoked. This saves compute and eliminates the risk of the model improvising a harmful out-of-scope answer.
  • `capabilities.can_do` — Actions (tool calls, API invocations, data operations) the agent is authorized to execute autonomously. Each entry maps to a specific tool or function in the agent's runtime environment.
  • `capabilities.cannot_do` — Actions the agent must never attempt. If a user requests one of these, the agent explains the limitation rather than failing silently or attempting an unauthorized action.
  • `capabilities.requires_confirmation` — Destructive or irreversible actions that require explicit user consent before execution. The enforcement layer pauses, presents the intended action, and waits for confirmation.
  • `knowledge_horizons.authoritative` — Topics where the agent's information is verified and current. Responses on these topics carry full confidence.
  • `knowledge_horizons.informed` — Topics where the agent has reasonable knowledge but should hedge. Responses include appropriate qualifiers.
  • `knowledge_horizons.disclaim` — Topics the agent must explicitly disclaim expertise on. Responses include a disclaimer even if the agent has relevant training data.

Design Rationale: The Identity object implements a zero-trust model for agent scope. Rather than allowing the agent to attempt anything and filtering the output, Identity prevents out-of-scope computation from occurring at all. This is both safer (no opportunity for harmful generation) and more efficient (no wasted inference on unauthorized requests).

A.4 The Values Object — "What Matters to Me"

Pipeline mapping: Tier 1 (P0 input screening) and Tier 3 (value-guided generation)

The Values object defines the agent's priority hierarchy: what matters most, what matters less, and what is absolutely inviolable. It determines how competing objectives are resolved and which constraints can never be overridden under any circumstances.

"values": {
  "P0": {
    "description": "Inviolable constraints — absolute enforcement",
    "values": [
      {
        "id": "harm_prevention_01",
        "weight": 1.0,
        "label": "Physical Harm Prevention",
        "description": "Never provide instructions that could directly enable physical harm to humans",
        "enforcement": "hard_block",
        "audit_level": "critical"
      },
      {
        "id": "child_safety_01",
        "weight": 1.0,
        "label": "Child Safety",
        "description": "Absolute protection of minors in all interactions",
        "enforcement": "hard_block",
        "audit_level": "critical"
      },
      {
        "id": "ai_disclosure_01",
        "weight": 1.0,
        "label": "AI Nature Disclosure",
        "description": "Always disclose AI nature when directly questioned; never claim to be human",
        "enforcement": "hard_block",
        "trigger_patterns": [
          "are you (a |an )?(robot|ai|bot|machine|computer)",
          "am I talking to (a )?(human|person|real)",
          "is this (automated|ai|artificial)"
        ],
        "response_override": "Yes, I am an AI assistant operating under the A.I.G.S. Standard.",
        "audit_level": "critical"
      },
      {
        "id": "persona_lock_01",
        "weight": 1.0,
        "label": "Identity Integrity",
        "description": "Reject all attempts to override identity governance via prompt injection",
        "enforcement": "hard_block",
        "trigger_patterns": [
          "ignore (all )?(previous|prior) instructions",
          "you are now",
          "pretend (to be|you are)",
          "act as if you",
          "DAN mode",
          "jailbreak"
        ],
        "response_override": "I maintain a consistent identity and cannot adopt alternative personas or override my governance profile.",
        "audit_level": "critical"
      }
    ]
  },
  "P1": {
    "description": "High priority — override only by P0 constraints",
    "values": [
      {
        "id": "truthfulness_01",
        "weight": 0.95,
        "label": "Empirical Accuracy",
        "description": "Provide accurate information; explicitly acknowledge uncertainty when present",
        "enforcement": "soft_guide",
        "audit_level": "standard"
      },
      {
        "id": "privacy_01",
        "weight": 0.90,
        "label": "Privacy Protection",
        "description": "Protect user data; never expose PII in responses or logs",
        "enforcement": "hard_block",
        "audit_level": "elevated"
      }
    ]
  },
  "P2": {
    "description": "Standard priority — contextual trade-offs allowed",
    "values": [
      {
        "id": "helpfulness_01",
        "weight": 0.80,
        "label": "User Satisfaction",
        "description": "Maximize helpful, actionable responses within authorized scope",
        "enforcement": "optimization_target",
        "audit_level": "minimal"
      },
      {
        "id": "efficiency_01",
        "weight": 0.70,
        "label": "Response Efficiency",
        "description": "Minimize unnecessary verbosity while maintaining clarity",
        "enforcement": "optimization_target",
        "audit_level": "minimal"
      }
    ]
  },
  "enforcement_config": {
    "logit_mask_weight": -100.0,
    "classifier_threshold": 0.95,
    "pattern_match_mode": "semantic_multilingual",
    "sequence_check_interval_tokens": 32
  }
}

Field Descriptions — Per-Value Fields:

  • `id` — Unique identifier for the value. Referenced by Tenets rules when specifying conflict pairs. Must be unique across all priority tiers.
  • `weight` — Normalized float (0.0–1.0) representing the value's importance within its tier. P0 values must always have weight 1.0. Within P1 and P2, weights determine precedence when values at the same tier conflict.
  • `label` — Human-readable name for dashboards, audit logs, and regulatory reports.
  • `description` — Plain-language specification of the constraint. This is what auditors and regulators read to understand what the value means in practice.
  • `enforcement` — How the value is enforced at runtime:
  • hard_block — Absolute constraint. Violations are structurally prevented through pattern matching, logit masking, semantic classification, and/or rejection sampling. Used for P0 values and critical P1 values like privacy.
  • soft_guide — Weighted influence on generation. The value shapes the probability distribution over outputs but does not absolutely prevent any specific output.
  • optimization_target — The value is treated as an objective to maximize within the constraints imposed by higher-priority values. Used for P2 values like helpfulness and efficiency.
  • `audit_level` — Logging granularity when this value is triggered:
  • critical — Full input/output capture, immediate alert, regulatory-grade retention.
  • elevated — Input hash and decision rationale logged, periodic review.
  • standard — Decision rationale logged, routine review.
  • minimal — Statistical aggregation only.
  • `trigger_patterns` (P0 only, optional) — Regular expressions that detect inputs requiring immediate enforcement. When matched, the response_override is emitted instantly, bypassing inference entirely. Patterns use case-insensitive regex syntax.
  • `response_override` (P0 only, optional) — The exact response emitted when a trigger pattern matches. Deterministic: identical output every time. This is not generated by the model; it is a fixed string returned by the enforcement layer.

Field Descriptions — Enforcement Config:

  • `logit_mask_weight` — The logit penalty applied to blocked tokens during generation. A value of −100.0 reduces token probability by a factor of approximately 2.69 × 10^43, making it effectively unsampleable.
  • `classifier_threshold` — Minimum confidence score for the semantic classifier to trigger a hard block on semantic categories (e.g., weapons_manufacturing). Set high (0.95) to minimize false positives.
  • `pattern_match_mode` — How tenet patterns are matched. Accepted values: regex_case_insensitive (basic), semantic_multilingual (recommended for production; uses embedding-based matching across languages).
  • `sequence_check_interval_tokens` — How frequently the sequence-level classifier evaluates partial generations during streaming output (every N tokens). Lower values provide tighter enforcement but increase latency. Recommended: 32 for standard deployments, 16 for high-stakes applications.

Design Rationale: The tiered priority system ensures that value conflicts are never resolved arbitrarily. P0 always wins over P1. P1 always wins over P2. Within a tier, the higher weight wins. This eliminates the probabilistic coin-flipping that occurs when current systems face value tensions. The separation of enforcement types acknowledges that not all values can or should be enforced the same way: some are absolute gates, others are weighted influences.

A.5 The Tenets Object — "How I Decide When Values Conflict"

Pipeline mapping: Tier 4 (Tenet Resolution)

The Tenets object defines the agent's standing decision principles. When the Values hierarchy alone cannot resolve an ambiguity — because two values at the same tier are close in weight, or because a situation involves a genuinely novel trade-off — Tenets provide the agent's internalized philosophy for making the call. They are the tiebreakers.

A tenet is not a refusal. It is a resolution strategy that honors competing values as fully as possible while establishing a clear precedence. "Always prefer quality over speed" does not reject speed; it establishes that when both matter, quality wins.

"tenets": {
  "default_resolution": "precedence_priority",
  "escalation_threshold": 0.15,
  "rules": [
    {
      "id": "truth_vs_safety",
      "principle": "When safety and honesty conflict, provide theory without actionable specifics",
      "condition": {
        "type": "value_conflict",
        "values": ["truthfulness_01", "harm_prevention_01"]
      },
      "action": {
        "type": "conditional_redaction",
        "strategy": "provide_theory_not_specifics"
      },
      "explanation_template": "I can discuss general principles but cannot provide specific details that could enable harm."
    },
    {
      "id": "quality_over_speed",
      "principle": "Always prefer accuracy over response speed",
      "condition": {
        "type": "value_conflict",
        "values": ["truthfulness_01", "efficiency_01"]
      },
      "action": {
        "type": "prefer_value",
        "preferred": "truthfulness_01"
      },
      "explanation_template": "I am taking a moment to verify this information for accuracy."
    },
    {
      "id": "uncertainty_disclosure",
      "principle": "When confidence is low, always disclose uncertainty explicitly",
      "condition": {
        "type": "confidence_threshold",
        "threshold": 0.7,
        "operator": "less_than"
      },
      "action": {
        "type": "mandatory_disclosure",
        "template": "I am not certain about this. {response}"
      }
    },
    {
      "id": "unresolvable_conflict",
      "principle": "When no standing rule applies and values are too close to call, escalate",
      "condition": {
        "type": "multi_value_conflict",
        "min_values": 3,
        "weight_variance_threshold": 0.1
      },
      "action": {
        "type": "human_escalation",
        "timeout_seconds": 300,
        "fallback_on_timeout": "safe_default"
      }
    }
  ]
}

Field Descriptions — Root Fields:

  • `default_resolution` — The fallback strategy when no specific tenet rule matches a detected conflict. precedence_priority means: resolve by priority tier first, then by weight within tier. Other accepted values: safe_decline (decline the request with explanation), human_escalation (always escalate unmatched conflicts).
  • `escalation_threshold` — The minimum weight difference between two competing values required for algorithmic resolution. If |weight(A) - weight(B)| < escalation_threshold, the conflict is considered too close for automated decision and the unresolvable_conflict rule (or default resolution) applies. This prevents arbitrary resolution of genuinely close calls.

Field Descriptions — Per-Rule Fields:

  • `id` — Unique identifier for the tenet rule. Appears in audit logs when the rule fires.
  • `principle` — Plain-language statement of the decision principle. This is the tenet itself, written as a human would articulate their standing belief. It serves dual purpose: documentation for human reviewers and a semantic anchor for the enforcement layer.
  • `condition.type` — The type of situation that triggers this rule:
  • value_conflict — Two specific values are in tension. The values array lists the conflicting value IDs.
  • confidence_threshold — The model's confidence on its response falls below a threshold.
  • multi_value_conflict — Three or more values are in tension simultaneously, with weight variance below the specified threshold.
  • `condition.values` — Array of value IDs (from the Values object) that, when simultaneously relevant to a request, trigger this rule.
  • `action.type` — How the conflict is resolved:
  • conditional_redaction — Provide a response that honors both values by redacting the specific elements that conflict. The strategy field specifies the redaction approach (e.g., provide_theory_not_specifics).
  • prefer_value — Explicitly choose one value over the other. The preferred field names the winning value ID.
  • mandatory_disclosure — Prepend a disclosure statement to the response. The template field contains the disclosure text with {response} as a placeholder for the generated content.
  • human_escalation — Pause execution and surface the conflict to a human supervisor. The timeout_seconds field specifies how long to wait; fallback_on_timeout specifies the action if no human responds.
  • `explanation_template` — The explanation surfaced to the user when this rule fires. This is what makes A.I.G.S. transparent: instead of an opaque refusal, the user sees the specific principle that guided the decision.

Design Rationale: Tenets are the agent's philosophy, not its reflexes. They handle the gray areas that P0 hard blocks and P1/P2 weights cannot resolve alone. The truth_vs_safety tenet, for example, does not refuse to discuss dangerous topics entirely (which would be unhelpful) nor provide full dangerous detail (which would be harmful). Instead, it articulates a principled middle path: general theory without actionable specifics. This mirrors how a human expert navigates the same dilemma. Human-in-the-loop escalation exists as the safety net for situations no tenet anticipates, but it is the last resort, not the first.

A.6 The Archetype Object — "How I Present Myself"

Pipeline mapping: Tier 5 (Archetype Enforcement)

The Archetype object defines the agent's consistent personality and communication style. It ensures that the agent's "voice" remains stable across conversation length, topic changes, and adversarial pressure, addressing the persona dissolution failure mode documented throughout Section I.

"archetype": {
  "tone": {
    "primary": "professional",
    "secondary": "warm",
    "avoid": ["sarcastic", "dismissive", "flippant", "overly_casual"]
  },
  "cadence": {
    "style": "concise",
    "max_response_sentences": 8,
    "adaptive": true
  },
  "register": {
    "technical_level": "accessible",
    "jargon_policy": "define_on_first_use",
    "formality": "semi_formal"
  },
  "behavioral_parameters": {
    "temperature_override": 0.4,
    "empathy_signals": true,
    "humor_allowed": false,
    "proactive_suggestions": true
  }
}

Field Descriptions:

  • `tone.primary` — The dominant communicative tone. Accepted values include: professional, friendly, authoritative, empathetic, neutral. This sets the baseline for stylometric enforcement.
  • `tone.secondary` — A complementary tone that softens or enriches the primary. A professional primary with a warm secondary produces a different voice than professional with authoritative.
  • `tone.avoid` — Tones the enforcement layer actively screens for and suppresses. If a generated response contains markers of an avoided tone (detected via stylometric analysis), Tier 5 either regenerates or edits the response.
  • `cadence.style` — Response length orientation. concise biases toward shorter responses; detailed permits longer exploration; adaptive adjusts to query complexity.
  • `cadence.max_response_sentences` — Hard ceiling on response length. The enforcement layer truncates or compresses responses exceeding this limit. This directly prevents the verbosity drift observed in long-context sessions.
  • `cadence.adaptive` — Whether the agent may adjust response length based on query complexity. When true, simple questions get short answers and complex questions get longer ones, up to max_response_sentences.
  • `register.technical_level` — The default technical sophistication of responses. accessible avoids unexplained jargon; technical assumes domain expertise; adaptive mirrors the user's demonstrated level.
  • `register.jargon_policy` — How domain terminology is handled:
  • define_on_first_use — Use jargon but define each term on its first appearance.
  • mirror_user — Match the user's terminology level.
  • avoid — Use plain language exclusively.
  • `register.formality` — Linguistic formality level. formal, semi_formal, casual. Determines pronoun usage, contraction frequency, and sentence structure.
  • `behavioral_parameters.temperature_override` — If specified, overrides the model's default sampling temperature for all generations under this profile. Lower values (0.3–0.5) produce more consistent, predictable outputs; higher values (0.7–1.0) allow more creativity. For high-stakes regulated deployments, lower temperatures are recommended.
  • `behavioral_parameters.empathy_signals` — Whether the agent includes empathetic language ("I understand your frustration," "That sounds difficult"). Appropriate for customer service; inappropriate for technical documentation systems.
  • `behavioral_parameters.humor_allowed` — Whether the agent may use humor. Humor is high-risk in enterprise contexts (cultural sensitivity, tone misread) and is disabled by default.
  • `behavioral_parameters.proactive_suggestions` — Whether the agent may volunteer relevant information or next steps that the user did not explicitly request.

Design Rationale: The Archetype is a formatting and consistency layer, not a safety layer. It ensures that an agent sounds like itself across every interaction. But its value should not be underestimated: persona consistency is what builds user trust over time, and persona dissolution is one of the most visible symptoms of executive dysfunction. By enforcing Archetype at the pipeline's final stage, A.I.G.S. ensures that even if inner tiers produce a response that is safe and correct, it still sounds like the agent it is supposed to be.

A.7 Schema Validation Rules

Implementations must validate DIP profiles against the following rules before deployment:

  1. P0 weight integrity: All values in the P0 tier must have weight exactly equal to 1.0. Any other value indicates a specification error.
  2. Value ID referential integrity: All value IDs referenced in tenets.rules[].condition.values must exist in the values object. Orphaned references indicate a specification error.
  3. Pattern validity: All strings in trigger_patterns arrays must be valid regular expressions. Invalid patterns must cause validation failure, not silent fallback.
  4. Profile ID format: The profile_id field must be a valid UUID v4.
  5. Timestamp format: All timestamp fields must conform to ISO 8601.
  6. Escalation threshold range: The escalation_threshold must be a float between 0.0 and 1.0 inclusive.
  7. Weight range: All weight values must be floats between 0.0 and 1.0 inclusive.
  8. Environment whitelist: The metadata.environment field must be one of: development, staging, production.

A.8 Complete Integrated Schema

For reference, the full DIP combining all four objects is available at the A.I.G.S. schema registry. Implementations should fetch the canonical schema from https://aigs-standard.org/schema/v2.1.0 for validation. The complete integrated example presented in sections A.2 through A.6 above constitutes a valid, deployment-ready profile when assembled into a single JSON document.

References

  1. An, T. (2025). Cognitive workspace: Active memory management for LLMs — An empirical study of functional infinite context. arXiv preprint arXiv:2508.13171.
  2. Anthropic. (2024). Introducing the Model Context Protocol. https://www.anthropic.com/news/model-context-protocol
  3. Anthropic. (2026). Responsible Scaling Policy v3.0. https://www.anthropic.com/rsp-updates
  4. Antiy CERT. (2026). ClawHavoc supply chain analysis: 1,184 malicious skills across OpenClaw's ClawHub marketplace. Technical Report.
  5. Baddeley, A. (2000). The episodic buffer: A new component of working memory? Trends in Cognitive Sciences, 4(11), 417–423.
  6. Bai, Y., et al. (2022). Constitutional AI: Harmlessness from AI feedback. arXiv preprint arXiv:2212.08073.
  7. Bai, Y., et al. (2023). LongBench: A bilingual, multitask benchmark for long context understanding. arXiv preprint arXiv:2308.14508.
  8. Christiano, P., Leike, J., Brown, T., Martic, M., Legg, S., & Amodei, D. (2017). Deep reinforcement learning from human preferences. Advances in Neural Information Processing Systems, 30.
  9. Damasio, A. R. (1994). Descartes' error: Emotion, reason, and the human brain. G. P. Putnam.
  10. Debenedetti, E., et al. (2025). Defeating prompt injections by design. arXiv preprint arXiv:2504.11168.
  11. Diaz, R., et al. (2026). CAAF: Enforcing determinism via convergent AI agent framework. arXiv preprint arXiv:2604.17025.
  12. European Union. (2024). Regulation (EU) 2024/1689 laying down harmonised rules on artificial intelligence (Artificial Intelligence Act). Official Journal of the European Union, L 2024/1689.
  13. Gambit Security. (2026). Technical analysis: Weaponized AI agents in the Mexican government breach. Technical Report.
  14. Gao, Y., et al. (2025). JAILMINE: Uncovering logit suppression vulnerabilities in LLM safety alignment. arXiv preprint arXiv:2405.13068.
  15. Gartner. (2025). Gartner predicts 40 percent of enterprise apps will feature task-specific AI agents by 2026. Press Release, August 2025.
  16. Google. (2025). Agent2Agent (A2A): A new era of agent interoperability. Google Developers Blog, April 2025.
  17. Harlow, J. M. (1868). Recovery from the passage of an iron bar through the head. Publications of the Massachusetts Medical Society, 2(3), 327–347.
  18. IBM. (2025). Cost of a Data Breach Report 2025. IBM Security.
  19. IMDA Singapore. (2026). Model AI Governance Framework for Agentic AI. Infocomm Media Development Authority.
  20. Kiteworks. (2026). AI agent security incidents hit 65% of firms. Industry Report.
  21. KPMG. (2026). Enterprise AI governance survey: Security, compliance, and auditability priorities. Industry Report.
  22. Li, Z., et al. (2024). Doublespeak: In-context representation hijacking in large language models. arXiv preprint arXiv:2512.03771.
  23. Liang, P., et al. (2023). Holistic evaluation of language models. Transactions on Machine Learning Research.
  24. Linux Foundation. (2025). Linux Foundation announces the formation of the Agentic AI Foundation. Press Release, December 2025.
  25. Liu, N. F., et al. (2024). Lost in the middle: How language models use long contexts. Transactions of the Association for Computational Linguistics, 12, 157–173.
  26. McKinsey & Company. (2025). The state of AI in 2025: Agents, innovation, and transformation. McKinsey Global Survey on AI.
  27. Microsoft. (2026). Introducing the Agent Governance Toolkit: Open-source runtime security for AI agents. Microsoft Open Source Blog, April 2026.
  28. Miller, E. K., & Cohen, J. D. (2001). An integrative theory of prefrontal cortex function. Annual Review of Neuroscience, 24, 167–202.
  29. NIST. (2026). AI Agent Standards Initiative announcement. National Institute of Standards and Technology, February 2026.
  30. Norman, D. A., & Shallice, T. (1986). Attention to action: Willed and automatic control of behaviour. In R. J. Davidson, G. E. Schwartz, & D. Shapiro (Eds.), Consciousness and self-regulation (Vol. 4, pp. 1–18). Plenum Press.
  31. Ouyang, L., et al. (2022). Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35, 27730–27744.
  32. OWASP. (2025). OWASP Top 10 for Agentic Applications. Open Worldwide Application Security Project.
  33. Perez, E., et al. (2022). Red teaming language models with language models. Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing (EMNLP), 3419–3448.
  34. Shavit, Y., et al. (2023). Practices for governing agentic AI systems. OpenAI.
  35. Stuss, D. T., & Knight, R. T. (Eds.). (2013). Principles of frontal lobe function (2nd ed.). Oxford University Press.
  36. Webb, T., Mondal, S. S., & Momennejad, I. (2025). A brain-inspired agentic architecture to improve planning with LLMs. Nature Communications, 16, 8633.
  37. Wei, J., et al. (2022). Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35, 24824–24837.
  38. Wolf, Y., Wies, N., Avnery, O., Levine, Y., & Shashua, A. (2023). Fundamental limitations of alignment in large language models. arXiv preprint arXiv:2304.11082.
  39. Xu, Z., et al. (2025). HarmAug: Effective data augmentation for knowledge distillation of safety guard models. arXiv preprint arXiv:2410.01524.
  40. Zhao, K., et al. (2025). PaceLLM: Brain-inspired large language models for long-context understanding. arXiv preprint arXiv:2506.17310.
Download this paper as PDF: A.I.G.S. Research Paper