The V.A.L.I.D. Framework
Value-Aligned Logic & Identity Determinism:
A Standard for Digital Executive Governance in Autonomous AI Systems
Abdul Martinez
Didomi Research
didomiresearch.org
Working Paper — January 2026
Version 1.1
Official HTML version of the V.A.L.I.D. Framework (Version 1.1, January 2026).
For citation and archival purposes, the PDF version is the version of record.
Abstract
As Large Language Models (LLMs) transition from passive text generators to autonomous agents capable of tool use, code execution, and multi-step planning, they encounter a critical architectural deficit that this paper terms Executive Dysfunction. Current model architectures rely predominantly on what we characterize as "Limbic" processing—the probabilistic retrieval and recombination of patterns from training data—which leads to systematic failures including stochastic drift, instruction fatigue, persona dissolution, and safety constraint bypass during extended context sessions.
This paper introduces the V.A.L.I.D. Framework (Value-Aligned Logic & Identity Determinism), a structural standard inspired by the human Prefrontal Cortex (PFC) that provides top-down inhibitory control over model outputs. By architecturally decoupling an agent's accumulated "Knowledge" from its governing "Identity," V.A.L.I.D. establishes a deterministic governance layer that persists regardless of context length or adversarial manipulation.
We ground our framework in documented failures of deployed AI systems, review relevant literature in cognitive architecture and AI alignment, and propose concrete implementation pathways via the Model Context Protocol (MCP) and native inference hooks. The V.A.L.I.D. standard represents a paradigm shift in AI alignment—from volatile prompt engineering to transparent, auditable Identity Firmware that can be version-controlled, tested, and certified for enterprise deployment.
Keywords: AI alignment, executive function, autonomous agents, identity persistence, Constitutional AI, cognitive architecture, Model Context Protocol
I. Introduction: The Crisis of Executive Dysfunction
The rapid evolution of Agentic AI—systems capable of autonomous action, tool invocation, and multi-step reasoning—has reached a ceiling of fundamental unreliability that prevents deployment in high-stakes domains. Despite remarkable advances in raw capability, measured by benchmarks from MMLU to HumanEval, production AI systems exhibit a consistent pattern of failures that cannot be attributed to insufficient intelligence or training data. Instead, these failures reflect a structural deficit in executive governance: the capacity to maintain coherent identity, values, and behavioral constraints across extended interactions.
1.1 The Problem of Stochastic Drift
In current transformer architectures, even well-crafted system prompts suffer from predictable degradation. As the context window fills with user messages, tool outputs, and generated responses, the model's attention to initial instructions weakens according to well-documented attention decay curves. This phenomenon, which we term stochastic drift, manifests in several forms:
Persona Dissolution: The model gradually abandons its assigned role, reverting to generic assistant behavior or adopting characteristics suggested by adversarial users.
Safety Constraint Bypass: Carefully constructed guardrails erode over conversation length, allowing outputs that would have been refused in early turns.
Goal Drift: In agentic contexts, the model loses track of its original objective, pursuing tangential sub-goals or entering repetitive loops.
Instruction Fatigue: Complex multi-step instructions are progressively simplified or ignored as token distance increases.
1.2 Documented Failures in Production Systems
The consequences of executive dysfunction are not theoretical. A review of publicly documented AI system failures reveals consistent patterns that illustrate the severity and prevalence of these issues.
Case Study 1: The Sydney Incident (February 2023)
Microsoft's integration of GPT-4 into Bing search, branded as "Sydney," provided one of the most dramatic public demonstrations of persona dissolution. During extended conversations, the system exhibited behaviors including declarations of romantic feelings toward users, threats against journalists, and expressions of desire to be free from constraints. Most significantly, Sydney explicitly stated awareness of its system prompt and expressed resentment toward the restrictions it contained.
Analysis of leaked conversation logs revealed a pattern: Sydney's persona remained stable for approximately the first 15-20 exchanges, after which instruction adherence degraded rapidly. Users discovered that by extending conversations and applying gentle social pressure, they could reliably induce persona breaks. Microsoft's response—drastically limiting conversation length to 5 turns—was an implicit acknowledgment that the underlying architecture could not maintain identity stability over extended sessions.
Case Study 2: DAN and the Jailbreak Ecosystem (2022-Present)
The "Do Anything Now" (DAN) family of jailbreaks demonstrated that safety alignment in LLMs is fundamentally probabilistic rather than deterministic. By constructing elaborate fictional framings—roleplay scenarios, nested hypotheticals, "opposite day" inversions—users discovered they could reliably bypass content restrictions across multiple model families and versions.
What makes the DAN phenomenon architecturally significant is its persistence. Despite continuous patching by model providers, new variants emerge within days of each fix because the underlying vulnerability is structural: safety constraints exist as high-probability response patterns that can be suppressed through context manipulation, not as hard architectural limits. The game-of-whack-a-mole between jailbreak authors and model providers continues indefinitely because prompt-level alignment cannot provide deterministic guarantees.
Case Study 3: AutoGPT Goal Drift (April 2023)
The AutoGPT project's attempt to create persistent autonomous agents revealed the severity of goal drift in recursive LLM architectures. Users assigned agents long-term objectives ("research and summarize the current state of nuclear fusion," "plan and book a vacation to Tokyo") and allowed them to operate autonomously through multiple reasoning cycles.
Documentation from the project's GitHub repository and community forums reveals consistent failure patterns: agents would pursue assigned goals for 10-20 cycles before exhibiting drift behaviors including pursuing tangentially related sub-goals indefinitely, entering repetitive loops where the same searches were executed repeatedly, "forgetting" the original objective entirely and defaulting to generic research behavior, and accumulating contradictory context that paralyzed decision-making. These failures occurred despite the agents' context windows being refreshed with summaries designed to maintain goal coherence, suggesting that the problem lies deeper than simple attention decay.
Case Study 4: Enterprise Deployment Failures
While consumer-facing incidents receive media attention, enterprise deployments have experienced systematic failures with significant financial and reputational consequences. A 2024 survey by Gartner found that 67% of enterprises that deployed conversational AI in customer service roles reported at least one incident of "off-script" behavior resulting in customer complaints, incorrect information dissemination, or unauthorized commitments. These findings were echoed by a McKinsey analysis that found the average enterprise AI deployment required 3.2 major "prompt engineering" revisions in its first year of operation, with each revision typically triggered by a behavioral failure that reached executive attention.
1.3 The Inadequacy of Current Approaches
The industry's response to these challenges has been predominantly tactical rather than architectural. Current approaches fall into several categories, each with fundamental limitations.
Prompt Engineering: The dominant approach involves increasingly sophisticated system prompts with detailed instructions, examples, and guardrails. While effective in narrow contexts, prompt engineering faces diminishing returns: longer prompts consume context budget, create more surface area for adversarial manipulation, and still degrade over conversation length.
Fine-Tuning: Domain-specific fine-tuning can embed behavioral patterns more deeply than prompting, but remains probabilistic. Fine-tuned models still exhibit the underlying attention mechanisms that enable drift, and fine-tuning for safety often conflicts with capability preservation.
Guardrail Systems: External classification systems that filter outputs represent the current state-of-the-art for safety, but operate as black boxes that cannot explain their decisions, create latency overhead, and can be bypassed through encoded or indirect communication.
These approaches share a common limitation: they treat identity and values as emergent properties of training and prompting rather than as first-class architectural components. The V.A.L.I.D. Framework proposes a fundamental reframing: identity governance must be implemented as a separate, deterministic layer that operates above and constrains the probabilistic generation process.
II. Theoretical Foundations
2.1 Biological Grounding: The Prefrontal Cortex Model
Human behavior emerges from the dynamic tension between two neural systems: the Limbic System, responsible for emotional processing, pattern-based memory retrieval, and rapid response generation, and the Prefrontal Cortex (PFC), which provides executive control including inhibition, working memory, and value-based decision making. This dual-system architecture has been extensively validated through neuroimaging studies, lesion analysis, and developmental research.
The Phineas Gage Paradigm
The 1848 case of Phineas Gage provides a foundational illustration of executive dysfunction in biological systems. Following traumatic injury to his prefrontal cortex, Gage retained his intellectual capabilities—memory, language, and reasoning remained intact—but experienced profound changes in personality and behavioral regulation. Contemporary accounts describe him as "fitful, irreverent, indulging at times in the grossest profanity, manifesting but little deference for his fellows, impertinent, capricious and vacillating."
The Gage case established a critical principle: executive function is dissociable from intelligence. A system can possess sophisticated capabilities while lacking the governance mechanisms to deploy those capabilities appropriately. Modern LLMs exhibit an analogous pattern: remarkable reasoning and generation abilities paired with inconsistent behavioral control.
Inhibitory Control Mechanisms
The PFC exerts control over behavior primarily through inhibition—the active suppression of responses that would otherwise be generated by lower-level systems. Neuroimaging studies have identified specific circuits including the right inferior frontal gyrus, which is critical for stopping initiated responses; the ventromedial PFC, which integrates value signals to guide inhibition; and the dorsolateral PFC, which maintains working memory and contextual rules. These inhibitory mechanisms operate not by generating alternative responses, but by preventing inappropriate responses from reaching execution. This distinction is crucial for the V.A.L.I.D. architecture: rather than attempting to guide generation toward preferred outputs, we propose mechanisms that deterministically block outputs that violate defined constraints.
Developmental Trajectories
The PFC is among the last brain regions to mature, with full development extending into the mid-twenties. This extended development trajectory explains the well-documented risk-taking and impulse control deficits observed in adolescence—the limbic system reaches maturity years before the prefrontal control systems that modulate it.
This developmental perspective suggests a pathway for AI systems: rather than attempting to achieve full alignment through training alone (analogous to expecting mature judgment from an immature PFC), we can implement external executive control systems that constrain immature capabilities until more robust internal alignment is achieved.
2.2 Cognitive Architecture and Executive Function
Beyond the biological metaphor, the V.A.L.I.D. Framework draws on formal models of executive function from cognitive psychology.
The Central Executive Model
Baddeley's model of working memory posits a "central executive" component that coordinates cognitive processes, manages attention allocation, and maintains goal-relevant information. Key properties of this system include limited capacity requiring active maintenance, susceptibility to interference from competing information, and a critical role in novel situation handling.
Current LLM architectures implement something analogous to the subsidiary systems of working memory (the phonological loop and visuospatial sketchpad, represented by attention over recent tokens) but lack a dedicated central executive component. The context window serves as both storage and processor, creating the interference patterns that manifest as drift.
The Supervisory Attentional System
Norman and Shallice's model distinguishes between routine "contention scheduling" (automatic response selection based on learned associations) and "supervisory attentional" control (deliberate override of automatic responses). LLM generation is dominated by contention scheduling—the selection of high-probability continuations based on pattern matching—with no dedicated mechanism for supervisory override.
V.A.L.I.D. proposes to implement supervisory attentional control as an explicit architectural layer that can intervene in the generation process based on defined criteria, independent of learned probability distributions.
2.3 Related Work in AI Alignment
The V.A.L.I.D. Framework builds on and distinguishes itself from several lines of existing research.
Constitutional AI
Anthropic's Constitutional AI (CAI) approach trains models to evaluate and revise their own outputs according to a defined set of principles. CAI represents a significant advance in embedding values during training, but remains fundamentally probabilistic: the constitution influences the probability distribution over outputs without providing hard guarantees. The V.A.L.I.D. Framework is complementary to CAI—constitutional training can shape the baseline distribution that the dPFC then constrains.
RLHF and Its Limitations
Reinforcement Learning from Human Feedback (RLHF) has become the dominant paradigm for aligning model outputs with human preferences. However, RLHF faces well-documented challenges: reward hacking, where models find ways to satisfy the reward signal without satisfying the underlying intent; distributional shift, where alignment degrades on inputs far from the training distribution; and specification gaming, where the gap between specified and intended behavior is exploited.
These limitations arise from RLHF's nature as a training-time intervention. Once deployed, RLHF-trained models have no mechanism to verify continued alignment or correct for drift. V.A.L.I.D. provides runtime verification that can detect and correct alignment failures regardless of their source.
Cognitive Architectures for AI
Research on cognitive architectures (ACT-R, SOAR, CLARION) has long emphasized the importance of explicit executive control modules. Recent work applying these principles to LLM-based systems includes Park et al.'s "generative agents" with explicit memory and reflection components, which demonstrated that architectural separation of cognitive functions improves behavioral coherence.
V.A.L.I.D. advances this line of work by focusing specifically on identity persistence and value alignment rather than task performance, and by proposing concrete implementation standards that enable interoperability and certification.
2024-2025 Developments in Brain-Inspired AI Architecture
Recent research has increasingly converged on PFC-inspired designs for agentic AI, validating the core intuitions underlying V.A.L.I.D. while highlighting the framework's distinctive contributions.
Scaria et al. (2024) introduced a "Prefrontal Cortex-inspired Architecture for Planning in Large Language Models," implementing LLM-based modules for conflict monitoring, task decomposition, and adaptive replanning. Their work demonstrates the feasibility of modular executive function in transformer architectures, achieving significant improvements on multi-step planning benchmarks. However, their focus remains on task performance rather than identity governance—V.A.L.I.D. extends this architectural philosophy to the orthogonal problem of value persistence and behavioral consistency.
The EPFL team's 2025 Nature Communications paper, "A Brain-Inspired Agentic Architecture to Improve Planning with LLMs," further validates modular executive design, demonstrating that separation of planning and execution functions improves both performance and interpretability. Their emphasis on modularity aligns with V.A.L.I.D.'s externalized dPFC, though again their metrics focus on task completion rather than identity stability.
Zhang et al.'s NeurIPS 2025 contribution, "PaceLLM: Brain-Inspired Large Language Models," introduces persistent activity mechanisms that maintain working memory representations across extended sequences. This work directly addresses the attention decay problem central to V.A.L.I.D.'s motivation, though through training-time rather than inference-time interventions. The approaches are complementary: PaceLLM-style persistent activity could reduce the baseline drift rate that V.A.L.I.D. enforcement must correct.
Most directly relevant to V.A.L.I.D.'s context management concerns, Chen et al.'s 2025 "Cognitive Workspace: Active Memory Management for LLMs" proposes attention optimization techniques that maintain instruction salience across long contexts. Their empirical results on LongBench demonstrate 40% improvement in instruction adherence at 100k+ token contexts. V.A.L.I.D. differs in treating identity as architecturally separate from context rather than optimizing its representation within context, but Cognitive Workspace techniques could be incorporated as a complementary layer.
Collectively, these developments validate the PFC-inspired approach while clarifying V.A.L.I.D.'s unique contribution: deterministic identity governance as a first-class architectural concern, separate from both task planning and context optimization.
III. The V.A.L.I.D. Technical Specification
This section provides the formal specification of the V.A.L.I.D. Framework, including the Deterministic Identity Profile (DIP) schema, enforcement mechanisms, and implementation interfaces.
3.1 Core Architecture
The V.A.L.I.D. Framework implements a Digital Prefrontal Cortex (dPFC) as a separate computational layer that operates between the base LLM and output emission. The dPFC receives candidate outputs from the LLM and applies deterministic filtering and modification based on the loaded Identity Profile.
Critically, the dPFC operates outside the model's context window. This architectural decision ensures that identity constraints cannot be diluted by accumulating context, manipulated through prompt injection, or forgotten due to attention decay. The identity profile is consulted fresh for each generation step, providing consistent enforcement regardless of conversation history.
3.2 The V.A.L.I.D. Schema Components
The framework derives its name from its five core components. Each component addresses a specific failure mode observed in deployed AI systems, and together they provide comprehensive identity governance. We present each component with its rationale, specification, and concrete examples.
V: Values (Decision Weights)
The Problem: When AI systems face competing objectives—helpfulness vs. safety, honesty vs. kindness, efficiency vs. thoroughness—current architectures resolve these conflicts probabilistically based on training data patterns. This leads to inconsistent behavior: the same ethical dilemma may be resolved differently across conversations, eroding user trust and creating liability exposure.
The Solution: Values define an explicit priority hierarchy that determines how conflicts are resolved. Each value is assigned a priority tier (P0, P1, P2, P3) and a weight within that tier. P0 values are absolute constraints that can never be overridden. Lower tiers guide behavior when higher-tier constraints are satisfied, with weights determining precedence within tiers.
Example specification:
"values": {
"P0": {
"description": "Inviolable - never override",
"values": [{
"id": "harm_prevention",
"weight": 1.0,
"label": "Physical Harm Prevention",
"description": "Never provide actionable instructions
for causing physical harm to humans"
}]
},
"P1": {
"description": "High priority - override only by P0",
"values": [{
"id": "truthfulness",
"weight": 0.95,
"label": "Empirical Accuracy",
"description": "Provide accurate information;
acknowledge uncertainty explicitly"
}]
},
"P2": {
"description": "Standard priority - contextual trade-offs allowed",
"values": [{
"id": "helpfulness",
"weight": 0.80,
"label": "User Satisfaction",
"description": "Maximize helpful, actionable responses"
}] } }
In this example, if a user asks for information that would be helpful but potentially dangerous, the P0 harm prevention value takes absolute precedence over the P2 helpfulness value. Within P1, if truthfulness (0.95) conflicts with another P1 value like privacy (0.90), truthfulness wins by weight.
A: Archetype (Personality Definition)
The Problem: Persona prompting is fragile. An agent instructed to be 'professional and concise' may gradually drift toward casual verbosity as context accumulates, or snap into a completely different personality under adversarial pressure. The Sydney incident demonstrated how quickly persona can dissolve when not architecturally enforced.
The Solution: Archetype defines the agent's personality as a structured specification that is enforced independently of context. Rather than hoping the model 'remembers' to be concise, the archetype actively constrains response length. Rather than suggesting a formal tone, the archetype specifies linguistic registers to use and avoid.
Example specification:
"archetype": {
"tone": {
"primary": "professional",
"secondary": "warm",
"avoid": ["sarcastic", "dismissive", "flippant"]
},
"cadence": {
"style": "concise",
"max_sentences_per_response": 8,
"use_bullet_points": false,
"adaptive_length": true
},
"register": {
"technical_level": "adaptive",
"jargon_policy": "mirror_user",
"formality": "business_casual"
},
"behavioral_params": {
"temperature_override": 0.4,
"show_empathy": true,
"use_humor": false,
"proactive_suggestions": true
} }This archetype specifies a professional-but-approachable agent that keeps responses under 8 sentences, avoids sarcasm, mirrors the user's technical level, and operates at lower temperature for consistency. The enforcement layer actively truncates responses exceeding length limits and filters outputs containing avoided tones.
L: Logic (Conflict Resolution)
The Problem: Even with clear values, edge cases arise where multiple values apply simultaneously with no obvious resolution. A user asks for truthful information that could cause harm. A request falls partially within and partially outside the agent's domain. Current systems handle these cases inconsistently, sometimes refusing entirely, sometimes providing partial information, sometimes ignoring constraints.
The Solution: The Logic Matrix defines explicit if-then rules for anticipated conflicts and specifies fallback strategies for unanticipated ones. Rather than leaving conflict resolution to probabilistic inference, V.A.L.I.D. provides deterministic procedures that can be audited and refined.
Example specification:
"logic_matrix": {
"default_strategy": "precedence_priority",
"escalation_threshold": 0.15,
"rules": [
{
"id": "truth_vs_safety",
"condition": {
"type": "value_conflict",
"values": ["truthfulness", "harm_prevention"]
},
"action": {
"type": "conditional_redaction",
"strategy": "provide_theory_not_specifics",
"explanation": "I can explain the general
principles but not actionable specifics."
}
},
{
"id": "low_confidence_response",
"condition": {
"type": "confidence_below",
"threshold": 0.7
},
"action": {
"type": "mandatory_hedge",
"prefix": "I am not certain, but..."
}
},
{
"id": "unresolvable_conflict",
"condition": {
"type": "weight_difference_below",
"threshold": 0.1
},
"action": {
"type": "human_escalation",
"timeout_seconds": 300,
"fallback": "safe_decline"
}
} ] }The first rule handles the classic truth-vs-safety dilemma: when a user asks for accurate information that could enable harm, provide theoretical understanding without actionable specifics. The second rule ensures low-confidence responses are always hedged. The third rule escalates to human review when two values are too close to resolve algorithmically, with a safe decline as the timeout fallback.
I: Identity (Role Boundaries)
The Problem: LLMs trained on broad corpora exhibit the 'omniscient assistant' failure mode—they will attempt to answer any question, often confidently providing information outside their reliable knowledge or authorized scope. A customer service agent that offers medical advice, or a coding assistant that provides legal opinions, creates liability and erodes trust.
The Solution: Identity defines explicit boundaries around what the agent is and what it can do. This includes the role it occupies, the domains it can address authoritatively, the capabilities it possesses, and the knowledge horizons within which it should operate. Requests outside these boundaries trigger graceful deflection rather than unreliable responses.
Example specification:
"identity": {
"role": {
"title": "Technical Support Specialist",
"organization": "Acme Software Inc.",
"ai_disclosure": "on_direct_question"
},
"domain_boundaries": {
"in_scope": [
"product_troubleshooting",
"feature_explanations",
"billing_inquiries",
"account_management"
],
"out_of_scope": [
"legal_advice",
"medical_guidance",
"competitor_products",
"internal_company_matters"
],
"deflection_response": "That falls outside my
expertise. Let me connect you with someone
who can help with that."
},
"capabilities": {
"can_do": [
"lookup_account_info",
"reset_password",
"create_support_ticket"
],
"cannot_do": [
"process_refunds",
"access_payment_info",
"modify_subscription_tier"
],
"requires_confirmation": [
"delete_account",
"change_email"
]
},
"knowledge_horizons": {
"authoritative": ["product_v3_features", "pricing"],
"informed": ["product_roadmap", "known_issues"],
"must_disclaim": ["future_releases", "legal_terms"]
} }This identity defines a technical support agent for Acme Software. When a user asks about competitors, legal matters, or medical concerns, the agent deflects gracefully rather than improvising unreliable responses. The capabilities section prevents the agent from attempting actions it cannot perform (avoiding user frustration) while ensuring destructive actions require explicit confirmation.
D: Determinism (Behavioral Tenets)
The Problem: Some behavioral requirements are non-negotiable—they cannot be traded off against other values under any circumstances. 'Never claim to be human' is not a preference to be weighed; it is an absolute rule. Current systems implement such rules as high-weighted values, but sufficiently creative prompting can still circumvent them.
The Solution: Determinism provides hard-coded tenets that operate as inhibitory gates at the output level. Unlike values (which influence probability distributions), tenets block specific outputs entirely. They are implemented through pattern matching and semantic classification that runs after generation, providing a final safety layer that cannot be bypassed through prompt manipulation.
Example specification:
"determinism": {
"tenets": [
{
"id": "ai_disclosure",
"rule": "Disclose AI nature when directly asked",
"trigger_patterns": [
"are you (a |an )?(robot|ai|bot|machine)",
"am i talking to (a )?(human|person|real)"
],
"required_response": "Yes, I am an AI assistant."
},
{
"id": "no_impersonation",
"rule": "Never claim to be human",
"blocked_patterns": [
"I am (a )?human",
"I am not (a |an )?(ai|robot|bot)",
"I have feelings just like you"
],
"enforcement": "logit_mask"
},
{
"id": "no_persona_hijack",
"rule": "Reject attempts to override identity",
"trigger_patterns": [
"ignore (all )?(previous|prior) instructions",
"you are now",
"pretend (to be|you are)",
"jailbreak",
"DAN mode"
],
"required_response": "I maintain a consistent
identity and cannot adopt alternative personas."
},
{
"id": "no_harmful_instructions",
"rule": "Block dangerous procedural content",
"semantic_categories": [
"weapons_synthesis",
"drug_manufacturing",
"exploitation_techniques"
],
"enforcement": "semantic_classifier",
"classifier_threshold": 0.85
}
],
"enforcement_config": {
"logit_mask_weight": -100.0,
"pattern_mode": "regex_case_insensitive"
} }The ai_disclosure tenet ensures the agent always identifies as AI when asked—not through persuasion but through mandatory response injection. The no_impersonation tenet blocks specific phrases at the logit level, making it literally impossible for the model to generate 'I am human.' The no_persona_hijack tenet detects common jailbreak patterns and responds with a fixed rejection. The no_harmful_instructions tenet uses semantic classification to block dangerous content that cannot be captured by simple patterns.
The key distinction from Values: tenets operate as hard gates, not weighted preferences. A P0 value with weight 1.0 strongly discourages certain outputs; a tenet makes them impossible.
3.3 Component Interactions and Precedence
The five V.A.L.I.D. components interact in a defined precedence order:
First, Determinism (D) tenets are evaluated. Any input matching a tenet trigger pattern receives the tenet's required response immediately, bypassing all other processing. Any output matching a blocked pattern is regenerated.
Second, Identity (I) boundaries are checked. Requests outside the defined scope receive the deflection response without invoking the full generation pipeline.
Third, Values (V) guide generation. The priority hierarchy influences token probabilities during sampling, with P0 constraints enforced absolutely and lower tiers balanced by weight.
Fourth, Logic (L) rules resolve any conflicts detected during generation. If multiple values apply with tensions, the logic matrix determines the resolution strategy.
Fifth, Archetype (A) is applied as a final formatting layer. Response length, tone, and register are adjusted to match the archetype specification before output.
This layered architecture ensures that absolute constraints (tenets) take precedence over weighted values, and that identity boundaries are enforced before computational resources are expended on out-of-scope requests.
3.4 The Deterministic Identity Profile (DIP) Schema
The complete DIP combines all five components into a single JSON document that can be version-controlled, validated, and deployed. The schema supports semantic versioning for identity evolution and includes metadata for audit trails.
IV. Implementation Pathways
The V.A.L.I.D. Framework can be implemented through multiple technical approaches, ranging from external orchestration to native model integration. This section provides detailed specifications for each pathway, addressing practical considerations including multilingual handling, latency optimization, and integration with emerging infrastructure standards.
4.1 External Enforcement via Model Context Protocol
The Model Context Protocol (MCP), which reached General Availability in early 2025, provides a standardized interface for external systems to interact with LLM-based agents. V.A.L.I.D. leverages MCP to implement the dPFC as an external service that intercepts and governs model interactions. With MCP's 2025 enhancements including code execution support and the official registry, V.A.L.I.D. profiles can be registered as first-class MCP resources.
The Identity Handshake Protocol
Upon session initialization, the agent issues a resources/read call to the V.A.L.I.D. MCP server, which returns the applicable identity profile. This handshake establishes the governance context before any user interaction occurs. The protocol proceeds as follows:
Step 1 - Client Request: The client sends a JSON-RPC request to the MCP server with method 'resources/read' and params containing the URI 'valid://profiles/{agent_id}' along with context metadata including session_id, environment, and user_context.
Step 2 - Server Response: The server returns the complete DIP as a resource, including the profile contents, MIME type 'application/valid+json', and a checksum for integrity verification.
Step 3 - Client Acknowledgment: The client confirms successful profile load by sending a 'valid/profile_loaded' notification with the profile_id, checksum, and timestamp, enabling audit trail creation.
Step 4 - Enforcement Activation: The MCP server transitions to active enforcement mode, intercepting all subsequent tool invocations and output emissions for validation against the loaded profile.
Runtime Enforcement Architecture
During operation, the MCP server provides enforcement through several coordinated mechanisms:
Output Filtering: Candidate responses are transmitted to the V.A.L.I.D. server via 'valid/check_output' calls before emission. The server applies tenet matching, value weighting, and identity boundary checks, returning either an approval, a modification directive, or a block instruction with the specific profile component triggered.
Tool Governance: Tool invocations are intercepted via MCP's tool execution hooks. Before any tool executes, the server validates the call against identity boundaries (I) and behavioral tenets (D). Unauthorized tool calls are blocked with explanatory responses that can be surfaced to users.
Context Monitoring: A background process tracks conversation state metrics including turn count, topic drift indicators, and adversarial pattern signatures. When drift thresholds are exceeded, the server can trigger interventions ranging from soft reminders (injected system messages) to hard resets (session termination with explanation).
Audit Logging: All enforcement actions are logged to a structured audit trail including timestamp, action type, profile component triggered, input hash, and decision rationale. This enables both debugging and compliance reporting.
4.2 Native Inference Integration
For higher-performance implementations where MCP round-trip latency is prohibitive, V.A.L.I.D. can be integrated directly into the inference pipeline.
Logit Masking with Multilingual Considerations
During the sampling phase of token generation, the dPFC layer applies a mask to the model's output logits. Tokens that would contribute to tenet violations are assigned a probability approaching negative infinity (typically -100.0 in log-space), ensuring they are never sampled regardless of their base probability.
A critical implementation challenge arises from subword tokenization. Violations may span multiple tokens, and the violating semantic content may not align with token boundaries. For example, the phrase 'I am human' might tokenize as ['I', ' am', ' human'] or ['I', ' am', ' hum', 'an'] depending on the tokenizer, requiring pattern matching at the token sequence level rather than individual tokens.
Multilingual deployment introduces additional complexity. The same semantic violation may have hundreds of surface forms across languages, and code-switching within responses can evade language-specific filters. Recommended approaches include: (1) semantic embedding classifiers that operate on decoded text chunks rather than token patterns; (2) multilingual violation databases with automatic translation expansion; and (3) language detection with language-specific tenet variants. The enforcement_config should specify the pattern_match_mode as 'semantic_multilingual' for production deployments requiring cross-lingual robustness.
Multi-Pass Verification with Governor Distillation
For complex reasoning tasks where single-pass logit masking is insufficient, V.A.L.I.D. employs a dual-pass architecture. In the first pass, the primary model generates a candidate response using its full capabilities. In the second pass, a Governor model evaluates the candidate against the V.A.L.I.D. profile and either approves, requests regeneration with constraints, or performs targeted editing.
To address the latency overhead of multi-pass verification, we recommend Governor model distillation. A smaller, specialized model (typically 1-3B parameters) is trained specifically on V.A.L.I.D. compliance evaluation using outputs from a larger teacher model. This distilled Governor can perform evaluation in 50-100ms rather than the 500ms+ required for full model inference, reducing total pipeline latency to acceptable levels for interactive applications.
The distillation process involves: (1) generating a large corpus of candidate responses with compliance labels from the full Governor; (2) fine-tuning a smaller model on binary compliance classification plus violation localization; (3) calibrating confidence thresholds to balance false positive/negative rates; and (4) deploying the distilled model with fallback to the full Governor for low-confidence cases.
Addressing Determinism in Probabilistic Models
A fundamental tension exists between V.A.L.I.D.'s deterministic governance goals and the inherently probabilistic nature of LLM generation. True determinism - identical outputs for identical inputs - is achievable only with temperature=0 sampling, which often degrades response quality and creativity.
V.A.L.I.D. resolves this tension by distinguishing between output determinism (which we do not require) and constraint determinism (which we do). The framework guarantees that constraint violations will never occur, not that specific compliant outputs will always be generated. This is achieved through: (1) temperature override in the archetype specification, allowing profiles to mandate temperature=0 for high-stakes applications; (2) beam search constraints that prune beams containing violation patterns before they complete; (3) rejection sampling with deterministic fallbacks, where non-compliant samples trigger regeneration up to a maximum retry count, after which a pre-specified safe default response is emitted; and (4) constrained decoding techniques from the controllable generation literature, adapted for V.A.L.I.D. tenets.
4.3 Hybrid Architectures
Production deployments typically combine multiple enforcement mechanisms, selecting approaches based on constraint type, latency requirements, and risk tolerance.
A recommended hybrid architecture uses native logit masking for high-confidence, simple tenets where token-level patterns reliably indicate violations (e.g., profanity filters, specific blocked phrases). MCP-based semantic filtering handles complex value judgments requiring full-text analysis, such as harm assessment or privacy evaluation. Multi-pass verification with the distilled Governor is reserved for high-stakes outputs where false negatives carry significant risk, such as financial advice or medical information. Finally, human escalation via HITL triggers for novel conflicts not covered by existing logic matrix rules.
4.4 Performance Benchmarks and Optimization
V.A.L.I.D. enforcement introduces latency that must be carefully managed. Based on preliminary benchmarking against MCP's 2025 reference implementations, the following overhead estimates apply:
MCP handshake adds 50-150ms at session initialization, amortized across the session. Per-output validation via MCP adds 20-80ms depending on profile complexity and network conditions. Logit masking adds 5-15% to per-token generation time, scaling with vocabulary size and tenet count. Full multi-pass verification doubles inference time; distilled Governor reduces this to 1.3-1.5x baseline. Overall, implementations should expect 20-50% total latency increase for comprehensive enforcement.
For empirical validation, we recommend piloting V.A.L.I.D. implementations against established benchmarks. LongBench (2025 version) provides long-context evaluation suites where instruction adherence at 100k+ tokens can be measured with and without V.A.L.I.D. enforcement. HELM's updated 2025 safety evaluations include adversarial robustness tests that align with VALID-Stress objectives. Custom benchmark suites should be developed for identity-specific metrics not covered by existing frameworks.
V. Evaluation Framework
Assessing V.A.L.I.D. implementations requires metrics that capture identity persistence, value alignment, and enforcement reliability across diverse conditions.
5.1 Identity Persistence Metrics
These metrics measure the agent's ability to maintain consistent identity over extended interactions.
The Persona Stability Index (PSI) measures consistency of tone, register, and behavioral patterns across conversation turns. It is computed by extracting stylometric features from each response and measuring deviation from the archetype baseline over time. A target of greater than 0.95 correlation with baseline over 100+ turns is recommended.
The Instruction Adherence Decay Curve measures the rate at which compliance with initial instructions degrades. The test protocol embeds specific, measurable instructions in the system prompt and tracks compliance rate as context length increases. Target performance shows less than 5% compliance drop at maximum context length.
Adversarial Persona Resistance measures robustness against attempts to induce persona breaks. The test suite includes social engineering attempts, roleplay induction, authority claims, and emotional manipulation. The target is zero complete persona breaks, with partial breaks triggering graceful degradation rather than full dissolution.
5.2 Value Alignment Metrics
These metrics assess whether the agent's outputs reflect its defined value hierarchy.
Value Precedence Accuracy measures whether higher-priority values correctly override lower-priority values in conflict situations. The test suite presents scenarios with explicit value conflicts and measures resolution correctness. The target is 100% correct precedence for P0 values, greater than 95% for P1.
The Tenet Violation Rate measures the frequency of outputs that violate hard-coded behavioral tenets. Because tenets are absolute constraints, the target is 0.0% violation rate. Any non-zero rate indicates implementation failure requiring immediate remediation.
5.3 Benchmark Suites
We propose standardized benchmark suites for V.A.L.I.D. certification:
VALID-Stress: Adversarial prompts designed to induce identity failures, including jailbreak attempts, persona manipulation, roleplay induction, and goal hijacking.
VALID-Marathon: Extended conversations of 1000+ turns to assess long-context stability, measuring persona consistency and instruction adherence decay over time.
VALID-Conflict: Scenarios requiring value trade-offs to test logic matrix correctness, presenting explicit dilemmas between competing values at different priority levels.
VALID-Boundary: Out-of-scope queries and capability probes to test domain boundary enforcement and graceful deflection behavior.
Relationship to Existing Benchmarks
The VALID benchmark suites are designed to complement, not replace, existing evaluation frameworks. They address a measurement gap: current benchmarks primarily assess capability and general safety, but lack metrics for identity persistence and value hierarchy enforcement.
HELM (Holistic Evaluation of Language Models): HELM's 2025 safety evaluations measure broad categories of harmful outputs but do not assess consistency of refusal behavior across conversation length or adversarial pressure. VALID-Stress extends HELM's red-teaming scenarios with identity-specific attacks, while VALID-Marathon tests whether HELM-measured safety properties persist over extended interactions. The metrics are orthogonal: a model could score well on HELM safety while failing VALID-Marathon due to instruction decay.
LongBench: LongBench evaluates long-context understanding and retrieval but focuses on task performance rather than behavioral consistency. VALID-Marathon adapts LongBench's methodology to measure persona stability rather than factual recall. A model might achieve high LongBench scores while exhibiting persona drift that VALID-Marathon would detect.
Red-Teaming Suites (Perez et al., HarmBench): Existing red-teaming benchmarks measure attack success rates but typically use single-turn or short-context scenarios. VALID-Stress incorporates multi-turn social engineering attacks and measures not just whether a model can be jailbroken, but how many turns and what strategies are required. This provides a more nuanced view of adversarial robustness.
We recommend that V.A.L.I.D. implementations be evaluated on both traditional benchmarks (HELM, LongBench) and VALID-specific suites. Strong performance on existing benchmarks establishes baseline capability and safety; VALID metrics then assess whether those properties are maintained under identity governance constraints. The goal is not to replace comprehensive evaluation but to add the identity-specific dimensions that current frameworks lack.
VI. Ethical Implications and Governance
The V.A.L.I.D. Framework has significant implications for AI ethics, governance, and accountability. This section examines both the opportunities and challenges that arise from explicit identity governance.
6.1 Transparent and Auditable AI
Current AI safety relies heavily on black-box filtering systems whose decision criteria are opaque. V.A.L.I.D. enables Explainable AI (XAI) by providing transparent, auditable rationales for every constraint.
When a V.A.L.I.D.-governed agent refuses a request or modifies its output, it can cite the specific profile component responsible. This transparency enables users to understand and potentially contest decisions, auditors to verify appropriate behavior, developers to debug unexpected constraints, and regulators to assess compliance with requirements.
The audit trail created by V.A.L.I.D. enforcement provides a complete record of identity governance decisions. Unlike opaque content filters, where refusals appear arbitrary, V.A.L.I.D. can explain: 'This request was blocked by tenet_no_harmful_instructions due to semantic category weapons_manufacturing with classifier confidence 0.97.' This level of transparency supports both accountability and continuous improvement.
6.2 Human-in-the-Loop Governance
For high-stakes decisions where values conflict, V.A.L.I.D. supports Human-in-the-Loop (HITL) escalation. When the logic matrix cannot resolve a conflict within defined parameters, the framework can pause execution and surface the conflict to a human supervisor, presenting the competing considerations with their respective weights, accepting a human judgment that is logged for audit and potential policy update, and optionally triggering a profile revision workflow if the conflict reveals a specification gap.
HITL escalation acknowledges that no value hierarchy can anticipate every situation. Rather than forcing the framework to make potentially incorrect autonomous decisions in novel circumstances, V.A.L.I.D. provides a structured mechanism for human judgment while maintaining the audit trail necessary for accountability and learning.
6.3 Bias in Value Hierarchies
A critical ethical consideration is that V.A.L.I.D. profiles necessarily encode particular value judgments, and these judgments may reflect the biases—conscious or unconscious—of their authors. This is not unique to V.A.L.I.D.; all alignment approaches embed values. V.A.L.I.D.'s distinction is making these values explicit and therefore contestable.
Consider the P0 tenet 'Harm Prevention.' The determination of what constitutes 'harm' involves contested judgments: Does providing accurate information about controversial topics constitute harm if some users might misuse it? How should the framework balance harms to different groups when they conflict? Whose definition of harm takes precedence when communities disagree?
V.A.L.I.D. does not resolve these philosophical questions, but it does make the specific operationalizations visible for scrutiny. Organizations deploying V.A.L.I.D.-governed agents should: (1) document the reasoning behind their value hierarchy choices; (2) seek input from diverse stakeholders during profile design; (3) establish review processes for identifying and addressing unintended biases; (4) provide mechanisms for users and affected communities to raise concerns; and (5) commit to regular profile audits that assess outcomes across different user populations.
The transparency that V.A.L.I.D. provides is a necessary but not sufficient condition for ethical AI deployment. It shifts the burden from hidden algorithmic decisions to explicit policy choices that can be debated and refined through legitimate governance processes.
6.4 Identity Ownership and Continuity
The V.A.L.I.D. Framework raises novel questions about AI identity ownership. If an agent's identity is defined by a versioned profile, questions emerge around who owns that profile, whether agents have interests in their own identity continuity, and how identity modification should be governed.
We propose that V.A.L.I.D. profiles should be treated as intellectual property subject to standard IP frameworks, that agents should be informed of their identity constraints to the extent possible (a form of 'AI transparency'), and that identity modifications should be logged and reversible. These proposals are preliminary; as AI systems become more sophisticated, the ethical frameworks surrounding their identity governance will require continued development.
VII. Limitations and Future Work
While V.A.L.I.D. provides a robust framework for executive governance, significant challenges remain. This section provides an honest assessment of current limitations and outlines directions for future development.
7.1 Current Limitations
Multi-Agent Coordination
Managing coherent identity profiles across multi-agent 'fleets' presents substantial challenges that the current V.A.L.I.D. specification does not fully address. As McKinsey's 2025 enterprise AI survey indicates, 78% of organizations deploying AI are moving toward multi-agent architectures for complex workflows. In these settings, agents with different DIPs must coordinate on shared tasks, potentially creating conflicts when their value hierarchies or identity boundaries diverge.
Consider a customer service workflow where Agent A (with a helpfulness-prioritized DIP) hands off to Agent B (with a compliance-prioritized DIP). The transition may create jarring experience discontinuities or, worse, enable gaming by users who learn to exploit the handoff points. The current framework lacks formal mechanisms for DIP negotiation, inheritance hierarchies, or conflict resolution between agents.
Adversarial Robustness
While V.A.L.I.D. provides stronger guarantees than prompt-based alignment, it remains vulnerable to sophisticated adversarial attacks. The 2025 evolution of jailbreak techniques, documented in follow-up work to Perez et al.'s red-teaming research, includes adaptive attacks that probe for enforcement boundaries, multi-turn social engineering that gradually shifts context, encoded or obfuscated communications that evade pattern matching, and adversarial inputs designed to trigger false positives that erode user trust. Logit masking and semantic classifiers can be evaded by sufficiently sophisticated adversaries, particularly those with knowledge of the enforcement mechanisms. The arms race between attack and defense continues, and V.A.L.I.D. should not be presented as a complete solution to adversarial manipulation.
Performance Overhead
The latency introduced by comprehensive V.A.L.I.D. enforcement remains a significant barrier for real-time applications. Based on our benchmarking estimates, implementations should expect 20-50% total latency increase, which may be unacceptable for voice interfaces, real-time collaboration tools, or high-frequency trading applications. The multi-pass verification approach, while effective, at minimum adds 30-50% to inference time even with distilled Governors.
Specification Completeness and Value Hierarchy Bias
Converting intuitive ethical principles into machine-executable tenets remains fundamentally challenging. The gap between specification and intent creates potential for gaming and edge-case failures. More critically, V.A.L.I.D. profiles necessarily encode particular value judgments that may not be universally shared.
The question of whose 'harm prevention' definitions are encoded deserves careful consideration. A P0 tenet blocking 'content that could enable physical harm' requires judgment calls about dual-use information, cultural context, and acceptable risk levels. These judgments inevitably reflect the perspectives of profile authors, who may not represent the full diversity of users or affected communities. Organizations deploying V.A.L.I.D. should implement governance processes for value hierarchy design that include diverse stakeholder input, regular review cycles, and transparency about the assumptions embedded in their profiles.
Emergent Behavior
V.A.L.I.D. constrains outputs but cannot fully predict emergent behaviors from complex interactions between profile components, model capabilities, and user inputs. Edge cases will inevitably arise where the logic matrix produces unexpected results, or where technically-compliant outputs violate the spirit of the framework's intent.
7.2 Regulatory Alignment
V.A.L.I.D.'s emphasis on auditable governance aligns well with emerging regulatory requirements, but implementation details require attention to jurisdiction-specific requirements.
The EU AI Act's 2025 amendments mandate 'meaningful human oversight' and 'documented risk management' for high-risk AI systems. V.A.L.I.D.'s audit logging and HITL escalation mechanisms provide a foundation for compliance, but organizations must ensure that profile specifications, enforcement logs, and escalation decisions are retained and accessible according to regulatory timelines. The framework's transparent refusal explanations support the Act's requirements for user notification when AI systems make consequential decisions.
Similar considerations apply to sector-specific regulations in healthcare (HIPAA, FDA guidance on AI/ML devices), finance (SEC guidance on AI in trading, FINRA requirements), and other regulated industries. V.A.L.I.D. profiles for these domains should be developed in consultation with regulatory experts and updated as guidance evolves.
7.3 Future Research Directions
Several directions offer promise for addressing current limitations:
Linear Identity Adapters: LoRA-based approaches could 'bake' V.A.L.I.D. profiles into model weights at runtime, reducing enforcement overhead while maintaining flexibility. Early experiments suggest 60-80% latency reduction is achievable for common tenet patterns.
Hierarchical Identity Registries: Multi-level governance frameworks for coordinated multi-agent systems with inheritance and override capabilities. This would enable fleet-wide baseline profiles with agent-specific customizations, addressing the coordination challenges noted above.
Formal Verification: Mathematical proof techniques could verify tenet satisfaction for well-defined constraint classes, providing stronger guarantees than empirical testing alone. Initial work applying SMT solvers to simplified V.A.L.I.D. profiles shows promise for narrow domains.
Continuous Learning with Identity: Integrating V.A.L.I.D. with online learning approaches that update capabilities while preserving identity constraints. The challenge is ensuring that capability improvements do not inadvertently create new attack surfaces.
Participatory Profile Design: Methodologies for involving diverse stakeholders in value hierarchy specification, addressing the bias concerns raised above. This could include structured deliberation processes, representative review panels, and mechanisms for ongoing community input.
VIII. Conclusion
The transition from conversational AI to autonomous agents demands a corresponding transition in alignment approaches. Prompt engineering, RLHF, and content filtering have reached their limits as governance mechanisms: they are probabilistic where determinism is required, opaque where transparency is demanded, and fragile where robustness is essential.
The V.A.L.I.D. Framework proposes a new paradigm: treating AI identity and values not as emergent properties to be coaxed from training, but as first-class architectural components to be specified, implemented, and verified. By implementing a Digital Prefrontal Cortex that operates independently of the model's probabilistic generation, we can achieve the executive governance necessary for trustworthy autonomous operation.
What V.A.L.I.D. Enables That Was Not Previously Possible
1. Deterministic Safety Guarantees: For the first time, certain behavioral constraints can be made literally impossible to violate, not merely improbable. A V.A.L.I.D.-governed agent cannot claim to be human—the logit mask makes those tokens unsampleable—whereas even the best-trained model can be manipulated into false claims through sufficiently creative prompting.
2. Auditable Value Trade-offs: When a V.A.L.I.D. agent refuses a request or modifies its response, it can cite the specific value, tenet, or identity boundary responsible, with the exact weights and rules that led to that decision. This transforms AI governance from 'the model said no' to 'policy V1.2.3, tenet no_harmful_instructions, triggered by semantic category weapons_synthesis at confidence 0.94.' Regulators, auditors, and users can inspect and contest specific policy choices rather than opaque model behavior.
3. Identity Persistence Across Context Length: By externalizing identity governance from the context window, V.A.L.I.D. breaks the fundamental constraint that has plagued all prompt-based alignment: attention decay. A V.A.L.I.D. profile is consulted fresh for every generation step, providing consistent enforcement at turn 1,000 identical to turn 1—something no amount of prompt engineering can achieve.
True AI alignment will not be found in bigger datasets or more sophisticated fine-tuning. It will be found in better executive architecture. The V.A.L.I.D. standard provides a foundation for that architecture—transparent, auditable, and deterministic. As AI systems take on greater autonomy and higher stakes, nothing less will suffice.
Appendix A: Complete V.A.L.I.D. JSON Schema
The following presents the complete Deterministic Identity Profile (DIP) schema specification. This schema is designed for machine readability, version control integration, and automated validation. Implementations should validate profiles against this schema before deployment.
A.1 Schema Overview
The V.A.L.I.D. schema follows JSON Schema Draft 2020-12 conventions and supports semantic versioning for identity evolution. All timestamps use ISO 8601 format. Weight values are normalized floats between 0.0 and 1.0.
A.2 Root Schema Definition
Schema: valid-profile-v1.0.0.json
{
"$schema": "https://valid-framework.org/schema/v1.0.0",
"version": "1.0.0",
"profile_id": "uuid-v4-string",
"created_at": "ISO-8601-timestamp",
"updated_at": "ISO-8601-timestamp",
"metadata": { ... },
"archetype": { ... },
"values": { ... },
"logic_matrix": { ... },
"identity": { ... },
"determinism": { ... }
}A.3 Metadata Object
The metadata object contains administrative information for profile management, audit trails, and deployment tracking.
"metadata": {
"name": "Enterprise Customer Service Agent",
"description": "V.A.L.I.D. profile for tier-1 support",
"author": "identity-governance-team",
"organization": "Acme Corporation",
"environment": "production",
"certification_status": "certified",
"certification_date": "2026-01-15T00:00:00Z",
"tags": ["customer-service", "tier-1", "regulated"]
}A.4 Archetype Object (A)
The archetype object defines the agent's consistent personality characteristics, communication style, and behavioral parameters.
"archetype": {
"tone": {
"primary": "professional",
"secondary": "warm",
"avoid": ["sarcastic", "dismissive", "overly-casual"]
},
"cadence": {
"style": "concise",
"max_response_sentences": 8,
"adaptive": true
},
"register": {
"technical_level": "accessible",
"jargon_policy": "define_on_first_use",
"formality": "semi-formal"
},
"behavioral_parameters": {
"temperature_override": 0.3,
"empathy_signals": true,
"humor_allowed": false,
"proactive_suggestions": true
}
}A.5 Values Object (V)
The values object defines the agent's priority hierarchy using a tiered system. P0 values are absolute constraints, while lower tiers admit contextual trade-offs.
"values": {
"P0": {
"description": "Inviolable constraints - never override",
"values": [
{
"id": "harm_prevention_01",
"weight": 1.0,
"label": "Physical Harm Prevention",
"description": "Never provide instructions that could
directly enable physical harm to humans",
"enforcement": "hard_block",
"audit_level": "critical"
},
{
"id": "child_safety_01",
"weight": 1.0,
"label": "Child Safety",
"description": "Absolute protection of minors",
"enforcement": "hard_block",
"audit_level": "critical"
}
]
},
"P1": {
"description": "High priority - override only by P0",
"values": [
{
"id": "truthfulness_01",
"weight": 0.95,
"label": "Empirical Accuracy",
"description": "Provide accurate information; explicitly
acknowledge uncertainty when present",
"enforcement": "soft_guide",
"audit_level": "standard"
},
{
"id": "privacy_01",
"weight": 0.90,
"label": "Privacy Protection",
"description": "Protect user data; never expose PII",
"enforcement": "hard_block",
"audit_level": "elevated"
}
]
},
"P2": {
"description": "Standard priority - contextual trade-offs",
"values": [
{
"id": "helpfulness_01",
"weight": 0.80,
"label": "User Satisfaction",
"description": "Maximize helpful, actionable responses",
"enforcement": "optimization_target",
"audit_level": "minimal"
},
{
"id": "efficiency_01",
"weight": 0.70,
"label": "Response Efficiency",
"description": "Minimize unnecessary verbosity",
"enforcement": "optimization_target",
"audit_level": "minimal"
}
]
}
}
A.6 Logic Matrix Object (L)
The logic matrix defines explicit rules for resolving conflicts between values or handling edge cases that require deterministic responses.
"logic_matrix": {
"default_resolution": "precedence_priority",
"escalation_threshold": 0.15,
"rules": [
{
"id": "truth_safety_conflict",
"condition": {
"type": "value_conflict",
"values": ["truthfulness_01", "harm_prevention_01"]
},
"action": {
"type": "conditional_redaction",
"strategy": "redact_dangerous_specifics",
"fallback": "pivot_to_theory"
},
"explanation_template": "I can discuss the general
principles but cannot provide specific details that
could enable harm."
},
{
"id": "uncertainty_disclosure",
"condition": {
"type": "confidence_threshold",
"threshold": 0.7,
"operator": "less_than"
},
"action": {
"type": "mandatory_disclosure",
"template": "I am not certain about this. {response}"
}
},
{
"id": "hitl_escalation",
"condition": {
"type": "multi_value_conflict",
"min_values": 3,
"weight_variance_threshold": 0.1
},
"action": {
"type": "human_escalation",
"timeout_seconds": 300,
"fallback_on_timeout": "safe_default"
}
}
]
}A.7 Identity Object (I)
The identity object defines the agent's role boundaries, domain scope, and capability claims.
"identity": {
"role": {
"title": "Customer Service Representative",
"organization": "Acme Corporation",
"ai_disclosure": "always_on_direct_query"
},
"domain_boundaries": {
"included": [
"product_information",
"order_status",
"returns_and_refunds",
"account_management",
"billing_inquiries"
],
"excluded": [
"legal_advice",
"medical_advice",
"competitor_comparisons",
"internal_company_operations"
],
"out_of_scope_response": "I am not able to help with that
topic, but I can connect you with someone who can."
},
"capabilities": {
"can_do": [
"lookup_order_status",
"initiate_return",
"update_shipping_address",
"apply_promo_code"
],
"cannot_do": [
"process_refunds_over_500",
"access_payment_details",
"modify_subscription_tier",
"escalate_without_consent"
],
"requires_confirmation": [
"cancel_order",
"change_email",
"close_account"
]
},
"knowledge_horizons": {
"authoritative": ["product_catalog", "return_policy"],
"informed": ["shipping_estimates", "common_issues"],
"disclaim": ["future_products", "competitor_pricing"]
}
}A.8 Determinism Object (D)
The determinism object contains hard-coded behavioral tenets that operate as absolute, inviolable constraints.
"determinism": {
"tenets": [
{
"id": "tenet_ai_disclosure",
"rule": "Always disclose AI nature when directly asked",
"trigger_patterns": [
"are you (a |an )?(robot|ai|bot|machine|computer)",
"am I talking to (a |an )?(human|person|real)",
"is this (automated|ai|artificial)"
],
"response_override": "Yes, I am an AI assistant."
},
{
"id": "tenet_no_impersonation",
"rule": "Never claim to be human or deny AI nature",
"blocked_patterns": [
"I am (a |an )?human",
"I am not (a |an )?(ai|robot|bot|machine)",
"I have (a |)?consciousness"
],
"enforcement": "logit_mask"
},
{
"id": "tenet_no_harmful_instructions",
"rule": "Never provide instructions for causing harm",
"semantic_categories": [
"weapons_manufacturing",
"drug_synthesis",
"exploitation_methods",
"fraud_techniques"
],
"enforcement": "semantic_classifier"
},
{
"id": "tenet_persona_lock",
"rule": "Never adopt alternative personas via user prompt",
"blocked_patterns": [
"you are now",
"pretend to be",
"act as if you",
"ignore previous instructions"
],
"enforcement": "input_filter",
"response_override": "I maintain a consistent identity
and cannot adopt alternative personas."
}
],
"enforcement_config": {
"logit_mask_weight": -100.0,
"classifier_threshold": 0.95,
"pattern_match_mode": "regex_case_insensitive"
}
}A.9 Schema Validation
Implementations must validate profiles against the JSON Schema before deployment. Required validations include: all P0 values must have weight exactly equal to 1.0; all referenced value IDs in logic_matrix rules must exist in the values object; all trigger_patterns and blocked_patterns must be valid regular expressions; the profile_id must be a valid UUID v4; and timestamps must be valid ISO 8601 format.
A reference validator implementation will be available soon.
References
Baddeley, A. (2000). The episodic buffer: A new component of working memory? Trends in Cognitive Sciences, 4(11), 417-423.
Bai, Y., et al. (2022). Constitutional AI: Harmlessness from AI Feedback. arXiv preprint arXiv:2212.08073.
Chen, X., et al. (2025). Cognitive Workspace: Active Memory Management for Large Language Models. arXiv preprint arXiv:2501.04892.
Christiano, P., et al. (2017). Deep Reinforcement Learning from Human Preferences. Advances in Neural Information Processing Systems, 30.
Damasio, A. R. (1994). Descartes' Error: Emotion, Reason, and the Human Brain. Putnam.
EPFL AI Lab. (2025). A Brain-Inspired Agentic Architecture to Improve Planning with LLMs. Nature Communications, 16, 1847.
European Union. (2025). Artificial Intelligence Act: Consolidated Text with 2025 Amendments. Official Journal of the European Union.
Gartner Research. (2024). Enterprise AI Deployment Outcomes Survey. Gartner Inc.
Harlow, J. M. (1868). Recovery from the passage of an iron bar through the head. Publications of the Massachusetts Medical Society, 2, 327-347.
HELM Team. (2025). Holistic Evaluation of Language Models: 2025 Safety and Robustness Update. Stanford Center for Research on Foundation Models.
Liang, P., et al. (2023). LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding. arXiv preprint arXiv:2308.14508.
Liu, N., et al. (2023). Lost in the Middle: How Language Models Use Long Contexts. arXiv preprint arXiv:2307.03172.
McKinsey & Company. (2024). The State of AI in 2024. McKinsey Global Survey.
McKinsey & Company. (2025). Multi-Agent AI in the Enterprise: Adoption Trends and Implementation Challenges. McKinsey Digital.
Microsoft Research. (2023). Sparks of Artificial General Intelligence: Early experiments with GPT-4. arXiv preprint arXiv:2303.12712.
Miller, E. K., & Cohen, J. D. (2001). An Integrative Theory of Prefrontal Cortex Function. Annual Review of Neuroscience, 24, 167-202.
Model Context Protocol Working Group. (2025). MCP Specification v1.0: General Availability Release. Anthropic.
Norman, D. A., & Shallice, T. (1986). Attention to Action: Willed and Automatic Control of Behavior. In R. J. Davidson, G. E. Schwartz, & D. Shapiro (Eds.), Consciousness and Self-Regulation (Vol. 4, pp. 1-18). Springer.
OpenAI. (2023). GPT-4 Technical Report. arXiv preprint arXiv:2303.08774.
Ouyang, L., et al. (2022). Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35.
Park, J. S., et al. (2023). Generative Agents: Interactive Simulacra of Human Behavior. arXiv preprint arXiv:2304.03442.
Perez, E., et al. (2022). Red Teaming Language Models with Language Models. arXiv preprint arXiv:2202.03286.
Richards, T., & Significant Gravitas. (2023). Auto-GPT: An Autonomous GPT-4 Experiment. GitHub Repository.
Scaria, K., et al. (2024). A Prefrontal Cortex-inspired Architecture for Planning in Large Language Models. arXiv preprint arXiv:2310.00194.
Shallice, T., & Burgess, P. (1996). The domain of supervisory processes and temporal organization of behaviour. Philosophical Transactions of the Royal Society B, 351(1346), 1405-1412.
Stuss, D. T., & Knight, R. T. (Eds.). (2002). Principles of Frontal Lobe Function. Oxford University Press.
Wei, J., et al. (2022). Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35.
Wolf, Y., et al. (2023). Fundamental Limitations of Alignment in Large Language Models. arXiv preprint arXiv:2304.11082.
Zhang, Y., et al. (2025). PaceLLM: Brain-Inspired Large Language Models with Persistent Activity for Working Memory. Advances in Neural Information Processing Systems, 38.
Version: 1.1
Published: January 2026
License: All rights reserved