How to Secure AI Agents That Run Untrusted Code Safely

AI agents are evolving from simple chatbots into autonomous systems capable of executing code, running shell commands, and managing databases. This shift introduces a critical challenge: how can developers grant these agents the power to act independently while preventing them from causing irreversible damage?

The stakes are high. A single parsing error or hallucinated variable can transform a harmless command like rm -rf /temp into a system-wrecking disaster. As AI agents gain the ability to self-orchestrate tools and modify environments, the need for robust safeguards has never been more urgent.

The Shift from Static Tools to Dynamic State Machines

Traditional software development treats tools as static libraries—reusable functions called by human developers. In an autonomous agent architecture, however, the agent itself becomes the orchestrator, and tools are its interface to the external world. Each tool execution represents a deliberate mutation of state, whether it’s writing a file, running a command, or querying a database.

This transformation demands a new architectural approach. Tools must be treated as interfaces to state machines, where every action triggers a state change. The agent’s core loop—perceiving input, processing cognition via an LLM, and executing actions—must be carefully controlled to prevent runaway behavior. Without safeguards, the LLM’s unpredictable output could spiral into catastrophic results, much like an overheating nuclear reactor.

A Three-Layered Defense System for AI Agents

To mitigate these risks, the Hermes Agent framework (v0.13) introduces a hierarchical, policy-driven architecture divided into three critical layers:

1. Tool Definition Layer: The Agent’s Catalog of Capabilities

This foundational layer serves as the agent’s "catalog," defining every tool’s schema, description, and strict JSON validation rules. It acts as a curated list of capabilities presented to the LLM during its cognitive phase. The catalog is dynamically filtered based on enabled or disabled toolsets, ensuring the agent operates within predefined boundaries.

Key components include:

Tool schemas that enforce argument structure
Descriptive metadata to guide the LLM’s decision-making
Version control to track tool evolution

2. Tool Execution Layer: The Dispatch Center

When the LLM generates a tool call, this layer acts as the "dispatch center," parsing arguments, validating inputs, and routing requests to the appropriate handlers. It handles type coercion, initial error detection, and sequential or concurrent execution of tasks. By centralizing this logic, the system ensures consistent behavior and early intervention for malformed requests.

3. Sandboxing Layer: The Containment Vessel

The final and most critical layer is the sandboxing mechanism, designed to isolate dangerous operations like terminal commands and code execution. This layer enforces guardrails, implements checkpoints, and leverages containerized environments (such as Docker) to contain potential damage. Even if the agent’s intent is flawed or malicious, the sandbox ensures the host system remains protected.

Inside the Agent’s Core Loop: A Closed Learning System

At the heart of the Hermes Agent is the run_conversation method, a state machine that implements a closed learning loop. Unlike traditional APIs that call a tool and forget, this loop ensures the agent continuously learns from its actions. Here’s how it works:

API_CALL State: The agent sends the conversation history to the LLM, which generates a response.
TOOL_EXECUTION State: If the LLM’s response includes tool calls, the system executes them and appends the results back into the conversation history as a role: "tool" message.
FINAL_RESPONSE State: Once all necessary tools are executed, the agent delivers a final response to the user.

This feedback mechanism makes the agent highly capable but also introduces risks like infinite loops or resource-draining tasks. To prevent these, the framework enforces strict iteration limits and real-time monitoring.

def run_conversation(self, user_message, ...):
    # Setup and memory loading
    while (api_call_count < self.max_iterations):
        # 1. API_CALL State: Send history to LLM
        response = self._interruptible_api_call(api_kwargs)
        normalized = self._get_transport().normalize_response(response)
        assistant_message = normalized
        
        # 2. TOOL_EXECUTION State: Process tool calls if present
        if assistant_message.tool_calls:
            assistant_msg = self._build_assistant_message(assistant_message, finish_reason)
            messages.append(assistant_msg)
            self._execute_tool_calls(assistant_message, messages, effective_task_id)
            continue
        else:
            # 3. FINAL_RESPONSE State: No more tools needed
            final_response = assistant_message.content
            break

Balancing Autonomy and Safety in Agentic AI

The rise of autonomous AI agents marks a pivotal moment in software development, but it also demands a fundamental shift in how we approach security. By adopting a multi-layered, state-machine-driven architecture, developers can harness the power of AI agents without sacrificing safety. Frameworks like Hermes Agent demonstrate that it’s possible to balance autonomy with robust containment, ensuring that even untrusted code can be executed responsibly.

As AI agents continue to evolve, the focus must remain on building systems that are not only intelligent but also resilient. The future of agentic AI lies in architectures that learn from their actions, adapt to new challenges, and—above all—protect the systems they operate in.

AI summary

Learn how to build self-healing AI agents that execute untrusted code without risking system failures. Explore the Hermes Agent framework's three-layered defense system for secure automation.

How to Secure AI Agents That Run Untrusted Code Safely

The Shift from Static Tools to Dynamic State Machines

A Three-Layered Defense System for AI Agents

1. Tool Definition Layer: The Agent’s Catalog of Capabilities

2. Tool Execution Layer: The Dispatch Center

3. Sandboxing Layer: The Containment Vessel

Inside the Agent’s Core Loop: A Closed Learning System

Balancing Autonomy and Safety in Agentic AI

Comments

Why Companies Should Focus on Operations, Not Build Tech Stacks

Cut Aider AI coding costs with a single LLM gateway setup

Python YouTube downloader with async downloads and real-time queue management