Core Concepts

ARES provides a reinforcement learning framework that enables training policies (agents) to produce better LLM responses for code agents. Unlike traditional frameworks that treat the entire code agent as the optimization target, ARES trains the LLM within the agent by treating LLM interactions as observations and actions within a standard RL loop.

Key Distinction

It’s important to understand two different concepts in ARES:

Code Agent (Static)
The orchestration logic that uses a Container and LLM to solve tasks (e.g., MiniSWECodeAgent). This is part of the environment and remains fixed during training. Think of it as the scaffold that defines how an LLM interacts with code.
Agent/Policy (Trained)
The component you’re actually training - a function that maps LLMRequest → LLMResponse. This could be a fine-tuned LLM, a prompt optimizer, or any policy that produces better responses. This is what improves through reinforcement learning.

System Architecture

Here’s how the components fit together:

Your Training Loop                    ARES Environment
═════════════════                     ════════════════

┌────────────────────────┐
│  Your RL Policy/Agent  │            ┌──────────────────────────────────────┐
│  (e.g. Fine-tuned LLM) │            │         CodeEnvironment              │
│  receives request,     |            |                                      |
|  generates response    |            │                                      │
└──────────┬─────────────┘            │  ┌────────────────────────────────┐  │
    ^      │                          │  │   QueueMediatedLLMClient       │  │
    |      │ LLMResponse (action)     │  │                                │  │
    |      └──────────────────────────┼─>│   Intercepts LLM calls         │  │
    |                                 │  │   from code agent via          │  │
    └─────────────────────────────────┼──│   QueueMediatedLLMClient       │  │
             LLMRequest (observation) │  └──────────────────┬─────────────┘  │
                                      │                 ^   │                │
                                      │      LLMRequest │   │ LLMResponse    │
                                      │                 │   v                │
                                      │  ┌──────────────└─────────────────┐  │
                                      │  │       CodeAgent                │  │
                                      │  │  (e.g. MiniSWECodeAgent)       │  │
                                      │  │                                │  │
                                      │  │  - Reasons about task          │  │
                                      │  │  - Calls LLM (blocks)          │  │
┌────────────────────────────┐        │  │  - Runs commands in Container  │  │
│  Multiple Environments     │        │  │  - Iterates until done         │  │
│  can run in parallel       │        │  └────────────────────┬───────────┘  │
│                            │        │                 ^     |              │
│  async with env1, env2:    │        │      cmd output |     │ exec_run()   │
│      # Parallel episodes   │        │                 |     v              │
└────────────────────────────┘        │  ┌──────────────└─────────────────┐  │
                                      │  │       Container                │  │
                                      │  │  (Docker or Daytona)           │  │
                                      │  │                                │  │
                                      │  │  - Isolated environment        │  │
                                      │  │  - Runs bash commands          │  │
                                      │  │  - File upload/download        │  │
                                      │  └────────────────────────────────┘  │
                                      └──────────────────────────────────────┘

Key Properties

Composability: Each component has a narrow interface and can be swapped independently. Want cloud containers? Switch the factory. Want a different agent? Swap the agent factory.
Scalability: Environments are async and independent. Run hundreds in parallel with asyncio.gather() for distributed data collection.
RL Native: The architecture naturally maps to RL: observations, actions, rewards, episodes. Use any RL algorithm - policy gradient, Q-learning, behavioral cloning, etc.
LLM-Focused Optimization: Unlike frameworks that treat the entire agent as a black box, ARES gives you fine-grained control over the LLM’s behavior at every step.

Environment

An Environment encapsulates the task, container, and code agent as a single RL environment. ARES implements an async version of DeepMind’s dm_env specification.

The key abstraction is CodeEnvironment, which:

Manages a Container - Provides an isolated execution environment
Manages a CodeAgent - Runs the orchestration logic for solving the task
Exposes LLM requests as observations - Intercepts calls from the code agent
Treats LLM responses as actions - Your trainable agent/policy provides responses

Crucially, the CodeAgent is part of the environment, not what you’re training. Your training loop optimizes an agent/policy that produces better LLMResponse outputs given LLMRequest observations.

Standard RL Loop

Every environment follows the standard RL pattern:

async with env:
    # Start a new episode
    timestep = await env.reset()

    while not timestep.last():
        # timestep.observation is an LLMRequest from the code agent
        action = await your_policy(timestep.observation)

        # action is an LLMResponse that continues the agent's execution
        timestep = await env.step(action)

    # timestep.reward contains the reward for the final step
    print(f"Final reward: {timestep.reward}")

TimeStep Structure

Each call to reset() or step() returns a TimeStep with:

step_type: One of "FIRST", "MID", or "LAST"
observation: An LLMRequest object (or None on termination)
reward: A float reward for each step
discount: A float discount factor for RL algorithms

CodeAgent

A CodeAgent implements the orchestration logic for attempting to solve a task. It has access to a Container (to execute shell commands) and an LLMClient (to interact with the language model).

The minimal interface is simple:

class CodeAgent(Protocol):
    async def run(self, task: str) -> None:
        """Runs the agent for the specific task."""

class CodeAgentFactory[T: CodeAgent](Protocol):
    def __call__(self, *, container: Container, llm_client: QueueMediatedLLMClient) -> T: ...
        """Instantiates a new CodeAgent."""

Agent Implementation Pattern

A typical code agent:

Receives a task description (e.g., “Fix the authentication bug”)
Makes LLM calls to reason about what to do
Executes bash commands in the container to inspect code, run tests, make edits
Iterates between LLM reasoning and command execution
Signals completion when done (implementation-specific)

Example structure:

class MyCodeAgent:
    def __init__(self, container: Container, llm_client: QueueMediatedLLMClient):
        self._container = container
        self._llm_client = llm_client

    async def run(self, task: str) -> None:
        while not self.is_done():
            # Ask LLM what to do next
            request = LLMRequest(messages=[...])
            response = await self._llm_client(request)

            # Parse and execute commands from LLM response
            commands = self.parse_commands(response)
            for cmd in commands:
                result = await self._container.exec_run(cmd)
                # Use result in next LLM call...

Connection to the RL Loop

Here’s the key insight: The agent doesn’t know it’s part of an RL loop.

When the agent calls await self._llm_client(request), it blocks and waits for a response. But the LLMClient is actually a QueueMediatedLLMClient (see How It Works), which:

Puts the request into a queue
Waits for someone to provide a response
Returns that response to the agent

The environment watches this queue and exposes requests as observations. Your RL policy provides responses as actions. This lets you train the LLM while the agent code remains simple and linear.

Available Agents

MiniSWECodeAgent (ares.code_agents.mini_swe_agent): Wraps the mini-swe-agent library. Uses Jinja2 templates for prompts, parses bash commands from markdown, handles timeouts and retries.

Implementing your own CodeAgent

To bring in your own CodeAgent implementation, the main blocker is typically rewriting around any LLM calls and command execution that your agent makes. This can look like:

class MyCurrentCodeAgent:
    def __init__(self, ..., llm_client: openai.AsyncClient):
        ...
        self.llm_client = llm_client

    async def run(self, task: str) -> None:
        # Do some setup for tools and what not
        ...
        while not self.is_done():
            # Decide what to ask LLM
            ...
            llm_response = await self.llm_client.chat.completions.create(
                ...
                messages=[...],
            )
            # Parse the LLM response and execute commands
            ...
            cmd_output = await self.run_command(command)
            ...

Which you will need to rewrite into something like:

class MyARESCodeAgent:
    def __init__(self, container: Container, llm_client: QueueMediatedLLMClient):
        self.llm_client = llm_client
        self.container = container
        # Replace other init setup
        ...

    async def run(self, task: str) -> None:
        # Do some setup for tools and what not
        ...
        while not self.is_done():
            # Decide what to ask LLM next
            ...
            llm_response = await self.llm_client(
                LLMRequest(
                    messages=[...],
                    ...  # Other request params
                )
            )
            # Parse the LLM response and execute commands
            ...
            cmd_output = await self.container.exec_run(command)
            ...

We are working on making the integration for adding CodeAgents as easy as possible, and hopefully more to come on this soon! We unfortunately don’t support arbitrary MCP tool calls yet, but that is one of multiple things that are top of mind. For the time being, depending on the specific tools you may be able to fit them into the existing CodeAgent API - and if not, please let us know on [GitHub](https://github.com/withmartian/ares/issues) or join our [Discord server](https://discord.gg/5Y93Zhg3eS)!

Container

A Container provides an isolated execution environment where code agents can safely run commands, modify files, and execute code.

Containers abstract over different backend implementations (local Docker, cloud providers) with a consistent interface:

class Container(Protocol):
    async def start(self, env: dict[str, str] | None) -> None
    async def stop() -> None
    async def exec_run(command, workdir, env, timeout_s) -> ExecResult
    async def upload_files(local_paths, remote_paths) -> None
    async def download_files(remote_paths, local_paths) -> None

Available Implementations

DockerContainer (ares.containers.docker): Uses local Docker for container management. Builds images from Dockerfiles on-demand. Best for development and single-machine experiments.
DaytonaContainer (ares.containers.daytona): Uses Daytona for cloud-based containers. Supports distributed workloads, resource limits (CPU/memory/disk/GPU), and auto-cleanup. Best for production training runs.

Container Lifecycle

Containers are managed by the environment:

Creation: Environment calls the container factory with an image or Dockerfile
Start: Container is started with task-specific environment variables
Execution: Code agent runs commands via exec_run()
Cleanup: Container is stopped and removed when the environment closes

You typically don’t interact with containers directly - the CodeEnvironment handles their lifecycle.

LLMClient

An LLMClient provides a simple, uniform interface for making LLM API calls. It’s a quality-of-life abstraction that makes it easy to treat LLM interactions as observations and actions.

Core Interface

class LLMClient(Protocol):
    async def __call__(self, request: LLMRequest) -> LLMResponse:
        ...

@dataclass(frozen=True)
class LLMRequest:
    messages: Iterable[ChatCompletionMessageParam]
    temperature: float | None = None

@dataclass(frozen=True)
class LLMResponse:
    chat_completion_response: ChatCompletion
    cost: float

This simple interface wraps OpenAI-style chat completion APIs. The messages field follows the OpenAI format with role (system/user/assistant) and content.

Why LLMClient?

The LLMClient abstraction serves two purposes:

Observations = LLM Requests: In the RL loop, timestep.observation is an LLMRequest containing the messages the code agent wants to send to the LLM. This is the “state” your policy observes.
Actions = LLM Responses: In the RL loop, the action you pass to env.step() is an LLMResponse containing the LLM’s reply. This is how your policy controls the agent’s behavior.

This framing makes it natural to think about code agent training as an RL problem: you’re learning a policy that maps agent requests to helpful responses.

Available Implementations

ChatCompletionCompatibleLLMClient (ares.llms.chat_completions_compatible): Makes real API calls to OpenAI-compatible endpoints (OpenAI, Martian, etc.). Includes retry logic, cost tracking, and configurable base URLs.
QueueMediatedLLMClient (ares.llms.queue_mediated_client): The critical piece that enables the RL abstraction. See How It Works for details.
MockLLMClient (ares.llms.mock_llm_client): Returns pre-defined responses for testing and debugging.

Next Steps

Learn about the QueueMediatedLLMClient pattern that makes the RL abstraction possible - How It Works
See the README for usage examples