Core Concepts
ARES provides a reinforcement learning framework that enables training policies (agents) to produce better LLM responses for code agents. Unlike traditional frameworks that treat the entire code agent as the optimization target, ARES trains the LLM within the agent by treating LLM interactions as observations and actions within a standard RL loop.
Key Distinction
It’s important to understand two different concepts in ARES:
- Code Agent (Static)
The orchestration logic that uses a Container and LLM to solve tasks (e.g., MiniSWECodeAgent). This is part of the environment and remains fixed during training. Think of it as the scaffold that defines how an LLM interacts with code.
- Agent/Policy (Trained)
The component you’re actually training - a function that maps
LLMRequest → LLMResponse. This could be a fine-tuned LLM, a prompt optimizer, or any policy that produces better responses. This is what improves through reinforcement learning.
System Architecture
Here’s how the components fit together:
Your Training Loop ARES Environment
═════════════════ ════════════════
┌────────────────────────┐
│ Your RL Policy/Agent │ ┌──────────────────────────────────────┐
│ (e.g. Fine-tuned LLM) │ │ CodeEnvironment │
│ receives request, | | |
| generates response | │ │
└──────────┬─────────────┘ │ ┌────────────────────────────────┐ │
^ │ │ │ QueueMediatedLLMClient │ │
| │ LLMResponse (action) │ │ │ │
| └──────────────────────────┼─>│ Intercepts LLM calls │ │
| │ │ from code agent via │ │
└─────────────────────────────────┼──│ QueueMediatedLLMClient │ │
LLMRequest (observation) │ └──────────────────┬─────────────┘ │
│ ^ │ │
│ LLMRequest │ │ LLMResponse │
│ │ v │
│ ┌──────────────└─────────────────┐ │
│ │ CodeAgent │ │
│ │ (e.g. MiniSWECodeAgent) │ │
│ │ │ │
│ │ - Reasons about task │ │
│ │ - Calls LLM (blocks) │ │
┌────────────────────────────┐ │ │ - Runs commands in Container │ │
│ Multiple Environments │ │ │ - Iterates until done │ │
│ can run in parallel │ │ └────────────────────┬───────────┘ │
│ │ │ ^ | │
│ async with env1, env2: │ │ cmd output | │ exec_run() │
│ # Parallel episodes │ │ | v │
└────────────────────────────┘ │ ┌──────────────└─────────────────┐ │
│ │ Container │ │
│ │ (Docker or Daytona) │ │
│ │ │ │
│ │ - Isolated environment │ │
│ │ - Runs bash commands │ │
│ │ - File upload/download │ │
│ └────────────────────────────────┘ │
└──────────────────────────────────────┘
Key Properties
- Composability
Each component has a narrow interface and can be swapped independently. Want cloud containers? Switch the factory. Want a different agent? Swap the agent factory.
- Scalability
Environments are async and independent. Run hundreds in parallel with
asyncio.gather()for distributed data collection.- RL Native
The architecture naturally maps to RL: observations, actions, rewards, episodes. Use any RL algorithm - policy gradient, Q-learning, behavioral cloning, etc.
- LLM-Focused Optimization
Unlike frameworks that treat the entire agent as a black box, ARES gives you fine-grained control over the LLM’s behavior at every step.
Environment
An Environment encapsulates the task, container, and code agent as a single RL environment. ARES implements an async version of DeepMind’s dm_env specification.
The key abstraction is CodeEnvironment, which:
Manages a Container - Provides an isolated execution environment
Manages a CodeAgent - Runs the orchestration logic for solving the task
Exposes LLM requests as observations - Intercepts calls from the code agent
Treats LLM responses as actions - Your trainable agent/policy provides responses
Crucially, the CodeAgent is part of the environment, not what you’re training. Your training loop optimizes an agent/policy that produces better LLMResponse outputs given LLMRequest observations.
Standard RL Loop
Every environment follows the standard RL pattern:
async with env:
# Start a new episode
timestep = await env.reset()
while not timestep.last():
# timestep.observation is an LLMRequest from the code agent
action = await your_policy(timestep.observation)
# action is an LLMResponse that continues the agent's execution
timestep = await env.step(action)
# timestep.reward contains the reward for the final step
print(f"Final reward: {timestep.reward}")
TimeStep Structure
Each call to reset() or step() returns a TimeStep with:
step_type: One of"FIRST","MID", or"LAST"observation: AnLLMRequestobject (orNoneon termination)reward: A float reward for each stepdiscount: A float discount factor for RL algorithms
CodeAgent
A CodeAgent implements the orchestration logic for attempting to solve a task. It has access to a Container (to execute shell commands) and an LLMClient (to interact with the language model).
The minimal interface is simple:
class CodeAgent(Protocol):
async def run(self, task: str) -> None:
"""Runs the agent for the specific task."""
class CodeAgentFactory[T: CodeAgent](Protocol):
def __call__(self, *, container: Container, llm_client: QueueMediatedLLMClient) -> T: ...
"""Instantiates a new CodeAgent."""
Agent Implementation Pattern
A typical code agent:
Receives a task description (e.g., “Fix the authentication bug”)
Makes LLM calls to reason about what to do
Executes bash commands in the container to inspect code, run tests, make edits
Iterates between LLM reasoning and command execution
Signals completion when done (implementation-specific)
Example structure:
class MyCodeAgent:
def __init__(self, container: Container, llm_client: QueueMediatedLLMClient):
self._container = container
self._llm_client = llm_client
async def run(self, task: str) -> None:
while not self.is_done():
# Ask LLM what to do next
request = LLMRequest(messages=[...])
response = await self._llm_client(request)
# Parse and execute commands from LLM response
commands = self.parse_commands(response)
for cmd in commands:
result = await self._container.exec_run(cmd)
# Use result in next LLM call...
Connection to the RL Loop
Here’s the key insight: The agent doesn’t know it’s part of an RL loop.
When the agent calls await self._llm_client(request), it blocks and waits for a response. But the LLMClient is actually a QueueMediatedLLMClient (see How It Works), which:
Puts the request into a queue
Waits for someone to provide a response
Returns that response to the agent
The environment watches this queue and exposes requests as observations. Your RL policy provides responses as actions. This lets you train the LLM while the agent code remains simple and linear.
Available Agents
- MiniSWECodeAgent (
ares.code_agents.mini_swe_agent) Wraps the mini-swe-agent library. Uses Jinja2 templates for prompts, parses bash commands from markdown, handles timeouts and retries.
Implementing your own CodeAgent
To bring in your own CodeAgent implementation, the main blocker is typically rewriting around any LLM calls and command execution that your agent makes. This can look like:
class MyCurrentCodeAgent:
def __init__(self, ..., llm_client: openai.AsyncClient):
...
self.llm_client = llm_client
async def run(self, task: str) -> None:
# Do some setup for tools and what not
...
while not self.is_done():
# Decide what to ask LLM
...
llm_response = await self.llm_client.chat.completions.create(
...
messages=[...],
)
# Parse the LLM response and execute commands
...
cmd_output = await self.run_command(command)
...
Which you will need to rewrite into something like:
class MyARESCodeAgent:
def __init__(self, container: Container, llm_client: QueueMediatedLLMClient):
self.llm_client = llm_client
self.container = container
# Replace other init setup
...
async def run(self, task: str) -> None:
# Do some setup for tools and what not
...
while not self.is_done():
# Decide what to ask LLM next
...
llm_response = await self.llm_client(
LLMRequest(
messages=[...],
... # Other request params
)
)
# Parse the LLM response and execute commands
...
cmd_output = await self.container.exec_run(command)
...
We are working on making the integration for adding CodeAgents as easy as possible, and hopefully more to come on this soon! We unfortunately don’t support arbitrary MCP tool calls yet, but that is one of multiple things that are top of mind. For the time being, depending on the specific tools you may be able to fit them into the existing CodeAgent API - and if not, please let us know on [GitHub](https://github.com/withmartian/ares/issues) or join our [Discord server](https://discord.gg/5Y93Zhg3eS)!
Container
A Container provides an isolated execution environment where code agents can safely run commands, modify files, and execute code.
Containers abstract over different backend implementations (local Docker, cloud providers) with a consistent interface:
class Container(Protocol):
async def start(self, env: dict[str, str] | None) -> None
async def stop() -> None
async def exec_run(command, workdir, env, timeout_s) -> ExecResult
async def upload_files(local_paths, remote_paths) -> None
async def download_files(remote_paths, local_paths) -> None
Available Implementations
- DockerContainer (
ares.containers.docker) Uses local Docker for container management. Builds images from Dockerfiles on-demand. Best for development and single-machine experiments.
- DaytonaContainer (
ares.containers.daytona) Uses Daytona for cloud-based containers. Supports distributed workloads, resource limits (CPU/memory/disk/GPU), and auto-cleanup. Best for production training runs.
Container Lifecycle
Containers are managed by the environment:
Creation: Environment calls the container factory with an image or Dockerfile
Start: Container is started with task-specific environment variables
Execution: Code agent runs commands via
exec_run()Cleanup: Container is stopped and removed when the environment closes
You typically don’t interact with containers directly - the CodeEnvironment handles their lifecycle.
LLMClient
An LLMClient provides a simple, uniform interface for making LLM API calls. It’s a quality-of-life abstraction that makes it easy to treat LLM interactions as observations and actions.
Core Interface
class LLMClient(Protocol):
async def __call__(self, request: LLMRequest) -> LLMResponse:
...
@dataclass(frozen=True)
class LLMRequest:
messages: Iterable[ChatCompletionMessageParam]
temperature: float | None = None
@dataclass(frozen=True)
class LLMResponse:
chat_completion_response: ChatCompletion
cost: float
This simple interface wraps OpenAI-style chat completion APIs. The messages field follows the OpenAI format with role (system/user/assistant) and content.
Why LLMClient?
The LLMClient abstraction serves two purposes:
Observations = LLM Requests: In the RL loop,
timestep.observationis anLLMRequestcontaining the messages the code agent wants to send to the LLM. This is the “state” your policy observes.Actions = LLM Responses: In the RL loop, the
actionyou pass toenv.step()is anLLMResponsecontaining the LLM’s reply. This is how your policy controls the agent’s behavior.
This framing makes it natural to think about code agent training as an RL problem: you’re learning a policy that maps agent requests to helpful responses.
Available Implementations
- ChatCompletionCompatibleLLMClient (
ares.llms.chat_completions_compatible) Makes real API calls to OpenAI-compatible endpoints (OpenAI, Martian, etc.). Includes retry logic, cost tracking, and configurable base URLs.
- QueueMediatedLLMClient (
ares.llms.queue_mediated_client) The critical piece that enables the RL abstraction. See How It Works for details.
- MockLLMClient (
ares.llms.mock_llm_client) Returns pre-defined responses for testing and debugging.
Next Steps
Learn about the QueueMediatedLLMClient pattern that makes the RL abstraction possible - How It Works
See the README for usage examples