AboutExo

Exo: Can Scaffolding Scale Test-Time Compute To Close the Model Gap?

grok exo

Testing whether structured reasoning architectures can close the performance gap between frontier and economy models cost-effectively.


TL;DR

Chain-of-thought shows test-time compute scaling improves LLM performance. Frontier models now score 80%+ on GAIA; economy models plateau around 55%. The gap presumably reflects better internal reasoning from higher parameter counts capturing more training data. But can external structure compensate? If a scaffold can guide a weaker model through the right reasoning steps — decomposed, domain-specialized, optimized from experience — can it close that gap cost-effectively? I built Exo to find out. It decomposes ReAct into modular components (reflection → domain routing → specialized action) operating on a shared trajectory, with an adapted GEPA optimizer for trajectory-level self-improvement. Result: modest architectural gains (+0.7%), but automated optimization hurt performance. I didn’t close the gap in this experiment. Not surprising in retrospect — GAIA spans a wide range of tasks, including web browsing, general computer use, and multimodal capability, and is riddled with edge cases and gotchas. Not the easiest benchmark to crack with context-based optimization. Below: what I built, what I learned, and what I’d try next.


What I Built

Modular ReAct Architecture (Exo) Decomposed monolithic ReAct into explicit modules operating on a shared trajectory:

Five domain specialists: Browser, Wikipedia, ArXiv, Code, Miscellaneous

Key design choice: Unlike subagent architectures where specialists maintain separate contexts, Exo’s modules operate on a single continuous trajectory. Each step routes to the appropriate specialist, but context accumulates without fragmentation. (This addresses context discontinuity problems I observed in earlier subagent-as-tools experiments.)

Trajectory-Level GEPA Adaptation

GEPA  provides a useful framework for optimizing LLM prompts, but out-of-the-box is suitable for “single-shot” LLM applications only. Agents need trajectory-level analysis — you can’t attribute a final failure to a specific step without examining the full chain. I adapted GEPA to:

Infrastructure


The Hypothesis

Claim: Modular decomposition + automated trajectory-level optimization should outperform monolithic ReAct.

Reasoning:


What I Tried First

Before settling on this architecture, I experimented with subagents-as-tools - giving a top-level agent tools to spawn specialist agents for browsing, coding, etc. Problem: the interface between the agent and subagent became a point of context loss, dependent on the ability of both agents to communicate flawlessly. Additionally, in complex tasks, such as those found in GAIA, the task often requires discovery, further complicating the transmission of context, since what information each agent needs to pass to the other is not obvious.

This motivated the “single trajectory, multiple specialists” design — preserve continuity while still enabling domain-specific prompting.


The Experiment

I used the GAIA validation set to create a new train-validation split and ran my modified dspy-GEPA optimizer on my Exo modular ReAct agent for thousands of rollouts. Important context: the GEPA algorithm validates updates - all updates are created from train set trajectory feedback, but must improve validation set scores for the update to be accepted. The GEPA author recommends using the smallest validation set that accurately represents the problem domain to reduce compute and training time. I used a small validation set - 5 items - but resampled the 5 items from the held out validation set regularly - to balance training efficiency with overfitting risk.

Curriculum

GAIA splits tasks into 3 difficulty levels, and I tagged each task with the tool group it required (eg: web browsing, wikipedia, calculator, etc). From this, I built a curriculum, starting with the easiest tasks that required the most commonly required tool group: web browsing. After validation scores started to plateau, I moved onto the next tool group, progressively expanding to all tool groups, then progressed the difficulty.

Apparent Progress

Over tens of runs and thousands of rollouts, I saw validation score generally improve. On some runs, performance was already strong (80%+), and on others, scores did not improve. After awhile, I decided it was time to stop optimizing and start evaluating.

Test

I tested 2 versions of my agent, Exo, and 1 baseline (Vanilla ReAct), on the full 300 task GAIA test set. The baseline, Vanilla ReAct, is DSPy’s implementation of ReAct, with the Exo tool set. The two versions of my agent were (1) an unoptimized version, to test the effect of the modular architecture without GEPA optimization and (2) a distilled version of the GEPA-optimized agent. The GEPA-optimized version is distilled because on preliminary validation set evaluations, the non-distilled version performed worse than both the unoptimized version and the baseline. After reviewing the GEPA-generated instructions and finding them problematic, I attempted to salvage the instructions by having Claude 4.5 remove redundant and incoherent information, and then I deleted obviously wrong guidance. The distillation partly invalidates that variant’s results but re-running the optimization was out of scope for this experiment, and an optimistic preview of potential performance seemed more informative than verifying its underperformance.


Results

Test set performance (using Grok 4 Fast, a top performing economy model, at $0.20/$0.50 per million tokens):

VariantAccuracyError Rate
Vanilla ReAct baseline54.8%1.7%
Exo (unoptimized)55.5%5.6%
Exo (GEPA-optimized, distilled)52.5%

What the numbers show:

GEPA-generated instructions

GEPA generated very long instructions: up to 2.2k tokens in one module, with detailed guidance relating to specific tasks in some cases, and incorrect guidance in others.

Secondary Results

I wanted to see if decomposed ReAct could have greater impact on older models, before increased focus on Scores on 50-task validation set sample, using DeepSeek V3 0324:


Reflections

GEPA Optimization Failure

The GEPA-optimized agent’s poor performance was surprising since GEPA validates instructions. All updates are tested against a validation set that the optimizer never sees, and that score must improve for the update to be accepted. In retrospect, I put too much faith into this validation mechanism. I assumed that the validation mechanism would ensure updates would be generally helpful, but unfortunately this wasn’t the case. If I continue this experiment, increasing the validation set size is one easy change to try.

I would also consider changing the optimizer more aggressively. One limitation of GEPA’s current implementation is that it runs training minibatches (default size 3), and all feedback and updates are based only on the current minibatch’s trajectories. This could be expanded to include all trajectories for a target module in its current version. My adapter already had the ability to synthesize feedback from multiple trajectories, but there was no minimum number of trajectory feedback to synthesize updates from. Forcing the optimizer to synthesize updates from more trajectories could help with overfit instructions.

I’ve also wondered about more memory based approaches. For example, instead of mutating instructions, have a mechanism to have the agent reflect on trajectories, and make those reflections retrievable for injection while the agent is running. ReasoningBank  does something like this.

GAIA is hard

GAIA covers a very diverse set of tasks which makes optimization difficult. I think that to have been successful, I would have needed to generate a training set covering the distribution of GAIA, which is not a small task. The validation set is too sparse to learn effectively. In the end, most of the performance gains on GAIA I’ve seen over the course of this project are from the models (most likely the result of a big push on agentic tool calling at the labs). I probably would have done better on a narrower benchmark, on a domain that’s not being actively worked on by the labs. In retrospect, this project was too ambitious given my constraints (one person working on a side project).

Context-based learning

This project really hammered home the constraints of context based learning for me. Weight-based learning uses gradient descent to synthesize large numbers of examples into generalizable parameter updates. Context-based optimization has to do this explicitly: integrate trajectory feedback into context, balancing detail with limited attention. My domain specialist modules are one way of addressing this, since they narrow the scope of each module’s responsibilities. A memory mechanism is another way.

On GEPA adaptation

I anticipated single-task analysis could produce over-specific guidance. Addressed pragmatically: instruction proposer synthesized across trajectories, added generalization guidelines, preserved GEPA’s core plumbing to minimize scope. Reasonable tradeoff, but insufficient—still overfitted. Next iteration: rip out more core machinery, implement broader historical search across all feedback for a module version before proposing updates. Validation rotation also wasn’t enough safeguard. Started with 10-task sets, cut to 5 with rotation and automatic resampling per GEPA’s guidance. Still overfitted. On GAIA’s diversity, would need 20+ or synthetic augmentation.

On optimizer context

The analysis and instruction-proposing modules need architectural context—how modules interact, what flows where, which failures are addressable at which layer. Obvious in retrospect; took about a day to adjust once I noticed. Would front-load this more aggressively.

On benchmark choice

GAIA’s diversity meant sparse signal per domain. Would validate on a vertical slice first—just web browsing or just Wikipedia—before expanding. Denser feedback, cleaner iteration.


What I’d Do Differently

Scope: Single domain first (e.g., just web browsing or just code), validate the approach works, then expand. GAIA’s diversity made iteration cycles expensive and signal noisy.

Iteration size: Smaller increments. Add just the reflection module, measure. Then add routing, measure. I made too many architectural bets simultaneously.

GEPA fixes I’d try:


Technical Appendix

[Link to detailed page or expandable sections] basic architecture

© 2025 Alex Choi.