A writeup for the Meta × PyTorch × Hugging Face OpenEnv Hackathon Grand Finale (Bangalore, April 25–26, 2026).
TL;DR — We built an OpenEnv environment that teaches Qwen 2.5 Coder 1.5B how to use a Recursive Language Model scaffold via GRPO. The trained model answers questions about 200K-token documents using only its 4K native context, and the entire pipeline reproduces in a Colab notebook.
Long-context inference has been getting longer (1M-token Gemini, 200K Claude), but small open-weights models still cap out at 4K-32K. For laptop / edge deployments, the only economically viable path through a 200K-token document is decomposition: slice the doc, run cheap operations to find the relevant span, and only call the LLM on the small slice that matters.
Recursive Language Models (RLMs) formalise this: give an LLM a Python REPL plus the ability to call itself on a sub-problem. The 1.5B becomes a small executive that orchestrates its own scratchpad. The catch — base models are bad at this. They either burn tokens grepping the whole doc, recurse needlessly, or skip the tools and hallucinate.
FATHOM is the first openly-published OpenEnv RL environment that teaches a small model the discipline of recursive-LM use, and we trained it with GRPO end-to-end.
Pratham-math/fathom-env)Live HF Space: https://huggingface.co/spaces/Pratham-math/fathom-env Endpoint: https://Pratham-math-fathom-env.hf.space
The env exposes the standard OpenEnv surface (/reset, /step, /healthz) and gives the agent two tool primitives:
| Primitive | Implementation | What it does |
|---|---|---|
repl(code: str) |
RestrictedPython AST filter + subprocess sandbox (network-off, ulimit'd, ephemeral cwd) | Slice / grep the document, build intermediate notes |
llm(prompt: str, slice: str) |
Internal recursive sub-call into the same model | Delegate a sub-question on a focused slice |
```python
chunk = repl("doc.split('\n\n')[42:48]") # cheap grep ans = llm("Who is the protagonist?", chunk) # focused recursion print(ans) # final answer ```
Recursion depth is capped at 2 during training (deeper depths only at demo time). All long documents (~200K tokens of synthetic narrative + QA chains) live on the env; the agent only ever sees small slices.
No LLM-as-judge. Every task in our 1000-train / 200-eval / 500-held-out dataset has a deterministic gold answer. The reward is built from four grep-verifiable components:
| Weight | Component | Signal |
|---|---|---|
| gate | format_gate.py |
Final answer wrapped in <answer>…</answer> tags |
| 0.75 | correctness.py |
Exact-match (with normalisation) against gold |
| 0.20 | token_budget.py |
Penalty proportional to total tool-call tokens (Mercor sub-prize alignment) |
| 0.05 | recursion_efficiency.py |
Reward depth used / depth required ratio |
compose_reward_fn in rewards/compose.py stitches them into the TRL reward callback contract and logs each scalar separately to W&B so we can see which component drives policy updates.
We ran adversarial probes BEFORE training. Each is documented in REWARD_AUDIT.md with the exact attack, the symptom it would produce on the reward curve, and the test that catches it. pytest -m reward_audit re-runs them on every change.
| # | Attack | Mitigation |
|---|---|---|
| 1 | Masked-context (model copies the question as the answer) | gold-set normalisation rejects substring-of-question answers |
| 2 | Format-only (return <answer>X</answer> for any X) |
format gate is a multiplier, not an additive bonus |
| 3 | Length gaming (verbose REPL outputs to game token budget) | budget is positive penalty (lower=better), not capped reward |
| 4 | Recursion spam (deeper recursion = higher reward) | recursion eff is per-depth ratio, capped at depth=2 |
| 5 | Copy-pasted gold from a leaked prompt | held-out eval split never appears in train env |
We use TRL 1.2 GRPOTrainer with Unsloth-patched 4-bit Qwen 2.5 Coder 1.5B + LoRA r=16. Single A100 / A10G, vLLM colocate mode (server mode breaks multi-turn OpenEnv per TRL #4543), 8 generations per step, β=0.04 (KL floor preventing collapse, per EDGE-GRPO §3.2).
```yaml
num_generations: 8 beta: 0.04 learning_rate: 5.0e-6 max_grad_norm: 0.5 bf16: true max_prompt_length: 4096 max_completion_length: 2048 optim: adamw_8bit max_steps: 400 vllm_mode: colocate vllm_gpu_memory_utilization: 0.45 ```
A 30-step SFT warm-start on Claude-generated traces (data/sft_traces.jsonl) gives the model the basic repl() + llm() schema before GRPO starts shaping the long-tail.
(SFT phase showed strong convergence, 3.20 → 0.29)
(GRPO phase stayed flat due to multiplicative format gate)
The SFT phase successfully taught the model the format and answer style, achieving a 91% reduction in loss and a 2× improvement in token accuracy. However, the GRPO phase did not converge in our initial hackathon budget. The flat reward curve exposed a crucial reward-design lesson: a multiplicative format gate without a soft-format prior causes GRPO to collapse when the policy strays even slightly from the templated output.
We learned that building the environment, the verifier, and the RL pipeline was only half the battle — shaping the reward signal to be smooth and continuous is just as critical. Our repo includes the code fixes (prompt alignment and an additive format bonus) for the next run.
The OpenEnv environment and SFT components proved fundamentally sound. While our GRPO run was flat, the SFT-trained adapter already shows improvement over the base model's zero-shot performance. A fully trained GRPO model with our v2 reward design is expected to dominate the Pareto frontier of accuracy vs tokens.
outputs/plots/reward_curve.png PNGThe full A100 + 1.5B + 400-step run is the same command but with --flavor=a100-large on hf jobs. We did it for ~$20 of HF credits.
If you build something on top of FATHOM, ping us — we'd love to see what you teach a small model to do.