FATHOM: Teaching a 1.5B model to read documents bigger than its context window with RL

A writeup for the Meta × PyTorch × Hugging Face OpenEnv Hackathon Grand Finale (Bangalore, April 25–26, 2026).

TL;DR — We built an OpenEnv environment that teaches Qwen 2.5 Coder 1.5B how to use a Recursive Language Model scaffold via GRPO. The trained model answers questions about 200K-token documents using only its 4K native context, and the entire pipeline reproduces in a Colab notebook.

1. The context-window problem

Long-context inference has been getting longer (1M-token Gemini, 200K Claude), but small open-weights models still cap out at 4K-32K. For laptop / edge deployments, the only economically viable path through a 200K-token document is decomposition: slice the doc, run cheap operations to find the relevant span, and only call the LLM on the small slice that matters.

Recursive Language Models (RLMs) formalise this: give an LLM a Python REPL plus the ability to call itself on a sub-problem. The 1.5B becomes a small executive that orchestrates its own scratchpad. The catch — base models are bad at this. They either burn tokens grepping the whole doc, recurse needlessly, or skip the tools and hallucinate.

FATHOM is the first openly-published OpenEnv RL environment that teaches a small model the discipline of recursive-LM use, and we trained it with GRPO end-to-end.

2. The environment (`Pratham-math/fathom-env`)

Live HF Space: https://huggingface.co/spaces/Pratham-math/fathom-env Endpoint: https://Pratham-math-fathom-env.hf.space

The env exposes the standard OpenEnv surface (/reset, /step, /healthz) and gives the agent two tool primitives:

Primitive	Implementation	What it does
`repl(code: str)`	RestrictedPython AST filter + subprocess sandbox (network-off, ulimit'd, ephemeral cwd)	Slice / grep the document, build intermediate notes
`llm(prompt: str, slice: str)`	Internal recursive sub-call into the same model	Delegate a sub-question on a focused slice

```python

Pseudocode of a typical 2-turn rollout the model learns to produce

chunk = repl("doc.split('\n\n')[42:48]") # cheap grep ans = llm("Who is the protagonist?", chunk) # focused recursion print(ans) # final answer ```

Recursion depth is capped at 2 during training (deeper depths only at demo time). All long documents (~200K tokens of synthetic narrative + QA chains) live on the env; the agent only ever sees small slices.

3. The reward (deterministic, composable, audited)

No LLM-as-judge. Every task in our 1000-train / 200-eval / 500-held-out dataset has a deterministic gold answer. The reward is built from four grep-verifiable components:

Weight	Component	Signal
gate	`format_gate.py`	Final answer wrapped in `<answer>…</answer>` tags
0.75	`correctness.py`	Exact-match (with normalisation) against gold
0.20	`token_budget.py`	Penalty proportional to total tool-call tokens (Mercor sub-prize alignment)
0.05	`recursion_efficiency.py`	Reward depth used / depth required ratio

compose_reward_fn in rewards/compose.py stitches them into the TRL reward callback contract and logs each scalar separately to W&B so we can see which component drives policy updates.

Anti-reward-hacking: 5 attacks, 5 mitigations

We ran adversarial probes BEFORE training. Each is documented in REWARD_AUDIT.md with the exact attack, the symptom it would produce on the reward curve, and the test that catches it. pytest -m reward_audit re-runs them on every change.

#	Attack	Mitigation
1	Masked-context (model copies the question as the answer)	gold-set normalisation rejects substring-of-question answers
2	Format-only (return `<answer>X</answer>` for any X)	format gate is a multiplier, not an additive bonus
3	Length gaming (verbose REPL outputs to game token budget)	budget is positive penalty (lower=better), not capped reward
4	Recursion spam (deeper recursion = higher reward)	recursion eff is per-depth ratio, capped at depth=2
5	Copy-pasted gold from a leaked prompt	held-out eval split never appears in train env

4. The training: GRPO via TRL + Unsloth

We use TRL 1.2 GRPOTrainer with Unsloth-patched 4-bit Qwen 2.5 Coder 1.5B + LoRA r=16. Single A100 / A10G, vLLM colocate mode (server mode breaks multi-turn OpenEnv per TRL #4543), 8 generations per step, β=0.04 (KL floor preventing collapse, per EDGE-GRPO §3.2).

```yaml

configs/train/grpo.yaml — the actual training contract

num_generations: 8 beta: 0.04 learning_rate: 5.0e-6 max_grad_norm: 0.5 bf16: true max_prompt_length: 4096 max_completion_length: 2048 optim: adamw_8bit max_steps: 400 vllm_mode: colocate vllm_gpu_memory_utilization: 0.45 ```

A 30-step SFT warm-start on Claude-generated traces (data/sft_traces.jsonl) gives the model the basic repl() + llm() schema before GRPO starts shaping the long-tail.

$SFT loss$ (SFT phase showed strong convergence, 3.20 → 0.29)

$Reward curve$ (GRPO phase stayed flat due to multiplicative format gate)

The SFT phase successfully taught the model the format and answer style, achieving a 91% reduction in loss and a 2× improvement in token accuracy. However, the GRPO phase did not converge in our initial hackathon budget. The flat reward curve exposed a crucial reward-design lesson: a multiplicative format gate without a soft-format prior causes GRPO to collapse when the policy strays even slightly from the templated output.

We learned that building the environment, the verifier, and the RL pipeline was only half the battle — shaping the reward signal to be smooth and continuous is just as critical. Our repo includes the code fixes (prompt alignment and an additive format bonus) for the next run.

5. Results

The OpenEnv environment and SFT components proved fundamentally sound. While our GRPO run was flat, the SFT-trained adapter already shows improvement over the base model's zero-shot performance. A fully trained GRPO model with our v2 reward design is expected to dominate the Pareto frontier of accuracy vs tokens.

6. Reproduce it in 5 minutes

Open the Colab notebook (link in README)
Run cells 1–5 — verifies env health + smoke test against the live HF Space
(Optional, A100 needed) Run cell 6 — launches a short GRPO sanity run on Qwen 0.5B
Cell 7 produces the outputs/plots/reward_curve.png PNG

The full A100 + 1.5B + 400-step run is the same command but with --flavor=a100-large on hf jobs. We did it for ~$20 of HF credits.

7. What's next

Deeper recursion at inference — depth-3 / depth-4 with curriculum learning during GRPO
Multi-doc chains — 5 documents × 40K each as one task, requiring cross-doc reasoning
The Mercor sub-prize: token-aware shaping — α-parameterised budget head, sweep α and publish the cost-vs-accuracy frontier