Iluvatar Labs

Introducing Actuator

2026-04-01T00:00:00+00:00

Post-training has become the primary differentiation lever for AI labs in 2026.

For smaller labs, it’s an existential one. If you cannot differentiate, you get steamrolled by the frontier labs.¹ For bigger labs, post-training is just as essential from a practical standpoint. It’s how models get adapted across product lines, various deployment targets, and within cost constraints. And for companies in regulated industries, such as medical and legal applications, it is often a requirement, not than a choice.

The problem with open loop

Despite the growing importance of model transformation, today’s post-training stack is still a fragmented, open-loop affair. Teams stitch together a stack of tools, launch runs with limited visibility into what is happening in flight, and only later discover the full cost of a bad tradeoff. That might mean degraded baseline capabilities, a failed compression pass, alignment that came at too high a price, or just another wasted cycle of compute and engineering time.

This is not only a startup problem. The same guess-and-check dynamic applies all the way up the stack. For smaller companies, it can mean losing the narrow window they had to differentiate or whether they can even afford to do so at all. For bigger ones, it means slower iteration, higher costs, and more friction in getting models into the forms real products and deployments actually require.

Closing the loop

Actuator is a patent-pending closed-loop control layer for model transformation. It replaces the manual, open loop process with continuous live monitoring, automatic training-time adjustments, and guardrails to keep your model transformations on track. When capability starts to drift, Actuator kicks in to ensure your model’s output doesn’t degrade at training time, as opposed to being discovered post-hoc. Quality in, quality out.

And not only does Actuator optimize your model transformation, it also makes post-training easy. It drops right in to your existing stack and provides the unified end-to-end software layer you need to ship better models while skipping the pain.

For every post-training task

Actuator’s plug and play capabilities mean it be used across varied applications from distillation (better draft models) and compression (smarter, smaller models) to reinforcement learning (learn preferences with losing capabilities via the alignment tax). We’ve benchmarked Actuator on various tasks and it showed outperformance on preservation of desired properties over using standard methods alone. Additional details are available on the Actuator page.

Actuator is now in closed beta. If your team is running serious post-training and want to do it a better way, please reach out! We’re excited to hear about what your team is working on and open to potential pilots or partnerships.

Sam Altman on 20VC discussing how startups can get “steamrolled” as frontier models improve: summary/transcript. ↩

Divergent-Convergent Attention

2026-03-26T00:00:00+00:00

The TL;DR: Divergent-Convergent Attention (DCA) improves compositional reasoning by maintaining multiple parallel attention perspectives before periodic learned consensus. On HotpotQA¹, DCA achieves 5.4x higher exact match than a parameter-matched 90M baseline, and a 215M DCA model outperforms a 355M standard transformer by 1.54x with fewer parameters and lower memory.

Most notably, DCA assigns higher probability to the correct answer tokens on 97.8% of examples, with the advantage sharply correlated with question difficulty, suggesting that DCA’s magic is in how distributed evidence is internally composed before decoding.

DCA helps when relevant content is scattered across structurally independent documents. It does not help on sequential reasoning or single-source retrieval tasks where every perspective sees the same chain or location.

(Note: This blog post reflects the latest manuscript version of this work.)

Introduction

Standard transformers process multi-document input through a single attention stream, fusing heterogeneous evidence into one representation at every layer. RAG pipelines, long-context windows, and tasks like legal analysis or medical synthesis all require integrating information from structurally independent sources. A single stream must compromise between local precision and global reach at every layer. The result is premature fusion, where multi-document evidence is collapsed before the model can develop complementary views.

We introduce Divergent-Convergent Attention (DCA), a transformer variant that maintains K parallel attention streams at different scales and reconciles them only at scheduled consensus points. The novelty is not merely independent lanes or a late merge, but that those lanes are explicitly multi-horizon: short, medium, and long timescales that cultivate complementary perspectives before reconciliation.

DCA is inspired by an organizational principle in neuroscience: the brain concurrently maintains multiple oscillatory bands that only periodically couple to coordinate information²³. Gamma (30-100 Hz) supports fast, local feature binding, analogous to our short horizon. Beta (13-30 Hz) integrates across nearby regions, our medium horizon. Theta (4-8 Hz) supports global synchronization, our long horizon⁴⁵. DCA provides the computational analogue: separate processing streams that periodically synchronize via learned consensus.

Figure 1a Biological multi-scale oscillations. Gamma, beta, and theta bands process at different scales and periodically couple to coordinate information. DCA maps these to three attention horizons.

In controlled experiments, DCA achieves 5.4x higher exact match on multi-hop QA at 90M parameters (p < 10^-6, 3 seeds). At 215M, DCA beats a 355M baseline by 1.54x with fewer parameters, approximately matched FLOPs, and less memory. We characterized the consensus mechanism through causal interventions at both scales. Despite the small capacity of these models, our force-decode analysis shows an unambiguous representational advantage in multi-document composition. DCA assigns higher probability to the correct answer tokens on 97.8% of all examples at 90M, and the advantage scales with difficulty, with 7.7x larger gains on the hardest examples.

Multi-scale and sparse attention methods such as Longformer⁶, BigBird⁷, and RetNet⁸ combine local and global attention within a single stream, blending scales early or continuously. DCA maintains separate streams that develop independent representations before merging. Ring attention⁹ and flash attention¹⁰ address computational cost but not fusion timing; DCA is orthogonal and compatible with these methods.

Multi-path architectures provide useful precedents but differ in mechanism. ResNeXt¹¹ established split-transform-merge for vision. Mixture of Experts¹²¹³ increases capacity through sparse routing. DCA differs in that all perspectives are always active and differentiated by attention scale rather than learned routing. The gated consensus mechanism uses Highway Network-style residual connections¹⁴ with periodic synchronization analogous to federated averaging¹⁵.

HotpotQA requires composing information across two Wikipedia paragraphs among eight distractors. Encoder models at 110M-355M achieve substantially higher scores with bidirectional attention, while decoder-only models generally require 7B+ to reach around 30% EM¹⁶⁶⁷¹⁷¹⁸. To our knowledge, no published decoder-only HotpotQA results exist between 90M and 7B parameters.

The Architecture

DCA replaces each transformer block with K parallel attention streams (“perspectives”), each operating at a different window size. In our experiments, K=3 with horizons [32, 128, 0], where 0 denotes full causal attention. Each perspective has its own QKV projection weights. All perspectives share a single MLP, with all paths always active (closer to ResNeXt’s split-transform-merge¹¹ than to Mixture of Experts’ selective routing¹², and analogous to cross-scale pooling in multi-scale vision transformers¹⁹). Every N layers, the perspectives merge via a Highway Network-style gate¹⁴, a periodic synchronization analogous to federated averaging¹⁵. This gate is content-dependent and learned, and the model discovers a depth-dependent strategy where early layers mostly pass through and late layers merge more fully, as shown later in the mechanistic analysis.

consensus = mean(perspective_1, ..., perspective_K)
gate = sigmoid(W_g * RMSNorm(x))
output = (1 - gate) * x + gate * consensus

Note that while the implementation described here uses dense causal attention, DCA is a more general late-consensus primitive. The consensus mechanism operates on tensors, so any module that takes [B, T, D] and produces [B, T, D] can serve as a perspective. In this work, we use dense causal attention with different window sizes, but other sequence-processing modules (ring attention, linear attention, SSMs) could serve the same role.

Figure 1b DCA architecture. K=3 perspectives fork from the residual stream, process with separate attention and shared MLP, then merge via learned highway consensus. The cycle repeats every N layers.

Design tradeoffs

Full-fat DCA at baseline width (K=3 at d=1024) costs 3x VRAM and ~2.7x FLOPs. Bottleneck projections let perspectives operate at d_lane=512 inside d_model=1024. The key math is that 3 x 512^2 < 1024^2, so K=3 perspectives at d=512 are cheaper per layer than a single stream at d=1024. This replaces the role that global tokens play in Longformer and BigBird⁶⁷ in a causal-compatible way; global tokens in causal decoders are functionally vacuous since position 0 can only attend to itself. Per-perspective gradient checkpointing reduces activation memory from ~3x baseline to below baseline levels. We scale by adding layers (30L) at the cheap d=512 perspective width rather than widening to d=1024.

Table 1 DCA design space. Theoretical and tested variants with FLOP and VRAM tradeoffs.

Variant	d_model	d_lane	Params	FLOP ratio	VRAM vs baseline	MLP
Baseline	1024	–	355M	1.0x	1.0x	1 stream
Full-fat (K=3)	1024	1024	556M	2.71x	~3x	shared
DCA-215M	1024	512	215M	1.24x	~0.8x	shared
DCA-215M + separate MLPs	1024	512	341M	1.24x	~0.8x	K weights

Benchmarks

HotpotQA at 90M (WikiText-103)

The Hurt Locker - A 2008 war thriller about an Iraq War EOD team, directed by Kathryn Bigelow...

distract

Kathryn Bigelow - An American filmmaker known for directing horror, action, and thriller films...

distract

Zero Dark Thirty - A 2012 action thriller directed by Kathryn Bigelow dramatizing the decade-long manhunt for Osama bin Laden. It received five Academy Award nominations, including Best Picture and Best Actress.

gold

Jessica Chastain - An American actress and film producer, studied at the Juilliard School...

distract

Mark Boal - An American screenwriter and journalist. Best known for writing "The Hurt Locker"...

distract

Argo (2012 film) - A 2012 historical drama directed by Ben Affleck about the rescue of six U.S. diplomats...

distract

Denis Villeneuve - A Canadian filmmaker acclaimed for "Prisoners," "Sicario," and "Dune"...

distract

Arrival (film) - A 2016 science fiction drama directed by Denis Villeneuve, adapted from Ted Chiang's "Story of Your Life." It received eight Academy Award nominations, including Best Picture and Best Director, winning Best Sound Editing.

gold

Ted Chiang - An American science fiction writer whose work has won four Nebula and four Hugo Awards...

distract

Eric Heisserer - An American screenwriter who adapted "Story of Your Life" into "Arrival"...

distract

Question (requires both gold paragraphs):
Which film received more Academy Award nominations, Zero Dark Thirty or Arrival?

Answer: Arrival (8 nominations vs 5)

Figure 2 HotpotQA distractor setting. 10 paragraphs per question: 2 supporting (gold), 8 distractors (gray). The answer requires composing information from both gold paragraphs scattered among topically similar distractors.

We pretrained DCA (89M params) and a parameter-matched baseline (90M params) on WikiText-103²⁰ for 50K steps, then finetuned both on HotpotQA across three seeds. Though DCA is modestly worse on WikiText-103 validation perplexity (21.48 vs 20.79, ~3%), the benefit on long reasoning is asymmetric. DCA achieves 5.4x higher exact match on HotpotQA (1.56% vs 0.29%, Table 2), with p < 10^-6 and odds ratio 5.49 (Fisher exact, pooled across seeds). DCA outperformed every baseline variant we tested, across both 50K and 30K pretrain budgets (Appendix H).

Scaling to PG-19 and architectural exploration

The 90M result raises a natural question: does the advantage hold at larger scale? While the relative advantage is clear, absolute performance of both models is low (1.56% and 0.29% EM). WT103 is too small for 350M-class models, so we switched to PG-19²¹ (3B tokens) following standard conventions.

To calibrate the effect of pretraining domain, we also trained DCA 90M on PG-19 (EM=0.38%, compared to 1.56% on WT103). The 350M standard baseline on PG-19 achieves 0.93% EM, indicating that even at 4x the parameters, standard decoders remain poor at multi-hop QA. A FLOP-comparable DCA-215M on PG-19 achieves 1.43% EM vs the baseline’s 0.93% (1.54x), with 39% fewer parameters and less VRAM (~35 vs ~45 GB). Within PG-19, scaling DCA from 90M to 215M improves EM from 0.38% to 1.43%, surpassing the 350M baseline by 1.54x.

Table 2 HotpotQA results across scales and pretraining domains. All models finetuned and evaluated on HotpotQA.

WT103 pretraining (90M):

Model	Params	FLOP ratio	VRAM	EM%	F1%
Baseline 90M	90M	~1.0x	~4 GB	0.29	7.77
DCA 90M	89M	~1.56x	~8 GB	1.56	14.40

PG-19 pretraining (up to 350M):

Model	Params	FLOP ratio	VRAM	EM%	F1%
DCA 90M	89M	~1.56x	~8 GB	0.38	7.78
Baseline 350M	355M	1.0x	~45 GB	0.93	10.81
DCA-215M	215M	1.24x	~35 GB	1.43	11.32

Architectural exploration

Scaling provided an opportunity to test which components of DCA are essential. Our 90M baseline uses per-head window assignment (heads 0-2 at w=32, heads 3-5 at w=128, heads 6-7 full causal), achieving the best perplexity among baseline variants (20.79) but only 0.29% EM on HotpotQA. A factorial experiment confirmed parallel streams are the primary mechanism; multi-scale windows are secondary. Two other architectural properties proved essential. Shared QKV weights collapse perspective diversity (cosine similarity >0.9 vs 0.2-0.4 with separate weights). Consensus at every layer (k=1) drops EM to 1.05% vs 1.59% with consensus every 6 layers (k=6).

These results motivated the DCA-215M design. A narrower variant at d=768 (323M params) achieved only EM=0.57%, undertrained at 6.2 tokens per parameter. A variant without per-perspective MLP (302M params, 1.07x FLOPs) achieved EM=1.21% (1.30x over baseline). Separate MLP weights (139M params, same FLOPs) achieved PPL=20.45 but EM=0.80%, confirming the shared MLP acts as a regularizer.

Table 3 Architectural variant results. All evaluated on HotpotQA.

Model	Params	d_model	d_lane	EM%	VRAM
DCA 90M (full-fat)	89M	512	512	1.56	~8 GB
DCA-d768	323M	768	768	0.57	~45 GB
DCA-noMLP	302M	1024	768	1.21	~35 GB
DCA-215M (bottleneck)	215M	1024	512	1.43	~35 GB
DCA 90M (separate MLPs)	139M	512	512	0.80	~10 GB

The DCA-215M results (Tables 2 and 3) confirm this design is practical and competitive.

What is DCA well suited for?

We selected benchmarks to test where DCA should help, not to maximize wins. Few existing tasks isolate distributed-source composition while remaining tractable for sub-billion-parameter decoder-only models, so HotpotQA serves as the primary stress test, 2Wiki as secondary corroboration, and the remaining tasks as negative controls.

Figure 3 Information topology. DCA helps when relevant content is distributed across independent documents (left). It does not help when information forms a single chain (right).

Sequential reasoning tasks (bAbI²², Tree pathfinding²³, PrOntoQA²⁴, LEGO²⁵) show no advantage; all facts lie in a single flat sequence and every attention scale sees the same chain. Single-source tasks (TriviaQA²⁶, LAMBADA²⁷, MQAR²⁸) show no advantage; all perspectives see the same content. Tasks beyond model capacity (MuSiQue²⁹) show both models at floor. 2WikiMultiHopQA³⁰ provides weak corroboration (Soft EM p = 0.004, EM ns).

DCA helps when relevant information is distributed across structurally independent segments, what we refer to as the information topology of the input, and does not help when information forms a single chain or resides at a single location. Within HotpotQA, the advantage is uniform across bridge questions (sequential logic, OR=4.65) and comparison questions (parallel logic, OR=4.51), indicating that multi-document context, not reasoning pattern, is the key factor.

Mechanistic Analysis

Force-decode: the representation advantage

To separate representation quality from generation dynamics, we feed the context to both models and force-decode the gold answer tokens (teacher-forcing), recording each model’s log-probability of the correct token at each position. For each of 6,359 validation examples, we compare which model assigns higher probability to the gold answer.

Table 4: Force-decode results (90M, WT103). Top: paired comparison across all 6,359 validation examples (Wilcoxon signed-rank p < 10^-300). Bottom: advantage by baseline difficulty quintile.

Slice	DCA advantage	DCA win rate
Overall	+6.25 nats (~520x)	97.8% (6,217/6,359)
0-20% (easiest)	+1.84 nats	96.5%
20-40%	+3.20 nats	96.2%
40-60%	+4.75 nats	96.5%
60-80%	+7.25 nats	99.6%
80-100% (hardest)	+14.23 nats	100%

The representation advantage is near-universal: DCA produces better internal representations on 97.8% of all examples, not just the 1.6% where EM=1. The advantage correlates with baseline difficulty (r=-0.888): 7.7x larger on the hardest examples than the easiest (Table 4, Figure 4). The harder an example is for a standard transformer, the more DCA’s multi-perspective consensus improves the representation.

Figure 4 DCA 90M vs Baseline 90M (both WT103). DCA’s representational advantage scales with example difficulty. On the hardest quintile (where the baseline assigns the lowest probability to the correct answer), DCA’s advantage is +14.23 nats. On the easiest, +1.84 nats. r=-0.888.

Recent work on latent multi-hop reasoning finds that while bridge-entity recall scales smoothly with model size, the compositional second hop does not, suggesting composition is a structural bottleneck rather than a capacity problem³¹. That work studies parametric knowledge recall; DCA’s setting differs in that all relevant information is provided in context. Nevertheless, our force-decode result is consistent with the broader view that end-to-end exact match may understate the gradual development of multi-hop structure in model representations: at 90M, the correct answer is already encoded with substantially higher probability under the right architecture, even where end-to-end EM remains near floor. Confirming this connection would require targeted compositional probes, such as entity-recall scores and causal interventions on bridge entities, applied directly to DCA’s internal representations.

Same retrieval, better composition

We computed mean token recall and derived an approximate token precision from aggregate F1 and recall on generated predictions (90M, WT103, pooled across seeds 137 and 2024). Token Recall is essentially identical (~57.5% in both models). Token Precision, derived via P = F1·R / (2R - F1), shows the full advantage: ~8.2% vs ~4.2% (~2.0x). The advantage appears to come primarily from composition rather than token-level recall.

Figure 5 DCA 90M vs Baseline 90M (both WT103). Token Recall is essentially identical (~57.5%). Token Precision (derived from aggregate F1 and recall) shows the full advantage (~8.2% vs ~4.2%).

First-sentence extraction approximately decomposes this into two components. The advantage that survives extraction (~1.2-1.5x) reflects compositional integration at the representation level. The remaining multiplier (~2-3x) reflects generation coherence, scaling with answer length (3x at 1 token, 12.8x at 4+ tokens). These ranges are inferred from comparing first-sentence and full-output EM ratios, not independently measured.

Gate ablation: consensus is essential and precisely tuned

We force the consensus gate to fixed values during full QA evaluation using forward hooks. Gate=0 clamps the sigmoid to 0.001 (bypass consensus). Gate=1 clamps to 0.999 (force full consensus).

Table 5: Gate ablation and learned gate values. Top: forcing gate to fixed values during QA evaluation. Bottom: learned gate values at consensus layers.

Condition	90M	215M
Learned gates	101	91
Gate=1 (force full)	18	0
Baseline	12	59
Gate=0 (bypass)	2	2

Consensus layer	90M gate	215M gate
Layer 5 (1st)	0.29	0.31
Layer 11 (final @ 90M)	0.99	0.21
Layer 17 (3rd)	–	0.28
Layer 23 (4th)	–	0.35
Layer 29 (final)	–	0.37

Bypassing consensus collapses performance from 101 to 2 correct at 90M and 91 to 2 at 215M (Table 5). Forcing full consensus drops 101 to 18 at 90M and 91 to 0 at 215M, worse than the 350M baseline (59 correct), indicating that forced consensus is actively destructive to the representations DCA has learned to build through gradual integration.

At 90M the learned strategy is binary: passthrough early (0.29), full commit at the final layer (0.99). At 215M it is gradual and never exceeds 0.37. Both strategies are load-bearing, and disrupting either destroys performance.

Perspective divergence and attention patterns

Perspectives develop genuinely distinct representations (cosine similarity 0.21 between local and medium at layer 5), complementary rather than redundant (Appendix Table 8).

Figure 6 DCA 90M vs Baseline 90M (both WT103), consensus layer 5. Each DCA perspective specializes at a different scale, while the baseline compromises at 0.34.

Attention measurements (computed on EM=1 examples, n=101) confirm the specialization. The local perspective keeps 96% of attention within paragraphs (cross-document fraction 0.04), while the global perspective distributes 68% across documents. The baseline sits at 0.34. With DCA, local perspectives extract precise within-document content, global perspectives maintain cross-document context, and consensus integrates both. The baseline attends at multiple scales within a single residual stream, but must reconcile those scales within one shared representation.

Conclusion

Multi-document composition is a documented bottleneck for production LLMs. RAG pipelines retrieve relevant documents but fail to synthesize across them³². Models fail to use information in the middle of long contexts³³. Multi-hop reasoning may require 30-70B parameters to emerge in standard transformers³⁴. With DCA, we sought to demonstrate that parallel multi-scale perspectives with periodic late consensus can improve on these deficiencies.

Despite the limited capacity of our models, extensive benchmarking demonstrated a consistent advantage on distributed-source tasks (5.4x EM at 90M, 1.54x at 215M FLOP-comparable) and no advantage on sequential, single-source, or capacity-limited tasks. At 90M, the resulting representations encode multi-document relationships better than a parameter-matched standard transformer on 97.8% of examples, with 7.7x larger gains on the hardest examples, reflecting an advantage in composition rather than retrieval.

Consensus frequency, horizon widths, and gate training dynamics were fixed throughout our experiments, leaving substantial room for task-specific tuning. Because DCA is fundamentally a primitive that operates on tensors, other sequence-processing modules (ring attention, linear attention, SSMs) could serve as perspectives, opening a combinatorial design space we have only begun to explore. The force-decode diagnostic is itself useful beyond DCA, offering a general tool for determining whether the bottleneck in a given architecture is understanding or expression. We will be sharing the base code for DCA on GitHub.

Appendix

HotpotQA task illustration

Figure 2 above illustrates the HotpotQA distractor setting used throughout the paper: 10 paragraphs per question, with 2 supporting paragraphs embedded among 8 distractors.

Multi-seed raw data

Table 6: Per-seed HotpotQA results (90M, WT103). DCA mean EM: 1.562% (std 0.120%). Baseline mean EM: 0.288% (std 0.122%). Fisher exact (pooled): OR=5.49, p < 10^-6.

Model	Seed	EM	F1	n
DCA 90M	42	0.01588	0.14508	6359
DCA 90M	137	0.01431	0.14439	6359
DCA 90M	2024	0.01667	0.14265	6359
Baseline 90M	42	0.00189	0.07625	6359
Baseline 90M	137	0.00425	0.08143	6359
Baseline 90M	2024	0.00252	0.07528	6359

Additional benchmark evaluations

All results in this section come from 90M models pretrained on WT103.

Sequential reasoning tasks show no DCA advantage: bAbI 2-hop hits 100% for both models at all distractor counts, and PrOntoQA also hits 100% at all hop counts. Tree pathfinding favors the baseline by 2-7 points at depths 4-6. LEGO is roughly even, with the baseline at ~31% and DCA at ~30%.

Single-source tasks also show no advantage. TriviaQA and LAMBADA show no consistent lift. MQAR (fixed protocol, vocab=8192) remains at exact chance across all key-value counts and learning rates for both models.

Capacity-limited tasks stay at floor. MuSiQue shows DCA at 0.21% and baseline at 0.10% (p = 0.687, not significant). Additional synthetic compositional probes were similarly uninformative at this scale: Entity Comparison remained at chance (50%, loss near ln 2), and MQAR2 also remained at chance (50%). We treat these as floor-effect results for small models trained from scratch rather than meaningful tests of DCA’s inductive bias.

2WikiMultiHopQA provides weak corroboration: EM is even (0.31% vs 0.33%, p = 1.0), but Soft EM (F1 >= 0.5) favors DCA at 2.31% vs 1.47% (p = 0.004).

Force-decode difficulty scaling

Appendix Figure Per-example force-decode advantage (DCA log-prob minus baseline log-prob) plotted against baseline log-prob (90M DCA vs 90M baseline, both WT103). Each point is one of 6,359 HotpotQA validation examples. r=-0.888. The harder an example is for the baseline (more negative log-prob), the larger DCA’s representational advantage.

Literature gap

To our knowledge, no published decoder-only HotpotQA results exist between 90M and 7B parameters.

Table 7: Published HotpotQA results. Encoder models dominate at 110M-355M due to bidirectional attention and span extraction heads. Decoder-only models need 7B+ for ~30% EM.

Architecture	Params	HotpotQA	Notes
DCA 90M	89M	1.56% EM	Decoder, WT103
Baseline 90M	90M	0.29% EM	Decoder, WT103
Baseline 350M	355M	0.93% EM	Decoder, PG-19
DCA-215M	215M	1.43% EM	Decoder, PG-19
BERT-base-era systems	~110M	~54% EM	Encoder, bidirectional
Longformer-base	~149M	64% F1	Encoder, local+global
Longformer-large	~435M	73% F1	Encoder, local+global
BigBird-ETC	~131M	76% F1	Encoder, sparse
RoBERTa-large-based systems	~355M	~70% EM	Encoder
Llama-2-7B	7B	~30% EM	Decoder (FireAct)
GPT-3.5	proprietary	~31% EM	Few-shot ReAct
Human	–	~91% F1	Leaderboard

Encoder models such as BERT¹⁶, Longformer⁶, BigBird-ETC⁷, and RoBERTa¹⁷ dominate at 110M-355M because HotpotQA was designed for BERT-era extractive QA with bidirectional attention and span extraction heads. Decoder-only models need 7B+ for ~30% EM (FireAct with Llama-2-7B¹⁸). Steele & Katz³⁴ identify a phase transition at 30-70B for emergent multi-hop reasoning.

Additional WT103 variants

In addition to the headline comparison (DCA vs baseline_mixed, 50K steps, 3 seeds), we trained six additional 90M WT103 variants: three DCA variants (consensus every 1, 3, or 6 layers, plus uniform-horizon settings) and three baseline variants (full causal, layerwise windows, sliding window w=256) at 50K or 30K steps. In all cases, every DCA variant outperformed every baseline on HotpotQA EM, including cross-budget comparisons where DCA at 30K steps with 1-epoch finetuning exceeded baselines at 50K steps with 3-epoch finetuning. The factorial decomposition confirmed that parallel streams are the primary mechanism; multi-scale windows are secondary.

Perspective divergence

Pairwise cosine similarity between K=3 perspectives at consensus layers, measured on HotpotQA inputs (90M, WT103). No EM stratification: correct and incorrect examples show nearly identical divergence, confirming divergence is an architectural property rather than a predictor of success.

Table 8: Perspective divergence on QA data. Local and medium are most dissimilar at layer 5 (0.21); by layer 11 they partially reconverge (0.62) while local-global remains distinct (0.34).

Layer	Pair	Overall	EM=1 (n=101)	EM=0 (n=6,258)
5	local vs medium	0.207	0.209	0.207
5	local vs global	0.435	0.437	0.435
5	medium vs global	0.405	0.407	0.405
11	local vs medium	0.621	0.625	0.621
11	local vs global	0.336	0.337	0.336
11	medium vs global	0.437	0.436	0.437

Citation

This blog post serves as the current preprint version of this work. Until an archival version is available, please cite it as:

@misc{zhao2026dca,
  author = {Ben Zhao and Jenhan Tao},
  title = {Divergent-Convergent Attention: Parallel Perspectives for Compositional Reasoning},
  year = {2026},
  howpublished = {\url{https://iluvatarlabs.github.io/blog/2026/03/divergent-convergent-attention/}},
  note = {Iluvatar Labs blog preprint}
}

Acknowledgements

We thank Abel Chiao for helpful discussions and feedback on this work.

Yang, Z., Qi, P., Zhang, S., Bengio, Y., Cohen, W. W., Salakhutdinov, R., & Manning, C. D. (2018). “HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering”. Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP 2018). arXiv:1809.09600 ↩
Buzsaki, G. (2006). Rhythms of the Brain. Oxford University Press. ↩
Canolty, R. T., & Knight, R. T. (2010). “The Functional Role of Cross-Frequency Coupling”. Trends in Cognitive Sciences, 14(11), 506-515. ↩
Lisman, J. E., & Jensen, O. (2013). “The Theta-Gamma Neural Code”. Neuron, 77(6), 1002-1016. ↩
Colgin, L. L., Denninger, T., Fyhn, M., Hafting, T., Bonnevie, T., Jensen, O., Moser, M.-B., & Moser, E. I. (2009). “Frequency of Gamma Oscillations Routes Flow of Information in the Hippocampus”. Nature, 462, 353-357. ↩
Beltagy, I., Peters, M. E., & Cohan, A. (2020). “Longformer: The Long-Document Transformer”. arXiv:2004.05150 ↩ ↩² ↩³ ↩⁴
Zaheer, M., Guruganesh, G., Dubey, A., Ainslie, J., Alberti, C., Ontanon, S., Pham, P., Ravula, A., Wang, Q., Yang, L., & Ahmed, A. (2020). “Big Bird: Transformers for Longer Sequences”. Advances in Neural Information Processing Systems 33 (NeurIPS 2020). arXiv:2007.14062 ↩ ↩² ↩³ ↩⁴
Sun, Y., Dong, L., Huang, S., Ma, S., Xia, Y., Xue, J., Wang, J., & Wei, F. (2023). “Retentive Network: A Successor to Transformer for Large Language Models”. arXiv:2307.08621 ↩
Liu, H., Zaharia, M., & Abbeel, P. (2023). “Ring Attention with Blockwise Transformers for Near-Infinite Context”. arXiv:2310.01889 ↩
Dao, T., Fu, D. Y., Ermon, S., Rudra, A., & Re, C. (2022). “FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness”. Advances in Neural Information Processing Systems, 35. ↩
Xie, S., Girshick, R., Dollar, P., Tu, Z., & He, K. (2017). “Aggregated Residual Transformations for Deep Neural Networks”. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2017). arXiv:1611.05431 ↩ ↩²
Shazeer, N., Mirhoseini, A., Maziarz, K., Davis, A., Le, Q., Hinton, G., & Dean, J. (2017). “Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer”. Proceedings of the 5th International Conference on Learning Representations (ICLR 2017). arXiv:1701.06538 ↩ ↩²
Fedus, W., Zoph, B., & Shazeer, N. (2022). “Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity”. Journal of Machine Learning Research, 23(120), 1-39. ↩
Srivastava, R. K., Greff, K., & Schmidhuber, J. (2015). “Highway Networks”. arXiv:1505.00387 ↩ ↩²
McMahan, H. B., Moore, E., Ramage, D., Hampson, S., & y Arcas, B. A. (2017). “Communication-Efficient Learning of Deep Networks from Decentralized Data”. Proceedings of the 20th International Conference on Artificial Intelligence and Statistics (AISTATS 2017). arXiv:1602.05629 ↩ ↩²
Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2019). “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding”. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT 2019). arXiv:1810.04805 ↩ ↩²
Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., & Stoyanov, V. (2019). “RoBERTa: A Robustly Optimized BERT Pretraining Approach”. arXiv:1907.11692 ↩ ↩²
Chen, B., Monajatipoor, M., Veen, D. V., Guo, Y., & Dubrawski, A. (2023). “FireAct: Toward Language Agent Fine-tuning”. arXiv:2310.05915 ↩ ↩²
Fan, H., Xiong, B., Mangalam, K., Li, Y., Yan, Z., Malik, J., & Feichtenhofer, C. (2021). “Multiscale Vision Transformers”. Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV 2021). arXiv:2104.11227 ↩
Merity, S., Xiong, C., Bradbury, J., & Socher, R. (2017). “Pointer Sentinel Mixture Models”. Proceedings of the 5th International Conference on Learning Representations (ICLR 2017). arXiv:1609.07843 ↩
Rae, J. W., Potapenko, A., Jayakumar, S. M., & Hillier, C. (2020). “Compressive Transformers for Long-Range Sequence Modelling”. Proceedings of the 8th International Conference on Learning Representations (ICLR 2020). arXiv:1911.05507 ↩
Weston, J., Bordes, A., Chopra, S., & Mikolov, T. (2015). “Towards AI-Complete Question Answering: A Set of Prerequisite Toy Tasks”. arXiv:1502.05698 ↩
Brinkmann, J., Goswami, K., & Rajani, N. F. (2024). “A Mechanistic Analysis of a Transformer Trained on a Symbolic Multi-Step Reasoning Task”. Findings of the Association for Computational Linguistics: ACL 2024. ↩
Saparov, A., & He, H. (2023). “Language Models Are Greedy Reasoners: A Systematic Formal Analysis of Chain-of-Thought”. Proceedings of the 11th International Conference on Learning Representations (ICLR 2023). ↩
Zhang, Y., Yu, A. W., & Xu, W. (2022). “Unveiling Transformers with LEGO: A Synthetic Reasoning Task”. arXiv:2206.04301 ↩
Joshi, M., Choi, E., Weld, D. S., & Zettlemoyer, L. (2017). “TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension”. Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (ACL 2017). arXiv:1705.03551 ↩
Paperno, D., Kruszewski, G., Lazaridou, A., Pham, N. Q., Bernardi, R., Pezzelle, S., Baroni, M., Boleda, G., & Fernandez, R. (2016). “The LAMBADA Dataset: Word Prediction Requiring a Broad Discourse Context”. Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (ACL 2016). arXiv:1606.06031 ↩
Arora, S., Eyuboglu, S., Timalsina, A., Johnson, I., Poli, M., Rudra, A., & Zou, J. (2023). “Zoology: Measuring and Improving Recall in Efficient Language Models”. arXiv:2312.04927 ↩
Trivedi, H., Balasubramanian, N., Khot, T., & Sabharwal, A. (2022). “MuSiQue: Multihop Questions via Single-hop Question Composition”. Transactions of the Association for Computational Linguistics, 10, 539-554. arXiv:2108.00573 ↩
Ho, X., Nguyen, A.-K. D., Sugawara, S., & Aizawa, A. (2020). “Constructing a Multi-hop QA Dataset for Comprehensive Evaluation of Reasoning Steps”. Proceedings of the 28th International Conference on Computational Linguistics (COLING 2020). arXiv:2011.01060 ↩
Yang, S., Gribovskaya, E., Kassner, N., Geva, M., & Riedel, S. (2024). “Do Large Language Models Latently Perform Multi-Hop Reasoning?” Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (ACL 2024). ↩
Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V., Goyal, N., Kuttler, H., Lewis, M., Yih, W.-t., Rocktaschel, T., Riedel, S., & Kiela, D. (2020). “Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks”. Advances in Neural Information Processing Systems 33 (NeurIPS 2020). arXiv:2005.11401 ↩
Liu, N. F., Lin, K., Hewitt, J., Paranjape, A., Bevilacqua, M., Petroni, F., & Liang, P. (2024). “Lost in the Middle: How Language Models Use Long Contexts”. Transactions of the Association for Computational Linguistics, 12, 157-173. arXiv:2307.03172 ↩
Steele, B., & Katz, M. (2026). “Scaling Trends for Multi-Hop Contextual Reasoning in Mid-Scale Language Models”. arXiv:2601.04254 ↩ ↩²

Meet Marvin

2026-03-01T00:00:00+00:00

Today, we’re introducing Marvin, an autonomous research agent for ML science. Marvin takes the information overload and busywork out of research. It does deep literature review, generates and tests novel and scientifically valid hypotheses, and can perform the entire research loop fully autonomously, end to end. Learn more about Marvin.

Why we built Marvin

The bottleneck in ML research today isn’t compute or data. It’s the preparation. More research is being produced now than at any point in history, and the pace is only increasing. Researchers must ingest and synthesize growing volumes of information before they can actually start their research. And even once they start, a lot of the research cycle is still spent on logistics rather than the science itself.

We built Marvin because nothing out there worked well enough for our own research. The existing options were either too dumb (chasing red herrings down rabbit holes or proposing smart-sounding ideas that were anything but), too wasteful (channeling Ralph Wiggum on experiments that were never going to work), or too opaque (poor documentation, no reasoning traces or “logic trail” that forms the bedrock of scientific reproducibility).

Full autonomy with logic trail

The latter issue with closed-loop agents is especially problematic for doing science. Full autonomy is only useful if you can trust it. AI is incredibly good at generating plausible-looking outputs, which will only further compound the reproducibility crisis in academia today.

Did you know drinking water can prevent Alzheimer’s? Neither did we. Better keep the receipts.¹

In order for autonomous scientists to contribute real, meaningful discoveries, the system has to do more than generate the output. It has to carry forward rich context continuously, make sensible, data- and fact-driven decisions, and leave behind a clear record for others, both human and agentic, to inspect and validate.

Marvin is for everyone

We do not see autonomous systems as replacements for human work. They should augment us, increase our productivity, and let us spend more of our time on the parts of the work that actually matter. That is why we built Marvin to be a scientific collaborator: flexible and sophisticated enough to function as a coworker, not just a tool.

Whether you’re a highly technical ML researcher who just needs more clones of you or a bench scientist who has never written a line of code, Marvin can join your team and pick up the work you want to delegate at the degree of autonomy you want to grant it. It can handle anything from just literature review to a full end-to-end research loop, and at any time, you can review or discuss the results or redirect the next experiments before Marvin kicks off again. The level of autonomy is yours to set.

Marvin’s capabilities are also cross-domain. It can do research across fields as diverse as frontier AI research, computational biology and bioinformatics, and materials science. That is because scientific method and rigor are universal, and we designed Marvin’s research loop and memory system around the same principles we used running our own research teams and academic labs.

Work with Marvin

In head-to-head evaluations using both LLM judges and human PhD judges in the relevant fields, Marvin scored higher than competing autonomous science agents on research depth, rigor, and creativity. We will publish a broader meta-paper with those results closer to Marvin’s open launch.

Marvin is in closed testing now. Read more on the Marvin page and see examples of its work there. If you’re interested, we’d love to hear about your project’s needs and discuss how Marvin can help.

Ruslan Salakhutdinov, “the future of science is less about producing results and more about verifying them,” X, July 18, 2025. Embedded figure above from the linked post. ↩

Elastic Speculation

2025-11-11T00:00:00+00:00

The TL;DR: Elastic speculation speeds up inference while maintaining output quality, resulting in more responsive models and a reduction compute costs.

Specifically, adaptive draft length delivers 20-50% latency reduction over fixed-length speculation. Confidence-based early exit cuts speculative KV writes by ~50% at a 1-3% latency cost. Both methods preserve semantic quality at multiple scales (BERTScore >0.9, cosine similarity >0.95, equivalent reward model scoring).

Introduction

Large language model inference is fast enough to demo and slow enough to hurt.

Speculative decoding¹ is an incredibly effective strategy for speeding up inference: a smaller draft model proposes multiple tokens, a larger target model verifies them, and we commit the accepted prefix and discard the rest. Implementations like EAGLE² in vLLM³ already make this practical and widely used.

However, two parts of this pipeline are still potentially inefficient:

The draft length is fixed, even as acceptance behavior changes across prompts, positions, and workloads.
Every speculative token writes to KV cache, even when it was never likely to survive verification.

In this post, we introduce Elastic Speculation: a small control layer on top of EAGLE that makes speculative decoding adaptive instead of static.

Why spec decode leaves performance on the table

First, acceptance is not constant and so a global, fixed K is too blunt. For easy or highly structured workloads (e.g., coding or QA-style prompts), acceptance can be very high, so a small K underutilizes the draft model. For harder or more creative workloads, acceptance drops, so a large K wastes compute on tokens that will be thrown away.

Second, being KV-cache bandwidth constrained hurts. Even speculative tokens that will never be accepted still pay the full price of KV writes. At larger batch sizes, longer contexts, and bigger models, KV-cache traffic becomes a dominant bottleneck⁴. Reducing unnecessary KV work is often the real lever for throughput.

Elastic Speculation treats speculative decoding as a runtime control problem:

Speculate more when speculation is working.
Speculate less when it isn’t.
Stop writing KV for tokens that are very unlikely to matter.

We do this without changing model weights or the verification rule. Our reference implementation is for EAGLE in vLLM, but the same control-plane ideas apply to other speculative decoding methods.

Figure 1 illustrates this design: speculative decoding with a dynamic K, plus a separate control that can gate KV writes.

Adaptive draft length: making K elastic

Our first contribution is enabling an adaptive draft length. Instead of choosing K once and hard-coding it, we let the system adjust K dynamically based on how speculation has been performing recently.

At a high level, our implementation features the following:

A runtime maintains lightweight statistics about speculative behavior.

A controller selects a draft length from a small set (e.g., 5, 10, 15) for each step:

When recent speculative proposals are mostly accepted, it chooses a longer draft.

When they are frequently rejected, it chooses a shorter one.

The selected draft length is carried through existing batch descriptors into the EAGLE path. No extra RPC layer, no changes to the verification contract.

Latency savings

We evaluated adaptive draft length on Llama-3.1-8B-Instruct target and draft models, across various configurations (including batch, tokens, etc.) and datasets. We selected the following four diverse benchmark datasets representing different LLM workload characteristics:

Alpaca - Instruction-following tasks spanning creative writing, QA, and general task completion. Representative of typical chat assistant workloads.
SQuAD - Reading comprehension requiring extractive answers. Short, factual outputs with high determinism ideal for testing speculation on low-entropy tasks.
CNN DailyMail (aka long) - Document summarization, essays, and narratives requiring 256+ tokens. Stresses sustained speculation quality over extended generations.
BigCodeBench (aka coding) - Code completion, bug fixing, and algorithm implementation. Highly structured outputs with strict syntactic constraints test adaptive tuning limits.

Across workloads ranging from short bursts (12 requests x 64 tokens) to long-form generation (36 x 256), adaptive draft length cuts latency substantially. Figure 2 breaks down these gains at draft length d=10 across the four datasets. The short-context benchmarks - Alpaca, SQuAD, and Coding - deliver consistent 35–45% speedups under both greedy (temp=0) and stochastic sampling decoding (temp=0.7, not shown). For the long-form dataset, while adaptive still provides sizeable gains, the savings drop to ~16–30%.

Why the gap? Speculative decoding fundamentally relies on the draft model tracking the target model’s distribution. As sequences grow longer, this alignment degrades. Our long-form benchmark averages 487 tokens per output (vs 128–256 for other datasets). The longer the context, the more cumulative errors compound, and acceptance rates fall accordingly⁵.

Figure 2 Adaptive draft length (d=10) achieves 35-55% latency reduction across datasets with Llama-3.1-8B-Instruct.

Next, we evaluated draft lengths of 5, 10, and 15 tokens on the 36 requests x 128 tokens configuration. These values span the typical deployment range: production systems conservatively use 3-5 tokens (Red Hat’s EAGLE3 at 3, NVIDIA’s reference configs at 5) to minimize wasted computation when drafts are rejected. Our experiments also tests draft lengths beyond this range, as some implementations suggest 8-10 and even 18-32 for methods like suffix decoding.

At d=5, adaptive speculation yields less savings across the board, which is logical as there are fewer possible ways to dynamically reduce K. The benefit does appear to saturate after d=10. We observe task-specific phenomena as well. As noted above, long-form generation maintains modest 16–30% speedups across all lengths, limited by fundamental acceptance rate degradation at extended sequences.

Coding presents a rather unique case compared to the other short form datasets. At d=5 there is minimal improvement (~4%), but d=10 unlocks 35% speedups. We suspect that this is because structured generation requires longer draft windows to amortize verification costs, a pattern documented in recent work⁶ showing that syntactic tasks need sufficient lookahead to capture token dependencies. We confirmed these results with the Llama 3.2 3B model as well.

Llama 3.1 8B

Llama 3.2 3B

Figure 3 Draft length sensitivity. Latency reduction confirms generalization across model scales (8B and 3B).

Ultimately, this variability explains why no single draft length works universally. Our adaptive approach sidesteps this problem by adjusting draft length per-request based on observed acceptance rates and task-specific requirements: fewer verification rounds when speculation is effective, and less wasted draft compute when it is not.

Confidence-based early exit: cutting speculative KV writes

The second component is confidence-based early exit, designed to reduce speculative KV writes. In standard speculative decoding, every drafted token writes to the KV cache. If a token is never accepted, that bandwidth was wasted. On hardware and workloads where decode is memory-bound, this is expensive.

Our goal is to avoid KV writes for speculative tokens that the draft model itself considers unlikely, while keeping (1) the loop structure compatible with CUDA graphs, and (2) the target model’s verification rule unchanged.

We’ve implemented the approach as follows:

For each speculative step, we compute a simple confidence score per sequence (the maximum predicted token probability).
We maintain a continue_mask for sequences that should keep writing KV.
On the next iteration, if a sequence has fallen below the confidence threshold, we mark its KV-write slot as padding.
The KV-write stage treats padding slots as no-ops, so those tokens are skipped.

All sequences still execute the same control flow and only the data (which slots get written) changes. The target model still evaluates whatever drafts are produced, so we are not weakening correctness checks.

Why DRAM savings matter at scale

Early exit functions as a bandwidth control knob: terminate low-confidence speculations before writing full draft sequences to KV cache, trading local compute overhead for reduced memory pressure.

This matters because KV cache dominates production inference. At scale (large batches, long contexts), the decode phase is memory-bandwidth bound: research shows KV cache accounts for up to 73% of total memory in LLaMA-7B at batch=32⁷, and over 50% of attention kernel cycles stall on data access delays⁸. Techniques that reduce KV cache bandwidth show 1.5-3.7× latency improvements in production (RocketKV, SQuat, Async KV prefetch).

Our early exit mechanism cuts DRAM writes by stopping speculation when confidence drops below threshold—fewer draft tokens generated means fewer KV cache entries written. In bandwidth-limited stacks (large models, long contexts, multi-tenant serving), this enables higher batch throughput and prevents OOM conditions. The 1-5% per-request latency cost translates to net system-level gains when memory bandwidth, not compute, is the bottleneck.

Bandwidth vs latency trade-off

Figure 4 shows the bandwidth-latency trade-off across thresholds 0.3, 0.5, and 0.7. At threshold=0.5, early exit stops 50-65% of speculative tokens before KV cache writes, translating to roughly 50% fewer DRAM write operations in our NCU profiles. The cost: 1-3% higher end-to-end latency compared to no early exit.

This latency penalty emerges from the mechanics of speculation. When early exit terminates a draft sequence, fewer tokens are available for verification. Lower acceptance per round means more speculation rounds to generate the same output — and each additional round invokes the target model. On our compute-bound test hardware, this overhead dominates. But production deployments are bandwidth-bound at scale⁷, where 50% DRAM savings enables higher batch throughput. The mechanism is the same — and production regimes are precisely where bandwidth constraints bite.

Latency

KV Writes Saved

Figure 4 Early exit stops a threshold-proportional % of speculative tokens before KV cache writes. Trades 1-3% latency for ~50% bandwidth reduction; coding shows steepest penalty (-5.4%) at threshold=0.7.

Figure 5 visualizes this relationship: higher stop rates correlate with larger latency penalties. Coding exhibits the steepest degradation at threshold=0.7 (73.7% stop rate, -5.4% latency), while other datasets show smaller penalties — structured generation suffers most when speculation is aggressively curtailed.

The optimal threshold will ultimately depend on deployment context. Bandwidth-limited production stacks benefit from aggressive early exit (threshold=0.5-0.7) to prevent OOM and enable larger batches. Compute-bound scenarios favor conservative thresholds (0.3) or disabling early exit entirely. Our implementation exposes threshold as a tunable parameter for operators to match their hardware constraints.

Figure 5 Higher stop rates correlate with larger latency penalties on compute-bound hardware; optimal threshold depends on deployment context (Llama-3.1-8B-Instruct @ k=10).

Maintaining output semantics and quality

Elastic Speculation necessarily changes which speculative tokens are proposed and accepted, so we do not expect or intend to achieve exact bitwise-identical outputs. However, we do still want to ensure the overall quality and correctness of the output semantics. After all, what’s the point of speeding up inference if all you get out is non-sensical?

To quantify this difference, we systematically evaluated the exact outputs from adaptive draft length and early exit on (elastic speculation) against standard speculative decoding (fixed-length k). We also compared both against vLLM’s no-spec target model only to understand the relative semantic similarity and to ensure elastic speculation keeps our outputs in the same semantic regime.

Specifically, we evaluated the outputs under the following three criteria:

BERTScore F1 (token-level semantic similarity)
cosine similarity (sentence-level via Sentence-BERT similarity)
and a reward model quality score (human preference alignment)

BERTScore F1 (Context-aware token alignment)

BERTScore measures semantic equivalence by comparing contextualized token embeddings from BERT-family models. Unlike surface-level string matching, it captures whether two texts convey the same meaning even with different wording.

How it works: The metric computes token-level similarity using contextual embeddings from microsoft/deberta-large-mnli⁹, then aggregates via precision, recall, and F1-score. Each token in the candidate text is matched to its most similar token in the reference text based on cosine similarity in embedding space.

Both adaptive draft length and early exit maintain semantic fidelity: BERTScore F1 ranges from ~0.89 to 0.94 across all experiments. This places outputs well into the semantic equivalence regime—above the 0.90 threshold where texts convey identical meaning. For context, scores of 0.85-0.90 indicate paraphrase-level similarity, while values below 0.80 signal semantically different content.

Adaptive (BERTScore F1)

Early Exit (BERTScore F1)

Figure 6 Adaptive draft length and early exit maintain BERTScore F1 >0.88 and F1 >0.95 respectively across all datasets, indicating semantic equivalence to baseline.

Cosine Similarity (Sentence-Level Embeddings)

Cosine similarity measures the angle between dense vector representations of complete sentences, capturing overall semantic content at the document level rather than token-by-token.

How it works: We encode each output using Sentence-BERT¹⁰ (all-mpnet-base-v2), which produces a single 768-dimensional vector per text. The cosine similarity between corresponding baseline and optimized outputs quantifies semantic alignment.

Cosine similarity between sentence embeddings confirms (and even exceeds) the BERTScore findings: adaptive draft length achieves >0.95 similarity for all datasets, with SQuAD and coding measuring over 0.97 (Figure 7). Early exit maintains >0.92 across thresholds. These scores place outputs well above the 0.85 threshold for semantic equivalence—effectively producing semantic duplicates of baseline outputs at the sentence level.

\[\text{cosine similarity}(u, v) = \frac{u \cdot v}{u \times v}\]

where $u = \text{SentenceBERT}(\text{text}_1)$, $v = \text{SentenceBERT}(\text{text}_2) \in \mathbb{R}^{768}$

For reference, scores of 0.70-0.85 indicate paraphrases with similar meaning, while values below 0.60 signal semantically divergent content. Our results demonstrate that neither elastic technique introduces meaningful semantic drift.

Adaptive (Cosine)

Early Exit (Cosine)

Figure 7 Adaptive draft length and early exit achieve >0.94 sentence-level similarity across all thresholds and datasets.

Reward Model Quality Score ∆ (Human Preference Alignment)

The reward model measures output quality based on human preference alignment, trained on datasets of human judgments about response quality. Unlike similarity metrics, it evaluates absolute quality rather than just semantic equivalence.

How it works: We used OpenAssistant/reward-model-deberta-v3-large-v2¹¹, a DeBERTa-v3-large model fine-tuned on human preference data. The model scores each output on a continuous scale, predicting how humans would rate the response quality in terms of helpfulness, correctness, and coherence.

This particular model scores outputs on helpfulness, correctness, and coherence as a proxy for human-perceived quality. The model outputs unbounded logit scores (typically -5 to +5 range), where higher values indicate better quality.

Figure 8 plots the quality score delta: elastic speculation minus baseline speculation, with both compared against no-speculation runs. Values hovering near zero indicate equivalent quality. Adaptive draft length shows deltas within ±0.15 across all datasets, while early exit maintains ±0.2 across thresholds. Paired t-tests confirm no statistically significant difference (p > 0.85 across experiments). Mean absolute scores are baseline = -2.505, adaptive = -2.513 — both producing equivalently high-quality outputs from a human preference perspective.

Adaptive (Quality ∆)

Early Exit (Quality ∆)

Figure 8 Quality deltas within ±0.15 confirm elastic speculation preserves human-perceived output quality; no statistically significant difference from baseline speculation (p>0.85).

Across all three metrics, elastic speculation preserves semantic quality. BERTScore >0.94, cosine similarity >0.95, and reward model deltas within ±0.2 confirm outputs match baseline speculation in both token-level fidelity and human-perceived quality.

To understand what “acceptable drift” looks like, we measured how much baseline speculation diverges from no-speculation runs. This gives us a reference: if speculation itself introduces some semantic variance, elastic variants should stay within that same range. They do — elastic spec vs. no-spec shows comparable deltas to baseline spec vs. no-spec (not shown). Our optimizations don’t add drift beyond what standard speculation already introduces. Finally, the 3B model replicates these findings across all metrics and conditions (not shown).

Note that the results shown use temperature=0.0. At temperature=0.7, scores drop for both baseline and elastic variants to similar degrees (not shown) — that’s just the nature of making using sampling based generation. Your outputs get a little spicy but elastic is no worse than baseline speculation.

Concluding Remarks

Elastic Speculation makes speculative decoding responsive by adapting to workload characteristics and hardware constraints in real time. In our tests, that means up to ~20-50% lower latency versus fixed-length K from adaptive draft length, and a proportional reduction in speculative KV writes based the selected threshold for confidence-based early exit. It changes how tokens are generated, not necessarily the meaning of what gets generated, staying within the same semantic regime as standard speculative decoding in the recommended settings.

We are preparing an vLLM PR so you can try Elastic Speculation in your own deployments, tune it for your workloads, and see how it behaves at your scale. Please feel free to share your findings and/or implementations for other frameworks!

Citation

Please cite this work as:

Zhao, Ben and Iluvatar Labs, "Elastic Speculation: Adaptive Draft Length and
Confidence-Based Early Exit", Iluvatar Labs Blog, Nov 2025.

Leviathan, Y., Kalman, M., & Matias, Y. (2023). “Fast Inference from Transformers via Speculative Decoding”. Proceedings of the 40th International Conference on Machine Learning (ICML 2023), 19274-19286. arXiv:2211.17192 ↩
Li, Y., Wei, F., Zhang, C., & Zhang, H. (2024). “EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty”. Proceedings of the 41st International Conference on Machine Learning (ICML 2024). arXiv:2401.15077 ↩
Kwon, W., Li, Z., Zhuang, S., Sheng, Y., Zheng, L., Yu, C. H., Gonzalez, J. E., Zhang, H., & Stoica, I. (2023). “Efficient Memory Management for Large Language Model Serving with PagedAttention”. Proceedings of the 29th ACM Symposium on Operating Systems Principles (SOSP ‘23). arXiv:2309.06180 ↩
Kwon et al. (2023) show that KV cache accounts for up to 73% of total memory in large-batch inference, with memory bandwidth becoming the primary bottleneck during decoding. ↩
Miao, X. et al. (2024). “Draft Model Knows When to Stop: A Self-Verification Length Policy for Speculative Decoding”. arXiv:2411.18462. The paper demonstrates that speculative decoding performance degrades as input length grows due to reduced draft accuracy. ↩
Chen, C., Borgeaud, S., Irving, G., Lespiau, J.-B., Sifre, L., & Jumper, J. (2023). “Accelerating Large Language Model Decoding with Speculative Sampling”. arXiv:2302.01318 ↩
Sheng, Y., Cao, S., Li, D., Hooper, C., Lee, N., Yang, S., Chou, C., Zhu, B., Zheng, L., Keutzer, K., Gonzalez, J. E., & Stoica, I. (2024). “S-LoRA: Serving Thousands of Concurrent LoRA Adapters”. arXiv:2311.03285 ↩ ↩²
Kim, J., Lee, M., & Kim, S. (2024). “Mind the Memory Gap: Unveiling GPU Bottlenecks in Large-Batch LLM Inference”. arXiv:2503.08311 ↩
Zhang, T., Kishore, V., Wu, F., Weinberger, K. Q., & Artzi, Y. (2020). “BERTScore: Evaluating Text Generation with BERT”. Proceedings of the 8th International Conference on Learning Representations (ICLR 2020). arXiv:1904.09675 ↩
Reimers, N., & Gurevych, I. (2019). “Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks”. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing (EMNLP-IJCNLP), 3982-3992. arXiv:1908.10084 ↩
OpenAssistant/reward-model-deberta-v3-large-v2. Available at https://huggingface.co/OpenAssistant/reward-model-deberta-v3-large-v2. ↩

Iluvatar Labs

Introducing Actuator

The problem with open loop

Closing the loop

For every post-training task

Divergent-Convergent Attention

Introduction

Related Work

The Architecture

Design tradeoffs

Benchmarks

HotpotQA at 90M (WikiText-103)

Scaling to PG-19 and architectural exploration

Architectural exploration

What is DCA well suited for?

Mechanistic Analysis

Force-decode: the representation advantage

Same retrieval, better composition

Gate ablation: consensus is essential and precisely tuned

Perspective divergence and attention patterns

Conclusion

Appendix

HotpotQA task illustration

Multi-seed raw data

Additional benchmark evaluations

Force-decode difficulty scaling

Literature gap

Additional WT103 variants

Perspective divergence

Citation

Acknowledgements

Meet Marvin

Why we built Marvin

Full autonomy with logic trail

Marvin is for everyone

Work with Marvin

Elastic Speculation

Introduction

Why spec decode leaves performance on the table

Adaptive draft length: making K elastic

Latency savings

Confidence-based early exit: cutting speculative KV writes

Why DRAM savings matter at scale

Bandwidth vs latency trade-off

Maintaining output semantics and quality

BERTScore F1 (Context-aware token alignment)

Cosine Similarity (Sentence-Level Embeddings)

Reward Model Quality Score ∆ (Human Preference Alignment)

Concluding Remarks

Citation