<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/" version="3.10.0">Jekyll</generator><link href="https://iluvatarlabs.github.io/feed.xml" rel="self" type="application/atom+xml" /><link href="https://iluvatarlabs.github.io/" rel="alternate" type="text/html" /><updated>2026-05-08T18:47:24+00:00</updated><id>https://iluvatarlabs.github.io/feed.xml</id><title type="html">Iluvatar Labs</title><subtitle>Reality-inspired AI architectures for the AGI era</subtitle><author><name>Iluvatar Labs</name><email>ben@iluvatarlabs.com</email></author><entry><title type="html">Introducing Actuator</title><link href="https://iluvatarlabs.github.io/blog/2026/04/introducing-actuator/" rel="alternate" type="text/html" title="Introducing Actuator" /><published>2026-04-01T00:00:00+00:00</published><updated>2026-04-01T00:00:00+00:00</updated><id>https://iluvatarlabs.github.io/blog/2026/04/introducing-actuator</id><content type="html" xml:base="https://iluvatarlabs.github.io/blog/2026/04/introducing-actuator/"><![CDATA[<p>Post-training has become the <strong>primary differentiation</strong> lever for AI labs in 2026.</p>

<p>For smaller labs, it’s an existential one. If you cannot differentiate, you get steamrolled by the frontier labs.<sup id="fnref:altman" role="doc-noteref"><a href="#fn:altman" class="footnote" rel="footnote">1</a></sup> For bigger labs, post-training is just as essential from a practical standpoint. It’s how models get adapted across product lines, various deployment targets, and within cost constraints. And for companies in regulated industries, such as medical and legal applications, it is often a requirement, not than a choice.</p>

<h2 id="the-problem-with-open-loop">The problem with open loop</h2>

<p>Despite the growing importance of model transformation, today’s post-training stack is still a fragmented, open-loop affair. Teams stitch together a stack of tools, launch runs with limited visibility into what is happening in flight, and only later discover the full cost of a bad tradeoff. That might mean degraded baseline capabilities, a failed compression pass, alignment that came at too high a price, or just another wasted cycle of compute and engineering time.</p>

<p>This is not only a startup problem. The same guess-and-check dynamic applies all the way up the stack. For smaller companies, it can mean losing the narrow window they had to differentiate or whether they can even afford to do so at all. For bigger ones, it means slower iteration, higher costs, and more friction in getting models into the forms real products and deployments actually require.</p>

<h2 id="closing-the-loop">Closing the loop</h2>

<p>Actuator is a patent-pending closed-loop control layer for model transformation. It replaces the manual, open loop process with continuous live monitoring, automatic training-time adjustments, and guardrails to keep your model transformations on track. When capability starts to drift, Actuator kicks in to ensure your model’s output doesn’t degrade at training time, as opposed to being discovered post-hoc. Quality in, quality out.</p>

<p>And not only does Actuator optimize your model transformation, it also makes post-training <strong>easy</strong>. It drops right in to your existing stack and provides the unified end-to-end software layer you need to ship better models while skipping the pain.</p>

<h2 id="for-every-post-training-task">For every post-training task</h2>

<p>Actuator’s plug and play capabilities mean it be used across varied applications from distillation (better draft models) and compression (smarter, smaller models) to reinforcement learning (learn preferences with losing capabilities via the alignment tax). We’ve benchmarked Actuator on various tasks and it showed outperformance on preservation of desired properties over using standard methods alone. Additional details are available on the <a href="/actuator/">Actuator</a> page.</p>

<p>Actuator is now in closed beta. If your team is running serious post-training and want to do it a better way, please reach out! We’re excited to hear about what your team is working on and open to potential pilots or partnerships.</p>

<div class="footnotes" role="doc-endnotes">
  <ol>
    <li id="fn:altman" role="doc-endnote">
      <p>Sam Altman on <code class="language-plaintext highlighter-rouge">20VC</code> discussing how startups can get “steamrolled” as frontier models improve: <a href="https://lilys.ai/en/notes/374015">summary/transcript</a>. <a href="#fnref:altman" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
  </ol>
</div>]]></content><author><name>Iluvatar Labs</name></author><category term="Product" /><summary type="html"><![CDATA[Actuator is a patent-pending closed-loop control layer for model transformation with live training-time monitoring and in-flight adjustment.]]></summary></entry><entry><title type="html">Divergent-Convergent Attention</title><link href="https://iluvatarlabs.github.io/blog/2026/03/divergent-convergent-attention/" rel="alternate" type="text/html" title="Divergent-Convergent Attention" /><published>2026-03-26T00:00:00+00:00</published><updated>2026-03-26T00:00:00+00:00</updated><id>https://iluvatarlabs.github.io/blog/2026/03/divergent-convergent-attention</id><content type="html" xml:base="https://iluvatarlabs.github.io/blog/2026/03/divergent-convergent-attention/"><![CDATA[<blockquote>
  <p><strong>The TL;DR:</strong> Divergent-Convergent Attention (DCA) improves compositional reasoning by maintaining multiple parallel attention perspectives before periodic learned consensus. On HotpotQA<sup id="fnref:hotpotqa" role="doc-noteref"><a href="#fn:hotpotqa" class="footnote" rel="footnote">1</a></sup>, DCA achieves <strong>5.4x higher exact match</strong> than a parameter-matched 90M baseline, and a <strong>215M DCA model outperforms a 355M standard transformer by 1.54x</strong> with fewer parameters and lower memory.</p>

  <p>Most notably, DCA assigns higher probability to the correct answer tokens on <strong>97.8% of examples</strong>, with the advantage sharply <strong>correlated with question difficulty</strong>, suggesting that DCA’s magic is in how distributed evidence is internally composed before decoding.</p>

  <p>DCA helps when relevant content is scattered across structurally independent documents. It does not help on sequential reasoning or single-source retrieval tasks where every perspective sees the same chain or location.</p>

  <p>(Note: This blog post reflects the latest manuscript version of this work.)</p>
</blockquote>

<h2 id="introduction">Introduction</h2>

<p>Standard transformers process multi-document input through a single attention stream, fusing heterogeneous evidence into one representation at every layer. RAG pipelines, long-context windows, and tasks like legal analysis or medical synthesis all require integrating information from structurally independent sources. A single stream must compromise between local precision and global reach at every layer. The result is premature fusion, where multi-document evidence is collapsed before the model can develop complementary views.</p>

<p>We introduce Divergent-Convergent Attention (DCA), a transformer variant that maintains K parallel attention streams at different scales and reconciles them only at scheduled consensus points. The novelty is not merely independent lanes or a late merge, but that those lanes are explicitly multi-horizon: short, medium, and long timescales that cultivate complementary perspectives before reconciliation.</p>

<p>DCA is inspired by an organizational principle in neuroscience: the brain concurrently maintains multiple oscillatory bands that only periodically couple to coordinate information<sup id="fnref:buzsaki" role="doc-noteref"><a href="#fn:buzsaki" class="footnote" rel="footnote">2</a></sup><sup id="fnref:canolty" role="doc-noteref"><a href="#fn:canolty" class="footnote" rel="footnote">3</a></sup>. Gamma (30-100 Hz) supports fast, local feature binding, analogous to our short horizon. Beta (13-30 Hz) integrates across nearby regions, our medium horizon. Theta (4-8 Hz) supports global synchronization, our long horizon<sup id="fnref:lisman" role="doc-noteref"><a href="#fn:lisman" class="footnote" rel="footnote">4</a></sup><sup id="fnref:colgin" role="doc-noteref"><a href="#fn:colgin" class="footnote" rel="footnote">5</a></sup>. DCA provides the computational analogue: separate processing streams that periodically synchronize via learned consensus.</p>

<p><img src="/assets/images/divergent-convergent-attention/neural_oscillation_dcr_analogy.svg" alt="Figure 1a: Neural oscillation analogy" /></p>

<blockquote>
  <p><strong>Figure 1a</strong> Biological multi-scale oscillations. Gamma, beta, and theta bands process at different scales and periodically couple to coordinate information. DCA maps these to three attention horizons.</p>
</blockquote>

<p>In controlled experiments, DCA achieves 5.4x higher exact match on multi-hop QA at 90M parameters (p &lt; 10^-6, 3 seeds). At 215M, DCA beats a 355M baseline by 1.54x with fewer parameters, approximately matched FLOPs, and less memory. We characterized the consensus mechanism through causal interventions at both scales. Despite the small capacity of these models, our force-decode analysis shows an unambiguous representational advantage in multi-document composition. DCA assigns higher probability to the correct answer tokens on 97.8% of all examples at 90M, and the advantage scales with difficulty, with 7.7x larger gains on the hardest examples.</p>

<h2 id="related-work">Related Work</h2>

<p>Multi-scale and sparse attention methods such as Longformer<sup id="fnref:longformer" role="doc-noteref"><a href="#fn:longformer" class="footnote" rel="footnote">6</a></sup>, BigBird<sup id="fnref:bigbird" role="doc-noteref"><a href="#fn:bigbird" class="footnote" rel="footnote">7</a></sup>, and RetNet<sup id="fnref:retnet" role="doc-noteref"><a href="#fn:retnet" class="footnote" rel="footnote">8</a></sup> combine local and global attention within a single stream, blending scales early or continuously. DCA maintains separate streams that develop independent representations before merging. Ring attention<sup id="fnref:ringattention" role="doc-noteref"><a href="#fn:ringattention" class="footnote" rel="footnote">9</a></sup> and flash attention<sup id="fnref:flashattention" role="doc-noteref"><a href="#fn:flashattention" class="footnote" rel="footnote">10</a></sup> address computational cost but not fusion timing; DCA is orthogonal and compatible with these methods.</p>

<p>Multi-path architectures provide useful precedents but differ in mechanism. ResNeXt<sup id="fnref:resnext" role="doc-noteref"><a href="#fn:resnext" class="footnote" rel="footnote">11</a></sup> established split-transform-merge for vision. Mixture of Experts<sup id="fnref:moe" role="doc-noteref"><a href="#fn:moe" class="footnote" rel="footnote">12</a></sup><sup id="fnref:switch" role="doc-noteref"><a href="#fn:switch" class="footnote" rel="footnote">13</a></sup> increases capacity through sparse routing. DCA differs in that all perspectives are always active and differentiated by attention scale rather than learned routing. The gated consensus mechanism uses Highway Network-style residual connections<sup id="fnref:highway" role="doc-noteref"><a href="#fn:highway" class="footnote" rel="footnote">14</a></sup> with periodic synchronization analogous to federated averaging<sup id="fnref:fedavg" role="doc-noteref"><a href="#fn:fedavg" class="footnote" rel="footnote">15</a></sup>.</p>

<p>HotpotQA requires composing information across two Wikipedia paragraphs among eight distractors. Encoder models at 110M-355M achieve substantially higher scores with bidirectional attention, while decoder-only models generally require 7B+ to reach around 30% EM<sup id="fnref:bert" role="doc-noteref"><a href="#fn:bert" class="footnote" rel="footnote">16</a></sup><sup id="fnref:longformer:1" role="doc-noteref"><a href="#fn:longformer" class="footnote" rel="footnote">6</a></sup><sup id="fnref:bigbird:1" role="doc-noteref"><a href="#fn:bigbird" class="footnote" rel="footnote">7</a></sup><sup id="fnref:roberta" role="doc-noteref"><a href="#fn:roberta" class="footnote" rel="footnote">17</a></sup><sup id="fnref:fireact" role="doc-noteref"><a href="#fn:fireact" class="footnote" rel="footnote">18</a></sup>. To our knowledge, no published decoder-only HotpotQA results exist between 90M and 7B parameters.</p>

<h2 id="the-architecture">The Architecture</h2>

<p>DCA replaces each transformer block with K parallel attention streams (“perspectives”), each operating at a different window size. In our experiments, K=3 with horizons [32, 128, 0], where 0 denotes full causal attention. Each perspective has its own QKV projection weights. All perspectives share a single MLP, with all paths always active (closer to ResNeXt’s split-transform-merge<sup id="fnref:resnext:1" role="doc-noteref"><a href="#fn:resnext" class="footnote" rel="footnote">11</a></sup> than to Mixture of Experts’ selective routing<sup id="fnref:moe:1" role="doc-noteref"><a href="#fn:moe" class="footnote" rel="footnote">12</a></sup>, and analogous to cross-scale pooling in multi-scale vision transformers<sup id="fnref:mvit" role="doc-noteref"><a href="#fn:mvit" class="footnote" rel="footnote">19</a></sup>). Every N layers, the perspectives merge via a Highway Network-style gate<sup id="fnref:highway:1" role="doc-noteref"><a href="#fn:highway" class="footnote" rel="footnote">14</a></sup>, a periodic synchronization analogous to federated averaging<sup id="fnref:fedavg:1" role="doc-noteref"><a href="#fn:fedavg" class="footnote" rel="footnote">15</a></sup>. This gate is content-dependent and learned, and the model discovers a depth-dependent strategy where early layers mostly pass through and late layers merge more fully, as shown later in the mechanistic analysis.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>consensus = mean(perspective_1, ..., perspective_K)
gate = sigmoid(W_g * RMSNorm(x))
output = (1 - gate) * x + gate * consensus
</code></pre></div></div>

<p>Note that while the implementation described here uses dense causal attention, DCA is a more general late-consensus primitive. The consensus mechanism operates on tensors, so any module that takes [B, T, D] and produces [B, T, D] can serve as a perspective. In this work, we use dense causal attention with different window sizes, but other sequence-processing modules (ring attention, linear attention, SSMs) could serve the same role.</p>

<p><img src="/assets/images/divergent-convergent-attention/dcr_architecture_diagram.svg" alt="Figure 1b: DCA architecture" /></p>

<blockquote>
  <p><strong>Figure 1b</strong> DCA architecture. K=3 perspectives fork from the residual stream, process with separate attention and shared MLP, then merge via learned highway consensus. The cycle repeats every N layers.</p>
</blockquote>

<h3 id="design-tradeoffs">Design tradeoffs</h3>

<p>Full-fat DCA at baseline width (K=3 at d=1024) costs 3x VRAM and ~2.7x FLOPs. Bottleneck projections let perspectives operate at d_lane=512 inside d_model=1024. The key math is that 3 x 512^2 &lt; 1024^2, so K=3 perspectives at d=512 are cheaper per layer than a single stream at d=1024. This replaces the role that global tokens play in Longformer and BigBird<sup id="fnref:longformer:2" role="doc-noteref"><a href="#fn:longformer" class="footnote" rel="footnote">6</a></sup><sup id="fnref:bigbird:2" role="doc-noteref"><a href="#fn:bigbird" class="footnote" rel="footnote">7</a></sup> in a causal-compatible way; global tokens in causal decoders are functionally vacuous since position 0 can only attend to itself. Per-perspective gradient checkpointing reduces activation memory from ~3x baseline to below baseline levels. We scale by adding layers (30L) at the cheap d=512 perspective width rather than widening to d=1024.</p>

<blockquote>
  <p><strong>Table 1</strong> DCA design space. Theoretical and tested variants with FLOP and VRAM tradeoffs.</p>
</blockquote>

<table class="data-table">
  <thead>
    <tr>
      <th>Variant</th>
      <th>d_model</th>
      <th>d_lane</th>
      <th>Params</th>
      <th>FLOP ratio</th>
      <th>VRAM vs baseline</th>
      <th>MLP</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Baseline</td>
      <td>1024</td>
      <td>–</td>
      <td>355M</td>
      <td>1.0x</td>
      <td>1.0x</td>
      <td>1 stream</td>
    </tr>
    <tr>
      <td>Full-fat (K=3)</td>
      <td>1024</td>
      <td>1024</td>
      <td>556M</td>
      <td>2.71x</td>
      <td>~3x</td>
      <td>shared</td>
    </tr>
    <tr>
      <td>DCA-215M</td>
      <td>1024</td>
      <td>512</td>
      <td>215M</td>
      <td>1.24x</td>
      <td>~0.8x</td>
      <td>shared</td>
    </tr>
    <tr>
      <td>DCA-215M + separate MLPs</td>
      <td>1024</td>
      <td>512</td>
      <td>341M</td>
      <td>1.24x</td>
      <td>~0.8x</td>
      <td>K weights</td>
    </tr>
  </tbody>
</table>

<h2 id="benchmarks">Benchmarks</h2>

<h3 id="hotpotqa-at-90m-wikitext-103">HotpotQA at 90M (WikiText-103)</h3>

<style>
.hotpotqa-figure {
  max-width: 720px;
  margin: 1.5rem auto;
  padding: 1.25rem;
  border: 1px solid #2a2a2a;
  border-radius: 10px;
  background: #141414;
  color: #e8e8e8;
}

.hotpotqa-figure * {
  box-sizing: border-box;
}

.hotpotqa-figure .stack {
  display: flex;
  flex-direction: column;
  gap: 4px;
}

.hotpotqa-figure .dist-row {
  display: grid;
  grid-template-columns: 1fr 1fr;
  gap: 4px;
}

.hotpotqa-figure .para {
  display: flex;
  border-radius: 5px;
  overflow: hidden;
}

.hotpotqa-figure .para-body {
  flex: 1;
  padding: 8px 14px;
  font-size: 12px;
  line-height: 1.5;
  border: 1px solid #333;
  border-right: none;
  border-radius: 5px 0 0 5px;
  background: #222;
  color: #aaa;
}

.hotpotqa-figure .para-body b {
  font-weight: 600;
  color: #e0e0e0;
}

.hotpotqa-figure .para.gold .para-body {
  border-color: #0f6e56;
  background: rgba(93, 202, 165, 0.04);
}

.hotpotqa-figure .para.gold .para-body .evidence {
  font-weight: 600;
  color: #5dcaa5;
}

.hotpotqa-figure .tag {
  width: 32px;
  display: flex;
  align-items: center;
  justify-content: center;
  font-size: 8px;
  font-weight: 600;
  letter-spacing: 0.6px;
  text-transform: uppercase;
  writing-mode: vertical-rl;
  text-orientation: mixed;
  flex-shrink: 0;
  border-radius: 0 5px 5px 0;
}

.hotpotqa-figure .tag.dist {
  background: #333;
  color: #666;
}

.hotpotqa-figure .tag.gold-tag {
  background: #0f6e56;
  color: #d4f5e9;
}

.hotpotqa-figure .qa {
  margin-top: 16px;
  padding: 16px;
  border-radius: 8px;
  border: 1px solid #444;
}

.hotpotqa-figure .qa-label,
.hotpotqa-figure .qa-a-label {
  font-size: 14px;
  font-weight: 600;
}

.hotpotqa-figure .qa-q {
  font-size: 14px;
  font-weight: 400;
  display: inline;
  line-height: 1.4;
}

.hotpotqa-figure .qa-a-row {
  display: flex;
  align-items: baseline;
  gap: 8px;
  margin-top: 8px;
}

.hotpotqa-figure .qa-a {
  font-size: 14px;
  font-weight: 400;
  color: #5dcaa5;
}

.hotpotqa-figure .qa-a-detail {
  font-size: 13px;
  color: #888;
}

@media (max-width: 768px) {
  .hotpotqa-figure {
    padding: 0.9rem;
  }

  .hotpotqa-figure .dist-row {
    grid-template-columns: 1fr;
  }
}
</style>

<div class="hotpotqa-figure">
  <div class="stack">
    <div class="dist-row">
      <div class="para"><div class="para-body"><b>The Hurt Locker</b> - A 2008 war thriller about an Iraq War EOD team, directed by Kathryn Bigelow...</div><div class="tag dist">distract</div></div>
      <div class="para"><div class="para-body"><b>Kathryn Bigelow</b> - An American filmmaker known for directing horror, action, and thriller films...</div><div class="tag dist">distract</div></div>
    </div>

    <div class="para gold"><div class="para-body"><b>Zero Dark Thirty</b> - A 2012 action thriller directed by Kathryn Bigelow dramatizing the decade-long manhunt for Osama bin Laden. <span class="evidence">It received five Academy Award nominations</span>, including Best Picture and Best Actress.</div><div class="tag gold-tag">gold</div></div>

    <div class="dist-row">
      <div class="para"><div class="para-body"><b>Jessica Chastain</b> - An American actress and film producer, studied at the Juilliard School...</div><div class="tag dist">distract</div></div>
      <div class="para"><div class="para-body"><b>Mark Boal</b> - An American screenwriter and journalist. Best known for writing "The Hurt Locker"...</div><div class="tag dist">distract</div></div>
    </div>

    <div class="dist-row">
      <div class="para"><div class="para-body"><b>Argo (2012 film)</b> - A 2012 historical drama directed by Ben Affleck about the rescue of six U.S. diplomats...</div><div class="tag dist">distract</div></div>
      <div class="para"><div class="para-body"><b>Denis Villeneuve</b> - A Canadian filmmaker acclaimed for "Prisoners," "Sicario," and "Dune"...</div><div class="tag dist">distract</div></div>
    </div>

    <div class="para gold"><div class="para-body"><b>Arrival (film)</b> - A 2016 science fiction drama directed by Denis Villeneuve, adapted from Ted Chiang's "Story of Your Life." <span class="evidence">It received eight Academy Award nominations</span>, including Best Picture and Best Director, winning Best Sound Editing.</div><div class="tag gold-tag">gold</div></div>

    <div class="dist-row">
      <div class="para"><div class="para-body"><b>Ted Chiang</b> - An American science fiction writer whose work has won four Nebula and four Hugo Awards...</div><div class="tag dist">distract</div></div>
      <div class="para"><div class="para-body"><b>Eric Heisserer</b> - An American screenwriter who adapted "Story of Your Life" into "Arrival"...</div><div class="tag dist">distract</div></div>
    </div>
  </div>

  <div class="qa">
    <div><span class="qa-label">Question (requires both gold paragraphs): <br /></span><span class="qa-q">Which film received more Academy Award nominations, Zero Dark Thirty or Arrival?</span></div>
    <div class="qa-a-row">
      <span class="qa-a-label">Answer:</span>
      <span class="qa-a">Arrival</span>
      <span class="qa-a-detail">(8 nominations vs 5)</span>
    </div>
  </div>
</div>

<blockquote>
  <p><strong>Figure 2</strong> HotpotQA distractor setting. 10 paragraphs per question: 2 supporting (gold), 8 distractors (gray). The answer requires composing information from both gold paragraphs scattered among topically similar distractors.</p>
</blockquote>

<p>We pretrained DCA (89M params) and a parameter-matched baseline (90M params) on WikiText-103<sup id="fnref:wikitext" role="doc-noteref"><a href="#fn:wikitext" class="footnote" rel="footnote">20</a></sup> for 50K steps, then finetuned both on HotpotQA across three seeds. Though DCA is modestly worse on WikiText-103 validation perplexity (21.48 vs 20.79, ~3%), the benefit on long reasoning is asymmetric. DCA achieves 5.4x higher exact match on HotpotQA (1.56% vs 0.29%, Table 2), with p &lt; 10^-6 and odds ratio 5.49 (Fisher exact, pooled across seeds). DCA outperformed every baseline variant we tested, across both 50K and 30K pretrain budgets (Appendix H).</p>

<h3 id="scaling-to-pg-19-and-architectural-exploration">Scaling to PG-19 and architectural exploration</h3>

<p>The 90M result raises a natural question: does the advantage hold at larger scale? While the relative advantage is clear, absolute performance of both models is low (1.56% and 0.29% EM). WT103 is too small for 350M-class models, so we switched to PG-19<sup id="fnref:pg19" role="doc-noteref"><a href="#fn:pg19" class="footnote" rel="footnote">21</a></sup> (3B tokens) following standard conventions.</p>

<p>To calibrate the effect of pretraining domain, we also trained DCA 90M on PG-19 (EM=0.38%, compared to 1.56% on WT103). The 350M standard baseline on PG-19 achieves 0.93% EM, indicating that even at 4x the parameters, standard decoders remain poor at multi-hop QA. A FLOP-comparable DCA-215M on PG-19 achieves 1.43% EM vs the baseline’s 0.93% (1.54x), with 39% fewer parameters and less VRAM (~35 vs ~45 GB). Within PG-19, scaling DCA from 90M to 215M improves EM from 0.38% to 1.43%, surpassing the 350M baseline by 1.54x.</p>

<blockquote>
  <p><strong>Table 2</strong> HotpotQA results across scales and pretraining domains. All models finetuned and evaluated on HotpotQA.</p>
</blockquote>

<p><strong>WT103 pretraining (90M):</strong></p>

<table class="data-table">
  <thead>
    <tr>
      <th>Model</th>
      <th>Params</th>
      <th>FLOP ratio</th>
      <th>VRAM</th>
      <th>EM%</th>
      <th>F1%</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Baseline 90M</td>
      <td>90M</td>
      <td>~1.0x</td>
      <td>~4 GB</td>
      <td>0.29</td>
      <td>7.77</td>
    </tr>
    <tr>
      <td>DCA 90M</td>
      <td>89M</td>
      <td>~1.56x</td>
      <td>~8 GB</td>
      <td>1.56</td>
      <td>14.40</td>
    </tr>
  </tbody>
</table>

<p><strong>PG-19 pretraining (up to 350M):</strong></p>

<table class="data-table">
  <thead>
    <tr>
      <th>Model</th>
      <th>Params</th>
      <th>FLOP ratio</th>
      <th>VRAM</th>
      <th>EM%</th>
      <th>F1%</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>DCA 90M</td>
      <td>89M</td>
      <td>~1.56x</td>
      <td>~8 GB</td>
      <td>0.38</td>
      <td>7.78</td>
    </tr>
    <tr>
      <td>Baseline 350M</td>
      <td>355M</td>
      <td>1.0x</td>
      <td>~45 GB</td>
      <td>0.93</td>
      <td>10.81</td>
    </tr>
    <tr>
      <td>DCA-215M</td>
      <td>215M</td>
      <td>1.24x</td>
      <td>~35 GB</td>
      <td>1.43</td>
      <td>11.32</td>
    </tr>
  </tbody>
</table>

<h3 id="architectural-exploration">Architectural exploration</h3>

<p>Scaling provided an opportunity to test which components of DCA are essential. Our 90M baseline uses per-head window assignment (heads 0-2 at w=32, heads 3-5 at w=128, heads 6-7 full causal), achieving the best perplexity among baseline variants (20.79) but only 0.29% EM on HotpotQA. A factorial experiment confirmed parallel streams are the primary mechanism; multi-scale windows are secondary. Two other architectural properties proved essential. Shared QKV weights collapse perspective diversity (cosine similarity &gt;0.9 vs 0.2-0.4 with separate weights). Consensus at every layer (k=1) drops EM to 1.05% vs 1.59% with consensus every 6 layers (k=6).</p>

<p>These results motivated the DCA-215M design. A narrower variant at d=768 (323M params) achieved only EM=0.57%, undertrained at 6.2 tokens per parameter. A variant without per-perspective MLP (302M params, 1.07x FLOPs) achieved EM=1.21% (1.30x over baseline). Separate MLP weights (139M params, same FLOPs) achieved PPL=20.45 but EM=0.80%, confirming the shared MLP acts as a regularizer.</p>

<blockquote>
  <p><strong>Table 3</strong> Architectural variant results. All evaluated on HotpotQA.</p>
</blockquote>

<table class="data-table">
  <thead>
    <tr>
      <th>Model</th>
      <th>Params</th>
      <th>d_model</th>
      <th>d_lane</th>
      <th>EM%</th>
      <th>VRAM</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>DCA 90M (full-fat)</td>
      <td>89M</td>
      <td>512</td>
      <td>512</td>
      <td>1.56</td>
      <td>~8 GB</td>
    </tr>
    <tr>
      <td>DCA-d768</td>
      <td>323M</td>
      <td>768</td>
      <td>768</td>
      <td>0.57</td>
      <td>~45 GB</td>
    </tr>
    <tr>
      <td>DCA-noMLP</td>
      <td>302M</td>
      <td>1024</td>
      <td>768</td>
      <td>1.21</td>
      <td>~35 GB</td>
    </tr>
    <tr>
      <td>DCA-215M (bottleneck)</td>
      <td>215M</td>
      <td>1024</td>
      <td>512</td>
      <td>1.43</td>
      <td>~35 GB</td>
    </tr>
    <tr>
      <td>DCA 90M (separate MLPs)</td>
      <td>139M</td>
      <td>512</td>
      <td>512</td>
      <td>0.80</td>
      <td>~10 GB</td>
    </tr>
  </tbody>
</table>

<p>The DCA-215M results (Tables 2 and 3) confirm this design is practical and competitive.</p>

<h3 id="what-is-dca-well-suited-for">What is DCA well suited for?</h3>

<p>We selected benchmarks to test where DCA should help, not to maximize wins. Few existing tasks isolate distributed-source composition while remaining tractable for sub-billion-parameter decoder-only models, so HotpotQA serves as the primary stress test, 2Wiki as secondary corroboration, and the remaining tasks as negative controls.</p>

<p><img src="/assets/images/divergent-convergent-attention/information_topology_panel_b.svg" alt="Figure 3: Information topology" /></p>

<blockquote>
  <p><strong>Figure 3</strong> Information topology. DCA helps when relevant content is distributed across independent documents (left). It does not help when information forms a single chain (right).</p>
</blockquote>

<p>Sequential reasoning tasks (bAbI<sup id="fnref:babi" role="doc-noteref"><a href="#fn:babi" class="footnote" rel="footnote">22</a></sup>, Tree pathfinding<sup id="fnref:treepath" role="doc-noteref"><a href="#fn:treepath" class="footnote" rel="footnote">23</a></sup>, PrOntoQA<sup id="fnref:prontoqa" role="doc-noteref"><a href="#fn:prontoqa" class="footnote" rel="footnote">24</a></sup>, LEGO<sup id="fnref:lego" role="doc-noteref"><a href="#fn:lego" class="footnote" rel="footnote">25</a></sup>) show no advantage; all facts lie in a single flat sequence and every attention scale sees the same chain. Single-source tasks (TriviaQA<sup id="fnref:triviaqa" role="doc-noteref"><a href="#fn:triviaqa" class="footnote" rel="footnote">26</a></sup>, LAMBADA<sup id="fnref:lambada" role="doc-noteref"><a href="#fn:lambada" class="footnote" rel="footnote">27</a></sup>, MQAR<sup id="fnref:mqar" role="doc-noteref"><a href="#fn:mqar" class="footnote" rel="footnote">28</a></sup>) show no advantage; all perspectives see the same content. Tasks beyond model capacity (MuSiQue<sup id="fnref:musique" role="doc-noteref"><a href="#fn:musique" class="footnote" rel="footnote">29</a></sup>) show both models at floor. 2WikiMultiHopQA<sup id="fnref:twowiki" role="doc-noteref"><a href="#fn:twowiki" class="footnote" rel="footnote">30</a></sup> provides weak corroboration (Soft EM p = 0.004, EM ns).</p>

<p>DCA helps when relevant information is distributed across structurally independent segments, what we refer to as the information topology of the input, and does not help when information forms a single chain or resides at a single location. Within HotpotQA, the advantage is uniform across bridge questions (sequential logic, OR=4.65) and comparison questions (parallel logic, OR=4.51), indicating that multi-document context, not reasoning pattern, is the key factor.</p>

<h2 id="mechanistic-analysis">Mechanistic Analysis</h2>

<h3 id="force-decode-the-representation-advantage">Force-decode: the representation advantage</h3>

<p>To separate representation quality from generation dynamics, we feed the context to both models and force-decode the gold answer tokens (teacher-forcing), recording each model’s log-probability of the correct token at each position. For each of 6,359 validation examples, we compare which model assigns higher probability to the gold answer.</p>

<p><strong>Table 4: Force-decode results (90M, WT103).</strong> Top: paired comparison across all 6,359 validation examples (Wilcoxon signed-rank p &lt; 10^-300). Bottom: advantage by baseline difficulty quintile.</p>

<table class="data-table">
  <thead>
    <tr>
      <th>Slice</th>
      <th>DCA advantage</th>
      <th>DCA win rate</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Overall</td>
      <td>+6.25 nats (~520x)</td>
      <td>97.8% (6,217/6,359)</td>
    </tr>
    <tr>
      <td>0-20% (easiest)</td>
      <td>+1.84 nats</td>
      <td>96.5%</td>
    </tr>
    <tr>
      <td>20-40%</td>
      <td>+3.20 nats</td>
      <td>96.2%</td>
    </tr>
    <tr>
      <td>40-60%</td>
      <td>+4.75 nats</td>
      <td>96.5%</td>
    </tr>
    <tr>
      <td>60-80%</td>
      <td>+7.25 nats</td>
      <td>99.6%</td>
    </tr>
    <tr>
      <td>80-100% (hardest)</td>
      <td>+14.23 nats</td>
      <td>100%</td>
    </tr>
  </tbody>
</table>

<p>The representation advantage is near-universal: DCA produces better internal representations on 97.8% of all examples, not just the 1.6% where EM=1. The advantage correlates with baseline difficulty (r=-0.888): 7.7x larger on the hardest examples than the easiest (Table 4, Figure 4). The harder an example is for a standard transformer, the more DCA’s multi-perspective consensus improves the representation.</p>

<p><img src="/assets/images/divergent-convergent-attention/fig5_difficulty_scaling.svg" alt="Figure 4: Force-decode advantage by baseline difficulty quintile" /></p>

<blockquote>
  <p><strong>Figure 4</strong> DCA 90M vs Baseline 90M (both WT103). DCA’s representational advantage scales with example difficulty. On the hardest quintile (where the baseline assigns the lowest probability to the correct answer), DCA’s advantage is +14.23 nats. On the easiest, +1.84 nats. r=-0.888.</p>
</blockquote>

<p>Recent work on latent multi-hop reasoning finds that while bridge-entity recall scales smoothly with model size, the compositional second hop does not, suggesting composition is a structural bottleneck rather than a capacity problem<sup id="fnref:yanglatent" role="doc-noteref"><a href="#fn:yanglatent" class="footnote" rel="footnote">31</a></sup>. That work studies parametric knowledge recall; DCA’s setting differs in that all relevant information is provided in context. Nevertheless, our force-decode result is consistent with the broader view that end-to-end exact match may understate the gradual development of multi-hop structure in model representations: at 90M, the correct answer is already encoded with substantially higher probability under the right architecture, even where end-to-end EM remains near floor. Confirming this connection would require targeted compositional probes, such as entity-recall scores and causal interventions on bridge entities, applied directly to DCA’s internal representations.</p>

<h3 id="same-retrieval-better-composition">Same retrieval, better composition</h3>

<p>We computed mean token recall and derived an approximate token precision from aggregate F1 and recall on generated predictions (90M, WT103, pooled across seeds 137 and 2024). Token Recall is essentially identical (~57.5% in both models). Token Precision, derived via P = F1·R / (2R - F1), shows the full advantage: ~8.2% vs ~4.2% (~2.0x). The advantage appears to come primarily from composition rather than token-level recall.</p>

<blockquote>
  <p><strong>Figure 5</strong> DCA 90M vs Baseline 90M (both WT103). Token Recall is essentially identical (~57.5%). Token Precision (derived from aggregate F1 and recall) shows the full advantage (~8.2% vs ~4.2%).</p>
</blockquote>

<p><img src="/assets/images/divergent-convergent-attention/fig3_token_recall_precision.svg" alt="Figure 5: Token Recall/Precision" /></p>

<p>First-sentence extraction approximately decomposes this into two components. The advantage that survives extraction (~1.2-1.5x) reflects compositional integration at the representation level. The remaining multiplier (~2-3x) reflects generation coherence, scaling with answer length (3x at 1 token, 12.8x at 4+ tokens). These ranges are inferred from comparing first-sentence and full-output EM ratios, not independently measured.</p>

<h3 id="gate-ablation-consensus-is-essential-and-precisely-tuned">Gate ablation: consensus is essential and precisely tuned</h3>

<p>We force the consensus gate to fixed values during full QA evaluation using forward hooks. Gate=0 clamps the sigmoid to 0.001 (bypass consensus). Gate=1 clamps to 0.999 (force full consensus).</p>

<p><strong>Table 5: Gate ablation and learned gate values.</strong> Top: forcing gate to fixed values during QA evaluation. Bottom: learned gate values at consensus layers.</p>

<table class="data-table">
  <thead>
    <tr>
      <th>Condition</th>
      <th>90M</th>
      <th>215M</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Learned gates</td>
      <td>101</td>
      <td>91</td>
    </tr>
    <tr>
      <td>Gate=1 (force full)</td>
      <td>18</td>
      <td>0</td>
    </tr>
    <tr>
      <td>Baseline</td>
      <td>12</td>
      <td>59</td>
    </tr>
    <tr>
      <td>Gate=0 (bypass)</td>
      <td>2</td>
      <td>2</td>
    </tr>
  </tbody>
</table>

<table class="data-table">
  <thead>
    <tr>
      <th>Consensus layer</th>
      <th>90M gate</th>
      <th>215M gate</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Layer 5 (1st)</td>
      <td>0.29</td>
      <td>0.31</td>
    </tr>
    <tr>
      <td>Layer 11 (final @ 90M)</td>
      <td>0.99</td>
      <td>0.21</td>
    </tr>
    <tr>
      <td>Layer 17 (3rd)</td>
      <td>–</td>
      <td>0.28</td>
    </tr>
    <tr>
      <td>Layer 23 (4th)</td>
      <td>–</td>
      <td>0.35</td>
    </tr>
    <tr>
      <td>Layer 29 (final)</td>
      <td>–</td>
      <td>0.37</td>
    </tr>
  </tbody>
</table>

<p>Bypassing consensus collapses performance from 101 to 2 correct at 90M and 91 to 2 at 215M (Table 5). Forcing full consensus drops 101 to 18 at 90M and 91 to 0 at 215M, worse than the 350M baseline (59 correct), indicating that forced consensus is actively destructive to the representations DCA has learned to build through gradual integration.</p>

<p>At 90M the learned strategy is binary: passthrough early (0.29), full commit at the final layer (0.99). At 215M it is gradual and never exceeds 0.37. Both strategies are load-bearing, and disrupting either destroys performance.</p>

<h3 id="perspective-divergence-and-attention-patterns">Perspective divergence and attention patterns</h3>

<p>Perspectives develop genuinely distinct representations (cosine similarity 0.21 between local and medium at layer 5), complementary rather than redundant (Appendix Table 8).</p>

<blockquote>
  <p><strong>Figure 6</strong> DCA 90M vs Baseline 90M (both WT103), consensus layer 5. Each DCA perspective specializes at a different scale, while the baseline compromises at 0.34.</p>
</blockquote>

<p><img src="/assets/images/divergent-convergent-attention/fig4_cross_doc_attention.svg" alt="Figure 6: Cross-document attention fraction" /></p>

<p>Attention measurements (computed on EM=1 examples, n=101) confirm the specialization. The local perspective keeps 96% of attention within paragraphs (cross-document fraction 0.04), while the global perspective distributes 68% across documents. The baseline sits at 0.34. With DCA, local perspectives extract precise within-document content, global perspectives maintain cross-document context, and consensus integrates both. The baseline attends at multiple scales within a single residual stream, but must reconcile those scales within one shared representation.</p>

<h2 id="conclusion">Conclusion</h2>

<p>Multi-document composition is a documented bottleneck for production LLMs. RAG pipelines retrieve relevant documents but fail to synthesize across them<sup id="fnref:rag" role="doc-noteref"><a href="#fn:rag" class="footnote" rel="footnote">32</a></sup>. Models fail to use information in the middle of long contexts<sup id="fnref:lostmiddle" role="doc-noteref"><a href="#fn:lostmiddle" class="footnote" rel="footnote">33</a></sup>. Multi-hop reasoning may require 30-70B parameters to emerge in standard transformers<sup id="fnref:steelekatz" role="doc-noteref"><a href="#fn:steelekatz" class="footnote" rel="footnote">34</a></sup>. With DCA, we sought to demonstrate that parallel multi-scale perspectives with periodic late consensus can improve on these deficiencies.</p>

<p>Despite the limited capacity of our models, extensive benchmarking demonstrated a consistent advantage on distributed-source tasks (5.4x EM at 90M, 1.54x at 215M FLOP-comparable) and no advantage on sequential, single-source, or capacity-limited tasks. At 90M, the resulting representations encode multi-document relationships better than a parameter-matched standard transformer on 97.8% of examples, with 7.7x larger gains on the hardest examples, reflecting an advantage in composition rather than retrieval.</p>

<p>Consensus frequency, horizon widths, and gate training dynamics were fixed throughout our experiments, leaving substantial room for task-specific tuning. Because DCA is fundamentally a primitive that operates on tensors, other sequence-processing modules (ring attention, linear attention, SSMs) could serve as perspectives, opening a combinatorial design space we have only begun to explore. The force-decode diagnostic is itself useful beyond DCA, offering a general tool for determining whether the bottleneck in a given architecture is understanding or expression. We will be sharing the base code for DCA on GitHub.</p>

<hr />

<h2 id="appendix">Appendix</h2>

<h3 id="hotpotqa-task-illustration">HotpotQA task illustration</h3>

<p>Figure 2 above illustrates the HotpotQA distractor setting used throughout the paper: 10 paragraphs per question, with 2 supporting paragraphs embedded among 8 distractors.</p>

<h3 id="multi-seed-raw-data">Multi-seed raw data</h3>

<p><strong>Table 6: Per-seed HotpotQA results (90M, WT103).</strong> DCA mean EM: 1.562% (std 0.120%). Baseline mean EM: 0.288% (std 0.122%). Fisher exact (pooled): OR=5.49, p &lt; 10^-6.</p>

<table class="data-table">
  <thead>
    <tr>
      <th>Model</th>
      <th>Seed</th>
      <th>EM</th>
      <th>F1</th>
      <th>n</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>DCA 90M</td>
      <td>42</td>
      <td>0.01588</td>
      <td>0.14508</td>
      <td>6359</td>
    </tr>
    <tr>
      <td>DCA 90M</td>
      <td>137</td>
      <td>0.01431</td>
      <td>0.14439</td>
      <td>6359</td>
    </tr>
    <tr>
      <td>DCA 90M</td>
      <td>2024</td>
      <td>0.01667</td>
      <td>0.14265</td>
      <td>6359</td>
    </tr>
    <tr>
      <td>Baseline 90M</td>
      <td>42</td>
      <td>0.00189</td>
      <td>0.07625</td>
      <td>6359</td>
    </tr>
    <tr>
      <td>Baseline 90M</td>
      <td>137</td>
      <td>0.00425</td>
      <td>0.08143</td>
      <td>6359</td>
    </tr>
    <tr>
      <td>Baseline 90M</td>
      <td>2024</td>
      <td>0.00252</td>
      <td>0.07528</td>
      <td>6359</td>
    </tr>
  </tbody>
</table>

<h3 id="additional-benchmark-evaluations">Additional benchmark evaluations</h3>

<p>All results in this section come from 90M models pretrained on WT103.</p>

<p>Sequential reasoning tasks show no DCA advantage: bAbI 2-hop hits 100% for both models at all distractor counts, and PrOntoQA also hits 100% at all hop counts. Tree pathfinding favors the baseline by 2-7 points at depths 4-6. LEGO is roughly even, with the baseline at ~31% and DCA at ~30%.</p>

<p>Single-source tasks also show no advantage. TriviaQA and LAMBADA show no consistent lift. MQAR (fixed protocol, vocab=8192) remains at exact chance across all key-value counts and learning rates for both models.</p>

<p>Capacity-limited tasks stay at floor. MuSiQue shows DCA at 0.21% and baseline at 0.10% (p = 0.687, not significant). Additional synthetic compositional probes were similarly uninformative at this scale: Entity Comparison remained at chance (50%, loss near ln 2), and MQAR2 also remained at chance (50%). We treat these as floor-effect results for small models trained from scratch rather than meaningful tests of DCA’s inductive bias.</p>

<p>2WikiMultiHopQA provides weak corroboration: EM is even (0.31% vs 0.33%, p = 1.0), but Soft EM (F1 &gt;= 0.5) favors DCA at 2.31% vs 1.47% (p = 0.004).</p>

<h3 id="force-decode-difficulty-scaling">Force-decode difficulty scaling</h3>

<p><img src="/assets/images/divergent-convergent-attention/fig5_scatter_reference.png" alt="Appendix Figure: Force-decode advantage vs baseline difficulty" /></p>

<blockquote>
  <p><strong>Appendix Figure</strong> Per-example force-decode advantage (DCA log-prob minus baseline log-prob) plotted against baseline log-prob (90M DCA vs 90M baseline, both WT103). Each point is one of 6,359 HotpotQA validation examples. r=-0.888. The harder an example is for the baseline (more negative log-prob), the larger DCA’s representational advantage.</p>
</blockquote>

<hr />

<h3 id="literature-gap">Literature gap</h3>

<p>To our knowledge, no published decoder-only HotpotQA results exist between 90M and 7B parameters.</p>

<p><strong>Table 7: Published HotpotQA results.</strong> Encoder models dominate at 110M-355M due to bidirectional attention and span extraction heads. Decoder-only models need 7B+ for ~30% EM.</p>

<table class="data-table">
  <thead>
    <tr>
      <th>Architecture</th>
      <th>Params</th>
      <th>HotpotQA</th>
      <th>Notes</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>DCA 90M</td>
      <td>89M</td>
      <td>1.56% EM</td>
      <td>Decoder, WT103</td>
    </tr>
    <tr>
      <td>Baseline 90M</td>
      <td>90M</td>
      <td>0.29% EM</td>
      <td>Decoder, WT103</td>
    </tr>
    <tr>
      <td>Baseline 350M</td>
      <td>355M</td>
      <td>0.93% EM</td>
      <td>Decoder, PG-19</td>
    </tr>
    <tr>
      <td>DCA-215M</td>
      <td>215M</td>
      <td>1.43% EM</td>
      <td>Decoder, PG-19</td>
    </tr>
    <tr>
      <td>BERT-base-era systems</td>
      <td>~110M</td>
      <td>~54% EM</td>
      <td>Encoder, bidirectional</td>
    </tr>
    <tr>
      <td>Longformer-base</td>
      <td>~149M</td>
      <td>64% F1</td>
      <td>Encoder, local+global</td>
    </tr>
    <tr>
      <td>Longformer-large</td>
      <td>~435M</td>
      <td>73% F1</td>
      <td>Encoder, local+global</td>
    </tr>
    <tr>
      <td>BigBird-ETC</td>
      <td>~131M</td>
      <td>76% F1</td>
      <td>Encoder, sparse</td>
    </tr>
    <tr>
      <td>RoBERTa-large-based systems</td>
      <td>~355M</td>
      <td>~70% EM</td>
      <td>Encoder</td>
    </tr>
    <tr>
      <td>Llama-2-7B</td>
      <td>7B</td>
      <td>~30% EM</td>
      <td>Decoder (FireAct)</td>
    </tr>
    <tr>
      <td>GPT-3.5</td>
      <td>proprietary</td>
      <td>~31% EM</td>
      <td>Few-shot ReAct</td>
    </tr>
    <tr>
      <td>Human</td>
      <td>–</td>
      <td>~91% F1</td>
      <td>Leaderboard</td>
    </tr>
  </tbody>
</table>

<p>Encoder models such as BERT<sup id="fnref:bert:1" role="doc-noteref"><a href="#fn:bert" class="footnote" rel="footnote">16</a></sup>, Longformer<sup id="fnref:longformer:3" role="doc-noteref"><a href="#fn:longformer" class="footnote" rel="footnote">6</a></sup>, BigBird-ETC<sup id="fnref:bigbird:3" role="doc-noteref"><a href="#fn:bigbird" class="footnote" rel="footnote">7</a></sup>, and RoBERTa<sup id="fnref:roberta:1" role="doc-noteref"><a href="#fn:roberta" class="footnote" rel="footnote">17</a></sup> dominate at 110M-355M because HotpotQA was designed for BERT-era extractive QA with bidirectional attention and span extraction heads. Decoder-only models need 7B+ for ~30% EM (FireAct with Llama-2-7B<sup id="fnref:fireact:1" role="doc-noteref"><a href="#fn:fireact" class="footnote" rel="footnote">18</a></sup>). Steele &amp; Katz<sup id="fnref:steelekatz:1" role="doc-noteref"><a href="#fn:steelekatz" class="footnote" rel="footnote">34</a></sup> identify a phase transition at 30-70B for emergent multi-hop reasoning.</p>

<h3 id="additional-wt103-variants">Additional WT103 variants</h3>

<p>In addition to the headline comparison (DCA vs <code class="language-plaintext highlighter-rouge">baseline_mixed</code>, 50K steps, 3 seeds), we trained six additional 90M WT103 variants: three DCA variants (consensus every 1, 3, or 6 layers, plus uniform-horizon settings) and three baseline variants (full causal, layerwise windows, sliding window w=256) at 50K or 30K steps. In all cases, every DCA variant outperformed every baseline on HotpotQA EM, including cross-budget comparisons where DCA at 30K steps with 1-epoch finetuning exceeded baselines at 50K steps with 3-epoch finetuning. The factorial decomposition confirmed that parallel streams are the primary mechanism; multi-scale windows are secondary.</p>

<h3 id="perspective-divergence">Perspective divergence</h3>

<p>Pairwise cosine similarity between K=3 perspectives at consensus layers, measured on HotpotQA inputs (90M, WT103). No EM stratification: correct and incorrect examples show nearly identical divergence, confirming divergence is an architectural property rather than a predictor of success.</p>

<p><strong>Table 8: Perspective divergence on QA data.</strong> Local and medium are most dissimilar at layer 5 (0.21); by layer 11 they partially reconverge (0.62) while local-global remains distinct (0.34).</p>

<table class="data-table">
  <thead>
    <tr>
      <th>Layer</th>
      <th>Pair</th>
      <th>Overall</th>
      <th>EM=1 (n=101)</th>
      <th>EM=0 (n=6,258)</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>5</td>
      <td>local vs medium</td>
      <td>0.207</td>
      <td>0.209</td>
      <td>0.207</td>
    </tr>
    <tr>
      <td>5</td>
      <td>local vs global</td>
      <td>0.435</td>
      <td>0.437</td>
      <td>0.435</td>
    </tr>
    <tr>
      <td>5</td>
      <td>medium vs global</td>
      <td>0.405</td>
      <td>0.407</td>
      <td>0.405</td>
    </tr>
    <tr>
      <td>11</td>
      <td>local vs medium</td>
      <td>0.621</td>
      <td>0.625</td>
      <td>0.621</td>
    </tr>
    <tr>
      <td>11</td>
      <td>local vs global</td>
      <td>0.336</td>
      <td>0.337</td>
      <td>0.336</td>
    </tr>
    <tr>
      <td>11</td>
      <td>medium vs global</td>
      <td>0.437</td>
      <td>0.436</td>
      <td>0.437</td>
    </tr>
  </tbody>
</table>

<h2 id="citation">Citation</h2>

<p>This blog post serves as the current preprint version of this work. Until an archival version is available, please cite it as:</p>

<div class="language-bibtex highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nc">@misc</span><span class="p">{</span><span class="nl">zhao2026dca</span><span class="p">,</span>
  <span class="na">author</span> <span class="p">=</span> <span class="s">{Ben Zhao and Jenhan Tao}</span><span class="p">,</span>
  <span class="na">title</span> <span class="p">=</span> <span class="s">{Divergent-Convergent Attention: Parallel Perspectives for Compositional Reasoning}</span><span class="p">,</span>
  <span class="na">year</span> <span class="p">=</span> <span class="s">{2026}</span><span class="p">,</span>
  <span class="na">howpublished</span> <span class="p">=</span> <span class="s">{\url{https://iluvatarlabs.github.io/blog/2026/03/divergent-convergent-attention/}}</span><span class="p">,</span>
  <span class="na">note</span> <span class="p">=</span> <span class="s">{Iluvatar Labs blog preprint}</span>
<span class="p">}</span>
</code></pre></div></div>

<h3 id="acknowledgements">Acknowledgements</h3>

<p>We thank Abel Chiao for helpful discussions and feedback on this work.</p>
<div class="footnotes" role="doc-endnotes">
  <ol>
    <li id="fn:hotpotqa" role="doc-endnote">
      <p>Yang, Z., Qi, P., Zhang, S., Bengio, Y., Cohen, W. W., Salakhutdinov, R., &amp; Manning, C. D. (2018). “HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering”. <em>Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP 2018)</em>. arXiv:1809.09600 <a href="#fnref:hotpotqa" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
    <li id="fn:buzsaki" role="doc-endnote">
      <p>Buzsaki, G. (2006). <em>Rhythms of the Brain</em>. Oxford University Press. <a href="#fnref:buzsaki" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
    <li id="fn:canolty" role="doc-endnote">
      <p>Canolty, R. T., &amp; Knight, R. T. (2010). “The Functional Role of Cross-Frequency Coupling”. <em>Trends in Cognitive Sciences</em>, 14(11), 506-515. <a href="#fnref:canolty" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
    <li id="fn:lisman" role="doc-endnote">
      <p>Lisman, J. E., &amp; Jensen, O. (2013). “The Theta-Gamma Neural Code”. <em>Neuron</em>, 77(6), 1002-1016. <a href="#fnref:lisman" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
    <li id="fn:colgin" role="doc-endnote">
      <p>Colgin, L. L., Denninger, T., Fyhn, M., Hafting, T., Bonnevie, T., Jensen, O., Moser, M.-B., &amp; Moser, E. I. (2009). “Frequency of Gamma Oscillations Routes Flow of Information in the Hippocampus”. <em>Nature</em>, 462, 353-357. <a href="#fnref:colgin" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
    <li id="fn:longformer" role="doc-endnote">
      <p>Beltagy, I., Peters, M. E., &amp; Cohan, A. (2020). “Longformer: The Long-Document Transformer”. arXiv:2004.05150 <a href="#fnref:longformer" class="reversefootnote" role="doc-backlink">&#8617;</a> <a href="#fnref:longformer:1" class="reversefootnote" role="doc-backlink">&#8617;<sup>2</sup></a> <a href="#fnref:longformer:2" class="reversefootnote" role="doc-backlink">&#8617;<sup>3</sup></a> <a href="#fnref:longformer:3" class="reversefootnote" role="doc-backlink">&#8617;<sup>4</sup></a></p>
    </li>
    <li id="fn:bigbird" role="doc-endnote">
      <p>Zaheer, M., Guruganesh, G., Dubey, A., Ainslie, J., Alberti, C., Ontanon, S., Pham, P., Ravula, A., Wang, Q., Yang, L., &amp; Ahmed, A. (2020). “Big Bird: Transformers for Longer Sequences”. <em>Advances in Neural Information Processing Systems 33 (NeurIPS 2020)</em>. arXiv:2007.14062 <a href="#fnref:bigbird" class="reversefootnote" role="doc-backlink">&#8617;</a> <a href="#fnref:bigbird:1" class="reversefootnote" role="doc-backlink">&#8617;<sup>2</sup></a> <a href="#fnref:bigbird:2" class="reversefootnote" role="doc-backlink">&#8617;<sup>3</sup></a> <a href="#fnref:bigbird:3" class="reversefootnote" role="doc-backlink">&#8617;<sup>4</sup></a></p>
    </li>
    <li id="fn:retnet" role="doc-endnote">
      <p>Sun, Y., Dong, L., Huang, S., Ma, S., Xia, Y., Xue, J., Wang, J., &amp; Wei, F. (2023). “Retentive Network: A Successor to Transformer for Large Language Models”. arXiv:2307.08621 <a href="#fnref:retnet" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
    <li id="fn:ringattention" role="doc-endnote">
      <p>Liu, H., Zaharia, M., &amp; Abbeel, P. (2023). “Ring Attention with Blockwise Transformers for Near-Infinite Context”. arXiv:2310.01889 <a href="#fnref:ringattention" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
    <li id="fn:flashattention" role="doc-endnote">
      <p>Dao, T., Fu, D. Y., Ermon, S., Rudra, A., &amp; Re, C. (2022). “FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness”. <em>Advances in Neural Information Processing Systems</em>, 35. <a href="#fnref:flashattention" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
    <li id="fn:resnext" role="doc-endnote">
      <p>Xie, S., Girshick, R., Dollar, P., Tu, Z., &amp; He, K. (2017). “Aggregated Residual Transformations for Deep Neural Networks”. <em>Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2017)</em>. arXiv:1611.05431 <a href="#fnref:resnext" class="reversefootnote" role="doc-backlink">&#8617;</a> <a href="#fnref:resnext:1" class="reversefootnote" role="doc-backlink">&#8617;<sup>2</sup></a></p>
    </li>
    <li id="fn:moe" role="doc-endnote">
      <p>Shazeer, N., Mirhoseini, A., Maziarz, K., Davis, A., Le, Q., Hinton, G., &amp; Dean, J. (2017). “Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer”. <em>Proceedings of the 5th International Conference on Learning Representations (ICLR 2017)</em>. arXiv:1701.06538 <a href="#fnref:moe" class="reversefootnote" role="doc-backlink">&#8617;</a> <a href="#fnref:moe:1" class="reversefootnote" role="doc-backlink">&#8617;<sup>2</sup></a></p>
    </li>
    <li id="fn:switch" role="doc-endnote">
      <p>Fedus, W., Zoph, B., &amp; Shazeer, N. (2022). “Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity”. <em>Journal of Machine Learning Research</em>, 23(120), 1-39. <a href="#fnref:switch" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
    <li id="fn:highway" role="doc-endnote">
      <p>Srivastava, R. K., Greff, K., &amp; Schmidhuber, J. (2015). “Highway Networks”. arXiv:1505.00387 <a href="#fnref:highway" class="reversefootnote" role="doc-backlink">&#8617;</a> <a href="#fnref:highway:1" class="reversefootnote" role="doc-backlink">&#8617;<sup>2</sup></a></p>
    </li>
    <li id="fn:fedavg" role="doc-endnote">
      <p>McMahan, H. B., Moore, E., Ramage, D., Hampson, S., &amp; y Arcas, B. A. (2017). “Communication-Efficient Learning of Deep Networks from Decentralized Data”. <em>Proceedings of the 20th International Conference on Artificial Intelligence and Statistics (AISTATS 2017)</em>. arXiv:1602.05629 <a href="#fnref:fedavg" class="reversefootnote" role="doc-backlink">&#8617;</a> <a href="#fnref:fedavg:1" class="reversefootnote" role="doc-backlink">&#8617;<sup>2</sup></a></p>
    </li>
    <li id="fn:bert" role="doc-endnote">
      <p>Devlin, J., Chang, M.-W., Lee, K., &amp; Toutanova, K. (2019). “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding”. <em>Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT 2019)</em>. arXiv:1810.04805 <a href="#fnref:bert" class="reversefootnote" role="doc-backlink">&#8617;</a> <a href="#fnref:bert:1" class="reversefootnote" role="doc-backlink">&#8617;<sup>2</sup></a></p>
    </li>
    <li id="fn:roberta" role="doc-endnote">
      <p>Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., &amp; Stoyanov, V. (2019). “RoBERTa: A Robustly Optimized BERT Pretraining Approach”. arXiv:1907.11692 <a href="#fnref:roberta" class="reversefootnote" role="doc-backlink">&#8617;</a> <a href="#fnref:roberta:1" class="reversefootnote" role="doc-backlink">&#8617;<sup>2</sup></a></p>
    </li>
    <li id="fn:fireact" role="doc-endnote">
      <p>Chen, B., Monajatipoor, M., Veen, D. V., Guo, Y., &amp; Dubrawski, A. (2023). “FireAct: Toward Language Agent Fine-tuning”. arXiv:2310.05915 <a href="#fnref:fireact" class="reversefootnote" role="doc-backlink">&#8617;</a> <a href="#fnref:fireact:1" class="reversefootnote" role="doc-backlink">&#8617;<sup>2</sup></a></p>
    </li>
    <li id="fn:mvit" role="doc-endnote">
      <p>Fan, H., Xiong, B., Mangalam, K., Li, Y., Yan, Z., Malik, J., &amp; Feichtenhofer, C. (2021). “Multiscale Vision Transformers”. <em>Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV 2021)</em>. arXiv:2104.11227 <a href="#fnref:mvit" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
    <li id="fn:wikitext" role="doc-endnote">
      <p>Merity, S., Xiong, C., Bradbury, J., &amp; Socher, R. (2017). “Pointer Sentinel Mixture Models”. <em>Proceedings of the 5th International Conference on Learning Representations (ICLR 2017)</em>. arXiv:1609.07843 <a href="#fnref:wikitext" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
    <li id="fn:pg19" role="doc-endnote">
      <p>Rae, J. W., Potapenko, A., Jayakumar, S. M., &amp; Hillier, C. (2020). “Compressive Transformers for Long-Range Sequence Modelling”. <em>Proceedings of the 8th International Conference on Learning Representations (ICLR 2020)</em>. arXiv:1911.05507 <a href="#fnref:pg19" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
    <li id="fn:babi" role="doc-endnote">
      <p>Weston, J., Bordes, A., Chopra, S., &amp; Mikolov, T. (2015). “Towards AI-Complete Question Answering: A Set of Prerequisite Toy Tasks”. arXiv:1502.05698 <a href="#fnref:babi" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
    <li id="fn:treepath" role="doc-endnote">
      <p>Brinkmann, J., Goswami, K., &amp; Rajani, N. F. (2024). “A Mechanistic Analysis of a Transformer Trained on a Symbolic Multi-Step Reasoning Task”. <em>Findings of the Association for Computational Linguistics: ACL 2024</em>. <a href="#fnref:treepath" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
    <li id="fn:prontoqa" role="doc-endnote">
      <p>Saparov, A., &amp; He, H. (2023). “Language Models Are Greedy Reasoners: A Systematic Formal Analysis of Chain-of-Thought”. <em>Proceedings of the 11th International Conference on Learning Representations (ICLR 2023)</em>. <a href="#fnref:prontoqa" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
    <li id="fn:lego" role="doc-endnote">
      <p>Zhang, Y., Yu, A. W., &amp; Xu, W. (2022). “Unveiling Transformers with LEGO: A Synthetic Reasoning Task”. arXiv:2206.04301 <a href="#fnref:lego" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
    <li id="fn:triviaqa" role="doc-endnote">
      <p>Joshi, M., Choi, E., Weld, D. S., &amp; Zettlemoyer, L. (2017). “TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension”. <em>Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (ACL 2017)</em>. arXiv:1705.03551 <a href="#fnref:triviaqa" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
    <li id="fn:lambada" role="doc-endnote">
      <p>Paperno, D., Kruszewski, G., Lazaridou, A., Pham, N. Q., Bernardi, R., Pezzelle, S., Baroni, M., Boleda, G., &amp; Fernandez, R. (2016). “The LAMBADA Dataset: Word Prediction Requiring a Broad Discourse Context”. <em>Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (ACL 2016)</em>. arXiv:1606.06031 <a href="#fnref:lambada" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
    <li id="fn:mqar" role="doc-endnote">
      <p>Arora, S., Eyuboglu, S., Timalsina, A., Johnson, I., Poli, M., Rudra, A., &amp; Zou, J. (2023). “Zoology: Measuring and Improving Recall in Efficient Language Models”. arXiv:2312.04927 <a href="#fnref:mqar" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
    <li id="fn:musique" role="doc-endnote">
      <p>Trivedi, H., Balasubramanian, N., Khot, T., &amp; Sabharwal, A. (2022). “MuSiQue: Multihop Questions via Single-hop Question Composition”. <em>Transactions of the Association for Computational Linguistics</em>, 10, 539-554. arXiv:2108.00573 <a href="#fnref:musique" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
    <li id="fn:twowiki" role="doc-endnote">
      <p>Ho, X., Nguyen, A.-K. D., Sugawara, S., &amp; Aizawa, A. (2020). “Constructing a Multi-hop QA Dataset for Comprehensive Evaluation of Reasoning Steps”. <em>Proceedings of the 28th International Conference on Computational Linguistics (COLING 2020)</em>. arXiv:2011.01060 <a href="#fnref:twowiki" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
    <li id="fn:yanglatent" role="doc-endnote">
      <p>Yang, S., Gribovskaya, E., Kassner, N., Geva, M., &amp; Riedel, S. (2024). “Do Large Language Models Latently Perform Multi-Hop Reasoning?” <em>Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (ACL 2024)</em>. <a href="#fnref:yanglatent" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
    <li id="fn:rag" role="doc-endnote">
      <p>Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V., Goyal, N., Kuttler, H., Lewis, M., Yih, W.-t., Rocktaschel, T., Riedel, S., &amp; Kiela, D. (2020). “Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks”. <em>Advances in Neural Information Processing Systems 33 (NeurIPS 2020)</em>. arXiv:2005.11401 <a href="#fnref:rag" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
    <li id="fn:lostmiddle" role="doc-endnote">
      <p>Liu, N. F., Lin, K., Hewitt, J., Paranjape, A., Bevilacqua, M., Petroni, F., &amp; Liang, P. (2024). “Lost in the Middle: How Language Models Use Long Contexts”. <em>Transactions of the Association for Computational Linguistics</em>, 12, 157-173. arXiv:2307.03172 <a href="#fnref:lostmiddle" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
    <li id="fn:steelekatz" role="doc-endnote">
      <p>Steele, B., &amp; Katz, M. (2026). “Scaling Trends for Multi-Hop Contextual Reasoning in Mid-Scale Language Models”. arXiv:2601.04254 <a href="#fnref:steelekatz" class="reversefootnote" role="doc-backlink">&#8617;</a> <a href="#fnref:steelekatz:1" class="reversefootnote" role="doc-backlink">&#8617;<sup>2</sup></a></p>
    </li>
  </ol>
</div>]]></content><author><name>Ben Zhao · Jenhan Tao</name></author><category term="Research" /><summary type="html"><![CDATA[We introduce Divergent-Convergent Attention (DCA), a transformer primitive that maintains parallel attention streams at different window sizes and reconciles them through learned periodic consensus.]]></summary><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://iluvatarlabs.github.io/assets/images/divergent-convergent-attention/social-card.png" /><media:content medium="image" url="https://iluvatarlabs.github.io/assets/images/divergent-convergent-attention/social-card.png" xmlns:media="http://search.yahoo.com/mrss/" /></entry><entry><title type="html">Meet Marvin</title><link href="https://iluvatarlabs.github.io/blog/2026/03/introducing-marvin/" rel="alternate" type="text/html" title="Meet Marvin" /><published>2026-03-01T00:00:00+00:00</published><updated>2026-03-01T00:00:00+00:00</updated><id>https://iluvatarlabs.github.io/blog/2026/03/introducing-marvin</id><content type="html" xml:base="https://iluvatarlabs.github.io/blog/2026/03/introducing-marvin/"><![CDATA[<p>Today, we’re introducing Marvin, an autonomous research agent for ML science. Marvin takes the information overload and busywork out of research. It does deep literature review, generates and tests novel and scientifically valid hypotheses, and can perform the entire research loop fully autonomously, end to end. <a href="/marvin/">Learn more about Marvin.</a></p>

<h2 id="why-we-built-marvin">Why we built Marvin</h2>

<p>The bottleneck in ML research today isn’t compute or data. It’s the preparation. More research is being produced now than at any point in history, and the pace is only increasing. Researchers must ingest and synthesize growing volumes of information before they can actually start their research. And even once they start, a lot of the research cycle is still spent on logistics rather than the science itself.</p>

<p>We built Marvin because nothing out there worked well enough for our own research. The existing options were either <strong>too dumb</strong> (chasing red herrings down rabbit holes or proposing smart-sounding ideas that were anything but), <strong>too wasteful</strong> (channeling Ralph Wiggum on experiments that were never going to work), or <strong>too opaque</strong> (poor documentation, no reasoning traces or “logic trail” that forms the bedrock of scientific reproducibility).</p>

<h2 id="full-autonomy-with-logic-trail">Full autonomy with logic trail</h2>

<p>The latter issue with closed-loop agents is especially problematic for doing science. Full autonomy is only useful if you can trust it. AI is incredibly good at generating plausible-looking outputs, which will only further compound the reproducibility crisis in academia today.</p>

<p><img src="/assets/images/introducing-marvin/ruslan-autonomous-science.jpg" alt="Scientific figure showing that higher water intake reduces amyloid pathology and improves cognition in 5xFAD mice." /></p>

<p><em>Did you know drinking water can prevent Alzheimer’s? Neither did we. Better keep the receipts.<sup id="fnref:ruslan" role="doc-noteref"><a href="#fn:ruslan" class="footnote" rel="footnote">1</a></sup></em></p>

<p>In order for autonomous scientists to contribute real, meaningful discoveries, the system has to do more than generate the output. It has to carry forward rich context continuously, make sensible, data- and fact-driven decisions, and leave behind a clear record for others, both human and agentic, to inspect and validate.</p>

<h2 id="marvin-is-for-everyone">Marvin is for everyone</h2>

<p>We do not see autonomous systems as replacements for human work. They should augment us, increase our productivity, and let us spend more of our time on the parts of the work that actually matter. That is why we built Marvin to be a scientific <strong>collaborator</strong>: flexible and sophisticated enough to function as a coworker, not just a tool.</p>

<p>Whether you’re a highly technical ML researcher who just needs more clones of you or a bench scientist who has never written a line of code, Marvin can join your team and pick up the work you want to delegate at the degree of autonomy you want to grant it. It can handle anything from just literature review to a full end-to-end research loop, and at any time, you can review or discuss the results or redirect the next experiments before Marvin kicks off again. The level of autonomy is yours to set.</p>

<p>Marvin’s capabilities are also cross-domain. It can do research across fields as diverse as frontier AI research, computational biology and bioinformatics, and materials science. That is because scientific method and rigor are universal, and we designed Marvin’s research loop and memory system around the same principles we used running our own research teams and academic labs.</p>

<h2 id="work-with-marvin">Work with Marvin</h2>

<p>In head-to-head evaluations using both LLM judges and human PhD judges in the relevant fields, Marvin scored higher than competing autonomous science agents on research depth, rigor, and creativity. We will publish a broader meta-paper with those results closer to Marvin’s open launch.</p>

<p>Marvin is in closed testing now. Read more on the <a href="/marvin/">Marvin page</a> and see examples of its work there. If you’re interested, we’d love to hear about your project’s needs and discuss how Marvin can help.</p>

<div class="footnotes" role="doc-endnotes">
  <ol>
    <li id="fn:ruslan" role="doc-endnote">
      <p>Ruslan Salakhutdinov, “<a href="https://x.com/rust_ruslan/status/2047718238663172329">the future of science is less about producing results and more about verifying them</a>,” X, July 18, 2025. Embedded figure above from the linked post. <a href="#fnref:ruslan" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
  </ol>
</div>]]></content><author><name>Iluvatar Labs</name></author><category term="Product" /><summary type="html"><![CDATA[We built Marvin because too much ML research time is still spent on preparation, context gathering, and logistics instead of the science itself.]]></summary><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://iluvatarlabs.github.io/assets/images/introducing-marvin/social-card.png" /><media:content medium="image" url="https://iluvatarlabs.github.io/assets/images/introducing-marvin/social-card.png" xmlns:media="http://search.yahoo.com/mrss/" /></entry><entry><title type="html">Elastic Speculation</title><link href="https://iluvatarlabs.github.io/blog/2025/11/elastic-speculation/" rel="alternate" type="text/html" title="Elastic Speculation" /><published>2025-11-11T00:00:00+00:00</published><updated>2025-11-11T00:00:00+00:00</updated><id>https://iluvatarlabs.github.io/blog/2025/11/elastic-speculation</id><content type="html" xml:base="https://iluvatarlabs.github.io/blog/2025/11/elastic-speculation/"><![CDATA[<blockquote>
  <p><strong>The TL;DR:</strong> Elastic speculation speeds up inference while maintaining output quality, resulting in more responsive models and a reduction compute costs.</p>

  <p>Specifically, adaptive draft length delivers <strong>20-50% latency reduction</strong> over fixed-length speculation. Confidence-based early exit <strong>cuts speculative KV writes by ~50% at a 1-3% latency cost</strong>. Both methods <strong>preserve semantic quality</strong> at multiple scales (BERTScore &gt;0.9, cosine similarity &gt;0.95, equivalent reward model scoring).</p>
</blockquote>

<h2 id="introduction">Introduction</h2>

<p>Large language model inference is fast enough to demo and slow enough to hurt.</p>

<p>Speculative decoding<sup id="fnref:speculative" role="doc-noteref"><a href="#fn:speculative" class="footnote" rel="footnote">1</a></sup> is an incredibly effective strategy for speeding up inference: a smaller draft model proposes multiple tokens, a larger target model verifies them, and we commit the accepted prefix and discard the rest. Implementations like EAGLE<sup id="fnref:eagle" role="doc-noteref"><a href="#fn:eagle" class="footnote" rel="footnote">2</a></sup> in vLLM<sup id="fnref:vllm" role="doc-noteref"><a href="#fn:vllm" class="footnote" rel="footnote">3</a></sup> already make this practical and widely used.</p>

<p>However, two parts of this pipeline are still potentially inefficient:</p>

<ul>
  <li>The draft length is fixed, even as acceptance behavior changes across prompts, positions, and workloads.</li>
  <li>Every speculative token writes to KV cache, even when it was never likely to survive verification.</li>
</ul>

<p>In this post, we introduce <strong>Elastic Speculation</strong>: a small control layer on top of EAGLE that makes speculative decoding adaptive instead of static.</p>

<h2 id="why-spec-decode-leaves-performance-on-the-table">Why spec decode leaves performance on the table</h2>

<p><strong>First, acceptance is not constant</strong> and so a global, fixed <em>K</em> is too blunt. For easy or highly structured workloads (e.g., coding or QA-style prompts), acceptance can be very high, so a small <em>K</em> underutilizes the draft model. For harder or more creative workloads, acceptance drops, so a large <em>K</em> wastes compute on tokens that will be thrown away.</p>

<p><strong>Second, being KV-cache bandwidth constrained hurts.</strong> Even speculative tokens that will never be accepted still pay the full price of KV writes. At larger batch sizes, longer contexts, and bigger models, KV-cache traffic becomes a dominant bottleneck<sup id="fnref:kvbottleneck" role="doc-noteref"><a href="#fn:kvbottleneck" class="footnote" rel="footnote">4</a></sup>. Reducing unnecessary KV work is often the real lever for throughput.</p>

<p>Elastic Speculation treats speculative decoding as a <strong>runtime control problem</strong>:</p>
<ul>
  <li>Speculate more when speculation is working.</li>
  <li>Speculate less when it isn’t.</li>
  <li>Stop writing KV for tokens that are very unlikely to matter.</li>
</ul>

<p>We do this without changing model weights or the verification rule. Our reference implementation is for EAGLE in vLLM, but the same control-plane ideas apply to other speculative decoding methods.</p>

<p><img src="/assets/images/elastic-speculation/elastic_spec_overview_mod.svg" alt="Figure 1: Elastic Speculation overview" /></p>

<blockquote>
  <p><strong>Figure 1</strong> illustrates this design: speculative decoding with a dynamic <em>K</em>, plus a separate control that can gate KV writes.</p>
</blockquote>

<h2 id="adaptive-draft-length-making-k-elastic">Adaptive draft length: making <em>K</em> elastic</h2>

<p>Our first contribution is enabling an <strong>adaptive draft length</strong>. Instead of choosing <em>K</em> once and hard-coding it, we let the system adjust <em>K</em> dynamically based on how speculation has been performing recently.</p>

<p>At a high level, our implementation features the following:</p>

<blockquote>
  <ul>
    <li>A runtime maintains lightweight statistics about speculative behavior.</li>
    <li>A controller selects a draft length from a small set (e.g., 5, 10, 15) for each step:
      <ul>
        <li>When recent speculative proposals are mostly accepted, it chooses a longer draft.</li>
        <li>When they are frequently rejected, it chooses a shorter one.</li>
      </ul>
    </li>
    <li>The selected draft length is carried through existing batch descriptors into the EAGLE path. No extra RPC layer, no changes to the verification contract.</li>
  </ul>
</blockquote>

<h3 id="latency-savings">Latency savings</h3>

<p>We evaluated adaptive draft length on <code class="language-plaintext highlighter-rouge">Llama-3.1-8B-Instruct</code> target and draft models, across various configurations (including batch, tokens, etc.) and datasets. We selected  the following four diverse benchmark datasets representing different LLM workload characteristics:</p>

<ul>
  <li><code class="language-plaintext highlighter-rouge">Alpaca</code> - Instruction-following tasks spanning creative writing, QA, and general task completion. Representative of typical chat assistant workloads.</li>
  <li><code class="language-plaintext highlighter-rouge">SQuAD</code> - Reading comprehension requiring extractive answers. Short, factual outputs with high determinism ideal for testing speculation on low-entropy tasks.</li>
  <li><code class="language-plaintext highlighter-rouge">CNN DailyMail</code> (aka long) - Document summarization, essays, and narratives requiring 256+ tokens. Stresses sustained speculation quality over extended generations.</li>
  <li><code class="language-plaintext highlighter-rouge">BigCodeBench</code> (aka coding) - Code completion, bug fixing, and algorithm implementation. Highly structured outputs with strict syntactic constraints test adaptive tuning limits.</li>
</ul>

<p>Across workloads ranging from short bursts (<code class="language-plaintext highlighter-rouge">12 requests x 64 tokens</code>) to long-form generation (<code class="language-plaintext highlighter-rouge">36 x 256</code>), adaptive draft length cuts latency substantially. Figure 2 breaks down these gains at draft length <code class="language-plaintext highlighter-rouge">d=10</code> across the four datasets. The short-context benchmarks - Alpaca, SQuAD, and Coding - deliver consistent <strong>35–45%</strong> speedups under both greedy (<code class="language-plaintext highlighter-rouge">temp=0</code>) and stochastic sampling decoding (<code class="language-plaintext highlighter-rouge">temp=0.7</code>, not shown). For the long-form dataset, while adaptive still provides sizeable gains, the savings drop to <strong>~16–30%</strong>.</p>

<p>Why the gap? Speculative decoding fundamentally relies on the draft model tracking the target model’s distribution. As sequences grow longer, this alignment degrades. Our long-form benchmark averages 487 tokens per output (vs 128–256 for other datasets). The longer the context, the more cumulative errors compound, and acceptance rates fall accordingly<sup id="fnref:seqlen" role="doc-noteref"><a href="#fn:seqlen" class="footnote" rel="footnote">5</a></sup>.</p>

<p><img src="/assets/images/elastic-speculation/latency_adaptive_d10.png" alt="Figure 2: Latency" /></p>

<blockquote>
  <p><strong>Figure 2</strong> Adaptive draft length (d=10) achieves 35-55% latency reduction across datasets with Llama-3.1-8B-Instruct.</p>
</blockquote>

<p>Next, we evaluated draft lengths of 5, 10, and 15 tokens on the <code class="language-plaintext highlighter-rouge">36 requests x 128 tokens</code> configuration. These values span the typical deployment range: production systems conservatively use 3-5 tokens (Red Hat’s EAGLE3 at 3, NVIDIA’s reference configs at 5) to minimize wasted computation when drafts are rejected. Our experiments also tests draft lengths beyond this range, as some implementations suggest 8-10 and even 18-32 for methods like suffix decoding.</p>

<p>At <code class="language-plaintext highlighter-rouge">d=5</code>, adaptive speculation yields less savings across the board, which is logical as there are fewer possible ways to dynamically reduce <em>K</em>. The benefit does appear to saturate after <code class="language-plaintext highlighter-rouge">d=10</code>. We observe task-specific phenomena as well. As noted above, long-form generation maintains modest 16–30% speedups across all lengths, limited by fundamental acceptance rate degradation at extended sequences.</p>

<p>Coding presents a rather unique case compared to the other short form datasets. At <code class="language-plaintext highlighter-rouge">d=5</code> there is minimal improvement (~4%), but <code class="language-plaintext highlighter-rouge">d=10</code> unlocks 35% speedups. We suspect that this is because structured generation requires longer draft windows to amortize verification costs, a pattern documented in recent work<sup id="fnref:chen" role="doc-noteref"><a href="#fn:chen" class="footnote" rel="footnote">6</a></sup> showing that syntactic tasks need sufficient lookahead to capture token dependencies. We confirmed these results with the <code class="language-plaintext highlighter-rouge">Llama 3.2 3B</code> model as well.</p>

<figure class="figure-row">
    <div><img src="/assets/images/elastic-speculation/latency_adaptive.png" alt="Figure 3a" /><figcaption>Llama 3.1 8B</figcaption></div>
    <div><img src="/assets/images/elastic-speculation/latency_adaptive_3b.png" alt="Figure 3b" /><figcaption>Llama 3.2 3B</figcaption></div>
</figure>

<blockquote>
  <p><strong>Figure 3</strong> Draft length sensitivity. Latency reduction confirms generalization across model scales (8B and 3B).</p>
</blockquote>

<p>Ultimately, this variability explains why no single draft length works universally. Our adaptive approach sidesteps this problem by adjusting draft length per-request based on observed acceptance rates and task-specific requirements: fewer verification rounds when speculation is effective, and less wasted draft compute when it is not.</p>

<h2 id="confidence-based-early-exit-cutting-speculative-kv-writes">Confidence-based early exit: cutting speculative KV writes</h2>

<p>The second component is <strong>confidence-based early exit</strong>, designed to reduce speculative KV writes. In standard speculative decoding, every drafted token writes to the KV cache. If a token is never accepted, that bandwidth was wasted. On hardware and workloads where decode is memory-bound, this is expensive.</p>

<p>Our goal is to avoid KV writes for speculative tokens that the draft model itself considers unlikely, while keeping (1) the loop structure compatible with CUDA graphs, and (2) the target model’s verification rule unchanged.</p>

<p>We’ve implemented the approach as follows:</p>

<ol>
  <li>For each speculative step, we compute a simple confidence score per sequence (the maximum predicted token probability).</li>
  <li>We maintain a <code class="language-plaintext highlighter-rouge">continue_mask</code> for sequences that should keep writing KV.</li>
  <li>On the <strong>next</strong> iteration, if a sequence has fallen below the confidence threshold, we mark its KV-write slot as padding.</li>
  <li>The KV-write stage treats padding slots as no-ops, so those tokens are <strong>skipped</strong>.</li>
</ol>

<p>All sequences still execute the same control flow and only the data (which slots get written) changes. The target model still evaluates whatever drafts are produced, so we are not weakening correctness checks.</p>

<h3 id="why-dram-savings-matter-at-scale">Why DRAM savings matter at scale</h3>

<p>Early exit functions as a <strong>bandwidth control knob</strong>: terminate low-confidence  speculations before writing full draft sequences to KV cache, trading local  compute overhead for reduced memory pressure.</p>

<p>This matters because KV cache dominates production inference. At scale (large  batches, long contexts), the decode phase is memory-bandwidth bound: research shows KV cache accounts for up to 73% of total memory in <code class="language-plaintext highlighter-rouge">LLaMA-7B</code> at <code class="language-plaintext highlighter-rouge">batch=32</code><sup id="fnref:sheng" role="doc-noteref"><a href="#fn:sheng" class="footnote" rel="footnote">7</a></sup>, and over 50% of attention kernel cycles stall on data access delays<sup id="fnref:memorygap" role="doc-noteref"><a href="#fn:memorygap" class="footnote" rel="footnote">8</a></sup>. Techniques that reduce KV cache bandwidth show 1.5-3.7× latency improvements in production (RocketKV, SQuat, Async KV  prefetch).</p>

<p>Our early exit mechanism cuts DRAM writes by stopping speculation when confidence drops below threshold—fewer draft tokens generated means fewer KV cache entries written. In bandwidth-limited stacks (large models, long contexts, multi-tenant serving), this enables higher batch throughput and prevents OOM conditions. The 1-5% per-request latency cost translates to net system-level gains when memory bandwidth, not compute, is the bottleneck.</p>

<h3 id="bandwidth-vs-latency-trade-off">Bandwidth vs latency trade-off</h3>

<p>Figure 4 shows the bandwidth-latency trade-off across thresholds <code class="language-plaintext highlighter-rouge">0.3</code>, <code class="language-plaintext highlighter-rouge">0.5</code>, and <code class="language-plaintext highlighter-rouge">0.7</code>. At <code class="language-plaintext highlighter-rouge">threshold=0.5</code>, early exit stops 50-65% of speculative tokens before KV cache writes, translating to roughly 50% fewer DRAM write operations in our NCU profiles. The cost: 1-3% higher end-to-end latency compared to no early exit.</p>

<p>This latency penalty emerges from the mechanics of speculation. When early exit terminates a draft sequence, fewer tokens are available for verification. Lower acceptance per round means more speculation rounds to generate the same output — and each additional round invokes the target model. On our compute-bound test hardware, this overhead dominates. But production deployments are bandwidth-bound at scale<sup id="fnref:sheng:1" role="doc-noteref"><a href="#fn:sheng" class="footnote" rel="footnote">7</a></sup>, where 50% DRAM savings enables higher batch throughput. The mechanism is the same — and production regimes are precisely where bandwidth constraints bite.</p>

<figure class="figure-row">
    <div><img src="/assets/images/elastic-speculation/latency_early.png" alt="Figure 4a" /><figcaption>Latency</figcaption></div>
    <div><img src="/assets/images/elastic-speculation/tokens_early.png" alt="Figure 4b" /><figcaption>KV Writes Saved</figcaption></div>
</figure>

<blockquote>
  <p><strong>Figure 4</strong> Early exit stops a threshold-proportional % of speculative tokens before KV cache writes. Trades 1-3% latency for ~50% bandwidth reduction; coding shows steepest penalty (-5.4%) at threshold=0.7.</p>
</blockquote>

<p>Figure 5 visualizes this relationship: higher stop rates correlate with larger latency penalties. Coding exhibits the steepest degradation at threshold=<code class="language-plaintext highlighter-rouge">0.7</code> (73.7% stop rate, -5.4% latency), while other datasets show smaller penalties — structured generation suffers most when speculation is aggressively curtailed.</p>

<p>The optimal threshold will ultimately depend on deployment context. Bandwidth-limited production stacks benefit from aggressive early exit (threshold=<code class="language-plaintext highlighter-rouge">0.5-0.7</code>) to prevent OOM and enable larger batches. Compute-bound scenarios favor conservative thresholds (<code class="language-plaintext highlighter-rouge">0.3</code>) or disabling early exit entirely. Our implementation exposes threshold as a tunable parameter for operators to match their hardware constraints.</p>

<p><img src="/assets/images/elastic-speculation/lat_tok_scatter.png" alt="Figure 5: Latency vs Bandwidth Trade-Off" /></p>

<blockquote>
  <p><strong>Figure 5</strong> Higher stop rates correlate with larger latency penalties on compute-bound hardware; optimal threshold depends on deployment context (Llama-3.1-8B-Instruct @ k=10).</p>
</blockquote>

<h2 id="maintaining-output-semantics-and-quality">Maintaining output semantics and quality</h2>

<p>Elastic Speculation necessarily changes which speculative tokens are proposed and accepted, so we do not expect or intend to achieve exact bitwise-identical outputs. However, we do still want to ensure the overall quality and correctness of the output semantics. After all, what’s the point of speeding up inference if all you get out is non-sensical?</p>

<p>To quantify this difference, we systematically evaluated the exact outputs from adaptive draft length and early exit on (elastic speculation) against standard speculative decoding (fixed-length k). We also compared both against vLLM’s no-spec target model only to understand the relative semantic similarity and to ensure elastic speculation keeps our outputs in the same <strong>semantic regime</strong>.</p>

<p>Specifically, we evaluated the outputs under the following three criteria:</p>
<ul>
  <li><code class="language-plaintext highlighter-rouge">BERTScore F1</code> (token-level semantic similarity)</li>
  <li><code class="language-plaintext highlighter-rouge">cosine similarity</code> (sentence-level via Sentence-BERT similarity)</li>
  <li>and a <code class="language-plaintext highlighter-rouge">reward model quality score</code> (human preference alignment)</li>
</ul>

<h3 id="bertscore-f1-context-aware-token-alignment">BERTScore F1 (Context-aware token alignment)</h3>

<p>BERTScore measures semantic equivalence by comparing contextualized token
embeddings from BERT-family models. Unlike surface-level string matching, it
captures whether two texts convey the same meaning even with different wording.</p>

<blockquote>
  <p><strong>How it works:</strong> The metric computes token-level similarity using contextual
embeddings from <em>microsoft/deberta-large-mnli</em><sup id="fnref:bertscore" role="doc-noteref"><a href="#fn:bertscore" class="footnote" rel="footnote">9</a></sup>, then aggregates via precision, recall, and F1-score. Each token in the candidate text is matched to its most similar token in the reference text based on cosine similarity in embedding space.</p>
</blockquote>

<p>Both adaptive draft length and early exit maintain semantic fidelity: BERTScore F1 ranges from ~0.89 to 0.94 across all experiments. This places outputs well into the semantic equivalence regime—above the 0.90 threshold where texts convey identical meaning. For context, scores of 0.85-0.90 indicate paraphrase-level similarity, while values below 0.80 signal semantically different content.</p>

<figure class="figure-row">
    <div><img src="/assets/images/elastic-speculation/bert_adaptive.png" alt="Figure 6a" /><figcaption>Adaptive (BERTScore F1)</figcaption></div>
    <div><img src="/assets/images/elastic-speculation/bert_early.png" alt="Figure 6b" /><figcaption>Early Exit (BERTScore F1)</figcaption></div>
</figure>

<blockquote>
  <p><strong>Figure 6</strong> Adaptive draft length and early exit maintain BERTScore F1 &gt;0.88 and F1 &gt;0.95 respectively across all datasets, indicating semantic equivalence to baseline.</p>
</blockquote>

<h3 id="cosine-similarity-sentence-level-embeddings">Cosine Similarity (Sentence-Level Embeddings)</h3>

<p>Cosine similarity measures the angle between dense vector representations of
complete sentences, capturing overall semantic content at the document level
rather than token-by-token.</p>

<blockquote>
  <p><strong>How it works:</strong> We encode each output using Sentence-BERT<sup id="fnref:sbert" role="doc-noteref"><a href="#fn:sbert" class="footnote" rel="footnote">10</a></sup> (<em>all-mpnet-base-v2</em>), which produces a single 768-dimensional vector per text. The cosine similarity between corresponding baseline and optimized outputs quantifies semantic alignment.</p>
</blockquote>

<p>Cosine similarity between sentence embeddings confirms (and even exceeds) the BERTScore findings: adaptive draft length achieves &gt;0.95 similarity for all datasets, with SQuAD and coding measuring over 0.97 (Figure 7). Early exit maintains &gt;0.92 across thresholds. These scores place outputs well above the 0.85 threshold for semantic equivalence—effectively producing semantic duplicates of baseline outputs at the sentence level.</p>

\[\text{cosine similarity}(u, v) = \frac{u \cdot v}{u \times v}\]

<p>where $u = \text{SentenceBERT}(\text{text}_1)$, $v =
\text{SentenceBERT}(\text{text}_2) \in \mathbb{R}^{768}$</p>

<p>For reference, scores of 0.70-0.85 indicate paraphrases with similar meaning, while values below 0.60 signal semantically divergent content. Our results demonstrate that neither elastic technique introduces meaningful semantic drift.</p>

<figure class="figure-row">
    <div><img src="/assets/images/elastic-speculation/cosine_adaptive.png" alt="Figure 7a" /><figcaption>Adaptive (Cosine)</figcaption></div>
    <div><img src="/assets/images/elastic-speculation/cosine_early.png" alt="Figure 7b" /><figcaption>Early Exit (Cosine)</figcaption></div>
</figure>

<blockquote>
  <p><strong>Figure 7</strong> Adaptive draft length and early exit achieve &gt;0.94 sentence-level similarity across all thresholds and datasets.</p>
</blockquote>

<h3 id="reward-model-quality-score--human-preference-alignment">Reward Model Quality Score ∆ (Human Preference Alignment)</h3>

<p>The reward model measures output quality based on human preference alignment,
trained on datasets of human judgments about response quality. Unlike similarity
metrics, it evaluates absolute quality rather than just semantic equivalence.</p>

<blockquote>
  <p><strong>How it works:</strong> We used <em>OpenAssistant/reward-model-deberta-v3-large-v2</em><sup id="fnref:rewardmodel" role="doc-noteref"><a href="#fn:rewardmodel" class="footnote" rel="footnote">11</a></sup>, a <code class="language-plaintext highlighter-rouge">DeBERTa-v3-large</code> model fine-tuned on human preference data. The model scores each output on a continuous scale, predicting how humans would rate the response quality in terms of helpfulness, correctness, and coherence.</p>
</blockquote>

<p>This particular model scores outputs on helpfulness, correctness, and coherence as a proxy for human-perceived quality. The model outputs unbounded logit scores (typically -5 to +5 range), where higher values indicate better quality.</p>

<p>Figure 8 plots the quality score delta: elastic speculation minus baseline speculation, with both compared against no-speculation runs. Values hovering near zero indicate equivalent quality. Adaptive draft length shows deltas within ±0.15 across all datasets, while early exit maintains ±0.2 across thresholds. Paired t-tests confirm no statistically significant difference (p &gt; 0.85 across experiments). Mean absolute scores are baseline = -2.505, adaptive = -2.513 — both producing equivalently high-quality outputs from a human preference perspective.</p>

<figure class="figure-row">
    <div><img src="/assets/images/elastic-speculation/reward_adaptive.png" alt="Figure 8a" /><figcaption>Adaptive (Quality ∆)</figcaption></div>
    <div><img src="/assets/images/elastic-speculation/reward_early.png" alt="Figure 8b" /><figcaption>Early Exit (Quality ∆)</figcaption></div>
</figure>

<blockquote>
  <p><strong>Figure 8</strong> Quality deltas within ±0.15 confirm elastic speculation preserves human-perceived output quality; no statistically significant difference from baseline speculation (p&gt;0.85).</p>
</blockquote>

<p>Across all three metrics, elastic speculation preserves semantic quality. BERTScore &gt;0.94, cosine similarity &gt;0.95, and reward model deltas within ±0.2 confirm outputs match baseline speculation in both token-level fidelity and human-perceived quality.</p>

<p>To understand what “acceptable drift” looks like, we measured how much baseline speculation diverges from no-speculation runs. This gives us a reference: if speculation itself introduces some semantic variance, elastic variants should stay within that same range. They do — elastic spec vs. no-spec shows comparable deltas to baseline spec vs. no-spec (<em>not shown</em>). Our optimizations don’t add drift beyond what standard speculation already introduces. Finally, the 3B model replicates these findings across all metrics and conditions (not shown).</p>

<p>Note that the results shown use temperature=<code class="language-plaintext highlighter-rouge">0.0</code>. At temperature=<code class="language-plaintext highlighter-rouge">0.7</code>, scores drop for both baseline and elastic variants to similar degrees (<em>not shown</em>) — that’s just the nature of making using sampling based generation. Your outputs get a little <em>spicy</em> but elastic is no worse than baseline speculation.</p>

<h2 id="concluding-remarks">Concluding Remarks</h2>

<p>Elastic Speculation makes speculative decoding <strong>responsive</strong> by adapting to workload characteristics and hardware constraints in real time. In our tests, that means up to <strong>~20-50% lower latency</strong> versus fixed-length K from adaptive draft length, and <strong>a proportional reduction in speculative KV writes</strong> based the selected threshold for confidence-based early exit. It changes how tokens are generated, not necessarily the meaning of what gets generated, staying within the same semantic regime as standard speculative decoding in the recommended settings.</p>

<p>We are preparing an vLLM PR so you can try Elastic Speculation in your own deployments, tune it for your workloads, and see how it behaves at your scale. Please feel free to share your findings and/or implementations for other frameworks!</p>

<h3 id="citation">Citation</h3>

<p>Please cite this work as:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Zhao, Ben and Iluvatar Labs, "Elastic Speculation: Adaptive Draft Length and
Confidence-Based Early Exit", Iluvatar Labs Blog, Nov 2025.
</code></pre></div></div>

<div class="footnotes" role="doc-endnotes">
  <ol>
    <li id="fn:speculative" role="doc-endnote">
      <p>Leviathan, Y., Kalman, M., &amp; Matias, Y. (2023). “Fast Inference from Transformers via Speculative Decoding”. <em>Proceedings of the 40th International Conference on Machine Learning (ICML 2023)</em>, 19274-19286. arXiv:2211.17192 <a href="#fnref:speculative" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
    <li id="fn:eagle" role="doc-endnote">
      <p>Li, Y., Wei, F., Zhang, C., &amp; Zhang, H. (2024). “EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty”. <em>Proceedings of the 41st International Conference on Machine Learning (ICML 2024)</em>. arXiv:2401.15077 <a href="#fnref:eagle" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
    <li id="fn:vllm" role="doc-endnote">
      <p>Kwon, W., Li, Z., Zhuang, S., Sheng, Y., Zheng, L., Yu, C. H., Gonzalez, J. E., Zhang, H., &amp; Stoica, I. (2023). “Efficient Memory Management for Large Language Model Serving with PagedAttention”. <em>Proceedings of the 29th ACM Symposium on Operating Systems Principles (SOSP ‘23)</em>. arXiv:2309.06180 <a href="#fnref:vllm" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
    <li id="fn:kvbottleneck" role="doc-endnote">
      <p>Kwon et al. (2023) show that KV cache accounts for up to 73% of total memory in large-batch inference, with memory bandwidth becoming the primary bottleneck during decoding. <a href="#fnref:kvbottleneck" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
    <li id="fn:seqlen" role="doc-endnote">
      <p>Miao, X. et al. (2024). “Draft Model Knows When to Stop: A Self-Verification Length Policy for Speculative Decoding”. arXiv:2411.18462. The paper demonstrates that speculative decoding performance degrades as input length grows due to reduced draft accuracy. <a href="#fnref:seqlen" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
    <li id="fn:chen" role="doc-endnote">
      <p>Chen, C., Borgeaud, S., Irving, G., Lespiau, J.-B., Sifre, L., &amp; Jumper, J. (2023). “Accelerating Large Language Model Decoding with Speculative Sampling”. arXiv:2302.01318 <a href="#fnref:chen" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
    <li id="fn:sheng" role="doc-endnote">
      <p>Sheng, Y., Cao, S., Li, D., Hooper, C., Lee, N., Yang, S., Chou, C., Zhu, B., Zheng, L., Keutzer, K., Gonzalez, J. E., &amp; Stoica, I. (2024). “S-LoRA: Serving Thousands of Concurrent LoRA Adapters”. arXiv:2311.03285 <a href="#fnref:sheng" class="reversefootnote" role="doc-backlink">&#8617;</a> <a href="#fnref:sheng:1" class="reversefootnote" role="doc-backlink">&#8617;<sup>2</sup></a></p>
    </li>
    <li id="fn:memorygap" role="doc-endnote">
      <p>Kim, J., Lee, M., &amp; Kim, S. (2024). “Mind the Memory Gap: Unveiling GPU Bottlenecks in Large-Batch LLM Inference”. arXiv:2503.08311 <a href="#fnref:memorygap" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
    <li id="fn:bertscore" role="doc-endnote">
      <p>Zhang, T., Kishore, V., Wu, F., Weinberger, K. Q., &amp; Artzi, Y. (2020). “BERTScore: Evaluating Text Generation with BERT”. <em>Proceedings of the 8th International Conference on Learning Representations (ICLR 2020)</em>. arXiv:1904.09675 <a href="#fnref:bertscore" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
    <li id="fn:sbert" role="doc-endnote">
      <p>Reimers, N., &amp; Gurevych, I. (2019). “Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks”. <em>Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing (EMNLP-IJCNLP)</em>, 3982-3992. arXiv:1908.10084 <a href="#fnref:sbert" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
    <li id="fn:rewardmodel" role="doc-endnote">
      <p>OpenAssistant/reward-model-deberta-v3-large-v2. Available at <a href="https://huggingface.co/OpenAssistant/reward-model-deberta-v3-large-v2">https://huggingface.co/OpenAssistant/reward-model-deberta-v3-large-v2</a>. <a href="#fnref:rewardmodel" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
  </ol>
</div>]]></content><author><name>Ben Zhao · Roy Zhao · Justin Huang</name></author><category term="Research" /><summary type="html"><![CDATA[Elastic speculation delivers 30–50% lower latency and up to ~50% fewer speculative KV writes in our experiments, while preserving output quality.]]></summary></entry></feed>