Divergent-Convergent Attention

The TL;DR: Divergent-Convergent Attention (DCA) improves compositional reasoning by maintaining multiple parallel attention perspectives before periodic learned consensus. On HotpotQA¹, DCA achieves 5.4x higher exact match than a parameter-matched 90M baseline, and a 215M DCA model outperforms a 355M standard transformer by 1.54x with fewer parameters and lower memory.

Most notably, DCA assigns higher probability to the correct answer tokens on 97.8% of examples, with the advantage sharply correlated with question difficulty, suggesting that DCA’s magic is in how distributed evidence is internally composed before decoding.

DCA helps when relevant content is scattered across structurally independent documents. It does not help on sequential reasoning or single-source retrieval tasks where every perspective sees the same chain or location.

(Note: This blog post reflects the latest manuscript version of this work.)

Introduction

Standard transformers process multi-document input through a single attention stream, fusing heterogeneous evidence into one representation at every layer. RAG pipelines, long-context windows, and tasks like legal analysis or medical synthesis all require integrating information from structurally independent sources. A single stream must compromise between local precision and global reach at every layer. The result is premature fusion, where multi-document evidence is collapsed before the model can develop complementary views.

We introduce Divergent-Convergent Attention (DCA), a transformer variant that maintains K parallel attention streams at different scales and reconciles them only at scheduled consensus points. The novelty is not merely independent lanes or a late merge, but that those lanes are explicitly multi-horizon: short, medium, and long timescales that cultivate complementary perspectives before reconciliation.

DCA is inspired by an organizational principle in neuroscience: the brain concurrently maintains multiple oscillatory bands that only periodically couple to coordinate information²³. Gamma (30-100 Hz) supports fast, local feature binding, analogous to our short horizon. Beta (13-30 Hz) integrates across nearby regions, our medium horizon. Theta (4-8 Hz) supports global synchronization, our long horizon⁴⁵. DCA provides the computational analogue: separate processing streams that periodically synchronize via learned consensus.

Figure 1a: Neural oscillation analogy

Figure 1a Biological multi-scale oscillations. Gamma, beta, and theta bands process at different scales and periodically couple to coordinate information. DCA maps these to three attention horizons.

In controlled experiments, DCA achieves 5.4x higher exact match on multi-hop QA at 90M parameters (p < 10^-6, 3 seeds). At 215M, DCA beats a 355M baseline by 1.54x with fewer parameters, approximately matched FLOPs, and less memory. We characterized the consensus mechanism through causal interventions at both scales. Despite the small capacity of these models, our force-decode analysis shows an unambiguous representational advantage in multi-document composition. DCA assigns higher probability to the correct answer tokens on 97.8% of all examples at 90M, and the advantage scales with difficulty, with 7.7x larger gains on the hardest examples.

Multi-scale and sparse attention methods such as Longformer⁶, BigBird⁷, and RetNet⁸ combine local and global attention within a single stream, blending scales early or continuously. DCA maintains separate streams that develop independent representations before merging. Ring attention⁹ and flash attention¹⁰ address computational cost but not fusion timing; DCA is orthogonal and compatible with these methods.

Multi-path architectures provide useful precedents but differ in mechanism. ResNeXt¹¹ established split-transform-merge for vision. Mixture of Experts¹²¹³ increases capacity through sparse routing. DCA differs in that all perspectives are always active and differentiated by attention scale rather than learned routing. The gated consensus mechanism uses Highway Network-style residual connections¹⁴ with periodic synchronization analogous to federated averaging¹⁵.

HotpotQA requires composing information across two Wikipedia paragraphs among eight distractors. Encoder models at 110M-355M achieve substantially higher scores with bidirectional attention, while decoder-only models generally require 7B+ to reach around 30% EM¹⁶⁶⁷¹⁷¹⁸. To our knowledge, no published decoder-only HotpotQA results exist between 90M and 7B parameters.

The Architecture

DCA replaces each transformer block with K parallel attention streams (“perspectives”), each operating at a different window size. In our experiments, K=3 with horizons [32, 128, 0], where 0 denotes full causal attention. Each perspective has its own QKV projection weights. All perspectives share a single MLP, with all paths always active (closer to ResNeXt’s split-transform-merge¹¹ than to Mixture of Experts’ selective routing¹², and analogous to cross-scale pooling in multi-scale vision transformers¹⁹). Every N layers, the perspectives merge via a Highway Network-style gate¹⁴, a periodic synchronization analogous to federated averaging¹⁵. This gate is content-dependent and learned, and the model discovers a depth-dependent strategy where early layers mostly pass through and late layers merge more fully, as shown later in the mechanistic analysis.

consensus = mean(perspective_1, ..., perspective_K)
gate = sigmoid(W_g * RMSNorm(x))
output = (1 - gate) * x + gate * consensus

Note that while the implementation described here uses dense causal attention, DCA is a more general late-consensus primitive. The consensus mechanism operates on tensors, so any module that takes [B, T, D] and produces [B, T, D] can serve as a perspective. In this work, we use dense causal attention with different window sizes, but other sequence-processing modules (ring attention, linear attention, SSMs) could serve the same role.

Figure 1b: DCA architecture

Figure 1b DCA architecture. K=3 perspectives fork from the residual stream, process with separate attention and shared MLP, then merge via learned highway consensus. The cycle repeats every N layers.

Design tradeoffs

Full-fat DCA at baseline width (K=3 at d=1024) costs 3x VRAM and ~2.7x FLOPs. Bottleneck projections let perspectives operate at d_lane=512 inside d_model=1024. The key math is that 3 x 512^2 < 1024^2, so K=3 perspectives at d=512 are cheaper per layer than a single stream at d=1024. This replaces the role that global tokens play in Longformer and BigBird⁶⁷ in a causal-compatible way; global tokens in causal decoders are functionally vacuous since position 0 can only attend to itself. Per-perspective gradient checkpointing reduces activation memory from ~3x baseline to below baseline levels. We scale by adding layers (30L) at the cheap d=512 perspective width rather than widening to d=1024.

Table 1 DCA design space. Theoretical and tested variants with FLOP and VRAM tradeoffs.

Variant	d_model	d_lane	Params	FLOP ratio	VRAM vs baseline	MLP
Baseline	1024	–	355M	1.0x	1.0x	1 stream
Full-fat (K=3)	1024	1024	556M	2.71x	~3x	shared
DCA-215M	1024	512	215M	1.24x	~0.8x	shared
DCA-215M + separate MLPs	1024	512	341M	1.24x	~0.8x	K weights

Benchmarks

HotpotQA at 90M (WikiText-103)

The Hurt Locker - A 2008 war thriller about an Iraq War EOD team, directed by Kathryn Bigelow...

distract

Kathryn Bigelow - An American filmmaker known for directing horror, action, and thriller films...

distract

Zero Dark Thirty - A 2012 action thriller directed by Kathryn Bigelow dramatizing the decade-long manhunt for Osama bin Laden. It received five Academy Award nominations, including Best Picture and Best Actress.

gold

Jessica Chastain - An American actress and film producer, studied at the Juilliard School...

distract

Mark Boal - An American screenwriter and journalist. Best known for writing "The Hurt Locker"...

distract

Argo (2012 film) - A 2012 historical drama directed by Ben Affleck about the rescue of six U.S. diplomats...

distract

Denis Villeneuve - A Canadian filmmaker acclaimed for "Prisoners," "Sicario," and "Dune"...

distract

Arrival (film) - A 2016 science fiction drama directed by Denis Villeneuve, adapted from Ted Chiang's "Story of Your Life." It received eight Academy Award nominations, including Best Picture and Best Director, winning Best Sound Editing.

gold

Ted Chiang - An American science fiction writer whose work has won four Nebula and four Hugo Awards...

distract

Eric Heisserer - An American screenwriter who adapted "Story of Your Life" into "Arrival"...

distract

Question (requires both gold paragraphs):
Which film received more Academy Award nominations, Zero Dark Thirty or Arrival?

Answer: Arrival (8 nominations vs 5)

Figure 2 HotpotQA distractor setting. 10 paragraphs per question: 2 supporting (gold), 8 distractors (gray). The answer requires composing information from both gold paragraphs scattered among topically similar distractors.

We pretrained DCA (89M params) and a parameter-matched baseline (90M params) on WikiText-103²⁰ for 50K steps, then finetuned both on HotpotQA across three seeds. Though DCA is modestly worse on WikiText-103 validation perplexity (21.48 vs 20.79, ~3%), the benefit on long reasoning is asymmetric. DCA achieves 5.4x higher exact match on HotpotQA (1.56% vs 0.29%, Table 2), with p < 10^-6 and odds ratio 5.49 (Fisher exact, pooled across seeds). DCA outperformed every baseline variant we tested, across both 50K and 30K pretrain budgets (Appendix H).

Scaling to PG-19 and architectural exploration

The 90M result raises a natural question: does the advantage hold at larger scale? While the relative advantage is clear, absolute performance of both models is low (1.56% and 0.29% EM). WT103 is too small for 350M-class models, so we switched to PG-19²¹ (3B tokens) following standard conventions.

To calibrate the effect of pretraining domain, we also trained DCA 90M on PG-19 (EM=0.38%, compared to 1.56% on WT103). The 350M standard baseline on PG-19 achieves 0.93% EM, indicating that even at 4x the parameters, standard decoders remain poor at multi-hop QA. A FLOP-comparable DCA-215M on PG-19 achieves 1.43% EM vs the baseline’s 0.93% (1.54x), with 39% fewer parameters and less VRAM (~35 vs ~45 GB). Within PG-19, scaling DCA from 90M to 215M improves EM from 0.38% to 1.43%, surpassing the 350M baseline by 1.54x.

Table 2 HotpotQA results across scales and pretraining domains. All models finetuned and evaluated on HotpotQA.

WT103 pretraining (90M):

Model	Params	FLOP ratio	VRAM	EM%	F1%
Baseline 90M	90M	~1.0x	~4 GB	0.29	7.77
DCA 90M	89M	~1.56x	~8 GB	1.56	14.40

PG-19 pretraining (up to 350M):

Model	Params	FLOP ratio	VRAM	EM%	F1%
DCA 90M	89M	~1.56x	~8 GB	0.38	7.78
Baseline 350M	355M	1.0x	~45 GB	0.93	10.81
DCA-215M	215M	1.24x	~35 GB	1.43	11.32

Architectural exploration

Scaling provided an opportunity to test which components of DCA are essential. Our 90M baseline uses per-head window assignment (heads 0-2 at w=32, heads 3-5 at w=128, heads 6-7 full causal), achieving the best perplexity among baseline variants (20.79) but only 0.29% EM on HotpotQA. A factorial experiment confirmed parallel streams are the primary mechanism; multi-scale windows are secondary. Two other architectural properties proved essential. Shared QKV weights collapse perspective diversity (cosine similarity >0.9 vs 0.2-0.4 with separate weights). Consensus at every layer (k=1) drops EM to 1.05% vs 1.59% with consensus every 6 layers (k=6).

These results motivated the DCA-215M design. A narrower variant at d=768 (323M params) achieved only EM=0.57%, undertrained at 6.2 tokens per parameter. A variant without per-perspective MLP (302M params, 1.07x FLOPs) achieved EM=1.21% (1.30x over baseline). Separate MLP weights (139M params, same FLOPs) achieved PPL=20.45 but EM=0.80%, confirming the shared MLP acts as a regularizer.

Table 3 Architectural variant results. All evaluated on HotpotQA.

Model	Params	d_model	d_lane	EM%	VRAM
DCA 90M (full-fat)	89M	512	512	1.56	~8 GB
DCA-d768	323M	768	768	0.57	~45 GB
DCA-noMLP	302M	1024	768	1.21	~35 GB
DCA-215M (bottleneck)	215M	1024	512	1.43	~35 GB
DCA 90M (separate MLPs)	139M	512	512	0.80	~10 GB

The DCA-215M results (Tables 2 and 3) confirm this design is practical and competitive.

What is DCA well suited for?

We selected benchmarks to test where DCA should help, not to maximize wins. Few existing tasks isolate distributed-source composition while remaining tractable for sub-billion-parameter decoder-only models, so HotpotQA serves as the primary stress test, 2Wiki as secondary corroboration, and the remaining tasks as negative controls.

Figure 3: Information topology

Figure 3 Information topology. DCA helps when relevant content is distributed across independent documents (left). It does not help when information forms a single chain (right).

Sequential reasoning tasks (bAbI²², Tree pathfinding²³, PrOntoQA²⁴, LEGO²⁵) show no advantage; all facts lie in a single flat sequence and every attention scale sees the same chain. Single-source tasks (TriviaQA²⁶, LAMBADA²⁷, MQAR²⁸) show no advantage; all perspectives see the same content. Tasks beyond model capacity (MuSiQue²⁹) show both models at floor. 2WikiMultiHopQA³⁰ provides weak corroboration (Soft EM p = 0.004, EM ns).

DCA helps when relevant information is distributed across structurally independent segments, what we refer to as the information topology of the input, and does not help when information forms a single chain or resides at a single location. Within HotpotQA, the advantage is uniform across bridge questions (sequential logic, OR=4.65) and comparison questions (parallel logic, OR=4.51), indicating that multi-document context, not reasoning pattern, is the key factor.

Mechanistic Analysis

Force-decode: the representation advantage

To separate representation quality from generation dynamics, we feed the context to both models and force-decode the gold answer tokens (teacher-forcing), recording each model’s log-probability of the correct token at each position. For each of 6,359 validation examples, we compare which model assigns higher probability to the gold answer.

Table 4: Force-decode results (90M, WT103). Top: paired comparison across all 6,359 validation examples (Wilcoxon signed-rank p < 10^-300). Bottom: advantage by baseline difficulty quintile.

Slice	DCA advantage	DCA win rate
Overall	+6.25 nats (~520x)	97.8% (6,217/6,359)
0-20% (easiest)	+1.84 nats	96.5%
20-40%	+3.20 nats	96.2%
40-60%	+4.75 nats	96.5%
60-80%	+7.25 nats	99.6%
80-100% (hardest)	+14.23 nats	100%

The representation advantage is near-universal: DCA produces better internal representations on 97.8% of all examples, not just the 1.6% where EM=1. The advantage correlates with baseline difficulty (r=-0.888): 7.7x larger on the hardest examples than the easiest (Table 4, Figure 4). The harder an example is for a standard transformer, the more DCA’s multi-perspective consensus improves the representation.

Figure 4: Force-decode advantage by baseline difficulty quintile

Figure 4 DCA 90M vs Baseline 90M (both WT103). DCA’s representational advantage scales with example difficulty. On the hardest quintile (where the baseline assigns the lowest probability to the correct answer), DCA’s advantage is +14.23 nats. On the easiest, +1.84 nats. r=-0.888.

Recent work on latent multi-hop reasoning finds that while bridge-entity recall scales smoothly with model size, the compositional second hop does not, suggesting composition is a structural bottleneck rather than a capacity problem³¹. That work studies parametric knowledge recall; DCA’s setting differs in that all relevant information is provided in context. Nevertheless, our force-decode result is consistent with the broader view that end-to-end exact match may understate the gradual development of multi-hop structure in model representations: at 90M, the correct answer is already encoded with substantially higher probability under the right architecture, even where end-to-end EM remains near floor. Confirming this connection would require targeted compositional probes, such as entity-recall scores and causal interventions on bridge entities, applied directly to DCA’s internal representations.

Same retrieval, better composition

We computed mean token recall and derived an approximate token precision from aggregate F1 and recall on generated predictions (90M, WT103, pooled across seeds 137 and 2024). Token Recall is essentially identical (~57.5% in both models). Token Precision, derived via P = F1·R / (2R - F1), shows the full advantage: ~8.2% vs ~4.2% (~2.0x). The advantage appears to come primarily from composition rather than token-level recall.

Figure 5 DCA 90M vs Baseline 90M (both WT103). Token Recall is essentially identical (~57.5%). Token Precision (derived from aggregate F1 and recall) shows the full advantage (~8.2% vs ~4.2%).

Figure 5: Token Recall/Precision

First-sentence extraction approximately decomposes this into two components. The advantage that survives extraction (~1.2-1.5x) reflects compositional integration at the representation level. The remaining multiplier (~2-3x) reflects generation coherence, scaling with answer length (3x at 1 token, 12.8x at 4+ tokens). These ranges are inferred from comparing first-sentence and full-output EM ratios, not independently measured.

Gate ablation: consensus is essential and precisely tuned

We force the consensus gate to fixed values during full QA evaluation using forward hooks. Gate=0 clamps the sigmoid to 0.001 (bypass consensus). Gate=1 clamps to 0.999 (force full consensus).

Table 5: Gate ablation and learned gate values. Top: forcing gate to fixed values during QA evaluation. Bottom: learned gate values at consensus layers.

Condition	90M	215M
Learned gates	101	91
Gate=1 (force full)	18	0
Baseline	12	59
Gate=0 (bypass)	2	2

Consensus layer	90M gate	215M gate
Layer 5 (1st)	0.29	0.31
Layer 11 (final @ 90M)	0.99	0.21
Layer 17 (3rd)	–	0.28
Layer 23 (4th)	–	0.35
Layer 29 (final)	–	0.37

Bypassing consensus collapses performance from 101 to 2 correct at 90M and 91 to 2 at 215M (Table 5). Forcing full consensus drops 101 to 18 at 90M and 91 to 0 at 215M, worse than the 350M baseline (59 correct), indicating that forced consensus is actively destructive to the representations DCA has learned to build through gradual integration.

At 90M the learned strategy is binary: passthrough early (0.29), full commit at the final layer (0.99). At 215M it is gradual and never exceeds 0.37. Both strategies are load-bearing, and disrupting either destroys performance.

Perspective divergence and attention patterns

Perspectives develop genuinely distinct representations (cosine similarity 0.21 between local and medium at layer 5), complementary rather than redundant (Appendix Table 8).

Figure 6 DCA 90M vs Baseline 90M (both WT103), consensus layer 5. Each DCA perspective specializes at a different scale, while the baseline compromises at 0.34.

$Figure 6: Cross-document attention fraction$

Attention measurements (computed on EM=1 examples, n=101) confirm the specialization. The local perspective keeps 96% of attention within paragraphs (cross-document fraction 0.04), while the global perspective distributes 68% across documents. The baseline sits at 0.34. With DCA, local perspectives extract precise within-document content, global perspectives maintain cross-document context, and consensus integrates both. The baseline attends at multiple scales within a single residual stream, but must reconcile those scales within one shared representation.

Conclusion

Multi-document composition is a documented bottleneck for production LLMs. RAG pipelines retrieve relevant documents but fail to synthesize across them³². Models fail to use information in the middle of long contexts³³. Multi-hop reasoning may require 30-70B parameters to emerge in standard transformers³⁴. With DCA, we sought to demonstrate that parallel multi-scale perspectives with periodic late consensus can improve on these deficiencies.

Despite the limited capacity of our models, extensive benchmarking demonstrated a consistent advantage on distributed-source tasks (5.4x EM at 90M, 1.54x at 215M FLOP-comparable) and no advantage on sequential, single-source, or capacity-limited tasks. At 90M, the resulting representations encode multi-document relationships better than a parameter-matched standard transformer on 97.8% of examples, with 7.7x larger gains on the hardest examples, reflecting an advantage in composition rather than retrieval.

Consensus frequency, horizon widths, and gate training dynamics were fixed throughout our experiments, leaving substantial room for task-specific tuning. Because DCA is fundamentally a primitive that operates on tensors, other sequence-processing modules (ring attention, linear attention, SSMs) could serve as perspectives, opening a combinatorial design space we have only begun to explore. The force-decode diagnostic is itself useful beyond DCA, offering a general tool for determining whether the bottleneck in a given architecture is understanding or expression. We will be sharing the base code for DCA on GitHub.

Appendix

HotpotQA task illustration

Figure 2 above illustrates the HotpotQA distractor setting used throughout the paper: 10 paragraphs per question, with 2 supporting paragraphs embedded among 8 distractors.

Multi-seed raw data

Table 6: Per-seed HotpotQA results (90M, WT103). DCA mean EM: 1.562% (std 0.120%). Baseline mean EM: 0.288% (std 0.122%). Fisher exact (pooled): OR=5.49, p < 10^-6.

Model	Seed	EM	F1	n
DCA 90M	42	0.01588	0.14508	6359
DCA 90M	137	0.01431	0.14439	6359
DCA 90M	2024	0.01667	0.14265	6359
Baseline 90M	42	0.00189	0.07625	6359
Baseline 90M	137	0.00425	0.08143	6359
Baseline 90M	2024	0.00252	0.07528	6359

Additional benchmark evaluations

All results in this section come from 90M models pretrained on WT103.

Sequential reasoning tasks show no DCA advantage: bAbI 2-hop hits 100% for both models at all distractor counts, and PrOntoQA also hits 100% at all hop counts. Tree pathfinding favors the baseline by 2-7 points at depths 4-6. LEGO is roughly even, with the baseline at ~31% and DCA at ~30%.

Single-source tasks also show no advantage. TriviaQA and LAMBADA show no consistent lift. MQAR (fixed protocol, vocab=8192) remains at exact chance across all key-value counts and learning rates for both models.

Capacity-limited tasks stay at floor. MuSiQue shows DCA at 0.21% and baseline at 0.10% (p = 0.687, not significant). Additional synthetic compositional probes were similarly uninformative at this scale: Entity Comparison remained at chance (50%, loss near ln 2), and MQAR2 also remained at chance (50%). We treat these as floor-effect results for small models trained from scratch rather than meaningful tests of DCA’s inductive bias.

2WikiMultiHopQA provides weak corroboration: EM is even (0.31% vs 0.33%, p = 1.0), but Soft EM (F1 >= 0.5) favors DCA at 2.31% vs 1.47% (p = 0.004).

Force-decode difficulty scaling

Appendix Figure: Force-decode advantage vs baseline difficulty

Appendix Figure Per-example force-decode advantage (DCA log-prob minus baseline log-prob) plotted against baseline log-prob (90M DCA vs 90M baseline, both WT103). Each point is one of 6,359 HotpotQA validation examples. r=-0.888. The harder an example is for the baseline (more negative log-prob), the larger DCA’s representational advantage.

Literature gap

To our knowledge, no published decoder-only HotpotQA results exist between 90M and 7B parameters.

Table 7: Published HotpotQA results. Encoder models dominate at 110M-355M due to bidirectional attention and span extraction heads. Decoder-only models need 7B+ for ~30% EM.

Architecture	Params	HotpotQA	Notes
DCA 90M	89M	1.56% EM	Decoder, WT103
Baseline 90M	90M	0.29% EM	Decoder, WT103
Baseline 350M	355M	0.93% EM	Decoder, PG-19
DCA-215M	215M	1.43% EM	Decoder, PG-19
BERT-base-era systems	~110M	~54% EM	Encoder, bidirectional
Longformer-base	~149M	64% F1	Encoder, local+global
Longformer-large	~435M	73% F1	Encoder, local+global
BigBird-ETC	~131M	76% F1	Encoder, sparse
RoBERTa-large-based systems	~355M	~70% EM	Encoder
Llama-2-7B	7B	~30% EM	Decoder (FireAct)
GPT-3.5	proprietary	~31% EM	Few-shot ReAct
Human	–	~91% F1	Leaderboard

Encoder models such as BERT¹⁶, Longformer⁶, BigBird-ETC⁷, and RoBERTa¹⁷ dominate at 110M-355M because HotpotQA was designed for BERT-era extractive QA with bidirectional attention and span extraction heads. Decoder-only models need 7B+ for ~30% EM (FireAct with Llama-2-7B¹⁸). Steele & Katz³⁴ identify a phase transition at 30-70B for emergent multi-hop reasoning.

Additional WT103 variants

In addition to the headline comparison (DCA vs baseline_mixed, 50K steps, 3 seeds), we trained six additional 90M WT103 variants: three DCA variants (consensus every 1, 3, or 6 layers, plus uniform-horizon settings) and three baseline variants (full causal, layerwise windows, sliding window w=256) at 50K or 30K steps. In all cases, every DCA variant outperformed every baseline on HotpotQA EM, including cross-budget comparisons where DCA at 30K steps with 1-epoch finetuning exceeded baselines at 50K steps with 3-epoch finetuning. The factorial decomposition confirmed that parallel streams are the primary mechanism; multi-scale windows are secondary.

Perspective divergence

Pairwise cosine similarity between K=3 perspectives at consensus layers, measured on HotpotQA inputs (90M, WT103). No EM stratification: correct and incorrect examples show nearly identical divergence, confirming divergence is an architectural property rather than a predictor of success.

Table 8: Perspective divergence on QA data. Local and medium are most dissimilar at layer 5 (0.21); by layer 11 they partially reconverge (0.62) while local-global remains distinct (0.34).

Layer	Pair	Overall	EM=1 (n=101)	EM=0 (n=6,258)
5	local vs medium	0.207	0.209	0.207
5	local vs global	0.435	0.437	0.435
5	medium vs global	0.405	0.407	0.405
11	local vs medium	0.621	0.625	0.621
11	local vs global	0.336	0.337	0.336
11	medium vs global	0.437	0.436	0.437

Citation

This blog post serves as the current preprint version of this work. Until an archival version is available, please cite it as:

@misc{zhao2026dca,
  author = {Ben Zhao and Jenhan Tao},
  title = {Divergent-Convergent Attention: Parallel Perspectives for Compositional Reasoning},
  year = {2026},
  howpublished = {\url{https://iluvatarlabs.github.io/blog/2026/03/divergent-convergent-attention/}},
  note = {Iluvatar Labs blog preprint}
}

Acknowledgements

We thank Abel Chiao for helpful discussions and feedback on this work.

Yang, Z., Qi, P., Zhang, S., Bengio, Y., Cohen, W. W., Salakhutdinov, R., & Manning, C. D. (2018). “HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering”. Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP 2018). arXiv:1809.09600 ↩
Buzsaki, G. (2006). Rhythms of the Brain. Oxford University Press. ↩
Canolty, R. T., & Knight, R. T. (2010). “The Functional Role of Cross-Frequency Coupling”. Trends in Cognitive Sciences, 14(11), 506-515. ↩
Lisman, J. E., & Jensen, O. (2013). “The Theta-Gamma Neural Code”. Neuron, 77(6), 1002-1016. ↩
Colgin, L. L., Denninger, T., Fyhn, M., Hafting, T., Bonnevie, T., Jensen, O., Moser, M.-B., & Moser, E. I. (2009). “Frequency of Gamma Oscillations Routes Flow of Information in the Hippocampus”. Nature, 462, 353-357. ↩
Beltagy, I., Peters, M. E., & Cohan, A. (2020). “Longformer: The Long-Document Transformer”. arXiv:2004.05150 ↩ ↩² ↩³ ↩⁴
Zaheer, M., Guruganesh, G., Dubey, A., Ainslie, J., Alberti, C., Ontanon, S., Pham, P., Ravula, A., Wang, Q., Yang, L., & Ahmed, A. (2020). “Big Bird: Transformers for Longer Sequences”. Advances in Neural Information Processing Systems 33 (NeurIPS 2020). arXiv:2007.14062 ↩ ↩² ↩³ ↩⁴
Sun, Y., Dong, L., Huang, S., Ma, S., Xia, Y., Xue, J., Wang, J., & Wei, F. (2023). “Retentive Network: A Successor to Transformer for Large Language Models”. arXiv:2307.08621 ↩
Liu, H., Zaharia, M., & Abbeel, P. (2023). “Ring Attention with Blockwise Transformers for Near-Infinite Context”. arXiv:2310.01889 ↩
Dao, T., Fu, D. Y., Ermon, S., Rudra, A., & Re, C. (2022). “FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness”. Advances in Neural Information Processing Systems, 35. ↩
Xie, S., Girshick, R., Dollar, P., Tu, Z., & He, K. (2017). “Aggregated Residual Transformations for Deep Neural Networks”. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2017). arXiv:1611.05431 ↩ ↩²
Shazeer, N., Mirhoseini, A., Maziarz, K., Davis, A., Le, Q., Hinton, G., & Dean, J. (2017). “Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer”. Proceedings of the 5th International Conference on Learning Representations (ICLR 2017). arXiv:1701.06538 ↩ ↩²
Fedus, W., Zoph, B., & Shazeer, N. (2022). “Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity”. Journal of Machine Learning Research, 23(120), 1-39. ↩
Srivastava, R. K., Greff, K., & Schmidhuber, J. (2015). “Highway Networks”. arXiv:1505.00387 ↩ ↩²
McMahan, H. B., Moore, E., Ramage, D., Hampson, S., & y Arcas, B. A. (2017). “Communication-Efficient Learning of Deep Networks from Decentralized Data”. Proceedings of the 20th International Conference on Artificial Intelligence and Statistics (AISTATS 2017). arXiv:1602.05629 ↩ ↩²
Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2019). “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding”. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT 2019). arXiv:1810.04805 ↩ ↩²
Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., & Stoyanov, V. (2019). “RoBERTa: A Robustly Optimized BERT Pretraining Approach”. arXiv:1907.11692 ↩ ↩²
Chen, B., Monajatipoor, M., Veen, D. V., Guo, Y., & Dubrawski, A. (2023). “FireAct: Toward Language Agent Fine-tuning”. arXiv:2310.05915 ↩ ↩²
Fan, H., Xiong, B., Mangalam, K., Li, Y., Yan, Z., Malik, J., & Feichtenhofer, C. (2021). “Multiscale Vision Transformers”. Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV 2021). arXiv:2104.11227 ↩
Merity, S., Xiong, C., Bradbury, J., & Socher, R. (2017). “Pointer Sentinel Mixture Models”. Proceedings of the 5th International Conference on Learning Representations (ICLR 2017). arXiv:1609.07843 ↩
Rae, J. W., Potapenko, A., Jayakumar, S. M., & Hillier, C. (2020). “Compressive Transformers for Long-Range Sequence Modelling”. Proceedings of the 8th International Conference on Learning Representations (ICLR 2020). arXiv:1911.05507 ↩
Weston, J., Bordes, A., Chopra, S., & Mikolov, T. (2015). “Towards AI-Complete Question Answering: A Set of Prerequisite Toy Tasks”. arXiv:1502.05698 ↩
Brinkmann, J., Goswami, K., & Rajani, N. F. (2024). “A Mechanistic Analysis of a Transformer Trained on a Symbolic Multi-Step Reasoning Task”. Findings of the Association for Computational Linguistics: ACL 2024. ↩
Saparov, A., & He, H. (2023). “Language Models Are Greedy Reasoners: A Systematic Formal Analysis of Chain-of-Thought”. Proceedings of the 11th International Conference on Learning Representations (ICLR 2023). ↩
Zhang, Y., Yu, A. W., & Xu, W. (2022). “Unveiling Transformers with LEGO: A Synthetic Reasoning Task”. arXiv:2206.04301 ↩
Joshi, M., Choi, E., Weld, D. S., & Zettlemoyer, L. (2017). “TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension”. Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (ACL 2017). arXiv:1705.03551 ↩
Paperno, D., Kruszewski, G., Lazaridou, A., Pham, N. Q., Bernardi, R., Pezzelle, S., Baroni, M., Boleda, G., & Fernandez, R. (2016). “The LAMBADA Dataset: Word Prediction Requiring a Broad Discourse Context”. Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (ACL 2016). arXiv:1606.06031 ↩
Arora, S., Eyuboglu, S., Timalsina, A., Johnson, I., Poli, M., Rudra, A., & Zou, J. (2023). “Zoology: Measuring and Improving Recall in Efficient Language Models”. arXiv:2312.04927 ↩
Trivedi, H., Balasubramanian, N., Khot, T., & Sabharwal, A. (2022). “MuSiQue: Multihop Questions via Single-hop Question Composition”. Transactions of the Association for Computational Linguistics, 10, 539-554. arXiv:2108.00573 ↩
Ho, X., Nguyen, A.-K. D., Sugawara, S., & Aizawa, A. (2020). “Constructing a Multi-hop QA Dataset for Comprehensive Evaluation of Reasoning Steps”. Proceedings of the 28th International Conference on Computational Linguistics (COLING 2020). arXiv:2011.01060 ↩
Yang, S., Gribovskaya, E., Kassner, N., Geva, M., & Riedel, S. (2024). “Do Large Language Models Latently Perform Multi-Hop Reasoning?” Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (ACL 2024). ↩
Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V., Goyal, N., Kuttler, H., Lewis, M., Yih, W.-t., Rocktaschel, T., Riedel, S., & Kiela, D. (2020). “Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks”. Advances in Neural Information Processing Systems 33 (NeurIPS 2020). arXiv:2005.11401 ↩
Liu, N. F., Lin, K., Hewitt, J., Paranjape, A., Bevilacqua, M., Petroni, F., & Liang, P. (2024). “Lost in the Middle: How Language Models Use Long Contexts”. Transactions of the Association for Computational Linguistics, 12, 157-173. arXiv:2307.03172 ↩
Steele, B., & Katz, M. (2026). “Scaling Trends for Multi-Hop Contextual Reasoning in Mid-Scale Language Models”. arXiv:2601.04254 ↩ ↩²

Introduction

Related Work

The Architecture

Design tradeoffs

Benchmarks

HotpotQA at 90M (WikiText-103)

Scaling to PG-19 and architectural exploration

Architectural exploration

What is DCA well suited for?

Mechanistic Analysis

Force-decode: the representation advantage

Same retrieval, better composition

Gate ablation: consensus is essential and precisely tuned

Perspective divergence and attention patterns

Conclusion

Appendix

HotpotQA task illustration

Multi-seed raw data

Additional benchmark evaluations

Force-decode difficulty scaling

Literature gap

Additional WT103 variants

Perspective divergence

Citation

Acknowledgements