Reasoning Within the Mind:
Dynamic Multimodal Interleaving
in Latent Space

Chengzhi Liu^1*, Yuzhe Yang^1*, Yue Fan³, Qingyu Wei², Sheng Liu^2†, Xin Eric Wang^1†

* Equal contribution † Equal advising

University of California, Santa Barbara

Stanford University

University of California, Santa Cruz

Paper Code

Comparison between DMLR and two reasoning paradigms. (A) Text-only reasoning: relies solely on explicit CoT, often causing visual grounding errors and redundant steps. (B) Think-with-Image reasoning: depends on external perception tools, leading to unstable tool calls and extra overhead. (C) DMLR (ours): refines latent think tokens in the latent space through confidence-guided optimization and dynamically injects visual information, achieving self-improving reasoning without additional training while maintaining high efficiency.

Abstract

Recent advancements in Multimodal Large Language Models (MLLMs) have significantly enhanced cross-modal understanding. However, existing methods often rely on explicit step-by-step textual reasoning or unstable external tools. Inspired by human cognition, we posit that thinking unfolds not linearly but through the dynamic interleaving of reasoning and perception within the mind. We propose DMLR (Dynamic Multimodal Latent Reasoning), a test-time framework that employs confidence-guided latent policy gradient optimization to refine latent think tokens. Furthermore, we introduce a Dynamic Visual Injection Strategy, which retrieves relevant visual features only when needed during the thought process. Experiments across seven benchmarks demonstrate that DMLR significantly improves reasoning performance and efficiency without additional training.

Empirical Observations

Analysis of visual dependency and confidence.

1 Visual Sparsity

Visual information is not uniformly needed; it contributes only at specific reasoning stages.

2 Confidence as Signal

Internal confidence strongly correlates with reasoning correctness and visual grounding quality.

Methodology

Confidence-guided optimization of Latent Think Tokens with Dynamic Visual Injection.

1 Latent Policy Optimization

Instead of generating explicit text, DMLR introduces learnable Latent Think Tokens ($\mathcal{T}$). At test-time, we employ a policy gradient approach to iteratively refine these tokens. The optimization is guided by a Confidence Reward ($\mathcal{R}$), which encourages the model to find reasoning paths that maximize internal consistency and prediction confidence:

$$ \mathcal{T}^{(t)} \leftarrow \mathcal{T}^{(t)} + \eta \nabla_{\mathcal{T}^{(t)}} \mathcal{J} $$ $$ \nabla_{\mathcal{T}}\mathcal{J} \approx \mathcal{R}(\mathcal{T}+\xi) \frac{\xi}{\sigma^{2}} $$

2 Dynamic Visual Injection (DVI)

To mimic human cognitive interleaving, we do not inject the entire image at once. Instead, DVI dynamically retrieves the most relevant visual features based on the current thought state. At each iteration, the model selects top-$k$ visual patches. If injecting these patches improves the confidence reward, they are integrated into the reasoning process, ensuring precise and efficient visual grounding.

Performance

Efficiency vs. Accuracy

Key Findings

SOTA Performance: DMLR achieves superior performance across 7 benchmarks, including a +1.5% gain on MathVista and significant improvements in visual reasoning tasks.
Training-Free Efficiency: Unlike methods requiring extensive fine-tuning or explicit CoT generation (which incurs high token costs), DMLR operates purely in latent space, offering a better trade-off between accuracy and inference speed.

Qualitative Analysis

Comparison of Visual Grounding: Explicit CoT vs. DMLR

Visualization of attention heatmaps during the reasoning process. The baseline Explicit CoT method often exhibits scattered attention, shifting focus towards task-irrelevant regions, which leads to hallucinated reasoning steps.

In contrast, DMLR maintains a stable and focused attention distribution throughout the iterative optimization process. By dynamically injecting only the best visual patches, DMLR successfully converges on the key visual evidence required to solve the problem, resulting in more reliable and grounded reasoning chains. This demonstrates that latent interleaving effectively bridges the gap between high-level semantic reasoning and low-level visual perception.

BibTeX

@misc{liu2025reasoningminddynamicmultimodal,
      title={Reasoning Within the Mind: Dynamic Multimodal Interleaving in Latent Space}, 
      author={Chengzhi Liu and Yuzhe Yang and Yue Fan and Qingyue Wei and Sheng Liu and Xin Eric Wang},
      year={2025},
      eprint={2512.12623},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2512.12623}, 
}