Comparison between DMLR and two reasoning paradigms. (A) Text-only reasoning: relies solely on explicit CoT, often causing visual grounding errors and redundant steps. (B) Think-with-Image reasoning: depends on external perception tools, leading to unstable tool calls and extra overhead. (C) DMLR (ours): refines latent think tokens in the latent space through confidence-guided optimization and dynamically injects visual information, achieving self-improving reasoning without additional training while maintaining high efficiency.
Recent advancements in Multimodal Large Language Models (MLLMs) have significantly enhanced cross-modal understanding. However, existing methods often rely on explicit step-by-step textual reasoning or unstable external tools. Inspired by human cognition, we posit that thinking unfolds not linearly but through the dynamic interleaving of reasoning and perception within the mind. We propose DMLR (Dynamic Multimodal Latent Reasoning), a test-time framework that employs confidence-guided latent policy gradient optimization to refine latent think tokens. Furthermore, we introduce a Dynamic Visual Injection Strategy, which retrieves relevant visual features only when needed during the thought process. Experiments across seven benchmarks demonstrate that DMLR significantly improves reasoning performance and efficiency without additional training.
Analysis of visual dependency and confidence.
Visual information is not uniformly needed; it contributes only at specific reasoning stages.
Internal confidence strongly correlates with reasoning correctness and visual grounding quality.
Confidence-guided optimization of Latent Think Tokens with Dynamic Visual Injection.
Instead of generating explicit text, DMLR introduces learnable Latent Think Tokens ($\mathcal{T}$). At test-time, we employ a policy gradient approach to iteratively refine these tokens. The optimization is guided by a Confidence Reward ($\mathcal{R}$), which encourages the model to find reasoning paths that maximize internal consistency and prediction confidence:
To mimic human cognitive interleaving, we do not inject the entire image at once. Instead, DVI dynamically retrieves the most relevant visual features based on the current thought state. At each iteration, the model selects top-$k$ visual patches. If injecting these patches improves the confidence reward, they are integrated into the reasoning process, ensuring precise and efficient visual grounding.
Efficiency vs. Accuracy
Comparison of Visual Grounding: Explicit CoT vs. DMLR
Visualization of attention heatmaps during the reasoning process. The baseline Explicit CoT method often exhibits scattered attention, shifting focus towards task-irrelevant regions, which leads to hallucinated reasoning steps.
In contrast, DMLR maintains a stable and focused attention distribution throughout the iterative optimization process. By dynamically injecting only the best visual patches, DMLR successfully converges on the key visual evidence required to solve the problem, resulting in more reliable and grounded reasoning chains. This demonstrates that latent interleaving effectively bridges the gap between high-level semantic reasoning and low-level visual perception.
@misc{liu2025reasoningminddynamicmultimodal,
title={Reasoning Within the Mind: Dynamic Multimodal Interleaving in Latent Space},
author={Chengzhi Liu and Yuzhe Yang and Yue Fan and Qingyue Wei and Sheng Liu and Xin Eric Wang},
year={2025},
eprint={2512.12623},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2512.12623},
}