Multimodal DeepResearcher
Generating Text-Chart Interleaved Reports From Scratch with Agentic Framework
Minfeng Zhu‡✉,Bo Zhang§✉,Wei Chen§
We introduce an agentic framework that automatically generates comprehensive multimodal reports from scratch with interleaved texts and visualizations, going beyond text-only content generation.
!The Problem We Study
Existing deep research frameworks primarily focus on generating text-only content, leaving the automated generation of interleaved texts and visualizations underexplored
Visualizations play a crucial part in effective communication of concepts and information, yet automated generation remains challenging
Despite advances in reasoning and retrieval augmented generation, LLMs lack standardized methods for understanding and generating diverse, high-quality visualizations
⚡Key Challenges
Designing informative and meaningful visualizations that enhance content understanding
Effectively integrating visualizations with text reports in a coherent manner
Enabling LLMs to learn from and generate diverse, high-quality chart representations
Developing comprehensive evaluation frameworks for multimodal report generation
💡Our Method
Formal Description of Visualization (FDV)
We propose FDV, a structured textual representation of charts that enables Large Language Models to learn from and generate diverse, high-quality visualizations.

Four-Stage Agentic Framework

Researching
Iterative researching about given topic
Exemplar Report Textualization
In-context learning from high-quality multimodal reports
Planning
Strategic content organization and visualization style guide
Multimodal Report Generation
Generation of multimodal reports with interleaved texts and visualizations
📊Evaluation and Results
MultimodalReportBench
Comprehensive evaluation benchmark with 100 diverse topics
5 dedicated metrics for multimodal report assessment
Extensive experiments across models (proprietary and open-source models) and evaluation methods (automatic and human evaluation)
Experimental Results
Overall win rate over baseline method using Claude 3.7 Sonnet model
Demonstrating the effectiveness of our approach across diverse evaluation scenarios