Multimodal DeepResearcher

Generating Text-Chart Interleaved Reports From Scratch with Agentic Framework

Zhaorui Yang§*,Bo Pan§*,Han Wang§*,Yiyao Wang§,Xingyu Liu§,Luoxuan Weng§,Yingchaojie Feng§,
Minfeng Zhu,Bo Zhang§,Wei Chen§
§State Key Lab of CAD&CG, Zhejiang UniversityZhejiang University
*Equal ContributionCorresponding AuthorsarXiv
TL;DR

We introduce an agentic framework that automatically generates comprehensive multimodal reports from scratch with interleaved texts and visualizations, going beyond text-only content generation.

!The Problem We Study

Existing deep research frameworks primarily focus on generating text-only content, leaving the automated generation of interleaved texts and visualizations underexplored

Visualizations play a crucial part in effective communication of concepts and information, yet automated generation remains challenging

Despite advances in reasoning and retrieval augmented generation, LLMs lack standardized methods for understanding and generating diverse, high-quality visualizations

Key Challenges

Designing informative and meaningful visualizations that enhance content understanding

Effectively integrating visualizations with text reports in a coherent manner

Enabling LLMs to learn from and generate diverse, high-quality chart representations

Developing comprehensive evaluation frameworks for multimodal report generation

💡Our Method

Formal Description of Visualization (FDV)

We propose FDV, a structured textual representation of charts that enables Large Language Models to learn from and generate diverse, high-quality visualizations.

Structured textual representation for charts
Enables in-context learning and generation of LLM
Supports diverse visualization types
Formal Description of Visualization (FDV) Example

Four-Stage Agentic Framework

Multimodal DeepResearcher Framework
A
Researching

Iterative researching about given topic

B
Exemplar Report Textualization

In-context learning from high-quality multimodal reports

C
Planning

Strategic content organization and visualization style guide

D
Multimodal Report Generation

Generation of multimodal reports with interleaved texts and visualizations

📊Evaluation and Results

MultimodalReportBench

Comprehensive evaluation benchmark with 100 diverse topics

5 dedicated metrics for multimodal report assessment

Extensive experiments across models (proprietary and open-source models) and evaluation methods (automatic and human evaluation)

Experimental Results

82%

Overall win rate over baseline method using Claude 3.7 Sonnet model

Demonstrating the effectiveness of our approach across diverse evaluation scenarios