Multimodal DeepResearcher

Generating Text-Chart Interleaved Reports From Scratch with Agentic Framework

Zhaorui Yang^§*,Bo Pan^§*,Han Wang^§*,Yiyao Wang^§,Xingyu Liu^§,Luoxuan Weng^§,Yingchaojie Feng^§,
Minfeng Zhu^‡^✉,Bo Zhang^§^✉,Wei Chen^§

^§State Key Lab of CAD&CG, Zhejiang University^‡Zhejiang University

^*Equal Contribution^✉Corresponding AuthorsarXiv

TL;DR

We introduce an agentic framework that automatically generates comprehensive multimodal reports from scratch with interleaved texts and visualizations, going beyond text-only content generation.

!The Problem We Study

Existing deep research frameworks primarily focus on generating text-only content, leaving the automated generation of interleaved texts and visualizations underexplored

Visualizations play a crucial part in effective communication of concepts and information, yet automated generation remains challenging

Despite advances in reasoning and retrieval augmented generation, LLMs lack standardized methods for understanding and generating diverse, high-quality visualizations

⚡Key Challenges

Designing informative and meaningful visualizations that enhance content understanding

Effectively integrating visualizations with text reports in a coherent manner

Enabling LLMs to learn from and generate diverse, high-quality chart representations

Developing comprehensive evaluation frameworks for multimodal report generation

💡Our Method

Formal Description of Visualization (FDV)

We propose FDV, a structured textual representation of charts that enables Large Language Models to learn from and generate diverse, high-quality visualizations.

Structured textual representation for charts

Enables in-context learning and generation of LLM

Supports diverse visualization types

Formal Description of Visualization (FDV) Example

Four-Stage Agentic Framework

Researching

Iterative researching about given topic

Exemplar Report Textualization

In-context learning from high-quality multimodal reports

Planning

Strategic content organization and visualization style guide

Multimodal Report Generation

Generation of multimodal reports with interleaved texts and visualizations

📊Evaluation and Results

MultimodalReportBench

Comprehensive evaluation benchmark with 100 diverse topics

5 dedicated metrics for multimodal report assessment

Extensive experiments across models (proprietary and open-source models) and evaluation methods (automatic and human evaluation)

Experimental Results

82%

Overall win rate over baseline method using Claude 3.7 Sonnet model

Demonstrating the effectiveness of our approach across diverse evaluation scenarios

View Demo Reports Read Full Paper