MileBench

Benchmarking MLLMs in Long Context

Dingjie Song, Shunian Chen, Guiming Hardy Chen, Fei Yu, Xiang Wan, Benyou Wang^*

The Chinese University of Hong Kong, Shenzhen
Shenzhen Research Institute of Big Data

*Corresponding to: wangbenyou@cuhk.edu.cn

🤗

A image generated by DALLE3 using the prompt, "Create the scene: a few llamas in a unique marathon. The track is paved with pictures, books, articles, and video clips. There's a finish line with a 'MileBench Winner!' banner. " with style-control prompt.

🔔News

🏆[2024-5-14]: Awesome GPT-4o acheives highest scores on MileBench!!! Please refer to Leaderboard.
🔥[2024-4-30]: Our evaluation code is now available on GitHub.

Introduction

We introduce MileBench, a pioneering benchmark designed to rigorously test the MultImodal Long-contExt capabilities of MLLMs. This benchmark comprises a mix of text and images, long contexts, multiple tasks, and tasks requiring both comprehension and generation. To systematically assess the capabilities of MLLM in multimodal long contexts, our benchmark consists of two distinct evaluation sets, diagnostic evaluation and realistic evaluation. The former explores the long-context recall abilities of MLLMs, using needle-in-a-haystack and image retrieval tasks, while the latter stress-tests the model in a manner akin to real-world conditions using both temporal multi-image tasks and semantic multi-image tasks.

MLLMs' performance fluctuates with the image count in datasets. Open-source MLLMs demonstrate a remarkable performance drop as the number of images increases. The performance divergence between open-source and closed-source MLLMs expands as well. For single-image performance, we refer to SEED-Bench, given its absence in MileBench.

After evaluating 22 models, the closed-source GPT-4o excelled in both diagnostic and realistic evaluations, achieving impressive scores of 99.4% and 60.3%, although it still falls short of a perfect 100% score. Only Mantis and Qwen-VL-7B managed average scores of 47.5% and 37.2% in realistic and diagnostic evaluations respectively. These results underscore that there are "miles to go" towards fully-realized long-context MLLMs, prompting a call for increased research focus on such tasks, especially those involving numerous images.

MileBench

Overview

MileBench consists of two major components: Realistic Evaluation and Diagnostic Evaluation. Realistic Evaluation requires MLLMs to address tasks within multimodal long-context scenarios, emphasizing the models' proficiency in comprehending and reasoning across extended multimodal contexts. Conversely, Diagnostic Evaluation demands MLLMs to retrieve information from the provided context, highlighting the model's capability of long-range information retrieval and the elimination of distractors. We present examples in each evaluation set for better understanding.

We present a detailed taxonomy of the dataset, task composition, as well as the number of samples and metrics corresponding to each task.

The Realistic Evaluation is designed to assess an MLLM's ability to comprehend, integrate, and infer information in a multimodal long context. We have categorized the tasks into two main groups: Temporal Multi-Image tasks and Semantic Multi-Image tasks. Temporal Multi-Image tasks test the MLLM's ability to discern temporal relationships among several time-related images, emphasizing the model's predictive capabilities in real-world scenarios. On the other hand, Semantic Multi-Image tasks challenge MLLMs to process multiple images that are possibly temporal-irrelevant but are semantically interconnected.
The Diagnostic Evaluation focuses on the MLLMs' capability to retrieve information without being distracted in a multimodal long context. We transform the tasks of "Needle in a Haystack" from NLP and "Image Retrieval" from CV into a multimodal format for assessment. This transition preserves the core of the conventional tasks while offering a more challenging and realistic measure of MLLMs' performance.

Comparisons with Existing Benchmarks

Existing benchmarks primarily focus on single-image and short-text samples, failing to fully reflect typical real-world scenarios. While some benchmarks evaluate multi-image tasks, they either limit the number of images provided per sample (e.g., SEED-Bench-2, DEMON) or only include time-series captioning tasks (e.g., Mementos), as evidenced in the figure below. To the best of our knowledge, MileBench is the first comprehensive benchmark that evaluates MLLMs across both multi-image and long-context dimensions, catering to a broader spectrum of general scenarios.

Visualization of distribution of images and word number in present MLLM benchmarks. The range and mean of both word and image number per sample in MileBench far exceed those of previous works.

Statistics

Tasks Distribution of MileBench

Key Statistics of MileBench. Note that we use tokenizer of LLaMA2 to calculate the token number

Experiment Results

Leaderboard

We conducted an evaluation of several models across three distinct categories that may handle multimodal long contexts. All models used greedy decoding to generate answers, with a designated generation length between 1 and 512. All evaluations were performed in a zero-shot setting. When the input length exceeds the maximum context length of the model, we keep the instruction, and truncate the interleaved image-text question from left so as to keep the question of a sample, as instruction and question are critical information and the importance of the last image is higher in many tasks. Metrics for each dataset were consistent with the original work for tasks built on previous datasets.

Experiment Result on MileBench. T-1 refers to the task number. NH and IR refers to Needle in a Haystack and Image Retrieval. The highest scores for closed-source models, open-source image models, and open-source video models are marked in red, blue, and green respectively.

Takeaways:
(1) Closed-source MLLMs outperform open-source MLLMs in multimodal long-context tasks.
(2) Open-source image models generally perform better than video models.
(3) The ability to adapt to long-context and perform long-context tasks are not necessarily linked.
(4) Interestingly, the majority of open-source models failed to score in the Image Needle in a Haystack task.

Error Analysis

We conducted a detailed error analysis to further investigate the flaws of the models.

An example of the Space Understanding task is displayed in the figure above. When recognizing spatial positions and current actions, GPT-4V declined to respond, while other models did not correctly follow the instructions to answer the question. They merely generated captions for the images, which could be related to them not having been trained on multi-image QA data, emphasizing the importance of multi-image training.

In a Visual Relation Inference task example (figure above), Qwen-VL-Chat and Valley struggled with image differentiation and instruction following, resulting in inaccurate inferences. This suggests MLLMs could improve in recognizing subtle image differences, possibly due to their low-resolution visual models. The illusion issue in multi-image inputs for video models also highlights the need for ample multi-image training data.

Different Difficulty Levels

To investigate the performance of the model with varying numbers of images, we divide our dataset into three levels: Few, Medium, and Many, based on the number of images per sample. The specific quantities for each level can be found in the table above. The figure below reports the average performance of the model on the three types of data with different numbers of images.

Average performance across various levels of image quantity.

It can be observed that as the number of images increases, the performance of most models significantly declines (as indicated by a steep slope in the curve), especially for the LLaVA-1.5 series models. This is likely because most models have only been trained on single image, resulting in insufficient generalization for multi-image test data. However, the performance of GPT-4V, GPT-4o, Gemini 1.5, Claude 3 Opus and Qwen-VL-Chat on the Medium level surpasses that of the Few level. This could be attributed to their training on multi-image data, where a larger number of images can provide more information to some extent, thereby aiding the model in task completion. Despite their outstanding performance on multi-image tasks, their performance still declines when the number of images reaches the Many level. This leaves room for future development in modeling for multi-image context.

"Lost in the Middle" for MLLMs

Some literature pointed out that in needle-in-a-haystack tasks involving long texts, models may experience the "Lost in the Middle" phenomenon, where they struggle to find the needle located in the middle of the context. We investigated whether the MLLM would exhibit the "Lost in the Middle" phenomenon in multimodal contexts. We chose the two best-performing models from closed-source and open-source models in the Needle in a Haystack task for analysis.

Visualization of results varying in depth and context length in needle haystack. The x-axis represents the number of tokens or images in the context, while the y-axis indicates the depth of the context where the needle resides. Green squares indicate successful extraction of the needle at that position, while red squares denote failure.

As can be seen from the results in the figure, MLLMs displayed varying behaviors. In multimodal long contexts, GPT-4V did not "get lost in the middle" and managed to complete the two tasks impressively. On the other hand, ignoring the scenarios where the data exceeds its maximum context length (8192 tokens or 32 images) and gets truncated, Qwen-VL-Chat showed a certain degree of "lost in the middle", particularly evident in the image needle task. This indicates that the "get lost in the middle" phenomenon also exists in multimodal scenarios. However, a strong ability to manage long context can significantly reduce this risk.

Risk of Data Contamination

Considering MileBench's use of public datasets, there's a potential risk of data contamination. Our investigation involved excluding models trained solely on single-image tasks and opting for cost-effective open-source models, resulting in Qwen-VL-Chat, Cheetor, Open flamingo, and VILA. We also constructed an Adversarial (ADV) Set with shuffled options and paraphrased reference answers and evaluated the difference between original and ADV results. Results show a negligible performance drop (0.1% ~ 1.2%) for all models, indicating minimal likelihood of these models being trained on our dataset.

Contamination Detection. We present Regular (result on MileBench), ADV (result on the ADV set) and their difference Δ.

Other Analysis

For more analysis, please refer to our paper.

BibTeX


      @article{song2024milebench,
        title={MileBench: Benchmarking MLLMs in Long Context},
        author={Song, Dingjie and Chen, Shunian and Chen, Guiming Hardy and Yu, Fei and Wan, Xiang and Wang, Benyou},
        journal={arXiv preprint arXiv:2404.18532},
        year={2024}
      }