Recent advances in multimodal large language models (LLMs) have highlighted their potential for medical and surgical applications. However, existing surgical datasets predominantly adopt a Visual Question Answering (VQA) format with heterogeneous taxonomies and lack support for pixel-level segmentation, limiting consistent evaluation and applicability. We present SurgMLLMBench, a unified multimodal benchmark explicitly designed for developing and evaluating interactive multimodal LLMs for surgical scene understanding, including the newly collected Micro-surgical Artificial Vascular anastomosIS (MAVIS) dataset. It integrates pixel-level instrument segmentation masks and structured VQA annotations across laparoscopic, robot-assisted, and micro-surgical domains under a unified taxonomy, enabling comprehensive evaluation beyond traditional VQA tasks and richer visual–conversational interactions.
We construct an integrated benchmark dataset for training multimodal LLMs including five widely used surgical video datasets and a newly generated MAVIS dataset. SurgMLLMBench covers diverse procedures and clinical domains: Laparoscopic Surgery (LS), Robot-Assisted Surgery (RAS), and Micro-Surgical Training procedures (MST).
To standardize existing datasets with varying resolutions, frame rates, annotation formats, and task definitions, all videos and annotations were integrated into a unified structure.
video_id, frame_id,
stage, phase, step, instrument_action, and
segmentation; missing entries are left blank for consistency.
The MAVIS dataset contains the entire sequence of both anterior and posterior anastomosis of an artificial vessel within a single video. Each procedure is hierarchically annotated following a stage–phase–step workflow structure. The MAVIS dataset is the first to include the Stage attribute, providing a more detailed representation of the surgical workflow.
It comprises 19 videos of artificial vascular anastomosis procedures performed by three expert micro-surgeons. The dataset includes recordings of 1 mm artificial vessel anastomosis procedures.
Each frame was paired with one of roughly ten fixed prompt templates, uniformly sampled across five categories:
A template-based VQA approach was used instead of generative models to ensure precise, consistent, and reproducible annotations for surgical workflow and instrument counting tasks, supporting stable multimodal training in SurgMLLMBench.
The figure demonstrates that a single model instruction-tuned on the full SurgMLLMBench corpus (denoted with § ) achieves competitive performance across diverse surgical datasets, despite not being optimized for any specific domain. Notably, the findings highlight the effectiveness of SurgMLLMBench in facilitating the development of a single model that can perform interactive VQA across multiple surgical domains.
LLaVa instruction-tuned on SurgMLLMBench tends to exhibit lower action prediction accuracy. This degradation arises from the imbalanced task distribution within the benchmark. Consequently, the model often conflates semantically related actions across datasets, where functionally similar gestures are expressed under different naming conventions (Figure (c) above). We argue that this imbalance reduces numerical accuracy, even when the predicted actions remain contextually valid.
@misc{choi2025surgmllmbenchmultimodallargelanguage,
title={SurgMLLMBench: A Multimodal Large Language Model Benchmark Dataset for Surgical Scene Understanding},
author={Tae-Min Choi and Tae Kyeong Jeong and Garam Kim and Jaemin Lee and Yeongyoon Koh and In Cheul Choi and Jae-Ho Chung and Jong Woong Park and Juyoun Park},
year={2025},
eprint={2511.21339},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2511.21339},
}