SurgMLLMBench

: A Multimodal Large Language Model Benchmark Dataset
for Surgical Scene Understanding

Tae-Min Choi¹, Tae Kyeong Jeong^2,*, Garam Kim^2,*, Jaemin Lee³, Yeongyoon Koh⁴, In Cheul Choi⁴, Jae-Ho Chung³, Jong Woong Park⁴, Juyoun Park^2,†

¹Samsung Research, ²Center for Humanoid Research, Korea Institute of Science and Technology, ³Department of plastic surgery, College of medicine, Korea University, ⁴Department of orthopedic surgery, College of medicine, Korea University

^*Equal contribution, ^†Corresponding author.

Paper arXiv

SurgMLLMBench Dataset

MAVIS Dataset

SurgMLLMBench: a benchmark for training and evaluating interactive multimodal LLMs in surgical scene understanding with pixel-level instrument segmentation across diverse surgical domains, including the newly proposed Micro-surgical Artificial Vascular anastomosIS (MAVIS) dataset.

Through taxonomy alignment, label unification, and VQA annotation, datasets are consolidated into a unified multimodal framework. On the right, the template-based question generator illustrates four query types used to create structured VQA pairs.

Abstract

Recent advances in multimodal large language models (LLMs) have highlighted their potential for medical and surgical applications. However, existing surgical datasets predominantly adopt a Visual Question Answering (VQA) format with heterogeneous taxonomies and lack support for pixel-level segmentation, limiting consistent evaluation and applicability. We present SurgMLLMBench, a unified multimodal benchmark explicitly designed for developing and evaluating interactive multimodal LLMs for surgical scene understanding, including the newly collected Micro-surgical Artificial Vascular anastomosIS (MAVIS) dataset. It integrates pixel-level instrument segmentation masks and structured VQA annotations across laparoscopic, robot-assisted, and micro-surgical domains under a unified taxonomy, enabling comprehensive evaluation beyond traditional VQA tasks and richer visual–conversational interactions.

Data Collection

We construct an integrated benchmark dataset for training multimodal LLMs including five widely used surgical video datasets and a newly generated MAVIS dataset. SurgMLLMBench covers diverse procedures and clinical domains: Laparoscopic Surgery (LS), Robot-Assisted Surgery (RAS), and Micro-Surgical Training procedures (MST).

To standardize existing datasets with varying resolutions, frame rates, annotation formats, and task definitions, all videos and annotations were integrated into a unified structure.

Frame-level conversion: All videos were converted into frame-level representations.
Harmonized annotations: Designed a COCO-style metadata schema to unify annotation information across datasets.
Frame fields: Each frame contains video_id, frame_id, stage, phase, step, instrument_action, and segmentation; missing entries are left blank for consistency.

MAVIS Dataset

The MAVIS dataset contains the entire sequence of both anterior and posterior anastomosis of an artificial vessel within a single video. Each procedure is hierarchically annotated following a stage–phase–step workflow structure. The MAVIS dataset is the first to include the Stage attribute, providing a more detailed representation of the surgical workflow.

It comprises 19 videos of artificial vascular anastomosis procedures performed by three expert micro-surgeons. The dataset includes recordings of 1 mm artificial vessel anastomosis procedures.

VQA Prompt Generation

Each frame was paired with one of roughly ten fixed prompt templates, uniformly sampled across four categories:

Workflow queries: “Which stage, phase, and step are shown?”
Instrument count queries: “How many surgical tools are visible?”
Instrument type queries: “Which instruments are present?”
Instrument action queries: “What action is the needle holder performing?”

A template-based VQA approach was used instead of generative models to ensure precise, consistent, and reproducible annotations for surgical workflow and instrument counting tasks, supporting stable multimodal training in SurgMLLMBench.

Evaluation Results

The figure demonstrates that a single model instruction-tuned on the full SurgMLLMBench corpus (denoted with ^§ ) achieves competitive performance across diverse surgical datasets, despite not being optimized for any specific domain. Notably, the findings highlight the effectiveness of SurgMLLMBench in facilitating the development of a single model that can perform interactive VQA across multiple surgical domains.

LLaVa instruction-tuned on SurgMLLMBench tends to exhibit lower action prediction accuracy. This degradation arises from the imbalanced task distribution within the benchmark. Consequently, the model often conflates semantically related actions across datasets, where functionally similar gestures are expressed under different naming conventions (Figure (c) above). We argue that this imbalance reduces numerical accuracy, even when the predicted actions remain contextually valid.

BibTeX


@misc{choi2025surgmllmbenchmultimodallargelanguage,
      title={SurgMLLMBench: A Multimodal Large Language Model Benchmark Dataset for Surgical Scene Understanding}, 
      author={Tae-Min Choi and Tae Kyeong Jeong and Garam Kim and Jaemin Lee and Yeongyoon Koh and In Cheul Choi and Jae-Ho Chung and Jong Woong Park and Juyoun Park},
      year={2025},
      eprint={2511.21339},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2511.21339}, 
}