SurgMLLMBench

: A Multimodal Large Language Model Benchmark Dataset
for Surgical Scene Understanding

Tae-Min Choi1, Tae Kyeong Jeong2,*, Garam Kim2,*, Jaemin Lee3, Yeongyoon Koh4, In Cheul Choi4, Jae-Ho Chung3, Jong Woong Park4, Juyoun Park2,†
1Samsung Research, 2Center for Humanoid Research, Korea Institute of Science and Technology, 3Department of plastic surgery, College of medicine, Korea University, 4Department of orthopedic surgery, College of medicine, Korea University
*Equal contribution, Corresponding author.
Teaser image.

SurgMLLMBench: a benchmark for training and evaluating interactive multimodal LLMs in surgical scene understanding with pixel-level instrument segmentation across diverse surgical domains, including the newly proposed Micro-surgical Artificial Vascular anastomosIS (MAVIS) dataset.

Through taxonomy alignment, label unification, and VQA annotation, datasets are consolidated into a unified multimodal framework. On the right, the template-based question generator illustrates four query types used to create structured VQA pairs.

Abstract

Recent advances in multimodal large language models (LLMs) have highlighted their potential for medical and surgical applications. However, existing surgical datasets predominantly adopt a Visual Question Answering (VQA) format with heterogeneous taxonomies and lack support for pixel-level segmentation, limiting consistent evaluation and applicability. We present SurgMLLMBench, a unified multimodal benchmark explicitly designed for developing and evaluating interactive multimodal LLMs for surgical scene understanding, including the newly collected Micro-surgical Artificial Vascular anastomosIS (MAVIS) dataset. It integrates pixel-level instrument segmentation masks and structured VQA annotations across laparoscopic, robot-assisted, and micro-surgical domains under a unified taxonomy, enabling comprehensive evaluation beyond traditional VQA tasks and richer visual–conversational interactions.

Data Collection

Data comparison image.

We construct an integrated benchmark dataset for training multimodal LLMs including five widely used surgical video datasets and a newly generated MAVIS dataset. SurgMLLMBench covers diverse procedures and clinical domains: Laparoscopic Surgery (LS), Robot-Assisted Surgery (RAS), and Micro-Surgical Training procedures (MST).

To standardize existing datasets with varying resolutions, frame rates, annotation formats, and task definitions, all videos and annotations were integrated into a unified structure.

  • Frame-level conversion: All videos were converted into frame-level representations.
  • Harmonized annotations: Designed a COCO-style metadata schema to unify annotation information across datasets.
  • Frame fields: Each frame contains video_id, frame_id, stage, phase, step, instrument_action, and segmentation; missing entries are left blank for consistency.

MAVIS Dataset

Data comparison image.

The MAVIS dataset contains the entire sequence of both anterior and posterior anastomosis of an artificial vessel within a single video. Each procedure is hierarchically annotated following a stage–phase–step workflow structure. The MAVIS dataset is the first to include the Stage attribute, providing a more detailed representation of the surgical workflow.

It comprises 19 videos of artificial vascular anastomosis procedures performed by three expert micro-surgeons. The dataset includes recordings of 1 mm artificial vessel anastomosis procedures.

VQA Prompt Generation

Data comparison image.

Each frame was paired with one of roughly ten fixed prompt templates, uniformly sampled across five categories:

  • Workflow queries: “Which stage, phase, and step are shown?”
  • Instrument count queries: “How many surgical tools are visible?”
  • Instrument type queries: “Which instruments are present?”
  • Instrument action queries: “What action is the needle holder performing?”
  • Dataset source queries: “What is the source of this dataset?”

A template-based VQA approach was used instead of generative models to ensure precise, consistent, and reproducible annotations for surgical workflow and instrument counting tasks, supporting stable multimodal training in SurgMLLMBench.

Evaluation Results

Data comparison image.

The figure demonstrates that a single model instruction-tuned on the full SurgMLLMBench corpus (denoted with § ) achieves competitive performance across diverse surgical datasets, despite not being optimized for any specific domain. Notably, the findings highlight the effectiveness of SurgMLLMBench in facilitating the development of a single model that can perform interactive VQA across multiple surgical domains.

Data comparison image.

LLaVa instruction-tuned on SurgMLLMBench tends to exhibit lower action prediction accuracy. This degradation arises from the imbalanced task distribution within the benchmark. Consequently, the model often conflates semantically related actions across datasets, where functionally similar gestures are expressed under different naming conventions (Figure (c) above). We argue that this imbalance reduces numerical accuracy, even when the predicted actions remain contextually valid.

BibTeX


@misc{choi2025surgmllmbenchmultimodallargelanguage,
      title={SurgMLLMBench: A Multimodal Large Language Model Benchmark Dataset for Surgical Scene Understanding}, 
      author={Tae-Min Choi and Tae Kyeong Jeong and Garam Kim and Jaemin Lee and Yeongyoon Koh and In Cheul Choi and Jae-Ho Chung and Jong Woong Park and Juyoun Park},
      year={2025},
      eprint={2511.21339},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2511.21339}, 
}