CUI Yaqi,
ZHOU Tian,
XIONG Wei,
XU Saifei,
LIN Chuanqi,
XIA Shutao,
SUN Weiwei,
TANG Tiantian,
ZHANG Jie,
GUO Hengguang,
SONG Penghan,
HUAN Yingchun,
ZHANG Zhenjie
2025, 47(12): 5236-5250.
doi: 10.11999/JEIT250138
Abstract:
Objective The increasing intensity of marine resource development and maritime operations has heightened the need for accurate vessel detection under complex marine conditions, which is essential for protecting maritime rights and interests. In recent years, object detection algorithms based on deep learning—such as YOLO and Faster R-CNN—have emerged as key methods for maritime target perception due to their strong feature extraction capabilities. However, their performance relies heavily on large-scale, high-quality training data. Existing general-purpose datasets, such as COCO and PASCAL VOC, offer limited vessel classes and predominantly feature static, urban, or terrestrial scenes, making them unsuitable for marine environments. Similarly, specialized datasets like SeaShips and the Singapore Marine Dataset (SMD) suffer from constraints such as limited data sources, simple scenes, small sample sizes, and incomplete coverage of marine target categories. These limitations significantly hinder further performance improvement of detection algorithms. Therefore, the development of large-scale, multimodal, and comprehensive marine-specific datasets represents a critical step toward resolving current application challenges. This effort is urgently needed to strengthen marine monitoring capabilities and ensure operational safety at sea. Methods To overcome the aforementioned challenges, a multi-sensor marine target acquisition system integrating radar, visible-light, infrared, laser, Automatic Identification System (AIS), and Global Positioning System (GPS) technologies is developed. A two-month shipborne observation campaign is conducted, yielding 200 hours of maritime monitoring and over 90 TB of multimodal raw data. To efficiently process this large volume of low-value-density data, a rapid annotation pipeline is designed, combining automated labeling with manual verification. Iterative training of intelligent annotation models, supplemented by extensive manual correction, enables the construction of the Visible Figure Part of the Multi-Source Maritime Ship Dataset (MSMS-VF). This dataset comprises 265 233 visible-light images with 1 097 268 bounding boxes across nine target categories: passenger ship, cargo vessel, speedboat, sailboat, fishing boat, buoy, floater, offshore platform, and others. Notably, 55.88% of targets are small, with pixel areas below 1 024. The dataset incorporates diverse environmental conditions including backlighting, haze, rain, and occlusion, and spans representative maritime settings such as harbor basins, open seas, and navigation channels. MSMS-VF offers a comprehensive data foundation for advancing maritime target detection, recognition, and tracking research. Results and Discussions The MSMS-VF dataset exhibits substantially greater diversity than existing datasets (Table 1, Table 2). Small targets, including buoys and floaters, occur frequently (Table 5), posing significant challenges for detection. Five object detection models—YOLO series, Real-Time Detection Transformer (RT-DETR), Faster R-CNN, Single Shot MultiBox Detector (SSD), and RetinaNet—are assessed, together with five multi-object tracking algorithms: Simple Online and Realtime Tracking (SORT), Optimal Compute for SORT (OC-SORT), DeepSORT, ByteTrack, and MotionTrack. YOLO models exhibit the most favorable trade-off between speed and accuracy. YOLOv11 achieves a mAP50 of 0.838 on the test set and a processing speed of 34.43 fps (Table 6). However, substantial performance gaps remain for small targets; for instance, YOLOv11 yields a mAP50 of 0.549 for speedboats, markedly lower than the 0.946 obtained for large targets such as cargo vessels (Table 7). RT-DETR shows moderate performance on small objects, achieving a mAP50 of 0.532 for floaters, whereas conventional models like Faster R-CNN perform poorly, with mAP50 values below 0.1. For tracking, MotionTrack performs best under low-frame-rate conditions, achieving a MOTA of 0.606, IDF1 of 0.750, and S of 0.681 using a Gaussian distance cascade-matching strategy (Table 8, Fig. 13). Conclusions This study presents the MSMS-VF dataset, which offers essential data support for maritime perception research through its integration of multi-source inputs, diverse environmental scenarios, and a high proportion of small targets. Experimental validation confirms the dataset’s utility in training and evaluating state-of-the-art algorithms, while also revealing persistent challenges in detecting and tracking small objects under dynamic maritime conditions. Nevertheless, the dataset has limitations. The current data are predominantly sourced from waters near Yantai, leading to imbalanced ship-type representation and the absence of certain vessel categories. Future efforts will focus on expanding data acquisition to additional maritime regions, broadening the scope of multi-source data collection, and incrementally releasing extended components of the dataset to support ongoing research.