Sample Generation Based on Conditional Diffusion Model for Few-Shot Object Detection

MEI Tiancan; WANG Yaru; CHEN Yuanhao

doi:10.11999/JEIT240841

Volume 47 Issue 4

Apr. 2025

Turn off MathJax

Article Contents

Article Navigation > Journal of Electronics & Information Technology > 2025 > 47(4): 1182-1191

MEI Tiancan, WANG Yaru, CHEN Yuanhao. Sample Generation Based on Conditional Diffusion Model for Few-Shot Object Detection[J]. Journal of Electronics & Information Technology, 2025, 47(4): 1182-1191. doi: 10.11999/JEIT240841

Citation:

MEI Tiancan, WANG Yaru, CHEN Yuanhao. Sample Generation Based on Conditional Diffusion Model for Few-Shot Object Detection[J]. Journal of Electronics & Information Technology, 2025, 47(4): 1182-1191. doi: 10.11999/JEIT240841

Citation:

PDF( 5088 KB)

Sample Generation Based on Conditional Diffusion Model for Few-Shot Object Detection

doi: 10.11999/JEIT240841 cstr: 32379.14.JEIT240841

School of Electronic Information, Wuhan University, Wuhan 430072, China

Received Date: 2024-10-08
Rev Recd Date: 2025-03-05

Available Online: 2025-03-19

Publish Date: 2025-04-01

Abstract

Abstract

Objective Deep learning-based object detection typically requires a large volume of high-quality annotated samples, which limits its practical applicability. Few-Shot Object Detection (FSOD) has gained significant attention as a promising research area. FSOD leverages base classes with abundant labeled data to recognize novel classes with limited training samples. Several methods based on generative models have been proposed to address the challenge of limited annotated data in FSOD. However, some limitations remain. (1) Most generative models fail to sufficiently capture the relationships between base and novel classes, which hinders their ability to accurately estimate novel class distributions and degrades the quality of generated samples. (2) Existing methods often prioritize increasing sample diversity, neglecting the critical need for representativeness. Low representativeness can cause confusion between categories, potentially reducing detection performance. Since the quality of generated samples directly affects the performance of the object detection network, which trains using both original and generated samples, this issue must be addressed. To address these challenges, a novel framework for data generation in FSOD via additional high-Quality and Representative Samples (FQRS), is introduced. A conditional control module, incorporating both inter-class and intra-class dynamics, is introduced to improve the quality and representativeness of generated samples, ultimately enhancing the accuracy of FSOD. Methods The proposed model architecture consists of a fine-tuning-based object detector and a data generator. First, the object detector is trained using base class data. Then, the pre-trained detector is employed to extract Region of Interest (RoI) features, which are used as training data for the generator. The generator, once trained, generates new samples for the novel classes. The architecture of the data generator includes a diffusion model for sample generation and an inter-class and intra-class conditional control module to guide the diffusion process. For inter-class conditional control, a semantic relation embedding is introduced, using cosine similarity to represent the degree of correlation between different classes. This enables the data generator to learn inter-class relations effectively. The relations between base and novel classes assist the diffusion model in estimating novel class distributions, improving the quality of generated samples. For intra-class conditional control, Intersection Over Union (IOU) information is utilized to constrain the position of generated samples within the corresponding feature space. This ensures that generated samples cluster around their respective category centers, enhancing their representativeness and preserving important class characteristics. Finally, the object detector is fine-tuned using both the generated samples and the original training samples. A hyperparameter in the loss function is introduced to control the influence of generated samples on the object detector’s training process. Results and Discussions The effectiveness and robustness of the proposed network are validated on two public datasets: PASCAL VOC and MS COCO. Detection accuracy is evaluated using mAP and mAP50 metrics. Quantitative comparisons (Tables 1 and 2) show that the proposed network outperforms existing methods across both datasets. For example, on the MS COCO dataset under the 1-shot setting, the proposed method achieves a 16.9% improvement over the state-of-the-art DeFRCN approach. A cross-domain experiment (Table 3), where base and novel class data are sourced from different datasets, demonstrates the superior generalization capability of the proposed method. Visual comparisons (Fig. 5) highlight that the proposed method effectively addresses issues like missed detections and category confusion arising from limited training data, thus improving the performance of FSOD. Ablation studies (Tables 4, 5, and 6) confirm the efficacy of the proposed modules and reveal the impact of varying parameter configurations on detection performance. t-SNE visualization results (Fig. 6) show that the inter-class and intra-class conditional control module enhances feature aggregation within the same category, while improving discriminability between categories and reducing categorical confusion. Additionally, quantitative analysis (Table 7) examines the variations in model complexity introduced by the data generator, focusing on both parameter count and floating-point operations. Conclusions This paper presents a novel data-generation-based framework for obtaining additional samples in FSOD. The framework integrates a data generator, built on a conditional diffusion model, into a fine-tuning-based object detection network. The proposed data generator learns category features in conjunction with inter-class relations, capturing distinct category characteristics and improving generalization to novel classes. Additionally, the generator enhances sample representativeness by constraining generated samples to cluster around category centers. These high-quality, representative generated samples facilitate the object detector’s training, leading to improved FSOD accuracy. In various few-shot settings, the proposed model outperforms the state-of-the-art fine-tuning object detection model, Decoupled Faster Region-based Convolutional Neural net-work (DeFRCN), on both the PASCAL VOC and MS COCO datasets. Extensive experimental results validate the superiority of the proposed approach.
- Few-Shot Object Detection (FSOD),
- Deep Learning (DL),
- Data augmentation,
- Sample generation

FullText(HTML)

References(30)

References

[1]	LIU Qiankun, LIU Rui, ZHENG Bolun, et al. Infrared small target detection with scale and location sensitivity[C]. 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, USA, 2024: 17490–17499. doi: 10.1109/CVPR52733.2024.01656.
[2]	ZHANG Gang, CHEN Junnan, GAO Guohuan, et al. SAFDNet: A simple and effective network for fully sparse 3D object detection[C]. 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, USA, 2024: 14477–14486. doi: 10.1109/CVPR52733.2024.01372.
[3]	YE Mingqiao, KE Lei, LI Siyuan, et al. Cascade-DETR: Delving into high-quality universal object detection[C]. 2023 IEEE/CVF International Conference on Computer Vision, Paris, France, 2023: 6704–6714. doi: 10.1109/ICCV51070.2023.00617.
[4]	WANG C Y, BOCHKOVSKIY A, and LIAO H Y M. YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors[C]. 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, Canada, 2023: 7464–7475. doi: 10.1109/CVPR52729.2023.00721.
[5]	WANG Yuxiong, RAMANAN D, and HEBERT M. Meta-learning to detect rare objects[C]. 2019 IEEE/CVF International Conference on Computer Vision, Seoul, Korea (South), 2019: 9925–9934. doi: 10.1109/ICCV.2019.01002.
[6]	WANG Xin, HUANG T E, DARRELL T, et al. Frustratingly simple few-shot object detection[C]. The 37th International Conference on Machine Learning, 2020: 920.
[7]	SUN Bo, LI Banghuai, CAI Shengcai, et al. FSCE: Few-shot object detection via contrastive proposal encoding[C]. 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, USA, 2021: 7352–7362. doi: 10.1109/CVPR46437.2021.00727.
[8]	YAN Xiaopeng, CHEN Ziliang, XU Anni, et al. Meta R-CNN: Towards general solver for instance-level low-shot learning[C]. 2019 IEEE/CVF International Conference on Computer Vision, Seoul, Korea (South), 2019: 9577–9586. doi: 10.1109/ICCV.2019.00967.
[9]	QIAO Limeng, ZHAO Yuxuan, LI Zhiyuan, et al. DeFRCN: Decoupled faster R-CNN for few-shot object detection[C]. 2021 IEEE/CVF International Conference on Computer Vision, Montreal, Canada, 2021: 8681–8690. doi: 10.1109/ICCV48922.2021.00856.
[10]	ZHU Chenchen, CHEN Fangyi, AHMED U, et al. Semantic relation reasoning for shot-stable few-shot object detection[C]. 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, USA, 2021: 8782–8791. doi: 10.1109/CVPR46437.2021.00867.
[11]	REN Shaoqing, HE Kaiming, GIRSHICK R, et al. Faster R-CNN: Towards real-time object detection with region proposal networks[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017, 39(6): 1137–1149. doi: 10.1109/TPAMI.2016.2577031.
[12]	ZHANG Weilin and WANG Yuxiong. Hallucination improves few-shot object detection[C]. 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, USA, 2021: 13008–13017. doi: 10.1109/CVPR46437.2021.01281.
[13]	ZHU Pengkai, WANG Hanxiao, and SALIGRAMA V. Don’t even look once: Synthesizing features for zero-shot detection[C]. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, USA, 2020: 11693–11702. doi: 10.1109/CVPR42600.2020.01171.
[14]	XU Jingyi, LE H, and SAMARAS D. Generating features with increased crop-related diversity for few-shot object detection[C]. 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, Canada, 2023: 19713–19722. doi: 10.1109/CVPR52729.2023.01888.
[15]	HO J, JAIN A, and ABBEEL P. Denoising diffusion probabilistic models[C]. The 34th International Conference on Neural Information Processing Systems, Vancouver, Canada, 2020: 574.
[16]	HO J and SALIMANS T. Classifier-free diffusion guidance[EB/OL]. https://arxiv.org/abs/2207.12598, 2022.
[17]	QI Tianhao, FANG Shancheng, WU Yanze, et al. DEADiff: An efficient stylization diffusion model with disentangled representations[C]. 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, USA, 2024: 8693–8702. doi: 10.1109/CVPR52733.2024.00830.
[18]	GARBER T and TIRER T. Image restoration by denoising diffusion models with iteratively preconditioned guidance[C]. 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, USA, 2024: 25245–25254. doi: 10.1109/CVPR52733.2024.02385.
[19]	LI Muyang, CAI Tianle, CAO Jiaxin, et al. DistriFusion: Distributed parallel inference for high-resolution diffusion models[C]. 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, USA, 2024: 7183–7193. doi: 10.1109/CVPR52733.2024.00686.
[20]	HUANG Ziqi, CHAN K C K, JIANG Yuming, et al. Collaborative diffusion for multi-modal face generation and editing[C]. 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, Canada, 2023: 6080–6090. doi: 10.1109/CVPR52729.2023.00589.
[21]	RADFORD A, KIM J W, HALLACY C, et al. Learning transferable visual models from natural language supervision[C]. The 38th International Conference on Machine Learning, 2021: 8748–8763.
[22]	RONNEBERGER O, FISCHER P, and BROX T. U-net: Convolutional networks for biomedical image segmentation[C]. The 18th International Conference on Medical Image Computing and Computer-Assisted Intervention, Munich, Germany, 2015: 234–241. doi: 10.1007/978-3-319-24574-4_28.
[23]	WU Jiaxi, LIU Songtao, HUANG Di, et al. Multi-scale positive sample refinement for few-shot object detection[C]. The 16th European Conference on Computer Vision, Glasgow, UK, 2020: 456–472. doi: 10.1007/978-3-030-58517-4_27.
[24]	LI Aoxue and LI Zhenguo. Transformation invariant few-shot object detection[C]. 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, USA, 2021: 3094–3102. doi: 10.1109/CVPR46437.2021.00311.
[25]	HU Hanzhe, BAI Shuai, LI Aoxue, et al. Dense relation distillation with context-aware aggregation for few-shot object detection[C]. 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, USA, 2021: 10185–10194. doi: 10.1109/CVPR46437.2021.01005.
[26]	LI Bohao, YANG Boyu, LIU Chang, et al. Beyond max-margin: Class margin equilibrium for few-shot object detection[C]. 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, USA, 2021: 7363–7372. doi: 10.1109/CVPR46437.2021.00728.
[27]	CAO Yuhang, WANG Jiaqi, JIN Ying, et al. Few-shot object detection via association and discrimination[C]. The 35th International Conference on Neural Information Processing Systems, 2021: 1267.
[28]	HAN Guangxing, MA Jiawei, HUANG Shiyuan, et al. Few-shot object detection with fully cross-transformer[C]. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, USA, 2022: 5321–5330. doi: 10.1109/CVPR52688.2022.00525.
[29]	ZHANG Gongjie, LUO Zhipeng, CUI Kaiwen, et al. Meta-DETR: Image-level few-shot detection with inter-class correlation exploitation[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023, 45(11): 12832–12843. doi: 10.1109/TPAMI.2022.3195735.
[30]	XIAO Yang, LEPETIT V, and MARLET R. Few-shot object detection and viewpoint estimation for objects in the wild[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023, 45(3): 3090–3106. doi: 10.1109/TPAMI.2022.3174072.