高级搜索

留言板

尊敬的读者、作者、审稿人, 关于本刊的投稿、审稿、编辑和出版的任何问题, 您可以本页添加留言。我们将尽快给您答复。谢谢您的支持!

姓名
邮箱
手机号码
标题
留言内容
验证码

BIRD1445:面向生态监测的大规模多模态鸟类数据集

王洪昌 咸凤羽 谢子晖 董苗苗 鉴海防

王洪昌, 咸凤羽, 谢子晖, 董苗苗, 鉴海防. BIRD1445:面向生态监测的大规模多模态鸟类数据集[J]. 电子与信息学报. doi: 10.11999/JEIT250647
引用本文: 王洪昌, 咸凤羽, 谢子晖, 董苗苗, 鉴海防. BIRD1445:面向生态监测的大规模多模态鸟类数据集[J]. 电子与信息学报. doi: 10.11999/JEIT250647
WANG Hongchang, XIAN Fengyu, XIE Zihui, DONG Miaomiao, JIAN Haifang. BIRD1445: Large-scale Multimodal Bird Dataset for Ecological Monitoring[J]. Journal of Electronics & Information Technology. doi: 10.11999/JEIT250647
Citation: WANG Hongchang, XIAN Fengyu, XIE Zihui, DONG Miaomiao, JIAN Haifang. BIRD1445: Large-scale Multimodal Bird Dataset for Ecological Monitoring[J]. Journal of Electronics & Information Technology. doi: 10.11999/JEIT250647

BIRD1445:面向生态监测的大规模多模态鸟类数据集

doi: 10.11999/JEIT250647 cstr: 32379.14.JEIT260647
基金项目: 新一代人工智能国家科技重大专项(2022ZD0116304)
详细信息
    作者简介:

    王洪昌:男,助理研究员,研究方向为多模态智能感知算法与软硬件协同设计

    咸凤羽:女,硕士生,研究方向为深度学习,计算机视觉

    谢子晖:男,硕士生,研究方向为深度学习,计算机视觉

    董苗苗:女,硕士生,研究方向为深度学习,多模态智能感知

    鉴海防:男,研究员,博士生导师,研究方向为高性能专用集成电路设计、智能信息处理算法与系统等

    通讯作者:

    鉴海防 jhf@semi.ac.cn

  • 中图分类号: TP391

BIRD1445: Large-scale Multimodal Bird Dataset for Ecological Monitoring

Funds: The National Science and Technology Major Project (2022ZD0116304)
  • 摘要: 随着人工智能技术的快速发展,基于深度学习的计算机视觉、声学智能分析和多模态融合技术为生态监测领域提供了重要技术手段,广泛应用于鸟类物种识别与调查等业务场景。然而,现有鸟类数据集存在实采数据获取难度大、专业标注人力成本高、珍稀物种数据样本少且数据模态单一等诸多问题,难以满足大模型等人工智能技术在生态监测与保护领域的训练与应用需求。针对此问题,该文提出一种面向专业领域的大规模多模态数据集高效构建方法,通过多源异构数据采集、智能化半自动标注和基于异构注意力融合的多模型协同校验机制,有效降低专业标注成本并保证数据质量。该文设计了基于多尺度注意力融合的数据集校验方法,通过构建多模型协同校验系统,利用类别敏感权重分配机制提升数据集校验的准确度和效率。基于以上方法,该文构建了大规模多模态鸟类数据集BIRD1445,涵盖1,445种鸟类物种,包含图像、视频、音频和文本四种模态,共计354万个样本,能够支持目标检测、密度估计、细粒度识别等智能分析任务,为人工智能技术在生态监测与保护领域的应用提供了重要数据基础。
  • 图  1  大规模多模态鸟类数据集

    图  2  大规模多模态专业数据集构建流程图

    图  3  面向开放场景的鸟类多模态智慧监测系统

    图  4  数据标准化与冗余优化处理流程

    图  5  基于多尺度注意力融合的数据校验方法

    图  6  BIRD数据集多模态数据结构示例

    图  7  长尾分布图像细粒度识别数据集统计特征分析

    图  8  均衡分布图像细粒度识别数据集统计特征分析

    图  9  长尾分布音频细粒度识别数据集统计特征分析

    图  10  大模型专业领域细节特征感知能力评测实验

    表  1  主流数据集汇总

    数据集 数据集类型 发布年份 数据类型 类别数 数据总数 主要任务 主要特点
    PASCAL VOC[16] AI算法通用 2005 图像 21 1.15×104 分类、检测、分割 高精度标注;标准评估协议;
    难度分级与遮挡标记
    MS COCO[17] AI算法通用 2014 图像 80 3.30×105 检测、实例分割、
    图像描述、关键点检测
    像素级分割标注;复杂多样场景;多任务;
    物体关系与语言描述
    ImageNet[18] AI算法通用 2009 图像 20,000+ 2.10×107 分类、检测 超大规模;类别广泛且结构化
    Open Images[19] AI算法通用 2016 图像 600 9.00×106 多标签分类、检测、
    实例分割、视觉关系检测
    标注类型全面;真实场景分布广;
    支持细粒度视觉理解任务
    VQA v2[20] 大模型评测 2017 问答对 \ 2.14×105 视觉问答 平衡设计,消除语言偏见
    GQA[21] 大模型评测 2019 问答对 \ 2.20×107 组合推理 多步推理,场景图结构
    OK-VQA[22] 大模型评测 2019 问答对 \ 2.50×104 知识推理 外部知识整合,跨领域推理
    CUB-200-2011[11] 专业领域
    应用
    2011 图像 200 1.18×104 细粒度识别、部位定位 精细标注,场景单一
    NABirds[12] 专业领域
    应用
    2015 图像 1,011 4.86×104 层级分类、
    细粒度识别
    北美鸟类覆盖全面,包含分类层级
    iNaturalist Birds[13] 专业领域
    应用
    2021 图像 1,203 2.70×105 野外识别、生物多样性监测 自然场景真实,数据分布不均衡
    Species196[24] 专业领域
    应用
    2024 图像 196 1.22×106 入侵物种生物识别 120万无标注+19,236精细标注
    TreeOfLife-1M[14] 专业领域
    应用
    2024 图像 454,000 1.00×107 生物学图像
    识别
    附带详细的分类层级标签
    BIRD1445
    (本文)
    专业领域
    应用
    2025 图像
    音频
    视频
    文本
    1,445 3.54×106 多种AI任务 包含全国18个自然保护地实采数据、
    全国鸟类物种、多任务、
    智能化专业标注、文本模态包含100万条
    结构化观测记录,关联经纬度、
    时间戳等元数据
    下载: 导出CSV

    表  2  主流细粒度识别算法对比实验结果(%)

    模型Top-1准确率相对基线
    ResNet50[33]87.74-
    DCL[28]86.86–0.88
    PIM[29]91.65+3.91
    MPSA[31]91.79+4.05
    ConvNeXt[32]87.38–5.65
    TransFG[34]88.49+0.75
    PMG[35]88.32+0.58
    CrossX[30]87.00–0.74
    本文95.39+7.65
    下载: 导出CSV

    表  3  核心模块消融实验结果(%)

    模型Top-1准确率相对提升
    MPSA[31]91.79-
    AVE(本文)94.19+2.40
    GWV(本文)94.46+2.67
    PKFD(本文)95.39+3.60
    下载: 导出CSV

    表  4  Top K消融实验结果(%)

    模型数量KTop-1准确率相对提升
    194.66-
    295.25+0.59
    395.39+0.73
    495.18+0.52
    下载: 导出CSV

    表  5  1445分类图像均衡数据集细粒度识别任务基准实验(%)

    模型 Top-1准确率
    (CUB-200-2011)
    Top-1准确率
    (BIRD1445)
    变化
    ResNet50 [33] 87.74 78.00 –9.74
    DCL [28] 86.86 83.43 –3.43
    PIM [29] 91.65 85.00 –6.65
    MPSA [31] 91.79 89.17 –2.62
    ConvNeXt [32] 87.38 80.72 –6.66
    TransFG [34] 88.49 72.73 –15.76
    PMG [35] 88.32 73.69 –14.63
    CrossX [30] 87.00 82.00 –5.00
    下载: 导出CSV
  • [1] ZHU Ruizhe, JIN Hai, HAN Yonghua, et al. Aircraft target detection in remote sensing images based on improved YOLOv7-tiny network[J]. IEEE Access, 2025, 13: 48904–48922. doi: 10.1109/ACCESS.2025.3551320.
    [2] 侯志强, 董佳乐, 马素刚, 等. 基于多尺度特征增强与全局-局部特征聚合的视频目标分割算法[J]. 电子与信息学报, 2024, 46(11): 4198–4207. doi: 10.11999/JEIT231394.

    HOU Zhiqiang, DONG Jiale, MA Sugang, et al. Video object segmentation algorithm based on multi-scale feature enhancement and global-local feature aggregation[J]. Journal of Electronics & Information Technology, 2024, 46(11): 4198–4207. doi: 10.11999/JEIT231394.
    [3] 查志远, 袁鑫, 张嘉超, 等. 基于低秩正则联合稀疏建模的图像去噪算法[J]. 电子与信息学报, 2025, 47(2): 561–572. doi: 10.11999/JEIT240324.

    ZHA Zhiyuan, YUAN Xin, ZHANG Jiachao, et al. Low-rank regularized joint sparsity modeling for image denoising[J]. Journal of Electronics & Information Technology, 2025, 47(2): 561–572. doi: 10.11999/JEIT240324.
    [4] TUIA D, KELLENBERGER B, BEERY S, et al. Perspectives in machine learning for wildlife conservation[J]. Nature Communications, 2022, 13(1): 792. doi: 10.1038/s41467-022-27980-y.
    [5] WANG Hongchang, LU Huaxiang, GUO Huimin, et al. Bird-Count: A multi-modality benchmark and system for bird population counting in the wild[J]. Multimedia Tools and Applications, 2023, 82(29): 45293–45315. doi: 10.1007/s11042-023-14833-z.
    [6] 王洪昌, 夏舫, 张渊媛, 等. 基于深度学习算法的鸟类及其栖息地识别——以北京翠湖国家城市湿地公园为例[J]. 生态学杂志, 2024, 43(7): 2231–2238. doi: 10.13292/j.1000-4890.202407.045.

    WANG Hongchang, XIA Fang, ZHANG Yuanyuan, et al. Bird and habitat recognition based on deep learning algorithm: A case study of Beijing Cuihu National Urban Wetland Park[J]. Chinese Journal of Ecology, 2024, 43(7): 2231–2238. doi: 10.13292/j.1000-4890.202407.045.
    [7] GUO Huimin, JIAN Haifang, WANG Yiyu, et al. CDPNet: Conformer-based dual path joint modeling network for bird sound recognition[J]. Applied Intelligence, 2024, 54(4): 3152–3168. doi: 10.1007/s10489-024-05362-9.
    [8] NICHOLS J D and WILLIAMS B K. Monitoring for conservation[J]. Trends in Ecology & Evolution, 2006, 21(12): 668–673. doi: 10.1016/j.tree.2006.08.007.
    [9] NOROUZZADEH M S, NGUYEN A, KOSMALA M, et al. Automatically identifying, counting, and describing wild animals in camera-trap images with deep learning[J]. Proceedings of the National Academy of Sciences, 2018, 115(25): E5716–E5725. doi: 10.1073/pnas.1719367115.
    [10] HAMPTON S E, STRASSER C A, TEWKSBURY J J, et al. Big data and the future of ecology[J]. Frontiers in Ecology and the Environment, 2013, 11(3): 156–162. doi: 10.1890/120103.
    [11] WAH C, BRANSON S, WELINDER P, et al. The Caltech-UCSD birds-200-2011 dataset[R]. 2011. (查阅网上资料, 未找到本条文献出版信息, 请确认).
    [12] VAN HORN G, BRANSON S, FARRELL R, et al. Building a bird recognition app and large scale dataset with citizen scientists: The fine print in fine-grained dataset collection[C]. Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition, Boston, USA, 2015: 595–604. doi: 10.1109/CVPR.2015.7298658.
    [13] VAN HORN G, MAC AODHA O, SONG Yang, et al. The iNaturalist species classification and detection dataset[C]. Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, USA, 2018: 8769–8778. doi: 10.1109/CVPR.2018.00914.
    [14] STEVENS S, WU Jiaman, THOMPSON M J, et al. BioCLIP: A vision foundation model for the tree of life[C]. Proceedings of the 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle, USA, 2024: 19412–19424. doi: 10.1109/CVPR52733.2024.01836.
    [15] FERGUS P, CHALMERS C, LONGMORE S, et al. Harnessing artificial intelligence for wildlife conservation[J]. Conservation, 2024, 4(4): 685–702. doi: 10.3390/conservation4040041.
    [16] EVERINGHAM M, VAN GOOL L, WILLIAMS C K I, et al. The PASCAL visual object classes (VOC) challenge[J]. International Journal of Computer Vision, 2010, 88(2): 303–338. doi: 10.1007/s11263-009-0275-4.
    [17] LIN T Y, MAIRE M, BELONGIE S, et al. Microsoft COCO: Common objects in context[C]. 13th European Conference on Computer Vision -- ECCV 2014, Zurich, Switzerland, 2014: 740–755. doi: 10.1007/978-3-319-10602-1_48.
    [18] DENG Jia, DONG Wei, SOCHER R, et al. ImageNet: A large-scale hierarchical image database[C]. 2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami, USA, 2009: 248–255. doi: 10.1109/CVPR.2009.5206848.
    [19] KUZNETSOVA A, ROM H, ALLDRIN N, et al. The open images dataset V4: Unified image classification, object detection, and visual relationship detection at scale[J]. International Journal of Computer Vision, 2020, 128(7): 1956–1981. doi: 10.1007/s11263-020-01316-z.
    [20] GOYAL Y, KHOT T, SUMMERS-STAY D, et al. Making the V in VQA matter: Elevating the role of image understanding in visual question answering[C]. Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, USA, 2017: 6325–6334. doi: 10.1109/CVPR.2017.670.
    [21] AINSLIE J, LEE-THORP J, DE JONG M, et al. GQA: Training generalized multi-query transformer models from multi-head checkpoints[C]. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, Singapore, Singapore, 2023: 4895–4901. doi: 10.18653/v1/2023.emnlp-main.298.
    [22] MARINO K, RASTEGARI M, FARHADI A, et al. OK-VQA: A visual question answering benchmark requiring external knowledge[C]. Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, USA, 2019: 3190–3199. doi: 10.1109/CVPR.2019.00331.
    [23] KAHL S, WILHELM-STEIN T, KLINCK H, et al. Recognizing birds from sound-the 2018 BirdCLEF baseline system[J]. arXiv preprint arXiv: 1804.07177, 2018. doi: 10.48550/arXiv.1804.07177(查阅网上资料,不确定文献类型是否正确,请确认).
    [24] HE Wei, HAN Kai, NIE Ying, et al. Species196: A one-million semi-supervised dataset for fine-grained species recognition[C]. Proceedings of the 37th International Conference on Neural Information Processing Systems, New Orleans, USA, 2023: 1949. doi: 10.5555/3666122.3668071.
    [25] GUO Huimin, JIAN Haifang, WANG Yequan, et al. MAMGAN: Multiscale attention metric GAN for monaural speech enhancement in the time domain[J]. Applied Acoustics, 2023, 209: 109385. doi: 10.1016/j.apacoust.2023.109385.
    [26] DOSOVITSKIY A, BEYER L, KOLESNIKOV A, et al. An image is worth 16x16 words: Transformers for image recognition at scale[C]. 9th International Conference on Learning Representations, 2021. (查阅网上资料, 未找到本条文献出版地信息, 请确认).
    [27] WANG Xinlong, ZHANG Xiaosong, CAO Yue, et al. SegGPT: Towards segmenting everything in context[C]. 2023 IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 2023: 1130–1140. doi: 10.1109/ICCV51070.2023.00110.
    [28] CHEN Yue, BAI Yalong, ZHANG Wei, et al. Destruction and construction learning for fine-grained image recognition[C]//Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, USA, 2019: 5152–5161. doi: 10.1109/CVPR.2019.00530.
    [29] CHOU P Y, LIN C H, and KAO W C. A novel plug-in module for fine-grained visual classification[J]. arxiv preprint arXiv: 2202.03822, 2022. doi: 10.48550/arXiv.2202.03822(查阅网上资料,不确定文献类型及格式是否正确,请确认).
    [30] LUO Wei, YANG Xitong, MO Xianjie, et al. Cross-X learning for fine-grained visual categorization[C]. Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision, Seoul, Korea (South), 2019: 8241–8250. doi: 10.1109/ICCV.2019.00833.
    [31] WANG Jiahui, XU Qin, JIANG Bo, et al. Multi-granularity part sampling attention for fine-grained visual classification[J]. IEEE Transactions on Image Processing, 2024, 33: 4529–4542. doi: 10.1109/TIP.2024.3441813.
    [32] LIU Zhuang, MAO Hanzi, WU Chaoyuan, et al. A ConvNet for the 2020s[C]. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, USA, 2022: 11966–11976. doi: 10.1109/CVPR52688.2022.01167.
    [33] HE Kaiming, ZHANG Xiangyu, REN Shaoqing, et al. Deep residual learning for image recognition[C]. Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, USA, 2016: 770–778. doi: 10.1109/CVPR.2016.90.
    [34] HE Ju, CHEN Jieneng, LIU Shuai, et al. TransFG: A transformer architecture for fine-grained recognition[C]. Proceedings of the 36th AAAI Conference on Artificial Intelligence, 2022: 852–860. doi: 10.1609/aaai.v36i1.19967. (查阅网上资料,未找到本条文献出版地信息,请确认).
    [35] DU Ruoyi, CHANG Dongliang, BHUNIA A K, et al. Fine-grained visual classification via progressive multi-granularity training of jigsaw patches[C]. 16th European Conference on Computer Vision, Glasgow, UK, 2020: 153–168. doi: 10.1007/978-3-030-58565-5_10.
    [36] BAI Shuai, CHEN Keqin, LIU Xuejing, et al. Qwen2.5-VL technical report[J]. arXiv preprint arXiv: 2502.13923, 2025. doi: 10.48550/arXiv.2502.13923(查阅网上资料,不确定文献类型及格式是否正确,请确认).
  • 加载中
图(10) / 表(5)
计量
  • 文章访问数:  30
  • HTML全文浏览量:  14
  • PDF下载量:  2
  • 被引次数: 0
出版历程
  • 收稿日期:  2025-07-09
  • 修回日期:  2025-08-19
  • 网络出版日期:  2025-09-01

目录

    /

    返回文章
    返回