MG-MoE：基于路由机制的多粒度专家集成模型

咸凤羽; 鉴海防; 谢子晖; 杜军; 张渊媛; 宁欣; 董苗苗; 王洪昌

doi:10.11999/JEIT260219

MG-MoE：基于路由机制的多粒度专家集成模型

doi: 10.11999/JEIT260219 cstr: 32379.14.JEIT260219

咸凤羽¹,
鉴海防^{2, 5},
谢子晖¹,
杜军¹,
张渊媛³,
宁欣^{4, 5},
董苗苗^{2, 5},
王洪昌^{2, 5}

1.
山东师范大学通信与电子工程学院山东济南 250358
2.
中国科学院半导体研究所固态光电信息技术实验室北京 100083
3.
北京麋鹿生态实验中心北京 100076
4.
中国科学院半导体研究所人工智能与高速电路实验室北京 100083
5.
中国科学院大学北京 100049

基金项目: XX基金(基金号)；YY项目(项目号)基金项目：国家重点研发计划战略性创新合作项目(2024YFE0210600)

详细信息

作者简介:
咸凤羽：女，硕士生，研究方向为深度学习，计算机视觉

鉴海防：男，研究员，博士生导师，研究方向为高性能专用集成电路设计、智能信息处理算法与系统等

谢子晖：男，硕士生，研究方向为深度学习，计算机视觉

杜军：女，副教授，研究方向为优化计算、医学图像处理、智能机器人导航等

张渊媛：女，副研究员，研究方向为野生动物保护生物学

宁欣：男，博士，研究员，博士生导师，研究方向为形象认知计算理论和2D/3D视觉算法

董苗苗：女，硕士研究生，研究方向为深度学习，多模态智能感知

王洪昌：男，助理研究员，研究方向为多模态智能感知算法与软硬件协同设计

中图分类号: TP391
计量
- 文章访问数: 10
- HTML全文浏览量: 5
- PDF下载量: 1
- 被引次数: 0
出版历程
- 修回日期: 2026-04-23
- 录用日期: 2026-04-23
- 网络出版日期: 2026-05-23

MG-MoE: Routed Multi-Granularity Expert Ensemble

1.
School of Communication and Electronic Engineering, Shandong Normal University, Jinan 250358, China
2.
Laboratory of Solid State Optoelectronics Information Technology, Institute of Semiconductors, Chinese Academy of Sciences, Beijing 100083, China
3.
Beijing Milu Ecological Research Center, Beijing 100076, China
4.
Laboratory of Artificial Intelligence and High Speed Circuit, Institute of Semiconductors, Chinese Academy of Sciences, Beijing 100083, China
5.
University of Chinese Academy of Sciences, Beijing 100049, China

Funds: This work was supported by the National Key Research and Development Program of China (2024YFE0210600)

摘要

摘要: 细粒度图像识别任务中，模型需在类间差异较小的条件下，同时捕捉局部判别线索与全局结构特征，且在复杂背景、姿态变化及长尾数据分布下，仍能保持稳定的泛化性能。本文提出MG-MoE (Multi-Granularity Mixture-of-Experts)，一种基于路由机制的多粒度专家集成模型，依托样本自适应条件计算机制，在可控推理开销内实现判别性能的提升。针对路由学习的稳定性与泛化性问题，本文提出两阶段优化策略。第一阶段为动态簇级训练，基于验证集统计构建簇级软教师分布，借助KL散度正则化稳定路由行为，推动专家间形成有效分工；第二阶段为残差微调，在保持特征驱动路由形式不变的前提下，按簇对Top-2专家的分类头进行解冻，并以分组学习率对门控与专家头联合微调，从而缓解专家融合偏差并增强模型对困难样本与长尾类别的判别能力。在CUB-200-2011与Bird-1445两个基准数据集上的实验结果表明，所提出的MG-MoE具有较好的有效性。其中，在CUB-200-2011上，MG-MoE取得了92.89%的准确率；在Bird-1445抽样集上，MG-MoE的准确率达到96.80%，均达到常见模型中的最佳准确率;消融分析进一步表明，受控的Top-2融合与四专家互补结构共同决定了性能上限，并在专家过少或同质扩展时呈现出可解释的退化规律。该研究为细粒度场景下的多粒度专家建模与路由训练提供了可复用的实现范式与分析框架。
- 细粒度图像识别 /
- 混合专家模型 /
- 多粒度学习 /
- 路由机制 /
- 稀疏激活
Abstract: Objective Fine-grained image recognition (FGIR) aims to distinguish visually similar subcategories that differ only in subtle local patterns while remaining robust to large intra-class variations such as pose changes, occlusions, illumination shifts, and complex backgrounds. In real-world settings, these challenges are further compounded by long-tailed category distributions, where rare or hard classes are prone to overfitting spurious context and suffering unstable decision boundaries. This motivates a conditional computation paradigm in which complementary inductive biases are explicitly separated into specialized expert branches and combined adaptively per sample. The goal of this work is to develop a routed multi-granularity mixture-of-experts framework that improves discriminative performance under controllable inference cost, while enhancing robustness on difficult samples and long-tailed categories through adaptive sparse expert activation. Methods We propose MG-MoE (Multi-Granularity Mixture-of-Experts), a routed ensemble architecture composed of a shared backbone, four heterogeneous experts, and a learnable router that predicts expert weights conditioned on the input (Fig. 2). The experts are deliberately instantiated with complementary inductive biases to cover the key factors in FGIR: (1) MPSA emphasizes global structure and contour-level semantics; (2) PMG captures fine local details through multi-granularity part modeling; (3) TransFG focuses on pose- and deformation-aware modeling; and (4) PIM improves robustness under cluttered backgrounds via background suppression mechanisms. To limit interference and reduce unnecessary computation, MG-MoE adopts sparse fusion, where only the Top-K experts (K=2 by default) contribute to the final prediction at inference.To improve routing stability and generalization, we introduce a two-stage optimization strategy. The first stage performs dynamic cluster-level training, where a cluster-level soft teacher distribution is constructed from validation-set statistics and imposed through KL-divergence regularization to stabilize routing behavior and promote effective specialization among experts. The second stage performs residual fine-tuning: while keeping the feature-driven routing mechanism unchanged, the classification heads of the Top-2 experts associated with each cluster are selectively unfrozen, and the router and expert heads are jointly optimized with grouped learning rates. This design reduces fusion bias and strengthens discrimination on difficult samples and long-tailed categories. Results and Discussions MG-MoE achieves strong performance on standard FGIR benchmarks. On CUB-200-2011, MG-MoE attains 92.89% Top-1 accuracy, exceeding representative expert backbones when used individually, such as MPSA (91.23%), PIM (91.17%), and TransFG (90.49%), and surpassing multi-granularity baselines such as PMG (88.32%) (Table 1). On the larger Bird-1445 dataset, MG-MoE continues to show consistent improvements over strong baselines, indicating that routed multi-expert specialization remains effective under a higher number of categories and stronger long-tail effects (Table 2).The efficiency–accuracy trade-off is summarized in Table 3. MG-MoE (Top-2) reaches the best accuracy (92.89%) with a compute budget of 143.9 GFLOPs.Importantly, MG-MoE avoids dense expert activation at inference by selecting only the Top-2 experts for each sample, yielding a favorable accuracy–efficiency trade-off, and ablations show that increasing K beyond 2 does not yield consistent gains, suggesting that indiscriminate fusion can dilute discriminative evidence. Specifically, Top-2 fusion delivers the best performance, whereas Top-1 is more sensitive to routing errors and larger K can introduce noise and reduce accuracy (Table 4).We further analyze the role of expert diversity and composition. Experiments with fewer experts (two- or three-expert variants) generally underperform the full four-expert configuration, indicating that each inductive bias contributes nontrivially to handling different fine-grained difficulty factors. Conversely, simply adding more experts without introducing genuinely new inductive biases yields diminishing or negative returns, consistent with increased routing ambiguity and limited functional diversity (Table 5). These results support the design choice of a compact set of heterogeneous experts combined with sparse routing.To interpret the learned specialization, we visualize category-wise routing statistics. The expert–category heatmap shows that MPSA dominates routing weight across many categories, reflecting the central role of global structure in fine-grained discrimination; meanwhile, PIM and TransFG exhibit noticeable activation increases on specific difficult categories, aligning with their intended functionality for background suppression and pose/deformation modeling (Fig. 3). Finally, t-SNE visualizations illustrate the qualitative effect of expert fusion on class separability: shared backbone features exhibit stronger inter-class entanglement for visually similar subcategories, whereas fused outputs form clearer clusters with improved between-class separation and within-class compactness, consistent with a more reliable decision space shaped by routed expert aggregation (Fig. 4). Conclusions This work presents MG-MoE, a multi-granularity routed mixture-of-experts framework for fine-grained recognition. By combining four complementary experts with Top-2 sparse fusion and a two-stage optimization strategy for stable routing and calibrated fusion, MG-MoE improves recognition accuracy on CUB-200-2011 and Bird-1445 while providing interpretable evidence of expert specialization (Table 1, Table 2, Fig. 3, Fig. 4). Ablations confirm that controlled Top-2 fusion and heterogeneous expert design are key to the observed gains, while overly dense fusion or homogeneous expert expansion offers limited benefit (Table 4, Table 5).
- fine-grained image recognition /
- mixture-of-experts /
- multi-granularity learning /
- routing /
- sparse activation.

HTML全文

图 1 不同鸟类图像专家注意力区域可视化

每行对应一个测试样本，每列展示输入图像及不同专家的注意力响应。专家名称后的“*”表示该专家对该样本分类正确，“×”表示该专家对该样本分类错误。

下载: 全尺寸图片幻灯片

图 2 MG-MoE模型框架图

下载: 全尺寸图片幻灯片

图 3 MG-MoE专家激活与类别的统计关联热力图

下载: 全尺寸图片幻灯片

图 4 t-SNE 特征空间流形图

下载: 全尺寸图片幻灯片

表 1 CUB-200-2011测试集Top-1准确率对比(%)

模型	Top-1准确率(%)	相对提升(pp)
DCL^[31]	86.86	-
CrossX^[32]	87.00	+0.14
ConvNeXt^[33]	87.38	+0.52
ResNet-50^[34]	87.74	+0.88
PMG^[18]	88.32	+1.46
TransFG^[17]	90.49	+3.63
PIM^[19]	91.17	+4.31
MPSA^[3]	91.23	+4.37
MG-MoE(本文)	92.89	+6.03

下载: 导出CSV

表 2 Bird-1445抽测试集(200类)Top-1准确率对比(%)

方法	Top-1 准确率(%)	相对提升(pp)
ResNet-50	88.90	—
DCL	89.40	+0.50
CrossX	90.60	+1.70
ConvNeXt	90.70	+1.80
PMG	93.10	+4.20
TransFG	93.60	+4.70
PIM	94.80	+5.90
MPSA	95.10	+6.20
MG-MoE(本文)	96.80	+7.90

下载: 导出CSV

表 3 多专家模型CUB-200-2011效率分析对比

模型	Acc(%)	GFLOPs	Params(M)	Latency(ms)	FPS
PMG	88.32	37.4	45.1	7.5	133.3
MPSA	91.23	62.7	94.2	18.4	54.3
PIM	91.17	73.2	94.3	7.7	130.5
TransFG	90.49	99.1	87.6	22.1	45.2
MG-MoE (Full)	92.89	275.5	330.9	56.9	17.6
MG-MoE (Top-2)	92.89	143.9	330.9	37	27

下载: 导出CSV

表 4 MG-MoE模型Top-K消融实验结果(CUB-200-2011)

模型数量	Top-1(%)	相对最优(pp)
Top-1	91.95	–0.94
Top-2	92.89	0.00
Top-3	92.62	–0.27
Top-4	92.10	–0.79

下载: 导出CSV

表 5 MG-MoE模型不同专家数量Top-2融合消融实验结果(CUB-200-2011)

专家总数	专家组合	Top-1(%)	相对最优(pp)
2	MPSA+TransFG	91.28	–1.61
2	PMG+PIM	91.22	–1.67
2	MPSA+PIM	92.03	–0.86
3	MPSA+TransFG+PIM	92.36	–0.53
3	MPSA+PMG+PIM	92.15	–0.74
3	MPSA+PMG+TransFG	92.01	–0.88
4	MPSA+PMG+TransFG+PIM	92.89	-
5	四专家+1个同质专家	92.78	–0.11
6	四专家+2个同质专家	92.60	–0.29

下载: 导出CSV

参考文献(34)

[1]	SUN Hongbo, HE Xiangteng, XU Jinglin, et al. SIM-OFE: Structure information mining and object-aware feature enhancement for fine-grained visual categorization[J]. IEEE Transactions on Image Processing, 2024, 33: 5312–5326. doi: 10.1109/TIP.2024.3459788.
[2]	YANG Shengying, YANG Xinqi, WU Jianfeng, et al. Significant feature suppression and cross-feature fusion networks for fine-grained visual classification[J]. Scientific Reports, 2024, 14(1): 24051. doi: 10.1038/s41598-024-74654-4.
[3]	WANG Jiahui, XU Qin, JIANG Bo, et al. Multi-granularity part sampling attention for fine-grained visual classification[J]. IEEE Transactions on Image Processing, 2024, 33: 4529–4542. doi: 10.1109/TIP.2024.3441813.
[4]	MA Bing, LI Junyi, JIN Zhengbei, et al. Fine-grained image recognition with bio-inspired gradient-aware attention[J]. Biomimetics, 2025, 10(12): 834. doi: 10.3390/biomimetics10120834.
[5]	CHANG Dongliang, TONG Yujun, DU Ruoyi, et al. An erudite fine-grained visual classification model[C]. Proceedings of 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, Canada, 2023: 7268–7277. doi: 10.1109/CVPR52729.2023.00702.
[6]	SU J C, CHENG Zezhou, and MAJI S. A realistic evaluation of semi-supervised learning for fine-grained classification[C]. Proceedings of 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, USA, 2021: 12966–12975. doi: 10.1109/CVPR46437.2021.01277.
[7]	SHU Yangyang, YU Baosheng, XU Haiming, et al. Improving fine-grained visual recognition in low data regimes via self-boosting attention mechanism[C]. Proceedings of the 17th European Conference on Computer Vision, Tel Aviv, Israel, 2022: 449–465. doi: 10.1007/978-3-031-19806-9_26.
[8]	FEDUS W, ZOPH B, and SHAZEER N. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity[J]. Journal of Machine Learning Research, 2022, 23(120): 1–39.
[9]	JACOBS R A, JORDAN M I, NOWLAN S J, et al. Adaptive mixtures of local experts[J]. Neural Computation, 1991, 3(1): 79–87. doi: 10.1162/neco.1991.3.1.79.
[10]	SHAZEER N, MIRHOSEINI A, MAZIARZ K, et al. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer[C]. Proceedings of the 5th International Conference on Learning Representations, Toulon, France, 2017.
[11]	RIQUELME C, PUIGCERVER J, MUSTAFA B, et al. Scaling vision with sparse mixture of experts[C]. Proceedings of the 35th International Conference on Neural Information Processing Systems, 2021: 657. (查阅网上资料, 未找到对应的出版地信息, 请确认).
[12]	HAN Xumeng, WEI Longhui, DOU Zhiyang, et al. ViMoE: An empirical study of designing vision mixture-of-experts[J]. IEEE Transactions on Image Processing, 2025, 34: 7209–7221. doi: 10.1109/TIP.2025.3626887.
[13]	ZHU Jinguo, ZHU Xizhou, WANG Wenhai, et al. Uni-perceiver-MoE: Learning sparse generalist models with conditional MoEs[C]. Proceedings of the 36th International Conference on Neural Information Processing Systems, New Orleans, USA, 2022: 193.
[14]	MUSTAFA B, RIQUELME C, PUIGCERVER J, et al. Multimodal contrastive learning with LIMoE: The language-image mixture of experts[C]. Proceedings of the 36th International Conference on Neural Information Processing Systems, New Orleans, USA, 2022: 695.
[15]	SHEN Leyang, CHEN Gongwei, SHAO Rui, et al. MoME: Mixture of multimodal experts for generalist multimodal large language models[C]. Proceedings of the 38th International Conference on Neural Information Processing Systems, Vancouver, Canada, 2024: 1330.
[16]	ZHENG Haiyang, PU Nan, LI Wenjing, et al. Generalized fine-grained category discovery with multi-granularity conceptual experts[J]. arXiv preprint arXiv: 2509.26227, 2025. (查阅网上资料, 请核对文献类型及格式).
[17]	HE Ju, CHEN Jieneng, LIU Shuai, et al. TransFG: A transformer architecture for fine-grained recognition[C]. Proceedings of the 36th AAAI Conference on Artificial Intelligence, 2022: 852–860. doi: 10.1609/aaai.v36i1.19967. (查阅网上资料,未找到对应的出版地信息,请确认).
[18]	DU Ruoyi, CHANG Dongliang, BHUNIA A K, et al. Fine-grained visual classification via progressive multi-granularity training of jigsaw patches[C]. Proceedings of the 16th European Conference on Computer Vision, Glasgow, UK, 2020: 153–168. doi: 10.1007/978-3-030-58565-5_10.
[19]	CHOU P Y, LIN C H, and KAO W C. A novel plug-in module for fine-grained visual classification[J]. arXiv preprint arXiv: 2202.03822, 2022. <b>(查阅网上资料, 请核对文献类型及格式)</b>.
[20]	XU Zhikang, YUE Xiaodong, LV Ying, et al. Trusted fine-grained image classification through hierarchical evidence fusion[C]. Proceedings of the 37th AAAI Conference on Artificial Intelligence, Washington, USA, 2023: 10657–10665. doi: 10.1609/aaai.v37i9.26265.
[21]	ZHENG Haiyang, PU Nan, LI Wenjing, et al. Generalized fine-grained category discovery with multi-granularity conceptual experts[J]. arXiv preprint arXiv: 2509.26227, 2025. (查阅网上资料, 请核对文献类型及格式)(查阅网上资料, 本条文献与第16条文献重复, 请确认).
[22]	LEPIKHIN D, LEE H, XU Yuanzhong, et al. GShard: Scaling giant models with conditional computation and automatic sharding[C]. Proceedings of the 9th International Conference on Learning Representations, Austria, 2021. (查阅网上资料, 未找到对应的出版城市信息, 请确认).
[23]	GURURANGAN S, LEWIS M, HOLTZMAN A, et al. DEMix layers: Disentangling domains for modular language modeling[C]. Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Seattle, United States, 2022: 5557–5576. doi: 10.18653/v1/2022.naacl-main.407.
[24]	RAJBHANDARI S, LI Conglong, YAO Zhewei, et al. DeepSpeed-MoE: Advancing mixture-of-experts inference and training to power next-generation AI scale[C]. Proceedings of the 39th International Conference on Machine Learning, Baltimore, USA, 2022: 18332–18346.
[25]	WANG Lean, GAO Huazuo, ZHAO Chenggang, et al. Auxiliary-loss-free load balancing strategy for mixture-of-experts[J]. arXiv preprint arXiv: 2408.15664, 2024. (查阅网上资料, 请核对文献类型及格式).
[26]	ROLLER S, SUKHBAATAR S, SZLAM A, et al. Hash layers for large sparse models[C]. Proceedings of the 35th International Conference on Neural Information Processing Systems, 2021: 1343. (查阅网上资料, 未找到对应的出版地信息, 请确认).
[27]	JIANG A Q, SABLAYROLLES A, ROUX A, et al. Mixtral of experts[J]. arXiv preprint arXiv: 2401.04088, 2024. (查阅网上资料, 请核对文献类型及格式).
[28]	DAI Damai, DENG Chengqi, ZHAO Chenggang, et al. DeepSeekMoE: Towards ultimate expert specialization in mixture-of-experts language models[C]. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics, Bangkok, Thailand, 2024. doi: 10.18653/v1/2024.acl-long.70.
[29]	CHEN Tianlong, CHEN Xuxi, DU Xianzhi, et al. AdaMV-MoE: Adaptive multi-task vision mixture-of-experts[C]. Proceedings of 2023 IEEE/CVF International Conference on Computer Vision, Paris, France, 2023: 17346–17357. doi: 10.1109/ICCV51070.2023.01591.
[30]	王洪昌, 咸凤羽, 谢子晖, 等. BIRD1445: 面向生态监测的大规模多模态鸟类数据集[J]. 电子与信息学报, 2026, 48(2): 873–888. doi: 10.11999/JEIT250647. WANG Hongchang, XIAN Fengyu, XIE Zihui, et al. BIRD1445: Large-scale multimodal bird dataset for ecological monitoring[J]. Journal of Electronics & Information Technology, 2026, 48(2): 873–888. doi: 10.11999/JEIT250647.
[31]	CHEN Yue, BAI Yalong, ZHANG Wei, et al. Destruction and construction learning for fine-grained image recognition[C]. Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, USA, 2019: 5157–5166. doi: 10.1109/CVPR.2019.00530.
[32]	LUO Wei, YANG Xitong, MO Xianjie, et al. Cross-x learning for fine-grained visual categorization[C]. Proceedings of 2019 IEEE/CVF International Conference on Computer Vision, Seoul, Korea (South), 2019: 8242–8251. doi: 10.1109/ICCV.2019.00833.
[33]	LIU Zhuang, MAO Hanzi, WU Chaoyuan, et al. A ConvNet for the 2020s[C]. Proceedings of 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, USA, 2022: 11976–11986. doi: 10.1109/CVPR52688.2022.01167.
[34]	HE Kaiming, ZHANG Xiangyu, REN Shaoqing, et al. Deep residual learning for image recognition[C]. Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, USA, 2016: 770–778. doi: 10.1109/CVPR.2016.90.