MG-MoE: Routed Multi-Granularity Expert Ensemble
-
摘要: 细粒度图像识别任务中,模型需在类间差异较小的条件下,同时捕捉局部判别线索与全局结构特征,且在复杂背景、姿态变化及长尾数据分布下,仍能保持稳定的泛化性能。本文提出MG-MoE (Multi-Granularity Mixture-of-Experts),一种基于路由机制的多粒度专家集成模型,依托样本自适应条件计算机制,在可控推理开销内实现判别性能的提升。针对路由学习的稳定性与泛化性问题,本文提出两阶段优化策略。第一阶段为动态簇级训练,基于验证集统计构建簇级软教师分布,借助KL散度正则化稳定路由行为,推动专家间形成有效分工;第二阶段为残差微调,在保持特征驱动路由形式不变的前提下,按簇对Top-2专家的分类头进行解冻,并以分组学习率对门控与专家头联合微调,从而缓解专家融合偏差并增强模型对困难样本与长尾类别的判别能力。在CUB-200-2011与Bird-1445两个基准数据集上的实验结果表明,所提出的MG-MoE具有较好的有效性。其中,在CUB-200-2011上,MG-MoE取得了92.89%的准确率;在Bird-1445抽样集上,MG-MoE的准确率达到96.80%,均达到常见模型中的最佳准确率;消融分析进一步表明,受控的Top-2融合与四专家互补结构共同决定了性能上限,并在专家过少或同质扩展时呈现出可解释的退化规律。该研究为细粒度场景下的多粒度专家建模与路由训练提供了可复用的实现范式与分析框架。Abstract:
Objective Fine-grained image recognition (FGIR) aims to distinguish visually similar subcategories that differ only in subtle local patterns while remaining robust to large intra-class variations such as pose changes, occlusions, illumination shifts, and complex backgrounds. In real-world settings, these challenges are further compounded by long-tailed category distributions, where rare or hard classes are prone to overfitting spurious context and suffering unstable decision boundaries. This motivates a conditional computation paradigm in which complementary inductive biases are explicitly separated into specialized expert branches and combined adaptively per sample. The goal of this work is to develop a routed multi-granularity mixture-of-experts framework that improves discriminative performance under controllable inference cost, while enhancing robustness on difficult samples and long-tailed categories through adaptive sparse expert activation. Methods We propose MG-MoE (Multi-Granularity Mixture-of-Experts), a routed ensemble architecture composed of a shared backbone, four heterogeneous experts, and a learnable router that predicts expert weights conditioned on the input ( Fig. 2 ). The experts are deliberately instantiated with complementary inductive biases to cover the key factors in FGIR: (1) MPSA emphasizes global structure and contour-level semantics; (2) PMG captures fine local details through multi-granularity part modeling; (3) TransFG focuses on pose- and deformation-aware modeling; and (4) PIM improves robustness under cluttered backgrounds via background suppression mechanisms. To limit interference and reduce unnecessary computation, MG-MoE adopts sparse fusion, where only the Top-K experts (K=2 by default) contribute to the final prediction at inference.To improve routing stability and generalization, we introduce a two-stage optimization strategy. The first stage performs dynamic cluster-level training, where a cluster-level soft teacher distribution is constructed from validation-set statistics and imposed through KL-divergence regularization to stabilize routing behavior and promote effective specialization among experts. The second stage performs residual fine-tuning: while keeping the feature-driven routing mechanism unchanged, the classification heads of the Top-2 experts associated with each cluster are selectively unfrozen, and the router and expert heads are jointly optimized with grouped learning rates. This design reduces fusion bias and strengthens discrimination on difficult samples and long-tailed categories.Results and Discussions MG-MoE achieves strong performance on standard FGIR benchmarks. On CUB-200-2011, MG-MoE attains 92.89% Top-1 accuracy, exceeding representative expert backbones when used individually, such as MPSA (91.23%), PIM (91.17%), and TransFG (90.49%), and surpassing multi-granularity baselines such as PMG (88.32%) ( Table 1 ). On the larger Bird-1445 dataset, MG-MoE continues to show consistent improvements over strong baselines, indicating that routed multi-expert specialization remains effective under a higher number of categories and stronger long-tail effects (Table 2 ).The efficiency–accuracy trade-off is summarized inTable 3 . MG-MoE (Top-2) reaches the best accuracy (92.89%) with a compute budget of 143.9 GFLOPs.Importantly, MG-MoE avoids dense expert activation at inference by selecting only the Top-2 experts for each sample, yielding a favorable accuracy–efficiency trade-off, and ablations show that increasing K beyond 2 does not yield consistent gains, suggesting that indiscriminate fusion can dilute discriminative evidence. Specifically, Top-2 fusion delivers the best performance, whereas Top-1 is more sensitive to routing errors and larger K can introduce noise and reduce accuracy (Table 4 ).We further analyze the role of expert diversity and composition. Experiments with fewer experts (two- or three-expert variants) generally underperform the full four-expert configuration, indicating that each inductive bias contributes nontrivially to handling different fine-grained difficulty factors. Conversely, simply adding more experts without introducing genuinely new inductive biases yields diminishing or negative returns, consistent with increased routing ambiguity and limited functional diversity (Table 5 ). These results support the design choice of a compact set of heterogeneous experts combined with sparse routing.To interpret the learned specialization, we visualize category-wise routing statistics. The expert–category heatmap shows that MPSA dominates routing weight across many categories, reflecting the central role of global structure in fine-grained discrimination; meanwhile, PIM and TransFG exhibit noticeable activation increases on specific difficult categories, aligning with their intended functionality for background suppression and pose/deformation modeling (Fig. 3 ). Finally, t-SNE visualizations illustrate the qualitative effect of expert fusion on class separability: shared backbone features exhibit stronger inter-class entanglement for visually similar subcategories, whereas fused outputs form clearer clusters with improved between-class separation and within-class compactness, consistent with a more reliable decision space shaped by routed expert aggregation (Fig. 4 ).Conclusions This work presents MG-MoE, a multi-granularity routed mixture-of-experts framework for fine-grained recognition. By combining four complementary experts with Top-2 sparse fusion and a two-stage optimization strategy for stable routing and calibrated fusion, MG-MoE improves recognition accuracy on CUB-200-2011 and Bird- 1445 while providing interpretable evidence of expert specialization (Table 1 ,Table 2 ,Fig. 3 ,Fig. 4 ). Ablations confirm that controlled Top-2 fusion and heterogeneous expert design are key to the observed gains, while overly dense fusion or homogeneous expert expansion offers limited benefit (Table 4 ,Table 5 ). -
表 1 CUB-200-2011测试集Top-1准确率对比(%)
表 2 Bird-
1445 抽测试集(200类)Top-1准确率对比(%)方法 Top-1 准确率(%) 相对提升(pp) ResNet-50 88.90 — DCL 89.40 +0.50 CrossX 90.60 +1.70 ConvNeXt 90.70 +1.80 PMG 93.10 +4.20 TransFG 93.60 +4.70 PIM 94.80 +5.90 MPSA 95.10 +6.20 MG-MoE(本文) 96.80 +7.90 表 3 多专家模型CUB-200-2011效率分析对比
模型 Acc(%) GFLOPs Params(M) Latency(ms) FPS PMG 88.32 37.4 45.1 7.5 133.3 MPSA 91.23 62.7 94.2 18.4 54.3 PIM 91.17 73.2 94.3 7.7 130.5 TransFG 90.49 99.1 87.6 22.1 45.2 MG-MoE (Full) 92.89 275.5 330.9 56.9 17.6 MG-MoE (Top-2) 92.89 143.9 330.9 37 27 表 4 MG-MoE模型Top-K消融实验结果(CUB-200-2011)
模型数量 Top-1(%) 相对最优(pp) Top-1 91.95 –0.94 Top-2 92.89 0.00 Top-3 92.62 –0.27 Top-4 92.10 –0.79 表 5 MG-MoE模型不同专家数量Top-2融合消融实验结果(CUB-200-2011)
专家总数 专家组合 Top-1(%) 相对最优(pp) 2 MPSA+TransFG 91.28 –1.61 2 PMG+PIM 91.22 –1.67 2 MPSA+PIM 92.03 –0.86 3 MPSA+TransFG+PIM 92.36 –0.53 3 MPSA+PMG+PIM 92.15 –0.74 3 MPSA+PMG+TransFG 92.01 –0.88 4 MPSA+PMG+TransFG+PIM 92.89 - 5 四专家+1个同质专家 92.78 –0.11 6 四专家+2个同质专家 92.60 –0.29 -
[1] SUN Hongbo, HE Xiangteng, XU Jinglin, et al. SIM-OFE: Structure information mining and object-aware feature enhancement for fine-grained visual categorization[J]. IEEE Transactions on Image Processing, 2024, 33: 5312–5326. doi: 10.1109/TIP.2024.3459788. [2] YANG Shengying, YANG Xinqi, WU Jianfeng, et al. Significant feature suppression and cross-feature fusion networks for fine-grained visual classification[J]. Scientific Reports, 2024, 14(1): 24051. doi: 10.1038/s41598-024-74654-4. [3] WANG Jiahui, XU Qin, JIANG Bo, et al. Multi-granularity part sampling attention for fine-grained visual classification[J]. IEEE Transactions on Image Processing, 2024, 33: 4529–4542. doi: 10.1109/TIP.2024.3441813. [4] MA Bing, LI Junyi, JIN Zhengbei, et al. Fine-grained image recognition with bio-inspired gradient-aware attention[J]. Biomimetics, 2025, 10(12): 834. doi: 10.3390/biomimetics10120834. [5] CHANG Dongliang, TONG Yujun, DU Ruoyi, et al. An erudite fine-grained visual classification model[C]. Proceedings of 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, Canada, 2023: 7268–7277. doi: 10.1109/CVPR52729.2023.00702. [6] SU J C, CHENG Zezhou, and MAJI S. A realistic evaluation of semi-supervised learning for fine-grained classification[C]. Proceedings of 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, USA, 2021: 12966–12975. doi: 10.1109/CVPR46437.2021.01277. [7] SHU Yangyang, YU Baosheng, XU Haiming, et al. Improving fine-grained visual recognition in low data regimes via self-boosting attention mechanism[C]. Proceedings of the 17th European Conference on Computer Vision, Tel Aviv, Israel, 2022: 449–465. doi: 10.1007/978-3-031-19806-9_26. [8] FEDUS W, ZOPH B, and SHAZEER N. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity[J]. Journal of Machine Learning Research, 2022, 23(120): 1–39. [9] JACOBS R A, JORDAN M I, NOWLAN S J, et al. Adaptive mixtures of local experts[J]. Neural Computation, 1991, 3(1): 79–87. doi: 10.1162/neco.1991.3.1.79. [10] SHAZEER N, MIRHOSEINI A, MAZIARZ K, et al. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer[C]. Proceedings of the 5th International Conference on Learning Representations, Toulon, France, 2017. [11] RIQUELME C, PUIGCERVER J, MUSTAFA B, et al. Scaling vision with sparse mixture of experts[C]. Proceedings of the 35th International Conference on Neural Information Processing Systems, 2021: 657. (查阅网上资料, 未找到对应的出版地信息, 请确认). [12] HAN Xumeng, WEI Longhui, DOU Zhiyang, et al. ViMoE: An empirical study of designing vision mixture-of-experts[J]. IEEE Transactions on Image Processing, 2025, 34: 7209–7221. doi: 10.1109/TIP.2025.3626887. [13] ZHU Jinguo, ZHU Xizhou, WANG Wenhai, et al. Uni-perceiver-MoE: Learning sparse generalist models with conditional MoEs[C]. Proceedings of the 36th International Conference on Neural Information Processing Systems, New Orleans, USA, 2022: 193. [14] MUSTAFA B, RIQUELME C, PUIGCERVER J, et al. Multimodal contrastive learning with LIMoE: The language-image mixture of experts[C]. Proceedings of the 36th International Conference on Neural Information Processing Systems, New Orleans, USA, 2022: 695. [15] SHEN Leyang, CHEN Gongwei, SHAO Rui, et al. MoME: Mixture of multimodal experts for generalist multimodal large language models[C]. Proceedings of the 38th International Conference on Neural Information Processing Systems, Vancouver, Canada, 2024: 1330. [16] ZHENG Haiyang, PU Nan, LI Wenjing, et al. Generalized fine-grained category discovery with multi-granularity conceptual experts[J]. arXiv preprint arXiv: 2509.26227, 2025. (查阅网上资料, 请核对文献类型及格式). [17] HE Ju, CHEN Jieneng, LIU Shuai, et al. TransFG: A transformer architecture for fine-grained recognition[C]. Proceedings of the 36th AAAI Conference on Artificial Intelligence, 2022: 852–860. doi: 10.1609/aaai.v36i1.19967. (查阅网上资料,未找到对应的出版地信息,请确认). [18] DU Ruoyi, CHANG Dongliang, BHUNIA A K, et al. Fine-grained visual classification via progressive multi-granularity training of jigsaw patches[C]. Proceedings of the 16th European Conference on Computer Vision, Glasgow, UK, 2020: 153–168. doi: 10.1007/978-3-030-58565-5_10. [19] CHOU P Y, LIN C H, and KAO W C. A novel plug-in module for fine-grained visual classification[J]. arXiv preprint arXiv: 2202.03822, 2022. <b>(查阅网上资料, 请核对文献类型及格式)</b>. [20] XU Zhikang, YUE Xiaodong, LV Ying, et al. Trusted fine-grained image classification through hierarchical evidence fusion[C]. Proceedings of the 37th AAAI Conference on Artificial Intelligence, Washington, USA, 2023: 10657–10665. doi: 10.1609/aaai.v37i9.26265. [21] ZHENG Haiyang, PU Nan, LI Wenjing, et al. Generalized fine-grained category discovery with multi-granularity conceptual experts[J]. arXiv preprint arXiv: 2509.26227, 2025. (查阅网上资料, 请核对文献类型及格式)(查阅网上资料, 本条文献与第16条文献重复, 请确认). [22] LEPIKHIN D, LEE H, XU Yuanzhong, et al. GShard: Scaling giant models with conditional computation and automatic sharding[C]. Proceedings of the 9th International Conference on Learning Representations, Austria, 2021. (查阅网上资料, 未找到对应的出版城市信息, 请确认). [23] GURURANGAN S, LEWIS M, HOLTZMAN A, et al. DEMix layers: Disentangling domains for modular language modeling[C]. Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Seattle, United States, 2022: 5557–5576. doi: 10.18653/v1/2022.naacl-main.407. [24] RAJBHANDARI S, LI Conglong, YAO Zhewei, et al. DeepSpeed-MoE: Advancing mixture-of-experts inference and training to power next-generation AI scale[C]. Proceedings of the 39th International Conference on Machine Learning, Baltimore, USA, 2022: 18332–18346. [25] WANG Lean, GAO Huazuo, ZHAO Chenggang, et al. Auxiliary-loss-free load balancing strategy for mixture-of-experts[J]. arXiv preprint arXiv: 2408.15664, 2024. (查阅网上资料, 请核对文献类型及格式). [26] ROLLER S, SUKHBAATAR S, SZLAM A, et al. Hash layers for large sparse models[C]. Proceedings of the 35th International Conference on Neural Information Processing Systems, 2021: 1343. (查阅网上资料, 未找到对应的出版地信息, 请确认). [27] JIANG A Q, SABLAYROLLES A, ROUX A, et al. Mixtral of experts[J]. arXiv preprint arXiv: 2401.04088, 2024. (查阅网上资料, 请核对文献类型及格式). [28] DAI Damai, DENG Chengqi, ZHAO Chenggang, et al. DeepSeekMoE: Towards ultimate expert specialization in mixture-of-experts language models[C]. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics, Bangkok, Thailand, 2024. doi: 10.18653/v1/2024.acl-long.70. [29] CHEN Tianlong, CHEN Xuxi, DU Xianzhi, et al. AdaMV-MoE: Adaptive multi-task vision mixture-of-experts[C]. Proceedings of 2023 IEEE/CVF International Conference on Computer Vision, Paris, France, 2023: 17346–17357. doi: 10.1109/ICCV51070.2023.01591. [30] 王洪昌, 咸凤羽, 谢子晖, 等. BIRD1445: 面向生态监测的大规模多模态鸟类数据集[J]. 电子与信息学报, 2026, 48(2): 873–888. doi: 10.11999/JEIT250647.WANG Hongchang, XIAN Fengyu, XIE Zihui, et al. BIRD1445: Large-scale multimodal bird dataset for ecological monitoring[J]. Journal of Electronics & Information Technology, 2026, 48(2): 873–888. doi: 10.11999/JEIT250647. [31] CHEN Yue, BAI Yalong, ZHANG Wei, et al. Destruction and construction learning for fine-grained image recognition[C]. Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, USA, 2019: 5157–5166. doi: 10.1109/CVPR.2019.00530. [32] LUO Wei, YANG Xitong, MO Xianjie, et al. Cross-x learning for fine-grained visual categorization[C]. Proceedings of 2019 IEEE/CVF International Conference on Computer Vision, Seoul, Korea (South), 2019: 8242–8251. doi: 10.1109/ICCV.2019.00833. [33] LIU Zhuang, MAO Hanzi, WU Chaoyuan, et al. A ConvNet for the 2020s[C]. Proceedings of 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, USA, 2022: 11976–11986. doi: 10.1109/CVPR52688.2022.01167. [34] HE Kaiming, ZHANG Xiangyu, REN Shaoqing, et al. Deep residual learning for image recognition[C]. Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, USA, 2016: 770–778. doi: 10.1109/CVPR.2016.90. -
下载:
下载: