Advanced Search
Turn off MathJax
Article Contents
XIAN Fengyu, JIAN Haifang, XIE Zihui, DU Jun, ZHANG Yuanyuan, NING Xin, DONG Miaomiao, WANG Hongchang. MG-MoE: Routed Multi-Granularity Expert Ensemble[J]. Journal of Electronics & Information Technology. doi: 10.11999/JEIT260219
Citation: XIAN Fengyu, JIAN Haifang, XIE Zihui, DU Jun, ZHANG Yuanyuan, NING Xin, DONG Miaomiao, WANG Hongchang. MG-MoE: Routed Multi-Granularity Expert Ensemble[J]. Journal of Electronics & Information Technology. doi: 10.11999/JEIT260219

MG-MoE: Routed Multi-Granularity Expert Ensemble

doi: 10.11999/JEIT260219 cstr: 32379.14.JEIT260219
Funds:  This work was supported by the National Key Research and Development Program of China (2024YFE0210600)
  • Accepted Date: 2026-04-23
  • Rev Recd Date: 2026-04-23
  • Available Online: 2026-05-23
  •   Objective  Fine-grained image recognition (FGIR) aims to distinguish visually similar subcategories that differ only in subtle local patterns while remaining robust to large intra-class variations such as pose changes, occlusions, illumination shifts, and complex backgrounds. In real-world settings, these challenges are further compounded by long-tailed category distributions, where rare or hard classes are prone to overfitting spurious context and suffering unstable decision boundaries. This motivates a conditional computation paradigm in which complementary inductive biases are explicitly separated into specialized expert branches and combined adaptively per sample. The goal of this work is to develop a routed multi-granularity mixture-of-experts framework that improves discriminative performance under controllable inference cost, while enhancing robustness on difficult samples and long-tailed categories through adaptive sparse expert activation.  Methods  We propose MG-MoE (Multi-Granularity Mixture-of-Experts), a routed ensemble architecture composed of a shared backbone, four heterogeneous experts, and a learnable router that predicts expert weights conditioned on the input (Fig. 2). The experts are deliberately instantiated with complementary inductive biases to cover the key factors in FGIR: (1) MPSA emphasizes global structure and contour-level semantics; (2) PMG captures fine local details through multi-granularity part modeling; (3) TransFG focuses on pose- and deformation-aware modeling; and (4) PIM improves robustness under cluttered backgrounds via background suppression mechanisms. To limit interference and reduce unnecessary computation, MG-MoE adopts sparse fusion, where only the Top-K experts (K=2 by default) contribute to the final prediction at inference.To improve routing stability and generalization, we introduce a two-stage optimization strategy. The first stage performs dynamic cluster-level training, where a cluster-level soft teacher distribution is constructed from validation-set statistics and imposed through KL-divergence regularization to stabilize routing behavior and promote effective specialization among experts. The second stage performs residual fine-tuning: while keeping the feature-driven routing mechanism unchanged, the classification heads of the Top-2 experts associated with each cluster are selectively unfrozen, and the router and expert heads are jointly optimized with grouped learning rates. This design reduces fusion bias and strengthens discrimination on difficult samples and long-tailed categories.  Results and Discussions  MG-MoE achieves strong performance on standard FGIR benchmarks. On CUB-200-2011, MG-MoE attains 92.89% Top-1 accuracy, exceeding representative expert backbones when used individually, such as MPSA (91.23%), PIM (91.17%), and TransFG (90.49%), and surpassing multi-granularity baselines such as PMG (88.32%) (Table 1). On the larger Bird-1445 dataset, MG-MoE continues to show consistent improvements over strong baselines, indicating that routed multi-expert specialization remains effective under a higher number of categories and stronger long-tail effects (Table 2).The efficiency–accuracy trade-off is summarized in Table 3. MG-MoE (Top-2) reaches the best accuracy (92.89%) with a compute budget of 143.9 GFLOPs.Importantly, MG-MoE avoids dense expert activation at inference by selecting only the Top-2 experts for each sample, yielding a favorable accuracy–efficiency trade-off, and ablations show that increasing K beyond 2 does not yield consistent gains, suggesting that indiscriminate fusion can dilute discriminative evidence. Specifically, Top-2 fusion delivers the best performance, whereas Top-1 is more sensitive to routing errors and larger K can introduce noise and reduce accuracy (Table 4).We further analyze the role of expert diversity and composition. Experiments with fewer experts (two- or three-expert variants) generally underperform the full four-expert configuration, indicating that each inductive bias contributes nontrivially to handling different fine-grained difficulty factors. Conversely, simply adding more experts without introducing genuinely new inductive biases yields diminishing or negative returns, consistent with increased routing ambiguity and limited functional diversity (Table 5). These results support the design choice of a compact set of heterogeneous experts combined with sparse routing.To interpret the learned specialization, we visualize category-wise routing statistics. The expert–category heatmap shows that MPSA dominates routing weight across many categories, reflecting the central role of global structure in fine-grained discrimination; meanwhile, PIM and TransFG exhibit noticeable activation increases on specific difficult categories, aligning with their intended functionality for background suppression and pose/deformation modeling (Fig. 3). Finally, t-SNE visualizations illustrate the qualitative effect of expert fusion on class separability: shared backbone features exhibit stronger inter-class entanglement for visually similar subcategories, whereas fused outputs form clearer clusters with improved between-class separation and within-class compactness, consistent with a more reliable decision space shaped by routed expert aggregation (Fig. 4).  Conclusions  This work presents MG-MoE, a multi-granularity routed mixture-of-experts framework for fine-grained recognition. By combining four complementary experts with Top-2 sparse fusion and a two-stage optimization strategy for stable routing and calibrated fusion, MG-MoE improves recognition accuracy on CUB-200-2011 and Bird-1445 while providing interpretable evidence of expert specialization (Table 1, Table 2, Fig. 3, Fig. 4). Ablations confirm that controlled Top-2 fusion and heterogeneous expert design are key to the observed gains, while overly dense fusion or homogeneous expert expansion offers limited benefit (Table 4, Table 5).
  • loading
  • [1]
    SUN Hongbo, HE Xiangteng, XU Jinglin, et al. SIM-OFE: Structure information mining and object-aware feature enhancement for fine-grained visual categorization[J]. IEEE Transactions on Image Processing, 2024, 33: 5312–5326. doi: 10.1109/TIP.2024.3459788.
    [2]
    YANG Shengying, YANG Xinqi, WU Jianfeng, et al. Significant feature suppression and cross-feature fusion networks for fine-grained visual classification[J]. Scientific Reports, 2024, 14(1): 24051. doi: 10.1038/s41598-024-74654-4.
    [3]
    WANG Jiahui, XU Qin, JIANG Bo, et al. Multi-granularity part sampling attention for fine-grained visual classification[J]. IEEE Transactions on Image Processing, 2024, 33: 4529–4542. doi: 10.1109/TIP.2024.3441813.
    [4]
    MA Bing, LI Junyi, JIN Zhengbei, et al. Fine-grained image recognition with bio-inspired gradient-aware attention[J]. Biomimetics, 2025, 10(12): 834. doi: 10.3390/biomimetics10120834.
    [5]
    CHANG Dongliang, TONG Yujun, DU Ruoyi, et al. An erudite fine-grained visual classification model[C]. Proceedings of 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, Canada, 2023: 7268–7277. doi: 10.1109/CVPR52729.2023.00702.
    [6]
    SU J C, CHENG Zezhou, and MAJI S. A realistic evaluation of semi-supervised learning for fine-grained classification[C]. Proceedings of 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, USA, 2021: 12966–12975. doi: 10.1109/CVPR46437.2021.01277.
    [7]
    SHU Yangyang, YU Baosheng, XU Haiming, et al. Improving fine-grained visual recognition in low data regimes via self-boosting attention mechanism[C]. Proceedings of the 17th European Conference on Computer Vision, Tel Aviv, Israel, 2022: 449–465. doi: 10.1007/978-3-031-19806-9_26.
    [8]
    FEDUS W, ZOPH B, and SHAZEER N. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity[J]. Journal of Machine Learning Research, 2022, 23(120): 1–39.
    [9]
    JACOBS R A, JORDAN M I, NOWLAN S J, et al. Adaptive mixtures of local experts[J]. Neural Computation, 1991, 3(1): 79–87. doi: 10.1162/neco.1991.3.1.79.
    [10]
    SHAZEER N, MIRHOSEINI A, MAZIARZ K, et al. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer[C]. Proceedings of the 5th International Conference on Learning Representations, Toulon, France, 2017.
    [11]
    RIQUELME C, PUIGCERVER J, MUSTAFA B, et al. Scaling vision with sparse mixture of experts[C]. Proceedings of the 35th International Conference on Neural Information Processing Systems, 2021: 657. (查阅网上资料, 未找到对应的出版地信息, 请确认).
    [12]
    HAN Xumeng, WEI Longhui, DOU Zhiyang, et al. ViMoE: An empirical study of designing vision mixture-of-experts[J]. IEEE Transactions on Image Processing, 2025, 34: 7209–7221. doi: 10.1109/TIP.2025.3626887.
    [13]
    ZHU Jinguo, ZHU Xizhou, WANG Wenhai, et al. Uni-perceiver-MoE: Learning sparse generalist models with conditional MoEs[C]. Proceedings of the 36th International Conference on Neural Information Processing Systems, New Orleans, USA, 2022: 193.
    [14]
    MUSTAFA B, RIQUELME C, PUIGCERVER J, et al. Multimodal contrastive learning with LIMoE: The language-image mixture of experts[C]. Proceedings of the 36th International Conference on Neural Information Processing Systems, New Orleans, USA, 2022: 695.
    [15]
    SHEN Leyang, CHEN Gongwei, SHAO Rui, et al. MoME: Mixture of multimodal experts for generalist multimodal large language models[C]. Proceedings of the 38th International Conference on Neural Information Processing Systems, Vancouver, Canada, 2024: 1330.
    [16]
    ZHENG Haiyang, PU Nan, LI Wenjing, et al. Generalized fine-grained category discovery with multi-granularity conceptual experts[J]. arXiv preprint arXiv: 2509.26227, 2025. (查阅网上资料, 请核对文献类型及格式).
    [17]
    HE Ju, CHEN Jieneng, LIU Shuai, et al. TransFG: A transformer architecture for fine-grained recognition[C]. Proceedings of the 36th AAAI Conference on Artificial Intelligence, 2022: 852–860. doi: 10.1609/aaai.v36i1.19967. (查阅网上资料,未找到对应的出版地信息,请确认).
    [18]
    DU Ruoyi, CHANG Dongliang, BHUNIA A K, et al. Fine-grained visual classification via progressive multi-granularity training of jigsaw patches[C]. Proceedings of the 16th European Conference on Computer Vision, Glasgow, UK, 2020: 153–168. doi: 10.1007/978-3-030-58565-5_10.
    [19]
    CHOU P Y, LIN C H, and KAO W C. A novel plug-in module for fine-grained visual classification[J]. arXiv preprint arXiv: 2202.03822, 2022. <b>(查阅网上资料, 请核对文献类型及格式)</b>.
    [20]
    XU Zhikang, YUE Xiaodong, LV Ying, et al. Trusted fine-grained image classification through hierarchical evidence fusion[C]. Proceedings of the 37th AAAI Conference on Artificial Intelligence, Washington, USA, 2023: 10657–10665. doi: 10.1609/aaai.v37i9.26265.
    [21]
    ZHENG Haiyang, PU Nan, LI Wenjing, et al. Generalized fine-grained category discovery with multi-granularity conceptual experts[J]. arXiv preprint arXiv: 2509.26227, 2025. (查阅网上资料, 请核对文献类型及格式)(查阅网上资料, 本条文献与第16条文献重复, 请确认).
    [22]
    LEPIKHIN D, LEE H, XU Yuanzhong, et al. GShard: Scaling giant models with conditional computation and automatic sharding[C]. Proceedings of the 9th International Conference on Learning Representations, Austria, 2021. (查阅网上资料, 未找到对应的出版城市信息, 请确认).
    [23]
    GURURANGAN S, LEWIS M, HOLTZMAN A, et al. DEMix layers: Disentangling domains for modular language modeling[C]. Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Seattle, United States, 2022: 5557–5576. doi: 10.18653/v1/2022.naacl-main.407.
    [24]
    RAJBHANDARI S, LI Conglong, YAO Zhewei, et al. DeepSpeed-MoE: Advancing mixture-of-experts inference and training to power next-generation AI scale[C]. Proceedings of the 39th International Conference on Machine Learning, Baltimore, USA, 2022: 18332–18346.
    [25]
    WANG Lean, GAO Huazuo, ZHAO Chenggang, et al. Auxiliary-loss-free load balancing strategy for mixture-of-experts[J]. arXiv preprint arXiv: 2408.15664, 2024. (查阅网上资料, 请核对文献类型及格式).
    [26]
    ROLLER S, SUKHBAATAR S, SZLAM A, et al. Hash layers for large sparse models[C]. Proceedings of the 35th International Conference on Neural Information Processing Systems, 2021: 1343. (查阅网上资料, 未找到对应的出版地信息, 请确认).
    [27]
    JIANG A Q, SABLAYROLLES A, ROUX A, et al. Mixtral of experts[J]. arXiv preprint arXiv: 2401.04088, 2024. (查阅网上资料, 请核对文献类型及格式).
    [28]
    DAI Damai, DENG Chengqi, ZHAO Chenggang, et al. DeepSeekMoE: Towards ultimate expert specialization in mixture-of-experts language models[C]. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics, Bangkok, Thailand, 2024. doi: 10.18653/v1/2024.acl-long.70.
    [29]
    CHEN Tianlong, CHEN Xuxi, DU Xianzhi, et al. AdaMV-MoE: Adaptive multi-task vision mixture-of-experts[C]. Proceedings of 2023 IEEE/CVF International Conference on Computer Vision, Paris, France, 2023: 17346–17357. doi: 10.1109/ICCV51070.2023.01591.
    [30]
    王洪昌, 咸凤羽, 谢子晖, 等. BIRD1445: 面向生态监测的大规模多模态鸟类数据集[J]. 电子与信息学报, 2026, 48(2): 873–888. doi: 10.11999/JEIT250647.

    WANG Hongchang, XIAN Fengyu, XIE Zihui, et al. BIRD1445: Large-scale multimodal bird dataset for ecological monitoring[J]. Journal of Electronics & Information Technology, 2026, 48(2): 873–888. doi: 10.11999/JEIT250647.
    [31]
    CHEN Yue, BAI Yalong, ZHANG Wei, et al. Destruction and construction learning for fine-grained image recognition[C]. Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, USA, 2019: 5157–5166. doi: 10.1109/CVPR.2019.00530.
    [32]
    LUO Wei, YANG Xitong, MO Xianjie, et al. Cross-x learning for fine-grained visual categorization[C]. Proceedings of 2019 IEEE/CVF International Conference on Computer Vision, Seoul, Korea (South), 2019: 8242–8251. doi: 10.1109/ICCV.2019.00833.
    [33]
    LIU Zhuang, MAO Hanzi, WU Chaoyuan, et al. A ConvNet for the 2020s[C]. Proceedings of 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, USA, 2022: 11976–11986. doi: 10.1109/CVPR52688.2022.01167.
    [34]
    HE Kaiming, ZHANG Xiangyu, REN Shaoqing, et al. Deep residual learning for image recognition[C]. Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, USA, 2016: 770–778. doi: 10.1109/CVPR.2016.90.
  • 加载中

Catalog

    通讯作者: 陈斌, bchen63@163.com
    • 1. 

      沈阳化工大学材料科学与工程学院 沈阳 110142

    1. 本站搜索
    2. 百度学术搜索
    3. 万方数据库搜索
    4. CNKI搜索

    Figures(4)  / Tables(5)

    Article Metrics

    Article views (23) PDF downloads(3) Cited by()
    Proportional views
    Related

    /

    DownLoad:  Full-Size Img  PowerPoint
    Return
    Return