基于改进YOLO12n的人脸表情识别模型

韩闯; 黄静垚; 兰朝凤

doi:10.11999/JEIT250936

基于改进YOLO12n的人脸表情识别模型

doi: 10.11999/JEIT250936 cstr: 32379.14.JEIT250936

哈尔滨理工大学测控技术与通信工程学院哈尔滨 150080

基金项目: 黑龙江省省属本科高校优秀青年教师基础研究支持计划资助(YQJH2024077)

详细信息

作者简介:
韩闯：男，副教授，研究方向为图像识别及小目标定位

黄静垚：女，硕士生，研究方向为图像处理及目标检测

兰朝凤：女，教授，研究方向为图像多目标分类跟踪

通讯作者:
兰朝凤　lanchaofeng@ hrbust.edu.cn

中图分类号: TP391.41; TP183
计量
- 文章访问数: 24
- HTML全文浏览量: 3
- PDF下载量: 8
- 被引次数: 0
出版历程
- 修回日期: 2026-04-08
- 录用日期: 2026-04-08
- 网络出版日期: 2026-04-28

Facial Expression Recognition Model based on Improved YOLO12n

School of Measurement and Control Technology and Communication Engineering, Harbin University of Science and Technology, Harbin 150080, China

Funds: Supported By Program for Young Talents of Basic Research in Universities of Heilongjiang Province (YQJH2024077)

摘要

摘要: 针对低分辨率、光照复杂、部分遮挡等场景下人脸表情识别精度下降的问题，本文提出一种基于YOLO12n改进的人脸表情识别模型YOLO-FER (Facial Expression Recognition)。该模型通过设计NewStarBlock模块优化原有C3k2瓶颈结构以缓解高维特征缺失，并引入多维协作注意力(Multidimensional Collaborative Attention, MCA)模块在通道、高度、宽度三个维度协同建模以增强细粒度特征提取能力，同时增加低分辨率特征增强模块(Low Resolution Feature Extractor, LRFE)提升弱光及模糊场景下的鲁棒性，并采用自适应阈值焦点损失函数(Adaptive Threshold Focal Loss, ATFL)动态调整难易样本权重以缓解类别不平衡问题。实验在RAF-DB和Low Light Dataset数据集上表明，YOLO-FER在mAP@0.5指标上较基线YOLO12n分别提升了3.8%和5.0%，在保持实时检测速度的同时提升了模型的泛化能力与鲁棒性，能够更准确地捕捉表情关键区域，适用于表情识别实际场景。
- 表情识别 /
- YOLO /
- 多维协作注意力 /
- 低分辨率特征增强 /
- 自适应阈值焦点损失
Abstract: Objective Facial Expression Recognition (FER) is a key technology in affective computing and intelligent human–computer interaction. However, in practical scenarios, the recognition performance is significantly degraded due to low resolution, complex illumination, partial occlusion, and class imbalance. Although deep learning-based methods have achieved considerable progress, lightweight models such as YOLO12n still suffer from insufficient feature extraction capability and weak robustness under degraded imaging conditions. To address these limitations, this paper proposes an improved FER model, termed YOLO-FER, aiming to enhance feature representation, improve discrimination of similar expressions, and maintain real-time detection performance in low-quality environments. Methods Based on the YOLO12n framework, the proposed YOLO-FER model introduces several targeted improvements. First, a C3k2_star module is constructed by embedding the NewStarBlock into the original bottleneck structure, which enhances high-dimensional nonlinear feature representation and alleviates feature loss during fusion, as illustrated in Fig. 2 and Fig. 3. Second, a Multidimensional Collaborative Attention (MCA) mechanism is incorporated into the backbone to form the A2C2f_MCA module, enabling joint modeling across channel, height, and width dimensions to capture fine-grained facial features (Fig. 4). Third, a Low Resolution Feature Extractor (LRFE) module is introduced at the end of the backbone to enhance pixel-level feature representation under low-resolution and low-light conditions through dilated convolution and pixel attention (Fig. 5). Finally, an Adaptive Threshold Focal Loss (ATFL) is adopted to dynamically adjust the contribution of easy and hard samples, effectively addressing class imbalance and improving discrimination of similar expressions. The model structure is shown in Fig. 1. Experiments are conducted on the RAF-DB and Low Light Dataset (LLD), and evaluated using Precision (P), Recall (R), F1-score, and mAP@0.5. Results and Discussions Extensive experiments demonstrate that the proposed YOLO-FER model achieves superior performance compared with the baseline YOLO12n and other YOLO series models. As shown in Table 2, on the RAF-DB dataset, YOLO-FER achieves P=81.8%, R=81.9%, and mAP@0.5=87.6%, which improves mAP@0.5 by 3.8% over the baseline. Similarly, on the LLD dataset (Table 3), YOLO-FER achieves an mAP@0.5 of 95.9%, representing a 5.0% improvement, indicating strong robustness under low-light conditions.The ablation studies in Table 2 and Table 3 verify that each proposed module contributes positively to performance improvement. Specifically, the introduction of C3k2_star, A2C2f_MCA, LRFE, and ATFL leads to consistent gains in detection accuracy, and their combination achieves the best results with only a slight increase in parameters. The comparison with other YOLO variants in Table 5 further shows that YOLO-FER achieves the best trade-off between accuracy and model complexity.In addition, the mAP@0.5 curves in Fig. 8 illustrate the consistent performance improvement of the proposed model during training. The confusion matrix analysis in Fig. 9 and Table 4 demonstrates that the MCA module effectively improves the discrimination of similar expressions such as “Angry” and “Disgust,” reducing misclassification rates. Visualization results based on Grad-CAM (Fig.13) indicate that YOLO-FER focuses more accurately on key facial regions, such as eyes, eyebrows, and mouth, compared to the baseline model. Furthermore, experiments under degraded conditions (Fig.14 and Table 13) show that YOLO-FER maintains higher detection performance than YOLO12n, with a smaller overall performance drop, confirming its robustness in low-quality scenarios. Although the model parameters increase slightly from 2.5M to 3.0M, the inference speed remains competitive, as shown in Table 7, demonstrating that the proposed method preserves real-time capability. Conclusions This paper proposes YOLO-FER, an improved facial expression recognition model based on YOLO12n, which effectively enhances feature extraction capability and robustness in low-quality image scenarios. By integrating the C3k2_star module, MCA attention mechanism, LRFE module, and ATFL loss function, the model significantly improves recognition accuracy and generalization ability. Experimental results on RAF-DB and LLD datasets confirm that YOLO-FER achieves state-of-the-art performance while maintaining efficient inference speed. The proposed method provides a practical solution for real-time FER applications in complex environments. Future work will focus on improving performance under extreme low-resolution conditions and exploring cross-domain generalization and micro-expression recognition.
- Facial Expression Recognition /
- YOLO /
- Multidimensional Collaborative Attention /
- Low Resolution Feature Enhancement /
- Adaptive Threshold Focal Loss

HTML全文

图 1 YOLO-FER模型结构

下载: 全尺寸图片幻灯片

图 2 C3k2改进结构图

下载: 全尺寸图片幻灯片

图 3 StarBlock与NewStarBlock模块结构

下载: 全尺寸图片幻灯片

图 4 MCA结构图

下载: 全尺寸图片幻灯片

图 5 LRFE模块结构图

下载: 全尺寸图片幻灯片

图 6 RAF-DB数据集部分样本

下载: 全尺寸图片幻灯片

图 7 LLD数据集部分样本

下载: 全尺寸图片幻灯片

图 8 改进模型mAP@0.5曲线对比图

下载: 全尺寸图片幻灯片

图 9 归一化混淆矩阵对比图

下载: 全尺寸图片幻灯片

图 10 不同激活函数mAP@0.5对比图及PR曲线对比图

下载: 全尺寸图片幻灯片

图 11 不同扩张率mAP@0.5对比图及PR曲线对比图

下载: 全尺寸图片幻灯片

图 12 不同λ mAP@0.5对比图及PR曲线对比图

下载: 全尺寸图片幻灯片

图 13 热力图对比图

下载: 全尺寸图片幻灯片

图 14 退化测试集部分测试结果

下载: 全尺寸图片幻灯片

表 1 RAF-DB与LLD数据集表情类别分布

类别(英文)	类别(中文)	RAF-DB样本数	LLD样本数
Angry	生气	867	1156
Disgust	厌恶	877	1726
Fear	恐惧	355	/
Happy	高兴	5957	2064
Neutral	中性	3204	1748
Sad	悲伤	2460	1896
Surprise	惊讶	1619	2158

下载: 导出CSV

表 2 基于RAF-DB数据集消融实验

Baseline	C3k2_star	A2C2f_MCA	LRFE	ATFL	P(%)	R(%)	F1	mAP@0.5(%)	Params/M
√					78.4	78.6	0.78	83.8	2.5
√	√				78.8	80.7	0.79	85.9	2.5
√		√			78.6	82.7	0.81	86.9	2.5
√			√		81.2	80.8	0.81	86.8	2.8
√				√	78.7	81.3	0.80	86.7	2.5
√	√	√			80.1	81.5	0.81	87.0	2.8
√	√	√	√		80.6	81,7	0.81	87.3	3.0
√	√	√	√	√	81.8	81.9	0.82	87.6	3.0

下载: 导出CSV

表 3 基于LLD数据集消融实验

Baseline	C3k2_star	A2C2f_MCA	LRFE	ATFL	P(%)	R(%)	F1	mAP@0.5(%)	Params/M
√					87.3	82.8	0.85	90.9	2.5
√	√				91.0	87.5	0.89	94.0	2.5
√		√			89.6	87.3	0.88	93.2	2.5
√			√		92.4	87.4	0.90	94.4	2.8
√				√	90.1	87.9	0.89	93.8	2.5
√	√	√			89.2	89.6	0.90	94.2	2.8
√	√	√	√		92.3	89.4	0.91	95.0	3.0
√	√	√	√	√	91.9	91.2	0.92	95.9	3.0

下载: 导出CSV

表 4 相似表情(生气–厌恶)区分能力对比

模型	R(生气)	R(厌恶)	混淆(生气→厌恶)	混淆(厌恶→生气)
YOLO12n	0.69	0.45	0.10	0.10
YOLO12n+ A2C2f_MCA	0.75	0.59	0.09	0.06
Δ	+0.06	+0.14	-0.01	-0.04

下载: 导出CSV

表 5 YOLO系列模型对比实验

模型	P(%)	R(%)	F1	mAP@0.5(%)	mAP@0.5标准差	Params/M
YOLOv8n	81.4	77.9	0.79	84.1	0.0035	3.0
YOLOV10n	71.3	77.8	0.74	81.6	0.0062	2.7
YOLO11n	81.5	79.4	0.80	85.9	0.0052	2.6
YOLO12n	78.4	78.6	0.78	83.8	0.0041	2.5
YOLO-FER	81.8	81.9	0.82	87.6	0.0029	3.0

下载: 导出CSV

表 6 主流方法在RAF-DB上的表现结果

模型	Accuracy (%)	GFLOP/10⁹	Params/M
RAN-ResNet18^[6]	86.90	14.55	11.19
POSTER++^[7]	92.21	8.4	8.4
MA-Net^[9]	88.40	3.65	50.54
DAN^[10]	89.70	2.3	19.72

下载: 导出CSV

表 7 不同模型在 RTX4090 上的计算复杂度与推理速度对比

模型	输入分辨率	GFLOPs	FPS	Params/M
YOLOv8n	640×640	8.2	603.86	3.0
YOLOV10n	640×640	8.4	781.71	2.7
YOLO11n	640×640	6.4	677.36	2.6
YOLO12n	640×640	5.8	619.49	2.5
YOLO-FER	640×640	7.7	503.53	3.0

下载: 导出CSV

表 8 LO-FER 在不同输入分辨率下的计算复杂度与推理速度

输入分辨率	GFLOPs	FPS
320×320	7.7	1291.76
480×480	7.7	803.41
640×640	7.7	503.53

下载: 导出CSV

表 9 超参数取值范围

模块	超参数	取值范围	基准值
NewStarBlock	激活函数	ReLU6, SiLU	SiLU
LRFE	dilation(扩张率)	1, 2, 3	2
ATFL	λ(损失调制系数)	1.5, 2.0, 2.5	2.0

下载: 导出CSV

表 10 激活函数敏感性分析实验结果

激活函数	P(%)	R(%)	F1	mAP@0.5
ReLU6	88.5	86.3	0.87	93
SiLU	91.9	91.2	0.92	95.9

下载: 导出CSV

表 11 扩张率敏感性分析实验结果

扩张率dilation	P(%)	R(%)	F1	mAP@0.5
1	88.9	87.8	0.88	93.7
2	91.9	91.2	0.92	95.9
3	91.8	91.1	0.91	95.7

下载: 导出CSV

表 12 λ敏感性分析实验结果

λ	P(%)	R(%)	F1	mAP@0.5
1.5	91.9	91.1	0.91	95.7
2	91.9	91.2	0.92	95.9
2.5	89.8	90.0	0.90	95.0

下载: 导出CSV

表 13 原始测试集与随机退化测试集上的性能对比(YOLO-FER/YOLO12n)

表情类别	生气	厌恶	恐惧	高兴	中性	悲伤	惊讶	平均
原图	91.1/85.7	67.1/59.2	75.4/66.3	98.6/98.2	92.7/90.8	93.6/92.9	94.6/93.5	87.6/83.8
退化后	85.6/80.9	66.4/54.2	70.4/58.8	98.1/97.9	88.7/88.2	91.6/90.2	92.5/90.2	84.8/80.1
Δ	–5.5/–4.8	–0.7/–5.0	–5.0/–7.5	–0.5/–0.5	–4.0/–2.6	–2.0/–2.7	–2.1/–3.3	–2.8/–3.7

下载: 导出CSV

参考文献(20)

[1]	ADYAPADY R R and ANNAPPA B. A comprehensive review of facial expression recognition techniques[J]. Multimedia Systems, 2023, 29(1): 73–103. doi: 10.1007/S00530-022-00984-w.
[2]	LI Shan and DENG Weihong. Deep facial expression recognition: A survey[J]. IEEE Transactions on Affective Computing, 2022, 13(3): 1195–1215. doi: 10.1109/taffc.2020.2981446.
[3]	张国祥, 孙运卓. 复杂光线环境下局部二值模式的CNN人脸识别方法[J]. 湖北师范大学学报: 自然科学版, 2023, 43(4): 49–55. doi: 10.3969/j.issn.2096-3149.2023.04.007. ZHANG Guoxiang and SUN Yunzhuo. CNN facialrecognition method based on local binary pattern in complex light environment[J]. Journal of Hubei Normal University: Natural Science, 2023, 43(4): 49–55. doi: 10.3969/j.issn.2096-3149.2023.04.007.
[4]	李蕊, 刘鹏宇, 贾克斌. 局部遮挡条件下的人脸表情识别[J]. 计算机应用与软件, 2016, 33(9): 147–150,175. doi: 10.3969/j.issn.1000-386x.2016.09.035. LI Rui, LIU Pengyu, and JIA Kebin. Facial expression recognition under partial occlusion[J]. Computer Applications and Software, 2016, 33(9): 147–150,175. doi: 10.3969/j.issn.1000-386x.2016.09.035.
[5]	李珊, 邓伟洪. 深度人脸表情识别研究进展[J]. 中国图象图形学报, 2020, 25(11): 2306–2320. doi: 10.11834/jig.200233. LI Shan and DENG Weihong. Deep facial expression recognition: A survey[J]. Journal of Image and Graphics, 2020, 25(11): 2306–2320. doi: 10.11834/jig.200233.
[6]	WANG Kai, PENG Xiaojiang, YANG Jianfei, et al. Region attention networks for pose and occlusion robust facial expression recognition[J]. IEEE Transactions on Image Processing, 2020, 29: 4057–4069. doi: 10.1109/TIP.2019.2956143.
[7]	MAO Jiawei, XU Rui, YIN Xuesong, et al. POSTER++: A simpler and stronger facial expression recognition network[J]. Pattern Recognition, 2025, 157: 110951. doi: 10.1016/J.PATCOG.2024.110951.
[8]	赵明华, 董爽爽, 胡静, 等. 注意力引导的三流卷积神经网络用于微表情识别[J]. 中国图象图形学报, 2024, 29(1): 111–122. doi: 10.11834/jig.230053. ZHAO Minghua, DONG Shuangshuang, HU Jing, et al. Attention-guided three-stream convolutional neural network for microexpression recognition[J]. Journal of Image and Graphics, 2024, 29(1): 111–122. doi: 10.11834/jig.230053.
[9]	YANG Qiaohe, HE Yueshun, CHEN Hongmao, et al. A novel lightweight facial expression recognition network based on deep shallow network fusion and attention mechanism[J]. Algorithms, 2025, 18(8): 473. doi: 10.3390/A18080473.
[10]	WEN Zhengyao, LIN Wenzhong, WANG Tao, et al. Distract your attention: Multi-head cross attention network for facial expression recognition[J]. Biomimetics, 2023, 8(2): 199. doi: 10.3390/BIOMIMETICS8020199.
[11]	LAI Zhenyi, CHEN Renhe, JIA Jinlu, et al. Real-time micro-expression recognition based on ResNet and atrous convolutions[J]. Journal of Ambient Intelligence and Humanized Computing, 2023, 14(11): 15215–15226. doi: 10.1007/s12652-020-01779-5.
[12]	薛珮芸, 戴书涛, 白静, 等. 借助语音和面部图像的双模态情感识别[J]. 电子与信息学报, 2024, 46(12): 4542–4552. doi: 10.11999/JEIT240087. XUE Peiyun, DAI Shutao, BAI Jing, et al. Emotion recognition with speech and facial images[J]. Journal of Electronics & Information Technology, 2024, 46(12): 4542–4552. doi: 10.11999/JEIT240087.
[13]	张嘉淏, 刘峰, 齐佳音. 一种基于Bottleneck Transformer的轻量级微表情识别架构[J]. 计算机科学, 2022, 49(6A): 370–377. doi: 10.11896/jsjkx.210500023. ZHANG Jiahao, LIU Feng, and QI Jiayin. Lightweight micro-expression recognition architecture based on Bottleneck Transformer[J]. Computer Science, 2022, 49(6A): 370–377. doi: 10.11896/jsjkx.210500023.
[14]	张鹏, 孔韦韦, 滕金保. 基于多尺度特征注意力机制的人脸表情识别[J]. 计算机工程与应用, 2022, 58(1): 182–189. doi: 10.3778/j.issn.1002-8331.2106-0174. ZHANG Peng, KONG Weiwei, and TENG Jinbao. Facial expression recognition based on multi-scale feature attention mechanism[J]. Computer Engineering and Applications, 2022, 58(1): 182–189. doi: 10.3778/j.issn.1002-8331.2106-0174.
[15]	邵延华, 张铎, 楚红雨, 等. 基于深度学习的YOLO目标检测综述[J]. 电子与信息学报, 2022, 44(10): 3697–3708. doi: 10.11999/JEIT210790. SHAO Yanhua, ZHANG Duo, CHU Hongyu, et al. A review of YOLO object detection based on deep learning[J]. Journal of Electronics & Information Technology, 2022, 44(10): 3697–3708. doi: 10.11999/JEIT210790.
[16]	MA Xu, DAI Xiyang, BAI Yue, et al. Rewrite the stars[C]. 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, USA, 2024: 5694–5703. doi: 10.1109/CVPR52733.2024.00544.
[17]	YU Yang, ZHANG Yi, CHENG Zeyu, et al. MCA: Multidimensional collaborative attention in deep convolutional neural networks for image recognition[J]. Engineering Applications of Artificial Intelligence, 2023, 126: 107079. doi: 10.1016/j.engappai.2023.107079.
[18]	YANG Bo, ZHANG Xinyu, ZHANG Jian, et al. EFLNet: Enhancing feature learning network for infrared small target detection[J]. IEEE Transactions on Geoscience and Remote Sensing, 2024, 62: 5906511. doi: 10.1109/TGRS.2024.3365677.
[19]	LI Shan and DENG Weihong. Reliable crowdsourcing and deep locality-preserving learning for unconstrained facial expression recognition[J]. IEEE Transactions on Image Processing, 2019, 28(1): 356–370. doi: 10.1109/TIP.2018.2868382.
[20]	Emotiscore. Low light dataset computer vision model[EB/OL]. https://universe.roboflow.com/emotiscore/low-light-dataset, 2025. (查阅网上资料,未找到本条文献年份信息,请确认).