Cross-Domain Deepfake Detection with Dynamic Artifacts Tracking and Spatial-Frequency Interaction Analysis
-
摘要: 针对跨域深度伪造检测中存在依赖静态伪影与固定频段、局限单一分析域、全局关联能力不足的问题,该文提出一种融合动态伪影追踪与空频交互分析的金字塔式交互双流网络(PIDSNet)。首先,通过多分支特征提取模块与频谱卷积模块实现伪影特征的动态挖掘,降低对固定参数和频段的依赖,显著提高伪影特征的自适应捕捉能力。其次,通过融合金字塔挤压注意力模块和多头自注意力机制,实现全局特征与局部特征提取的平衡。最后,在空域和频域分别构建高斯金字塔与拉普拉斯金字塔,多层次提取高频信息和低频信息并实现跨域特征融合,构建新型空频特征动态交互机制。实验结果表明,在包含25种生成对抗网络和扩散模型的伪造数据集中平均准确度提升7.4%,为深度伪造检测的跨域泛化性研究提供了新的方案。Abstract:
Objective The rapid development of generative adversarial networks and diffusion models has led to a sharp increase in the number of fake images. The widespread dissemination of fake images poses a potential and unpredictable threat to individuals, societies, and nations. Developing efficient and highly generalizable deepfake detection methods is needed. In current forgery detection research, cross-domain detection capability has become a core task in deepfake detection. However, existing detection methods still suffer from problems such as feature extraction relying on specific artifacts or fixed parameters, spatial-frequency modalities often being learned in isolation and lacking dynamic interaction mechanisms, and insufficient global feature association capabilities. To address these limitations, a Pyramidal Interactive Dual-Stream Network (PIDSNet) integrating dynamic artifact tracking and spatial-frequency interaction analysis has been proposed. Methods The PIDSNet is centered on two branches in the spatial and frequency domains ( Fig. 1 ), with four modules working collaboratively: Multi-Branch Feature Extraction (MBFE) module, Frequency Domain Feature Extraction (FDFE) module, Pyramid Spatial-Frequency Interaction (PSFI) module, and Multi-Head Pyramid Squeezing Attention (MHPSA) module. The MBFE module (Fig. 2 ), as the basic unit of the spatial branch, avoids information loss as the receptive field increases by constructing multi-level, multi-branch dilated convolutions, achieving collaborative extraction of global and local features. The FDFE module, the core module of the frequency branch, fuses the MBFE module with spectral convolutions to achieve dynamic mining of frequency domain artifact features, reducing the dependence of traditional frequency domain methods on fixed parameters and frequency bands, significantly improving the model's adaptive capture ability of artifact features from different generative models. The PSFI module is key to the spatial-frequency branch interaction (Fig. 3 ), capturing low-frequency global information and high-frequency detailed features by constructing a spatial Gaussian pyramid and a frequency Laplacian pyramid. Dynamic weight enhancement at each level of the pyramid achieves adaptive fusion of spatial-frequency features, constructing a dynamic spatial-frequency feature interaction mechanism. The MHPSA module combines multi-head self-attention (MHSA) with dilated convolution (Fig. 4 ). While inheriting the local detail capture capability of the Pyramid Squeeze Attention (PSA) module, it also enhances the global feature modeling capability, thereby improving the model's adaptability and robustness.Results and Discussions To comprehensively verify the cross-domain detection capabilities of PIDSNet across different generative paradigms, this paper trains it on the ProGAN dataset and tests it on multiple GAN and diffusion model datasets. First, for the GAN generative model, in the ForenSynths test set containing four GANs ( Table 3 ), the average Acc. reaches 95.2%, an improvement of 5.3% and 5.2% compared to LGrad and FreqNet. In the GANGen dataset containing nine GANs (Table 4 ,5 ), the average Acc. reaches 95.5%, an improvement of 20.1% compared to F3Net, and improvements of 4.1% and 1.3% in average Acc. and average A.P. compared to FreqNet. Second, for the diffusion model, tests were conducted on the DiffusionForensics and Ojha datasets. In the DiffusionForensics dataset (Table 6 ), the average Acc. reaches 95.4%, an improvement of 4.8% and 13.2% compared to LGrad and FreqNet. In the Ojha dataset (Table 7 ), the average Acc. and average A.P. reached 96.1% and 99.4%, showing a significant improvement. More importantly, PIDSNet has only 2.4M parameters (Table 8 ), and achieves average Acc. and average A.P. of 95.7% and 98.7% across 25 datasets, surpassing other methods. The above experiments show that PIDSNet, trained only on the ProGAN dataset, can adapt to multiple types of GAN models and effectively detect diffusion model images with significant differences in artifact features between the spatial and frequency domains, demonstrating excellent cross-model and cross-generative paradigm generalization capabilities. Moreover, Grad-CAM visualizations reveal that despite not being trained on face images (Fig. 5 ), PIDSNet demonstrates strong detection performance on face images.Conclusions This paper addresses the problems of current GAN and diffusion model detection methods, such as feature extraction relying on domain-specific artifacts or fixed parameters and weak modal interaction, which lead to weak domain adaptability and poor generalization performance. To solve these problems, a spatial-frequency collaborative learning framework and a dynamic artifact mining mechanism are constructed to reduce the limitations of traditional methods that rely on specific domain artifacts and fixed parameters, enhancing the extraction capability of general forgery features and reducing dependence on specific artifacts. The model's effectiveness is validated on image datasets generated by 25 different GAN and diffusion models. Compared with current state-of-the-art models, the average Acc. and A.P. are significantly improved, confirming good performance in cross-domain forgery detection tasks. However, experiments reveal that PIDSNet still has certain limitations. When dealing with specific models whose high-frequency energy distribution is very close to that of real images (such as S3GAN), there is still room for performance improvement, and the frequency domain feature mining mechanism needs optimization. Therefore, future work will focus on two main aspects: firstly, continuing to optimize the frequency domain feature extraction mechanism to enhance the ability to identify forged samples with high-frequency energy features close to real images; secondly, focusing on improving the detection capability of low-quality forged images with compression distortion and noise interference, while studying artifact separation and detection methods for forged images generated by multiple models to enhance the adaptability of the model in real complex environments. -
表 1 ProGAN测试集评估结果(%)
表 2 ForenSynths测试集2类训练设置评估(%)
Methods StyleGAN StyleGAN2 CycleGAN StarGAN Mean Acc. A.P. Acc. A.P. Acc. A.P. Acc. A.P. Acc. A.P. Wang[26] 52.8 82.8 75.7 96.6 58.6 81.5 51.2 74.3 59.6 83.8 F3Net[16] 84.5 99.5 82.2 99.8 81.2 89.7 100.0 100.0 87.0 97.3 Frank[27] 73.1 68.5 75.0 70.9 86.5 80.8 85.0 77.0 79.9 74.3 BiHPF[28] 71.6 74.1 77.0 81.1 86.0 86.6 93.8 80.8 82.1 80.7 FrePGAN[29] 80.8 92.0 72.2 94.0 69.1 70.3 98.5 100.0 80.2 89.1 LGrad[13] 88.3 97.6 87.5 97.5 82.7 90.7 96.5 97.7 88.8 95.9 FreqNet[15] 88.4 98.9 85.8 98.1 88.7 99.8 95.5 100.0 89.6 99.2 PIDSNet 90.5 98.4 92.7 99.3 90.4 99.2 97.1 100.0 92.7 99.2 表 3 ForenSynths测试集4类训练设置评估(%)
Methods StyleGAN StyleGAN2 CycleGAN StarGAN Mean Acc. A.P. Acc. A.P. Acc. A.P. Acc. A.P. Acc. A.P. Wang[26] 63.8 91.4 76.4 97.5 72.7 88.6 63.8 90.8 69.2 92.1 F3Net[16] 92.6 99.7 88.0 99.8 76.4 84.3 99.5 99.8 89.1 95.9 Frank[27] 74.5 72.0 73.1 71.4 75.5 71.2 99.5 99.5 80.7 78.5 BiHPF[28] 76.9 75.1 76.2 74.7 81.9 78.9 94.4 94.4 82.4 80.8 FrePGAN[29] 80.7 89.6 84.1 98.6 71.1 74.4 99.9 100.0 84.0 90.7 LGrad[13] 89.5 94.8 90.6 97.5 84.5 94.0 94.8 97.6 89.9 96.0 FreqNet[15] 90.2 99.7 88.0 99.5 95.8 99.6 85.7 99.8 90.0 99.7 PIDSNet 91.4 99.9 97.8 99.8 91.5 98.8 100.0 100.0 95.2 99.6 表 4 GANGen数据集4类训练设置评估(%)
Methods AttGAN BEGAN CramerGAN InfoMaxGAN MMDGAN Acc. A.P. Acc. A.P. Acc. A.P. Acc. A.P. Acc. A.P. Wang[26] 51.1 83.7 50.2 44.9 81.5 97.5 71.1 94.7 72.9 94.4 F3Net[16] 85.2 94.8 87.1 97.5 89.5 99.8 67.1 83.1 73.7 99.6 LGrad[13] 68.6 93.8 69.9 89.2 50.3 54.0 71.1 82.0 57.5 67.3 FreqNet[15] 88.6 98.1 97.5 99.4 93.5 97.8 92.1 96.6 91.7 97.6 PIDSNet 95.9 99.3 99.5 100.0 99.0 99.9 98.1 99.6 98.0 99.6 表 5 GANGen数据集4类训练设置评估(%)
Methods RelGAN S3GAN SNGAN STGAN 9 GANs Mean Acc. A.P. Acc. A.P. Acc. A.P. Acc. A.P. Acc. A.P. Wang[26] 53.3 82.1 55.2 66.1 62.7 90.4 63.0 92.7 62.3 82.9 F3Net[16] 98.8 100.0 65.4 70.0 51.6 93.6 60.3 99.9 75.4 93.1 LGrad[13] 89.1 99.1 78.5 86.0 78.0 87.4 54.8 68.0 68.6 80.8 FreqNet[15] 97.9 99.5 84.5 88.7 81.7 95.3 95.4 99.1 91.4 96.9 PIDSNet 100.0 100.0 80.5 87.1 94.5 98.4 94.1 100.0 95.5 98.2 表 6 DiffusionForensics数据集4类训练设置评估(%)
Methods LDM PNDM VQ-Diffusion Stable Diffusion V1 Stable Diffusion V2 Mean Acc. A.P. Acc. A.P. Acc. A.P. Acc. A.P. Acc. A.P. Acc. A.P. F3Net[16] 100.0 100.0 72.8 99.5 99.9 99.9 73.4 97.2 99.8 100.0 89.2 99.3 LGrad[13] 99.7 100.0 69.5 98.5 96.2 100.0 90.4 99.4 97.1 100.0 90.6 99.6 Ojha[11] 82.2 97.1 75.3 92.5 83.5 97.7 56.4 90.4 71.5 92.4 73.8 94.0 FreqNet[15] 94.2 99.2 78.8 99.7 97.8 100.0 62.4 93.4 78.0 92.1 82.2 96.9 PIDSNet 99.7 100.0 79.3 99.9 98.0 99.8 99.4 100.0 99.8 100.0 95.4 98.2 表 7 Ojha数据集4类训练设置评估(%)
Methods Glide100_10 Glide100_27 Glide50_27 LDM100 LDM200 LDM200_cfg Acc. A.P. Acc. A.P. Acc. A.P. Acc. A.P. Acc. A.P. Acc. A.P. F3Net[16] 88.3 95.4 87.0 94.5 88.5 95.4 74.1 84.0 73.4 83.3 80.7 89.1 LGrad[13] 89.4 94.9 87.4 93.2 90.7 95.1 94.8 99.2 94.2 99.1 95.9 99.2 Ojha[11] 90.1 97.0 90.7 97.2 91.1 97.4 90.5 97.0 90.2 97.1 77.3 88.6 FreqNet[15] 78.8 89.3 76.2 86.9 78.0 88.5 94.3 99.3 93.8 99.2 93.0 98.8 PIDSNet 96.5 99.5 95.1 99.1 95.9 99.2 96.5 99.6 96.7 99.5 96.0 99.4 表 8 各数据集4类训练设置评估(%)
Methods Params ForenSynths GANGen DiffusionForensics Ojha Mean Acc. A.P. Acc. A.P. Acc. A.P. Acc. A.P. Acc. A.P. Wang[26] 18.6M 73.5 93.5 62.3 82.9 - - - - - - F3Net[16] 48.9M 91.0 96.7 75.4 93.1 89.2 99.3 82.0 90.3 82.9 94.4 LGrad[13] 46.6M 91.6 96.8 68.6 80.8 90.6 99.6 92.1 96.8 83.2 91.6 FreqNet[15] 2.1M 91.9 99.7 91.4 96.9 82.2 96.9 85.7 93.7 88.3 96.7 PIDSNet 2.4M 96.1 99.7 95.5 98.2 95.4 98.2 96.1 99.4 95.7 98.7 表 9 ForenSynths数据集消融实验(%)
MBFE FDFE PSFI MHPSA mean Acc. √ √ √ 91.3 √ √ √ 90.9 √ √ 87.1 √ √ √ 92.7 √ √ √ 92.8 √ √ √ √ 95.2 表 10 金字塔类型的消融实验(%)
Spatial Domain Frequency Domain mean Acc. 92.7 Gaussian Pyramid 93.1 Laplacian Pyramid 93.9 Gaussian Pyramid Laplacian Pyramid 95.2 表 11 金字塔层数的消融实验(%)
the number of pyramid levels mean Acc. 1 90.3 2 90.9 3 91.9 4 95.2 5 92.6 -
[1] LIU Zhian, LI Maomao, ZHANG Yong, et al. Fine-grained face swapping via regional GAN inversion[C]. 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, Canada, 2023: 8578–8587. doi: 10.1109/CVPR52729.2023.00829. [2] ZHAO Wenliang, RAO Yongming, SHI Weikang, et al. DiffSwap: High-fidelity and controllable face swapping via 3D-aware masked diffusion[C]. 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, Canada, 2023: 8568–8577. doi: 10.1109/CVPR52729.2023.00828. [3] BALIAH S, LIN Qinliang, LIAO Shengcai, et al. Realistic and efficient face swapping: A unified approach with diffusion models[C]. 2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Tucson, USA, 2025: 1062–1071. doi: 10.1109/WACV61041.2025.00112. [4] YUAN Shuaiwei, DONG Junyu, and LI Yuezun. Where the devil hides: Deepfake detectors can no longer be trusted[C]. 2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, USA, 2025: 8764–8774. doi: 10.1109/CVPR52734.2025.00819. [5] HUANG Zhenglin, HU Jinwei, LI Xiangtai, et al. SIDA: Social media image deepfake detection, localization and explanation with large multimodal model[C]. 2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, USA, 2025: 28831–28841. doi: 10.1109/CVPR52734.2025.02685. [6] 丁峰, 匡仁盛, 周越, 等. 深度伪造及其取证技术综述[J]. 中国图象图形学报, 2024, 29(2): 295–317. doi: 10.11834/jig.230088.DING Feng, KUANG Rensheng, ZHOU Yue, et al. A survey of deepfake and related digital forensics[J]. Journal of Image and Graphics, 2024, 29(2): 295–317. doi: 10.11834/jig.230088. [7] CAO Junyi, MA Chao, YAO Taiping, et al. End-to-end reconstruction-classification learning for face forgery detection[C]. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, USA, 2022: 4113–4122. doi: 10.1109/CVPR52688.2022.00408. [8] SHIOHARA K and YAMASAKI T. Detecting deepfakes with self-blended images[C]. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, USA, 2022: 18720–18729. doi: 10.1109/CVPR52688.2022.01816. [9] 张晶, 许盼, 刘文君, 等. 多样性负实例生成的跨域人脸伪造检测[J]. 中国图象图形学报, 2025, 30(2): 421–434 doi: 10.11834/jig.240160.ZHANG Jing, XU Pan, LIU Wenjun, et al. Negative instance generation for cross-domain facial forgery detection[J]. Journal of Image and Graphics, 2025, 30(2): 421–434. doi: 10.11834/jig.240160. [10] YAN Zhiyuan, LUO Yuhao, LYU Siwei, et al. Transcending forgery specificity with latent space augmentation for generalizable deepfake detection[C]. 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, USA, 2024: 8984–8994. doi: 10.1109/CVPR52733.2024.00858. [11] OJHA U, LI Yuheng, and LEE Y J. Towards universal fake image detectors that generalize across generative models[C]. 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, Canada, 2023: 24480–24489. doi: 10.1109/CVPR52729.2023.02345. [12] KASHIANI H, TALEMI N A, and AFGHAH F. FreqDebias: Towards generalizable deepfake detection via consistency-driven frequency debiasing[C]. 2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, USA, 2025: 8775–8785. doi: 10.1109/CVPR52734.2025.00820. [13] TAN Chuangchuang, ZHAO Yao, WEI Shikui, et al. Learning on gradients: Generalized artifacts representation for GAN-generated images detection[C]. 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, Canada, 2023: 12105–12114. doi: 10.1109/CVPR52729.2023.01165. [14] TAN Chuangchuang, LIU Huan, ZHAO Yao, et al. Rethinking the up-sampling operations in CNN-based generative network for generalizable deepfake detection[C]. 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, USA, 2024: 28130–28139. doi: 10.1109/CVPR52733.2024.02657. [15] TAN Chuangchuang, ZHAO Yao, WEI Shikui, et al. Frequency-aware deepfake detection: Improving generalizability through frequency space domain learning[C]. Proceedings of the 38th AAAI Conference on Artificial Intelligence, Vancouver, Canada, 2024: 5052–5060. doi: 10.1609/aaai.v38i5.28310. [16] QIAN Yuyang, YIN Guojun, SHENG Lu, et al. Thinking in frequency: Face forgery detection by mining frequency-aware clues[C]. Proceedings of the 16th European Conference on Computer Vision, Glasgow, UK, 2020: 86–103. doi: 10.1007/978-3-030-58610-2_6. [17] BINH L M and WOO S. ADD: Frequency attention and multi-view based knowledge distillation to detect low-quality compressed deepfake images[C]. Proceedings of the 36th AAAI Conference on Artificial Intelligence, Washington, USA, 2022: 122–130. doi: 10.1609/aaai.v36i1.19886. [18] WANG Bo, WU Xiaohan, WANG Fei, et al. Spatial-frequency feature fusion based deepfake detection through knowledge distillation[J]. Engineering Applications of Artificial Intelligence, 2024, 133: 108341. doi: 10.1016/j.engappai.2024.108341. [19] 孙磊, 张洪蒙, 毛秀青, 等. 基于超分辨率重建的强压缩深度伪造视频检测[J]. 电子与信息学报, 2021, 43(10): 2967–2975. doi: 10.11999/JEIT200531.SUN Lei, ZHANG Hongmeng, MAO Xiuqing, et al. Super-resolution reconstruction detection method for deepfake hard compressed videos[J]. Journal of Electronics & Information Technology, 2021, 43(10): 2967–2975. doi: 10.11999/JEIT200531. [20] 王艳, 孙钦东, 荣东柱, 等. 伪影间共性机理驱动的多域感知社交网络深度伪造视频检测[J]. 电子与信息学报, 2024, 46(9): 3713–3721. doi: 10.11999/JEIT240025.WANG Yan, SUN Qindong, RONG Dongzhu, et al. Deepfake video detection on social networks using multi-domain aware driven by common mechanism analysis between artifacts[J]. Journal of Electronics & Information Technology, 2024, 46(9): 3713–3721. doi: 10.11999/JEIT240025. [21] WANG Zhendong, BAO Jianmin, ZHOU Wengang, et al. Dire for diffusion-generated image detection[C]. 2023 IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 2023: 22445–22455. doi: 10.1109/ICCV51070.2023.02051. [22] HOODA A, MANGAOKAR N, FENG R, et al. D4: Detection of adversarial diffusion deepfakes using disjoint ensembles[C]. 2024 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Waikoloa, USA, 2024: 3812–3822. doi: 10.1109/WACV57701.2024.00377. [23] LIU Baoping, LIU Bo, DING Ming, et al. Detection of diffusion model-generated faces by assessing smoothness and noise tolerance[C]. 2024 IEEE International Symposium on Broadband Multimedia Systems and Broadcasting (BMSB), Toronto, Canada, 2024: 1–6. doi: 10.1109/BMSB62888.2024.10608232. [24] ZHANG Hu, ZU Keke, LU Jian, et al. EPSANet: An efficient pyramid squeeze attention block on convolutional neural network[C]. Proceedings of the 16th Asian Conference on Computer Vision, Macao, China, 2023: 1161–1177. doi: 10.1007/978-3-031-26313-2_33. [25] COOLEY J W, LEWIS P A W, and WELCH P D. The fast Fourier transform and its applications[J]. IEEE Transactions on Education, 1969, 12(1): 27–34. doi: 10.1109/TE.1969.4320436. [26] WANG Shengyu, WANG O, ZHANG R, et al. CNN-generated images are surprisingly easy to spot. for now[C]. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, USA, 2020: 8695–8704. doi: 10.1109/CVPR42600.2020.00872. [27] FRANK J, EISENHOFER T, SCHÖNHERR L, et al. Leveraging frequency analysis for deep fake image recognition[C]. Proceedings of the 37th International Conference on Machine Learning, 2020: 3247–3258. (查阅网上资料, 未找到对应的出版地信息, 请确认). [28] JEONG Y, KIM D, MIN S, et al. BiHPF: Bilateral high-pass filters for robust deepfake detection[C]. 2022 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Waikoloa, USA, 2022: 48–57. doi: 10.1109/WACV51458.2022.00293. [29] JEONG Y, KIM D, RO Y, et al. FrePGAN: Robust deepfake detection using frequency-level perturbations[C]. Proceedings of the 36th AAAI Conference on Artificial Intelligence, Washington, USA, 2022: 1060–1068. doi: 10.1609/aaai.v36i1.19990. (查阅网上资料,无法确认出版地信息是否正确). -
下载:
下载: