Image Deraining Driven by CLIP Visual Embedding
-
摘要: 图像去雨是计算机视觉领域的基础任务,现有方法过度依赖假设雨模型或合成雨数据集,导致真实场景去雨效果泛化性能不足。该文分析发现对比语言-图像预训练(CLIP)模型图像编码器对雨纹干扰的鲁棒性,将去雨任务转化为基于视觉语义引导的像素级回归问题,提出基于冻结对比语言-图像预训练(FCLIP)策略的图像去雨模型FCLIP-UNet。该模型采用对称的编码器解码器结构(U-Net):编码器截取CLIP-RN50图像编码器的4层结构实现雨纹与图像内容语义的自动解耦;解码阶段采用(ConvNeXt-T)与上采样深度卷积模块(UpDWBlock)串行结构,结合跳跃连接中嵌入层级差异化扰动策略实现高层语义引导下的细节恢复与泛化能力的协同增强。定量和定性实验结果表明,FCLIP-UNet在公开合成数据集和真实雨图数据集上均取得最优或具有竞争力的性能,并在包含真实雨图的多个独立数据集上表现出良好的泛化性能。
-
关键词:
- 图像去雨 /
- 对比语言-图像预训练 /
- 卷积神经网络 /
- U型网络结构 /
- 泛化性能
Abstract:Objective Rain streaks introduce visual distortions that degrade image quality and significantly impair downstream vision tasks such as feature extraction and object detection. This work addresses the problem of single-image rain streak removal. Existing methods often rely heavily on restrictive priors or synthetic datasets. This dependence limits robustness and generalization because such data differ from complex and unstructured real-world scenarios. Contrastive Language–Image Pre-training(CLIP) demonstrates strong zero-shot generalization through large-scale image–text contrastive learning. Motivated by this property, this study proposes FCLIP-UNet, a visual–semantic-driven deraining architecture designed to improve rain removal and generalization in real-world rainy environments. Methods FCLIP-UNet adopts a U-Net encoder–decoder architecture and formulates deraining as pixel-level detail regression guided by high-level semantic features. During the encoding stage, textual queries are omitted. Instead, the first four layers of a frozen CLIP-RN50 are employed to extract robust features that are decoupled from rain distribution. These features exploit the semantic representation capability of CLIP to suppress diverse rain patterns. To guide accurate image restoration, a collaborative decoding architecture that integrates ConvNeXt-T and an Upsampling Depthwise Convolution Block (UpDWBlock) is adopted. The decoder employs ConvNeXt-T in place of conventional convolution modules to expand the receptive field and capture global contextual information. It parses rain streak patterns by using semantic priors extracted from the encoder. Under the constraint of these priors, UpDWBlock reduces information loss during upsampling and reconstructs fine-grained image details. Multi-level skip connections compensate for information loss introduced during encoding. In addition, a Layer-wise Differentiated Feature Perturbation Strategy (LDFPS) is incorporated to enhance robustness and adaptability in complex real-world rainy scenes. Results and Discussions Comprehensive evaluations are conducted on the Rain13K composite dataset by comparing the proposed model with ten state-of-the-art deraining algorithms. FCLIP-UNet shows consistently superior performance across all five testing subsets of Rain13K. In particular, the method outperforms the second-best approach on both datasets: on Test100 by 0.32 dB in Peak Signal-to-Noise Ratio (PSNR) and 0.06 in Structural Similarity Index Measure (SSIM); on Test2800 by 0.14 dB and 0.002, respectively. On Rain100H and Rain100L, FCLIP-UNet achieves competitive results, including the best SSIM on Rain100H and comparable results on other metrics ( Table 3 ). To evaluate model generalization, the Rain13K-pretrained FCLIP-UNet is further tested on three datasets with different rainfall distribution characteristics: SPA-Data, HQ-RAIN, and MPID (Table 4 ,Fig. 7 ). Qualitative and quantitative evaluations are also conducted on the real-world NTURain-R dataset (Table 5 ,Figs. 8 $ \sim $10 ). These results consistently demonstrate the strong generalization capability of FCLIP-UNet. Ablation experiments on Rain100H validate the proposed encoder design and confirm the effectiveness of both UpDWBlock and LDFPS (Tables 6 $ \sim $8 ). Additional ablation studies show that the use of LDFPS, combined with a 1:1 weighting ratio between L1 loss and perceptual loss, provides the best performance for FCLIP-UNet (Tables 9 $ \sim $11 ).Conclusions This study proposes FCLIP-UNet, a deraining network designed for real-world generalization by leveraging the CLIP paradigm. Three main contributions are presented. First, image deraining is formulated as a pixel-level regression task that reconstructs rain-free images from high-level semantic features. A frozen CLIP image encoder extracts representations that remain stable across different rain distributions, thereby reducing domain shifts caused by diverse rain models. Second, a decoder that integrates ConvNeXt-T with an UpDWBlock is designed, and a LDFPS is proposed to improve robustness to unseen rain distributions. Third, a composite loss function jointly optimizes pixel-level accuracy and perceptual consistency. Experiments on both synthetic and real-world rainy datasets show that FCLIP-UNet effectively removes rain streaks, preserves fine image details, and achieves strong deraining performance with reliable generalization capability. -
表 1 Test1200数据集图像与细粒度文本标签匹配情况
图像 文本标签 light rain moderate rain heavy rain 低密度雨纹图像 196 97 107 中等密度雨纹图像 68 188 144 高密度雨纹图像 95 157 148 表 2 Rain13K数据集组成
Rain800 Rain100H Rain100L Rain14000 Rain1200 Rain12 总计 训练样本数 700 1800 0 11200 0 12 13712 测试样本数 100 100 100 2800 1200 0 4300 测试集名称 Test100 Rain100H Rain100L Test2800 Test1200 - - 表 3 Rain13K 数据集上对比实验结果
算法 Test100 Rain100H Rain100L Test2800 Test1200 平均值 PSNR SSIM PSNR SSIM PSNR SSIM PSNR SSIM PSNR SSIM PSNR SSIM DerainNet[4] 22.77 0.810 14.92 0.592 27.03 0.884 24.31 0.861 23.38 0.835 22.48 0.796 DID-MDN[24] 22.56 0.818 17.35 0.524 25.23 0.741 28.13 0.867 29.95 0.901 24.58 0.770 RESCAN[26] 25.00 0.835 26.36 0.785 29.80 0.881 31.29 0.904 30.51 0.882 28.59 0.857 MSPFN[27] 27.50 0.876 28.66 0.860 32.40 0.933 32.82 0.930 32.39 0.916 30.75 0.903 MPRNet[23] 30.27 0.897 30.41 0.890 36.40 0.965 33.64 0.938 32.91 0.916 32.73 0.921 Uformer-B[28] 29.90 0.906 30.31 0.900 36.86 0.972 33.53 0.939 29.45 0.903 32.01 0.924 IDT[11] 29.69 0.905 29.95 0.898 37.01 0.971 33.38 0.937 31.38 0.908 32.28 0.924 DCTR[1] 30.91 0.912 30.74 0.892 38.19 0.974 33.89 0.941 33.57 0.926 33.46 0.929 AFENet[29] 30.51 0.918 31.22 0.901 37.66 0.978 33.13 0.925 33.82 0.944 33.27 0.933 DPCNet[30] 30.59 0.914 30.73 0.899 37.96 0.974 33.23 0.928 33.87 0.941 33.28 0.931 FCLIP-UNet(本文) 31.23 0.924 30.82 0.903 38.06 0.972 34.03 0.943 33.27 0.928 33.48 0.934 注:加粗字体为每列最优值,下划线为每列次优值(本文其他表格设置相同)。 表 4 跨数据集上的对比实验结果
算法 SPA-Data HQ-RAIN MPID PSNR SSIM PSNR SSIM PSNR SSIM DID-MDN 31.12 0.937 23.62 0.640 23.09 0.794 RESCAN 34.57 0.958 23.79 0.519 26.74 0.823 MSPFN 34.55 0.961 23.99 0.572 27.48 0.849 MPRNet 35.16 0.954 26.36 0.681 31.53 0.896 Uformer-B 35.03 0.948 26.67 0.685 31.47 0.893 IDT 35.61 0.957 26.88 0.679 31.63 0.899 DCTR 35.87 0.963 27.33 0.684 31.81 0.905 DPCNet 35.64 0.958 28.56 0.786 31.59 0.894 FCLIP-UNet(本文) 36.39 0.967 30.36 0.858 31.96 0.913 表 5 NTURain-R上无参考图像质量评价指标对比结果
指标 输入 DerainNet DID-MDN RESCAN MSPFN MPRNet Uformer-B IDT DCTR DPCNet 本文 NIQE 6.211 6.432 5.988 5.745 4.873 4.834 4.473 4.352 4.533 4.378 4.286 BRISQUE 33.156 31.167 30.896 31.766 29.866 30.651 28.768 27.378 26.245 26.509 25.676 表 6 CLIP不同编码器Rain100H上消融实验结果
编码器 PSNR(dB) SSIM ResNet50 16.53 0.546 CLIP-ViT-B/32 29.78 0.878 CLIP-ViT-B/16 30.25 0.885 CLIP-RN50 30.82 0.903 表 7 CLIP不同编码器FLOPs和推理速度对比
CLIP编码器 RN50 ViT-B/32 ViT-B/16 FLOPs 2.36×1010 2.43×1011 9.76×1011 推理速度(s/frame) 0.23 0.56 1.06 表 8 Rain100H消融实验结果
网络 UpDWBlock LDFPS PSNR(dB) SSIM N1 - - 30.02 0.879 N2 √ - 30.42 0.893 N3 - √ 30.29 0.887 N4 √ √ 30.82 0.903 表 9 不同损失函数的消融实验
损失函数 Lmse L1 Lmse+Lp L1+Lp PSNR(dB) 29.44 29.76 29.39 30.82 SSIM 0.880 0.892 0.885 0.903 表 10 不同λp取值的消融实验
λp 0.1 1 10 PSNR(dB) 29.51 30.82 29.66 SSIM 0.884 0.903 0.887 表 11 不同强度(σi)扰动策略消融实验
不同扰动策略 PSNR(dB) SSIM 等强度低扰动(σ1=σ2=σ3=σ4=0.01) 30.63 0.892 等强度高扰动(σ1=σ2=σ3=σ4=0.1) 30.58 0.887 扰动强度逐层降低
(σ1=0.1, σ2=0.06, σ3=0.03, σ4=0.01)30.11 0.880 扰动强度逐层增加
(σ1=0.01, σ2=0.03, σ3=0.06, σ4=0.1)30.82 0.903 -
[1] LI Yufeng, LU Jiyang, CHEN Hongming, et al. Dilated convolutional transformer for high-quality image deraining[C]. The IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, Vancouver, Canada, 2023: 4199–4207. doi: 10.1109/CVPRW59228.2023.00442. [2] KANG Liwei, LIN C W, and FU Y H. Automatic single-image-based rain streaks removal via image decomposition[J]. IEEE Transactions on Image Processing, 2012, 21(4): 1742–1755. doi: 10.1109/TIP.2011.2179057. [3] ZHU Lei, FU C W, LISCHINSKI D, et al. Joint Bi-layer optimization for single-image rain streak removal[C]. The IEEE International Conference on Computer Vision, Venice, Italy, 2017: 2545–2553. doi: 10.1109/ICCV.2017.276. [4] FU Xueyang, HUANG Jiabin, DING Xinghao, et al. Clearing the skies: A deep network architecture for single-image rain removal[J]. IEEE Transactions on Image Processing, 2017, 26(6): 2944–2956. doi: 10.1109/TIP.2017.2691802. [5] FU Xueyang, HUANG Jiabin, ZENG Delu, et al. Removing rain from single images via a deep detail network[C]. The IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, USA, 2017: 1715–1723. doi: 10.1109/CVPR.2017.186. [6] 梅天灿, 曹敏, 杨宏, 等. 基于密度分类引导的双阶段雨天图像复原方法[J]. 电子与信息学报, 2023, 45(4): 1383–1390. doi: 10.11999/JEIT220157.MEI Tiancan, CAO Min, YANG Hong, et al. Two-stage rain image removal based on density guidance[J]. Journal of Electronics & Information Technology, 2023, 45(4): 1383–1390. doi: 10.11999/JEIT220157. [7] REN Dongwei, ZUO Wangmeng, HU Qinghua, et al. Progressive image deraining networks: A better and simpler baseline[C]. The IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, USA, 2019: 3932–3941. doi: 10.1109/CVPR.2019.00406. [8] WEI Wei, MENG Deyu, ZHAO Qian, et al. Semi-supervised transfer learning for image rain removal[C]. The IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, USA, 2019: 3872–3881. doi: 10.1109/CVPR.2019.00400. [9] YASARLA R, SINDAGI V A, and PATEL V M. Syn2real transfer learning for image deraining using gaussian processes[C]. The IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, USA, 2020: 2723–2733. doi: 10.1109/CVPR42600.2020.00280. [10] JIANG Kui, WANG Zhongyuan, CHEN Chen, et al. Magic ELF: Image deraining meets association learning and transformer[C]. The 30th ACM International Conference on Multimedia, Lisboa, Portugal, 2022: 827–836. doi: 10.1145/3503161.3547760. [11] XIAO Jie, FU Xueyang, LIU Aiping, et al. Image de-raining transformer[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023, 45(11): 12978–12995. doi: 10.1109/TPAMI.2022.3183612. [12] CUI Yuning, REN Wenqi, CAO Xiaochun, et al. Revitalizing convolutional network for image restoration[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024, 46(12): 9423–9438. doi: 10.1109/TPAMI.2024.3419007. [13] RADFORD A, KIM J W, HALLACY C, et al. Learning transferable visual models from natural language supervision[C]. The 38th International Conference on Machine Learning, 2021: 8748–8763. [14] MA Wenxin, ZHANG Xu, YAO Qingsong, et al. AA-CLIP: Enhancing zero-shot anomaly detection via anomaly-aware CLIP[C]. The IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, USA, 2025: 4744–4754. doi: 10.1109/CVPR52734.2025.00447. [15] SUN Zeyi, FANG Ye, WU Tong, et al. Alpha-CLIP: A CLIP model focusing on wherever you want[C]. The IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, USA, 2024: 13019–13029. doi: 10.1109/CVPR52733.2024.01237. [16] WANG Mengmeng, XING Jiazheng, JIANG Boyuan, et al. A multimodal, multi-task adapting framework for video action recognition[C]. The 38th AAAI Conference on Artificial Intelligence, Vancouver, Canada: AAAI, 2024: 5517–5525. doi: 10.1609/aaai.v38i6.28361. [17] LUO Ziwei, GUSTAFSSON F K, ZHAO Zheng, et al. Controlling vision-language models for multi-task image restoration[C]. The 12th International Conference on Learning Representations, Vienna, Austria, 2024. [18] LIN Jingbo, ZHANG Zhilu, WEI Yuxiang, et al. Improving image restoration through removing degradations in textual representations[C]. The IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, USA, 2024: 2866–2878. doi: 10.1109/CVPR52733.2024.00277. [19] 文渊博, 高涛, 安毅生, 等. 基于视觉提示学习的天气退化图像恢复[J]. 计算机学报, 2024, 47(10): 2401–2416. doi: 10.11897/SP.J.1016.2024.02401.WEN Yuanbo, GAO Tao, AN Yisheng, et al. Weather-degraded image restoration based on visual prompt learning[J]. Chinese Journal of Computers, 2024, 47(10): 2401–2416. doi: 10.11897/SP.J.1016.2024.02401. [20] WANG Ruiyi, LI Wenhao, LIU Xiaohong, et al. HazeCLIP: Towards language guided real-world image dehazing[C]. ICASSP 2025–2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Hyderabad, India, 2025: 1–5. doi: 10.1109/ICASSP49660.2025.10889509. [21] CHENG Jun, LIANG Dong, and TAN Shan. Transfer CLIP for generalizable image denoising[C]. The IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, USA, 2024: 25974–25984. doi: 10.1109/CVPR52733.2024.02454. [22] LIU Zhuang, MAO Hanzi, WU Chaoyuan, et al. A ConvNet for the 2020s[C]. The IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, USA, 2022: 11966–11976. doi: 10.1109/CVPR52688.2022.01167. [23] ZAMIR S W, ARORA A, KHAN S, et al. Multi-stage progressive image restoration[C]. The IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, USA, 2021: 14816–14826. doi: 10.1109/CVPR46437.2021.01458. [24] ZHANG He and PATEL V M. Density-aware single image de-raining using a multi-stream dense network[C]. The IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, USA, 2018: 695–704. doi: 10.1109/CVPR.2018.00079. [25] ZHOU Tianfei, YUAN Ye, WANG Binglu, et al. Federated feature augmentation and alignment[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024. doi: 10.1109/TPAMI.2024.3457751. [26] LI Xia, WU Jianlong, LIN Zhouchen, et al. Recurrent squeeze-and-excitation context aggregation net for single image deraining[C]. The 15th European Conference on Computer Vision (ECCV), Munich, Germany, 2018: 262–277. doi: 10.1007/978-3-030-01234-2_16. [27] JIANG Kui, WANG Zhongyuan, YI Peng, et al. Multi-scale progressive fusion network for single image deraining[C]. The IEEE/CVF conference on computer vision and pattern recognition, Recognition. Seattle, USA, 2020: 8346-8355. doi: 10.1109/CVPR42600. 2020.00837. [28] WANG Zhendong, CUN Xiaodong, BAO Jianmin, et al. Uformer: A general U-shaped transformer for image restoration[C]. The IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, USA, 2022: 17662–17672. doi: 10.1109/CVPR52688.2022.01716. [29] YAN Fei, HE Yuhong, CHEN Keyu, et al. Adaptive frequency enhancement network for single image deraining[C]. 2024 IEEE International Conference on Systems, Man, and Cybernetics (SMC), Kuching, Malaysia, 2024: 4534–4541. doi: 10.1109/SMC54092.2024.10831025. [30] HE Yuhong, JIANG Aiwen, JIANG Lingfang, et al. Dual-path coupled image deraining network via spatial-frequency interaction[C]. 2024 IEEE International Conference on Image Processing (ICIP), Abu Dhabi, United Arab Emirates, 2024: 1452–1458. doi: 10.1109/ICIP51287.2024.10647753. -
下载:
下载: