{"id":943,"date":"2023-11-10T09:11:22","date_gmt":"2023-11-10T01:11:22","guid":{"rendered":"http:\/\/www.tamanegi.xyz\/?p=943"},"modified":"2023-11-10T09:11:22","modified_gmt":"2023-11-10T01:11:22","slug":"modality-synergy-complement-learning-with-cascaded-aggregation-for-visible-infrared-person-re-identification","status":"publish","type":"post","link":"http:\/\/tamanegi.xyz\/?p=943","title":{"rendered":"Modality Synergy Complement Learning with Cascaded Aggregation for Visible-Infrared Person Re-Identification"},"content":{"rendered":"<h1>Abstract.<\/h1>\n<p>Visible-Infrared Re-Identification (VI-ReID) is challenging in image retrievals. The modality discrepancy will easily make huge intraclass variations. Most existing methods either bridge different modalities through modality-invariance or generate the intermediate modality for better performance. Differently, this paper proposes a novel framework, named Modality Synergy Complement Learning Network (MSCLNet) with Cascaded Aggregation. Its basic idea is to synergize two modalities to construct diverse representations of identity-discriminative semantics and less noise. Then, we complement synergistic representations under the advantages of the two modalities. Furthermore, we propose the Cascaded Aggregation strategy for fine-grained optimization of the feature distribution, which progressively aggregates feature embeddings from the subclass, intra-class, and inter-class. Extensive experiments on SYSU-MM01 and RegDB datasets show that MSCLNet outperforms the state-of-the-art by a large margin. On the large-scale SYSU-MM01 dataset, our model can achieve 76.99% and 71.64% in terms of Rank-1 accuracy and mAP value.<\/p>\n<p>\u53ef\u89c1\u7ea2\u5916\u91cd\u8bc6\u522b\uff08VI-ReID\uff09\u5728\u56fe\u50cf\u68c0\u7d22\u4e2d\u5177\u6709\u6311\u6218\u6027\u3002\u4e0d\u540c\u6a21\u6001\u4e4b\u95f4\u7684\u5dee\u5f02\u5f88\u5bb9\u6613\u4ea7\u751f\u5927\u7684\u7c7b\u5185\u53d8\u5316\u3002\u5927\u591a\u6570\u73b0\u6709\u65b9\u6cd5\u8981\u4e48\u901a\u8fc7\u6a21\u6001\u4e0d\u53d8\u6027\u6765\u8fde\u63a5\u4e0d\u540c\u6a21\u6001\uff0c\u8981\u4e48\u751f\u6210\u4e2d\u95f4\u6a21\u6001\u4ee5\u83b7\u5f97\u66f4\u597d\u7684\u6027\u80fd\u3002\u4e0e\u4e4b\u4e0d\u540c\u7684\u662f\uff0c\u672c\u6587\u63d0\u51fa\u4e86\u4e00\u79cd\u65b0\u7684\u6846\u67b6\uff0c\u540d\u4e3aModality Synergy Complement Learning Network\uff08MSCLNet\uff09\u4e0eCascaded Aggregation\u3002\u5176\u57fa\u672c\u601d\u60f3\u662f\u534f\u540c\u4f7f\u7528\u4e24\u79cd\u6a21\u6001\u6784\u5efa\u8eab\u4efd\u5224\u522b\u8bed\u4e49\u548c\u8f83\u5c11\u566a\u58f0\u7684\u591a\u6837\u5316\u8868\u8fbe\u3002\u7136\u540e\uff0c\u6211\u4eec\u5728\u4e24\u79cd\u6a21\u6001\u7684\u4f18\u52bf\u4e0b\u8865\u5145\u534f\u540c\u8868\u8fbe\u3002\u6b64\u5916\uff0c\u6211\u4eec\u63d0\u51fa\u4e86\u7ea7\u8054\u805a\u5408\u7b56\u7565\uff0c\u7528\u4e8e\u5bf9\u7279\u5f81\u5206\u5e03\u8fdb\u884c\u7ec6\u7c92\u5ea6\u4f18\u5316\uff0c\u8be5\u7b56\u7565\u9010\u6b65\u805a\u5408\u4e86\u5b50\u7c7b\u3001\u7c7b\u5185\u548c\u7c7b\u95f4\u7684\u7279\u5f81\u5d4c\u5165\u3002SYSU-MM01\u548cRegDB\u6570\u636e\u96c6\u4e0a\u7684\u5927\u91cf\u5b9e\u9a8c\u8868\u660e\uff0cMSCLNet\u5728\u6027\u80fd\u4e0a\u8fdc\u8fdc\u8d85\u8fc7\u4e86\u73b0\u6709\u6280\u672f\u6c34\u5e73\u3002\u5728\u5927\u89c4\u6a21SYSU-MM01\u6570\u636e\u96c6\u4e0a\uff0c\u6211\u4eec\u7684\u6a21\u578b\u5728Rank-1\u51c6\u786e\u7387\u548cmAP\u503c\u65b9\u9762\u5206\u522b\u8fbe\u5230\u4e8676.99\uff05\u548c71.64\uff05\u3002<\/p>\n<h1>Introduction<\/h1>\n<p>In conclusion, the main contributions of our work can be summarized as follows: We propose a novel framework named Modality Synergy Complement Learning Network (MSCLNet) with Cascaded Aggregation for VI-ReID. To fetch more discriminative semantics, it learns enhanced feature representations by diverse semantics and specific advantages of visible and infrared modalities. And we propose a Modality Synergy module (MS) which innovatively mines the modality-specific diverse semantics and a Modality Complement module (MC) which further enhances the feature representations by two parallel guidances of modality-specific advantages. They provide a reference for further high-level identity representation. Then we design a Cascaded Aggregation strategy (CA) to optimize the distribution of feature embeddings on a fine-grained level. It progressively aggregates the overall instances in a cascaded manner and enhances the discrimination of identities. Extensive experimental results show that our proposed framework outperforms the state-of-the-art methods by a large margin on two mainstream benchmarks of VI-ReID.<\/p>\n<p>\u603b\u4e4b\uff0c\u6211\u4eec\u7684\u5de5\u4f5c\u7684\u4e3b\u8981\u8d21\u732e\u53ef\u4ee5\u5f52\u7eb3\u5982\u4e0b\uff1a\u6211\u4eec\u63d0\u51fa\u4e86\u4e00\u4e2a\u540d\u4e3a\u6a21\u6001\u534f\u540c\u8865\u5145\u5b66\u4e60\u7f51\u7edc\uff08MSCLNet\uff09\u7684\u65b0\u9896\u6846\u67b6\uff0c\u901a\u8fc7\u4f20\u64ad\u805a\u5408\u6765\u5b9e\u73b0VI-ReID\u3002\u4e3a\u4e86\u83b7\u53d6\u66f4\u5177\u533a\u5206\u5ea6\u7684\u8bed\u4e49\u4fe1\u606f\uff0c\u5b83\u901a\u8fc7\u4e0d\u540c\u7684\u8bed\u4e49\u548c\u53ef\u89c1\u5149\u4e0e\u7ea2\u5916\u5149\u6a21\u6001\u7684\u7279\u5b9a\u4f18\u52bf\u6765\u5b66\u4e60\u589e\u5f3a\u7279\u5f81\u8868\u793a\u3002\u6211\u4eec\u63d0\u51fa\u4e86\u6a21\u6001\u534f\u540c\u6a21\u5757\uff08MS\uff09\uff0c\u521b\u65b0\u5730\u6316\u6398\u6a21\u6001\u7279\u5b9a\u7684\u4e0d\u540c\u8bed\u4e49\uff0c\u4ee5\u53ca\u6a21\u6001\u8865\u5145\u6a21\u5757\uff08MC\uff09\uff0c\u901a\u8fc7\u4e24\u4e2a\u6a21\u6001\u7279\u5b9a\u4f18\u52bf\u7684\u5e73\u884c\u5f15\u5bfc\u8fdb\u4e00\u6b65\u589e\u5f3a\u7279\u5f81\u8868\u793a\u3002\u5b83\u4eec\u4e3a\u8fdb\u4e00\u6b65\u7684\u9ad8\u7ea7\u8eab\u4efd\u8868\u793a\u63d0\u4f9b\u4e86\u53c2\u8003\u3002\u7136\u540e\uff0c\u6211\u4eec\u8bbe\u8ba1\u4e86\u7ea7\u8054\u805a\u5408\u7b56\u7565\uff08CA\uff09\uff0c\u5728\u7ec6\u7c92\u5ea6\u7ea7\u522b\u4e0a\u4f18\u5316\u7279\u5f81\u5d4c\u5165\u7684\u5206\u5e03\u3002\u5b83\u4ee5\u7ea7\u8054\u7684\u65b9\u5f0f\u9010\u6e10\u805a\u5408\u6240\u6709\u5b9e\u4f8b\uff0c\u5e76\u589e\u5f3a\u8eab\u4efd\u7684\u533a\u5206\u5ea6\u3002\u5e7f\u6cdb\u7684\u5b9e\u9a8c\u7ed3\u679c\u8868\u660e\uff0c\u6211\u4eec\u63d0\u51fa\u7684\u6846\u67b6\u5728\u4e24\u4e2a\u4e3b\u6d41\u7684VI-ReID\u57fa\u51c6\u4e0a\u5927\u5e45\u4f18\u4e8e\u73b0\u6709\u7684\u65b9\u6cd5\u3002<\/p>\n<h1>Related Work<\/h1>\n<h2>Single-Modality Person Re-Identification<\/h2>\n<p>retrieves pedestrians in the set of visible images. Visible person ReID is a reliable technique which plays an important role in daily life. These methods mainly solved the single-modal ReID problem via ranking [2,29], local and global attention [38,57], camera style [3,55,59], person key-points [36], siamese network [58], similarity graph [22], network architecture searching [18], .etc. Some works attempted domain adaptation [8,59]. And Some research dealt with the misalignment of human parts, such as cascaded convolutional module [39], refined part pooling [34], transformer [19] and so on. Beside, single-modality person re-identification contains several subdivided areas, for example, video person re-identification [26,44,60], unsupervised person re-identification which tackles pseudo labels [46,54], unsupervised domain adaption [1,31] and generalized person re-identification [16]. Due to the tremendous discrepancy between visible and infrared images, single-modal solutions are not suitable for cross-modality person re-identification, which creates a demand for the development of VI-ReID solutions.<\/p>\n<p>\u5728\u53ef\u89c1\u56fe\u50cf\u96c6\u4e2d\u68c0\u7d22\u884c\u4eba\u3002\u53ef\u89c1\u4eba\u5458\u91cd\u65b0\u8bc6\u522b\u662f\u4e00\u79cd\u53ef\u9760\u7684\u6280\u672f\uff0c\u5728\u65e5\u5e38\u751f\u6d3b\u4e2d\u8d77\u7740\u91cd\u8981\u4f5c\u7528\u3002\u8fd9\u4e9b\u65b9\u6cd5\u4e3b\u8981\u901a\u8fc7\u6392\u540d[2,29]\uff0c\u5c40\u90e8\u548c\u5168\u5c40\u6ce8\u610f\u529b[38,57]\uff0c\u6444\u50cf\u673a\u98ce\u683c[3,55,59]\uff0c\u4eba\u5458\u5173\u952e\u70b9[36]\uff0c\u8fde\u4f53\u7f51\u7edc[58]\uff0c\u76f8\u4f3c\u6027\u56fe[22]\uff0c\u7f51\u7edc\u67b6\u6784\u641c\u7d22[18]\u7b49\u6765\u89e3\u51b3\u5355\u6a21\u6001\u91cd\u65b0\u8bc6\u522b\u95ee\u9898\u3002\u4e00\u4e9b\u7814\u7a76\u5c1d\u8bd5\u8fdb\u884c\u57df\u9002\u5e94[8,59]\u3002\u8fd8\u6709\u4e00\u4e9b\u7814\u7a76\u5904\u7406\u4eba\u4f53\u90e8\u5206\u7684\u9519\u4f4d\uff0c\u4f8b\u5982\u7ea7\u8054\u5377\u79ef\u6a21\u5757[39]\uff0c\u7cbe\u7ec6\u90e8\u5206\u6c60\u5316[34]\uff0c\u8f6c\u6362\u5668[19]\u7b49\u3002\u6b64\u5916\uff0c\u5355\u6a21\u6001\u4eba\u5458\u91cd\u65b0\u8bc6\u522b\u5305\u542b\u51e0\u4e2a\u7ec6\u5206\u9886\u57df\uff0c\u4f8b\u5982\u89c6\u9891\u4eba\u5458\u91cd\u65b0\u8bc6\u522b[26,44,60]\uff0c\u65e0\u76d1\u7763\u4eba\u5458\u91cd\u65b0\u8bc6\u522b\u5904\u7406\u4f2a\u6807\u7b7e[46,54]\uff0c\u65e0\u76d1\u7763\u57df\u9002\u5e94[1,31]\u548c\u5e7f\u4e49\u4eba\u5458\u91cd\u65b0\u8bc6\u522b[16]\u3002\u7531\u4e8e\u53ef\u89c1\u56fe\u50cf\u548c\u7ea2\u5916\u56fe\u50cf\u4e4b\u95f4\u5b58\u5728\u5de8\u5927\u5dee\u5f02\uff0c\u5355\u6a21\u6001\u89e3\u51b3\u65b9\u6848\u4e0d\u9002\u7528\u4e8e\u8de8\u6a21\u6001\u4eba\u5458\u91cd\u65b0\u8bc6\u522b\uff0c\u8fd9\u5bf9VI-ReID\u89e3\u51b3\u65b9\u6848\u7684\u5f00\u53d1\u63d0\u51fa\u4e86\u9700\u6c42\u3002<\/p>\n<h2>Visible-Infrared Person Re-Identification<\/h2>\n<p>Visible-Infrared Person Re-Identification focuses on narrowing the gap between visible and infrared modalities and learning appropriate representations for pedestrian retrieval across modalities. [43] proposed a deep zero-fill network to extract useful embedded features to reduce cross-modal variation. Dual-stream networks [21,48,49,50,51] simultaneously learned modal-shared and modal-specific features. [30] used Gaussian-based variational auto-encoder to distinguish the subspace of cross-modal features. [15] exploited samples similarity within modalities. A modality-aware learning approach [47] processed modality differences on the classifier level. Some works generated images of intermediate or the corresponding modality [7,17,35,37,40] to mitigate the effect of modality discrepancy. However, extracting modality-shared features causes the loss of semantics related to identity discrimination, and GAN-based methods bring computational burden and non-original noise.<\/p>\n<p>\u53ef\u89c1\u7ea2\u5916\u4eba\u7269\u518d\u8bc6\u522b\u81f4\u529b\u4e8e\u7f29\u5c0f\u53ef\u89c1\u5149\u548c\u7ea2\u5916\u6a21\u6001\u4e4b\u95f4\u7684\u5dee\u8ddd\uff0c\u5e76\u5b66\u4e60\u8de8\u6a21\u6001\u884c\u4eba\u68c0\u7d22\u7684\u9002\u5f53\u8868\u793a\u3002[43]\u63d0\u51fa\u4e86\u4e00\u79cd\u6df1\u5ea6\u96f6\u586b\u5145\u7f51\u7edc\u6765\u63d0\u53d6\u6709\u7528\u7684\u5d4c\u5165\u7279\u5f81\u4ee5\u51cf\u5c11\u8de8\u6a21\u6001\u53d8\u5316\u3002\u53cc\u6d41\u7f51\u7edc[21,48,49,50,51]\u540c\u65f6\u5b66\u4e60\u4e86\u6a21\u6001\u5171\u4eab\u548c\u6a21\u6001\u7279\u5b9a\u7684\u7279\u5f81\u3002[30]\u4f7f\u7528\u57fa\u4e8e\u9ad8\u65af\u7684\u53d8\u5206\u81ea\u7f16\u7801\u5668\u6765\u533a\u5206\u8de8\u6a21\u6001\u7279\u5f81\u7684\u5b50\u7a7a\u95f4\u3002[15]\u5229\u7528\u4e86\u6a21\u6001\u5185\u7684\u6837\u672c\u76f8\u4f3c\u6027\u3002\u4e00\u79cd\u6a21\u6001\u611f\u77e5\u7684\u5b66\u4e60\u65b9\u6cd5[47]\u5728\u5206\u7c7b\u5668\u5c42\u9762\u5904\u7406\u6a21\u6001\u5dee\u5f02\u3002\u4e00\u4e9b\u4f5c\u54c1\u751f\u6210\u4e2d\u95f4\u6216\u5bf9\u5e94\u6a21\u6001\u7684\u56fe\u50cf[7,17,35,37,40]\u4ee5\u51cf\u8f7b\u6a21\u6001\u4e0d\u4e00\u81f4\u7684\u5f71\u54cd\u3002\u7136\u800c\uff0c\u63d0\u53d6\u6a21\u6001\u5171\u4eab\u7279\u5f81\u4f1a\u5bfc\u81f4\u4e0e\u8eab\u4efd\u9274\u522b\u76f8\u5173\u7684\u8bed\u4e49\u4e22\u5931\uff0c\u57fa\u4e8eGAN\u7684\u65b9\u6cd5\u4f1a\u5e26\u6765\u8ba1\u7b97\u8d1f\u62c5\u548c\u975e\u539f\u59cb\u566a\u58f0\u3002<\/p>\n<p>Differently, our work pays more attention to deep supervised knowledge synergy [32], which explores explicit information interaction between the supervised branches. We propose to make the most use of the intrinsic information of visible and infrared modalities, which learns diverse semantics and enhances feature representations by a modality synergy and complement learning scheme. To better discriminate identities, we introduce a cascaded feature aggregation strategy<\/p>\n<p>\u4e0e\u4f20\u7edf\u65b9\u6cd5\u4e0d\u540c\uff0c\u6211\u4eec\u7684\u5de5\u4f5c\u66f4\u52a0\u5173\u6ce8\u6df1\u5ea6\u76d1\u7763\u77e5\u8bc6\u878d\u5408[32]\uff0c\u63a2\u7d22\u53d7\u76d1\u7763\u5206\u652f\u4e4b\u95f4\u7684\u663e\u5f0f\u4fe1\u606f\u4ea4\u4e92\u3002\u6211\u4eec\u5efa\u8bae\u5145\u5206\u5229\u7528\u53ef\u89c1\u5149\u548c\u7ea2\u5916\u6a21\u6001\u7684\u5185\u5728\u4fe1\u606f\uff0c\u901a\u8fc7\u6a21\u6001\u534f\u540c\u548c\u4e92\u8865\u5b66\u4e60\u65b9\u6848\u6765\u5b66\u4e60\u591a\u6837\u7684\u8bed\u4e49\u5e76\u589e\u5f3a\u7279\u5f81\u8868\u793a\u3002\u4e3a\u4e86\u66f4\u597d\u5730\u533a\u5206\u8eab\u4efd\uff0c\u6211\u4eec\u5f15\u5165\u4e86\u7ea7\u8054\u7279\u5f81\u805a\u5408\u7b56\u7565\u3002<\/p>\n<h1>Modality Synergy Complement Learning<\/h1>\n<p>In this section, we formulate the VI-ReID problem and introduce the framework of our proposed MSCLNet (\u00a7 3.1). It mainly contains three major components: Modality Synergy module (MS, \u00a7 3.2), Modality Complement module (MC, \u00a7 3.3), and Cascaded Aggregation strategy (CA, \u00a7 3.4). We utilize MS to synergize modality-specific diverse semantics from the extractors, and then use MC to enhance feature representations under the guidance of advantages from the two modalities. To optimize the distribution of the features and aggregate instances of the same identity, we exploit CA to constrain the feature distribution in a fine-grained and progressive way. Finally, we summarize the proposed loss function (\u00a7 3.5).<\/p>\n<p>\u5728\u672c\u8282\u4e2d\uff0c\u6211\u4eec\u5bf9VI-ReID\u95ee\u9898\u8fdb\u884c\u4e86\u9610\u8ff0\uff0c\u5e76\u4ecb\u7ecd\u4e86\u6211\u4eec\u63d0\u51fa\u7684MSCLNet\u6846\u67b6\uff08\u00a7 3.1\uff09\u3002\u5b83\u4e3b\u8981\u5305\u542b\u4e09\u4e2a\u4e3b\u8981\u7ec4\u4ef6\uff1a\u6a21\u6001\u534f\u540c\u6a21\u5757\uff08MS\uff0c\u00a7 3.2\uff09\uff0c\u6a21\u6001\u8865\u5145\u6a21\u5757\uff08MC\uff0c\u00a7 3.3\uff09\u548c\u7ea7\u8054\u805a\u5408\u7b56\u7565\uff08CA\uff0c\u00a7 3.4\uff09\u3002\u6211\u4eec\u5229\u7528MS\u4ece\u63d0\u53d6\u5668\u4e2d\u534f\u540c\u4f5c\u7528\u6a21\u6001\u7279\u5b9a\u7684\u591a\u6837\u5316\u8bed\u4e49\uff0c\u7136\u540e\u4f7f\u7528MC\u5728\u4e24\u79cd\u6a21\u6001\u7684\u4f18\u52bf\u6307\u5bfc\u4e0b\u589e\u5f3a\u7279\u5f81\u8868\u793a\u3002\u4e3a\u4e86\u4f18\u5316\u7279\u5f81\u7684\u5206\u5e03\u5e76\u805a\u5408\u76f8\u540c\u8eab\u4efd\u7684\u5b9e\u4f8b\uff0c\u6211\u4eec\u5229\u7528CA\u4ee5\u4e00\u79cd\u7ec6\u7c92\u5ea6\u548c\u6e10\u8fdb\u7684\u65b9\u5f0f\u7ea6\u675f\u7279\u5f81\u5206\u5e03\u3002\u6700\u540e\uff0c\u6211\u4eec\u603b\u7ed3\u4e86\u63d0\u51fa\u7684\u635f\u5931\u51fd\u6570\uff08\u00a7 3.5\uff09\u3002<\/p>\n<h2>Problem Formulation<\/h2>\n<p>Fig. 3 illustrates the framework of Modality Synergy Complement Learning Network (MSCLNet) with Cascaded Aggregation. It adopts a dual-stream network as the feature extractor. Firstly, based on the extracted feature representations f v and f r from visible and images, MSCLNet constructs synergistic representations f s by constraining the diversity of the feature distributions between the two modalities. The synergistic feature will be further enhanced by modality complement guidance. The visible modality provides fine-grained discriminative semantics, while the infrared modality supplies with stable global pedestrian statistics. Then we aggregate feature embeddings of the same class via Cascaded Aggregation strategy which optimizes the comprehensive distribution of feature embeddings progressively on three aspects.<\/p>\n<p>\u56fe3\u5c55\u793a\u4e86\u5e26\u6709\u7ea7\u8054\u805a\u5408\u7684\u6a21\u6001\u534f\u540c\u8865\u5145\u5b66\u4e60\u7f51\u7edc\uff08MSCLNet\uff09\u7684\u6846\u67b6\u3002\u5b83\u91c7\u7528\u53cc\u6d41\u7f51\u7edc\u4f5c\u4e3a\u7279\u5f81\u63d0\u53d6\u5668\u3002\u9996\u5148\uff0c\u57fa\u4e8e\u53ef\u89c1\u5149\u548c\u7ea2\u5916\u56fe\u50cf\u63d0\u53d6\u7684\u7279\u5f81\u8868\u793af v\u548cf r\uff0cMSCLNet\u901a\u8fc7\u7ea6\u675f\u4e24\u79cd\u6a21\u6001\u4e4b\u95f4\u7279\u5f81\u5206\u5e03\u7684\u591a\u6837\u6027\u6765\u6784\u9020\u534f\u540c\u8868\u793af s\u3002\u534f\u540c\u7279\u5f81\u8fd8\u5c06\u901a\u8fc7\u6a21\u6001\u8865\u5145\u5f15\u5bfc\u8fdb\u884c\u8fdb\u4e00\u6b65\u589e\u5f3a\u3002\u53ef\u89c1\u5149\u6a21\u6001\u63d0\u4f9b\u4e86\u7ec6\u7c92\u5ea6\u7684\u5224\u522b\u8bed\u4e49\uff0c\u800c\u7ea2\u5916\u6a21\u6001\u5219\u63d0\u4f9b\u4e86\u7a33\u5b9a\u7684\u5168\u5c40\u884c\u4eba\u7edf\u8ba1\u6570\u636e\u3002\u7136\u540e\uff0c\u6211\u4eec\u901a\u8fc7\u7ea7\u8054\u805a\u5408\u7b56\u7565\u5bf9\u540c\u4e00\u7c7b\u522b\u7684\u7279\u5f81\u5d4c\u5165\u8fdb\u884c\u805a\u5408\uff0c\u8be5\u7b56\u7565\u5728\u4e09\u4e2a\u65b9\u9762\u9010\u6e10\u4f18\u5316\u7279\u5f81\u5d4c\u5165\u7684\u7efc\u5408\u5206\u5e03\u3002<\/p>\n<p><img decoding=\"async\" src=\"https:\/\/prod-files-secure.s3.us-west-2.amazonaws.com\/e9d07118-f13a-4e79-a274-ee99b3771efd\/e5010865-abc5-48d7-aa17-eb0e83e84655\/Untitled.png\" alt=\"Untitled\" \/><\/p>\n<h2>Modality Synergy Module<\/h2>\n<p>According to the differences in imaging principles and the heterogeneity of the image contents, visible and infrared images reveal quite different semantics to depict the same person. In our work, we design the network to learn and synergize the diverse semantics of the two modalities. Given a pair of visible and infrared images xv i \u2208 V, xr i \u2208 R, the dual-stream network extracts their features f v i and fr i . With the prerequisite of precise pedestrian re-identification, we concentrate on acquiring the semantic diversity to the largest extent. Features f v i and f r i are normalized by the following operations.<\/p>\n<p>\u6839\u636e\u6210\u50cf\u539f\u7406\u7684\u5dee\u5f02\u548c\u56fe\u50cf\u5185\u5bb9\u7684\u5f02\u8d28\u6027\uff0c\u53ef\u89c1\u5149\u548c\u7ea2\u5916\u56fe\u50cf\u5448\u73b0\u51fa\u5b8c\u5168\u4e0d\u540c\u7684\u8bed\u4e49\u6765\u63cf\u7ed8\u540c\u4e00\u4eba\u7269\u3002\u5728\u6211\u4eec\u7684\u5de5\u4f5c\u4e2d\uff0c\u6211\u4eec\u8bbe\u8ba1\u4e86\u4e00\u4e2a\u7f51\u7edc\u6765\u5b66\u4e60\u548c\u534f\u540c\u4e24\u79cd\u6a21\u5f0f\u7684\u591a\u6837\u8bed\u4e49\u3002\u7ed9\u5b9a\u4e00\u5bf9\u53ef\u89c1\u5149\u548c\u7ea2\u5916\u56fe\u50cf xv i \u2208 V\uff0cxr i \u2208 R\uff0c\u53cc\u6d41\u7f51\u7edc\u63d0\u53d6\u5b83\u4eec\u7684\u7279\u5f81 f v i \u548c f r i\u3002\u5728\u786e\u4fdd\u7cbe\u786e\u7684\u884c\u4eba\u518d\u8bc6\u522b\u7684\u524d\u63d0\u4e0b\uff0c\u6211\u4eec\u4e13\u6ce8\u4e8e\u5c3d\u53ef\u80fd\u83b7\u53d6\u8bed\u4e49\u591a\u6837\u6027\u3002\u7279\u5f81 f v i \u548c f r i \u901a\u8fc7\u4ee5\u4e0b\u64cd\u4f5c\u8fdb\u884c\u5f52\u4e00\u5316\u3002<\/p>\n<p>We utilize Mogrifier LSTM [25] as a synergistic feature encoder to maximize the effect of modality synergy learning, and the synergistic feature f s i is encoded with visible and infrared features with their shared ground-truth label. To construct f s i with diverse semantics, we exploit KL-Divergence to constrain the logistic distribution of visible and infrared features f v i , fr i , which can be formulated as follows:<\/p>\n<p>\u6211\u4eec\u4f7f\u7528Mogrifier LSTM [25]\u4f5c\u4e3a\u534f\u540c\u7279\u5f81\u7f16\u7801\u5668\uff0c\u4ee5\u6700\u5927\u5316\u6a21\u6001\u534f\u540c\u5b66\u4e60\u7684\u6548\u679c\uff0c\u5e76\u5c06\u534f\u540c\u7279\u5f81fsi\u4e0e\u53ef\u89c1\u5149\u548c\u7ea2\u5916\u7279\u5f81\u53ca\u5b83\u4eec\u7684\u5171\u4eab\u771f\u5b9e\u6807\u7b7e\u8fdb\u884c\u7f16\u7801\u3002\u4e3a\u4e86\u6784\u5efa\u5177\u6709\u591a\u6837\u6027\u8bed\u4e49\u7684fsi\uff0c\u6211\u4eec\u5229\u7528KL\u6563\u5ea6\u6765\u9650\u5236\u53ef\u89c1\u5149\u548c\u7ea2\u5916\u7279\u5f81fvi\uff0c fri\u7684\u903b\u8f91\u5206\u5e03\uff0c\u53ef\u4ee5\u8868\u793a\u5982\u4e0b\uff1a<\/p>\n<p>N denotes the number of samples in a batch. \u03b8v and \u03b8r act as learned feature extractors of visible and infrared modalities respectively, which aim to maximize the diversity of semantic representation across modalities. f v and f r are firstly designed in the representation spaces to maximize the modalityspecific discrimination among identities. Then, the synergistic feature extractor \u03b8s projects \u02c6 fv i, \u02c6 fr i to a shared representation space and constructs synergistic features f s i.<\/p>\n<p>N\u4ee3\u8868\u6279\u6b21\u4e2d\u7684\u6837\u672c\u6570\u91cf\u3002\u03b8v\u548c\u03b8r\u5206\u522b\u4f5c\u4e3a\u53ef\u89c1\u548c\u7ea2\u5916\u6a21\u5f0f\u7684\u5b66\u4e60\u7279\u5f81\u63d0\u53d6\u5668\uff0c\u65e8\u5728\u6700\u5927\u5316\u6a21\u6001\u95f4\u8bed\u4e49\u8868\u793a\u7684\u591a\u6837\u6027\u3002fv\u548cfr\u9996\u5148\u5728\u8868\u793a\u7a7a\u95f4\u4e2d\u8bbe\u8ba1\uff0c\u4ee5\u6700\u5927\u5316\u8eab\u4efd\u4e4b\u95f4\u7684\u6a21\u6001\u7279\u5b9a\u5dee\u5f02\u3002\u7136\u540e\uff0c\u534f\u540c\u7279\u5f81\u63d0\u53d6\u5668\u03b8s\u5c06\u02c6fvi\u548c\u02c6fri\u6295\u5f71\u5230\u5171\u4eab\u8868\u793a\u7a7a\u95f4\uff0c\u5e76\u6784\u5efa\u534f\u540c\u7279\u5f81fsi\u3002<\/p>\n<p>$$<br \/>\nVar[f^v<em>i]=\\frac{1}{HW}\\sum^W<\/em>{l=1}\\sum^H<em>{m=1}(f<\/em>{itlm}-E[f^v_i])^2<br \/>\n$$<\/p>\n<p>$$<br \/>\nL<em>t=-\\frac{1}{N}\\sum^N<\/em>{i=1}[\\hat y_i<em>log\\hat p^v_i(\\hat f^v_i,\\theta<em>v)]-\\frac{1}{N}\\sum^N<\/em>{i=1}[\\hat y_i<\/em>log\\hat p^r_i(\\hat f^r_i,\\theta_r)]<br \/>\n$$<\/p>\n<h2>Modality Complement Module<\/h2>\n<p>Although synergistic representation contains more identity-relevant diverse semantics, it is uncertain whether synergistic feature outperforms the combination of visible and infrared features Concat(f v i ,fr i ). Due to infrared images contain- ing global pedestrian statistics with less noise and visible images containing finegrained discriminative semantics, we enhance the representation effectiveness of synergistic feature f s i from two aspects. Considering fine-grained semantics, we enhance synergistic features with advantages of visible features f v i in terms of local parts. And considering coarse-grained semantics, we enhance synergistic features with advantages of infrared features f r i about global parts.<\/p>\n<p>\u5c3d\u7ba1\u534f\u540c\u8868\u793a\u5305\u542b\u66f4\u591a\u4e0e\u8eab\u4efd\u76f8\u5173\u7684\u591a\u6837\u5316\u8bed\u4e49\uff0c\u4f46\u4e0d\u786e\u5b9a\u534f\u540c\u7279\u5f81\u662f\u5426\u80dc\u8fc7\u53ef\u89c1\u5149\u548c\u7ea2\u5916\u5149\u7279\u5f81\u7684\u7ec4\u5408Concat(f v i ,  fr i )\u3002\u7531\u4e8e\u7ea2\u5916\u56fe\u50cf\u5305\u542b\u5168\u5c40\u884c\u4eba\u7edf\u8ba1\u4fe1\u606f\u4e14\u566a\u58f0\u8f83\u5c11\uff0c\u53ef\u89c1\u5149\u56fe\u50cf\u5305\u542b\u7ec6\u7c92\u5ea6\u7684\u533a\u5206\u8bed\u4e49\uff0c\u6211\u4eec\u4ece\u4e24\u4e2a\u65b9\u9762\u589e\u5f3a\u4e86\u534f\u540c\u7279\u5f81 f s i \u7684\u8868\u793a\u6548\u679c\u3002\u8003\u8651\u5230\u7ec6\u7c92\u5ea6\u7684\u8bed\u4e49\uff0c\u6211\u4eec\u4ee5\u5c40\u90e8\u90e8\u5206\u4e3a\u57fa\u7840\uff0c\u63d0\u9ad8\u534f\u540c\u7279\u5f81\u7684\u8868\u793a\u80fd\u529b\uff0c\u5229\u7528\u53ef\u89c1\u5149\u7279\u5f81 f v i \u7684\u4f18\u52bf\u3002\u540c\u65f6\uff0c\u8003\u8651\u5230\u7c97\u7c92\u5ea6\u7684\u8bed\u4e49\uff0c\u6211\u4eec\u4ee5\u5168\u5c40\u90e8\u5206\u4e3a\u57fa\u7840\uff0c\u63d0\u9ad8\u534f\u540c\u7279\u5f81\u7684\u8868\u793a\u80fd\u529b\uff0c\u5229\u7528\u7ea2\u5916\u7279\u5f81 f r i \u7684\u4f18\u52bf\u3002<\/p>\n<p>On the fine-grained level, we split visible and synergistic features into n = 6 parts as MPANet [45] and get separate feature blocks as f v i = [b1v, b2v \u00b7 \u00b7 \u00b7 , bvn], fs i = [bs1, bs2 \u00b7 \u00b7 \u00b7 , bsn]. The local discrimination of synergistic features can be boosted with nuanced regions of visible modality. Cosine similarity cos(\u00b7, \u00b7) is utilized for the optimization process.<\/p>\n<p>\u5728\u7ec6\u7c92\u5ea6\u7ea7\u522b\u4e0a\uff0c\u6211\u4eec\u5c06\u53ef\u89c1\u7279\u5f81\u548c\u534f\u540c\u7279\u5f81\u5206\u6210 n = 6 \u90e8\u5206\uff0c\u5982 MPANet [45]\uff0c\u5e76\u83b7\u5f97\u5206\u79bb\u7684\u7279\u5f81\u5757 f v i = [b1v, b2v \u00b7 \u00b7 \u00b7 , bvn]\uff0cfs i = [bs1, bs2 \u00b7 \u00b7 \u00b7 , bsn]\u3002\u534f\u540c\u7279\u5f81\u7684\u5c40\u90e8\u533a\u57df\u4ea6\u53ef\u7528\u4e8e\u589e\u5f3a\u7ec6\u817b\u5ea6\u3002\u4f59\u5f26\u76f8\u4f3c\u5ea6 cos(\u00b7, \u00b7) \u7528\u4e8e\u4f18\u5316\u8fc7\u7a0b\u4e2d\u3002<\/p>\n<p>$$<br \/>\nL<em>{local}=\\frac{1}{N}\\sum^N<\/em>{i=1}\\sum^n_{j=1}(cos(b^v_j,b^s_j)+\\sqrt{2-2cos(b^v_j,b^s_j)})<br \/>\n$$<\/p>\n<p>In parallel, on the coarse-grained level, we supervise f s i by keeping the statistic centers of synergistic features consistent with that of the infrared feature f r i . The global statistics of synergistic features can get optimized with center consistency of infrared modality.<\/p>\n<p>\u540c\u65f6\uff0c\u5728\u7c97\u7c92\u5ea6\u5c42\u9762\u4e0a\uff0c\u6211\u4eec\u901a\u8fc7\u4f7f\u534f\u540c\u7279\u5f81\u7684\u7edf\u8ba1\u4e2d\u5fc3\u4e0e\u7ea2\u5916\u7279\u5f81 f r i \u4fdd\u6301\u4e00\u81f4\u6765\u76d1\u7763 f s i\u3002\u534f\u540c\u7279\u5f81\u7684\u5168\u5c40\u7edf\u8ba1\u53ef\u4ee5\u901a\u8fc7\u7ea2\u5916\u6a21\u6001\u7684\u4e2d\u5fc3\u4e00\u81f4\u6027\u8fdb\u884c\u4f18\u5316\u3002<\/p>\n<p>$$<br \/>\nL<em>{global}=\\frac{1}{N}\\sum^N<\/em>{i=1}{||C^s_{y<em>i}-C^r<\/em>{y_i}||}^2_2<br \/>\n$$<\/p>\n<p>where Cysi , Cyri denote the center of the yith class for synergistic features f s i ,fr i. Lglobal helps to coordinate semantics of the synergistic and the infrared feature and filter identity-irrelevance of the synergistic representation.<\/p>\n<p>\u5176\u4e2dCysi\uff0cCyri\u8868\u793a\u534f\u540c\u7279\u5f81fsi\uff0cfri\u7684yith\u7c7b\u7684\u4e2d\u5fc3\u3002Lglobal\u5e2e\u52a9\u534f\u8c03\u534f\u540c\u548c\u7ea2\u5916\u7279\u5f81\u7684\u8bed\u4e49\uff0c\u5e76\u8fc7\u6ee4\u534f\u540c\u8868\u8fbe\u7684\u8eab\u4efd\u975e\u76f8\u5173\u6027\u3002<\/p>\n<p>In the progress of Modality Complement module, we update the parameters of synergistic feature extractor \u03b8s, which aims to construct features with less noise, more diverse and more precise semantic description for each identity. \u03b8s is optimized as follows:<\/p>\n<p>\u5728\u6a21\u6001\u8865\u5145\u6a21\u5757\u7684\u8fdb\u7a0b\u4e2d\uff0c\u6211\u4eec\u66f4\u65b0\u534f\u540c\u7279\u5f81\u63d0\u53d6\u5668\u03b8s\u7684\u53c2\u6570\uff0c\u76ee\u7684\u662f\u4e3a\u6bcf\u4e2a\u8eab\u4efd\u6784\u5efa\u5177\u6709\u66f4\u5c11\u566a\u97f3\u3001\u66f4\u591a\u6837\u5316\u548c\u66f4\u7cbe\u786e\u8bed\u4e49\u63cf\u8ff0\u7684\u7279\u5f81\u3002\u03b8s\u88ab\u4f18\u5316\u5982\u4e0b\uff1a<\/p>\n<p>$$<br \/>\nL_{Com}(\\theta<em>s)=\\lambda<\/em>{local}<em>L<em>{local}+\\lambda<\/em>{global}<\/em>L_{global},\\hat\\theta<em>s={argmin}<\/em>{\\theta_s}L(\\theta_s)<br \/>\n$$<\/p>\n<h2>Cascaded Aggregation Strategy<\/h2>\n<p>Due to factors like shooting perspectives, clothing, and occlusion, the results of person retrieval will easily be affected [53,33]. To cope with this problem, center loss [23] and triplet loss [14] are widely adopted in ReID problems to simultaneously learn the centralized representation of feature embeddings and mine hard samples. Center loss Lc and Triplet loss Ltri can be formulated as:<\/p>\n<p>\u7531\u4e8e\u62cd\u6444\u89d2\u5ea6\u3001\u670d\u88c5\u548c\u906e\u6321\u7b49\u56e0\u7d20\uff0c\u4eba\u7269\u68c0\u7d22\u7684\u7ed3\u679c\u5f88\u5bb9\u6613\u53d7\u5230\u5f71\u54cd[53\uff0c33]\u3002\u4e3a\u4e86\u5e94\u5bf9\u8fd9\u4e2a\u95ee\u9898\uff0cReID\u95ee\u9898\u4e2d\u5e7f\u6cdb\u91c7\u7528\u4e86\u4e2d\u5fc3\u635f\u5931[23]\u548c\u4e09\u5143\u7ec4\u635f\u5931[14]\uff0c\u4ee5\u540c\u65f6\u5b66\u4e60\u7279\u5f81\u5d4c\u5165\u7684\u96c6\u4e2d\u8868\u793a\u548c\u6316\u6398\u56f0\u96be\u6837\u672c\u3002\u4e2d\u5fc3\u635f\u5931Lc\u548c\u4e09\u5143\u7ec4\u635f\u5931Ltri\u53ef\u4ee5\u8868\u793a\u4e3a\uff1a<\/p>\n<p>$$<br \/>\nL<em>c=\\frac{1}{N}\\sum^N<\/em>{i=1}||f<em>i-C<\/em>{y_i}||^2<em>2\\L<\/em>{tri}=\\sum^N_i{[||f((x^a_i)-f(a^{pos}_i)||^2_2-||f(x^a_i)-f(s^{neg}_i)||^2<em>2+\\alpha]<\/em>+}<br \/>\n$$<\/p>\n<p>where xi denotes the ith input sample, Cyi is the yith class center, fi is the feature embedding, xa i is the anchor. Center loss pays attention to aggregating feature embeddings but neglects the intrinsic differences and diverse semantics existing in the visible and the infrared modalities. Triplet loss specializes in handling hard samples separately rather than considering the comprehensive distribution across modalities, which limits the performance. Considering the diverse semantics and structural distribution across modalities, we propose Cascaded Aggregation to progressively optimize the features distribution of, as shown in Fig. 4.<\/p>\n<p>\u5176\u4e2dxi\u8868\u793a\u7b2ci\u4e2a\u8f93\u5165\u6837\u672c\uff0cCyi\u662f\u7b2cy\u4e2a\u7c7b\u522b\u7684\u4e2d\u5fc3\uff0cfi\u662f\u7279\u5f81\u5d4c\u5165\uff0cxia\u662f\u951a\u70b9\u3002\u4e2d\u5fc3\u635f\u5931\u5173\u6ce8\u7279\u5f81\u5d4c\u5165\u7684\u805a\u5408\uff0c\u4f46\u5ffd\u89c6\u4e86\u53ef\u89c1\u5149\u548c\u7ea2\u5916\u6a21\u6001\u4e2d\u5b58\u5728\u7684\u5185\u5728\u5dee\u5f02\u548c\u591a\u6837\u7684\u8bed\u4e49\u3002\u4e09\u91cd\u635f\u5931\u4e13\u6ce8\u4e8e\u5206\u522b\u5904\u7406\u56f0\u96be\u6837\u672c\uff0c\u800c\u4e0d\u8003\u8651\u8de8\u6a21\u6001\u7684\u5168\u9762\u5206\u5e03\uff0c\u8fd9\u9650\u5236\u4e86\u6027\u80fd\u3002\u8003\u8651\u5230\u6a21\u6001\u4e4b\u95f4\u7684\u591a\u6837\u8bed\u4e49\u548c\u7ed3\u6784\u5206\u5e03\uff0c\u6211\u4eec\u63d0\u51fa\u7ea7\u8054\u805a\u5408\u6765\u9010\u6b65\u4f18\u5316\u7279\u5f81\u5206\u5e03\uff0c\u5982\u56fe4\u6240\u793a\u3002<\/p>\n<p><img decoding=\"async\" src=\"https:\/\/prod-files-secure.s3.us-west-2.amazonaws.com\/e9d07118-f13a-4e79-a274-ee99b3771efd\/e2a53c8a-94c5-49f1-8444-fdedb39d289a\/Untitled.png\" alt=\"Untitled\" \/><\/p>\n<p>1)Aggregation on Sub-class level. We utilize the identity of shooting cameras for each image as the natural sub-class, since images of the same person shot by the same camera have high similarities with each other, where Csi denotes the sth i sub-class center:<\/p>\n<p>1\uff09\u5728\u5b50\u7c7b\u522b\u5c42\u9762\u4e0a\u7684\u805a\u5408\u3002\u6211\u4eec\u5c06\u6bcf\u5f20\u56fe\u50cf\u7684\u6444\u50cf\u673a\u8eab\u4efd\u4f5c\u4e3a\u81ea\u7136\u7684\u5b50\u7c7b\u522b\uff0c\u56e0\u4e3a\u7531\u540c\u4e00\u6444\u50cf\u673a\u62cd\u6444\u7684\u540c\u4e00\u4eba\u7269\u7684\u56fe\u50cf\u5f7c\u6b64\u4e4b\u95f4\u5177\u6709\u9ad8\u5ea6\u76f8\u4f3c\u6027\uff0c\u5176\u4e2dCsi\u8868\u793a\u7b2ci\u4e2a\u5b50\u7c7b\u522b\u7684\u7b2cs\u4e2a\u4e2d\u5fc3\u3002<\/p>\n<p>$$<br \/>\nL<em>{sub}=\\frac{1}{N}\\sum^N<\/em>{i=1}||f^i<em>s-C<\/em>{s_i}||^2_2<br \/>\n$$<\/p>\n<hr \/>\n<p>2) Aggregation on the intra-class level, which keeps the structural priors of the features during the training progress. The formulation of the aggregation can be represented as follows, where Ns denotes the number of the sub-classes of each identity.<\/p>\n<p>\u5728\u7c7b\u5185\u5c42\u7ea7\u4e0a\u8fdb\u884c\u805a\u5408\uff0c\u4ee5\u5728\u8bad\u7ec3\u8fc7\u7a0b\u4e2d\u4fdd\u6301\u7279\u5f81\u7684\u7ed3\u6784\u5148\u9a8c\u3002\u805a\u5408\u7684\u516c\u5f0f\u53ef\u4ee5\u8868\u793a\u5982\u4e0b\uff0c\u5176\u4e2dNs\u8868\u793a\u6bcf\u4e2a\u8eab\u4efd\u7684\u5b50\u7c7b\u6570\u91cf\u3002<\/p>\n<p>$$<br \/>\nL<em>{intra}=\\frac{1}{N}\\sum^N<\/em>{i=1}\\sum^{N<em>s}<\/em>{j=1}||C_{s<em>j}-C<\/em>{y_i}||^2_2<br \/>\n$$<\/p>\n<p>3)Aggregation on the inter-class level. Our method of aggregation not only maximizes the similarity of intra-class instances but also maximizes the dissimilarity of inter-class instances on the whole. The dispersion between different identities and the two types of aggregation in 1) and 2) of the same identities are independent of each other. Formally, the dispersion between different identities can be represented as:<\/p>\n<p>\u7c7b\u95f4\u7ea7\u522b\u4e0a\u7684\u805a\u5408\u3002\u6211\u4eec\u7684\u805a\u5408\u65b9\u6cd5\u4e0d\u4ec5\u6700\u5927\u5316\u4e86\u540c\u4e00\u7c7b\u5b9e\u4f8b\u7684\u76f8\u4f3c\u6027\uff0c\u8fd8\u6700\u5927\u5316\u4e86\u6574\u4f53\u4e0a\u4e0d\u540c\u7c7b\u5b9e\u4f8b\u7684\u4e0d\u76f8\u4f3c\u6027\u3002\u4e0d\u540c\u8eab\u4efd\u4e4b\u95f4\u7684\u79bb\u6563\u5ea6\u4ee5\u53ca\u76f8\u540c\u8eab\u4efd\u4e0b1)\u548c2)\u4e24\u79cd\u805a\u5408\u7c7b\u578b\u662f\u5f7c\u6b64\u72ec\u7acb\u7684\u3002\u6b63\u5f0f\u6765\u8bf4\uff0c\u4e0d\u540c\u8eab\u4efd\u4e4b\u95f4\u7684\u79bb\u6563\u5ea6\u53ef\u4ee5\u8868\u793a\u4e3a\uff1a<\/p>\n<p>$$<br \/>\nL<em>{inter}=-\\frac{1}{\\frac{N}{2}}\\sum^N<\/em>{i=1}\\sum^N<em>{j \\neq i}||C<\/em>{y<em>i}-C<\/em>{y_j}||^2_2<br \/>\n$$<\/p>\n<p>The loss function of CA for metric learning can be represented as:<\/p>\n<p>CA\u5728\u5ea6\u91cf\u5b66\u4e60\u4e2d\u7684\u635f\u5931\u51fd\u6570\u53ef\u4ee5\u8868\u793a\u4e3a\uff1a<\/p>\n<p>$$<br \/>\nL<em>{cascade}=L<\/em>{sub}+L<em>{intra}+L<\/em>{inter}<br \/>\n$$<\/p>\n<p>Compared with Center Loss, our method begins with only a few samples of high similarity for the same shooting cameras and it will become much easier to learn sub-center representations.<\/p>\n<p>\u4e0e\u4e2d\u5fc3\u635f\u5931\u76f8\u6bd4\uff0c\u6211\u4eec\u7684\u65b9\u6cd5\u4ec5\u4f7f\u7528\u51e0\u4e2a\u76f8\u4f3c\u5ea6\u8f83\u9ad8\u7684\u76f8\u673a\u62cd\u6444\u6837\u672c\uff0c\u5b66\u4e60\u5b50\u4e2d\u5fc3\u8868\u793a\u5c06\u53d8\u5f97\u66f4\u52a0\u5bb9\u6613\u3002<\/p>\n<p>Compared with Triplet Loss, our method deals with negative samples simultaneously by guiding the negative samples to the correspondent sub-class instead of easily pushing away alongside the gradient.<\/p>\n<p>\u4e0e\u4e09\u5143\u7ec4\u635f\u5931\u76f8\u6bd4\uff0c\u6211\u4eec\u7684\u65b9\u6cd5\u901a\u8fc7\u5c06\u8d1f\u6837\u672c\u5f15\u5bfc\u5230\u76f8\u5e94\u7684\u5b50\u7c7b\u800c\u4e0d\u662f\u4ec5\u4ec5\u968f\u7740\u68af\u5ea6\u63a8\u5f00\uff0c\u540c\u65f6\u5904\u7406\u8d1f\u6837\u672c\u3002<\/p>\n<h2>Objective Function<\/h2>\n<p>Firstly, we utilize Synergy Loss LSynergy to enrich the representation on diverse semantics. The parameters of feature extractors \u03b8v and \u03b8r are updated as:<\/p>\n<p>\u9996\u5148\uff0c\u6211\u4eec\u5229\u7528\u534f\u540c\u635f\u5931LSynergy\u6765\u4e30\u5bcc\u4e0d\u540c\u8bed\u4e49\u7684\u8868\u793a\u3002\u7279\u5f81\u63d0\u53d6\u5668\u03b8v\u548c\u03b8r\u7684\u53c2\u6570\u66f4\u65b0\u5982\u4e0b\uff1a<\/p>\n<p>$$<br \/>\nL_{synergistic}=L(\\theta_v,\\theta<em>r)=\\lambda<\/em>{div}\u00b7L_{div}+\\lambda_t\u00b7L_t<br \/>\n$$<\/p>\n<p>Then, we enhance the synergistic feature representation with the advantages of two modalities, namely, the discriminative local parts from the visible feature and global identity statistics from the infrared feature. We utilize Complementary Loss LCom to update the modality synergy feature extractor \u03b8s:<\/p>\n<p>\u7136\u540e\uff0c\u6211\u4eec\u5229\u7528\u4e24\u79cd\u6a21\u6001\u7684\u4f18\u52bf\uff0c\u5373\u53ef\u89c1\u7279\u5f81\u4e2d\u7684\u6709\u533a\u522b\u7684\u5c40\u90e8\u90e8\u5206\u548c\u7ea2\u5916\u7279\u5f81\u4e2d\u7684\u5168\u5c40\u8eab\u4efd\u7edf\u8ba1\u4fe1\u606f\uff0c\u6765\u589e\u5f3a\u534f\u540c\u7279\u5f81\u8868\u793a\u3002\u6211\u4eec\u5229\u7528\u4e92\u8865\u635f\u5931LCom\u6765\u66f4\u65b0\u6a21\u6001\u534f\u540c\u7279\u5f81\u63d0\u53d6\u5668\u03b8s\uff1a<\/p>\n<p>$$<br \/>\nL_{com}=L(\\theta<em>s)=\\lambda <\/em>{local} \u00b7 L<em>{local} + \\lambda<\/em>{global} \u00b7 L_{global}<br \/>\n$$<\/p>\n<p>Finally, we constrain the distribution of visible, infrared and synergistic feature f v, f r, f s with cascaded aggregation strategy Lcascaded:<\/p>\n<p>\u6700\u540e\uff0c\u6211\u4eec\u901a\u8fc7\u7ea7\u8054\u805a\u5408\u7b56\u7565 Lcascaded \u5bf9\u53ef\u89c1\u5149\u3001\u7ea2\u5916\u548c\u534f\u540c\u7279\u5f81 f v\u3001f r\u3001f s \u7684\u5206\u5e03\u52a0\u4ee5\u9650\u5236\u3002<\/p>\n<p>$$<br \/>\nL_{cascaded}=L(\\theta_v,\\theta_s,\\theta<em>s)=L<\/em>{sub}+L<em>{intra} + L<\/em>{inter}<br \/>\n$$<\/p>\n<p>Overall, the objective function of our MSCLNet can be summarized as follows:<\/p>\n<p>\u603b\u4f53\u800c\u8a00\uff0c\u6211\u4eec\u7684MSCLNet\u7684\u76ee\u6807\u51fd\u6570\u53ef\u4ee5\u603b\u7ed3\u5982\u4e0b\uff1a<\/p>\n<p>$$<br \/>\nL<em>{total}=\\lambda<\/em>{div}\u00b7L_{div}+\\lambda_t\u00b7L<em>t+\\lambda <\/em>{local} \u00b7 L<em>{local} + \\lambda<\/em>{global} \u00b7 L<em>{global}+L<\/em>{sub}+L<em>{intra} + L<\/em>{inter}<br \/>\n$$<\/p>\n<h1>Experiment<\/h1>\n<h2>Datasets and Evaluation Protocol<\/h2>\n<h3>SYSU-MM01<\/h3>\n<h3>RegDB<\/h3>\n<h3>Evaluation Protocol.<\/h3>\n<h2>Implement Details<\/h2>\n<h3>Training<\/h3>\n<p>We implement MSCLNet with PyTorch on a single NVIDIA RTX 2080 Ti GPU and deal with 64 images consisting of 32 visible and 32 infrared images of 8 identities in a mini-batch by randomly selecting 4 visible and 4 infrared images for each identity. Our baseline is AGW*, which means AGW [51] with Random Erasing. We adopt pre-trained ResNet-50 [13] on ImageNet as the backbone network. Then, we pre-process each image by re-scaling in to 288 \u00d7 144 and augment images through random cropping with zero-padding, random horizontal flipping and random erasing (80% probability, 80% max-area, 20% min-area) . During the training process, we optimize the feature extractors \u03b8v, \u03b8r and modality synergy module \u03b8s with SGD optimizer. We set the initial learning rate \u03b7 = 0.1, the momentum parameter p = 0.9. The learning rate is changed as \u03b7 = 0.05 at 21-50 epoch, \u03b7 = 0.01 at 51-100 epoch, and \u03b7 = 0.001 at 101-200 epoch. The hyper-parameters \u03bbdiv, \u03bbt, \u03bblocal, \u03bbglobal are set to 0.5, 1.25, 0.8, and 1.5, respectively. We synergize visible and infrared instances to train a concise end-to-end network, which retrieves specific person across modalities.<\/p>\n<p>\u6211\u4eec\u5728\u5355\u4e2aNVIDIA RTX 2080 Ti GPU\u4e0a\u4f7f\u7528PyTorch\u5b9e\u73b0\u4e86MSCLNet\uff0c\u5e76\u901a\u8fc7\u968f\u673a\u9009\u62e9\u6bcf\u4e2a\u8eab\u4efd\u76844\u5f20\u53ef\u89c1\u5149\u56fe\u50cf\u548c4\u5f20\u7ea2\u5916\u56fe\u50cf\uff0c\u5904\u7406\u4e86\u753132\u4e2a\u53ef\u89c1\u5149\u56fe\u50cf\u548c32\u4e2a\u7ea2\u5916\u56fe\u50cf\u7ec4\u6210\u768464\u5f20\u56fe\u50cf\u7684\u5c0f\u6279\u91cf\u3002\u6211\u4eec\u7684\u57fa\u7ebf\u662fAGW *\uff0c\u8fd9\u610f\u5473\u7740\u4f7f\u7528Random Erasing\u7684AGW [51]\u3002\u6211\u4eec\u91c7\u7528\u5728ImageNet\u4e0a\u9884\u5148\u8bad\u7ec3\u7684ResNet-50 [13]\u4f5c\u4e3a\u9aa8\u5e72\u7f51\u7edc\u3002\u7136\u540e\uff0c\u6211\u4eec\u901a\u8fc7\u91cd\u65b0\u7f29\u653e\u81f3288\u00d7144\u548c\u968f\u673a\u88c1\u526a\uff08\u96f6\u586b\u5145\uff09\uff0c\u968f\u673a\u6c34\u5e73\u7ffb\u8f6c\u548c\u968f\u673a\u64e6\u9664\uff0880\uff05\u6982\u7387\uff0c80\uff05\u6700\u5927\u9762\u79ef\uff0c20\uff05\u6700\u5c0f\u9762\u79ef\uff09\u6765\u9884\u5904\u7406\u6bcf\u4e2a\u56fe\u50cf\u3002\u5728\u8bad\u7ec3\u8fc7\u7a0b\u4e2d\uff0c\u6211\u4eec\u4f7f\u7528SGD\u4f18\u5316\u5668\u4f18\u5316\u7279\u5f81\u63d0\u53d6\u5668\u03b8v\uff0c\u03b8r\u548c\u6a21\u6001\u534f\u540c\u6a21\u5757\u03b8s\u3002\u6211\u4eec\u5c06\u521d\u59cb\u5b66\u4e60\u7387\u03b7\u8bbe\u7f6e\u4e3a0.1\uff0c\u52a8\u91cf\u53c2\u6570p\u8bbe\u7f6e\u4e3a0.9\u3002\u5b66\u4e60\u7387\u572821-50\u4e2a\u65f6\u671f\u65f6\u66f4\u6539\u4e3a\u03b7= 0.05\uff0c\u572851-100\u4e2a\u65f6\u671f\u65f6\u66f4\u6539\u4e3a\u03b7= 0.01\uff0c\u5e76\u5728101-200\u4e2a\u65f6\u671f\u65f6\u66f4\u6539\u4e3a\u03b7= 0.001\u3002\u8d85\u53c2\u6570\u03bbdiv\uff0c\u03bbt\uff0c\u03bblocal\uff0c\u03bbglobal\u5206\u522b\u8bbe\u7f6e\u4e3a0.5\uff0c1.25\uff0c0.8\u548c1.5\u3002\u6211\u4eec\u5c06\u53ef\u89c1\u5149\u548c\u7ea2\u5916\u5b9e\u4f8b\u5408\u5e76\u5728\u4e00\u8d77\uff0c\u8bad\u7ec3\u4e00\u4e2a\u7b80\u660e\u7684\u7aef\u5230\u7aef\u7f51\u7edc\uff0c\u53ef\u8de8\u6a21\u6001\u68c0\u7d22\u7279\u5b9a\u7684\u4eba\u7269\u3002<\/p>\n<h3>Testing<\/h3>\n<p>For testing, the model works in Single-shot mode by extracting the query and the gallery features from a single modality by the feature extractor \u03b8v or \u03b8r. Besides, MS and MC modules do not participate in testing stage.<\/p>\n<p>\u7528\u4e8e\u6d4b\u8bd5\u7684\u6a21\u578b\u901a\u8fc7\u4ece\u5355\u4e2a\u6a21\u6001\u4e2d\u63d0\u53d6\u67e5\u8be2\u548c\u56fe\u5e93\u7279\u5f81\uff0c\u7531\u7279\u5f81\u63d0\u53d6\u5668\u03b8v\u6216\u03b8r\uff0c\u4ee5\u5355\u6b21\u62cd\u6444\u6a21\u5f0f\u5de5\u4f5c\u3002\u6b64\u5916\uff0cMS\u548cMC\u6a21\u5757\u5728\u6d4b\u8bd5\u9636\u6bb5\u4e0d\u53c2\u4e0e\u3002<\/p>\n<h2>Ablation Study<\/h2>\n<p>\u5728\u672c\u5c0f\u8282\u4e2d\uff0c\u6211\u4eec\u8fdb\u884c\u4e86\u4e00\u9879\u6d88\u878d\u7814\u7a76\uff0c\u4ee5\u8bc4\u4f30MSCLNet\u7684\u6bcf\u4e2a\u7ec4\u4ef6\u7684\u6548\u679c\uff0c\u5982\u65b9\u7a0b\u5f0f17\u6240\u603b\u7ed3\u3002\u7ed3\u679c\u5728\u88681\u4e2d\u5c55\u793a\u3002\u6211\u4eec\u8bc4\u4f30\u4e86\u6bcf\u4e2a\u7ec4\u4ef6\u5bf9SYSU-MM01\u6570\u636e\u96c6\u7684\u5168\u641c\u7d22\u6a21\u5f0f\u53ef\u4ee5\u5e26\u6765\u591a\u5c11\u6539\u8fdb\u3002<\/p>\n<p><img decoding=\"async\" src=\"https:\/\/prod-files-secure.s3.us-west-2.amazonaws.com\/e9d07118-f13a-4e79-a274-ee99b3771efd\/302b45c1-b638-4085-9250-53dfcded140f\/Untitled.png\" alt=\"Untitled\" \/><\/p>\n<h1>Conclusion and Discussion<\/h1>\n<p>In this paper, we propose a novel VI-ReID framework, which has the capability to make full use of the visible and the infrared modality semantics and learn discriminative representation of identities by synergizing and complementing instances of visible and infrared modalities. Different from existing methods pursuing modal-shared information at the risk of identity-relevant semantics loss, MSCLNet provides an innovative approach exploring high-level unity in VI-ReID task. Meanwhile, we propose Cascaded Aggregation strategy to fine-grained and progressively optimize the distribution of feature embeddings, which assists the network discriminate identities and extract more precise and more comprehensive features. Experimental results validate the merit of the framework, as well as the effectiveness of each component in this framework. In the future work, we plan to explore background scenes, gender, and appearances to construct better different sub-classes.<\/p>\n<p>\u5728\u8fd9\u7bc7\u8bba\u6587\u4e2d\uff0c\u6211\u4eec\u63d0\u51fa\u4e86\u4e00\u79cd\u65b0\u9896\u7684VI-ReID\u6846\u67b6\uff0c\u8be5\u6846\u67b6\u80fd\u591f\u5145\u5206\u5229\u7528\u53ef\u89c1\u5149\u548c\u7ea2\u5916\u5149\u8bed\u4e49\uff0c\u5e76\u901a\u8fc7\u534f\u540c\u548c\u8865\u5145\u53ef\u89c1\u5149\u548c\u7ea2\u5916\u5149\u7684\u5b9e\u4f8b\u6765\u5b66\u4e60\u8bc6\u522b\u8eab\u4efd\u7684\u5224\u522b\u6027\u8868\u793a\u3002\u4e0e\u73b0\u6709\u65b9\u6cd5\u4e0d\u540c\uff0c\u73b0\u6709\u65b9\u6cd5\u8ffd\u6c42\u6a21\u6001\u5171\u4eab\u4fe1\u606f\uff0c\u53ef\u80fd\u5bfc\u81f4\u8eab\u4efd\u76f8\u5173\u8bed\u4e49\u7684\u4e27\u5931\uff0cMSCLNet\u63d0\u4f9b\u4e86\u4e00\u79cd\u521b\u65b0\u7684\u65b9\u6cd5\u6765\u63a2\u7d22VI-ReID\u4efb\u52a1\u4e2d\u7684\u9ad8\u7ea7\u7edf\u4e00\u6027\u3002\u540c\u65f6\uff0c\u6211\u4eec\u63d0\u51fa\u4e86\u7ea7\u8054\u805a\u5408\u7b56\u7565\uff0c\u4ee5\u7ec6\u5316\u548c\u9010\u6b65\u4f18\u5316\u7279\u5f81\u5d4c\u5165\u7684\u5206\u5e03\uff0c\u4ece\u800c\u5e2e\u52a9\u7f51\u7edc\u533a\u5206\u8eab\u4efd\u5e76\u63d0\u53d6\u66f4\u7cbe\u786e\u548c\u66f4\u5168\u9762\u7684\u7279\u5f81\u3002\u5b9e\u9a8c\u7ed3\u679c\u9a8c\u8bc1\u4e86\u8be5\u6846\u67b6\u7684\u4f18\u70b9\uff0c\u4ee5\u53ca\u8be5\u6846\u67b6\u4e2d\u6bcf\u4e2a\u7ec4\u4ef6\u7684\u6709\u6548\u6027\u3002\u5728\u672a\u6765\u7684\u5de5\u4f5c\u4e2d\uff0c\u6211\u4eec\u8ba1\u5212\u63a2\u7d22\u80cc\u666f\u573a\u666f\u3001\u6027\u522b\u548c\u5916\u89c2\uff0c\u6765\u6784\u5efa\u66f4\u597d\u7684\u4e0d\u540c\u5b50\u7c7b\u3002<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Abstract. Visible-Infrared Re-Identification (VI-ReID)  [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[105,11],"tags":[],"_links":{"self":[{"href":"http:\/\/tamanegi.xyz\/index.php?rest_route=\/wp\/v2\/posts\/943"}],"collection":[{"href":"http:\/\/tamanegi.xyz\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"http:\/\/tamanegi.xyz\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"http:\/\/tamanegi.xyz\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"http:\/\/tamanegi.xyz\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=943"}],"version-history":[{"count":1,"href":"http:\/\/tamanegi.xyz\/index.php?rest_route=\/wp\/v2\/posts\/943\/revisions"}],"predecessor-version":[{"id":944,"href":"http:\/\/tamanegi.xyz\/index.php?rest_route=\/wp\/v2\/posts\/943\/revisions\/944"}],"wp:attachment":[{"href":"http:\/\/tamanegi.xyz\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=943"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"http:\/\/tamanegi.xyz\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=943"},{"taxonomy":"post_tag","embeddable":true,"href":"http:\/\/tamanegi.xyz\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=943"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}