Modality Synergy Complement Learning with Cascaded Aggregation for Visible-Infrared Person Re-Identification
Abstract.
Visible-Infrared Re-Identification (VI-ReID) is challenging in image retrievals. The modality discrepancy will easily make huge intraclass variations. Most existing methods either bridge different modalities through modality-invariance or generate the intermediate modality for better performance. Differently, this paper proposes a novel framework, named Modality Synergy Complement Learning Network (MSCLNet) with Cascaded Aggregation. Its basic idea is to synergize two modalities to construct diverse representations of identity-discriminative semantics and less noise. Then, we complement synergistic representations under the advantages of the two modalities. Furthermore, we propose the Cascaded Aggregation strategy for fine-grained optimization of the feature distribution, which progressively aggregates feature embeddings from the subclass, intra-class, and inter-class. Extensive experiments on SYSU-MM01 and RegDB datasets show that MSCLNet outperforms the state-of-the-art by a large margin. On the large-scale SYSU-MM01 dataset, our model can achieve 76.99% and 71.64% in terms of Rank-1 accuracy and mAP value.
可见红外重识别(VI-ReID)在图像检索中具有挑战性。不同模态之间的差异很容易产生大的类内变化。大多数现有方法要么通过模态不变性来连接不同模态,要么生成中间模态以获得更好的性能。与之不同的是,本文提出了一种新的框架,名为Modality Synergy Complement Learning Network(MSCLNet)与Cascaded Aggregation。其基本思想是协同使用两种模态构建身份判别语义和较少噪声的多样化表达。然后,我们在两种模态的优势下补充协同表达。此外,我们提出了级联聚合策略,用于对特征分布进行细粒度优化,该策略逐步聚合了子类、类内和类间的特征嵌入。SYSU-MM01和RegDB数据集上的大量实验表明,MSCLNet在性能上远远超过了现有技术水平。在大规模SYSU-MM01数据集上,我们的模型在Rank-1准确率和mAP值方面分别达到了76.99%和71.64%。
Introduction
In conclusion, the main contributions of our work can be summarized as follows: We propose a novel framework named Modality Synergy Complement Learning Network (MSCLNet) with Cascaded Aggregation for VI-ReID. To fetch more discriminative semantics, it learns enhanced feature representations by diverse semantics and specific advantages of visible and infrared modalities. And we propose a Modality Synergy module (MS) which innovatively mines the modality-specific diverse semantics and a Modality Complement module (MC) which further enhances the feature representations by two parallel guidances of modality-specific advantages. They provide a reference for further high-level identity representation. Then we design a Cascaded Aggregation strategy (CA) to optimize the distribution of feature embeddings on a fine-grained level. It progressively aggregates the overall instances in a cascaded manner and enhances the discrimination of identities. Extensive experimental results show that our proposed framework outperforms the state-of-the-art methods by a large margin on two mainstream benchmarks of VI-ReID.
总之,我们的工作的主要贡献可以归纳如下:我们提出了一个名为模态协同补充学习网络(MSCLNet)的新颖框架,通过传播聚合来实现VI-ReID。为了获取更具区分度的语义信息,它通过不同的语义和可见光与红外光模态的特定优势来学习增强特征表示。我们提出了模态协同模块(MS),创新地挖掘模态特定的不同语义,以及模态补充模块(MC),通过两个模态特定优势的平行引导进一步增强特征表示。它们为进一步的高级身份表示提供了参考。然后,我们设计了级联聚合策略(CA),在细粒度级别上优化特征嵌入的分布。它以级联的方式逐渐聚合所有实例,并增强身份的区分度。广泛的实验结果表明,我们提出的框架在两个主流的VI-ReID基准上大幅优于现有的方法。
Related Work
Single-Modality Person Re-Identification
retrieves pedestrians in the set of visible images. Visible person ReID is a reliable technique which plays an important role in daily life. These methods mainly solved the single-modal ReID problem via ranking [2,29], local and global attention [38,57], camera style [3,55,59], person key-points [36], siamese network [58], similarity graph [22], network architecture searching [18], .etc. Some works attempted domain adaptation [8,59]. And Some research dealt with the misalignment of human parts, such as cascaded convolutional module [39], refined part pooling [34], transformer [19] and so on. Beside, single-modality person re-identification contains several subdivided areas, for example, video person re-identification [26,44,60], unsupervised person re-identification which tackles pseudo labels [46,54], unsupervised domain adaption [1,31] and generalized person re-identification [16]. Due to the tremendous discrepancy between visible and infrared images, single-modal solutions are not suitable for cross-modality person re-identification, which creates a demand for the development of VI-ReID solutions.
在可见图像集中检索行人。可见人员重新识别是一种可靠的技术,在日常生活中起着重要作用。这些方法主要通过排名[2,29],局部和全局注意力[38,57],摄像机风格[3,55,59],人员关键点[36],连体网络[58],相似性图[22],网络架构搜索[18]等来解决单模态重新识别问题。一些研究尝试进行域适应[8,59]。还有一些研究处理人体部分的错位,例如级联卷积模块[39],精细部分池化[34],转换器[19]等。此外,单模态人员重新识别包含几个细分领域,例如视频人员重新识别[26,44,60],无监督人员重新识别处理伪标签[46,54],无监督域适应[1,31]和广义人员重新识别[16]。由于可见图像和红外图像之间存在巨大差异,单模态解决方案不适用于跨模态人员重新识别,这对VI-ReID解决方案的开发提出了需求。
Visible-Infrared Person Re-Identification
Visible-Infrared Person Re-Identification focuses on narrowing the gap between visible and infrared modalities and learning appropriate representations for pedestrian retrieval across modalities. [43] proposed a deep zero-fill network to extract useful embedded features to reduce cross-modal variation. Dual-stream networks [21,48,49,50,51] simultaneously learned modal-shared and modal-specific features. [30] used Gaussian-based variational auto-encoder to distinguish the subspace of cross-modal features. [15] exploited samples similarity within modalities. A modality-aware learning approach [47] processed modality differences on the classifier level. Some works generated images of intermediate or the corresponding modality [7,17,35,37,40] to mitigate the effect of modality discrepancy. However, extracting modality-shared features causes the loss of semantics related to identity discrimination, and GAN-based methods bring computational burden and non-original noise.
可见红外人物再识别致力于缩小可见光和红外模态之间的差距,并学习跨模态行人检索的适当表示。[43]提出了一种深度零填充网络来提取有用的嵌入特征以减少跨模态变化。双流网络[21,48,49,50,51]同时学习了模态共享和模态特定的特征。[30]使用基于高斯的变分自编码器来区分跨模态特征的子空间。[15]利用了模态内的样本相似性。一种模态感知的学习方法[47]在分类器层面处理模态差异。一些作品生成中间或对应模态的图像[7,17,35,37,40]以减轻模态不一致的影响。然而,提取模态共享特征会导致与身份鉴别相关的语义丢失,基于GAN的方法会带来计算负担和非原始噪声。
Differently, our work pays more attention to deep supervised knowledge synergy [32], which explores explicit information interaction between the supervised branches. We propose to make the most use of the intrinsic information of visible and infrared modalities, which learns diverse semantics and enhances feature representations by a modality synergy and complement learning scheme. To better discriminate identities, we introduce a cascaded feature aggregation strategy
与传统方法不同,我们的工作更加关注深度监督知识融合[32],探索受监督分支之间的显式信息交互。我们建议充分利用可见光和红外模态的内在信息,通过模态协同和互补学习方案来学习多样的语义并增强特征表示。为了更好地区分身份,我们引入了级联特征聚合策略。
Modality Synergy Complement Learning
In this section, we formulate the VI-ReID problem and introduce the framework of our proposed MSCLNet (§ 3.1). It mainly contains three major components: Modality Synergy module (MS, § 3.2), Modality Complement module (MC, § 3.3), and Cascaded Aggregation strategy (CA, § 3.4). We utilize MS to synergize modality-specific diverse semantics from the extractors, and then use MC to enhance feature representations under the guidance of advantages from the two modalities. To optimize the distribution of the features and aggregate instances of the same identity, we exploit CA to constrain the feature distribution in a fine-grained and progressive way. Finally, we summarize the proposed loss function (§ 3.5).
在本节中,我们对VI-ReID问题进行了阐述,并介绍了我们提出的MSCLNet框架(§ 3.1)。它主要包含三个主要组件:模态协同模块(MS,§ 3.2),模态补充模块(MC,§ 3.3)和级联聚合策略(CA,§ 3.4)。我们利用MS从提取器中协同作用模态特定的多样化语义,然后使用MC在两种模态的优势指导下增强特征表示。为了优化特征的分布并聚合相同身份的实例,我们利用CA以一种细粒度和渐进的方式约束特征分布。最后,我们总结了提出的损失函数(§ 3.5)。
Problem Formulation
Fig. 3 illustrates the framework of Modality Synergy Complement Learning Network (MSCLNet) with Cascaded Aggregation. It adopts a dual-stream network as the feature extractor. Firstly, based on the extracted feature representations f v and f r from visible and images, MSCLNet constructs synergistic representations f s by constraining the diversity of the feature distributions between the two modalities. The synergistic feature will be further enhanced by modality complement guidance. The visible modality provides fine-grained discriminative semantics, while the infrared modality supplies with stable global pedestrian statistics. Then we aggregate feature embeddings of the same class via Cascaded Aggregation strategy which optimizes the comprehensive distribution of feature embeddings progressively on three aspects.
图3展示了带有级联聚合的模态协同补充学习网络(MSCLNet)的框架。它采用双流网络作为特征提取器。首先,基于可见光和红外图像提取的特征表示f v和f r,MSCLNet通过约束两种模态之间特征分布的多样性来构造协同表示f s。协同特征还将通过模态补充引导进行进一步增强。可见光模态提供了细粒度的判别语义,而红外模态则提供了稳定的全局行人统计数据。然后,我们通过级联聚合策略对同一类别的特征嵌入进行聚合,该策略在三个方面逐渐优化特征嵌入的综合分布。
Modality Synergy Module
According to the differences in imaging principles and the heterogeneity of the image contents, visible and infrared images reveal quite different semantics to depict the same person. In our work, we design the network to learn and synergize the diverse semantics of the two modalities. Given a pair of visible and infrared images xv i ∈ V, xr i ∈ R, the dual-stream network extracts their features f v i and fr i . With the prerequisite of precise pedestrian re-identification, we concentrate on acquiring the semantic diversity to the largest extent. Features f v i and f r i are normalized by the following operations.
根据成像原理的差异和图像内容的异质性,可见光和红外图像呈现出完全不同的语义来描绘同一人物。在我们的工作中,我们设计了一个网络来学习和协同两种模式的多样语义。给定一对可见光和红外图像 xv i ∈ V,xr i ∈ R,双流网络提取它们的特征 f v i 和 f r i。在确保精确的行人再识别的前提下,我们专注于尽可能获取语义多样性。特征 f v i 和 f r i 通过以下操作进行归一化。
We utilize Mogrifier LSTM [25] as a synergistic feature encoder to maximize the effect of modality synergy learning, and the synergistic feature f s i is encoded with visible and infrared features with their shared ground-truth label. To construct f s i with diverse semantics, we exploit KL-Divergence to constrain the logistic distribution of visible and infrared features f v i , fr i , which can be formulated as follows:
我们使用Mogrifier LSTM [25]作为协同特征编码器,以最大化模态协同学习的效果,并将协同特征fsi与可见光和红外特征及它们的共享真实标签进行编码。为了构建具有多样性语义的fsi,我们利用KL散度来限制可见光和红外特征fvi, fri的逻辑分布,可以表示如下:
N denotes the number of samples in a batch. θv and θr act as learned feature extractors of visible and infrared modalities respectively, which aim to maximize the diversity of semantic representation across modalities. f v and f r are firstly designed in the representation spaces to maximize the modalityspecific discrimination among identities. Then, the synergistic feature extractor θs projects ˆ fv i, ˆ fr i to a shared representation space and constructs synergistic features f s i.
N代表批次中的样本数量。θv和θr分别作为可见和红外模式的学习特征提取器,旨在最大化模态间语义表示的多样性。fv和fr首先在表示空间中设计,以最大化身份之间的模态特定差异。然后,协同特征提取器θs将ˆfvi和ˆfri投影到共享表示空间,并构建协同特征fsi。
$$
Var[f^vi]=\frac{1}{HW}\sum^W{l=1}\sum^H{m=1}(f{itlm}-E[f^v_i])^2
$$
$$
Lt=-\frac{1}{N}\sum^N{i=1}[\hat y_ilog\hat p^v_i(\hat f^v_i,\thetav)]-\frac{1}{N}\sum^N{i=1}[\hat y_ilog\hat p^r_i(\hat f^r_i,\theta_r)]
$$
Modality Complement Module
Although synergistic representation contains more identity-relevant diverse semantics, it is uncertain whether synergistic feature outperforms the combination of visible and infrared features Concat(f v i ,fr i ). Due to infrared images contain- ing global pedestrian statistics with less noise and visible images containing finegrained discriminative semantics, we enhance the representation effectiveness of synergistic feature f s i from two aspects. Considering fine-grained semantics, we enhance synergistic features with advantages of visible features f v i in terms of local parts. And considering coarse-grained semantics, we enhance synergistic features with advantages of infrared features f r i about global parts.
尽管协同表示包含更多与身份相关的多样化语义,但不确定协同特征是否胜过可见光和红外光特征的组合Concat(f v i , fr i )。由于红外图像包含全局行人统计信息且噪声较少,可见光图像包含细粒度的区分语义,我们从两个方面增强了协同特征 f s i 的表示效果。考虑到细粒度的语义,我们以局部部分为基础,提高协同特征的表示能力,利用可见光特征 f v i 的优势。同时,考虑到粗粒度的语义,我们以全局部分为基础,提高协同特征的表示能力,利用红外特征 f r i 的优势。
On the fine-grained level, we split visible and synergistic features into n = 6 parts as MPANet [45] and get separate feature blocks as f v i = [b1v, b2v · · · , bvn], fs i = [bs1, bs2 · · · , bsn]. The local discrimination of synergistic features can be boosted with nuanced regions of visible modality. Cosine similarity cos(·, ·) is utilized for the optimization process.
在细粒度级别上,我们将可见特征和协同特征分成 n = 6 部分,如 MPANet [45],并获得分离的特征块 f v i = [b1v, b2v · · · , bvn],fs i = [bs1, bs2 · · · , bsn]。协同特征的局部区域亦可用于增强细腻度。余弦相似度 cos(·, ·) 用于优化过程中。
$$
L{local}=\frac{1}{N}\sum^N{i=1}\sum^n_{j=1}(cos(b^v_j,b^s_j)+\sqrt{2-2cos(b^v_j,b^s_j)})
$$
In parallel, on the coarse-grained level, we supervise f s i by keeping the statistic centers of synergistic features consistent with that of the infrared feature f r i . The global statistics of synergistic features can get optimized with center consistency of infrared modality.
同时,在粗粒度层面上,我们通过使协同特征的统计中心与红外特征 f r i 保持一致来监督 f s i。协同特征的全局统计可以通过红外模态的中心一致性进行优化。
$$
L{global}=\frac{1}{N}\sum^N{i=1}{||C^s_{yi}-C^r{y_i}||}^2_2
$$
where Cysi , Cyri denote the center of the yith class for synergistic features f s i ,fr i. Lglobal helps to coordinate semantics of the synergistic and the infrared feature and filter identity-irrelevance of the synergistic representation.
其中Cysi,Cyri表示协同特征fsi,fri的yith类的中心。Lglobal帮助协调协同和红外特征的语义,并过滤协同表达的身份非相关性。
In the progress of Modality Complement module, we update the parameters of synergistic feature extractor θs, which aims to construct features with less noise, more diverse and more precise semantic description for each identity. θs is optimized as follows:
在模态补充模块的进程中,我们更新协同特征提取器θs的参数,目的是为每个身份构建具有更少噪音、更多样化和更精确语义描述的特征。θs被优化如下:
$$
L_{Com}(\thetas)=\lambda{local}L{local}+\lambda{global}L_{global},\hat\thetas={argmin}{\theta_s}L(\theta_s)
$$
Cascaded Aggregation Strategy
Due to factors like shooting perspectives, clothing, and occlusion, the results of person retrieval will easily be affected [53,33]. To cope with this problem, center loss [23] and triplet loss [14] are widely adopted in ReID problems to simultaneously learn the centralized representation of feature embeddings and mine hard samples. Center loss Lc and Triplet loss Ltri can be formulated as:
由于拍摄角度、服装和遮挡等因素,人物检索的结果很容易受到影响[53,33]。为了应对这个问题,ReID问题中广泛采用了中心损失[23]和三元组损失[14],以同时学习特征嵌入的集中表示和挖掘困难样本。中心损失Lc和三元组损失Ltri可以表示为:
$$
Lc=\frac{1}{N}\sum^N{i=1}||fi-C{y_i}||^22\L{tri}=\sum^N_i{[||f((x^a_i)-f(a^{pos}_i)||^2_2-||f(x^a_i)-f(s^{neg}_i)||^22+\alpha]+}
$$
where xi denotes the ith input sample, Cyi is the yith class center, fi is the feature embedding, xa i is the anchor. Center loss pays attention to aggregating feature embeddings but neglects the intrinsic differences and diverse semantics existing in the visible and the infrared modalities. Triplet loss specializes in handling hard samples separately rather than considering the comprehensive distribution across modalities, which limits the performance. Considering the diverse semantics and structural distribution across modalities, we propose Cascaded Aggregation to progressively optimize the features distribution of, as shown in Fig. 4.
其中xi表示第i个输入样本,Cyi是第y个类别的中心,fi是特征嵌入,xia是锚点。中心损失关注特征嵌入的聚合,但忽视了可见光和红外模态中存在的内在差异和多样的语义。三重损失专注于分别处理困难样本,而不考虑跨模态的全面分布,这限制了性能。考虑到模态之间的多样语义和结构分布,我们提出级联聚合来逐步优化特征分布,如图4所示。
1)Aggregation on Sub-class level. We utilize the identity of shooting cameras for each image as the natural sub-class, since images of the same person shot by the same camera have high similarities with each other, where Csi denotes the sth i sub-class center:
1)在子类别层面上的聚合。我们将每张图像的摄像机身份作为自然的子类别,因为由同一摄像机拍摄的同一人物的图像彼此之间具有高度相似性,其中Csi表示第i个子类别的第s个中心。
$$
L{sub}=\frac{1}{N}\sum^N{i=1}||f^is-C{s_i}||^2_2
$$
2) Aggregation on the intra-class level, which keeps the structural priors of the features during the training progress. The formulation of the aggregation can be represented as follows, where Ns denotes the number of the sub-classes of each identity.
在类内层级上进行聚合,以在训练过程中保持特征的结构先验。聚合的公式可以表示如下,其中Ns表示每个身份的子类数量。
$$
L{intra}=\frac{1}{N}\sum^N{i=1}\sum^{Ns}{j=1}||C_{sj}-C{y_i}||^2_2
$$
3)Aggregation on the inter-class level. Our method of aggregation not only maximizes the similarity of intra-class instances but also maximizes the dissimilarity of inter-class instances on the whole. The dispersion between different identities and the two types of aggregation in 1) and 2) of the same identities are independent of each other. Formally, the dispersion between different identities can be represented as:
类间级别上的聚合。我们的聚合方法不仅最大化了同一类实例的相似性,还最大化了整体上不同类实例的不相似性。不同身份之间的离散度以及相同身份下1)和2)两种聚合类型是彼此独立的。正式来说,不同身份之间的离散度可以表示为:
$$
L{inter}=-\frac{1}{\frac{N}{2}}\sum^N{i=1}\sum^N{j \neq i}||C{yi}-C{y_j}||^2_2
$$
The loss function of CA for metric learning can be represented as:
CA在度量学习中的损失函数可以表示为:
$$
L{cascade}=L{sub}+L{intra}+L{inter}
$$
Compared with Center Loss, our method begins with only a few samples of high similarity for the same shooting cameras and it will become much easier to learn sub-center representations.
与中心损失相比,我们的方法仅使用几个相似度较高的相机拍摄样本,学习子中心表示将变得更加容易。
Compared with Triplet Loss, our method deals with negative samples simultaneously by guiding the negative samples to the correspondent sub-class instead of easily pushing away alongside the gradient.
与三元组损失相比,我们的方法通过将负样本引导到相应的子类而不是仅仅随着梯度推开,同时处理负样本。
Objective Function
Firstly, we utilize Synergy Loss LSynergy to enrich the representation on diverse semantics. The parameters of feature extractors θv and θr are updated as:
首先,我们利用协同损失LSynergy来丰富不同语义的表示。特征提取器θv和θr的参数更新如下:
$$
L_{synergistic}=L(\theta_v,\thetar)=\lambda{div}·L_{div}+\lambda_t·L_t
$$
Then, we enhance the synergistic feature representation with the advantages of two modalities, namely, the discriminative local parts from the visible feature and global identity statistics from the infrared feature. We utilize Complementary Loss LCom to update the modality synergy feature extractor θs:
然后,我们利用两种模态的优势,即可见特征中的有区别的局部部分和红外特征中的全局身份统计信息,来增强协同特征表示。我们利用互补损失LCom来更新模态协同特征提取器θs:
$$
L_{com}=L(\thetas)=\lambda {local} · L{local} + \lambda{global} · L_{global}
$$
Finally, we constrain the distribution of visible, infrared and synergistic feature f v, f r, f s with cascaded aggregation strategy Lcascaded:
最后,我们通过级联聚合策略 Lcascaded 对可见光、红外和协同特征 f v、f r、f s 的分布加以限制。
$$
L_{cascaded}=L(\theta_v,\theta_s,\thetas)=L{sub}+L{intra} + L{inter}
$$
Overall, the objective function of our MSCLNet can be summarized as follows:
总体而言,我们的MSCLNet的目标函数可以总结如下:
$$
L{total}=\lambda{div}·L_{div}+\lambda_t·Lt+\lambda {local} · L{local} + \lambda{global} · L{global}+L{sub}+L{intra} + L{inter}
$$
Experiment
Datasets and Evaluation Protocol
SYSU-MM01
RegDB
Evaluation Protocol.
Implement Details
Training
We implement MSCLNet with PyTorch on a single NVIDIA RTX 2080 Ti GPU and deal with 64 images consisting of 32 visible and 32 infrared images of 8 identities in a mini-batch by randomly selecting 4 visible and 4 infrared images for each identity. Our baseline is AGW*, which means AGW [51] with Random Erasing. We adopt pre-trained ResNet-50 [13] on ImageNet as the backbone network. Then, we pre-process each image by re-scaling in to 288 × 144 and augment images through random cropping with zero-padding, random horizontal flipping and random erasing (80% probability, 80% max-area, 20% min-area) . During the training process, we optimize the feature extractors θv, θr and modality synergy module θs with SGD optimizer. We set the initial learning rate η = 0.1, the momentum parameter p = 0.9. The learning rate is changed as η = 0.05 at 21-50 epoch, η = 0.01 at 51-100 epoch, and η = 0.001 at 101-200 epoch. The hyper-parameters λdiv, λt, λlocal, λglobal are set to 0.5, 1.25, 0.8, and 1.5, respectively. We synergize visible and infrared instances to train a concise end-to-end network, which retrieves specific person across modalities.
我们在单个NVIDIA RTX 2080 Ti GPU上使用PyTorch实现了MSCLNet,并通过随机选择每个身份的4张可见光图像和4张红外图像,处理了由32个可见光图像和32个红外图像组成的64张图像的小批量。我们的基线是AGW *,这意味着使用Random Erasing的AGW [51]。我们采用在ImageNet上预先训练的ResNet-50 [13]作为骨干网络。然后,我们通过重新缩放至288×144和随机裁剪(零填充),随机水平翻转和随机擦除(80%概率,80%最大面积,20%最小面积)来预处理每个图像。在训练过程中,我们使用SGD优化器优化特征提取器θv,θr和模态协同模块θs。我们将初始学习率η设置为0.1,动量参数p设置为0.9。学习率在21-50个时期时更改为η= 0.05,在51-100个时期时更改为η= 0.01,并在101-200个时期时更改为η= 0.001。超参数λdiv,λt,λlocal,λglobal分别设置为0.5,1.25,0.8和1.5。我们将可见光和红外实例合并在一起,训练一个简明的端到端网络,可跨模态检索特定的人物。
Testing
For testing, the model works in Single-shot mode by extracting the query and the gallery features from a single modality by the feature extractor θv or θr. Besides, MS and MC modules do not participate in testing stage.
用于测试的模型通过从单个模态中提取查询和图库特征,由特征提取器θv或θr,以单次拍摄模式工作。此外,MS和MC模块在测试阶段不参与。
Ablation Study
在本小节中,我们进行了一项消融研究,以评估MSCLNet的每个组件的效果,如方程式17所总结。结果在表1中展示。我们评估了每个组件对SYSU-MM01数据集的全搜索模式可以带来多少改进。
Conclusion and Discussion
In this paper, we propose a novel VI-ReID framework, which has the capability to make full use of the visible and the infrared modality semantics and learn discriminative representation of identities by synergizing and complementing instances of visible and infrared modalities. Different from existing methods pursuing modal-shared information at the risk of identity-relevant semantics loss, MSCLNet provides an innovative approach exploring high-level unity in VI-ReID task. Meanwhile, we propose Cascaded Aggregation strategy to fine-grained and progressively optimize the distribution of feature embeddings, which assists the network discriminate identities and extract more precise and more comprehensive features. Experimental results validate the merit of the framework, as well as the effectiveness of each component in this framework. In the future work, we plan to explore background scenes, gender, and appearances to construct better different sub-classes.
在这篇论文中,我们提出了一种新颖的VI-ReID框架,该框架能够充分利用可见光和红外光语义,并通过协同和补充可见光和红外光的实例来学习识别身份的判别性表示。与现有方法不同,现有方法追求模态共享信息,可能导致身份相关语义的丧失,MSCLNet提供了一种创新的方法来探索VI-ReID任务中的高级统一性。同时,我们提出了级联聚合策略,以细化和逐步优化特征嵌入的分布,从而帮助网络区分身份并提取更精确和更全面的特征。实验结果验证了该框架的优点,以及该框架中每个组件的有效性。在未来的工作中,我们计划探索背景场景、性别和外观,来构建更好的不同子类。