A Reliable Multimodal Method Considering Modality-Specific Subspace Learning

Abstract

Representation learning is critical for multimodal methods; traditional consistency-based multimodal methods always constrain the disagreements among different modality embeddings or predictions as an extra regularization. However, these methods may appear to cause performance degeneration in open environments. This is mainly attributed to the interference of asymmetric information, that is, different modality information exists divergence, whereas consistency regularization prefers to simply minimize the divergence rather than optimal classifiers. Therefore, it is unsafe to directly use consistency regularization. To this end, we propose modality-specific subspace learning (MSSL). It learns the modality-specific subspace representations by treating modality divergence and consistency separately. In particular, MSSL is a semi-supervised framework that maps different modality feature embeddings into shared and independent subspaces. The shared subspace applies reliable consistency regularization by measuring intermodality structural similarities. The independent subspace uses a discriminative modality-separation network to emphasize modality complementary information. Finally, labeled instances from different modalities are classified with weighted predictions over concatenated embeddings. Consequently, MSSL improves both the single modal and ensemble classification results and acquires more robust mapping among different modalities. Empirical studies show the superior performance of MSSL on real-world datasets.

Keywords

multimodal learning semi-supervised learning subspace learning reliable regularization

1. Introduction

Multimodal methods aim to fuse information from multiple correlated sources. This improves classification accuracy and strengthens intermodality correlations. As a result, multimodal learning has attracted increasing attention and has been applied in many areas, such as healthcare (Ma et al., 2015), recommendation (Liu et al., 2019), and information retrieval (Liu et al., 2017; Yang et al., 2024b). Formally, the basic assumption behind existing approaches is that each modality has sufficient information for classification on its own. Following this idea, previous consistency-based methods (Brefeld et al., 2006; Farquhar et al., 2005; Yang et al., 2018b) enforced consistency among modalities. The constraints were applied at different levels, including models, features, and predictions. For example, co-regularization style methods (Brefeld et al., 2006; Yang et al., 2018b) aimed to improve single modal performance via constraining the predictions or similarity matrices of unlabeled multimodal data. In contrast, canonical correlation analysis (CCA)-style methods (Andrew et al., 2013; Wang et al., 2016b; Xie et al., 2020) built cross-modal connections by minimizing embedding discrepancies. Consequently, consistency principles improve not only single-modality performance but also cross-modal relationships, supporting tasks such as classification and retrieval. In consequence, the consistency principle can not only improve the performance of single-modal models, but also build relationships among modalities for applying in tasks such as classification and retrieval. However, in open environments, multimodal data often contains asymmetric information. Each modality may provide both shared and independent parts, and the independent parts cause inconsistency. Traditional consistency constraints suppress these inconsistencies, which interfere with joint optimization (Nie et al., 2022; Wang & Zhou, 2013; Yang et al., 2024a). As a result, such methods may fail and even suffer from performance degradation.

Therefore, robust consistency and diversity regularizations are vital. Robust consistency constraint aims to reduce the influence of asymmetrical instances, thus to achieve more reliable consistency. For example, Muslea et al. (2003) considered single-track teaching, which employed predetermined strong modality to assist weak modality; Yang et al. (2019b) developed robust consistency regularization to eliminate interference from inconsistent instances. On the contrary, the diversity measure aims to highlight the modal complementary information, that is, the independent information. Similar to heterogeneous ensemble learning (Zhou, 2009), different modal models can be considered as various basic classifiers, and a diversity measure can improve the voting results in the PAC-learning framework (Li et al., 2012). Therefore, Wang et al. (2017) and Yang et al. (2019a) combined both consistency and diversity constraints on different modal predictions to encourage diversity for improving the final classification performance. However, these methods measure both the consistency and diversity in the single subspace embeddings or label prediction distribution; it is difficult to learn the shared and independent information simultaneously in reality.

To address this problem, we propose modality-specific subspace learning (MSSL), which explores both consistency and diversity within a unified framework. Specifically, MSSL designs feature embedding networks for each modality. The embeddings are mapped into two subspaces: shared and independent. Shared subspaces capture information consistent across modalities. Independent subspaces capture complementary information, which may appear inconsistent. For classification, we concatenate the shared and independent embeddings of labeled data. The shared subspace is regularized with reliable consistency, while the independent subspace is refined by an modality-separation discriminator. Therefore, MSSL can effectively learn more discriminative embeddings for each modality in return. To summarize, the main contributions of our work are:

We establish a unified semi-supervised multimodal deep learning framework, MSSL, which effectively learns modality-specific embeddings to capture both consistency and diversity.

We develop related reliable metrics, that is, reliable consistency on shared subspace aims to acquire different modal correlated information, and diversity metric on independent subspace aims to learn distinguishable information for better classification. As a result, MSSL can be effectively applied to reliable multimodal representation learning.

We conduct extensive experiments on real-world datasets, and our results demonstrate the superior performance.

2. Related Work

This paper focuses on exploring reliable, consistent, and complementary information simultaneously, and realizes it by measuring modal-specific embeddings for different modalities. Therefore, our work is related to the following two aspects: traditional multimodal methods and reliable multimodal methods.

Multimodal learning improves performance by leveraging heterogeneous multisource data, in which modal consistency is one of the important principles. Most multimodal methods make full use of unlabeled data by constraining the consistency of different modal predictions, and improve the performance of tasks such as classification and clustering. For example, Brefeld et al. (2006) considered the consistency of different modal predictions, which took the aligned modal predictions of unlabeled instances as the pseudo information; Wang et al. (2013a) proposed to utilize the consistency of manifold structure among different modalities, and applied it on multimodal clustering. On the other hand, feature consistency constraints can build cross-modal relationships and improve the performance of tasks such as retrieval and captioning. For example, Hotelling (1936) proposed the CCA method, which maximized the correlation between two modal representations in a shared subspace, thus to acquire correlated feature representation; Andrew et al. (2013); Ngiam et al. (2011) combined deep learning techniques to learn more discriminative feature embeddings; and Zhen et al. (2019) introduced intra-modal and inter-modal consistency measures to improve cross-modal retrieval tasks. Therefore, correct alignment among modalities is a prerequisite for the success of multimodal approaches. However, multimodal data has asymmetric information in an open environment, and both the feature and prediction consistency are interfered with. Consequently, the traditional co-regularization method is even worse than single modality, as shown in Table 1.

Table 1.
Preliminary Investigations: (1) Unlabeled Data May Improve Performance, yet it is Unsafe. For Example, LU $_{mean}$ /LUC Only Performs Better Than L $_{mean}$ on Partial Criteria; (2) Traditional Co-regularization Method may Reduce the Performance. For example, CoReg Performs Worse Than Img/Text on Both Datasets.

FLICKR25K NUS-WIDE

Average Coverage Example Macro Micro Rank Average Coverage Example Macro Micro Rank

Methods Precision $↑$ $↓$ AUC $↑$ AUC $↑$ AUC $↑$ Loss $↓$ Precision $↑$ $↓$ AUC $↑$ AUC $↑$ AUC $↑$ Loss $↓$

Img .807 8.991 .947 .918 .943 .053 .822 2.822 .943 .917 .948 .057

Text .614 13.401 .863 .751 .827 .137 .764 3.562 .919 .848 .903 .081

L $_{mean}$ .791 9.505 .939 .898 .932 .061 .853 2.398 .958 .930 .958 .042

L $_{max}$ .778 9.696 .936 .890 .928 .064 .848 2.432 .957 .928 .957 .043

LC .803 8.923 .947 .919 .945 .054 .841 2.606 .951 .927 .955 .049

Img $_{U}$ .810 8.953 .947 .917 .944 .053 .832 2.806 .945 .915 .948 .055

Text $_{U}$ .612 13.414 .862 .740 .821 .138 .767 3.589 .917 .842 .896 .083

LU $_{mean}$ .793 9.487 .940 .897 .931 .060 .856 2.443 .958 .924 .955 .042

LU $_{max}$ .785 9.661 .937 .889 .929 .063 .854 2.468 .957 .922 .954 .043

LUC .809 8.944 .947 .920 .943 .053 .849 2.628 .952 .921 .953 .048

CoReg .576 14.537 .863 .801 .862 .137 .813 2.811 .944 .911 .947 .056

	FLICKR25K	NUS-WIDE
Img	.807	8.991	.947	.918	.943	.053	.822	2.822	.943	.917	.948	.057
Text	.614	13.401	.863	.751	.827	.137	.764	3.562	.919	.848	.903	.081
L $_{mean}$	.791	9.505	.939	.898	.932	.061	.853	2.398	.958	.930	.958	.042
L $_{max}$	.778	9.696	.936	.890	.928	.064	.848	2.432	.957	.928	.957	.043
LC	.803	8.923	.947	.919	.945	.054	.841	2.606	.951	.927	.955	.049
Img $_{U}$	.810	8.953	.947	.917	.944	.053	.832	2.806	.945	.915	.948	.055
Text $_{U}$	.612	13.414	.862	.740	.821	.138	.767	3.589	.917	.842	.896	.083
LU $_{mean}$	.793	9.487	.940	.897	.931	.060	.856	2.443	.958	.924	.955	.042
LU $_{max}$	.785	9.661	.937	.889	.929	.063	.854	2.468	.957	.922	.954	.043
LUC	.809	8.944	.947	.920	.943	.053	.849	2.628	.952	.921	.953	.048
CoReg	.576	14.537	.863	.801	.862	.137	.813	2.811	.944	.911	.947	.056

Note. AUC = area under the curve.

Therefore, reliable multimodal learning is researched to eliminate the inconsistency. Robust consistency regularization is first utilized for eliminating the divergent instances. For example, Iwata and Yamada (2016) proposed probabilistic latent variable models for inconsistent multimodal anomaly detection, which assumed that nonanomalous instances are generated from a single latent vector; Yang et al. (2018b) calculated modal weights using square root loss to eliminate inconsistent instances. However, previous methods focus on removing inconsistent outlier instances, let alone considering divergence comprehensively. To solve this problem, Wang et al. (2017) attempted to combine exclusivity and consistency terms to make complementary representations with a common indicator; Yang et al. (2019a) combined prediction divergence and robust consistency metric to build a reliable multimodal network. Nevertheless, these methods always concentrate on the measurement of single subspace embeddings or prediction distributions for different modalities, which makes it difficult to learn consistent information and local independent information among modalities.

Our method differs clearly from several related approaches. For example, CMML (Yang et al., 2019a) mitigates modality insufficiency with instance-level attention and balances consistency with diversity, but it does not explicitly separate shared and modality-specific subspaces. EXMV (Wang et al., 2017) combines consistency and diversity constraints at the prediction level, yet it lacks a unified representation framework. DSCMR (Zhen et al., 2019) learns a common representation space for cross-modal retrieval and enforces global alignment, but it ignores local inconsistencies and modality-specific information. In contrast, our MSSL explicitly disentangles shared and independent subspaces, filters globally inconsistent instances, and integrates modality separation, reliable consistency, and weighted classification into a unified framework.

3. Proposed Method

3.1. Notations

Suppose there are $N$ multimodal instances. Among them, $N_{l}$ are labeled, that is, $X_{l} = (x_{1}, y_{1}), (x_{2}, y_{2}), \dots, (x_{N_{l}}, y_{N_{l}})$ . The remaining $N_{u}$ are unlabeled, that is, $X_{u} = x_{N_{l + 1}}, \dots, x_{N}$ . Thus, $N = N_{l} + N_{u}$ . Each instance $x_{i} = x_{i}^{1}, x_{i}^{2}, \dots, x_{i}^{M}$ contains $M$ modalities. The $m$ -th modality is represented as a $d_{m}$ -dimensional vector. The label $y \in {0, 1}^{C}$ , where $C$ denotes the number of classes. Without loss of generality, we focus on two modalities: image and text. This is because most large multimodal datasets mainly contain these two types. Our goal is to learn independent models for each modality: $f_{m} : R^{d_{m}} \to R^{C}$ , and acquire better single/ensemble results intuitively. Meanwhile, we can also construct reliable relationship between two modalities via robust consistency constraints.

3.2. Preliminaries

In this section, we present the performance of traditional multimodal methods on real multimodal data to verify whether inconsistent data affects the performance. Table 1 records the results of real-world multimodal datasets, that is, FLICKR25K (Huiskes & Lew, 2008) and NUS-WIDE (Chua et al., 2009). The comparison methods include: (1) we train each modal classifier separately with labeled data, denoted as “Img/Text,” then use max/mean voting to acquire ensemble results, denoted as “L $_{max}$ /L $_{mean}$ ”; (2) we concatenate different modal feature embedding of labeled data, then learn single modal classifier, denoted as “LC”; (3) we train each modality separately with semi-supervised methods, then use max/mean voting, denoted as “LU $_{max}$ /LU $_{mean}$ ”; (4) we concatenate different modal feature embedding of all data, then learn semi-supervised classifier, denoted as “LUC”; (5) traditional co-regularization method. All base models use the same deep architecture. For images, we adopt ResNet18 (He et al., 2015). For text, we use a three-layer fully connected network. Besides, we utilize the Tri-Net (Chen et al., 2018) for a semi-supervised deep method.

It is notable that performance across modalities varies. For example, in FLICKR25K, the gap between image and text is large, whereas in NUS-WIDE it is relatively small. We observed several notable phenomena:

(1)
On FLICKR25K dataset, models using concatenated features perform better than the ensemble methods, that is, LC and LUC perform better than L $_{mean}$ and LU $_{mean}$ . While on NUS-WIDE, the results are opposite, that is, L $_{mean}$ and LU $_{mean}$ perform better than LC and LUC. The reversal reveals that modalities with a large difference will bring interference to the ensemble.
(2)
Semi-supervised methods may improve in single modality, yet their overall performance is worse than the supervised methods on some criteria, that is., coverage, macro area under the curve (AUC), and micro AUC. The phenomenon owes to unlabeled data, which may introduce additional label or structural noise.
(3)
Traditional co-regularization method performs even worse than single modality on some criteria, that is, average precision (AP), macro AUC, and micro AUC. This is attributed to the interference of inconsistent instances in cross-iterative training.

Therefore, we should design a reliable semi-supervised multimodal model, which can safely utilize the multimodal data, highlight the consistent and divergent information, rather than use the consensus constraint directly. As a result, the model acquires more reliable feature representation for each modality.
3.3. Algorithm Overview

In this section, we will introduce the proposed MSSL in detail. Consistency-based methods are often disturbed by inconsistent multimodal instances. These inconsistencies fall into two types: global inconsistency (e.g., mismatched image–text pairs in Figure 1(a)) and local inconsistency (e.g., mismatched blocks in Figure 1(b)). Therefore, MSSL aims to carefully consider these issues by hierarchically measuring inconsistency. First, MSSL computes correlations between feature embeddings of different modalities. A relatively high threshold is applied to filter out globally inconsistent instances. Next, the selected embeddings are mapped into shared and independent subspaces. The shared subspace captures consistent representations, while the independent subspace learns divergent features specific to each modality. Thereby, we can make full use of multimodal consistent information and highlight the modal complementary information in return. With the learned discriminative feature representation, the model can improve the performance of each modality steadily and construct robust relationships between modalities.

Figure 1.

Illustration of the inconsistent multimodal instances: (a) globally inconsistent instance; (b) locally inconsistent instance; and (c) consistent instance.

The detailed algorithmic process is shown in Figure 2. First, each modality is encoded by a separate network (e.g., convolutional neural network for images and deep neural network for text). The resulting embeddings $x_{l_{p}}^{1}$ and $x_{l_{p}}^{2}$ are set to the same dimension. If the output dimensions are inconsistent, we can add a mapping layer. Next, the semantic embeddings are used to calculate cross-modal correlations (trapezoid box in Figure 2). Instances with low similarity are removed as globally inconsistent. This step defines the Gated Correlation Objective ( $T_{g}$ ). For the remaining instances, embeddings are mapped into shared and independent subspaces. The shared representation $z^{c}$ captures consistent information across modalities. The independent representations $z^{1}$ and $z^{2}$ capture complementary, locally inconsistent information of each modality. To emphasize divergence, $z^{1}$ and $z^{2}$ are passed into an modality-separation module (red dotted box in Figure 2). The module enforces them to be mapped into distinguishable spaces, that is, $z^{1}$ and $z^{2}$ are pushed far apart. This defines the Modal Separation Objective ( $L_{sep}$ ). Meanwhile, the concatenation of shared and independent representation, that is, $z^{1} / z^{2}$ and $z^{c}$ , is also input to the feature transfer module $g : Z \to R$ , which is denoted as a green dotted box in Figure 2. This module transforms $x^{1}, x^{2}$ to $g (\cdot)$ for calculating the weight of $x_{i}^{1}, x_{i}^{2}$ . Hence, we introduce the corresponding Weight Objective, that is, $T_{w}$ . Finally, we combine two parts: the weighted prediction loss on labeled data and the reliable structural consistency regularization on shared representations. This yields the Overall Objective ( $L_{c}$ ).

Figure 2.

Illustration of the proposed MSSL. MSSL inputs two different modal feature embeddings into the gated correlation module, which aims to exclude globally inconsistent instances. Then each selected modal embedding maps to a shared and independent subspace, which aims to find local consistent and divergent information. Different modal independent embeddings are input to the modality-separation network, thus to learn complementary information with divergent representation, and shared embeddings are constrained by reliable structure consistency. MSSL also combines additional weighting networks for weighted ensemble prediction using each modal’s concatenated features. Consequently, MSSL acquires a more discriminative representation for each modality. Note. MSSL = modality-specific subspace learning.

3.4. Training Objectives

We first analyze the objective function of each module, then provide the overall objective function.

Gated Correlation Objective.

Mathematically, we first project each modal feature to a low-dimensional space, that is, $x_{l_{p}}^{1}, x_{l_{p}}^{2}$ , and then compute the cross-modal affinity measurement:

s_{i} = \frac{{x_{i, l_{p}}^{1}}^{⊤} x_{i, l_{p}}^{2}}{\sqrt{{x_{i, l_{p}}^{1}}^{⊤} x_{i, l_{p}}^{1}} \sqrt{{x_{i, l_{p}}^{2}}^{⊤} x_{i, l_{p}}^{2}}},

(1)

s_{i}

measures the global similarity of the

i

-th instance across modalities. If

s_{i}

is smaller than a threshold

σ

, the instance is considered globally inconsistent and is filtered out. Then, each modal embedding is mapped into two subspaces, that is, shared subspace

z^{c}

and independent subspaces

z^{1}, z^{2}

\begin{aligned} z_{i}^{c_{1}} & = {1_{s_{i} \geq σ}} x_{i, l_{p}}^{1} W^{1} z_{i}^{1} = {1_{s_{i} \geq σ}} x_{i, l_{p}}^{1} {\bar{W}}^{1}, \\ z_{i}^{c_{2}} & = {1_{s_{i} \geq σ}} x_{i, l_{p}}^{2} W^{2} z_{i}^{2} = {1_{s_{i} \geq σ}} x_{i, l_{p}}^{2} {\bar{W}}^{2}, \end{aligned}

(2)

W^{1}, W^{2} \in R^{d \times d_{c}}

are projection matrices that map embeddings into a

d_{c}

-dimensional shared subspace, which captures consistent representations.

{\bar{W}}^{1}, {\bar{W}}^{2} \in R^{d \times d_{s}}

map embeddings into a

d_{s}

-dimensional independent subspace, which captures complementary (locally inconsistent) information. Note that complementary information only accounts for a small percentage, thus

d_{s} ≪ d_{c}

, that is, we set

d_{s} = 16

and

d_{c} = 128

. Therefore, the consistent instances also will not be affected too much by the independent subspace representation. Through the above operations, MSSL can roughly filter highly inconsistent instances, while learning consistent and complementary information for locally inconsistent instances.

Modal Separation Objective.

To highlight the divergence between various modal independent information, we can distinguish it by divergent measures. Without any loss of generality, we refer to the modality-separation network (Goodfellow et al., 2014). Specifically, let $D : Z \to {0, 1}$ be the modal discriminator, where $Z$ is the independent cross-modal latent semantic feature space, that is, $z^{1}, z^{2}$ in Figure 2, ${0, 1}$ denotes two modalities, that is, 0 denotes image modality and 1 denotes text modality. $D$ is the discriminator, that is, a two-layer fully connected network, which is trained to distinguish two modal representation subspaces, and highlights each modal complementary feature by producing gradients. To this end, we formulate the objective as:

L_{sep} = E_{z \in Z^{1}} \log D (z^{1}) + E_{z \in Z^{2}} \log (1 - D (z^{2})) .

(3)

Note that

L_{sep}

pushes

z^{1}

apart from the

z^{2}

. Therefore, with the learned

z^{1}, z^{2}

representation, we can eliminate the locally inconsistent information of each instance, and enhance the complementary information to assist in the following classification.

Weight Objective.

The importance of each modality may vary across instances, especially in the presence of local inconsistencies. Previous methods often used mean or max voting, which ignores this variability. Thus, MSSL turns to utilize an extra weight mechanism to automatically learn the weights for different modalities. The weight mechanism can be formulated as follows:

\begin{aligned} α_{i, k} & = h ({\hat{z}}_{i}^{k}), \\ {\hat{z}}_{i}^{k} & = z_{i}^{k} \oplus z_{i}^{c_{k}}, \end{aligned}

(4)

where

α_{i, k}

represents the weight for the

i

-th instance on the

k

-th modality.

h (\cdot)

is the extra neural network, that is, we utilize a two-layer shallow fully connected here,

\oplus

denotes the concatenation operation. Note that the weights

α

are normalized as

\sum_{k} α_{i, k} = 1

after each round. Therefore, we can effectively integrate different modal information for better classification results.

Overall Objective.

The loss can be separated into two parts: classification loss and consistency regularization. With the calculated weights, the classification loss can be denoted as:

L_{s} = \sum_{i = 1}^{N_{l}} ℓ (f (\sum_{k = 1}^{2} α_{i, k} {\hat{z}}_{i}^{k}), y),

(5)

where

f (\cdot)

is the fully connected layer for prediction,

ℓ (\cdot)

can be denoted as any convex loss function, and we utilize cross-entropy here.

Beyond traditional consensus regularization, we also minimize inter-modal discrepancies, inspired by Zhen et al. (2019). This encourages embeddings to produce more reliable mappings, as formulated below:

\begin{aligned} L_{c} & = \frac{1}{N} \sum_{i}^{N} ‖ z_{i}^{c_{1}} - z_{i}^{c_{2}} ‖_{F} + \frac{1}{N_{l}^{2}} \sum_{i, j}^{N_{l}} (\log (1 + e^{p_{i j}}) - s_{i j} p_{i j}), \\ p_{i j} & = \cos (z_{i}^{c_{1}}, z_{j}^{c_{2}}), \end{aligned}

(6)

where

z_{i}^{c_{1}}

and

z_{i}^{c_{2}}

are shared subspace embeddings for two modalities,

\cos

is the cosine function to calculate the similarity between two modalities,

s_{i j}

is 1 if the

i

-th and

j

-th instances are from the same class, otherwise 0. The first term of

L_{c}

minimizes the cross-modal similarities of all instances. Meanwhile, the second term regularizes the inter-modal structure with the shared representation of labeled examples. Therefore, we can acquire a more discriminative shared representation.

In conclusion, combining equations (3), (5), and (6), we generate the whole formulation as:

L = L_{sep} + λ_{1} L_{c} + λ_{2} L_{s} .

(7)

For optimizing MSSL, we sample a mini-batch at each iteration, and calculate the objective according to equation (7). The model parameters are updated via the Adam optimizer (Kingma & Ba, 2015). With the learned model, we conduct inductive learning. The procedure of training the MSSL model can be summarized as Algorithm ??.

3.5. Optimization & Theoretical Guarantees

To validate the soundness of our optimization in equation (7), we provide two concise analyses: (i) a convergence guarantee for the modality-separation module under a two-time-scale update, and (ii) an upper bound showing that subspace separation does not increase task risk under mild conditions.

3.5.1. Modality-Separation Under TTUR

Let $ϕ$ be the discriminator parameters for $L_{sep}$ and $θ$ the feature/fusion parameters for $L_{c}, L_{s}$ . We adopt two time scales with step sizes ${β_{t}}$ and ${α_{t}}$ such that $\sum_{t} α_{t} = \sum_{t} β_{t} = \infty$ , $\sum_{t} α_{t}^{2} < \infty$ , $\sum_{t} β_{t}^{2} < \infty$ , and $β_{t} / α_{t} \to 0$ . Assume: (a) $L_{sep} (θ, ϕ)$ is locally concave in $ϕ$ with Lipschitz gradients; (b) $L_{c} (θ), L_{s} (θ)$ are smooth; (c) parameter trajectories and gradient noise are bounded.

Proposition 1 TTUR convergence

Under the above conditions, the ascent–descent iterates

\begin{aligned} ϕ_{t + 1} & = ϕ_{t} + β_{t} \nabla_{ϕ} L_{sep} (θ_{t}, ϕ_{t}), \\ θ_{t + 1} & = θ_{t} - α_{t} \nabla_{θ} (L_{sep} + λ_{1} L_{c} + λ_{2} L_{s}) (θ_{t}, ϕ_{t + 1}), \end{aligned}

(8)

converge almost surely to a stationary point of

L_{sep} + λ_{1} L_{c} + λ_{2} L_{s}

. If

L_{sep}

is locally convex–concave near the limit, the limit is a local Nash equilibrium.

Proof.

With $β_{t} / α_{t} \to 0$ , the fast-time-scale discriminator tracks $ϕ^{⋆} (θ) \in {\arg \max}_{ϕ} L_{sep} (θ, ϕ)$ . The slow-time-scale parameters then descend on $L_{sep} (θ, ϕ^{⋆} (θ)) + λ_{1} L_{c} (θ) + λ_{2} L_{s} (θ)$ . Standard two-time-scale stochastic approximation with Lipschitz drifts and bounded noise yields almost-sure convergence; local convex–concavity implies a local Nash equilibrium.

Implication

Updating the separator faster than the encoder/fusion yields stable optimization under TTUR; our update schedule thus aligns with the convergence guarantee.

3.5.2. Risk Gap of Shared/Specific Subspaces

Let $f$ be the prediction head in $L_{s}$ , $f$ is $L_{f}$ -Lipschitz, and the supervised loss $ℓ$ is bounded and $1$ -Lipschitz. Denote the shared-alignment error $ε_{c} = E ‖ z^{c_{1}} - z^{c_{2}} ‖_{2}$ and the residual correlation between specific subspaces $ρ_{+} = max {0, E [\cos (z^{1}, z^{2})]}$ , with $d_{s} ≪ d_{c}$ the (small) specific versus shared dimensionalities.

Theorem 1 (Risk gap bound)

Let $R_{full}$ be the baseline risk without explicit separation and $R_{split}$ the risk under equation (7) with separated specific subspaces. Then

R_{split} - R_{full} \leq L_{f} ε_{c} + C \sqrt{d_{s}} ρ_{+} + Gen (n),

(9)

where

C > 0

does not depend on

n

and

Gen (n) = O (n^{- 1 / 2})

is a lower-order generalization term. In particular, if the shared space is well aligned (

ε_{c}

small) and the specifics are weakly correlated with small dimension (

ρ_{+}

small,

d_{s} ≪ d_{c}

), then

R_{split} \leq R_{full} + o (1)

(no-regret).

Proof.

Couple predictions with and without separation and bound their loss difference via Lipschitz stability of $f$ and $ℓ$ . The representation deviation decomposes into a shared mismatch term ( $ε_{c}$ ) controlled by $L_{c}$ and a private interaction term scaling as $\sqrt{d_{s}} ρ_{+}$ controlled by $L_{sep}$ , plus a uniform-convergence remainder $Gen (n)$ .

Implication

The bound shows separation is risk-nonincreasing when $L_{c}$ ensures small shared mismatch and $L_{sep}$ yields low-dimensional, weakly correlated specifics ( $d_{s} ≪ d_{c}$ , $ρ_{+}$ small), supporting our dimensionality and loss choices.

4. Experiments

4.1. Datasets and Compared Methods

Most of the existing large-scale multimodal datasets focus on the two-modal multilabel classification with image–text pairs. Therefore, we first experiment on two public datasets, that is, FLICKR25K (Huiskes & Lew, 2008) and NUS-WIDE (Chua et al., 2009). FLICKR25K consists of 25,000 images collected from the Flickr website, each image is associated with several textual tags. The text for each instance is represented as a 1386-dimensional bag-of-words vector. Each point is manually annotated with 39 labels. NUS-WIDE contains 260,648 web images, and each image is associated with textual tags. Each point is annotated with 81 concept labels, and we select the 21 most frequent concepts as (Yang et al., 2019a). The text is represented as a 1000-dimensional bag-of-words vector. We also experiment on one real-world complex article dataset, that is, WKG Game-Hub (WKG in simplify; Yang et al., 2018a) consists of 25,694 image–text pairs collected from the Game-Hub of “Strike of King” with 54 concept labels. The text is represented as a 512-dimensional vector using Chinese BERT (Cui et al., 2019).

For the comparison method. First, we adopt an ablation study to verify the effectiveness of each module. Specifically, we define three different varieties of MSSL: (1) no modal separation module, denoted as MSSL-S; (2) no weight module, denoted as MSSL-W; and (3) no reliable consistency module, denoted as MSSL-R. Moreover, we compare MSSL with the state-of-the-art multimodal method: Co-trade (Zhang & Zhou, 2011), WNH (Wang et al., 2013a), SLIM (Yang et al., 2018b), ICo-train (Guo & Wang, 2019), EXMV (Wang et al., 2017), CMML (Yang et al., 2019a), M3DN (Yang et al., 2018a), and TagCLIP (Lin et al., 2024). For multilabel classification, we treat each label independently, that is, for each label, we train a corresponding classifier using different modalities.

Moreover, MSSL can also obtain shared subspace embeddings, so in addition to classification tasks, we also utilize shared subspace representations for cross-modal retrieval. Therefore, we compare MSSL with six state-of-the-art cross-modal method: CCA (Hotelling, 1936), LCFS (Wang et al., 2013b), JFSSL (Wang et al., 2016a), DCCA (Andrew et al., 2013), VSE++ (Faghri et al., 2018), and DSCMR (Zhen et al., 2019).

4.2. Implementation Details

All experiments are implemented in PyTorch and conducted on a single NVIDIA RTX 3090 GPU with 24 GB memory. The image encoder is ResNet-18 pretrained on ImageNet, while the text encoder is a three-layer fully connected network. Unless otherwise specified, the embedding dimension of the shared subspace is set to $d_{c} = 128$ , and the independent subspace is set to $d_{s} = 16$ . For optimization, we adopt the Adam optimizer (Kingma & Ba, 2015) with an initial learning rate of $1 \times 10^{- 3}$ and weight decay of $5 \times 10^{- 4}$ . The batch size is fixed at 128, and the models are trained for 100 epochs. Early stopping is applied if the validation loss does not improve within 10 consecutive epochs. We empirically set the threshold parameter $σ$ for filtering inconsistent instances to $0.6$ . The tradeoff parameters $λ_{1}$ and $λ_{2}$ in equation (7) are tuned from ${10^{- 3}, 10^{- 2}, 10^{- 1}, 1, 10}$ , where $λ_{1} = 1$ and $λ_{2} = 10^{- 1}$ yield the most stable results across datasets. Code and detailed configurations will be made publicly available upon acceptance.

For each dataset, we randomly select $33 %$ of the data for testing and the remaining instances are used for training. For training data, we randomly choose $10 %$ as labeled data, and the others are unlabeled ones. Six common multilabel classification criteria are recorded, that is, coverage, ranking loss, AP, macro AUC, example AUC, and micro AUC, and one common multilabel retrieval criterion is recorded, that is, NDCG. For deep learning methods, we report only the best results, following previous baselines (LeCun et al., 2015), due to the high computational cost of multiple runs. For traditional methods with lower cost, we additionally report the mean and standard deviation. Image encoder is implemented with Resnet18 (He et al., 2015). The text utilizes a three-layer fully connected network. The parameters $λ_{1}$ , $λ_{2}$ are tuned in ${10^{- 3}, \dots, 10^{1}, 10^{2}}$ . When the variation between objective values of equation (7) is $< 10^{- 4}$ in iterations, we consider MSSL to have converged. For the compared methods, the parameters are tuned according to the original papers.

4.3. Multilabel Classification

We first give the multilabel classification results. Single modal and ensemble results are listed in Tables 2 to Table 4. The notation “N/A” means a method cannot give a result within 60 h. The best performance for each criterion is bolded. $↑ / ↓$ indicates the larger/smaller, the better of the corresponding criterion. The results of benchmark datasets reveal that:

The single modal performance of some traditional methods has not been improved, and the ensemble effect is even worse than single modality, for example, Co-Trade, WNH, and ICo-train on most criteria. The performance is due to the drawbacks of the direct consistency principle.

The methods considering both the divergence and consistent constraints, that is, EXMV and CMML, are much better than traditional multimodal methods. Meanwhile, CMML performs better than EXMV on all criteria of both public datasets, which is attributed to CMML utilizing the reliable consistent constraints to eliminate the inconsistent instances.

MSSL is superior on most criteria. It is notable that MSSL achieves the best performance on both single modality and ensemble results in most performance measures, except the ensemble results of coverage/AP on FLICKR25K/NUS-WIDE datasets.

In ablation experiments, we found that results decrease after removing either the module, except the AP on the NUS-WIDE dataset, which validates the effectiveness of weighted ensemble and modality-specific subspace learning.

Table 2.
Multilabel Classification Results of two Public Datasets. Three Common Criteria are Recorded. The Best Performance for Each Criterion is Bolded.

Coverage( $\times 10^{1}$ ) $↓$

FLICKR25K NUS-WIDE

Methods Image Text Ensemble Image Text Ensemble

Co-Trade 2.013 $\pm$ .135 1.991 $\pm$ .162 2.006 $\pm$ .138 .701 $\pm$ .061 .687 $\pm$ .071 .713 $\pm$ .072

WNH 2.023 $\pm$ .022 1.209 $\pm$ 0 .843 $\pm$ 0 .668 $\pm$ .009 .350 $\pm$ 0 .638 $\pm$ .016

SLIM .919 $\pm$ 0 2.725 $\pm$ .024 2.286 $\pm$ .041 .275 $\pm$ 0 .797 $\pm$ .007 .250 $\pm$ 0

ICo-Train 3.157 3.157 3.157 1.151 1.151 1.151

EXMV 1.336 1.656 1.459 .309 .525 .347

CMML .830 1.267 .790 .261 .325 .243

M3DN 3.201 3.047 3.201 1.081 1.152 1.081

TagCLIP 1.950 2.015 1.980 1.813 1.11 1.863

MSSL-S .824 1.207 .862 .260 .303 .228

MSSL-W .861 1.206 .883 .268 .302 .233

MSSL-R .829 1.202 .880 .259 .301 .229

MSSL .800 1.184 .795 .236 .286 .228

Example AUC $↑$

Methods Image Text Ensemble Image Text Ensemble

Co-Trade .745 $\pm$ .029 .753 $\pm$ .033 .747 $\pm$ .030 .764 $\pm$ .028 .773 $\pm$ .034 .758 $\pm$ .034

WNH .772 $\pm$ .006 .884 $\pm$ 0 .946 $\pm$ 0 .798 $\pm$ .007 .917 $\pm$ 0 .813 $\pm$ .009

SLIM .937 $\pm$ 0 .565 $\pm$ .007 .712 $\pm$ .012 .944 $\pm$ 0 .743 $\pm$ .009 .952 $\pm$ 0

ICo-train .437 $\pm$ 0 .437 $\pm$ 0 .437 $\pm$ 0 .632 $\pm$ 0 .653 $\pm$ 0 .661 $\pm$ 0

EXMV .877 .816 .856 .934 .854 .922

CMML .952 .877 .960 .949 .920 .956

M3DN .529 .491 .529 .393 .379 .393

TagCLIP .772 .413 .740 .869 .382 .817

MSSL-S .957 .891 .955 .952 .936 .963

MSSL-W .951 .892 .951 .947 .936 .960

MSSL-R .956 .893 .952 .952 .937 .963

MSSL .959 .897 .960 .957 .939 .963

Macro AUC $↑$

Methods Image Text Ensemble Image Text Ensemble

Co-Trade .503 $\pm$ .037 .511 $\pm$ .009 .508 $\pm$ .039 .463 $\pm$ .008 .439 $\pm$ .013 .450 $\pm$ .009

WNH .695 $\pm$ .009 .790 $\pm$ 0 .926 $\pm$ 0 .658 $\pm$ .014 .894 $\pm$ 0 .748 $\pm$ .008

SLIM .910 $\pm$ 0 .592 $\pm$ .012 .663 $\pm$ .006 .925 $\pm$ 0 .689 $\pm$ .004 .848 $\pm$ 0

ICo-train .621 $\pm$ 0 .539 $\pm$ 0 .556 $\pm$ 0 .632 $\pm$ 0 .653 $\pm$ 0 .661 $\pm$ 0

EXMV .826 .623 .813 .905 .745 .899

CMML .937 .811 .935 .934 .896 .935

M3DN .558 .499 .558 .577 .498 .577

TagCLIP .721 .550 .763 .815 .533 .853

MSSL-S .934 .807 .925 .929 .893 .940

MSSL-W .929 .807 .923 .927 .893 .938

MSSL-R .935 .812 .922 .930 .895 .940

MSSL .937 .818 .937 .939 .903 .938

	Coverage( $\times 10^{1}$ ) $↓$
Co-Trade	2.013 $\pm$ .135	1.991 $\pm$ .162	2.006 $\pm$ .138	.701 $\pm$ .061	.687 $\pm$ .071	.713 $\pm$ .072
WNH	2.023 $\pm$ .022	1.209 $\pm$ 0	.843 $\pm$ 0	.668 $\pm$ .009	.350 $\pm$ 0	.638 $\pm$ .016
SLIM	.919 $\pm$ 0	2.725 $\pm$ .024	2.286 $\pm$ .041	.275 $\pm$ 0	.797 $\pm$ .007	.250 $\pm$ 0
ICo-Train	3.157	3.157	3.157	1.151	1.151	1.151
EXMV	1.336	1.656	1.459	.309	.525	.347
CMML	.830	1.267	.790	.261	.325	.243
M3DN	3.201	3.047	3.201	1.081	1.152	1.081
TagCLIP	1.950	2.015	1.980	1.813	1.11	1.863
MSSL-S	.824	1.207	.862	.260	.303	.228
MSSL-W	.861	1.206	.883	.268	.302	.233
MSSL-R	.829	1.202	.880	.259	.301	.229
MSSL	.800	1.184	.795	.236	.286	.228
	Example AUC $↑$
Methods	Image	Text	Ensemble	Image	Text	Ensemble
Co-Trade	.745 $\pm$ .029	.753 $\pm$ .033	.747 $\pm$ .030	.764 $\pm$ .028	.773 $\pm$ .034	.758 $\pm$ .034
WNH	.772 $\pm$ .006	.884 $\pm$ 0	.946 $\pm$ 0	.798 $\pm$ .007	.917 $\pm$ 0	.813 $\pm$ .009
SLIM	.937 $\pm$ 0	.565 $\pm$ .007	.712 $\pm$ .012	.944 $\pm$ 0	.743 $\pm$ .009	.952 $\pm$ 0
ICo-train	.437 $\pm$ 0	.437 $\pm$ 0	.437 $\pm$ 0	.632 $\pm$ 0	.653 $\pm$ 0	.661 $\pm$ 0
EXMV	.877	.816	.856	.934	.854	.922
CMML	.952	.877	.960	.949	.920	.956
M3DN	.529	.491	.529	.393	.379	.393
TagCLIP	.772	.413	.740	.869	.382	.817
MSSL-S	.957	.891	.955	.952	.936	.963
MSSL-W	.951	.892	.951	.947	.936	.960
MSSL-R	.956	.893	.952	.952	.937	.963
MSSL	.959	.897	.960	.957	.939	.963
	Macro AUC $↑$
Methods	Image	Text	Ensemble	Image	Text	Ensemble
Co-Trade	.503 $\pm$ .037	.511 $\pm$ .009	.508 $\pm$ .039	.463 $\pm$ .008	.439 $\pm$ .013	.450 $\pm$ .009
WNH	.695 $\pm$ .009	.790 $\pm$ 0	.926 $\pm$ 0	.658 $\pm$ .014	.894 $\pm$ 0	.748 $\pm$ .008
SLIM	.910 $\pm$ 0	.592 $\pm$ .012	.663 $\pm$ .006	.925 $\pm$ 0	.689 $\pm$ .004	.848 $\pm$ 0
ICo-train	.621 $\pm$ 0	.539 $\pm$ 0	.556 $\pm$ 0	.632 $\pm$ 0	.653 $\pm$ 0	.661 $\pm$ 0
EXMV	.826	.623	.813	.905	.745	.899
CMML	.937	.811	.935	.934	.896	.935
M3DN	.558	.499	.558	.577	.498	.577
TagCLIP	.721	.550	.763	.815	.533	.853
MSSL-S	.934	.807	.925	.929	.893	.940
MSSL-W	.929	.807	.923	.927	.893	.938
MSSL-R	.935	.812	.922	.930	.895	.940
MSSL	.937	.818	.937	.939	.903	.938

Note. AUC = area under the curve.

Table 3.

Multilabel Classification Results on Two Public Datasets. Three Common Evaluation Criteria are Reported. The Best Results for Each Criterion are Highlighted in Bold. The Notation “N/A” Indicates that the Method Could not Produce Results Within 60 Hours.

	Micro AUC $↑$
	FLICKR25K			NUS-WIDE
Methods	Image	Text	Ensemble	Image	Text	Ensemble
Co-Trade	.720 $\pm$ .033	.737 $\pm$ .030	.727 $\pm$ .032	.726 $\pm$ .018	.740 $\pm$ .022	.725 $\pm$ .021
WNH	.769 $\pm$ .007	.877 $\pm$ 0	.943 $\pm$ 0	.765 $\pm$ .007	.912 $\pm$ 0	.781 $\pm$ .005
SLIM	.934 $\pm$ 0	.547 $\pm$ .007	.704 $\pm$ .012	.948 $\pm$ 0	.712 $\pm$ .005	.955 $\pm$ 0
ICo-train	.563	.531	.539	.595	.595	.606
EXMV	.866	.809	.852	.936	.856	.927
CMML	.953	.882	.959	.953	.923	.960
M3DN	.495	.532	.495	.634	.558	.635
TagCLIP	.781	.625	.793	.874	.615	.873
MSSL-S	.956	.878	.952	.957	.936	.966
MSSL-W	.951	.880	.950	.952	.937	.963
MSSL-R	.955	.886	.951	.958	.938	.966
MSSL	.959	.891	.959	.963	.943	.966
	Rank loss $↓$
Methods	Image	Text	Ensemble	Image	Text	Ensemble
Co-Trade	.275 $\pm$ .027	.262 $\pm$ .032	.267 $\pm$ .029	.254 $\pm$ .032	.242 $\pm$ .039	.256 $\pm$ .039
WNH	.215 $\pm$ .006	.113 $\pm$ 0	.052 $\pm$ 0	.201 $\pm$ .007	.082 $\pm$ 0	.186 $\pm$ .009
SLIM	.061 $\pm$ 0	.418 $\pm$ .019	.273 $\pm$ .023	.055 $\pm$ 0	.256 $\pm$ .009	.047 $\pm$ 0
ICo-train	.563 $\pm$ 0	.563 $\pm$ 0	.563 $\pm$ 0	.392 $\pm$ 0	.392 $\pm$ 0	.392 $\pm$ 0
EXMV	.123	.184	.144	.066	.146	.078
CMML	.048	.123	.040	.051	.080	.044
M3DN	.471	.509	.471	.607	.621	.607
TagCLIP	.697	.705	.544	.581	.701	.541
MSSL-S	.043	.109	.045	.048	.064	.037
MSSL-W	.049	.108	.049	.053	.064	.040
MSSL-R	.044	.107	.048	.048	.063	.037
MSSL	.041	.103	.040	.043	.061	.037
	Average precision $↑$
Methods	Image	Text	Ensemble	Image	Text	Ensemble
Co-Trade	N/A	N/A	N/A	.432 $\pm$ .071	.445 $\pm$ .086	.428 $\pm$ .081
WNH	.387 $\pm$ .040	N/A	N/A	.348 $\pm$ .010	.738 $\pm$ 0	.371 $\pm$ .004
SLIM	N/A	.239 $\pm$ .011	.352 $\pm$ .033	.820 $\pm$ 0	.308 $\pm$ .004	.832 $\pm$ 0
ICo-train	.258 $\pm$ 0	.258 $\pm$ 0	.258 $\pm$ 0	.252 $\pm$ 0	.252 $\pm$ 0	.252 $\pm$ 0
EXMV	.587	.461	.544	.787	.595	.762
CMML	.825	.598	.845	.831	.688	.850
M3DN	.261	.332	.261	.321	.320	.321
TagCLIP	.447	.318	.690	.602	.538	.753
MSSL-S	.833	.661	.834	.844	.807	.868
MSSL-W	.812	.661	.817	.822	.807	.852
MSSL-R	.837	.663	.832	.847	.808	.871
MSSL	.849	.665	.851	.854	.809	.869

Note. AUC = area under the curve.

Table 4.

Multilabel Classification Results of WKG Game-Hub Dataset. Six Common Criteria are Recorded. The Best Performance for Each Criterion is Bolded.

	WKG Game-Hub
	Coverage ( $\times 10^{1}$ ) $↓$			Micro AUC $↑$
Methods	Image	Text	Ensemble	Image	Text	Ensemble
Co-Trade	1.165 $\pm$ .357	1.152 $\pm$ .368	1.157 $\pm$ .364	.546 $\pm$ .118	.552 $\pm$ .125	.550 $\pm$ .119
WNH	.718 $\pm$ .017	1.631 $\pm$ .021	1.340 $\pm$ .054	.746 $\pm$ .003	.327 $\pm$ .019	.380 $\pm$ .029
SLIM	.382 $\pm$ 0	.701 $\pm$ 0	.390 $\pm$ 0	.906 $\pm$ 0	.745 $\pm$ 0	.902 $\pm$ 0
ICo-train	9.324	9.324	9.324	.468	.516	.507
EXMV	.424	.740	.491	.888	.731	.859
CMML	.359	.752	.341	.918	.713	.923
M3DN	.931	.752	.868	.640	.659	.664
MSSL-S	.337	.716	.421	.926	.743	.887
MSSL-W	.335	.715	.418	.926	.743	.887
MSSL-R	.340	.716	.428	.925	.745	.883
MSSL	.338	.706	.335	.926	.747	.926
	Example AUC $↑$			Rank loss $↓$
Methods	Image	Text	Ensemble	Image	Text	Ensemble
Co-Trade	.549 $\pm$ .182	.555 $\pm$ .185	.555 $\pm$ .184	.477 $\pm$ .188	.465 $\pm$ .188	.464 $\pm$ .188
WNH	.768 $\pm$ .007	.284 $\pm$ .018	.427 $\pm$ .038	.231 $\pm$ .007	.717 $\pm$ .018	.573 $\pm$ .038
SLIM	.913 $\pm$ 0	.778 $\pm$ 0	.907 $\pm$ 0	.086 $\pm$ 0	.213 $\pm$ 0	.092 $\pm$ 0
ICo-train	.479	.508	.502	.506	.506	.506
EXMV	.896	.770	.873	.104	.230	.127
CMML	.919	.750	.926	.081	.250	.074
M3DN	.325	.262	.286	.675	.738	.714
MSSL-S	.928	.779	.903	.072	.221	.097
MSSL-W	.928	.779	.903	.072	.221	.097
MSSL-R	.927	.780	.900	.073	.220	.100
MSSL	.928	.783	.928	.072	.217	.072
	Macro AUC $↑$			Average precision $↑$
Methods	Image	Text	Ensemble	Image	Text	Ensemble
Co-Trade	.489 $\pm$ .019	.503 $\pm$ .002	.488 $\pm$ .013	.363 $\pm$ .191	.368 $\pm$ .198	.366 $\pm$ .194
WNH	.700 $\pm$ .007	.493 $\pm$ .011	.282 $\pm$ .007	.347 $\pm$ .002	.118 $\pm$ .009	.173 $\pm$ .018
SLIM	.862 $\pm$ 0	.549 $\pm$ 0	.862 $\pm$ 0	.783 $\pm$ 0	.626 $\pm$ 0	.747 $\pm$ 0
ICo-train	.479	.508	.502	.197	.197	.197
EXMV	.840	.516	.815	.747	.624	.712
CMML	.882	.505	.884	.801	.421	.810
M3DN	.571	.498	.571	.498	.577	.580
MSSL-S	.892	.542	.817	.822	.631	.803
MSSL-W	.891	.542	.818	.819	.632	.801
MSSL-R	.891	.544	.812	.825	.632	.800
MSSL	.893	.545	.888	.825	.633	.813

Note. AUC = area under the curve.

Moreover, as shown in Table 4, MSSL also achieves similar performance in real-world complex article classification. MSSL gets the best on all ensemble results over each criterion, although it ranks second in the text results of some criteria, that is, example AUC, and rank loss. The results verify the effectiveness of MSSL and present an interesting direction for multimodal learning. In detail, we need a reliable fusion of multimodal information, that is, on the basis of safely using the consistency among modalities, we need to consider the modal independence to improve the ensemble performance.

4.3.1. Investigation on Stability of Parameter

Furthermore, we conduct more experiments to explore the influence of parameters $λ_{1}$ and $λ_{2}$ . In detail, first, we fix the $λ_{1}$ while tuning $λ_{2}$ in ${10^{- 3}, 10^{- 2}, \dots, 10^{1}, 10^{2}}$ . Then we fix $λ_{2}$ while tuning $λ_{1}$ in ${10^{- 3}, 10^{- 2}, \dots, 10^{1}, 10^{2}}$ , and record performance in Figure 3. Considering the page limitation, we only give out the average precision (AP), micro AUC (Mi-AUC), and macro AUC (Ma-AUC) of three datasets. The figure color changes from blue to yellow, which indicates a transition from low to high performance. From the figure, it is notable that MSSL achieves a stable performance on each dataset with relatively small $λ_{2}$ , which indicates that most multimodal instances are consistent, and inconsistent instances are relatively small.

Figure 3.

Influence of the parameters $λ_{1}$ , $λ_{2}$ on AP, Mi-AUC, and Ma-AUC criteria of three datasets. AP = average precision; Mi-AUC = micro area under the curve; Ma-AUC = macro area under the curve.

4.4. Case Study of Tri-Modal Extension

We evaluate our method on the tri-modal IEMOCAP (Busso et al., 2008) using the standard splits and evaluation protocol, and compare against the classic tri-modal baseline HRG-SSA (Ji et al., 2025). We report class-wise accuracies (Happy/Sad/Neutral). As shown in Table 5, our tri-modal variant consistently surpasses HRG-SSA, demonstrating the method’s effectiveness and scalability to multimodal settings.

Table 5.
IEMOCAP Tri-Modal Results. We Report Per-Class Accuracies for Happy, Sad, and Neutral; Best Results are in Bold.

Method Happy Sad Neutral

HRG-SSA 71.27 84.79 74.50

MSSL (Ours) 72.10 85.12 74.59

Method	Happy	Sad	Neutral
HRG-SSA	71.27	84.79	74.50
MSSL (Ours)	72.10	85.12	74.59

Note. MSSL = modality-specific subspace learning.

4.5. Case Study of Inconsistent Pairs

Considering MSSL can effectively eliminate the inconsistent instances. We conduct more experiments, and locate the image–text pairs with the similarity $s$ in minimum order. Figure 4 exhibits several illustrative examples. Qualitatively, we find that globally inconsistent instances mostly are irrelevant instances, and the images are more complicated, which also demonstrated a similar phenomenon from sorted retrieval examples.

Figure 4.

Examples of inconsistent instances using the reliable consistency gated module from the WKG Game-Hub dataset.

4.5.1. Cross-Modal Retrieval

MSSL can also learn shared subspace representation for each modality; thus, we perform cross-modal retrieval to verify the effectiveness. Cross-modal retrieval is performed with two tasks: (1) Image $\to$ Text: retrieve relevant text training set using an image query. (2) Text $\to$ Image: retrieve relevant image training set using a text query. For the multilabel dataset, we adopt Normalized Discounted Cumulative Gain (NDCG@500; Jarvelin & Kekalainen, 2000) as the performance metric. From Table 6, we can observe that the MSSL method outperforms other compared methods on most datasets:

The deep learning-based methods are helpful to improve the performance of traditional methods, for example, DCCA outperforms CCA on all datasets.

The label information improves model feature learning, for example, the best two linear methods, LCFS and JFSSL, using deep output features, outperform the unsupervised deep method DCCA on all datasets.

MSSL outperforms both the unsupervised methods and the deep learning-based methods on most datasets, except FLICKR25K. The reason is that the label concepts of the FLICKR25K dataset are fine-grained, and the DSCMR deep method additionally considers the interclass/intraclass structure information, thus they are better in $I \to T$ and Avg.

Table 6.
Comparison of Different Real-Valued Cross-Modal Learning Methods on Three Datasets (NUS/FLICKR Denotes NUS-WIDE/FLICKR25K). NDCG@500 is Recorded. The Best Performance is Bolded.

Metric CCA LCFS JFSSL DCCA VSE++ DSCMR MSSL

NUS I $\to$ T .445 .581 .560 .526 .597 .606 .617

T $\to$ I .429 .603 .577 .529 .600 .606 .613

Avg .437 .592 .568 .527 .599 .606 .615

FLICKR I $\to$ T .395 .673 .600 .584 .682 .733 .725

T $\to$ I .376 .673 .591 .580 .672 .722 .730

Avg .386 .673 .595 .582 .677 .728 .727

WKG I $\to$ T .512 .634 .611 .580 .642 .741 .752

T $\to$ I .498 .642 .618 .575 .633 .743 .756

Avg .505 .638 .614 .577 .637 .742 .754

Metric	CCA	LCFS	JFSSL	DCCA	VSE++	DSCMR	MSSL
NUS	I $\to$ T	.445	.581	.560	.526	.597	.606	.617
	T $\to$ I	.429	.603	.577	.529	.600	.606	.613
	Avg	.437	.592	.568	.527	.599	.606	.615
FLICKR	I $\to$ T	.395	.673	.600	.584	.682	.733	.725
	T $\to$ I	.376	.673	.591	.580	.672	.722	.730
	Avg	.386	.673	.595	.582	.677	.728	.727
WKG	I $\to$ T	.512	.634	.611	.580	.642	.741	.752
	T $\to$ I	.498	.642	.618	.575	.633	.743	.756
	Avg	.505	.638	.614	.577	.637	.742	.754

Note. CCA = canonical correlation analysis; MSSL = modality-specific subspace learning.

4.5.2. Case Study of Cross-Modal Retrieval

To explore the quality of cross-modal retrieval, we also analyze the success and failure of cross-modal retrieval examples. Considering that the text modality of NUS-WIKD/FLICKR25K datasets only contains tag information, we give the retrieval examples of our proposed method on the real-world dataset WKG as shown in Figures 5 and 6. In detail, Figure 5 shows the success retrieval results of $Text \to Image$ and $Image \to Text$ , meanwhile, we also give the failure retrieval results of $Text \to Image$ and $Image \to Text$ in Figure 6. The results reveal that:

Correctly retrieved examples usually have simple modal information, for example, images describing game hero label concept (e.g., Li Bai, Di Renjie, and Guan Yu), and text describing skills label (e.g., Assassin). Whereas several retrieved results do not match the true image/text, they are still very relevant to the query information, for example, the retrieved text contains the hero’s name (i.e., the first case in Figure 5 includes “Li Bai” in failure examples), or the retrieved image belongs to the game (i.e., the failure examples of fourth case in Figure 5 belongs to WKG game).

Failure examples are usually more complex and require professional judgment. The reasons are: (1) the modal itself has insufficient information and contains lots of noise, which increases the difficulty of retrieval. Meanwhile, the image–text pair correlation of these examples is low as indicated in Figure 4; and (2) the retrieved samples have few training examples, for example, news or other game examples.

Figure 5.

Success results on the WKG dataset, we tested our method. Given text/image description as a query, we retrieve the most relevant image/text ranked from left to right.

Figure 6.

Failure results on the WKG dataset, we tested our method. Given a text/image description as a query, we retrieve the most relevant image/text ranked from left to right.

5. Conclusion

In open environments, multimodal data contains both shared and independent information, which often undermines traditional consistency-based approaches. We proposed MSSL, a unified semi-supervised framework that explicitly models both shared and independent information. MSSL integrates three components—modality separation, reliable consistency regularization, and weighted classification. This design expands complementarity at the representation level, enhances ensemble prediction through automatically learned weights, and enforces robust structure-aware co-supervision. Extensive experiments across diverse benchmarks demonstrate that MSSL achieves significant improvements under multiple evaluation criteria.

Footnotes

Funding

The authors received no financial support for the research, authorship, and/or publication of this article.

Declaration of Conflicting Interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

ORCID iD

Yinan Han

References

Andrew

Arora

Bilmes

J. A.

Livescu

(2013). Deep canonical correlation analysis. In Proceedings of the international conference on machine learning, IOML’13, Atlanta, GA, USA (pp. 1247–1255). JMLR.

Brefeld

Gartner

Scheffer

Wrobel

(2006). Efficient co-regularised least squares regression. In Proceedings of the international conference on machine learning, ICML’06, Pittsburgh, Pennsylvania, USA (pp. 137–144). Association for Computing Machinery.

Busso

Bulut

Lee

C. C.

Kazemzadeh

Mower

Kim

Chang

J. N.

Lee

Narayanan

S. S.

(2008). IEMOCAP: Interactive emotional dyadic motion capture database. Language Resources and Evaluation, 42(4), 335–359.

Chen

Wang

Gao

Zhou

(2018). Tri-net for semi-supervised deep learning. In Proceedings of the international joint conference on artificial intelligence, IJCAI’18, Stockholm, Sweden (pp. 2014–2020). AAAI Press.

Chua

Tang

Hong

Luo

Zheng

(2009). NUS-WIDE: A real-world web image database from National University of Singapore. In Proceedings of ACM international conference on image and video retrieval, CIVR’09, Santorini Island, Greece (Article No. 48, pp. 1–9). Association for Computing Machinery.

Cui

Che

Liu

Qin

Yang

Wang

(2019). Pre-training with whole word masking for Chinese BERT. arXiv preprint arXiv:1906.08101.

Faghri

Fleet

D. J.

Kiros

J. R.

Fidler

(2018). VSE++: Improving visual-semantic embeddings with hard negatives. In Proceedings of the British machine vision conference (p. 12). BMVA Press.

Farquhar

J. D. R.

Hardoon

D. R.

Meng

Shawe-Taylor

Szedmak

(2005). Two view learning: SVM-2K, theory and practice. In Advances in neural information processing systems (pp. 355–362). MIT Press.

Goodfellow

I. J.

Pouget-Abadie

Mirza

Warde-Farley

Ozair

Courville

A. C.

Bengio

(2014). Generative adversarial networks. CoRR abs/1406.2661.

10.

Guo

Wang

(2019). Towards making co-training suffer less from insufficient views. Frontiers of Computer Science, 13(1), 99–105.

11.

Zhang

Ren

Sun

(2015). Deep residual learning for image recognition. arXiv preprint arXiv:1512.03385.

12.

Hotelling

(1936). Relations between two sets of variates. Biometrika, 28(3/4), 321–377.

13.

Huiskes

M. J.

Lew

M. S.

(2008). The MIR Flickr retrieval evaluation. In Proceedings of ACM international conference on multimedia (pp. 39–43). Association for Computing Machinery.

14.

Iwata

Yamada

(2016). Multi-view anomaly detection via robust probabilistic latent variable models. In Advances in neural information processing systems (pp. 1136–1144). Curran Associates Inc.

15.

Jarvelin

Kekalainen

(2000). IR evaluation methods for retrieving highly relevant documents. In Proceedings of the annual international ACM SIGIR conference on research and development in information retrieval (pp. 41–48). Association for Computing Machinery.

16.

Zhao

Gao

(2025). Hybrid relational graphs with sentiment-laden semantic alignment for multimodal emotion recognition in conversation. In IJCAI (pp. 2973–2981). ijcai.org.

17.

Kingma

D. P.

(2015). Adam: A method for stochastic optimization. Proceedings of the 3rd International Conference on Learning Representations, San Diego, CA.

18.

LeCun

Bengio

Hinton

G. E.

(2015). Deep learning. Nature, 521(7553), 436–444.

19.

Zhou

(2012). Diversity regularized ensemble pruning. In Proceedings of the European conference on machine learning and principles and practice of knowledge discovery in databases (pp. 330–345). Springer.

20.

Lin

Chen

Zhang

Yang

Lin

Liu

Cai

(2024). Tagclip: A local-to-global framework to enhance open-vocabulary multi-label classification of CLIP without training. In AAAI conference on artificial intelligence (pp. 3513–3521). AAAI Press.

21.

Liu

Huang

Zhang

(2017). Cross-modality binary code learning via fusion similarity hashing. In Proceedings of the conference on computer vision and pattern recognition (pp. 6345–6353). IEEE.

22.

Liu

Tong

Zhang

Duan

Xiong

(2019). Hydra: A personalized and context-aware multi-modal transportation recommendation system. In Proceedings of the international conference on knowledge discovery and data mining (pp. 2314–2324). Association for Computing Machinery.

23.

Zhang

Wan

Zhang

Pan

(2015). Robot and cloud-assisted multi-modal healthcare system. Cluster Computing, 18(3), 1295–1306.

24.

Muslea

Minton

Knoblockraig

(2003). Active learning with strong and weak views: A case study on wrapper induction. In Proceedings of the international joint conference on artificial intelligence (pp. 415–420). Morgan Kaufmann Publishers Inc.

25.

Ngiam

Khosla

Kim

Nam

Lee

A. Y.

(2011). Multimodal deep learning. In Proceedings of the international conference on machine learning (pp. 689–696). Omnipress.

26.

Nie

Cao

Ding

Zhou

(2022). A total variation with joint norms for infrared and visible image fusion. IEEE Transactions on Multimedia, 24, 1460–1472.

27.

Wang

Nie

Huang

(2013a). Multi-view clustering and feature learning via structured sparsity. In Proceedings of the international conference on machine learning (pp. 352–360). PMLR.

28.

Wang

Tan

(2016a). Joint feature selection and subspace learning for cross-modal retrieval. IEEE Transactions on Pattern Analysis and Machine Intelligence, 38(10), 2010–2023.

29.

Wang

Tan

(2013b). Learning coupled feature spaces for cross-modal matching. In Proceedings of the IEEE international conference on computer vision (pp. 2088–2095). IEEE.

30.

Wang

Yin

Wang

(2016b). A comprehensive survey on cross-modal retrieval. CoRR abs/1607.06215.

31.

Wang

Zhou

(2013). Co-training with insufficient views. In Proceedings of the Asian conference on machine learning (pp. 467–482). PMLR.

32.

Wang

Guo

Lei

Zhang

S. Z.

(2017). Exclusivity-consistency regularized multi-view subspace clustering. In Proceedings of the conference on computer vision and pattern recognition (pp. 1–9). IEEE.

33.

Xie

Deng

Liu

Tao

(2020). Multi-task consistency-preserving adversarial hashing for cross-modal retrieval. IEEE Transaction Image Processing, 29, 3626–3637.

34.

Yang

Wan

Jiang

(2024). Facilitating multimodal classification via dynamically learning modality gap. In A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. M. Tomczak, & C. Zhang (Eds.), Advances in Neural Information Processing Systems. (pp. 62108–62122). Curran Associates, Inc.

35.

Yang

Wang

Zhan

Xiong

Jiang

(2019a). Comprehensive semi-supervised multi-modal learning. In Proceedings of the international joint conference on artificial intelligence (pp. 4092–4098). AAAI Press.

36.

Yang

Zhan

Liu

Jiang

(2018a). Complex object classification: A multi-modal multi-instance multi-label deep network with optimal transport. In Proceedings of the international conference on knowledge discovery and data mining (pp. 2594–2603). Association for Computing Machinery.

37.

Yang

Zhan

Liu

Jiang

(2019b). Deep robust unsupervised multi-modal network. In Proceedings of the AAAI conference on artificial intelligence (pp. 5652–5659). AAAI Press.

38.

Yang

Zhou

Tang

(2024b). Rebalanced vision-language retrieval considering structure-aware distillation. IEEE Transaction Image Processing, 33, 6881–6892.

39.

Yang

Zhan

Sheng

Jiang

(2018b). Semi-supervised multi-modal learning with incomplete modalities. In Proceedings of the international joint conference on artificial intelligence (pp. 2998–3004). AAAI Press.

40.

Zhang

Zhou

(2011). Co-trade: Confident co-training with data editing. IEEE Transactions on Systems, Man, and Cybernetics, 41(6), 1612–1626.

41.

Zhen

Wang

Peng

(2019). Deep supervised cross-modal retrieval. In IEEE conference on computer vision and pattern recognition (pp. 10394–10403). IEEE.

42.

Zhou

(2009). Ensemble learning. In Encyclopedia of biometrics (pp. 270–273). Springer US.

	FLICKR25K						NUS-WIDE
	Average	Coverage	Example	Macro	Micro	Rank	Average	Coverage	Example	Macro	Micro	Rank
Methods	Precision $↑$	$↓$	AUC $↑$	AUC $↑$	AUC $↑$	Loss $↓$	Precision $↑$	$↓$	AUC $↑$	AUC $↑$	AUC $↑$	Loss $↓$
Img	.807	8.991	.947	.918	.943	.053	.822	2.822	.943	.917	.948	.057
Text	.614	13.401	.863	.751	.827	.137	.764	3.562	.919	.848	.903	.081
L $_{mean}$	.791	9.505	.939	.898	.932	.061	.853	2.398	.958	.930	.958	.042
L $_{max}$	.778	9.696	.936	.890	.928	.064	.848	2.432	.957	.928	.957	.043
LC	.803	8.923	.947	.919	.945	.054	.841	2.606	.951	.927	.955	.049
Img $_{U}$	.810	8.953	.947	.917	.944	.053	.832	2.806	.945	.915	.948	.055
Text $_{U}$	.612	13.414	.862	.740	.821	.138	.767	3.589	.917	.842	.896	.083
LU $_{mean}$	.793	9.487	.940	.897	.931	.060	.856	2.443	.958	.924	.955	.042
LU $_{max}$	.785	9.661	.937	.889	.929	.063	.854	2.468	.957	.922	.954	.043
LUC	.809	8.944	.947	.920	.943	.053	.849	2.628	.952	.921	.953	.048
CoReg	.576	14.537	.863	.801	.862	.137	.813	2.811	.944	.911	.947	.056

A Reliable Multimodal Method Considering Modality-Specific Subspace Learning

Abstract

Keywords

1. Introduction

2. Related Work

3.1. Notations

3.2. Preliminaries

3.5.1. Modality-Separation Under TTUR

Proposition 1 TTUR convergence

Implication

Theorem 1 (Risk gap bound)

Implication

4.1. Datasets and Compared Methods

4.2. Implementation Details

4.3. Multilabel Classification

Table 5. IEMOCAP Tri-Modal Results. We Report Per-Class Accuracies for Happy, Sad, and Neutral; Best Results are in Bold. Method Happy Sad Neutral HRG-SSA 71.27 84.79 74.50 MSSL (Ours) 72.10 85.12 74.59

Footnotes

Funding

Declaration of Conflicting Interests

ORCID iD

References

Table 5.
IEMOCAP Tri-Modal Results. We Report Per-Class Accuracies for Happy, Sad, and Neutral; Best Results are in Bold.

Method Happy Sad Neutral

HRG-SSA 71.27 84.79 74.50

MSSL (Ours) 72.10 85.12 74.59