Learning hierarchical embedding space for image-text matching

Abstract

There are two mainstream strategies for image-text matching at present. The one, termed as joint embedding learning, aims to model the semantic information of both image and sentence in a shared feature subspace, which facilitates the measurement of semantic similarity but only focuses on global alignment relationship. To explore the local semantic relationship more fully, the other one, termed as metric learning, aims to learn a complex similarity function to directly output score of each image-text pair. However, it significantly suffers from more computation burden at retrieval stage. In this paper, we propose a hierarchically joint embedding model to incorporate the local semantic relationship into a joint embedding learning framework. The proposed method learns the shared local and global embedding spaces simultaneously, and models the joint local embedding space with respect to specific local similarity labels which are easy to access from the lexical information of corpus. Unlike the methods based on metric learning, we can prepare the fixed representations of both images and sentences by concatenating the normalized local and global representations, which makes it feasible to perform the efficient retrieval. And experiments show that the proposed model can achieve competitive performance when compared to the existing joint embedding learning models on two publicly available datasets Flickr30k and MS-COCO.

Keywords

Information retrieval cross-modal representation hierarchical embedding local alignment

1. Introduction

Image-text matching aims at searching for those semantically relevant images given a query of type text and vice versa. It is a basic task in many multimedia machine learning tasks, e.g., image captioning, cross-modal information retrieval and visual question answering. However, the natural discrepancy between representations of text and image makes it challenging to capture semantic relationships between them. As illustrated at the left side in Fig. 1, to bridge the heterogeneous gap, joint embedding learning models [1, 2, 3, 4, 5, 6, 7, 8, 9] apply separate encoders to project both sentences and images into a shared feature subspace in which the semantic similarity is equivalent to the conventional metrics, e.g., Euclidean distance. Due to the mutually independent image and text encoders, joint embedding learning turns the retrieval problem into a vector-based ranking problem, which makes it feasible to efficiently retrieve targets among a huge amount of candidate items by precomputing semantic feature vectors of all database items. Kiros et al. made the first attempt that projecting both image and sentence into a common space and learned cross-modal representation with a hinge-based bidirectional triplet rank loss [10]. Liu et al. introduced a recurrent residual block to progressively enhance feature embedding [8], and applied a fusion module to integrate all output representations at each time step to a power representation which is mapped into the shared feature space using a fully connected layer. Faghri et al. further improved the retrieval performance by computing triplet rank loss with respect to the hard negatives in each mini-batch [4]. Semedo et al. proposed an adaptive maximum-margin strategy to adopt a dynamic margin value for computing the rank loss function at the training stage [11]. Some methods [12, 13] introduced instance-based classification loss to capture intra-modal discriminative information. Zhang et al. proposed the cross-modal projection matching loss and classification loss to minimize KL divergence and intra-modal classification loss w.r.t the projection on other modality [5], which can incorporate inter-modal correlation constraint into the intra-modal classification loss.

Additionally, some multi-modal learning tasks are also utilized to further fill the heterogeneous gap, such as cross-modal generation task [1, 3], adversarial learning tasks [9] and reconstruction task [14, 15]. Gu et al. proposed a model for learning a global feature embedding space and a grounded feature embedding space [6], and introduced a sentence generative module and an image generative adversarial module to supervise learning the grounded embedding. Hu et al. proposed a scalable multi-modal retrieval model [15] which projects each modality into a predefined common label space using individual auto-encoder. Therefore, the cross-modal representation of data from a new modality can be learned without retraining whole model. Besides, given a set of detected salient object regions in an image, Huang et al. learned a semantic concept and order representation by fusing a semantic unit representation and a referred global representation [7], and introduced a sentence generation module conditioned on the fusion vector to supervise the fused representation learning. Hong et al. first pretrained a general embedding model on several large multi-modal datasets, and then fine-tuned the pretrained model with respect to specific downstream task [2]. To enhance the discriminative power of learned joint representations, the existing models generally focus on introducing either novel loss functions as explicit constraints or additional auxiliary tasks as implicit constraints. However, these models only exploit the semantic relationships between holistic images and sentences for aligning cross-modal representations in the shared feature subspace, which fails to explore local semantic relationships which is important for understanding and matching semantic information better in the cross-modal retrieval task. In other words, due to the complexity of semantic concepts in practice, it is almost impossible to find a textual description which matches all potentially expressed semantic information in an image, vice versa. It means that the mismatched information generally exists between an image and its semantically relevant sentence to some extent, which is not completely in line with the widely adopted loss functions in joint embedding learning framework, e.g., hinge-based triplet rank loss. Therefore, it is necessary to introduce the local semantic information for alleviating the problem.

Figure 1.

The demonstrations of joint embedding learning (left) and metric learning (right).

As illustrated at the right side in Fig. 1, the metric learning models learn a similarity function which directly output the similarity scores of all image-text pairs. Except for the modal-specific encoders, metric learning introduces specifically designed cross-modal operation to measure the semantic similarity, in which it is infeasible to find a feature space shared across all visual and textual database items. Much existing metric learning methods explore fine-grained semantic relationships at the level of image regions and words through attention mechanism [16, 17, 18, 19, 20]. Nam et al. proposed Dual Attentional Network [17] to capture fine-grained interplay between image and sentence through multiple time steps. Similarly, Chen et al. proposed matching image and sentence with recurrent attention memory through multiple time steps [21]. Lee et al. proposed the stacked cross attention mechanism [18] to capture the semantic similarity between all possible pairs of image regions and words to generate the power representations for semantic components belonging to both image and sentence. Ji et al. deployed an asymmetrical saliency-guided network [20] for generating powerful textual representations corresponding to each specific image. Zhang et al. proposed a context-aware network [19] to simultaneously capture inter-modal semantic correspondence and intra-modal semantic correlation. Yu et al. proposed heterogeneous attention model [16] with an adaptive-weighted hard negative rank loss.

Except for the attention mechanism, some methods [22] designed novel neural network architectures to adaptively control the information flow, which can help models attend to those objects with rich semantic information. Besides, some methods [23, 24] transformed image and textual data to the scene graph, and learn to measure the semantic similarity between heterogeneous graph data. In contrast to the joint embedding learning, the metric learning models generally involve complex interaction operation over multi-modal data, such as cross-modal attention mechanism and fusion strategy. In other words, it is impractical to find a fixed representation for each database item, which means it has to exhaustively compute the scores for all possible image-sentence pairs from the scratch. When dealing with a huge amount of data, e.g., social media and search engine, metric learning based methods will introduce unbearable computation burden and time cost at the retrieval stage.

In this paper, we propose a novel hierarchically joint embedding model to incorporate local semantic relationship into a joint embedding framework. The proposed model captures both local and global inter-modal semantic correlation by learning the joint local and global embedding space simultaneously. Concretely, the proposed model is the cascade of local and global semantic embedding module. The former takes as input a given image-sentence pair and outputs their local feature vectors in the shared local semantic subspace in which the local similarity is measured. Then the latter takes as input the set of local feature vectors and outputs the corresponding global semantic representations of images and sentences in the shared global semantic subspace. At the training stage, the local similarity label derived from lexical information of corpus is utilized to supervise training the local embedding module, by which we can incorporate local semantic correlation into the joint embedding learning framework. At the retrieval stage, the fixed semantic representation corresponding to each database item is generated by concatenating its normalized local and global representations. Our contributions are summarized as follows:

We propose a novel framework to introduce the local semantic alignment information into a joint embedding learning framework. And by concatenating the normalized local and global semantic representations, the proposed model still support the efficient retrieval as the existing joint embedding learning methods.

In contrast to the existing methods, we design a novel local semantic alignment strategy for modeling local feature subspace. And the local similarity labels are easy to access from the lexical information of corpus with low time cost.

2. Learning hierarchical embedding space

The proposed model aims at embedding both local and global semantic information into the cross-modal representations. Concretely, the distribution of joint local and global embedding spaces are inferred with respect to the training data simultaneously. And we derive the local similarity labels from the distribution of words in corpus, which are utilized to optimize the parameters of local embedding modules. And by concatenating the normalized representations in local and global embedding spaces, we can attain the fixed representations of all database items for efficient retrieval. In this section, we will introduce the proposed model from two aspects, model architecture in Section 3.1 and training strategy in Section 3.2.

Figure 2.

An overview of the proposed hierarchical embedding model.

2.1 Model architecture

We present an overview of the proposed model in the Fig. 2. As illustrated in the Fig. 2, our model encodes local and global semantic information of images and sentences with two mutually independent embedding branches. And each branch is the cascade of local and global embedding modules. The local embedding module aims at extracting a set of semantic feature vectors for each database item. And an attention-based pool strategy is proposed to average of all feature vectors is normalized as the final local semantic representation which is further used to match the semantic information at the level of image regions and words. Given a set of local semantic feature vectors, the global embedding module aims to output the final global representation for measuring semantic similarity at the level of holistic image and sentence. Finally, local and global semantic representations are normalized and concatenated as the universe representation of each database item. We will detail the image embedding branch in Section 3.1.1 and the text embedding branch in Section 3.1.2.

2.1.1 Image embedding branch

Due to the great achievement in computer vision field, Convolutional Neural Network (CNN) [25, 26, 27] have been widely applied to encode image in various image-text processing tasks. We firstly apply a pretrained CNN model to represent an image as a sequence of feature vectors of which each corresponds to a specific region in the image. And the precomputed feature vectors are fixed during both training and retrieval stages. And then we project each feature vector into the shared local embedding space with the multi-head self-attention mechanism [28]. The core of self-attention mechanism is the scale dot product attention which is described in the Eq. (1), where the $Q\in R^{n_{q}\times d_{k}}$ , $K\in R^{n_{k}\times d_{k}}$ and $V\in R^{n_{k}\times d_{v}}$ are the row-wise packed queries, keys and values, respectively.

$\displaystyle\text{Attention}(Q,K,V)=\textit{softmax}\left(\frac{QK^{T}}{\sqrt% {d_{k}}}\right)V$ (1)

Then, as illustrated in Eq. (2), the multi-head mechanism aims to concatenate the outputs of multiple parallel scale dot product attention operations and transform the output with single linear map. $W^{Q}_{i}$ , $W^{K}_{i}$ and $W^{V}_{i}$ are the learned weight matrices for mapping queries, keys and values into the view of $i$ -th attention head. Follow the rule of self-attention mechanism, the $Q, K, V$ are all equal to the input sequence of feature vectors. The local encoder can be formulated as $v^{l}=\text{MultiHead}(I,I,I)$ , where $I\in R^{M\times d_{p}}$ is the precomputed image feature matrix, $M$ is the number of regions and $d_{p}$ is the dimension of precomputed representations. The $v^{l}\in R^{M\times d_{l}}$ is the local feature matrix and $d_{l}$ is the dimension of shared local embedding space. Specially, the weight matrix $W^{O}$ is removed and the linear map $W^{Q}$ in query branch is followed by a normalization layer [29].

$\displaystyle\text{MultiHead}(Q,K,V)=\textit{Concat}(\textit{head}_{1},\ldots,% \textit{head}_{i})W^{O}$

where

$\displaystyle\textit{head}_{i}=\text{Attention}(W^{Q}_{i}Q,W^{K}_{i}K,W^{V}_{i% }V)$ (2)

To generate the powerful global semantic representation, we firstly apply the multi-head self-attention mechanism to further refine the local feature vectors. We view the global information as the fusion of natural semantic order and local semantic features. Therefore, it is pivotal to incorporate the semantic order information into the generation process of global representation. To make the global module position-sensitive, we add the learnable position encoding to the input sequence before feeding them into the attention module. Besides, we also adopt the average of local feature vectors as the query, which aims to attain single enhanced feature vector for each image. Then we apply a two-layers feedforward network to project the enhanced feature vector into the shared global semantic subspace. We sequentially insert the Gaussian error linear units (GELU) [30] and a normalization layer between two fully connected layers. The GELU activation function is defined as $f(x)=x\Phi(x)$ , in which the $\Phi(x)$ is a cumulative distribution function of the standard Gaussian distribution. The GELU function is differentiabel everywhere when compared with the rectified linear unit function (ReLU), which make it more smooth in the neighborhood of 0. The feedforward network is formulated as the Eq. (3), where $W_{v,g}^{n}$ and $b_{v,g}^{n}$ are the weight and bias matrix of $n$ -th fully connected layer.

$\displaystyle\text{FFN}(x)=W_{v,g}^{2}(\text{GELU}(\text{LayerNorm}[W_{v,g}^{1% }x+b_{v,g}^{1}]))+b_{v,g}^{2}$ (3) $\displaystyle v^{g}=\text{FFN}(\text{MultiHead}(\text{Avg}(v^{l}+W_{pe}),v^{l}% ,v^{l}))$ (4)

As in prior work [28], we apply the combination of residual connection and normalization layers after both attention and FFN layers. For simplicity, the global encoder can be formulated as the Eq. (4), where residual connections and layer normalization are omitted, $v^{g}\in R^{d_{g}}$ is the global representation for the image $v$ , and $d_{g}$ is the dimension of shared global semantic embedding space. $W_{pe}$ is the position encoding matrix which is randomly initialized and learned in end-to-end manner.

2.1.2 Text embedding branch

Given a sentence $T$ , we embed words into a low dimensional feature space through an embedding matrix $W_{e}\in R^{d_{e}\times|V|}$ , where $|V|$ is the vocabulary size and $d_{e}$ is the dimension of word embedding. Several text encoders have achieved great success in natural language processing domain, we adopt a bidirectional GRU [31] model to further project word embeddings into the shared local feature subspace. The computation in forward direction is described in Eqs (5)–(8), recurrent models can capture the long-term semantic order and dependency information by sharing parameters across all time steps and maintaining a memory state. Except for the inverse order of input sequence, the backward computation is similar as the operations in forward direction. In our work, the forward and backward hidden state is concatenated as the local representation of $t$ -th word in sentence. The local encoder can be formulated as $t^{l}=f_{t,l}(t;W_{e},\theta_{\textit{GRU}})$ , where $\theta_{\textit{GRU}}$ is the set of parameters in GRU encoder.

$\displaystyle r_{t}=W_{hr}h_{t-1}+W_{ir}x_{t}+b$ (5) $\displaystyle z_{t}=W_{hz}h_{t-1}+W_{iz}x_{t}+b$ (6) $\displaystyle n_{t}=\text{tanh}(W_{in}x_{t}+b_{in}+r_{t}*(W_{hn}h_{t-1}+b_{hn}))$ (7) $\displaystyle h_{t}=(1-z_{t})*n_{t}+z_{t}*h_{t-1}$ (8)

To generate the global representation of sentences, we adopt the same encoding process as in image embedding branch, which comprises a multi-head attention layer and a two-layers feedforward network. Similarly, the global encoder is formulated as $t^{g}=\text{FFN}(\text{MultiHead}(\text{Avg}(t^{l}+W_{pe}),t^{l},t^{l}))$ , where the FFN and MultiHead are defined as Eqs (3) and (2), $t^{g}\in R^{d_{g}}$ is the global semantic representation of sentence. $W_{pe}$ is the position encoding matrix which is randomly initialized and learned in the end-to-end manner. Note that we using independent position embedding matrices for image and text branch.

2.2 Embedding space learning

The proposed model requires optimizing the distribution of local and global embedding spaces simultaneously. Firstly, we define some necessary notations. Assume that a batch of training samples consists of $N_{v}$ images and $N_{t}$ sentences, we refer the set of images as $V_{b}={\{v_{i}\}}_{i=1}^{N_{v}}$ , and the set of sentences as $T_{b}={\{t_{j}\}}_{j=1}^{N_{t}}$ . Because each image is probably relevant to more than one sentence, a batch of samples is denoted as $O_{b}=\{(v_{i},t_{j},y_{ij})|v_{i}\in V_{b},t_{j}\in T_{b}\}$ , where $y_{ij}$ is 1 if image $v_{i}$ is relevant with sentence $t_{j}$ , otherwise $-$ 1. In the following, we index the variables belonging to image and sentence using subscript $i$ and $j$ in a batch, and indicate the local and global representations using superscript $l$ and $g$ . We will introduce the local and global embedding learning strategies in the Section 3.2.1 and Section 3.2.2, respectively.

2.2.1 Learning local embedding

In our work, the local embedding modules aim to transform images and sentences into the corresponding sequences of feature vectors in the shared local embedding space. Consider that the spatial layout and word order information should have limited effect on measuring the local semantic similarity of an image-sentence pair, we propose a position-independent local alignment strategy to model the local embedding space. Instead of exploiting global alignment labels to implicitly infer local semantic relationships, we extract a set of motif tags from the lexical information of corpus (i.e., all sentences in dataset) to promote fine-grained local semantic alignment. Concretely, we firstly utilize the Stanford natural language toolkit to generate the part-of-speech tags of corpus and build a collection of all nouns and adjectives occurring in all sentences. And then we further clean the collection by merging some semantically relevant words into a group with respect to some lexical relationships, e.g., synonyms, comparative, plural, etc. Finally, we remove those groups which occur in all sentences with low frequency to attain a set of motif tags, and assign a label vector $q\in R^{L}$ to each database item. The $L$ is the number of motif tags and the $l$ -th element of vector $q$ is 1 if the $l$ -th motif tag exists in the database item, otherwise 0. We view motif tags of an image as the union of all motif sets of its relevant sentences.

$\displaystyle p_{i}=\frac{1}{M}\sum_{m=1}^{M}\text{softmax}(v^{l}_{i,m}A)\quad s% .t.\quad{\|A_{\cdot,l}\|}_{2}^{2}=1$ (9)

Given a batch of samples equipped with label vector $O_{b}=\{(v_{i},t_{j},y_{ij},q_{i},q_{j})\}$ , we treat the local semantic alignment as the multi-label classification task and compute the probability of image local feature matrix $v^{l}_{i}$ matching the motif tags as the Eq. (9), where $v^{l}_{i,m}$ is the $m$ -th local feature row vector of image $v_{i}^{l}$ and $A\in R^{d_{l}\times L}$ is the embedding matrix of which each column corresponds to an unique motif tag. As the Eq. (10), the local classification loss is defined as the cross-entropy between estimated probability $p_{i}$ and the normalized label vector $\overline{q_{i}}$ . The local classification loss $F_{l,t}$ corresponding to sentences is computed in the similar way. Both $F_{l,v}$ and $F_{l,t}$ aim at aligning images and sentences to their motif tags in the shared local embedding space, which also indirectly match the local feature vectors belonging to different modalities with each other.

$\displaystyle F_{l,v}=-\sum_{i=1}^{N_{v}}\sum_{l=1}^{L}\overline{q_{i}}(l)\log% {p_{i}(l)}$ (10)

Except for the intra-modal classification losses, we also introduce a matching loss to further explore the inter-modal semantic relationship. However, it is non-trivial to obtain sufficient fine-grained labels indicating the semantic relevance at the instance-level. Many prior works generally utilize global alignment relationship to guide learning local semantic information. For example, the most attention-based methods learn local semantic relationship by controlling the information flow with specifically designed architecture and a global rank loss. In this paper, we compute the local semantic similarity labels of all possible image-sentence pairs using the Eq. (11), where the $C_{v_{i}}$ and $C_{t_{j}}$ are the set of motif tags belonging to image $v_{I}$ and sentence $t_{j}$ , respectively. And $|C|$ is the number of elements in set $C$ .

$\displaystyle s_{ij}^{l}=|C_{v_{i}}\cap C_{t_{j}}|/|C_{t_{j}}|$ (11)

We estimate the local similarity of the image-sentence pair $(v^{l}_{i},t^{l}_{j})$ using the Eq. (12), i.e., the cosine similarity between the average of normalized local feature vectors of image and sentence. The normalization operation makes the Eq. (12) actually equivalent to the average of cosine similarity scores of all possible region-word pairs. In other words, we view the feature vectors as points in the local embedding space and assume that the centers of two semantically relevant local feature point sets should locate in the as close direction as possible.

$\displaystyle\hat{s}^{l}_{ij}=\frac{<\overline{v_{i}^{l}},\overline{t_{j}^{l}}% >}{\|\overline{v_{i}^{l}}\|\cdot\|\overline{t_{j}^{l}}\|}$ (12)

From the Eq. (11), we can see that the range of local similarity label $s_{ij}^{l}$ is $[0,1]$ , however, the semantic similarity defined in Eq. (12) has a different range $[-1,1]$ . To align images and sentences in the local embedding space, a simple way is that truncating the similarity score defined in the Eq. (12) with a ReLU function, and then optimize the local embedding module by minimizing a smooth L1 loss function described as in Eq. (13).

$\displaystyle F_{l,x}=\left.\left(\sum_{i=1}^{|V_{b}|}\sum_{j=1}^{|T_{b}|}% \textit{smooth}_{L1}(\hat{s}_{ij}^{l}-s_{ij}^{l})\right)\right/(|V_{b}|\times|% T_{b}|)$ (13)

We observe that the objective Eq. (13) only impose the target value to each image-sentence pair, which may be not sufficiently compatible with the local similarity label, i.e., the different ranges between $\hat{s}^{l}_{ij}$ and $s_{ij}^{l}$ . Therefore, we also experiment with a bidirectional KL divergence loss function which imposes a relative similarity constraint on all image-sentence pairs. The detail KL divergence loss is described as follows, where the $s_{ij}^{I}$ and $s_{ij}^{T}$ are the output of L1-normalized similarity matrix along the image and sentence dimension.

$\displaystyle F_{l,x}=1/|O_{b}|\sum_{i}\sum_{j}\left(\hat{s}_{ij}^{I}\log{% \frac{\hat{s}_{ij}^{I}}{s_{ij}^{I}}}\right)+1\left/|O_{b}|\sum_{i}\sum_{j}% \left(\hat{s}_{ij}^{T}\log{\frac{\hat{s}_{ij}^{T}}{s_{ij}^{T}}}\right)\right.$ (14) $\displaystyle F_{l}=\beta_{1}(F_{l,v}+F_{l,t})+\beta_{2}F_{l,x}$ (15)

In contrast to the objective Eq. (13), the KL divergence loss tends to model the semantic relationships between one query and multiple candidate samples. In other words, unlike the smooth L1 loss function whose gradients with respect to all pairs of image and sentence embeddings are mutually independent, the gradients of KL divergence loss are dependent on all pairs of image and sentence embeddings in the batch. As in Eq. (15), we finally define the whole local align loss as the sum of intra-modal and inter-modal alignment losses, where $\beta_{1}$ and $\beta_{2}$ are predefined trade-off coefficients.

2.2.2 Learning global embedding

We define the semantic similarity score ${\hat{s}}_{ij}^{g}$ between global representations of image $v_{i}$ and sentence $t_{j}$ as their cosine similarity. And as the Eq. 16, the final similarity score $\hat{s}_{ij}$ is defined as the weighted sum of local and global scores where $\epsilon$ is a fixed trade-off coefficient ranging from 0 to 1.

$\displaystyle\hat{s}_{ij}=\epsilon\times{\hat{s}}_{ij}^{l}+(1-\epsilon)\times{% \hat{s}}_{ij}^{g}$ (16)

Given a batch of samples $O_{b}$ , we optimize the whole model with the bidirectional triplet rank loss using hard negative samples [4] in the batch, which is described as follows.

$\displaystyle F_{g}=\sum_{i\in V_{b}}\max_{j^{-}}[\alpha-\hat{s}_{ij}+\hat{s}_% {ij^{-}}]_{+}+\sum_{j\in T_{b}}\max_{i^{-}}[\alpha-\hat{s}_{ij}+\hat{s}_{i^{-}% j}]_{+}$ (17) $\displaystyle F=\lambda F_{l}+(1-\lambda)F_{g}$ (18)

The $\alpha$ is a predefined margin value and $[\cdot]_{+}$ is a rectified linear function. The objective function aims to make the similarity score of positive image-sentence pair higher than the negative one and only attend to the hard triplet loss term in batch. As the model gradually converge to a stable state, hard negatives generally carry more valuable information than other negatives. As Eq. (18), we define the whole loss function as the weighted sum of local and global alignment loss functions, where $\lambda$ is a fixed trade-off coefficient. The proposed model can be trained in the end-to-end manner.

2.2.3 Image-to-text generation task

To investigate the generalization of the shared feature space, we design a common image-to-text generation task in this subsection. We implement the task with classical attentive encoder-decoder model. In our model, the encoder is actually the visual local embedding module. We implement the decoder with a one-layer GRU model. The input to decoder at each time step consists of two parts, the output of attention module and the prediction at the previous time step. The attention module aims to aggregate the outputs of encoder into a single vector at each time step, which help the decoder selectively attends to different part of outputs of encoder. Given the outputs of visual local embedding module $v^{l}\in R^{M\times d_{l}}$ , the attention-based decoder is formulated as the Eqs (19)–(23), where the $w$ , $W_{1}$ and $W_{2}$ are learnable parameters and is the importance coefficient of m-th output of encoder at the t time step, i.e., $g_{t}$ is the weighted sum of all visual local feature vectors. We predict the next word by transforming the hidden state with a linear map $W^{O}$ which is also referred as the output embedding matrix. And as shown in Eq. (22), we concatenate the $g_{t}$ and the last word embedding $f(\hat{y}_{t-1})$ as the input to the GRU model. Note that the output embedding matrix $W^{O}$ and input embedding f share the same weights.

$\displaystyle\beta_{m,t}=w^{T}(W_{1}v^{l}_{m}+W_{2}s_{t-1}+b)$ (19) $\displaystyle\alpha_{m,t}=\exp(\beta_{m,t})\left/\sum_{m=1}^{M}\exp(\beta_{m,t% })\right.$ (20) $\displaystyle g_{t}=\sum_{m=1}^{M}\alpha_{m,t}v^{l}_{m}$ (21) $\displaystyle h_{t}=\textit{GRU}(s_{t-1},[g_{t},f(\hat{y}_{t-1})])$ (22) $\displaystyle\hat{y}_{t}=\arg\max_{i}W^{O}_{i}h_{t}$ (23)

As shown in the Eq. (24), we apply the negative log-likelihood functions to train the decoder. The $p(y_{t})$ is the prediction probability of the correct word at the $t$ step. We view the decoder as a constraint condition with learnable parameters, and alternatively train the encoder and decoder for several epochs. In detail, we first train both local and global embedding modules for several epochs until the model reach a plateau. And then we fix the parameters of all embedding modules and train the decoder for several epochs. The first cycle aims to warm up the model. Next, we alternatively train the encoder and decoder until the parameters convergence. Note that the loss function Eq. (24) is also introduced to train the embedding modules, i.e., the final loss function is the weighted sum of the Eqs (18) and (24).

$\displaystyle L_{\textit{gen}}=\frac{1}{T}\sum_{t=1}^{T}-\log p(y_{t})$ (24)

3. Experiments and analysis

In this section, we perform several experiments on two publicly available datasets to investigate and demonstrate the effectiveness of the proposed model.

3.1 Datasets and evaluation metric

We evaluate the proposed model on Flickr30k entities benchmark [32] and MSCOCO dataset [33]. Flickr30K contains 31,783 images collected from Flickr website, and each image is annotated with five sentences. As common in existing work [34], we use 1,000 images for validation, 1,000 images for testing and the rest for training. MSCOCO contains 123,287 images, and each image is annotated with five sentences. Following the work [4], the dataset is split into 113,287 training images, 5,000 validation images and 5,000 test images. We report the retrieval results for both average of 5 folds of 1000 test images and the full 5000 test images. The recall rate at top K (R@K) is used to measure performance of sentence retrieval (Sentence-to-image) and image retrieval (Image-to-sentences) tasks, which is defined as the fraction of queries for which the correct item is retrieved in the top K items. We also report the sum of all recall rates for illustrating comprehensive performance.

3.2 Experiment setup

We carry out experiments to evaluate the hierarchical embedding space model with Pytorch [35] framework. For fair comparison, we experiment with various pretrained CNN encoders which include VGG-19 [25], Resnet-152 [26], Faster R-CNN [27] and ViT models. For the first two encoders, we first rescale shorter edge of each image to 256 pixels and then generate ten $3\times 224\times 224$ feature tensors from ten crops (four corners, center and their horizontal flips). Finally, the average output corresponding to the ten crops is used as the precomputed local representations. The feature vector that concatenating values across all feature maps at the same position is the feature representation of the corresponding convolved region. For the third encoder, we select the top 36 regions and extract the average of the corresponding pooling features as the precomputed local representations. For the ViT model, we first rescale shorter edge of each image to 224 pixels, and then split the center region of size $224\times 224$ into a sequence of patches of size $16\times 16$ , which is fed into the pretrained ViT model to attain the precomputed visual representations. Given a raw sentence, we firstly transform it to a sequence of lowercase words, and tokenize it with WordPiece algorithm [36].

Both dimensions of local and global semantic representation are set to 768, i.e. $d_{l}=d_{g}=768$ . The intermediate dimensions of the fully connected layers are set to 2048, and the number of self-attention heads is 12. The input to the one-layer GRU encoder is 512-d word embedding learned form scratch. The bidirectional KL divergence loss is adopted as the default local alignment objective function. We refer the model with default configuration as HES. The model is optimized using Adam [37] algorithm for 30 epochs. The initial learning rate is set to 0.0001 and decays by 0.1 every 15 epochs. The margin $\alpha$ is set to 0.2 and the coefficients $\beta_{1}=\beta_{2}=0.5$ . The $\lambda$ is set to 0.32 and $\epsilon$ is the optimal value which achieves the best performance. The size of training batch is 128. Likewise, we notice that the existing methods are generally built on various precomputed textual representations. For fair comparison, except for the baseline model which learn the textual representations from scratch, we also experiment with another baseline model which takes as input the representations generated by the pretrained BERT model. We refer the latter as BHES.

3.3 Comparison with state-of-the-art methods

In this section, we experiment with the proposed hierarchical embedding model to demonstrate its effectiveness. We compare it with several state-of-art joint embedding models which use identical splitting manner. The compared models include CMPM [5], LTBN [13], VSE $++$ [4], Two-way Nets [14], DualPath [12], DSPE [38], DVSA [34], RRF [8], GXN [6], VSRN [3], SGM [24], SAEM [39], RRTC [40] and ABGR [41].

The retrieval results on Flickr30k are illustrated in the Table 1. Compared with other joint embedding models, the model BHES achieves best performance in R@10 on sentence retrieval and R@10 on image retrieval, and gains competitive results on other tasks. When focusing on the results of models with VGG-19, except for the SCO model, we can see our mode surpasses other models. Compared with the multi-classification label used in SCO model, the local similarity labels proposed in this paper are easier to access and introduce no extra computation burden at retrieval stage. When using the ResNet-152 encoder, our model also achieve competitive performance on all tasks. The comparison results on MS-COCO are illustrated in Table 2 (1000 test images) and Table 3 (5000 test images). Similar as the results on Flickr30k, the BHES model achieves competitive performance on both 1000 and 5000 test images. Compared with VSRN model, our model gains 1.5% improvement in the Sum when dealing with 1000 test images, even though our model perform worse on 5000 test images. Similar as the Resnet101 and VGG19 encoders, the ViT model also takes as input the adjacent patches with same size on the image, instead of the regions of interest detected by the Faster RCNN model. In contrast to the models built on either Resnet101 or VGG19, the model combined with pretrained ViT achieves better performance in all tasks on both Flickr30k and COCO datasets. As shown in the prior work, due to the large amount of training data and powerful attention-based architecture, the ViT model generally performs better than the common CNN models in some downstream visual tasks. However, it is still inferior to the model built on the regions of interest detected by the Faster RCNN model, which implies that performing visual semantic segmentation is likely beneficial to promote image-text matching. From the overview of results on both Flickr30k and MSCOCO, we can see the high-quality pre-computed image features significantly improve the retrieval performance on all tasks. And our model generally performs better on sentence retrieval than other models.

Table 1
Comparison results with state-of-the-art methods on Flickr30k

Models	Backbone	Sentence retrieval			Image retrieval			Sum
		R@1	R@5	R@10	R@1	R@5	R@10
DVSA	VGG-19	22.2	48.2	61.4	15.2	37.7	50.5	235.2
DSPE	VGG-19	40.3	68.9	79.9	29.7	60.1	72.1	351.0
VSE $++$	VGG-19	38.6	64.6	74.6	26.8	54.9	66.8	326.3
LTBN	VGG-19	40.5	70.7	80.9	30.7	61.1	72.3	356.2
Two-way Nets	VGG19	49.8	67.5	–	36.0	55.6	–	–
DualPath	VGG-19	37.5	66.0	75.6	27.2	55.4	67.6	329.3
SCO	VGG-19	44.2	74.1	83.6	32.8	64.3	74.9	373.9
VSE $++$	ResNet-152	43.7	71.9	82.1	32.3	60.9	72.1	363.0
CMPM	ResNet-152	49.6	76.8	86.1	37.3	65.7	75.5	391.0
DualPath	ResNet-152	44.2	70.2	79.7	30.7	59.2	70.8	354.8
RRF	ResNet-152	47.6	77.4	87.1	35.4	68.3	79.9	395.7
SCO	ResNet-152	55.5	82.0	89.3	41.1	70.5	80.1	418.50
TIMAM	ResNet-152	53.1	78.8	87.6	42.6	71.6	81.9	415.6
SAEM	Faster RCNN	69.1	91.0	95.1	52.4	81.1	88.1	476.8
VSRN	Faster RCNN	71.3	90.6	96.0	54.7	81.8	88.2	482.6
RRTC	Faster RCNN	72.7	93.8	96.8	54.2	79.4	86.1	483.0
ABGR	Faster RCNN	72.3	91.8	95.1	53.7	80.1	87.2	480.2
HES	VGG-19	42.8	71.0	81.3	31.8	61.3	72.4	360.6
HES	ResNet-152	50.7	79.4	86.9	37.7	67.0	77.0	398.8
HES	ViT	63.3	82.1	87.7	41.7	69.1	79.9	423.8
HES	Faster RCNN	66.1	88.9	93.6	49.4	76.7	84.4	459.1
BHES	Faster RCNN	72.1	90.3	97.0	53.3	79.6	88.5	480.8

Table 2

Comparison results with state-of-the-art methods on MS-COCO (1000 images)

Models	Backbone	Sentence retrieval			Image retrieval			Sum
		R@1	R@5	R@10	R@1	R@5	R@10
DVSA	VGG-19	38.4	69.9	80.5	27.4	60.2	74.8	351.2
DSPE	VGG-19	50.1	79.7	89.2	39.6	75.2	86.9	420.7
VSE $++$	VGG-19	51.9	81.5	90.4	39.5	74.1	85.6	423.0
LTBN	VGG-19	54.0	84.0	91.2	43.3	76.8	87.6	436.7
Two-way Nets	VGG-19	55.8	75.2	–	39.7	63.3	–	–
DualPath	VGG-19	46.0	75.6	85.3	34.4	66.6	78.7	386.6
VSE $++$	ResNet-152	58.3	86.1	93.3	43.6	77.6	87.8	446.7
CMPM	ResNet-152	56.1	86.3	92.9	44.6	78.8	89.0	447.7
DualPath	ResNet-152	52.2	80.4	88.7	37.2	69.5	80.6	408.6
RRF	ResNet-152	56.4	85.3	91.5	43.9	78.1	88.6	443.8
GXN	ResNet-152	68.5	–	97.9	56.6	–	94.5	317.5
SGM	Faster RCNN	73.4	93.8	97.8	57.5	87.3	94.3	504.1
VSRN	Faster RCNN	76.2	94.8	98.2	62.8	89.7	95.1	516.8
SAEM	Faster RCNN	71.2	94.1	97.7	57.8	88.6	94.9	504.3
RRTC	Faster RCNN	76.2	96.3	98.9	61.6	89.3	94.6	516.9
ABGR	Faster RCNN	73.0	94.7	98.3	59.5	89.4	95.2	510.1
HES	VGG-19	55.0	84.3	92.3	42.9	76.8	86.9	438.3
HES	ResNet-152	59.6	88.5	94.3	47.5	80.7	90.1	460.7
HES	ViT	68.4	91.5	95.1	53.4	81.9	92.2	482.5
HES	Faster RCNN	75.2	95.6	98.5	59.5	88.3	94.1	511.2
BHES	Faster RCNN	75.3	96.1	98.4	63.1	90.5	94.9	518.3

Table 3

Comparison results with state-of-the-art methods on MS-COCO (5000 images)

Models	Backbone	Sentence retrieval			Images retrieval			Sum
		R@1	R@5	R@10	R@1	R@5	R@10
DVSA	VGG-19	16.5	39.2	52.0	10.7	29.6	42.2	190.2
DualPath	VGG-19	35.5	63.2	75.6	21.0	47.5	60.9	303.7
SCO	VGG-19	40.2	70.1	81.3	31.3	61.5	73.9	358.3
CMPM	ResNet-152	31.1	60.7	73.9	22.9	50.2	63.8	302.6
DualPath	ResNet-50	41.2	70.5	81.1	25.3	53.4	66.4	337.9
GXN	ResNet-152	42.0	–	84.7	31.7	–	74.6	–
SCO	ResNet-152	42.8	72.3	83.0	33.1	62.9	75.5	369.6
SGM	Faster RCNN	50.0	79.3	87.9	35.3	64.9	76.5	393.9
VSRN	Faster RCNN	53.0	81.1	89.4	40.5	70.6	81.1	415.7
HES	VGG-19	34.5	65.0	77.4	25.1	53.1	66.0	321.1
HES	ResNet-152	38.8	69.5	81.7	28.2	57.4	70.2	345.7
HES	ViT	42.9	73.6	84.5	33.2	64.1	74.9	373.2
HES	Faster RCNN	45.6	78.2	88.9	35.0	65.4	77.8	390.9
BHES	Faster RCNN	47.4	82.3	92.0	37.3	69.0	80.9	408.9

3.4 Ablation study

In this section, we perform several ablation experiments on Flickr30k dataset to understand better the effect of some components of the proposed model. These ablation versions are described as follows. To investigate the effect of local alignment module, we experiment with only the global alignment module (denoted as $\text{HES}_{g}$ ), i.e. only global similarity score is used to compute loss function and retrieve samples. Similarly, to observe the effect of global alignment module, we remove the whole global alignment module to construct a lightweight model (denoted as $\text{HES}_{l}$ ). Additionally, we also experiment with two uni-directional KL divergence losses, i.e., image-to-sentence (denoted as $\text{HES}_{it}$ ) and sentence-to-image (denoted as $\text{HES}_{ti}$ ). Observe that some words may occur frequently but present poor information related with visual semantic, we further refine the local similarity label by truncating the original value by a predefined threshold value 0.3, i.e. clamping all labels less than 0.3 to 0, the corresponding model is denoted as $\text{HES}_{\textit{reg}}$ . We construct the model $\text{HES}_{nc}$ by removing the intra-modal classification losses in local alignment module. Finally, we remove the positional encoding in both image and sentence branches before it is fed into the global alignment module, which is denoted as $\text{HES}_{np}$ . Except for removing some submodules, we also report the results a model $\text{HES}_{L1}$ with alternative configuration, in which the KL divergence loss is replaced with smooth L1 loss. Finally, we refer the combination of the HES model and image-to-text generation task as $\text{HES}_{\textit{gen}}$ . We experiment with all ablation models on Flickr30k dataset and apply the Resnet-152 as image encoder.

Table 4
Comparison results of ablation study on Flickr30k dataset

Models	Sentences retrieval			Image retrieval			Sum
	R@1	R@5	R@10	R@1	R@5	R@10
HES	50.7	79.4	86.9	37.7	67.0	77.0	398.8
$\text{HES}_{nc}$	47.8	76.0	86.3	36.8	66.3	77.4	390.7
$\text{HES}_{L1}$	48.9	76.5	86.1	36.7	65.7	77.5	391.4
$\text{HES}_{\textit{gen}}$	47.1	77.3	87.4	34.1	66.2	77.9	390.0
$\text{HES}_{g}$	35.7	60.0	64.3	25.7	53.9	66.3	305.9
$\text{HES}_{l}$	17.0	30.3	42.3	23.0	46.8	58.0	217.4
$\text{HES}_{np}$	47.0	74.6	86.4	35.8	64.9	75.3	390.0
$\text{HES}_{it}$	48.0	76.0	85.6	35.6	65.3	75.2	385.7
$\text{HES}_{ti}$	46.9	76.9	85.6	36.4	65.0	76.4	387.2
$\text{HES}_{\textit{reg}}$	49.3	79.0	86.5	37.1	66.8	76.8	395.5

Table 5

The results of ablation experiments with various testing database

Dataset	Var	Sentences retrieval			Image retrieval			Sum
		R@1	R@5	R@10	R@1	R@5	R@10
Flickr30k	–	62.1	88.9	93.6	49.4	76.7	84.4	459.1
Flickr30k	$\textit{MASK}_{V}$	62.1	86.3	91.9	48.2	75.4	83.3	447.2
Flickr30k	$\textit{MASK}_{T}$	57.8	83.7	89.6	40.9	69.0	77.7	418.7
Flickr30k	$\textit{MASK}_{VT}$	56.4	83.5	90.1	40.1	68.1	77.2	415.4
Flickr30k	$\textit{NOISE}_{V}$	63.5	86.6	92.4	48.6	76.2	83.9	451.2
Flickr30k	$\textit{NOISE}_{T}$	54.2	82.5	89.0	39.9	68.9	78.2	412.7
Flickr30k	$\textit{NOISE}_{VT}$	56.5	81.2	88.1	39.5	68.1	77.8	411.2
MS-COCO	–	45.6	78.2	88.9	35.0	65.4	77.8	390.9
MS-COCO	$\textit{MASK}_{V}$	44.5	76.3	86.1	33.4	64.0	75.9	380.2
MS-COCO	$\textit{MASK}_{T}$	40.4	71.6	83.2	27.3	55.2	66.8	344.5
MS-COCO	$\textit{MASK}_{VT}$	39.4	70.8	81.6	26.6	54.1	66.1	338.6
MS-COCO	$\textit{NOISE}_{V}$	46.1	77.2	87.1	34.0	64.8	76.7	385.9
MS-COCO	$\textit{NOISE}_{T}$	37.0	70.1	82.2	26.8	54.8	67.2	338.1
MS-COCO	$\textit{NOISE}_{VT}$	37.6	69.0	79.9	26.4	55.1	67.5	335.5

The comparison results of ablation study are illustrated in Table 4. From the results of model $\text{HES}_{nc}$ , we can see that Gaussian mask improves the performance by an average 2.7% by aligning the local feature vectors of various modalities to the motif embeddings. The model $\text{HES}_{L1}$ decrease the performance by an average 1.2%. From the results of $\text{HES}_{\textit{gen}}$ , we can see that introducing the cross-modal generation task slightly improve the recall rate in the R@10 on both image and sentence retrieval tasks. To the contrary, it decreases the performance in the R@1 to some extend. In our view, the cross-modal generation task is beneficial to improve the generalization of model, but it has little effect on distinguishing fine-grained semantic information. Comparing the results of $\text{HES}_{g}$ with HES, we can see that the local alignment module significantly improve the results on sentence retrieval by an average 19.0% and on image retrieval by an average 11.9%, especially in the R@5 and R@10. The results of $\text{HES}_{l}$ indicate that the local alignment module is able to learn some useful information for retrieval, but it is difficult to achieve considerable performance. From the comparison results of model $\text{HES}_{np}$ and HES, it improves the performance in all retrieval tasks by adding positional encoding to the local embedding, gains an average of 3.0% improvement on sentence retrieval and an average of 1.9% improvement on image retrieval. Comparing the results of using various KL divergence loss functions, we can see the two uni-directional loss achieve almost same performance, and both of them reduce the performance to some extent when compared to the HES. Obviously, it has been proven in much prior work that bidirectional loss generally perform better that uni-directional loss in the cross-modal retrieval, because the bidirectional loss capture more rich semantic relevant information. From the results of model $\text{HES}_{\textit{reg}}$ , we can see the regularized local similarity label slightly decrease the performance. In our view, this may be caused by the entropy-based loss functions generally increase intra-class variance, especially in the case where the local representations are the intermediate outputs of the proposed model.

4. Discussion and visualization

4.1 Data masking and noise

Except for studying the effectiveness of models with various architectures, we also investigate the impact of testing database with different types of noise. We perform the following experiments with the baseline model HES (Faster RCNN). Concretely, we first randomly mask some words in a sentence with a predefined probability, and refer this modification as $\textit{MASK}_{T}$ . Likewise, we also randomly mask some region-based feature vectors in an image and refer this modification as $\textit{MASK}_{V}$ . Note that both textual and visual MASK tokens are set to zeros. Finally, we randomly mask both words and regions with a fixed probability which is referred as $\textit{MASK}_{VT}$ . We implement the masking operation with a dropout layer which independently sets each element to zero with the probability 0.1. Except for the masking operation which is similar to the occlusion, we also randomly select partial input and add independent Gaussian noise with mean 0 to each element of testing data to mimic dirty data. Likewise, we can construct three types of testing database by add noise to textual data, visual data and both, which are respectively referred as $\textit{NOISE}_{T}$ , $\textit{NOISE}_{V}$ and $\textit{NOISE}_{VT}$ . In consideration of different amplitude, the standard deviation of visual and textual noise distributions are set to 0.1 and 0.05 respectively. We report the average results over 20 independently random experiments on both Flickr30k and MS-COCO datasets in the Table 5. From the results of masked testing data, we can see that introducing masked visual data slightly decreases the performance on both Flickr30k and MS-COCO datasets, i.e., only total 11.9% and 10.7% degeneration. In contrast, using masked textual data significantly decreases the performance on both image and sentence retrieval tasks. And we can observe that the masked textual data is the dominant factor which determine the retrieval performance when introducing both masked visual and textual data. We can find similar results in all experiments with Gaussian noise, i.e., our model is more fragile when the textual data is polluted. In our view, the reason may be that the precomputed visual data contains more redundant semantic information since there exists considerable overlapped regions of interest.

Table 6
The examples of sentence retrieval

Image query

Results

A man in a horse drawn carriage parked in front of a stone building.A white horse pulling a carriage with a man on it.White horse carrying a man in a black buggy.A horse, buggy and driver sitting in front of a building.A horse is standing with a black carriage.

A bicycle is in a living room next to some shelves.Room containing corner of entertainment equipment and a bicycleA bike that is sitting in a living room on the carpet.A picture of a living room that has a computer and a bicycle.This is a living room with a bike and a desktop computer.

Commercial airplanes are sitting on the run way.A group of planes sit on the tarmac at an airport.The airplanes are lined up on the tarmac.There are several airplanes parked on the tarmac.A group of air planes sitting on a runway.

A group of yellow and blue umbrellas near a building clock.Yellow-orange pecial purpose train engine with American flag painted in the side.A very bright colored train near a big building.A lot of blue and yellow umbrellas sitting under a clock.Many blue and yellow umbrellas are shown next to a building.

Figure 3.

Visualization of local semantic embedding distribution using t-SNE algorithm.

Table 7

The examples of image retrieval

Text query

Results

A man standing in front of an open refrigerator filled with bottles of beer.

A tropical hotel room holds a bed with a white mosquito netting canopy.

A bathroom featuring a walk in shower, mirror, sink and toilet.

A man and woman are looking intently at a computer screen.

4.2 Embedding visualization

To better understand the effect of the proposed local alignment loss, we visualize the internal local semantic representations in a 2-D subspace using the t-SNE algorithm [42] in Fig. 3. Note that we compute the average of all normalized local feature vectors of a sample as the corresponding local representation. We select 100 images and 500 corresponding sentences from test set in Flcikr30k dataset for visualization analysis, the distribution of local semantic representations learned by the model HES and $\text{HES}_{g}$ are illustrated on the left and right sides of Fig. 3. Image and sentence are respectively marked as product sign and dot, the points with same color represent the samples which come from the same semantic category. From the comparison result in Fig. 3, we can see that introducing the local alignment loss is capable of learning the desirable distribution in local embedding space where the average vectors are the summary of local semantic components and the relevant items are distributed together.

4.3 Retrieval examples

Next we provide several retrieval examples from the test set in Flickr30k dataset to illustrate the effect of the proposed model. The results of sentence retrieval are illustrated in the Table 6, where the top-5 samples are ranked from top to bottom and the negatives are marked with red. We can see that all positives samples are ranked at top 5 in the first two examples, and the negative samples in the last two examples also present partial relevant visual semantic information. Table 7 shows several examples of image retrieval and the samples are ranked from left to right. We mark the positives samples using green bound box. We can see the positive samples can be ranked at top 1 in the first three examples and the followed samples generally present similar visual information.

5. Conclusion

In this paper, we focus on learning both local and global embedding spaces for image-sentence matching. To incorporate the local cross-modal correspondence into a classical joint embedding framework, we organize the local and global embedding module as the cascade structure, and optimize the local embedding module with respect to the local similarity labels derived from the lexical information of corpus. Due to the cascade structure, the global relevance rank loss also guides updating parameters of local embedding module. Compared with the methods based on joint embedding learning, the proposed model achieves the competitive performance on both image and sentence retrieval tasks.

The proposed model can be seen as a compromise strategy between joint embedding learning and similarity metric learning. However, there are still some limitations in the proposed method. Firstly, the proposed model is sensitive to hyper-parameters due to the cascade structure, the trade-off coefficients for global rank loss and local alignment loss is important for the success. Secondly, it only leverages the simple statistic information for the local embedding alignment, and directly extract the output of pooling layer as the local semantic feature for images, which is not sufficient to eliminate the noise from background. Therefore, we will further explore the work at two aspects, one is to extract more robust and accurate local similarity labels, the other is to explore a better strategy for modeling the local embedding space and fusing local and global similarity scores.

Data availability

The Flickr30k and MSCOCO are publicly available datasets in published papers [32, 33]. The code of the current study are available from the corresponding author on reasonable request.

Author contributions statement

Hao Sun wrote the main manuscript text and experiment code, Xiaolin Qin and Xiaojing Liu prepared the all figures and tables. All authors reviewed the manuscript.

Footnotes

Acknowledgments

The research was supported by The National Natural Science Foundation of China (grant nos. 61728204, 61802182).

Conflict of interest

The authors declare that they have no conflict of interest.

References

Duan

Fang

Gong

and Jiang

, Unicoder-vl: A universal encoder for vision and language by cross-modal pre-training, in: Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34, 2020, pp. 11336–11344.

Hong

Liu

Wang

Chen

and Chu

, GilBERT: Generative Vision-Language Pre-Training for Image-Text Retrieval, in: Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, 2021, pp. 1379–1388.

Zhang

and Fu

, Visual semantic reasoning for image-text matching, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 4654–4662.

Faghri

Fleet

D.J.

Kiros

J.R.

and Fidler

, Vse

++

: Improving visual-semantic embeddings with hard negatives, arXiv preprint arXiv:1707.05612, 2017.

Zhang

and Lu

, Deep cross-modal projection learning for image-text matching, in: Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 686–701.

Cai

Joty

S.R.

Niu

and Wang

, Look, imagine and match: Improving textual-visual cross-modal retrieval with generative models, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 7181–7189.

Huang

Song

and Wang

, Learning semantic concepts and order for image and sentence matching, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 6163–6171.

Liu

Guo

Bakker

E.M.

and Lew

M.S.

, Learning a Recurrent Residual Fusion Network for Multimodal Matching, in: 2017 IEEE International Conference on Computer Vision (ICCV), 2017, pp. 4127–4136. doi: 10.1109/ICCV.2017.442.

Sarafianos

and Kakadiaris

I.A.

, Adversarial representation learning for text-to-image matching, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 5814–5824.

10.

Kiros

Salakhutdinov

and Zemel

R.S.

, Unifying visual-semantic embeddings with multimodal neural language models, arXiv preprint arXiv:1411.2539, 2014.

11.

Semedo

and Magalhães

, Cross-Modal Subspace Learning with Scheduled Adaptive Margin Constraints, in: Proceedings of the 27th ACM International Conference on Multimedia, 2019, pp. 75–83.

12.

Zheng

Garrett

Yang

and Shen

Y.-D.

, Dual-path convolutional image-text embeddings with instance loss, ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM) 16(2) (2020), 1–23.

13.

Wang

Huang

and Lazebnik

, Learning two-branch neural networks for image-text matching tasks, IEEE Transactions on Pattern Analysis and Machine Intelligence 41(2) (2018), 394–407.

14.

Eisenschtat

and Wolf

, Linking image and text with 2-way nets, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 4601–4611.

15.

Zhen

Peng

and Liu

, Scalable deep multimodal learning for cross-modal retrieval, in: Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval, 2019, pp. 635–644.

16.

Yang

Liu

Fei

and Li

, Heterogeneous attention network for effective and efficient cross-modal retrieval, in: Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, 2021, pp. 1146–1156.

17.

Nam

J.-W.

and Kim

, Dual attention networks for multimodal reasoning and matching, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 299–307.

18.

Lee

K.-H.

Chen

Hua

and He

, Stacked cross attention for image-text matching, in: Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 201–216.

19.

Zhang

Lei

Zhang

and Li

S.Z.

, Context-aware attention network for image-text retrieval, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 3536–3545.

20.

Wang

Han

and Pang

, Saliency-guided attention network for image-sentence matching, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 5754–5763.

21.

Chen

Ding

Liu

Lin

Liu

and Han

, Imram: Iterative matching with recurrent attention memory for cross-modal image-text retrieval, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 12655–12663.

22.

Liu

Gao

and Nie

, Dynamic modality interaction modeling for image-text retrieval, in: Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, 2021, pp. 1104–1113.

23.

Liu

Mao

Zhang

Xie

Wang

and Zhang

, Graph structured network for image-text matching, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 10921–10930.

24.

Wang

Yao

Shan

and Chen

, Cross-modal scene graph matching for relationship-aware image-text retrieval, in: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2020, pp. 1508–1517.

25.

Simonyan

and Zisserman

, Very deep convolutional networks for large-scale image recognition, arXiv preprint arXiv:1409.1556, 2014.

26.

Zhang

Ren

and Sun

, Deep residual learning for image recognition, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 770–778.

27.

Girshick

, Fast r-cnn, in: Proceedings of the IEEE International Conference on Computer Vision, 2015, pp. 1440–1448.

28.

Vaswani

Shazeer

Parmar

Uszkoreit

Jones

Gomez

A.N.

Kaiser

and Polosukhin

, Attention Is All You Need, arXiv, 2017.

29.

J.L.

Kiros

J.R.

and Hinton

G.E.

, Layer normalization, arXiv preprint arXiv:1607.06450, 2016.

30.

Hendrycks

and Gimpel

, Bridging Nonlinearities and Stochastic Regularizers with Gaussian Error Linear Units, CoRR, abs/1606.08415, 2016. http://arxiv.org/abs/1606.08415.

31.

Cho

Van Merriënboer

Bahdanau

and Bengio

, On the properties of neural machine translation: Encoder-decoder approaches, arXiv preprint arXiv:1409.1259, 2014.

32.

Plummer

B.A.

Wang

Cervantes

C.M.

Caicedo

J.C.

Hockenmaier

and Lazebnik

, Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models, in: Proceedings of the IEEE International Conference on Computer Vision, 2015, pp. 2641–2649.

33.

Lin

T.-Y.

Maire

Belongie

Hays

Perona

Ramanan

Dollár

and Zitnick

C.L.

, Microsoft coco: Common objects in context, in: European Conference on Computer Vision, Springer, 2014, pp. 740–755.

34.

Karpathy

and Fei-Fei

, Deep visual-semantic alignments for generating image descriptions, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 3128–3137.

35.

Collobert

Kavukcuoglu

and Farabet

, Torch7: A Matlab-like Environment for Machine Learning, in: BigLearn NIPS Workshop, 2011.

36.

Schuster

Chen

Q.V.

Norouzi

Macherey

Krikun

Cao

Gao

Macherey

et al., Google’s neural machine translation system: Bridging the gap between human and machine translation, arXiv preprint arXiv:1609.08144, 2016.

37.

Kingma

and Ba

, Adam: A Method for Stochastic Optimization, Computer Science, 2014.

38.

Wang

and Lazebnik

, Learning deep structure-preserving image-text embeddings, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 5005–5013.

39.

Wang

Song

and Huang

, Learning fragment self-attention embeddings for image-text matching, in: Proceedings of the 27th ACM International Conference on Multimedia, 2019, pp. 2088–2096.

40.

Wang

and Cui

, Region reinforcement network with topic constraint for image-text matching, IEEE Trans. Cir. and Sys. for Video Technol. 32(1) (2022), 388–397. doi: 10.1109/TCSVT.2021.3060713.

41.

Zhong

Yang

Huang

Yuan

and Lin

C.-W.

, Auxiliary Bi-Level Graph Representation for Cross-Modal Image-Text Retrieval, in: 2021 IEEE International Conference on Multimedia and Expo (ICME), 2021, pp. 1–6. doi: 10.1109/ICME51207.2021.9428380.

42.

Van Der Maaten

, Accelerating t-SNE using tree-based algorithms, The Journal of Machine Learning Research 15(1) (2014), 3221–3245.

43.

Vendrov

Kiros

Fidler

and Urtasun

, Order-embeddings of images and language, arXiv preprint arXiv:1511. 06361, 2015.

44.

Chen

Jose

J.M.

and Liu

, Structured Multi-Modal Feature Embedding and Alignment for Image-Sentence Retrieval, in: Proceedings of the 29th ACM International Conference on Multimedia, 2021, pp. 5185–5193. ISBN 9781450386517. doi: 10.1145/3474085.3475634.

Learning hierarchical embedding space for image-text matching

Abstract

Keywords

1. Introduction

2.1.1 Image embedding branch

2.2.1 Learning local embedding

3.1 Datasets and evaluation metric

3.2 Experiment setup

3.3 Comparison with state-of-the-art methods

Table 1 Comparison results with state-of-the-art methods on Flickr30k

Table 4 Comparison results of ablation study on Flickr30k dataset

4.1 Data masking and noise

Table 6 The examples of sentence retrieval

4.3 Retrieval examples

5. Conclusion

Data availability

Author contributions statement

Footnotes

Acknowledgments

Conflict of interest

References

Table 1
Comparison results with state-of-the-art methods on Flickr30k

Table 4
Comparison results of ablation study on Flickr30k dataset

Table 6
The examples of sentence retrieval