Fast retrieval of multi-modal embeddings for e-commerce applications

Abstract

In this paper, we introduce a retrieval framework designed for e-commerce applications, which employs a multi-modal approach to represent items of interest. This approach incorporates both textual descriptions and images of products, alongside a locality-sensitive hashing (LSH) indexing scheme for rapid retrieval of potentially relevant products. Our focus is on a data-independent methodology, where the indexing mechanism remains unaffected by the specific dataset, while the multi-modal representation is learned beforehand. Specifically, we utilize a multi-modal architecture, CLIP, to learn a latent representation of items by combining text and images in a contrastive manner. The resulting item embeddings encapsulate both the visual and textual information of the products, which are then subjected to various types of LSH for balancing between result quality and retrieval speed. We present the findings of our experiments conducted on two real-world datasets sourced from e-commerce platforms, comprising both product images and textual descriptions. Promising results have been achieved, demonstrating favorable retrieval time and average precision. These results were obtained through testing the approach with a specifically selected set of queries and with synthetic queries generated using a Large Language Model.

Keywords

Multi-modal embeddings e-commerce applications locality sensitive hashing

1. Introduction

The description of products in e-commerce applications is increasingly using multi-modal specifications, where textual description of the products and pictures showing images of the products are the main relevant modalities. To this end, one promising avenue in e-commerce applications involves leveraging these different modalities during item searches, where textual descriptions and product images play the most important role. This approach not only offers enhanced flexibility in user-system interactions, but also presents a potential solution to the data sparsity challenge commonly encountered in the development of product recommendation strategies and during deployment in e-commerce applications [1]. However, as different modalities often entail a substantial volume of data, typically of high dimensionality, multi-modal retrieval demands extensive storage space and prolonged retrieval times. Consequently, efficient retrieval techniques based on hashing are commonly employed [2].

In this paper, we expand upon the work presented in [3] and introduce a retrieval framework tailored for e-commerce applications. This framework relies on a multi-modal representation of the items of interest, encompassing textual descriptions and product images, combined with a Locality-Sensitive Hashing (LSH) indexing scheme [4] to expedite product retrieval. The objective is to facilitate the specification of desired products through both images and text, enabling efficient retrieval of similar products. While cross-modal retrieval has been extensively explored in the literature [5], our approach makes a novel contribution by addressing two key aspects:

the exploitation of multi-modal retrieval in the context of e-commerce applications concerning the recommendation of products through both product pictures and textual descriptions;

the choice and the analysis of a suitable indexing strategy (LSH) directly working on the joint embedded representation of the product information (textual description and pictures) in an efficient way.

We focus our idea on a data-independent approach [6], where no training data is used to learn the indexing function, and the multi-modal representation is learned upstream. Specifically, in this paper, we utilize the Contrastive Language-Image Pre-Training architecture (CLIP) [7], a widely used deep multi-modal framework that learns a joint embedding of images and textual descriptions in a contrastive manner. In the context of e-commerce, this approach enables the exploitation of a unified representation for both item descriptions and images, facilitating the search for the most suitable products while considering multi-modal input. Subsequently, the unified embedding space can be explored using an appropriate indexing scheme. We propose adopting LSH to balance the precision of results against retrieval time.

A main direction we aim to investigate pertains to the significance of the retrieved results and the response time of the architecture, which are influenced by various factors, including the type and quantity of resources utilized. These factors encompass the type of LSH method employed, the number of hash tables utilized, the quantity of LSH functions employed, and the type of visual encoder utilized. Overall, investigating these issues is essential for optimizing the performance of the retrieval architecture, ensuring that it delivers relevant results promptly, while efficiently using computational resources.

In our experiments, we considered two real-world datasets, one in a technical domain (photo related products) and one in the fashion domain (clothing apparels e-commerce applications). The indexing scheme provided by LSH is a parametric one, relying on different hyper-parameters such as the actual hashing functions, the number of such hashing functions and the number of hash tables to be used. Because of that, we tested different settings of different LSH schemes, and we extended some of the analyses presented in [3] by incorporating a series of synthetic queries. These queries were generated by a Large Language Model using specific prompts, designed to emulate the queries made by a typical user of an e-commerce platform. This addition is of particular significance as it enhances the evaluation of the approach by enabling the processing of queries akin to those typically made by customers on the e-commerce platform, thereby augmenting the relevance of the assessment.

The outline of the rest of the paper is the following: Section 2 discusses related works, Section 3 presents the architecture for generating multi-modal embeddings, Section 4 describes the retrieval strategy based on LSH, Section 5 concerns the description of the experimental framework and of the results, finally conclusions are reported in Section 6.

2. Related works

The growing interest in the field of multi-modal retrieval aims to bridge the semantic gap between different modalities such as images and text. This may be really important in several domain and application tasks, as well as in the choose of the pertinent technique and approach dealing with such multi-modal information. Several works concerning product search focus on image retrieval only [8], text to image retrieval [9], cross-view image retrieval [10] or image retrieval augmented with text modification input [11, 12]. Feature combination or composition between text and images has also been extensively studied for VQA (Visual Question Answering) [13], which is again a form of text to image retrieval.

Recently, specific techniques for cross-modal retrieval between images and text modalities have received a lot of attention. The objective is to create suitable representations from diverse modalities (text and images in this instance) within a shared space. This enables the application of newly generated features in computing distance metrics [14, 15]. However, when aiming to establish a versatile text-to-image or image-to-text retrieval framework, challenges arise regarding generalizability. This entails ensuring system functionality even in cases of weak correlation between text and image [16], or addressing issues related to local alignment between text and images. This involves leveraging spatial relationships among objects depicted in the image [17, 18].

In our application, we can assume a strong correlation between text and images, since the text is a simple (usually compact) description of a product represented in the image; moreover, local alignment is also a minor problem, since the targeted pictures are representing specific items or products whose global characteristics are usually described in the corresponding text. In addition, the impact of spatial relationships among the objects in the image is very limited in the application we address, making unnecessary the introduction of complex architectures as those proposed in [17], where a stacked series of attention layers has to be properly devised, or in [18], where an additional graph structure is needed to capture spatial relationships after the generation of the multi-modal embedding.

Regarding indexing techniques, methods of cross-modal hashing aim to condense multi-modal data into compact and concise binary codes, while preserving the “semantic similarity” across different modalities [19, 20, 21]. However, differently from our approach, in such works the emphasis is on learning (either in a supervised or unsupervised fashion) the hash coding function, usually through deep learning models. None of them focus on the exploitation of a pre-defined indexing/hashing strategy working on the combined multi-modal embedding of the target objects. An exception is represented by the work in [22], where however, the text and the image representations are used in a rather independent way, since retrieval is based on a first step where text descriptions are compared with a set of keywords, and then the result is further filtered by considering image features. Moreover, image features are manually extracted using visual mappings for color, texture and shape. The only common aspect with our work is the use of an LSH strategy in order to retrieve most similar images from the query.

Our goal in the present paper is to exploit a standard hashing methods, such as LSH, for approximating a similarity-based retrieval, without the need of building a specific architecture for learning the hashing code. In addition, we want to exploit the power of deep learning methods for multi-modal embeddings (such as the CLIP-based architecture), in such a way that the representation of the target objects can keep all the similarity information from both text and image modalities.

Indeed, when handling data presented in various modalities, such as in e-commerce applications, it is crucial to leverage the information from each modality comprehensively and consistently. Comprehensive utilization ensures that no part of the information is overlooked, while consistency ensures that different formats of the same information are fused in a coherent way. Integration of deep learning and hashing methods for multi-modal retrieval can significantly enhance the efficiency of the retrieval step. Features extracted from deep models contain richer semantic information and are more adept at expressing the original data.

Two distinct primary approaches are available for consideration: data-dependent and data-independent [6]. The data-dependent category concerns approaches that learn both a representation and an indexing scheme from the original multi-modal data, frequently through supervised learning methodologies. Methods like Deep Cross Modal Hashing (DCMH) [23] and Pairwise Relationship Deep Hashing (PRDH) [24] integrate feature and hash learning directly, considering modal similarity and preserving it in hash code generation. Adversarial strategies like Self-Supervised Adversarial Hashing (SSAH) [25] and Attention-aware Deep Adversarial Hashing (ADAH) [26] further refine this idea by incorporating discriminators and attention mechanisms.

Our proposal focuses on the data-independent approach. Rather than training a deep model to simultaneously obtain multi-modal representation and hashing, we separately tackle learning latent representation and employing appropriate data-independent indexing method for retrieval. Through this architecture, we demonstrate the feasibility of constructing a multi-modal retrieval system that adopts LSH to achieve rapid retrieval times and high-quality retrieved items. While the number of hash functions and tables may depend on the data, we point out that with a relatively small number of such resources excellent results, measured by mean Average Precision, (mAP), can be achieved in reasonable time. The only data-dependent aspect is the number of probes used, which dynamically adjusts based on the required retrieval precision. Experimental results reveal that employing high-quality embeddings, specifically tailored to accurately represent cosine similarity among inputs, yields favorable outcomes across a variety of LSH configurations.

3. Multi-modal embedding for images and text

Given the multi-modal nature of e-commerce data and the need for a general architecture usable in different contexts without any modifications, the first task to address is the generation of specific multi-modal embeddings. To this end, we propose the use of CLIP, a model trained on a diverse range of images and texts without being specialized for any specific task (task-agnostic model).

The approach consists in training the model in a contrastive manner [27], such that it predicts, from a batch of $N$ pairs $⟨ 𝑖𝑚𝑎𝑔𝑒, 𝑡𝑒𝑥𝑡 ⟩$ , the correct pair out of the $N^{2}$ possible pairs. To achieve this, the model learns a multi-modal embedding space by training both an image and a text encoder concurrently. The objective is to maximize the cosine similarity of correct pairs while minimizing the similarity of incorrect ones (see Fig. 1). Those characteristics, coupled with the large size of the dataset used for the training, allow the model to take overfitting under control, and make it able to generate embeddings that correctly map the inputs on the multi-modal space.

Figure 1.

Contrastive pre-training in CLIP (from [7]).

We examined various pre-trained versions of CLIP and ultimately opted for the following: RN50 and ViT-L/14. These options primarily differ in the image encoder utilized: RN50 employs ResNet50 [28], while ViT-L/14 utilizes a Vision Transformer (ViT) [29]. The selected CLIP architectures represent a compromise between the time taken to produce an embedding and its size. Specifically, RN50 generated embeddings are larger but require less time compared to those generated by the ViT-L/14 version.

The RN50 architecture is derived from ResNet50 with a modification known as ResNet-D [30]. This tweak involves altering the downsampling block of ResNet50 by adjusting the convolution stride in path A to prevent information loss and adding average pooling on path B before convolution (refer to Fig. 2b for details).

Figure 2.

Model tweak on ResNet50 [30].

Additionally, we incorporated low-pass filtering to anti-alias for maintaining shift-invariance [31]. Furthermore, we replaced the global average pooling layer with attention pooling, implemented using a single attention layer akin to the transformer architecture. Finally, a scaling strategy inspired by the methodology outlined in [32] was adopted to achieve a better balance among depth, width, and resolution of the image encoder.

For the transformer-based architecture ViT-L/14, we implemented the version described in [29], which includes the following parameters: a 14 $\times$ 14 image patch size, 24 layers, a latent embedding size of 1024, an MLP size of 4096, and 4 attention heads, amounting to a total of 307 million learnable parameters. As detailed in Section 5, our experiments demonstrate that both types of embeddings provided by these different architectures exhibit similar performances. Notably, both architectures utilize the same text encoder, which is essentially a Transformer with the architectural modifications proposed in [33].

4. Locality sensitive hashing for fast embedding retrieval

Once the multi-modal embeddings are generated, as described in the preceding section, a retrieval mechanism needs to be established to identify the embeddings most similar to a given query. This task can be tackled by exploiting some form of hashing to index the embeddings for search and retrieval. In this study, we adopt Locality-Sensitive Hashing or LSH [4] as the indexing scheme, as it offers a balance between the precision (and recall) of retrieved results and time needed to retrieve them. This balance is crucial in our context, as we aim to construct an architecture that provides the flexibility to prioritize either precision or retrieval speed based on the configuration used. LSH is a technique where similar data points are hashed into the same “buckets” with high probability. This enables the implementation of (approximate) nearest-neighbor queries by means of the detection of collisions in a given set of hash tables properly devised for the task [34]. The main idea is to build a hash table by concatenating a suitable number $k$ of hash functions properly chosen. This can make the probability of collisions of distant data in the embedded space (points) sufficiently low; in order to avoid the reduction in the probability of collision of points that are close in the considered space as well, LSH resorts to the construction of more than one hash tables, so that the probability of getting a collision of close points in at least one of the tables is increased with respect to the probability of getting a collision in a single table.

LSH is a parametric indexing methods and relies on two primary hyper-parameters: the number $k$ of hash functions to be used, and the number $L > 1$ of hash tables. Additionally, we adopted a multi-probe LSH approach [35] to limit the number of hash tables. Standard LSH methods often face a challenge where $L$ must be sufficiently large to ensure good retrieval quality. In the multi-probe approach, several buckets presumed to hold potential query results are systematically examined (“probed”) within a designated hash table. A critical element of LSH lies in the proximity of buckets that contain similar objects: if an object is nearby a query but not mapped to the same bucket, it is probable to reside in a nearby bucket, given the slight variance in hash values between the two buckets. The multi-probe technique entails devising a probing sequence to explore a cluster of buckets in close proximity to the one in which the query is indexed. The number of probes, indicating the additional hash buckets to inspect, can be dynamically adjusted to attain the desired level of retrieval accuracy [35].

For implementation, we opted for Falconn: FAst Lookups of Cosine and Other Nearest Neighbors [36], a well-tested and efficient library that implements LSH-based algorithms. Falconn supports two primary hash families: hyperplane LSH [37] and cross-polytope LSH [38]. Both families offer theoretical guarantees for cosine similarity, which is particularly relevant in our case since CLIP generates multi-modal embeddings that maximize input’s cosine similarity. Multi-probe is employed in both LSH families to minimize memory usage.

Figure 3.

Indexing architecture for multi-modal retrieval.

Figure 3 shows the final architecture we propose, highlighting the role of the joint embeddings provided by CLIP, and the final hashing determined by LSH.

5. Experimental framework and analysis

We conducted an experimental analysis using two different datasets:

FC: this dataset comprises photographic products from a medium-sized Italian e-commerce site managed by Inferendo. We utilized product images and names for this dataset. However, some items lacked associated images, so embeddings were generated solely based on text. We processed 11.500 images and 342.000 names.

AC: this dataset consists of approximately 190.000 Amazon fashion products, including images, reviews and item metadata. Similarly, we considered only the item image and product name. As many products had multiple images associated with them, we totally processed over 280.000 images.

As previously outlined, our proposed architecture is based on a deep neural network pretrained with a contrastive language-image approach (CLIP), coupled with an LSH indexing scheme to expedite retrieval of the generated multi-modal embeddings.

We evaluated the architecture using the aforementioned datasets in three phases:

Embedding Generation. Utilizing CLIP, we generated embeddings using two underlying visual encoder architectures: RN50 and ViT-L/14.

LSH Indexing. We indexed the embeddings using LSH, considering various parameters: the LSH algorithm (random hyperplane or cross polytope), the number of hash functions ( $k$ ), the number of tables ( $L$ ), and finally the number of probes (dynamically chosen to maintain a retrieval precision $> 0.9$ ).

Evaluation. We assessed a set of multi-modal queries on the selected configurations.

For each dataset, we examined 16 distinct experimental configurations, encompassing different combinations of LSH algorithm (hyperplane or cross polytope), number of tables (30 or 50), number of hash functions (16 or 17), and the image encoder network (RN50 or ViT-L/14). These configurations were carefully selected as the most representative after testing several other combinations.

We set up two different querying frameworks, one with a set of specifically selected queries, and one by exploiting a Large Language Model (LLM) for the query generation. For each query, given the number $l$ of products to retrieve, we perform the computation of Precison@k ( $P @ k$ ) and Recall@k ( $R @ k$ ), from which we computed the Average Precision ( $A P$ ):

$A P = \sum_{k = 1}^{l} P @ k (R @ [k - 1] - R @ k) (R @ 0 = 0)$

A linear scan of the item embeddings is used to get the items that are relevant for a query. We select the $l$ items corresponding to embeddings with the highest cosine similarity to the query.

Finally, the Mean Average Precision (mAP) is calculated over all the $N$ queries:

$𝑚𝐴𝑃 = \frac{1}{N} \sum_{i = 1}^{N} A P_{i}$

where $A P_{i}$ is the average precision of $i$ -th query. Next subsections will report the details.

5.1 Specifically selected queries

Regarding the initial phase of the experimentation, we configured $N = 75$ queries for each dataset, balanced among the following categories (25 queries configured for each category):

a set of queries corresponding to actual items within the considered dataset;

a set of textual queries from item textual descriptions;

a set of visual (image-based) queries sourced from the world wide web.

The task we focused on is the recommendation of the top-5 most relevant items, where relevance is characterized by the similarity of the item with respect to the query. Table 1 presents the results for dataset FC, while Table 3 displays the results for dataset AF.

Table 1
Mean average precision achieved and number of probes used by the various configurations on FC

Hash families	n. tables	n. hash	RN50		ViT
			mAP	n. Probes	mAP	n. Probes
Hyperplane	30	16	0.919	82	0.913	207
		17	0.945	64	0.914	178
	50	16	0.951	50	0.932	203
		17	0.947	73	0.933	50
Cross polytope	30	16	0.930	34	0.930	46
		17	0.960	54	0.919	30
	50	16	0.963	50	0.943	50
		17	0.932	114	0.930	197

Table 2

Average number of unique candidates per query and average retrieval time for the various configurations on AF

Hash families	n. tables	n. hash	RN50		ViT
			n. cand.	Time (secs)	n. cand.	Time (secs)
Hyperplane	30	16	3727	0.00328	4983	0.00419
		17	4525	0.00432	3466	0.00334
	50	16	6413	0.00693	5451	0.00472
		17	4718	0.00447	4318	0.00425
Cross polytope	30	16	20099	0.01329	2939	0.00352
		17	5601	0.00652	7879	0.00532
	50	16	32078	0.02304	19238	0.01217
		17	4718	0.0045	4318	0.00423

Table 3

Mean average precision achieved and number of probes used by the various configurations on AF

Hash families	n. tables	n. hash	RN50		ViT
			mAP	n. Probes	mAP	n. Probes
Hyperplane	30	16	0.981	43	0.983	118
		17	0.986	119	0.974	156
	50	16	0.984	137	0.997	84
		17	0.987	88	0.986	88
Cross polytope	30	16	0.998	30	0.980	37
		17	1.000	78	0.970	30
	50	16	0.998	50	0.985	50
		17	0.987	88	0.986	105

For each configuration, we also provided the dynamically computed number of probes specific to the situation. Across all tested configurations, the performance in terms of mAP of LSH retrieval on the generated multi-modal embeddings proved to be consistently strong, with no clear indication of one configuration being definitively superior to others. The qualitative performance slightly improved for the AF dataset, which is larger in size (more item images and descriptions contribute to a better final multi-modal representation). The number of probes necessary to achieve results remained relatively limited.

Additionally, Table 2 reports the average number of unique candidates which are examined by the retrieval procedure, along with the corresponding response time for queries, specifically for the AF dataset (the largest one). Across various configurations, only a few thousand candidates were selected, and the response was typically delivered within a few milliseconds on average. Notably, the cross polytope version of LSH tended to select a larger number of candidates compared to the hyperplane-based algorithm. However, since qualitative results in terms of mAP were comparable for both approaches, this suggests that employing a simple hyperplane LSH algorithm can be highly effective in the tested scenarios.

Finally, by way of illustration, Fig. 4 showcases the top-1 results (i.e., the most similar item) obtained for three sample queries from the AF dataset.

Figure 4.

Some results on AF dataset.

5.2 Synthetic data produced by a large language model

To further expand the number of experiments to analyze we decided to focus on the largest of the two datasets we have previously considered (AF) and to produce a set of synthetic textual queries generated from the product titles (names) contained in the corresponding catalog. The idea is to automatically produce textual queries similar to those that would be reasonably used by a person when searching fashion products. Since we are not aware of any collection of human queries on the Amazon fashion repository, we decided to exploit the features of a Large Language Model (LLM) to create a synthetic dataset, somewhat similar to a real collection of human generated queries. This approach harnesses the power of LLMs to automatically generate textual queries by just looking at the products titles.

After extensive experiments with models of increasing complexity we decided to consider the Falcon-40b model from TII (Technology Innovation Institute, Abu Dabhi).1

¹
www.tii.ae.

It is the biggest one of the Falcon family of models with 40 billion parameters trained on one trillion tokens; it has been trained on a custom dataset called RefinedWeb [39], which was created by using a variety of techniques to extract high-quality content from the web. Furthermore, TII released a fine-tuned version of these models called Instruct; it exploits a specific prompting technique built on other well-established prompting methods such as “lets think step by step” [40] and “chain-of-thought” [41], in order to improve the baseline models ability to reason and solve more complex problems. The model expoits in-context learning [42] in order to generate a suitable textual query. The idea is to show to the model an example of what we want to achieve, followed by the request for the current product query.

The following is an example of a prompt we used ({product} is a placeholder for each product name in the AF dataset):

Context: To find the following product: “COCOLEGGINGS Adventure Time Print 3/4 Length Strechy Capri Legging L” on amazon.com the following text was used: “Adventure Time leggings”. Request: Given this other product: {product} a text to find it on amazon.com would be: For example if product $=$ “Flycro Men’s Star Wars Logo Fans Hoddies”, then the generated textual description for retrieval would be: “Star Wars hoodie”.

We randomly selected 1000 products from the AF dataset and prompted the LLM model with them; we collected the generated textual descriptions and we submitted each one, complemented with a picture of the product, to the system for evaluation. Table 4 reports the results in terms of mAP and number of probes, while Table 5 shows the average retrieval time together with the number of candidates checked.

Table 4

Mean average precision achieved and number of probes used by the various configurations on LLM queries

Hash families	n. tables	n. hash	RN50		ViT
			mAP	n. Probes	mAP	n. Probes
Hyperplane	30	16	0.8316	177	0.8230	406
		17	0.8361	383	0.8242	755
	50	16	0.8259	153	0.8215	303
		17	0.8318	248	0.8034	530
Cross polytope	30	16	0.8335	204	0.8176	148
		17	0.8284	294	0.8169	229
	50	16	0.8383	167	0.8339	146
		17	0.8493	30	0.8220	57

Table 5

Average number of unique candidates per query and average retrieval time for the various configurations on LLM queries

Hash families	n. tables	n. hash	RN50		ViT
			n. cand.	Time (secs)	n. cand.	Time (secs)
Hyperplane	30	16	1354	0.00127	956	0.00102
		17	697	0.000809	643	0.000803
	50	16	1074	0.00112	808	0.000903
		17	707	0.00087	631	0.000827
Cross polytope	30	16	1304	0.00196	826	0.00142
		17	1113	0.00190	693	0.00147
	50	16	1242	0.00228	833	0.00197
		17	5135	0.00356	776	0.00157

As in the previous case, we remark a good performance in terms of mAP (always above 0.8); the smaller values with respect to the previous set of tests can be explained by the fact that LLM generated queries have a more generic textual description that in the previous case. Also in this situation, there is no clearly prevalent architectural configuration (results in terms of mAP are very similar, independently of the considered configuration). The number of probes is still very limited, and few thousands candidates are always evaluated in a few milliseconds. We can remark that the number of candidates is more contained in case of hyperplane LSH using embeddings generated through RN50, while no clear difference can be noticed, in these terms, when using ViT. We can conclude that in both the tested situations, the average performance in terms of mAP, number of candidates and probes, and retrieval time can be considered really satisfactory. Concerning the choice of a proper LSH method, we can conclude that the use of a hyperplane LSH, which is simpler to tune than polytope LSH (fewer parameters) and faster in setting up the tables, can be regarded as really effective.

Figure 5.

mAP vs size of retrieval set: RN50.

Figure 6.

mAP vs size of retrieval set: ViT.

Finally, we also tested the behavior of the different configurations with respect to the size of the retrieval set $l$ (the number of retrieved products). Figures 5 and 6 show the behavior of the mAP score with respect to the number of retrieved products (ranging from $l = 1$ to $l = 10$ ), in case of RN50 and ViT embeddings respectively.

Table 6

Key for plots of Fig. 5

Line #	Hash families	n. hash	n. tables	n. probes
1	CrossPolytope	16	30	143
2	CrossPolytope	17	30	214
3	CrossPolytope	16	50	133
4	CrossPolytope	17	50	30
5	Hyperplane	16	30	118
6	Hyperplane	17	30	307
7	Hyperplane	16	50	123
8	Hyperplane	17	50	200

Table 7

Key for plots of Fig. 6

Line #	Hash families	n. hash	n. tables	n. probes
9	CrossPolytope	16	30	118
10	CrossPolytope	17	30	154
11	CrossPolytope	16	50	97
12	CrossPolytope	17	50	42
13	Hyperplane	16	30	361
14	Hyperplane	17	30	590
15	Hyperplane	16	50	249
16	Hyperplane	17	50	461

Tables 6 and 7 show the key for reading the different comfigurations in the plots. We notice that there is a significant difference in terms of mAP among the configurations when the number of relevant products is limited (small values of $l$ ). When this number increases, the differences seem to be less significant. However, there are some configurations that produce peculiar results. In case of RN50 embeddings, configuration 4 (CrossPolytope 17/50) shows a very stable behavior for $l ⩾ 2$ , fixing the mAP score arount 0.85. For $l ⩾ 7$ the best is configuration 5 (Hyperplan 16/30). In case of ViT embeddings, configuration 16 (Hyperplan 17/50) needs a value of $l ⩾ 9$ to reach its best value around 0.86, while configuration 11 (CrossPolytope 16/50) is the one performing better on average. We also notice that RN50 configurations have a more stable behavior for $l ⩾ 5$ than those related to ViT; this can be explained by the larger size of the embeddings produced by RN50, keeping more information about the different modalities.

6. Concluding remarks

This paper has presented a retrieval architecture tailored for e-commerce applications, using a multi-modal representation of items of interest (textual descriptions and product images). This architecture is complemented by LSH indexing, which facilitates rapid and efficient retrieval of potential products. This allows the user to specify the desired products both visually (by showing or selecting a picture) and textually (by providing a textual description), in such a way that products similar to what the user is looking for can be retrieved in an efficient way. We have performed an in-depth analysis concerning the possible architectural choices for the embedding generation, as well as the features of the most suitable LSH scheme to be adopted. Promising results have been obtained when testing the architecture on real-world datasets and on synthetic data obtained through the use of an LLM able to mimic user interaction with the system.

In conclusion, a comprehensive exploration of a multi-modal architecture designed for the retrieval of products described through images and text has been presented. Through the described experimentation, we have gathered insights about the effectiveness and promise of our proposed approach. The results underscore the potential of leveraging both textual and visual modalities for product retrieval in e-commerce applications. As future works we devise an extensively testing of the architecture using datasets larger in size. Additionally, efforts will be directed towards evaluating the robustness of the approach when datasets containing data expressed in only one modality are combined with datasets containing data expressed in both textual and visual modalities, in various relative proportions.

References

Truong

Salah

Law

. Multi-Modal Recommender Systems: Hands-On Exploration. In: Proc. RecSys ’21: Fifteenth ACM Conference on Recommender Systems. 2021. pp. 834–837.

Zhu

Guan

. Multi-modal Hash Learning: Efficient Multimedia Retrieval and Recommendations. Cham: Springer. 2024.

Ciarlo

Portinale

. Multi-modal deep learning and fast retrieval for recommendation. In: Proc. 26th International Symposium, ISMIS 2022. Cosenza, Italy: Springer. 2022. pp. 52–60. Lecture Notes in Artificial Intelligence 13515.

Gionis

Indyk

Motwani

. Similarity Search in High Dimensions via Hashing. In: Proc. 25th VLDB 99. 1999. pp. 518–529.

Zhu

Wang

Zhang

Shen

. Cross-modal retrieval: A systematic review of methods and future directions. 2023. https://arxiv.org/abs/2308.14263.

Cao

Fen

Lin

Cao

. A Review of Hashing Methods for Multimodal Retrieval. IEEE Access. 2020; 8: 15377–15391.

Radford

Kim

Hallacy

Ramesh

Goh

Agarwal

, et al. Learning Transferable Visual Models From Natural Language Supervision. In: Proc. of the 38th International Conference on Machine Learning (ICML 21). 2021.

Liu

Luo

Qiu

Tang

. Deepfashion: powering robust clothes recognition and retrieval with rich annotations. In: Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2016). 2016.

Wang

Labzebnik

. Learning deep structure-preserving image-text embeddings. In: Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2016). 2016.

10.

Lin

Cui

Belongie

Hays

. Learning deep representations for ground-to-aerial geolocalization. In: Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2015). 2015.

11.

Zhao

Feng

Yan

. Memory augmented attribute manipulation networks for interactive fashion search. In: Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2017). 2017.

12.

Jiang

Sun

Murphy

Fei-Fei

, et al. Composing text and image for image retrieval: and empirical odyssey. In: Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2019). 2019.

13.

Chen

Shen

Gao

Liu

. Language-based image editing with recurrent attentive models. In: Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2018). 2018.

14.

Lee

K-H

Chen

Hua

. Stacked cross attention for image-text matching. In: Proc. European Conference on Computer Vision (ECCV 2018), Springer. 2018. pp. 212–228.

15.

Chen

Ding

Liu

Lin

Liu

Han

. IMRAM: Iterative matching with recurrent attention memory for cross-modal image-text retrieval. In: Proc. Intern. Conference on Computer Vision and Pattern Recognition (CVPR 2020). 2020. page arXiv:2003.03772.

16.

Huo

Ding

Fei

. Cross-modal contrastive learning for generalizable and efficient image-text retrieval. Machine Learning Research. 2023; 20(4): 569–582.

17.

Chen

Zhang

. Semantic enhancement and multi?level alignment network for cross?modal retrieval. Multimedia Tools and Applications. 2024.

18.

Park

Jang

Cho

Kim

. Sam: Cross-modal semantic alignmentsmodule for image-text retrieval. Multimedia Tools and Applications. 2024; 83: 12363–12377.

19.

Bronstein

Michel

Nikos

. Data fusion through cross-modality metric learning using similarity-sensitive hashing. In: Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2010). 2010.

20.

Lin

Ding

Jianmin

. Semantic-preserving hashing fro cross-view retrieval. In: Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2015). 2015.

21.

Wang

Zhang

Song

Sebe

Shen

. A survey on learning to hash. IEEE Transactions on Pattern Analysis and Machine Intelligence. 2018; 40(4): 769–790.

22.

Zhang

Man

Lei

. Text and content based image retrieval via locality sensitive hashing. Engineering Letters. 2011; 19(3).

23.

Jiang

. Deep cross-modal hashing. In: Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2017). 2017. pp. 3232–3240.

24.

Yang

Deng

Liu

Tao

Gao

. Pairwise relationship guided deep hashing for cross-modal retrieval. In: Proc. 31st AAAI2017. 2017. pp. 1618–1625.

25.

Deng

Liu

Gao

Tao

. Selfsupervised adversarial hashing networks for cross-modal retrieval. In: Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2018). 2018. pp. 4242–4251.

26.

Zhang

Lai

Feng

. Attention-aware deep adversarial hashing for cross-modal retrieval. In: Proc. European Conference on Computer Vision (ECCV 18). 2018. pp. 591–606.

27.

Tian

Krishnan

. Contrastive multiview coding. In: Proc. 16th European Conference on Computer Vision (ECCV 20). 2020.

28.

Zhang

Ren

Sun

. Deep Residual Learning for Image Recognition. CoRR. 2015; abs/1512.03385. Available from: http://arxiv.org/abs/1512.03385.

29.

Dosovitskiy

Beyer

Kolesnikov

Weissenborn

Zhai

Unterthiner

, et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. CoRR. 2020; abs/2010.11929. Available from: https://arxiv.org/abs/2010.11929.

30.

Zhang

Xie

. Bag of Tricks for Image Classification with Convolutional Neural Networks. In: Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2019). 2019.

31.

Zhang

. Making Convolutional Networks Shift-Invariant Again. In: Proc. 36th Intern. Conference on Machine Learning (PMLR2019). 2019. pp. 7324–7334.

32.

Tan

. EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks. In: Proc. 36th Intern. Conference on Machine Learning (PMLR2019). 2019. pp. 6105–6114.

33.

Radford

Child

Luan

Amodei

Sutskever

. Language Models are Unsupervised Multitask Learners. 2019. http://www.persagen.com/files/misc/radford2019language.pdf.

34.

Indyk

Motwani

. Approximate nearest neighbors: towards removing the curse of dimensionality. In: Proc. 13th ACM STOC1998. 1998. pp. 604–613.

35.

Josephson

Wang

Charikar

. Multi-Probe LSH: Efficient Indexing for High-Dimensional Similarity Search. In: Proc. 33rd VLDB2007. 2007. pp. 950–961.

36.

Razenshteyn

Schmidt

. FALCONN – FAst Lookups of Cosine and Other Nearest Neighbors. https://github.com/FALCONN-LIB/FALCONN.

37.

Charikar

. Similarity Estimation Techniques from Rounding Algorithms. In: Proc. 34th Annual ACM Symposium on Theory of Computing. 2002. pp. 380–388.

38.

Andoni

Indyk

Laarhoven

Razenshteyn

Schmidt

. Practical and Optimal LSH for Angular Distance. CoRR. 2015; abs/1509.02897. Available from: http://arxiv.org/abs/1509.02897.

39.

Penedo

Malartic

Hesslow

Cojocaru

Cappelli

Alobeidli

, et al. The RefinedWeb dataset for Falcon LLM: outperforming curated corpora with web data, and web data only. arXiv preprint arXiv:230601116. 2023.

40.

Kojima

Reid

Matsuo

Iwasawa

. Large Language Models are Zero-Shot Reasoners. In: proc. ICML 2022 Workshop on Knowledge Retrieval and Language Models. 2022. Available from: https://knowledge-retrieval-workshop.github.io/.

41.

Wei

Wang

Schuurmans

Bosma

Xia

Chi

, et al. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. In: Advances in Neural Information Processing Systems. Vol. 35; 2022. pp. 36366–36381.

42.

Brown

Mann

Ryder

Subbiah

Kaplan

Dhariwal

, et al. Language models are few-shot learners. In: Advances in Neural Information Processing Systems. Vol. 33; 2020. pp. 1877–1901.

Fast retrieval of multi-modal embeddings for e-commerce applications

Abstract

Keywords

1. Introduction

2. Related works

3. Multi-modal embedding for images and text

5.1 Specifically selected queries

Table 1 Mean average precision achieved and number of probes used by the various configurations on FC

1 www.tii.ae.

References

Table 1
Mean average precision achieved and number of probes used by the various configurations on FC

¹
www.tii.ae.