Balancing preference and compatibility: A multiobjective optimization framework for outfit recommendation

Abstract

Garment recommendation in fashion e-commerce requires a practical balance between personalized user preference and outfit compatibility. Existing recommendation methods usually combine these two goals into a single scalar objective, which limits their ability to represent diverse trade-offs in apparel selection. This study proposes a multiobjective outfit recommendation (MOOR) framework that formulates outfit recommendation as a multiobjective optimization problem and searches for Pareto-optimal outfit solutions. To estimate the two objectives, the framework incorporates a heterogeneous graph-based user preference model for capturing sparse and higher-order preference signals, and a multimodal outfit compatibility model for assessing visual, textual, and attribute-level coherence between garments. A task-specific evolutionary search strategy is further introduced to explore candidate top–bottom combinations while preserving recommendation diversity. Experiments on the IQON3000 dataset show that the proposed framework provides strong preference and compatibility estimation and generates diverse trade-off solutions with favorable outfit-level quality under the reported evaluation protocol. These findings support the value of explicitly modeling the preference–compatibility trade-off in garment recommendation systems.

Keywords

Outfit recommendation heterogeneous graph neural network multimodal fusion multiobjective optimization

Within the domain of fashion recommendation, constructing complete and coherent outfit compositions, rather than recommending isolated items, is a practically important yet technically challenging task. Unlike single-item recommendation, outfit recommendation must simultaneously satisfy two inherently distinct requirements. On the one hand, the recommended items should align with user-specific preferences, reflecting individual style, historical behaviors, and contextual needs. On the other hand, the outfit as a whole should maintain aesthetic compatibility in terms of style consistency, color harmony, and visual balance. These two requirements are often not fully aligned: emphasizing personalization alone may result in stylistically inconsistent combinations, whereas prioritizing compatibility alone tends to suppress individual expression and produce generic recommendations. Therefore, effectively balancing user preference and outfit compatibility remains a fundamental challenge in modern outfit recommendation systems.

Existing studies on outfit recommendation can be broadly categorized into two paradigms. The first relies on predefined or template-based outfit construction. For instance, Zhan et al.¹ constructed an attribute-aware fashion knowledge graph to model user–outfit relationships, while Verma et al.² incorporated visual preference signals to generate occasion-specific outfit suggestions. Ding et al.³ further proposed the TOG framework, which leverages category combination templates to guide outfit generation. The second paradigm focuses on item completion, such as predicting a compatible bottom for a given top.^4,5 Although effective in specific scenarios, these methods generally model personalization and compatibility in an implicit and entangled manner. In particular, most approaches map heterogeneous signals into a shared embedding space and optimize a single scalar objective (e.g., via weighted summation), which enforces a fixed and globally shared trade-off between the two objectives. Such scalarized formulations fail to capture the inherently nonlinear and user-specific conflicts between preference and compatibility, thereby limiting the diversity and flexibility of the generated outfit recommendations.

To address these limitations, we reformulate outfit recommendation as a multiobjective optimization problem, where user preference and outfit compatibility are explicitly treated as two separate but interacting objectives. Based on this formulation, we propose a multiobjective outfit recommendation (MOOR) framework that combines dedicated objective estimation with Pareto-based search over candidate outfits. Instead of collapsing multiple objectives into a single scalar score, the framework preserves a set of trade-off solutions, each reflecting a different balance between personalization and compatibility. This design provides a more flexible recommendation mechanism for garment selection than fixed-weight scalarization.

Realizing this formulation requires reliable estimators for both objectives. For preference modeling, previous graph-based approaches primarily focus on direct user–item interactions and often overlook higher-order semantic relations such as category and attribute dependencies.^6,7 To address this issue, we construct a heterogeneous graph that jointly represents users, items, and category-level semantics, and further enhance it with similarity-based edge augmentation and multihop information fusion to capture latent preference signals under sparse interaction data. For compatibility modeling, prior studies typically rely on multimodal feature fusion of visual and textual information,^8,9 yet many of them adopt relatively simple fusion strategies that are insufficient for fine-grained cross-modal alignment and higher-order item interaction modeling. To overcome these limitations, we design a multimodal compatibility evaluator that integrates visual, textual, and attribute information to learn robust and semantically consistent representations for outfit compatibility assessment. Overall, our main contributions can be summarized in the following.

We formulate outfit recommendation as a task-specific multiobjective recommendation problem by explicitly disentangling personalized preference and outfit compatibility, instead of collapsing them into a fixed scalar objective.

We instantiate the two objectives using task-adapted objective estimators: a heterogeneous graph-based estimator for user-specific preference and a multimodal estimator for outfit-level compatibility. These estimators provide the objective signals required for Pareto-based outfit generation, rather than serving as standalone generic recommendation models.

We design a task-specific search framework for outfit construction, including user-conditioned candidate generation and phenotype-aware environmental selection, to improve the representativeness and diversity of the approximated Pareto solutions.

Related work

Prior studies related to MOOR can be grouped into three methodological streams: user preference modeling, fashion compatibility modeling, and multiobjective recommendation. These streams correspond to the three technical questions addressed in this study: how to estimate a user's garment-level preference, how to measure the coherence between constituent garments, and how to balance preference and compatibility without imposing a fixed scalar weighting scheme.

User preference modeling

User preference modeling is a fundamental component of fashion recommendation, aiming to infer personalized apparel tastes from historical user–item interactions and item information. Early recommendation methods primarily relied on collaborative filtering, matrix factorization methods, and hybrid recommendation strategies,^10–13 which learn preference patterns from explicit or implicit interaction signals. In fashion recommendation, user feedback is often sparse and highly implicit, making it difficult for interaction-only models to capture the subjective and dynamic nature of apparel preference. Pairwise ranking methods such as Bayesian personalized ranking (BPR)¹⁴ improve implicit-feedback learning by directly optimizing personalized ranking from observed interactions. Domain-oriented fashion models additionally incorporate item content and visual semantics. For example, FashionDNA¹⁵ maps fashion items and customer style preferences into a shared latent space using product images, tags, and sales data, while VECF¹⁶ combines visual features with user reviews through a multimodal attention mechanism for explainable fashion recommendation. Although these methods enrich preference representation, they still have limited ability to model higher-order semantic dependencies among users, items, and attributes.

Recent deep learning methods further exploit sequence structures and graph relations for personalized fashion recommendation. Transformer-based models such as POG¹⁷ connect user preferences over individual items and outfits for personalized outfit generation. Graph neural networks (GNNs) provide a natural framework for modeling relational recommendation data.¹⁸ For example, HFGN¹⁹ constructs a hierarchical graph to jointly model users, outfits, and items, while visual-aware graph models incorporate image-derived item features into graph-based recommendation.²⁰ These studies show the value of structural representation learning in fashion recommendation. Nevertheless, models based mainly on user–item interactions, outfit–item mappings, or visual item graphs may still be insufficient for representing heterogeneous semantic relations among users, items, categories, and attributes.

Heterogeneous graph networks (HGNs) have therefore attracted increasing attention for modeling multityped nodes and relations. For example, A3-FKG¹ introduces an attentive attribute-aware fashion knowledge graph for outfit preference prediction, showing the importance of attribute-level semantic relations in fashion preference modeling. Related heterogeneous graph studies^21,22 further demonstrate the potential of heterogeneous representation learning in handling complex semantic structures in recommendation and fashion-retail scenarios. Overall, user preference modeling in fashion recommendation has progressed from interaction-based latent factor models to graph-based and heterogeneous semantic graph models. This progression suggests that effective preference estimation requires not only collaborative signals from user–item interactions but also relation-aware use of item attributes and higher-order semantic neighborhoods. However, sparse feedback and insufficient cross-layer semantic aggregation remain important limitations, motivating a preference estimator that strengthens graph structure and adaptively integrates heterogeneous neighborhood information.

Fashion compatibility modeling

Fashion compatibility modeling aims to determine whether multiple garments can form a coherent and aesthetically acceptable outfit. Early studies primarily learned visual compatibility from item co-occurrence or image-based representations. For example, Veit et al.²³ learned visual clothing compatibility across heterogeneous categories using a Siamese convolutional neural network (CNN) framework, while later visual-centric models improved compatibility prediction through richer convolutional feature extraction and feature fusion.²⁴ These studies show that visual appearance is essential for compatibility learning, but visual information alone may be insufficient because outfit coherence also depends on category, texture, color, style, and semantic attributes. To address this limitation, subsequent studies incorporated multimodal information, including textual descriptions and structured attributes. Category-aware multimodal attention models and neural fashion expert models exploit visual and textual information to improve complementary clothing matching.^25,26 Related fashion recommendation studies also integrate higher-level style and occasion semantics. De Divitiis et al.²⁷ used Kobayashi-derived color-style semantics for style-based outfit recommendation, and Becattini et al.²⁸ combined style cues with social-event information for event-aware fashion recommendation. Attribute-augmented models further introduce explicit attribute interactions to improve explainability in compatibility prediction.²⁹ These studies indicate that compatibility learning has gradually shifted from isolated visual representation toward multimodal semantic modeling. However, fine-grained alignment among visual, textual, and attribute modalities remains challenging, especially when different modalities provide noisy or incomplete signals.

Another line of work focuses on relation-aware compatibility modeling. For example, NGNN³⁰ represents an outfit through a category-level fashion graph and learns item interactions with node-wise graph neural networks. Type-aware embedding methods learn compatibility in type-specific embedding spaces, distinguishing item similarity from cross-type compatibility.³¹ More recent studies further explore cross-modal attention and color-compatibility modeling to capture subtle item-pair relations.^32,33 Overall, fashion compatibility modeling has evolved from visual-style similarity learning to multimodal, type-aware, and relation-aware interaction modeling. The central requirement is to learn a compatibility-oriented representation space in which stylistically coherent garments are close and mismatched garments are separable. Existing methods still face challenges in fine-grained multimodal alignment and discriminative handling of hard negative combinations, motivating a compatibility estimator that better calibrates visual, textual, and attribute information.

Multiobjective optimization recommendation

Traditional recommendation models are typically designed to optimize a single objective, such as accuracy or ranking relevance. However, real recommendation scenarios often involve multiple criteria, including relevance, diversity, novelty, compatibility, and personalization.³⁴ In the fashion domain, multiobjective formulations have been explored in related tasks such as capsule wardrobe generation,^35,36 where the goal is to select a compact set of mutually compatible and versatile garments under practical constraints. These studies demonstrate the value of considering multiple fashion-related criteria simultaneously. In broader recommendation research, Pareto-efficient optimization has been introduced to address conflicts among multiple objectives. Lin et al.³⁷ proposed a Pareto-efficient framework for multiobjective e-commerce recommendation, and Xie et al.³⁸ further studied personalized approximate Pareto-efficient recommendation. These studies show that Pareto-based formulations can provide a principled way to handle objective conflicts. Meanwhile, multiobjective learning studies also indicate that fixed weighted scalarization can be limited when objectives compete, because a single predefined weighting scheme may force a global compromise and obscure alternative trade-off solutions.³⁹

For personalized outfit recommendation, the key challenge is not only to identify compatible garment combinations, but also to preserve candidate outfits that reflect different balances between user preference and outfit compatibility. Existing personalized outfit models often combine these factors into a single scoring or training objective, which makes the final recommendation sensitive to predefined weights and may collapse diverse trade-off solutions into one compromise. Motivated by this limitation, we formulate outfit recommendation as a multiobjective optimization problem and use Pareto-based search to approximate a set of representative nondominated outfit candidates. This formulation enables the recommendation process to retain diverse preference–compatibility trade-offs rather than imposing a fixed scalar objective.

Our approach

Problem definition

We formulate personalized outfit recommendation as a multiobjective optimization problem. Let $U$ denote the set of users and $I$ denote the set of fashion items. The item set $I$ is partitioned into two disjoint subsets, namely tops $I_{t}$ and bottoms $I_{b}$ , such that $I_{t} \cap I_{b} = Ø$ and $I_{t} \cup I_{b} = I$ . An outfit $o$ is defined as a top–bottom pair $o = (i_{t}, i_{b})$ , where $i_{t} \in I_{t}$ and $i_{b} \in I_{b}$ . For a given user $u \in U$ , the goal is to identify a set of candidate outfits that jointly optimize two objectives: personalized preference and outfit compatibility.

Personalized preference ( $f_{pref}$ ): The preference objective measures the degree to which the constituent garments of an outfit match the individualized taste of a target user. Let $f_{pref} (u, i)$ denote the predicted preference score of user $u$ for item $i$ . For a top–bottom outfit $o = (i_{t}, i_{b})$ , the outfit-level preference score is defined as a set-level aggregation of the user–item preference scores of its constituent garments: $f_{pref (u, o)} = AGG (f_{pref} (u, i_{t}), f_{pref} (u, i_{b}))$ , where $AGG (\cdot)$ is an order-insensitive aggregation function. Following the additive utility view in set-level recommendation studies, in which the preference for a set can be estimated from the utilities or ratings of its constituent items,⁴⁰ $AGG (\cdot)$ is instantiated as the arithmetic mean. Since the two item-level scores are produced by the same preference estimator and lie on the same scale, the arithmetic mean provides a scale-preserving equal-weight aggregation of the user’s preference for the top and bottom items without introducing additional role-specific weighting parameters. This objective captures the user-specific appeal of the selected garments at the item level.

Outfit compatibility ( $f_{comp}$ ): The compatibility objective evaluates the intrinsic coherence of an outfit independent of any specific user. It measures whether the items in an outfit form a stylistically coordinated combination in terms of visual appearance, semantic attributes, and category relations. Formally, the compatibility score of an outfit is defined as $f_{comp} (o) = f_{comp} (i_{t}, i_{b})$ , where $f_{comp} (i_{t}, i_{b})$ is a learned function that measures the coherence between the top $i_{t}$ and the bottom $i_{b}$ in a multimodal compatibility space. This space is learned from visual, textual, and attribute information, so the resulting score reflects compatibility-oriented semantic coherence rather than raw visual or textual similarity. Unlike the preference objective, which is user-dependent, the compatibility objective characterizes outfit-level coherence as an item-combination property.

The two objectives $f_{pref}$ and $f_{comp}$ are inherently conflicting. An outfit composed of highly preferred items may exhibit poor stylistic coherence, while an outfit with strong compatibility may not align with a user's personal preference. Consequently, optimizing one objective in isolation often leads to suboptimal solutions with respect to the other.

To address this trade-off, we adopt the standard formulation of multiobjective optimization.⁴¹ An outfit $o^{*}$ is Pareto optimal if there does not exist another outfit $o$ such that $f_{pref} (u, o) \geq f_{pref} (u, o^{*}), f_{comp} (o) \geq f_{comp} (o^{*})$ , with at least one strict inequality. The set of all Pareto-optimal solutions forms the Pareto front, which represents different trade-offs between personalized preference and compatibility. Accordingly, the outfit recommendation task is formulated as

\underset{o \in I_{t} \times I_{b}}{maximize} F (u, o) = [f_{pref} (u, o), f_{comp} (o)]

(1)

where the goal is to approximate the Pareto-optimal solution set rather than a single optimal solution.

As illustrated in Figure 1, the proposed MOOR framework instantiates the above formulation using two dedicated evaluators for preference and compatibility, respectively, followed by a task-specific evolutionary search procedure that explores the solution space and approximates the Pareto front. From the obtained nondominated solutions, a subset of representative outfits is finally selected for recommendation.

Figure 1.

Architectural overview of the proposed multiobjective outfit recommendation (MOOR) framework. The framework consists of three components. (1) A heterogeneous graph-based preference model estimates user-specific garment preference from multirelational interaction data. (2) A multimodal compatibility model evaluates the coherence of garment combinations using visual, textual, and attribute information. (3) The preference and compatibility scores are treated as two explicit objectives in a task-specific evolutionary search procedure, which approximates a Pareto set of nondominated outfit candidates. From these candidates, a subset of representative outfits is finally selected for recommendation.

Heterogeneous graph-based user preference modeling

To instantiate the personalized preference objective $f_{pref} (u, i)$ , a heterogeneous graph-based user preference model (HGUP) estimates user–item preference scores from sparse multirelational fashion interactions. In the MOOR pipeline, these scores serve as the preference objective for Pareto-based outfit generation. Since the downstream search is directly guided by the estimated preference scores, HGUP is designed to incorporate heterogeneous relations among users, items, and categories and to capture higher-order semantic dependencies relevant to fashion preference.

We construct a heterogeneous graph $G = (V, E)$ , where the node set $V = V_{u} \cup V_{i} \cup V_{c}$ contains users, items, and item categories. The edge set $E = E_{ui} \cup E_{uc} \cup E_{ic}$ encodes multiple relation types, including user–item interactions, item–category affiliations, and user–category associations, along with their inverse relations. Each item node is initialized by combining an item identity embedding with visual features extracted from a pretrained CLIP encoder through a learnable gating mechanism. This initialization enables the model to incorporate both collaborative signals and content-based information.

Semantic structure augmentation

To alleviate data sparsity, we augment the graph topology with similarity-based edges that capture implicit collaborative relationships. For user nodes, we define the similarity between two users $u_{i}$ and $u_{j}$ using the Jaccard coefficient over their interaction sets:

Sim (u_{i}, u_{j}) = \frac{| R_{u_{i}} \cap R_{u_{j}} |}{| R_{u_{i}} \cup R_{u_{j}} |}

(2)

where $R_{u_{i}}$ denotes the set of items previously interacted with by user $u_{i}$ . Only edges with similarity above a predefined threshold are retained to avoid introducing noisy connections.

For item nodes, we design a hybrid similarity function that combines behavioral and attribute-based signals:

Sim (i_{m}, i_{n}) = α \cdot Si m_{behav} (i_{m}, i_{n}) + (1 - α) \cdot Si m_{context} (i_{m}, i_{n})

(3)

Here, $Si m_{behav} (i_{m}, i_{n})$ is computed from co-interaction statistics, and $Si m_{context} (i_{m}, i_{n})$ measures similarity based on shared categorical attributes. The coefficient $α$ balances these two components. This augmentation densifies the graph and improves the reliability of preference estimation.

Heterogeneity-aware representation learning

We adopt the Heterogeneous Graph Transformer (HGT) to model multirelational dependencies.⁴² For each target node $j$ at layer, where $l \in {1, \dots, L}$ , messages from its topological neighbors are aggregated across all relation types $r \in R$ . The aggregation utilizes attention weights that explicitly account for semantic heterogeneity. To prevent information degradation from previous layers and facilitate deep network training, we formulate the message passing with a residual connection (for $l \geq 1$ ):

\begin{matrix} h_{j}^{(l)} = \\ ϕ (Linear (\sum_{r \in R} \sum_{s \in N_{r} (j)} Attn (s, r, j) \cdot M (s, r)) + h_{j}^{(l - 1)}) \end{matrix}

(4)

where $N_{r} (j)$ denotes the neighbors of node $j$ under relation $r$ , $Attn (s, r, j)$ is the relation-specific attention score, $M (s, r)$ is the projected message vector from source node $s$ , and $ϕ$ is the LeakyReLU activation function.

However, deep GNNs often suffer from over-smoothing, where node representations across different layers become indistinguishable. To effectively capture both local neighborhood signals and global structural semantics, we propose an adaptive multilayer fusion mechanism (as shown in Figure 2). Specifically, we introduce a trainable weight parameter vector and apply the Softmax function to obtain the normalized importance score $β_{l}$ for each layer:

H_{j} = \sum_{l = 1}^{L} β_{l} \cdot h_{j}^{(l)}

(5)

Figure 2.

The process of message aggregation for HGUP.

This allows the model to dynamically balance local and global structural information for each node $j$ . After $L$ layers, we obtain the final, comprehensive embeddings $H_{u}$ for user $u$ and $H_{i}$ for item $i$ . With the final user embedding H_u and item embedding $H_{i}$ , we predict the user’s preference score via a nonlinear interaction function:

{\hat{y}}_{ui} = MLP (H_{u} \oplus H_{i})

(6)

where ⊕ denotes vector concatenation and ${\hat{y}}_{ui}$ is the final predicted probability that user $u$ will interact with item $i$ .

Since the real-world fashion data typically consists of implicit feedback rather than explicit ratings, we formulate preference learning as a pairwise ranking task. We adopt the BPR framework, which assumes that a user prefers observed interactions over unobserved ones. For each observed positive interaction $(u, i) \in D^{+}$ , we sample a negative item $i^{-}$ that the user has not interacted with. To ensure efficient training, we employ a dynamic uniform negative sampling strategy. The model parameters $Θ$ are learned by maximizing the posterior probability that the user prefers item $i$ over item $i^{-}$ . This is equivalent to minimizing the following BPR objective function:

L_{HGUP} = - \sum_{(u, i) \in D^{+}} \ln σ ({\hat{y}}_{ui} - {\hat{y}}_{u i^{-}}) + λ ‖ Θ ‖_{2}^{2}

(7)

where $σ (\cdot)$ is the sigmoid function, and $λ$ controls the $L_{2}$ regularization strength to prevent overfitting.

The trained model HGUP provides the preference estimator $f_{pref} (u, i) = {\hat{y}}_{ui}$ , which is subsequently used as one of the objective functions in the multiobjective optimization framework.

Multimodal outfit compatibility learning

To instantiate the outfit compatibility objective $f_{comp} (o)$ , we design a multimodal outfit compatibility estimator, denoted as MMOC. MMOC estimates compatibility scores for top–bottom pairs using visual, textual, and attribute-level evidence. The estimated compatibility scores are then used as the second objective in the MOOR search stage. Since the quality of this objective directly affects the preference–compatibility trade-off, MMOC combines multimodal representation learning, cross-modal semantic calibration, and uncertainty-aware modality fusion to provide stable compatibility signals for outfit construction. In contrast to the preference objective, which is conditioned on user–item interaction history, the compatibility objective characterizes outfit-level coordination as an inherent property of the garment combination itself.

Multimodal representation construction

Given an outfit $o = (i_{t}, i_{b})$ , each item is represented using three complementary modalities: visual appearance, textual description, and structured attributes. For the visual modality, we extract multiscale representations from a pretrained CLIP encoder by aggregating intermediate feature maps across multiple layers. The resulting visual representation is defined as

v = \sum_{m = 1}^{M} ω_{m} \cdot Pro j_{v} (L_{m}^{v}) + ω_{g} \cdot Pro j_{g} (E_{global}^{v})

(8)

where $L_{m}^{v}$ denotes the local feature map from the mth layer, $E_{global}^{v}$ is the global embedding, and ${ω_{m}, ω_{g}}$ are learnable weights normalized via Softmax. This formulation captures both local discriminative patterns and global semantic cues.

Textual features are constructed analogously to obtain representation $t$ . In addition, structured attributes (e.g., category, color, and style) are embedded and projected into a latent space:

a = σ (W_{a} \cdot [e_{cat} \oplus e_{col} \oplus e_{sty}] + b_{a})

(9)

where ⊕ denotes concatenation. This attribute representation provides complementary semantic signals that are not directly observable from raw modalities.

Cross-modal semantic calibration

To model interactions across modalities, we construct a unified representation by stacking modality-specific features $X^{(0)} = [v, t, a]$ and applying multihead self-attention:

{\tilde{X}}^{(l)} = MHSA (X^{(l - 1)}, X^{(l - 1)}, X^{(l - 1)})

(10)

where ${\tilde{X}}^{(l)} = [{\tilde{v}}^{(l)}, {\tilde{t}}^{(l)}, {\tilde{a}}^{(l)}]$ captures cross-modal interactions at layer $l$ .

In a standard Transformer, fused features are typically subjected to a shared layer normalization, which may cause modality collapse, i.e., the statistics of the dominant modality overwhelm those of weaker modalities. We therefore adopt a disentangled residual strategy: for each modality-specific channel, we apply an independent, learnable normalization layer to enforce that each modality preserves its original feature distribution while absorbing cross-modal context. The updates are formulated as

v^{(l)} = L N_{v} (v^{(l - 1)} + {\tilde{v}}^{(l)} + FFN ({\tilde{v}}^{(l)}))

(11)

t^{(l)} = L N_{t} (t^{(l - 1)} + {\tilde{t}}^{(l)} + FFN ({\tilde{t}}^{(l)}))

(12)

a^{(l)} = L N_{a} (a^{(l - 1)} + {\tilde{a}}^{(l)} + FFN ({\tilde{a}}^{(l)}))

(13)

where $L N_{v}$ , $L N_{t}$ , and $L N_{a}$ denote independent learnable normalization parameters. This design explicitly preserves modality heterogeneity and mitigates representation homogenization during gradient-based optimization.

Uncertainty-aware modality fusion

After L interaction layers, we obtain modality-aligned representations $v^{(L)}, t^{(L)}, a^{(L)}$ . To aggregate them into a unified embedding, we introduce a learnable weighting mechanism:

z = w_{v} \cdot v^{(L)} + w_{t} \cdot t^{(L)} + w_{a} \cdot a^{(L)}

(14)

where ${w_{v}, w_{t}, w_{a}}$ are normalized weights. This formulation adaptively balances modality contributions under varying signal quality.

Given an anchor item representation $z_{t}$ , a compatible positive $z_{b}^{+}$ , and a set of negatives ${z_{b, j}^{-}}_{j = 1, \dots, J}$ , we define the compatibility objective using a contrastive loss:

L_{comp} = - \log \frac{\exp (S (z_{t}, z_{b}^{+}) / τ)}{\exp (S (z_{t}, z_{b}^{+}) / τ) + \sum_{j = 1}^{J} \exp (S (z_{t}, z_{b, j}^{-}) / τ)}

(15)

where $S (\cdot)$ denotes the cosine similarity, and τ is a learnable temperature parameter that controls the sharpness of the contrastive distribution. Since τ is optimized jointly with the representation encoders, an excessively small τ may amplify minor similarity differences and lead to an over-sharpened contrastive distribution. To avoid this degenerate temperature scaling, we introduce a hinge-style lower-bound regularization term:

L_{reg} = δ \cdot max (0, τ_{min} - τ)

(16)

where $τ_{min}$ is the preset minimum temperature and $δ$ is the balancing coefficient. This term is zero when $τ \geq τ_{min}$ and increases linearly when τ falls below $τ_{min}$ . It therefore preserves the flexibility of a learnable temperature while preventing the contrastive logits from becoming excessively sharp. The final compatibility learning objective is defined as

L_{MMOC} = L_{comp} + L_{reg}

(17)

After the multimodal compatibility model is trained, each item is represented in the learned compatibility space. During the multiobjective recommendation phase, the compatibility score $f_{comp} (o)$ of an outfit is computed as the cosine similarity between the learned representations of its top and bottom items:

[f_{comp} (o) = \frac{z_{t} \cdot z_{b}}{‖ z_{t} ‖ ‖ z_{b} ‖}]

(18)

A higher cosine value indicates stronger learned coherence between the two garments in the compatibility space. This formulation converts the learned outfit compatibility into a bounded scalar objective for the downstream Pareto optimization.

Multiobjective optimization outfit recommendation

The user preference estimator and the outfit compatibility estimator developed previously provide the two objective functions required by the formulation. Based on these two objectives, the recommendation stage is organized as a Pareto-based search over the combinatorial space of top–bottom garment pairs. For each target user, MOOR first constructs user-specific top and bottom candidate item subsets, and their Cartesian product defines the feasible outfit search space. As the number of feasible outfits increases with the product of the two candidate subset sizes, MOOR evaluates only a fixed number of top–bottom pairs during the search, rather than exhaustively scoring every pair in their Cartesian product. The evaluated pairs are iteratively updated through evolutionary search to approximate representative trade-off solutions. In this way, MOOR generates a set of nondominated outfit candidates that reflect different balances between personalized preference and outfit compatibility. MOOR builds on the Pareto approximation ability of multiobjective evolutionary algorithms (MOEAs) and adapts the search process to the structure of outfit recommendation. Specifically, user-conditioned item-level candidate construction guides the search toward garments that are relevant to the target user, structure-aware evolutionary operators preserve the validity of top–bottom combinations, and phenotype-aware environmental selection promotes diversity at the item-combination level. These components jointly support efficient exploration of the outfit search space under the preference–compatibility objective formulation. The overall optimization pipeline is illustrated in Figure 3.

Figure 3.

Flowchart of the proposed multiobjective outfit recommendation procedure.

Preference-stratified candidate pool construction

To guide the evolutionary search, we construct a user-specific candidate pool $C_{u}$ based on the preference scores predicted by HGUP. Instead of sampling uniformly from the entire item set, top and bottom items are separately stratified into multiple preference regimes. Let $T_{u}$ and $B_{u}$ denote the top set and bottom set for user $u$ , respectively. According to the predicted preference scores, the two sets are partitioned into $M$ preference strata. In practice, these strata correspond to high-preference, medium-preference, and exploratory regions. We then construct user-specific top and bottom candidate item subsets by stratified sampling from each regime. The outfit-combination search space induced by these two item subsets is defined as

C_{u} = {o = {i_{t}, i_{b}} | i_{t} \in T_{u}^{*}, i_{b} \in B_{u}^{*}}

(19)

where $T_{u}^{*}$ and $B_{u}^{*}$ represent the user-specific top and bottom candidate item subsets, respectively, and $C_{u}$ represents their Cartesian product in the form of feasible top–bottom outfits. This design enables the search process to balance exploitation of strong user preference signals with exploration of potentially compatible yet underrepresented garments. This preference-stratified construction reduces the global item space to a user-relevant item-combination space while retaining candidates from high-preference, medium-preference, and exploratory regions. Compared with a pure Top-K preference filter, this strategy keeps the search from being restricted to only the most preferred items and leaves room for potentially compatible combinations involving less obvious garments. The resulting search space supports both preference-guided exploitation and compatibility-oriented exploration in the subsequent evolutionary procedure.

Structure-aware initialization and evolutionary operators

Given that an outfit is represented as a structured pair $o = {i_{t}, i_{b}}$ , standard genetic operators designed for unconstrained vectors are not directly applicable. We therefore design structure-aware initialization and variation strategies. The initial population is generated by sampling valid top–bottom outfits from the induced combination space $C_{u}$ :

P_{0} = {o_{1}^{(0)}, o_{2}^{(0)}, \dots, o_{N_{p}}^{(0)}}, o_{k}^{(0)} \in C_{u}

(20)

where $N_{p}$ is the population size. To encourage broad stylistic coverage, the initial population is sampled across diverse category pairs whenever possible. During evolution, crossover exchanges either the top item or the bottom item between two parent outfits. Given two parent solutions $p_{1} = {t_{1}, b_{1}}, p_{2} = {t_{2}, b_{2}}$ , their offspring can be generated as ${\tilde{p}}_{1} = {t_{1}, b_{2}}, {\tilde{p}}_{2} = {t_{2}, b_{1}}$ . Mutation replaces a single garment in the current outfit with another item sampled from the corresponding user-specific top or bottom candidate subset. These operators preserve the compositional validity of outfits while enabling flexible local exploration of the solution space. In this way, the evolutionary procedure remains compatible with the structured nature of garment combinations rather than treating outfits as unconstrained vectors.

Phenotype-aware environmental selection

Standard MOEAs maintain diversity using metrics such as crowding distance in objective space. However, in outfit recommendation, diversity in the objective space does not necessarily correspond to diversity in the item space. Solutions that are well separated in terms of objective values may still correspond to visually or semantically similar outfits. To address this issue, we introduce a phenotype-aware environmental selection strategy. Specifically, structurally redundant outfits are filtered according to item-level similarity before selection, and higher priority is assigned to solutions that improve category-level diversity within the retained candidate set. In this way, the selection procedure explicitly bridges the gap between objective-space optimization and user-perceived diversity, preventing the approximated Pareto front from being dominated by redundant solutions.

Through iterative evolution, MOOR produces an approximation of the Pareto-optimal solution set defined over $f_{pref}$ and $f_{comp}$ . From this set, a subset of $K$ representative solutions is selected to form the final recommendation list. Each selected outfit corresponds to a distinct trade-off between personalized preference and outfit compatibility, enabling the system to provide diverse and flexible recommendations rather than a single deterministic result.

Experiments

Settings

Dataset and preprocessing

We conduct experiments on the IQON3000 dataset, a widely adopted benchmark for fashion outfit recommendation. The original dataset contains 308,747 outfits from 3568 users and 672,335 unique fashion items, each associated with a visual image, a textual description, and attribute metadata. Following prior studies,⁴³ we focus on the top–bottom outfit recommendation. To ensure sufficient interaction signals for user preference modeling, only users with at least four outfit interactions are retained. For each selected user, we further filter their interaction history to include only items belonging to top and bottom categories. This preprocessing step reduces noise from irrelevant categories while preserving meaningful outfit composition patterns. The final dataset statistics are reported in Table 1.

Table 1.

Dataset statistics.

Dataset	Interactions	Users	Items	Outfits
Training	129,732	1769	79,477	44,218
Validation	14,988	1769	13,133	4500
Test	19,394	1769	16,548	4356
Total	164,114	1769	94,223	53,074

To enrich the dataset with high-level aesthetic signals, we further engineer a style attribute for each outfit as an auxiliary feature. Following Kobayashi's Color Image Scale and related style-based fashion recommendation studies,^27,28,44 each outfit is deterministically categorized into one of six color harmony types: Complementary, Contrasting, Analogous, Similar, Monochromatic, and Neutral (see Figure 4 for illustrations). This style attribute is incorporated as a structured attribute input to the outfit compatibility model (MMOC) and is consistently provided to all comparative methods that utilize attribute information.

Figure 4.

Color wheel matching principle.

Parameter settings

To ensure a rigorous and leak-free evaluation, we adopt a user-stratified data splitting strategy. For each user, their interaction history is randomly partitioned into training, validation, and test sets with a ratio of 8:1:1. Each complete outfit is assigned exclusively to a single split, preventing information leakage across different stages of the framework. This splitting strategy is particularly critical for our multistage framework, as it ensures that both the preference model (HGUP) and the compatibility model (MMOC) are trained only on historical data, while the multiobjective optimization stage operates exclusively on unseen test outfits. The HGUP and MMOC models are trained independently on the training split, utilizing the validation set for hyperparameter tuning. Crucially, during the final recommendation stage, the parameters of both HGUP and MMOC are frozen and serve as deterministic objective functions. The MOOR operates in a pure inference mode without parameter updates, performing inference-time optimization to explore the Pareto frontier and generate recommendations. This design ensures that the optimization process does not access test interactions and is guided solely by learned representations.

All models were implemented in PyTorch 2.1.1 with CUDA 11.8 and trained on an NVIDIA A800 (80 GB) GPU. The key hyperparameter configurations are summarized in Table 2. For the HGUP model, we adopted a pairwise ranking formulation with a 1:1 positive-to-negative sample ratio to establish a clear decision boundary. For the MMOC model's contrastive learning task, a 1:2 ratio was used to provide more negative examples, enhancing the model's ability to learn a discriminative embedding space. This setting forces the model to not only pull compatible items closer but also to push incompatible distractors farther away, thereby sharpening the decision boundaries in the high-dimensional embedding space.

Table 2.

Hyperparameter settings for model components.

Component	Hyperparameter	Symbol	Value	Hyperparameter	Symbol	Value
HGUP	Hidden dimension	$d_{h}$	64	HGT layers	$L$	3
	Attention heads	A	4	Dropout rate	-	0.2
	Learning rate	$lr$	1 × 10^–3	Batch size	$B_{HGUP}$	512
	Epochs	-	30	L2 strength	$λ$	1 × 10^–4
MMOC	hidden dimension	$d_{m}$	512	TMSC layers	-	2
	Attention heads	A	4	Dropout rate	-	0.2
	Learning rate	$lr$	1 × 10^–4	Weight decay	-	5 × 10^–5
	Epochs	-	30	Batch size	$B_{MMOC}$	32
MOOR	Max generations	$T_{max}$	30	Population size	$N_{p}$	50
	Crossover probability	$p_{c}$	0.9	Mutation probability	$p_{m}$	0.2
	Candidate subset size per garment type	-	300	-	-	-

Evaluation metrics

To comprehensively evaluate different components of the framework, we adopt task-specific metrics aligned with our three objectives. For user preference prediction (HGUP), we use hit ratio@K (HR@K) and normalized discounted cumulative gain (NDCG@K) with K = {10,20}. These metrics evaluate retrieval accuracy and the position-aware ranking quality, respectively. For outfit compatibility prediction (MMOC), we employ area under the receiver operating characteristic (ROC) curve (AUC) and average precision (AP), which assess the model’s ability to distinguish compatible and incompatible item combinations.

For multiobjective recommendation (MOOR), we evaluate the quality of the generated solution set using three standard indicators: hypervolume (HV), spacing (SP), and convergence. HV measures the dominated volume in the objective space, SP quantifies the uniformity of solution distribution, and convergence measures the distance to the ideal point after normalizing objective values to [0,1]. A lower convergence value indicates closer proximity to the optimal trade-off.

Accordingly, the experiments are designed to answer three questions: whether the two estimators are reliable, whether the proposed search strategy is effective, and whether the resulting outfits exhibit favorable trade-off quality under a surrogate evaluation protocol.

Comparison of user preference models

To evaluate whether HGUP can serve as a reliable estimator of the preference objective in the proposed MOOR framework, we first examine its user–item preference prediction performance before the multiobjective search stage. The purpose of this experiment is to verify whether the learned preference scores are sufficiently accurate and discriminative to guide Pareto-based outfit generation. Since the optimization stage relies directly on $f_{pref}$ , inaccurate preference estimates would distort the trade-off structure between preference and compatibility and reduce the quality of the final recommendation set. We compare HGUP with representative preference-modeling baselines that cover latent factor collaborative filtering, neural collaborative filtering, graph collaborative filtering, and heterogeneous graph representation learning.

MF⁴⁵: A classical matrix factorization model that estimates user–item preference through latent factor interactions.

NCF⁴⁶: A neural collaborative filtering model that captures nonlinear user–item interactions using multilayer perceptrons.

NGCF⁶: A graph collaborative filtering model that propagates user and item embeddings over the user–item bipartite graph.

LightGCN⁴⁷: A simplified graph collaborative filtering model that focuses on linear neighborhood propagation without feature transformation or nonlinear activation.

HGCL⁴⁸: A heterogeneous graph contrastive learning model that learns robust representations from multitype relational structures.

These baselines progressively incorporate collaborative, structural, and heterogeneous relational information, enabling us to assess whether HGUP can provide a reliable preference signal for the MOOR objective under the same evaluation protocol. The results in Table 3 indicate that HGUP consistently achieves the best performance across all metrics. These results suggest that HGUP provides more discriminative preference estimates than the compared baselines under the reported protocol. MF and NCF perform worst because they learn preferences solely from the interaction matrix: MF is limited by its bilinear scoring function, and while NCF introduces nonlinearity via an MLP, it still lacks explicit relational inductive bias and cannot effectively exploit semantic side information under sparse implicit feedback. Graph-based recommenders substantially improve performance by leveraging higher-order collaborative signals through neighborhood aggregation. NGCF improves over MF/NCF by propagating messages on the user–item bipartite graph, but its formulation remains restricted to interaction edges and therefore cannot explicitly encode attribute semantics that are crucial for fashion preference. LightGCN further strengthens performance by simplifying graph propagation (removing feature transformations and nonlinear activations), which often yields more stable representation learning on interaction graphs and reduces overfitting to noisy transformations.

Table 3.

Performance comparison of different user preference prediction models.

Baseline	HR@10	HR@20	NDCG@10	NDCG@20
MF	0.2104	0.3252	0.0380	0.0443
NCF	0.2998	0.5085	0.0530	0.0720
NGCF	0.3544	0.4556	0.0731	0.0747
LightGCN	0.5192	0.6459	0.1166	0.1243
HGCL	0.5390	0.6454	0.1224	0.1283
HGUP	0.5577	0.6595	0.1575	0.1605

Notably, HGCL and HGUP constitute the strongest group of methods, underscoring the value of modeling heterogeneous relations beyond user–item interactions. The remaining gap between them further indicates that representation learning driven primarily by contrastive invariance does not necessarily translate into optimal performance under sparse implicit feedback. The observed gains are consistent with the use of semantic structure augmentation and adaptive multilayer fusion, which may improve information propagation under sparse fashion interactions. Collectively, these results show that the learned preference scores are sufficiently informative and well separated to serve as one of the two objective functions in the Pareto search process, rather than merely acting as an auxiliary recommendation model. Therefore, HGUP is used in the subsequent MOOR experiments as the estimator for the preference objective, because it provides relatively reliable and discriminative user–item preference scores under the reported protocol.

Comparison of outfit compatibility models

This section evaluates the effectiveness of MMOC as an estimator for the outfit compatibility objective. Unlike user preference modeling, compatibility estimation is intended to characterize the intrinsic coherence of garment combinations independently of any specific user. Since the estimated compatibility scores are used as the second objective in downstream multiobjective search, the purpose of this experiment is to examine whether MMOC can provide sufficiently stable and discriminative compatibility signals. Therefore, we compare MMOC with representative baselines that cover sequence-based outfit modeling, type-aware metric learning, graph-based multimodal relation modeling, and Transformer-based global item interaction.

Bi-LSTM⁴⁹: A sequence-based compatibility model that represents an outfit as an ordered item sequence and learns compatibility through bidirectional recurrent encoding.

Type-Aware-Net³¹: A type-aware metric learning model that learns category-specific embedding spaces for cross-type compatibility matching.

MOCM⁵⁰: A graph-based multimodal compatibility model that constructs modality-oriented graphs to capture intramodal and intermodal garment relations.

OutfitTransformer⁵¹: A Transformer-based model that jointly encodes outfit items to learn global interitem interactions.

The performance comparison for outfit compatibility prediction is reported in Table 4. MMOC achieves the best results on both AUC and AP, indicating its effectiveness in learning a discriminative compatibility scoring function. Compared with Bi-LSTM and Type-Aware-Net, MMOC yields consistently stronger results, suggesting that compatibility estimation based only on sequential dependence or type-specific metric matching is insufficient for capturing the complex semantic relationships involved in outfit formation. Compared with MOCM, which also models multimodal relational information, MMOC further improves compatibility prediction, indicating that more explicit cross-modal calibration before compatibility assessment may be beneficial. OutfitTransformer achieves competitive performance by modeling global interitem interactions through self-attention. However, its direct fusion of heterogeneous modalities within a shared attention space may introduce noise due to modality imbalance. In contrast, MMOC performs cross-modal interaction followed by disentangled normalization and uncertainty-aware fusion, which helps preserve modality-specific characteristics while reducing the influence of noisy signals. Overall, the results support the use of MMOC as the compatibility estimator for the objective function $f_{comp}$ in the proposed multiobjective recommendation framework.

Table 4.

Performance of outfit compatibility prediction models.

Baseline	AUC	AP
Bi-LSTM	0.8523	0.7678
Type-Aware-Net	0.8678	0.7845
MOCM	0.8891	0.8012
OutfitTransformer	0.9012	0.8151
MMOC	0.9278	0.8456

Multiobjective optimization outfit recommendation

This section empirically evaluates the proposed MOOR framework from four aspects. First, we examine whether the proposed search strategy provides favorable Pareto-approximation quality in comparison with representative MOEA baselines. Subsequently, we assess the outfit-level quality of the final recommendation results under an external-expert evaluation protocol. An extended ablation study is then conducted to systematically investigate not only the multiobjective formulation but also the individual contributions of MOOR's specific structural components. Finally, we present a qualitative case study to intuitively illustrate the practical advantages of our model in addressing user heterogeneity.

Efficacy of the MOOR search framework

The purpose of this experiment is to examine whether the proposed search design provides favorable Pareto-approximation quality under the multiobjective outfit recommendation formulation defined previously. To evaluate this aspect, we compare MOOR with two representative MOEA baselines under the same problem formulation, the same user-specific candidate item subsets, and the same evaluation budget. This setting allows us to assess how different evolutionary search strategies approximate the preference–compatibility trade-off in the top–bottom outfit-combination space.

SPEA2⁵²: A dominance- and archive-based MOEA that assigns fitness by combining Pareto strength and density information.

MOEA/D⁵³: A decomposition-based MOEA that transforms a multiobjective problem into a set of scalar subproblems and evolves them collaboratively.

The results in Table 5 indicate that MOOR achieves better Pareto-approximation performance than the two representative MOEA baselines under the reported setting. The higher HV value indicates that the obtained solutions dominate a larger region of the objective space, whereas the lower SP and convergence values suggest a more uniformly distributed and closer approximation to the target trade-off region. These results indicate that combining user-specific candidate item construction with phenotype-aware environmental selection is beneficial for exploring the preference–compatibility trade-off in outfit recommendation. In terms of average time, MOOR also achieves a slightly lower average optimization time under the same evaluation setting. This runtime corresponds to the complete evolutionary search procedure for generating a Pareto candidate set, rather than the latency of a single online recommendation request. Overall, the results support the effectiveness of the proposed task-specific evolutionary search strategy for approximating representative nondominated outfit candidates under the reported experimental protocol.

Table 5.

Comparison of the experimental results of three MOEAs.

Model	HV ↑	SP ↓	Convergence ↓	Average time (s) ↓
SPEA2	2.3044	0.0486	0.1120	23.45
MOEA/D	3.3163	0.0396	0.0612	22.32
MOOR	5.9869	0.0365	0.0243	21.17

For practical deployment, MOOR can be implemented in a nearline generation and online reranking paradigm. The evolutionary search can periodically generate Pareto candidate sets for target users or user groups, while the online stage only performs lightweight selection or reranking from the precomputed nondominated candidates according to the serving context. This separates the computationally heavier Pareto-search stage from real-time request handling and makes the framework more suitable for practical recommendation scenarios. Further reducing the search budget or accelerating objective evaluation remains an important direction for deployment-oriented optimization.

Figure 5 presents the joint density distribution of Pareto solutions for all users in the test set. The global view of the hexagonal heatmap reveals a distinct curved high-density band (dark red region) that spans from the upper left to the lower right of the plot and corresponds to a subset of high-quality candidate outfits. Within this high-quality subspace, the user preference score and the compatibility score exhibit a clear negative correlation. This pattern indicates that placing more emphasis on individual preference typically requires sacrificing a certain level of general compatibility. Conversely, enforcing very strong compatibility tends to limit the extent to which personalized preferences can be satisfied. The observed negative dependence highlights the intrinsic limitation of simple linear weighting schemes in outfit generation and recommendation. Such schemes usually converge to a single compromise point and fail to explore the diverse set of high-quality solutions that lie along the Pareto front. At the same time, the spread of the solutions suggests that the proposed search framework can retain multiple feasible trade-off candidates within the explored outfit-combination space instead of collapsing to a narrow compromise region.

Figure 5.

Global trade-off density analysis: user preference versus compatibility.

Outfit-level recommendation quality

To further assess the quality of the final recommendation results, we evaluate whether the proposed framework can generate outfit candidates that maintain a favorable balance between user preference and outfit compatibility. Because direct human evaluation is not available in the current setting, we adopt an external-expert protocol to provide complementary evidence beyond the internal objective values. Specifically, for each user, a Pareto-optimal solution set is first generated by MOOR. From this set, top-K candidate outfits are selected according to MOOR’s internal Pareto-based ranking strategy, which prioritizes solutions with higher crowding distance and balanced objective contributions, ensuring both quality and diversity among the selected outfits, where $K \in {5, 10, 20}$ . We then use the baseline models HGCL and OutfitTransformer as external experts to score these outfits and obtain their predicted preference and compatibility scores, respectively. For the top-K outfits of each user, we compute the Average-preference-score@K and the Average-compatibility-score@K. Under the setting $λ = 0.5$ , these two scores are combined with equal weights to form the overall metric AverageOutfitScore@K, which reflects a balanced assessment of personalization and outfit coherence. This evaluation design reflects the actual outfit quality presented to users by MOOR in real recommendation scenarios. At the same time, it reduces the reliance of evaluation on the internal objective functions through external expert scoring, which enables more consistent and comparable quantitative assessment of personalization and compatibility in generative outfit recommendation settings where comprehensive ground truth labels are unavailable. In addition to preference consistency and outfit compatibility, personalized outfit recommendation should provide users with a sufficiently diverse set of candidate outfits. We therefore introduce the average category diversity metric AvgDiv@K to measure the coverage of category combinations within the recommendation list. AvgDiv@K ranges from 0 to 1, with higher values indicating that the recommended outfits span more diverse category combinations and offer users a broader choice space.

Table 6 reports the outfit-level evaluation results under different top-K settings. Overall, MOOR maintains high and relatively stable external evaluation scores across all K values, indicating that the generated results are not only of high-quality at the top-ranked positions but also preserve favorable preference consistency and outfit compatibility at deeper recommendation positions. As K increases, the evaluation metrics show only limited fluctuations, suggesting that MOOR does not obtain only a few isolated high-quality solutions but produces a candidate solution set with stable overall quality. Because HGCL and OutfitTransformer adopt representation mechanisms and scoring logic different from the HGUP and MMOC modules used within MOOR, the stable high-quality evaluation under the external proxy protocol indicates that the outputs of MOOR are not merely locally optimized for its internal objective functions. Instead, the results suggest that MOOR captures outfit matching patterns that can be consistently recognized across different model architectures. Overall, the results in Table 6 validate the effectiveness of MOOR in terms of final outfit-level recommendation quality and show that the generated nondominated solution set can provide candidate outfits with relatively high external credibility for practical personalized fashion recommendation.

Table 6.

Outfit-level evaluation of recommendation results.

Top-K	AvgOutfit@K	AvgPref@K	AvgComp@K	AvgDiv@K
K = 5	0.9241	0.8587	0.9895	0.2500
K = 10	0.9207	0.8563	0.9850	0.2045
K = 20	0.9096	0.8517	0.9676	0.1735

Ablation study

To examine how different objective signals, optimization formulations, and search components affect MOOR, we conduct an ablation study. The variants are organized into three groups: objective-signal ablations, which remove either the preference or compatibility objective; a scalarization baseline, which combines the two objectives into a single weighted objective; and a search-mechanism ablation, which removes the phenotype-aware selection strategy. This design evaluates whether the two estimated objectives provide complementary signals for Pareto-based outfit generation, whether explicit multiobjective optimization is preferable to fixed scalarization, and whether the task-specific selection mechanism contributes to recommendation diversity.

Preference-only (without MMOC): This variant removes the compatibility-objective signal provided by MMOC and optimizes only the outfit-level preference objective. The same search procedure is retained, but the fitness evaluation is based only on the preference score.

Compatibility-only (without HGUP): This variant removes the preference-objective signal provided by HGUP and optimizes only the outfit-compatibility objective. The same search procedure is retained, but the fitness evaluation is based only on the compatibility score.

Linear scalarization (LS): This baseline retains both preference and compatibility scores but converts the multiobjective formulation into a single weighted objective.

MOOR without PAS: A variant of the full model where the proposed phenotype-aware selection (PAS) strategy is ablated. Environmental selection reverts to standard NSGA-II crowding distance, relying solely on diversity in the objective space without considering structural item-level redundancy.

The experimental results in Figure 6 show that removing either objective signal leads to an imbalanced recommendation outcome. When only a single objective is optimized, the resulting solutions exhibit an extreme bias. Specifically, the preference-only (without MMOC) variant maximizes the preference score but incurs a severe degradation in compatibility. Conversely, the compatibility-only (without HGUP) variant achieves high compatibility but fails to capture personalized preferences, resulting in a substantially lower preference score. These single-objective formulations consistently lead to suboptimal overall recommendation quality (AvgOutfit). This empirically demonstrates that maximizing one objective inherently compromises the conflicting objective, thereby substantiating the necessity of a multiobjective formulation.

Figure 6.

Ablation study of our proposed MOOR.

Beyond standard performance metrics, the evaluation of recommendation diversity provides deeper insights into the generated solution space. Although the LS baseline achieves a competitive overall score, its diversity metric collapses to a very low value, indicating that fixed-weight scalarization drives the optimization process toward a limited region of the Pareto front and produces highly homogeneous solutions. In contrast, MOOR achieves substantially higher diversity while simultaneously attaining superior overall performance. This result suggests that the proposed multiobjective framework not only improves recommendation quality but also yields a well-structured and diverse set of trade-off solutions, which is essential for practical outfit recommendation scenarios requiring both quality and user choice flexibility.

Furthermore, the comparison between the full MOOR framework and the MOOR without PAS variant validates the specific contribution of the proposed selection mechanism. Although standard multiobjective selection can preserve distribution in objective space, it does not necessarily prevent structural redundancy in the generated outfit candidates. The improved diversity of the full model suggests that incorporating item-level similarity filtering and category-level diversity considerations helps translate objective-space spread into more perceptually meaningful recommendation diversity. Therefore, the proposed selection strategy appears beneficial for retaining representative outfit candidates with reduced redundancy.

User case study

To further examine the practical behavior of the proposed framework, we conduct a qualitative case study on six randomly selected users. For each user, we analyze the approximated Pareto solutions, the trade-off patterns between user preference and outfit compatibility, and the visual characteristics of the recommended outfits.

Visualization of the Pareto front

Figure 7 displays the three-dimensional Pareto fronts for the selected users, mapped across top preference, bottom preference, and compatibility scores. The visualizations consistently reveal a continuous, concave surface across different users. Rather than clustering in a localized region, the widely distributed solutions along these surfaces empirically illustrate the highly nonconvex nature of the objective space and the trade-off between compatibility and the decomposed item-level contributions to the preference objective. This pattern demonstrates that maximizing one objective generally necessitates a compromise in another. Furthermore, the density and uniform spread of the points indicate that the evolutionary search mechanism within MOOR effectively escapes local optima, successfully approximating a diverse spectrum of high-quality trade-off solutions in a discrete combinatorial space.

Figure 7.

Pareto front of six target users generated by the MOOR model.

Analysis of objective trade-offs

To further illustrate how the trade-off structure varies across users, Figure 8 provides parallel coordinate visualizations of the same solution sets. Each line corresponds to one candidate outfit and shows its compatibility score together with the diagnostic decomposition of the preference objective. A recurring pattern is that improvements in user preference are not always accompanied by improvements in outfit compatibility, which is consistent with the quantitative motivation for multiobjective optimization.

Figure 8.

Scores of six users across three objectives.

At the same time, the degree of this trade-off differs across users. For some users, relatively favorable solutions can be found in which preference and compatibility remain jointly high, whereas for others, increasing preference is associated with a more visible reduction in compatibility. These user-specific differences suggest that a fixed scalarization strategy may be insufficient for capturing the full range of feasible recommendation trade-offs.

Recommendation visualization and case analysis

Figure 9 presents a representative case study illustrating the practical outcomes of different optimization paradigms. The user’s historical interactions demonstrate a consistent preference for minimalist, casual wear in neutral tones. As shown in Figure 9(b) and 9(c), single-objective optimization reveals fundamental limitations: preference-only optimization successfully captures individual item appeal (e.g., highly preferred tops and bottoms) but fails to ensure aesthetic coordination, resulting in disjointed ensembles. Compatibility-only optimization produces visually harmonious outfits but completely sacrifices personalization, generating generic recommendations disconnected from the user's preference. The proposed MOOR framework provides a broader set of recommendation candidates that retain different balances between these two criteria. As shown in Figure 9(d), the final recommendations cover a broader range of feasible trade-off solutions, from more preference-oriented candidates to more compatibility-oriented alternatives.

Figure 9.

Comparative visualization of single-objective and multiobjective optimization recommendation results.

Conclusion and future work

This study has investigated outfit recommendation from the perspective of balancing user preference and outfit compatibility. To move beyond fixed scalarization-based recommendation, we have proposed MOOR, a multiobjective recommendation framework that combines a heterogeneous graph-based user preference estimator, a multimodal outfit compatibility estimator, and a task-specific evolutionary search procedure. Within this formulation, HGUP is used to provide the preference objective, MMOC is used to provide the compatibility objective, and the search stage is designed to approximate representative nondominated outfit candidates. The experimental results provide support for the proposed framework in several respects. First, the component-level evaluations indicate that HGUP and MMOC can provide effective estimators for user preference and outfit compatibility under the reported protocol. Second, the optimization-level analyses suggest that the proposed search procedure achieves favorable Pareto-approximation quality in the discrete outfit recommendation space. Third, the outfit-level and ablation results provide complementary evidence that the framework can preserve useful trade-off solutions and recommendation diversity. Taken together, these findings support the value of explicitly modeling the trade-off between user preference and outfit compatibility in outfit recommendation.

Despite these promising results, several directions remain for future work. First, temporal modeling may be incorporated to better capture the dynamic evolution of user preference. Second, richer contextual factors, such as usage occasion, seasonal variation, and emerging fashion trends, may further improve compatibility estimation and recommendation relevance. Third, direct user-centered evaluation would be valuable for assessing the practical usefulness of the generated recommendation sets beyond surrogate expert-based metrics. Fourth, improving Pareto candidate generation efficiency through nearline caching, reduced-budget search, and parallel objective evaluation remains an important direction for practical deployment. These directions may help further strengthen the applicability of multiobjective recommendation methods in fashion-related decision support scenarios.

Footnotes

Declaration of conflicting interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This study was partly supported by the Research and application of unmanned water plant operation management and control system based on artificial intelligence and Beijing Educational Science Planning 2025 Priority Focus Project: CEAA25008.

ORCID iD

Xiaoyi Wang

References

Zhan

Lin

, et al. A³-FKG: Attentive attribute-aware fashion knowledge graph for outfit preference prediction. IEEE Trans Multimed 2022; 24: 819–831.

Verma

Gulati

Shah

. Addressing the cold-start problem in outfit recommendation using visual preference modelling. In: 2020 IEEE Sixth International Conference On Multimedia Big DATA (BigMM 2020). Los Alamitos, CA: IEEE Computer Society, pp. 251–256.

Ding

Mok

, et al. Personalized fashion outfit generation with user coordination preference learning. Inf Process Manag; 60. 2023. DOI: 10.1016/j.ipm.2023.103434.

Sagar

Garg

Kansal

, et al. PAI-BPR: Personalized outfit recommendation scheme with attribute-wise interpretability. In: 2020 IEEE Sixth International Conference on Multimedia Big Data (BigMM). New Delhi, India: IEEE, pp. 221–230.

Song

Han

, et al. GP-BPR: Personalized compatibility modeling for clothing matching. In: Proceedings of the 27th ACM International Conference on Multimedia. Nice France: ACM Press, pp. 320–328.

Wang

, et al. Neural graph collaborative filtering. In: Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval. New York: ACM Press, pp. 165–174.

Kumar

Saravanan

Enhancing e-commerce using fashion recommendation system. In: 2025 International Conference on Visual Analytics and Data Visualization (ICVADV). Piscataway: IEEE, 2025, pp. 864–868.

Sun

Zhao

, et al. Multi-order attributes information fusion via hypergraph matching for popular fashion compatibility analysis. Expert Syst Appl 2025; 263: 125758.

Song

Zheng

, et al. Complementary factorization towards outfit compatibility modeling. In: Proceedings of the 29th ACM International Conference on Multimedia. New York: ACM Press, pp. 4073–4081.

10.

Parthasarathy

Sathiya Devi

Hybrid recommendation system based on collaborative and content-based filtering. Cybern Syst 2023; 54: 432–453.

11.

McAuley

. Ups and downs: Modeling the visual evolution of fashion trends with one-class collaborative filtering. In: Proceedings of the 25th International Conference on World Wide Web, 2016, pp. 507–517.

12.

Zhao

Design of garment style recommendation system based on interactive genetic algorithm. Comput Intell Neurosci 2022; 2022: 9132165.

13.

Z-H

Wei

, et al. Examining collaborative filtering algorithms for clothing recommendation in e-commerce. Text Res J 2019; 89: 2821–2835.

14.

Rendle

Freudenthaler

Gantner

, et al. BPR: Bayesian personalized ranking from implicit feedback. Preprint, http://arxiv.org/abs/1205.2618 (2012, accessed 11 October 2024).

15.

Bracher

Heinz

Vollgraf

. Fashion DNA: merging content and sales data for recommendation and article mapping. ArXiv Preprint ArXiv160902489.

16.

Chen

, et al. Personalized fashion recommendation with visual explanations based on multimodal attention network: Towards visually explainable recommendation. In: Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval. New York: ACM Press, 2019, pp. 765–774.

17.

Chen

Huang

, et al. POG: Personalized outfit generation for fashion recommendation at Alibaba iFashion. In: Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York: ACM Press, 2019, pp. 2662-2670.

18.

Najafabadi

Chen

R-A

Rezazadeh

, et al. From theory to practice: The evolution and comparative analysis of homogeneous vs. Heterogeneous graph neural networks in recommender systems. Neurocomputing 2025; 624: 129446.

19.

Wang

, et al. Hierarchical fashion graph network for personalized outfit recommendation. In: Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval. New York: ACM Press, pp. 159–168.

20.

Malhi

Zhou

Rasool

, et al. Efficient visual-aware fashion recommendation using compressed node features and graph-based learning. Mach Learn Knowl Extr 2024; 6: 2111–2129.

21.

Yang

, et al. Heterogeneous graph completion collaborative network for attribute-missing heterogeneous graph representation learning. Expert Syst Appl 2025; 128402.

22.

Wang

Proactive return prediction in online fashion retail using heterogeneous graph neural networks. Electronics 2024; 13: 1398.

23.

Veit

Kovacs

Bell

, et al. Learning visual clothing style with heterogeneous dyadic co-occurrences. In: Proceedings of the IEEE International Conference on Computer Vision. Piscataway, NJ: IEEE, 2015, pp. 4642–4650.

24.

Zhu

, et al. Outfit compatibility prediction with multi-layered feature fusion network. Pattern Recognit Lett 2021; 147: 150–156.

25.

Jing

Cui

Guan

, et al. Category-aware multimodal attention network for fashion compatibility modeling. IEEE Trans Multimed 2023; 25: 9120–9131.

26.

Liu

Song

Chen

, et al. Neural fashion experts: I know how to make the complementary clothing matching. Neurocomputing 2019; 359: 249–263.

27.

De Divitiis

Becattini

Baecchi

, et al. Style-based outfit recommendation. In: 2021 International Conference on Content-Based Multimedia Indexing (CBMI). Piscataway, NJ: IEEE, 2021, pp. 1–4.

28.

Becattini

De Divitiis

Baecchi

, et al. Fashion recommendation based on style and social events. Multimed Tools Appl 2023; 82: 38217–38232.

29.

Chen

, et al. Explainable fashion compatibility prediction: an attribute-augmented neural framework. Electron Commer Res Appl 2024; 68: 101451.

30.

Cui

, et al. Dressing as a whole: Outfit compatibility learning based on node-wise graph neural networks. In: The World Wide Web Conference, 2019. DOI: 10.1145/3308558.3313444.

31.

Vasileva

Plummer

Dusad

, et al. Learning type-aware embeddings for fashion compatibility. In: Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 390–405.

32.

Cui

Liu

Feng

, et al. Correlation-aware cross-modal attention network for fashion compatibility modeling in UGC systems. ACM Trans Multimed Comput Commun Appl 2025; 21: 1–24.

33.

Zhang

Yang

Tan

, et al. Learning color compatibility in fashion outfits. ArXiv Preprint ArXiv200702388.

34.

Sá

Queiroz Marinho

Magalhães

, et al. Diversity vs relevance: a practical multi-objective study in luxury fashion recommendations. In: Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval. New York: ACM Press, pp. 2405–2409.

35.

Patil

Banerjee

Sural

A graph theoretic approach for multi-objective budget constrained capsule wardrobe recommendation. ACM Trans Inf Syst 2022; 40: 1–33.

36.

Hsiao

W-L

Grauman

. Creating capsule wardrobes from fashion images. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, pp. 7161–7170.

37.

Lin

Chen

Pei

, et al. A Pareto-efficient algorithm for multiple objective optimization in e-commerce recommendation. In: Proceedings of the 13th ACM Conference on Recommender Systems. New York: ACM Press, pp. 20–28.

38.

Xie

Liu

Zhang

, et al. Personalized approximate Pareto-efficient recommendation. In: Proceedings of the Web Conference 2021, 2021, pp. 3839–3849.

39.

Sener

Koltun

Multi-task learning as multi-objective optimization. Adv Neural Inf Process Syst 2018; 31. DOI: 10.48550/arXiv.1810.04650.

40.

Sharma

Harper

Karypis

Learning from sets of items in recommender systems. ACM Trans Interact Intell Syst TiiS 2019; 9: 1–26.

41.

Wang

Gong

, et al. Multi-objective optimization for long tail recommendation. Knowl-Based Syst 2016; 104: 145–155.

42.

Dong

Wang

, et al. Heterogeneous graph transformer. In: Proceedings of the Web Conference 2020, 2020, pp. 2704–2710.

43.

Guan

Jiao

Song

, et al. Personalized fashion compatibility modeling via metapath-guided heterogeneous graph learning. In: Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval. New York: ACM Press, 2022, pp. 482–491.

44.

Liu

Sun

Liu

, et al. Learning diverse fashion collocation by neural graph filtering. IEEE Trans Multimed 2021; 23: 2894–2901.

45.

Koren

Bell

Volinsky

Matrix factorization techniques for recommender systems. Computer 2009; 42: 31–37.

46.

Liao

Zhang

, et al. Neural collaborative filtering. In: Proceedings of the 26th International Conference on World Wide Web. Perth, Australia: International World Wide Web Conferences Steering Committee, pp. 173–182.

47.

Deng

Wang

, et al. LightGCN: Simplifying and powering graph convolution network for recommendation. In: Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval. New York: ACM Press, 2020, pp. 639–648.

48.

Chen

Huang

Xia

, et al. Heterogeneous graph contrastive learning for recommendation. In: Proceedings of the Sixteenth ACM International Conference on Web Search and Data Mining. New York: ACM Press, 2023, pp. 544–552.

49.

Han

Jiang

Y-G

, et al. Learning fashion compatibility with bidirectional LSTMs. In: Proceedings of the 25th ACM International Conference on Multimedia. New York: ACM Press, pp. 1078–1086.

50.

Song

Fang

S-T

Chen

, et al. Modality-oriented graph learning toward outfit compatibility modeling. IEEE Trans Multimed 2023; 25: 856–867.

51.

Sarkar

Bodla

Vasileva

, et al. OutfitTransformer: Learning outfit representations for fashion recommendation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. Piscataway, NJ: IEEE, 2023, pp. 3601–3609.

52.

Zitzler

Laumanns

Thiele

SPEA2: Improving the strength Pareto evolutionary algorithm. TIK Report 103. Zurich: ETH Zurich, 2001.

53.

Zhang

MOEA/D: A multiobjective evolutionary algorithm based on decomposition. IEEE Trans Evol Comput 2007; 11: 712–731.