Abstract
Garment recommendation in fashion e-commerce requires a practical balance between personalized user preference and outfit compatibility. Existing recommendation methods usually combine these two goals into a single scalar objective, which limits their ability to represent diverse trade-offs in apparel selection. This study proposes a multiobjective outfit recommendation (MOOR) framework that formulates outfit recommendation as a multiobjective optimization problem and searches for Pareto-optimal outfit solutions. To estimate the two objectives, the framework incorporates a heterogeneous graph-based user preference model for capturing sparse and higher-order preference signals, and a multimodal outfit compatibility model for assessing visual, textual, and attribute-level coherence between garments. A task-specific evolutionary search strategy is further introduced to explore candidate top–bottom combinations while preserving recommendation diversity. Experiments on the IQON3000 dataset show that the proposed framework provides strong preference and compatibility estimation and generates diverse trade-off solutions with favorable outfit-level quality under the reported evaluation protocol. These findings support the value of explicitly modeling the preference–compatibility trade-off in garment recommendation systems.
Keywords
Within the domain of fashion recommendation, constructing complete and coherent outfit compositions, rather than recommending isolated items, is a practically important yet technically challenging task. Unlike single-item recommendation, outfit recommendation must simultaneously satisfy two inherently distinct requirements. On the one hand, the recommended items should align with user-specific preferences, reflecting individual style, historical behaviors, and contextual needs. On the other hand, the outfit as a whole should maintain aesthetic compatibility in terms of style consistency, color harmony, and visual balance. These two requirements are often not fully aligned: emphasizing personalization alone may result in stylistically inconsistent combinations, whereas prioritizing compatibility alone tends to suppress individual expression and produce generic recommendations. Therefore, effectively balancing user preference and outfit compatibility remains a fundamental challenge in modern outfit recommendation systems.
Existing studies on outfit recommendation can be broadly categorized into two paradigms. The first relies on predefined or template-based outfit construction. For instance, Zhan et al. 1 constructed an attribute-aware fashion knowledge graph to model user–outfit relationships, while Verma et al. 2 incorporated visual preference signals to generate occasion-specific outfit suggestions. Ding et al. 3 further proposed the TOG framework, which leverages category combination templates to guide outfit generation. The second paradigm focuses on item completion, such as predicting a compatible bottom for a given top.4,5 Although effective in specific scenarios, these methods generally model personalization and compatibility in an implicit and entangled manner. In particular, most approaches map heterogeneous signals into a shared embedding space and optimize a single scalar objective (e.g., via weighted summation), which enforces a fixed and globally shared trade-off between the two objectives. Such scalarized formulations fail to capture the inherently nonlinear and user-specific conflicts between preference and compatibility, thereby limiting the diversity and flexibility of the generated outfit recommendations.
To address these limitations, we reformulate outfit recommendation as a multiobjective optimization problem, where user preference and outfit compatibility are explicitly treated as two separate but interacting objectives. Based on this formulation, we propose a multiobjective outfit recommendation (MOOR) framework that combines dedicated objective estimation with Pareto-based search over candidate outfits. Instead of collapsing multiple objectives into a single scalar score, the framework preserves a set of trade-off solutions, each reflecting a different balance between personalization and compatibility. This design provides a more flexible recommendation mechanism for garment selection than fixed-weight scalarization.
Realizing this formulation requires reliable estimators for both objectives. For preference modeling, previous graph-based approaches primarily focus on direct user–item interactions and often overlook higher-order semantic relations such as category and attribute dependencies.6,7 To address this issue, we construct a heterogeneous graph that jointly represents users, items, and category-level semantics, and further enhance it with similarity-based edge augmentation and multihop information fusion to capture latent preference signals under sparse interaction data. For compatibility modeling, prior studies typically rely on multimodal feature fusion of visual and textual information,8,9 yet many of them adopt relatively simple fusion strategies that are insufficient for fine-grained cross-modal alignment and higher-order item interaction modeling. To overcome these limitations, we design a multimodal compatibility evaluator that integrates visual, textual, and attribute information to learn robust and semantically consistent representations for outfit compatibility assessment. Overall, our main contributions can be summarized in the following.
We formulate outfit recommendation as a task-specific multiobjective recommendation problem by explicitly disentangling personalized preference and outfit compatibility, instead of collapsing them into a fixed scalar objective.
We instantiate the two objectives using task-adapted objective estimators: a heterogeneous graph-based estimator for user-specific preference and a multimodal estimator for outfit-level compatibility. These estimators provide the objective signals required for Pareto-based outfit generation, rather than serving as standalone generic recommendation models.
We design a task-specific search framework for outfit construction, including user-conditioned candidate generation and phenotype-aware environmental selection, to improve the representativeness and diversity of the approximated Pareto solutions.
Related work
Prior studies related to MOOR can be grouped into three methodological streams: user preference modeling, fashion compatibility modeling, and multiobjective recommendation. These streams correspond to the three technical questions addressed in this study: how to estimate a user's garment-level preference, how to measure the coherence between constituent garments, and how to balance preference and compatibility without imposing a fixed scalar weighting scheme.
User preference modeling
User preference modeling is a fundamental component of fashion recommendation, aiming to infer personalized apparel tastes from historical user–item interactions and item information. Early recommendation methods primarily relied on collaborative filtering, matrix factorization methods, and hybrid recommendation strategies,10–13 which learn preference patterns from explicit or implicit interaction signals. In fashion recommendation, user feedback is often sparse and highly implicit, making it difficult for interaction-only models to capture the subjective and dynamic nature of apparel preference. Pairwise ranking methods such as Bayesian personalized ranking (BPR) 14 improve implicit-feedback learning by directly optimizing personalized ranking from observed interactions. Domain-oriented fashion models additionally incorporate item content and visual semantics. For example, FashionDNA 15 maps fashion items and customer style preferences into a shared latent space using product images, tags, and sales data, while VECF 16 combines visual features with user reviews through a multimodal attention mechanism for explainable fashion recommendation. Although these methods enrich preference representation, they still have limited ability to model higher-order semantic dependencies among users, items, and attributes.
Recent deep learning methods further exploit sequence structures and graph relations for personalized fashion recommendation. Transformer-based models such as POG 17 connect user preferences over individual items and outfits for personalized outfit generation. Graph neural networks (GNNs) provide a natural framework for modeling relational recommendation data. 18 For example, HFGN 19 constructs a hierarchical graph to jointly model users, outfits, and items, while visual-aware graph models incorporate image-derived item features into graph-based recommendation. 20 These studies show the value of structural representation learning in fashion recommendation. Nevertheless, models based mainly on user–item interactions, outfit–item mappings, or visual item graphs may still be insufficient for representing heterogeneous semantic relations among users, items, categories, and attributes.
Heterogeneous graph networks (HGNs) have therefore attracted increasing attention for modeling multityped nodes and relations. For example, A3-FKG 1 introduces an attentive attribute-aware fashion knowledge graph for outfit preference prediction, showing the importance of attribute-level semantic relations in fashion preference modeling. Related heterogeneous graph studies21,22 further demonstrate the potential of heterogeneous representation learning in handling complex semantic structures in recommendation and fashion-retail scenarios. Overall, user preference modeling in fashion recommendation has progressed from interaction-based latent factor models to graph-based and heterogeneous semantic graph models. This progression suggests that effective preference estimation requires not only collaborative signals from user–item interactions but also relation-aware use of item attributes and higher-order semantic neighborhoods. However, sparse feedback and insufficient cross-layer semantic aggregation remain important limitations, motivating a preference estimator that strengthens graph structure and adaptively integrates heterogeneous neighborhood information.
Fashion compatibility modeling
Fashion compatibility modeling aims to determine whether multiple garments can form a coherent and aesthetically acceptable outfit. Early studies primarily learned visual compatibility from item co-occurrence or image-based representations. For example, Veit et al. 23 learned visual clothing compatibility across heterogeneous categories using a Siamese convolutional neural network (CNN) framework, while later visual-centric models improved compatibility prediction through richer convolutional feature extraction and feature fusion. 24 These studies show that visual appearance is essential for compatibility learning, but visual information alone may be insufficient because outfit coherence also depends on category, texture, color, style, and semantic attributes. To address this limitation, subsequent studies incorporated multimodal information, including textual descriptions and structured attributes. Category-aware multimodal attention models and neural fashion expert models exploit visual and textual information to improve complementary clothing matching.25,26 Related fashion recommendation studies also integrate higher-level style and occasion semantics. De Divitiis et al. 27 used Kobayashi-derived color-style semantics for style-based outfit recommendation, and Becattini et al. 28 combined style cues with social-event information for event-aware fashion recommendation. Attribute-augmented models further introduce explicit attribute interactions to improve explainability in compatibility prediction. 29 These studies indicate that compatibility learning has gradually shifted from isolated visual representation toward multimodal semantic modeling. However, fine-grained alignment among visual, textual, and attribute modalities remains challenging, especially when different modalities provide noisy or incomplete signals.
Another line of work focuses on relation-aware compatibility modeling. For example, NGNN 30 represents an outfit through a category-level fashion graph and learns item interactions with node-wise graph neural networks. Type-aware embedding methods learn compatibility in type-specific embedding spaces, distinguishing item similarity from cross-type compatibility. 31 More recent studies further explore cross-modal attention and color-compatibility modeling to capture subtle item-pair relations.32,33 Overall, fashion compatibility modeling has evolved from visual-style similarity learning to multimodal, type-aware, and relation-aware interaction modeling. The central requirement is to learn a compatibility-oriented representation space in which stylistically coherent garments are close and mismatched garments are separable. Existing methods still face challenges in fine-grained multimodal alignment and discriminative handling of hard negative combinations, motivating a compatibility estimator that better calibrates visual, textual, and attribute information.
Multiobjective optimization recommendation
Traditional recommendation models are typically designed to optimize a single objective, such as accuracy or ranking relevance. However, real recommendation scenarios often involve multiple criteria, including relevance, diversity, novelty, compatibility, and personalization. 34 In the fashion domain, multiobjective formulations have been explored in related tasks such as capsule wardrobe generation,35,36 where the goal is to select a compact set of mutually compatible and versatile garments under practical constraints. These studies demonstrate the value of considering multiple fashion-related criteria simultaneously. In broader recommendation research, Pareto-efficient optimization has been introduced to address conflicts among multiple objectives. Lin et al. 37 proposed a Pareto-efficient framework for multiobjective e-commerce recommendation, and Xie et al. 38 further studied personalized approximate Pareto-efficient recommendation. These studies show that Pareto-based formulations can provide a principled way to handle objective conflicts. Meanwhile, multiobjective learning studies also indicate that fixed weighted scalarization can be limited when objectives compete, because a single predefined weighting scheme may force a global compromise and obscure alternative trade-off solutions. 39
For personalized outfit recommendation, the key challenge is not only to identify compatible garment combinations, but also to preserve candidate outfits that reflect different balances between user preference and outfit compatibility. Existing personalized outfit models often combine these factors into a single scoring or training objective, which makes the final recommendation sensitive to predefined weights and may collapse diverse trade-off solutions into one compromise. Motivated by this limitation, we formulate outfit recommendation as a multiobjective optimization problem and use Pareto-based search to approximate a set of representative nondominated outfit candidates. This formulation enables the recommendation process to retain diverse preference–compatibility trade-offs rather than imposing a fixed scalar objective.
Our approach
Problem definition
We formulate personalized outfit recommendation as a multiobjective optimization problem. Let
Personalized preference (
Outfit compatibility (
The two objectives
To address this trade-off, we adopt the standard formulation of multiobjective optimization.
41
An outfit
where the goal is to approximate the Pareto-optimal solution set rather than a single optimal solution.
As illustrated in Figure 1, the proposed MOOR framework instantiates the above formulation using two dedicated evaluators for preference and compatibility, respectively, followed by a task-specific evolutionary search procedure that explores the solution space and approximates the Pareto front. From the obtained nondominated solutions, a subset of representative outfits is finally selected for recommendation.

Architectural overview of the proposed multiobjective outfit recommendation (MOOR) framework. The framework consists of three components. (1) A heterogeneous graph-based preference model estimates user-specific garment preference from multirelational interaction data. (2) A multimodal compatibility model evaluates the coherence of garment combinations using visual, textual, and attribute information. (3) The preference and compatibility scores are treated as two explicit objectives in a task-specific evolutionary search procedure, which approximates a Pareto set of nondominated outfit candidates. From these candidates, a subset of representative outfits is finally selected for recommendation.
Heterogeneous graph-based user preference modeling
To instantiate the personalized preference objective
We construct a heterogeneous graph
Semantic structure augmentation
To alleviate data sparsity, we augment the graph topology with similarity-based edges that capture implicit collaborative relationships. For user nodes, we define the similarity between two users
where
For item nodes, we design a hybrid similarity function that combines behavioral and attribute-based signals:
Here,
Heterogeneity-aware representation learning
We adopt the Heterogeneous Graph Transformer (HGT) to model multirelational dependencies.
42
For each target node
where
However, deep GNNs often suffer from over-smoothing, where node representations across different layers become indistinguishable. To effectively capture both local neighborhood signals and global structural semantics, we propose an adaptive multilayer fusion mechanism (as shown in Figure 2). Specifically, we introduce a trainable weight parameter vector and apply the Softmax function to obtain the normalized importance score

The process of message aggregation for HGUP.
This allows the model to dynamically balance local and global structural information for each node
where ⊕ denotes vector concatenation and
Since the real-world fashion data typically consists of implicit feedback rather than explicit ratings, we formulate preference learning as a pairwise ranking task. We adopt the BPR framework, which assumes that a user prefers observed interactions over unobserved ones. For each observed positive interaction
where
The trained model HGUP provides the preference estimator
Multimodal outfit compatibility learning
To instantiate the outfit compatibility objective
Multimodal representation construction
Given an outfit
where
Textual features are constructed analogously to obtain representation
where ⊕ denotes concatenation. This attribute representation provides complementary semantic signals that are not directly observable from raw modalities.
Cross-modal semantic calibration
To model interactions across modalities, we construct a unified representation by stacking modality-specific features
where
In a standard Transformer, fused features are typically subjected to a shared layer normalization, which may cause modality collapse, i.e., the statistics of the dominant modality overwhelm those of weaker modalities. We therefore adopt a disentangled residual strategy: for each modality-specific channel, we apply an independent, learnable normalization layer to enforce that each modality preserves its original feature distribution while absorbing cross-modal context. The updates are formulated as
where
Uncertainty-aware modality fusion
After L interaction layers, we obtain modality-aligned representations
where
Given an anchor item representation
where
where
After the multimodal compatibility model is trained, each item is represented in the learned compatibility space. During the multiobjective recommendation phase, the compatibility score
A higher cosine value indicates stronger learned coherence between the two garments in the compatibility space. This formulation converts the learned outfit compatibility into a bounded scalar objective for the downstream Pareto optimization.
Multiobjective optimization outfit recommendation
The user preference estimator and the outfit compatibility estimator developed previously provide the two objective functions required by the formulation. Based on these two objectives, the recommendation stage is organized as a Pareto-based search over the combinatorial space of top–bottom garment pairs. For each target user, MOOR first constructs user-specific top and bottom candidate item subsets, and their Cartesian product defines the feasible outfit search space. As the number of feasible outfits increases with the product of the two candidate subset sizes, MOOR evaluates only a fixed number of top–bottom pairs during the search, rather than exhaustively scoring every pair in their Cartesian product. The evaluated pairs are iteratively updated through evolutionary search to approximate representative trade-off solutions. In this way, MOOR generates a set of nondominated outfit candidates that reflect different balances between personalized preference and outfit compatibility. MOOR builds on the Pareto approximation ability of multiobjective evolutionary algorithms (MOEAs) and adapts the search process to the structure of outfit recommendation. Specifically, user-conditioned item-level candidate construction guides the search toward garments that are relevant to the target user, structure-aware evolutionary operators preserve the validity of top–bottom combinations, and phenotype-aware environmental selection promotes diversity at the item-combination level. These components jointly support efficient exploration of the outfit search space under the preference–compatibility objective formulation. The overall optimization pipeline is illustrated in Figure 3.

Flowchart of the proposed multiobjective outfit recommendation procedure.
Preference-stratified candidate pool construction
To guide the evolutionary search, we construct a user-specific candidate pool
where
Structure-aware initialization and evolutionary operators
Given that an outfit is represented as a structured pair
where
Phenotype-aware environmental selection
Standard MOEAs maintain diversity using metrics such as crowding distance in objective space. However, in outfit recommendation, diversity in the objective space does not necessarily correspond to diversity in the item space. Solutions that are well separated in terms of objective values may still correspond to visually or semantically similar outfits. To address this issue, we introduce a phenotype-aware environmental selection strategy. Specifically, structurally redundant outfits are filtered according to item-level similarity before selection, and higher priority is assigned to solutions that improve category-level diversity within the retained candidate set. In this way, the selection procedure explicitly bridges the gap between objective-space optimization and user-perceived diversity, preventing the approximated Pareto front from being dominated by redundant solutions.
Through iterative evolution, MOOR produces an approximation of the Pareto-optimal solution set defined over
Experiments
Settings
Dataset and preprocessing
We conduct experiments on the IQON3000 dataset, a widely adopted benchmark for fashion outfit recommendation. The original dataset contains 308,747 outfits from 3568 users and 672,335 unique fashion items, each associated with a visual image, a textual description, and attribute metadata. Following prior studies, 43 we focus on the top–bottom outfit recommendation. To ensure sufficient interaction signals for user preference modeling, only users with at least four outfit interactions are retained. For each selected user, we further filter their interaction history to include only items belonging to top and bottom categories. This preprocessing step reduces noise from irrelevant categories while preserving meaningful outfit composition patterns. The final dataset statistics are reported in Table 1.
Dataset statistics.
To enrich the dataset with high-level aesthetic signals, we further engineer a style attribute for each outfit as an auxiliary feature. Following Kobayashi's Color Image Scale and related style-based fashion recommendation studies,27,28,44 each outfit is deterministically categorized into one of six color harmony types: Complementary, Contrasting, Analogous, Similar, Monochromatic, and Neutral (see Figure 4 for illustrations). This style attribute is incorporated as a structured attribute input to the outfit compatibility model (MMOC) and is consistently provided to all comparative methods that utilize attribute information.

Color wheel matching principle.
Parameter settings
To ensure a rigorous and leak-free evaluation, we adopt a user-stratified data splitting strategy. For each user, their interaction history is randomly partitioned into training, validation, and test sets with a ratio of 8:1:1. Each complete outfit is assigned exclusively to a single split, preventing information leakage across different stages of the framework. This splitting strategy is particularly critical for our multistage framework, as it ensures that both the preference model (HGUP) and the compatibility model (MMOC) are trained only on historical data, while the multiobjective optimization stage operates exclusively on unseen test outfits. The HGUP and MMOC models are trained independently on the training split, utilizing the validation set for hyperparameter tuning. Crucially, during the final recommendation stage, the parameters of both HGUP and MMOC are frozen and serve as deterministic objective functions. The MOOR operates in a pure inference mode without parameter updates, performing inference-time optimization to explore the Pareto frontier and generate recommendations. This design ensures that the optimization process does not access test interactions and is guided solely by learned representations.
All models were implemented in PyTorch 2.1.1 with CUDA 11.8 and trained on an NVIDIA A800 (80 GB) GPU. The key hyperparameter configurations are summarized in Table 2. For the HGUP model, we adopted a pairwise ranking formulation with a 1:1 positive-to-negative sample ratio to establish a clear decision boundary. For the MMOC model's contrastive learning task, a 1:2 ratio was used to provide more negative examples, enhancing the model's ability to learn a discriminative embedding space. This setting forces the model to not only pull compatible items closer but also to push incompatible distractors farther away, thereby sharpening the decision boundaries in the high-dimensional embedding space.
Hyperparameter settings for model components.
Evaluation metrics
To comprehensively evaluate different components of the framework, we adopt task-specific metrics aligned with our three objectives. For user preference prediction (HGUP), we use hit ratio@K (HR@K) and normalized discounted cumulative gain (NDCG@K) with K = {10,20}. These metrics evaluate retrieval accuracy and the position-aware ranking quality, respectively. For outfit compatibility prediction (MMOC), we employ area under the receiver operating characteristic (ROC) curve (AUC) and average precision (AP), which assess the model’s ability to distinguish compatible and incompatible item combinations.
For multiobjective recommendation (MOOR), we evaluate the quality of the generated solution set using three standard indicators: hypervolume (HV), spacing (SP), and convergence. HV measures the dominated volume in the objective space, SP quantifies the uniformity of solution distribution, and convergence measures the distance to the ideal point after normalizing objective values to [0,1]. A lower convergence value indicates closer proximity to the optimal trade-off.
Accordingly, the experiments are designed to answer three questions: whether the two estimators are reliable, whether the proposed search strategy is effective, and whether the resulting outfits exhibit favorable trade-off quality under a surrogate evaluation protocol.
Comparison of user preference models
To evaluate whether HGUP can serve as a reliable estimator of the preference objective in the proposed MOOR framework, we first examine its user–item preference prediction performance before the multiobjective search stage. The purpose of this experiment is to verify whether the learned preference scores are sufficiently accurate and discriminative to guide Pareto-based outfit generation. Since the optimization stage relies directly on
MF 45 : A classical matrix factorization model that estimates user–item preference through latent factor interactions.
NCF 46 : A neural collaborative filtering model that captures nonlinear user–item interactions using multilayer perceptrons.
NGCF 6 : A graph collaborative filtering model that propagates user and item embeddings over the user–item bipartite graph.
LightGCN 47 : A simplified graph collaborative filtering model that focuses on linear neighborhood propagation without feature transformation or nonlinear activation.
HGCL 48 : A heterogeneous graph contrastive learning model that learns robust representations from multitype relational structures.
These baselines progressively incorporate collaborative, structural, and heterogeneous relational information, enabling us to assess whether HGUP can provide a reliable preference signal for the MOOR objective under the same evaluation protocol. The results in Table 3 indicate that HGUP consistently achieves the best performance across all metrics. These results suggest that HGUP provides more discriminative preference estimates than the compared baselines under the reported protocol. MF and NCF perform worst because they learn preferences solely from the interaction matrix: MF is limited by its bilinear scoring function, and while NCF introduces nonlinearity via an MLP, it still lacks explicit relational inductive bias and cannot effectively exploit semantic side information under sparse implicit feedback. Graph-based recommenders substantially improve performance by leveraging higher-order collaborative signals through neighborhood aggregation. NGCF improves over MF/NCF by propagating messages on the user–item bipartite graph, but its formulation remains restricted to interaction edges and therefore cannot explicitly encode attribute semantics that are crucial for fashion preference. LightGCN further strengthens performance by simplifying graph propagation (removing feature transformations and nonlinear activations), which often yields more stable representation learning on interaction graphs and reduces overfitting to noisy transformations.
Performance comparison of different user preference prediction models.
Notably, HGCL and HGUP constitute the strongest group of methods, underscoring the value of modeling heterogeneous relations beyond user–item interactions. The remaining gap between them further indicates that representation learning driven primarily by contrastive invariance does not necessarily translate into optimal performance under sparse implicit feedback. The observed gains are consistent with the use of semantic structure augmentation and adaptive multilayer fusion, which may improve information propagation under sparse fashion interactions. Collectively, these results show that the learned preference scores are sufficiently informative and well separated to serve as one of the two objective functions in the Pareto search process, rather than merely acting as an auxiliary recommendation model. Therefore, HGUP is used in the subsequent MOOR experiments as the estimator for the preference objective, because it provides relatively reliable and discriminative user–item preference scores under the reported protocol.
Comparison of outfit compatibility models
This section evaluates the effectiveness of MMOC as an estimator for the outfit compatibility objective. Unlike user preference modeling, compatibility estimation is intended to characterize the intrinsic coherence of garment combinations independently of any specific user. Since the estimated compatibility scores are used as the second objective in downstream multiobjective search, the purpose of this experiment is to examine whether MMOC can provide sufficiently stable and discriminative compatibility signals. Therefore, we compare MMOC with representative baselines that cover sequence-based outfit modeling, type-aware metric learning, graph-based multimodal relation modeling, and Transformer-based global item interaction.
Bi-LSTM 49 : A sequence-based compatibility model that represents an outfit as an ordered item sequence and learns compatibility through bidirectional recurrent encoding.
Type-Aware-Net 31 : A type-aware metric learning model that learns category-specific embedding spaces for cross-type compatibility matching.
MOCM 50 : A graph-based multimodal compatibility model that constructs modality-oriented graphs to capture intramodal and intermodal garment relations.
OutfitTransformer 51 : A Transformer-based model that jointly encodes outfit items to learn global interitem interactions.
The performance comparison for outfit compatibility prediction is reported in Table 4. MMOC achieves the best results on both AUC and AP, indicating its effectiveness in learning a discriminative compatibility scoring function. Compared with Bi-LSTM and Type-Aware-Net, MMOC yields consistently stronger results, suggesting that compatibility estimation based only on sequential dependence or type-specific metric matching is insufficient for capturing the complex semantic relationships involved in outfit formation. Compared with MOCM, which also models multimodal relational information, MMOC further improves compatibility prediction, indicating that more explicit cross-modal calibration before compatibility assessment may be beneficial. OutfitTransformer achieves competitive performance by modeling global interitem interactions through self-attention. However, its direct fusion of heterogeneous modalities within a shared attention space may introduce noise due to modality imbalance. In contrast, MMOC performs cross-modal interaction followed by disentangled normalization and uncertainty-aware fusion, which helps preserve modality-specific characteristics while reducing the influence of noisy signals. Overall, the results support the use of MMOC as the compatibility estimator for the objective function
Performance of outfit compatibility prediction models.
Multiobjective optimization outfit recommendation
This section empirically evaluates the proposed MOOR framework from four aspects. First, we examine whether the proposed search strategy provides favorable Pareto-approximation quality in comparison with representative MOEA baselines. Subsequently, we assess the outfit-level quality of the final recommendation results under an external-expert evaluation protocol. An extended ablation study is then conducted to systematically investigate not only the multiobjective formulation but also the individual contributions of MOOR's specific structural components. Finally, we present a qualitative case study to intuitively illustrate the practical advantages of our model in addressing user heterogeneity.
Efficacy of the MOOR search framework
The purpose of this experiment is to examine whether the proposed search design provides favorable Pareto-approximation quality under the multiobjective outfit recommendation formulation defined previously. To evaluate this aspect, we compare MOOR with two representative MOEA baselines under the same problem formulation, the same user-specific candidate item subsets, and the same evaluation budget. This setting allows us to assess how different evolutionary search strategies approximate the preference–compatibility trade-off in the top–bottom outfit-combination space.
SPEA2 52 : A dominance- and archive-based MOEA that assigns fitness by combining Pareto strength and density information.
MOEA/D 53 : A decomposition-based MOEA that transforms a multiobjective problem into a set of scalar subproblems and evolves them collaboratively.
The results in Table 5 indicate that MOOR achieves better Pareto-approximation performance than the two representative MOEA baselines under the reported setting. The higher HV value indicates that the obtained solutions dominate a larger region of the objective space, whereas the lower SP and convergence values suggest a more uniformly distributed and closer approximation to the target trade-off region. These results indicate that combining user-specific candidate item construction with phenotype-aware environmental selection is beneficial for exploring the preference–compatibility trade-off in outfit recommendation. In terms of average time, MOOR also achieves a slightly lower average optimization time under the same evaluation setting. This runtime corresponds to the complete evolutionary search procedure for generating a Pareto candidate set, rather than the latency of a single online recommendation request. Overall, the results support the effectiveness of the proposed task-specific evolutionary search strategy for approximating representative nondominated outfit candidates under the reported experimental protocol.
Comparison of the experimental results of three MOEAs.
For practical deployment, MOOR can be implemented in a nearline generation and online reranking paradigm. The evolutionary search can periodically generate Pareto candidate sets for target users or user groups, while the online stage only performs lightweight selection or reranking from the precomputed nondominated candidates according to the serving context. This separates the computationally heavier Pareto-search stage from real-time request handling and makes the framework more suitable for practical recommendation scenarios. Further reducing the search budget or accelerating objective evaluation remains an important direction for deployment-oriented optimization.
Figure 5 presents the joint density distribution of Pareto solutions for all users in the test set. The global view of the hexagonal heatmap reveals a distinct curved high-density band (dark red region) that spans from the upper left to the lower right of the plot and corresponds to a subset of high-quality candidate outfits. Within this high-quality subspace, the user preference score and the compatibility score exhibit a clear negative correlation. This pattern indicates that placing more emphasis on individual preference typically requires sacrificing a certain level of general compatibility. Conversely, enforcing very strong compatibility tends to limit the extent to which personalized preferences can be satisfied. The observed negative dependence highlights the intrinsic limitation of simple linear weighting schemes in outfit generation and recommendation. Such schemes usually converge to a single compromise point and fail to explore the diverse set of high-quality solutions that lie along the Pareto front. At the same time, the spread of the solutions suggests that the proposed search framework can retain multiple feasible trade-off candidates within the explored outfit-combination space instead of collapsing to a narrow compromise region.

Global trade-off density analysis: user preference versus compatibility.
Outfit-level recommendation quality
To further assess the quality of the final recommendation results, we evaluate whether the proposed framework can generate outfit candidates that maintain a favorable balance between user preference and outfit compatibility. Because direct human evaluation is not available in the current setting, we adopt an external-expert protocol to provide complementary evidence beyond the internal objective values. Specifically, for each user, a Pareto-optimal solution set is first generated by MOOR. From this set, top-K candidate outfits are selected according to MOOR’s internal Pareto-based ranking strategy, which prioritizes solutions with higher crowding distance and balanced objective contributions, ensuring both quality and diversity among the selected outfits, where
Table 6 reports the outfit-level evaluation results under different top-K settings. Overall, MOOR maintains high and relatively stable external evaluation scores across all K values, indicating that the generated results are not only of high-quality at the top-ranked positions but also preserve favorable preference consistency and outfit compatibility at deeper recommendation positions. As K increases, the evaluation metrics show only limited fluctuations, suggesting that MOOR does not obtain only a few isolated high-quality solutions but produces a candidate solution set with stable overall quality. Because HGCL and OutfitTransformer adopt representation mechanisms and scoring logic different from the HGUP and MMOC modules used within MOOR, the stable high-quality evaluation under the external proxy protocol indicates that the outputs of MOOR are not merely locally optimized for its internal objective functions. Instead, the results suggest that MOOR captures outfit matching patterns that can be consistently recognized across different model architectures. Overall, the results in Table 6 validate the effectiveness of MOOR in terms of final outfit-level recommendation quality and show that the generated nondominated solution set can provide candidate outfits with relatively high external credibility for practical personalized fashion recommendation.
Outfit-level evaluation of recommendation results.
Ablation study
To examine how different objective signals, optimization formulations, and search components affect MOOR, we conduct an ablation study. The variants are organized into three groups: objective-signal ablations, which remove either the preference or compatibility objective; a scalarization baseline, which combines the two objectives into a single weighted objective; and a search-mechanism ablation, which removes the phenotype-aware selection strategy. This design evaluates whether the two estimated objectives provide complementary signals for Pareto-based outfit generation, whether explicit multiobjective optimization is preferable to fixed scalarization, and whether the task-specific selection mechanism contributes to recommendation diversity.
Preference-only (without MMOC): This variant removes the compatibility-objective signal provided by MMOC and optimizes only the outfit-level preference objective. The same search procedure is retained, but the fitness evaluation is based only on the preference score.
Compatibility-only (without HGUP): This variant removes the preference-objective signal provided by HGUP and optimizes only the outfit-compatibility objective. The same search procedure is retained, but the fitness evaluation is based only on the compatibility score.
Linear scalarization (LS): This baseline retains both preference and compatibility scores but converts the multiobjective formulation into a single weighted objective.
MOOR without PAS: A variant of the full model where the proposed phenotype-aware selection (PAS) strategy is ablated. Environmental selection reverts to standard NSGA-II crowding distance, relying solely on diversity in the objective space without considering structural item-level redundancy.
The experimental results in Figure 6 show that removing either objective signal leads to an imbalanced recommendation outcome. When only a single objective is optimized, the resulting solutions exhibit an extreme bias. Specifically, the preference-only (without MMOC) variant maximizes the preference score but incurs a severe degradation in compatibility. Conversely, the compatibility-only (without HGUP) variant achieves high compatibility but fails to capture personalized preferences, resulting in a substantially lower preference score. These single-objective formulations consistently lead to suboptimal overall recommendation quality (AvgOutfit). This empirically demonstrates that maximizing one objective inherently compromises the conflicting objective, thereby substantiating the necessity of a multiobjective formulation.

Ablation study of our proposed MOOR.
Beyond standard performance metrics, the evaluation of recommendation diversity provides deeper insights into the generated solution space. Although the LS baseline achieves a competitive overall score, its diversity metric collapses to a very low value, indicating that fixed-weight scalarization drives the optimization process toward a limited region of the Pareto front and produces highly homogeneous solutions. In contrast, MOOR achieves substantially higher diversity while simultaneously attaining superior overall performance. This result suggests that the proposed multiobjective framework not only improves recommendation quality but also yields a well-structured and diverse set of trade-off solutions, which is essential for practical outfit recommendation scenarios requiring both quality and user choice flexibility.
Furthermore, the comparison between the full MOOR framework and the MOOR without PAS variant validates the specific contribution of the proposed selection mechanism. Although standard multiobjective selection can preserve distribution in objective space, it does not necessarily prevent structural redundancy in the generated outfit candidates. The improved diversity of the full model suggests that incorporating item-level similarity filtering and category-level diversity considerations helps translate objective-space spread into more perceptually meaningful recommendation diversity. Therefore, the proposed selection strategy appears beneficial for retaining representative outfit candidates with reduced redundancy.
User case study
To further examine the practical behavior of the proposed framework, we conduct a qualitative case study on six randomly selected users. For each user, we analyze the approximated Pareto solutions, the trade-off patterns between user preference and outfit compatibility, and the visual characteristics of the recommended outfits.
Visualization of the Pareto front
Figure 7 displays the three-dimensional Pareto fronts for the selected users, mapped across top preference, bottom preference, and compatibility scores. The visualizations consistently reveal a continuous, concave surface across different users. Rather than clustering in a localized region, the widely distributed solutions along these surfaces empirically illustrate the highly nonconvex nature of the objective space and the trade-off between compatibility and the decomposed item-level contributions to the preference objective. This pattern demonstrates that maximizing one objective generally necessitates a compromise in another. Furthermore, the density and uniform spread of the points indicate that the evolutionary search mechanism within MOOR effectively escapes local optima, successfully approximating a diverse spectrum of high-quality trade-off solutions in a discrete combinatorial space.

Pareto front of six target users generated by the MOOR model.
Analysis of objective trade-offs
To further illustrate how the trade-off structure varies across users, Figure 8 provides parallel coordinate visualizations of the same solution sets. Each line corresponds to one candidate outfit and shows its compatibility score together with the diagnostic decomposition of the preference objective. A recurring pattern is that improvements in user preference are not always accompanied by improvements in outfit compatibility, which is consistent with the quantitative motivation for multiobjective optimization.

Scores of six users across three objectives.
At the same time, the degree of this trade-off differs across users. For some users, relatively favorable solutions can be found in which preference and compatibility remain jointly high, whereas for others, increasing preference is associated with a more visible reduction in compatibility. These user-specific differences suggest that a fixed scalarization strategy may be insufficient for capturing the full range of feasible recommendation trade-offs.
Recommendation visualization and case analysis
Figure 9 presents a representative case study illustrating the practical outcomes of different optimization paradigms. The user’s historical interactions demonstrate a consistent preference for minimalist, casual wear in neutral tones. As shown in Figure 9(b) and 9(c), single-objective optimization reveals fundamental limitations: preference-only optimization successfully captures individual item appeal (e.g., highly preferred tops and bottoms) but fails to ensure aesthetic coordination, resulting in disjointed ensembles. Compatibility-only optimization produces visually harmonious outfits but completely sacrifices personalization, generating generic recommendations disconnected from the user's preference. The proposed MOOR framework provides a broader set of recommendation candidates that retain different balances between these two criteria. As shown in Figure 9(d), the final recommendations cover a broader range of feasible trade-off solutions, from more preference-oriented candidates to more compatibility-oriented alternatives.

Comparative visualization of single-objective and multiobjective optimization recommendation results.
Conclusion and future work
This study has investigated outfit recommendation from the perspective of balancing user preference and outfit compatibility. To move beyond fixed scalarization-based recommendation, we have proposed MOOR, a multiobjective recommendation framework that combines a heterogeneous graph-based user preference estimator, a multimodal outfit compatibility estimator, and a task-specific evolutionary search procedure. Within this formulation, HGUP is used to provide the preference objective, MMOC is used to provide the compatibility objective, and the search stage is designed to approximate representative nondominated outfit candidates. The experimental results provide support for the proposed framework in several respects. First, the component-level evaluations indicate that HGUP and MMOC can provide effective estimators for user preference and outfit compatibility under the reported protocol. Second, the optimization-level analyses suggest that the proposed search procedure achieves favorable Pareto-approximation quality in the discrete outfit recommendation space. Third, the outfit-level and ablation results provide complementary evidence that the framework can preserve useful trade-off solutions and recommendation diversity. Taken together, these findings support the value of explicitly modeling the trade-off between user preference and outfit compatibility in outfit recommendation.
Despite these promising results, several directions remain for future work. First, temporal modeling may be incorporated to better capture the dynamic evolution of user preference. Second, richer contextual factors, such as usage occasion, seasonal variation, and emerging fashion trends, may further improve compatibility estimation and recommendation relevance. Third, direct user-centered evaluation would be valuable for assessing the practical usefulness of the generated recommendation sets beyond surrogate expert-based metrics. Fourth, improving Pareto candidate generation efficiency through nearline caching, reduced-budget search, and parallel objective evaluation remains an important direction for practical deployment. These directions may help further strengthen the applicability of multiobjective recommendation methods in fashion-related decision support scenarios.
Footnotes
Declaration of conflicting interests
The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This study was partly supported by the Research and application of unmanned water plant operation management and control system based on artificial intelligence and Beijing Educational Science Planning 2025 Priority Focus Project: CEAA25008.
