A large language model persona-based framework for optimizing text-to-image prompts in fashion design applications

Abstract

This study presents a systematic framework that optimizes text-to-image generation prompts through Large Language Model (LLM) personas in fashion design applications. While generative models like Stable Diffusion show significant creative potential, prompt engineering remains challenging for domain experts lacking technical expertise. We developed a systematic five-stage methodology to optimize text-to-image prompts. First, we generate prompts using different AI personas with varying expertise. Then we create images, evaluate their quality, identify weaknesses, and optimize the prompts accordingly. Our optimized prompts demonstrated significant improvements over persona-based approaches across multiple evaluation dimensions. The Multi-expert persona achieved the highest baseline performance (9.11/11 points), which our optimization process enhanced to 10.05 points—a statistically significant 10.3% improvement (p<0.01). Optimized prompts significantly outperformed all persona approaches in requirement implementation and showed superior performance in human preference assessments. The optimized prompts achieved maximum CLIP scores of 0.9043 and ImageReward scores of 1.7452, demonstrating peak performance advantages across all metrics. In head-to-head comparisons, optimized prompts secured first-place rankings in 50% of human preference evaluations, significantly exceeding the 20% random expectation. This framework bridges the gap between language models and image generation systems, enabling fashion professionals to achieve consistent, high-quality AI-generated designs without prompt engineering expertise, thereby accelerating creative workflows and reducing design iteration time.

Keywords

prompt engineering LLM personas text-to-image generation multi-metric optimization stable diffusion fashion design applications

1. Introduction

Text-to-image generative models like Stable Diffusion, DALL-E, and Midjourney represent significant advances in artificial intelligence.^1,2 These systems transform natural language descriptions into corresponding visual outputs, revolutionizing creative workflows across visually-oriented industries.^3–5 Fashion design, product development, and interior design increasingly adopt open-source models due to their accessibility and creative potential.^6,7 Architecture exemplifies successful implementation, where generative AI enables innovative facade design through effective human-AI collaboration, advancing both conceptual development and visualization practices.^8,9

A critical gap separates these models’ technical capabilities from practical use by domain experts who lack prompting experience. Prompts—the text instructions given to AI systems—fundamentally determine output quality and whether the results meet specific requirements.^10,11 Effective prompt construction demands technical expertise in structure, terminology, and weight application—knowledge that extends beyond simple descriptive writing.^12,13 This expertise barrier significantly limits adoption among domain professionals, despite the technology’s transformative potential.^14,15

This expertise gap creates opportunities for bridging technical and domain expertise through Large Language Models (LLMs). Systems like GPT-4 and Claude demonstrate sophisticated text generation capabilities and can adopt specialized personas for domain-specific content creation.^16–19 By leveraging these persona-based approaches, we can bridge prompt engineering expertise gaps and enhance human-AI collaboration in design workflows.

We propose a systematic framework that couples LLMs with text-to-image models to achieve high-quality visual generation. Our data-driven methodology establishes iterative feedback loops connecting persona-based prompt generation, systematic evaluation, and optimization processes. This approach addresses critical accessibility needs in creative domains, particularly fashion design, where visual complexity demands both technical and domain expertise.

This study pursues four objectives. First, we establish distinct LLM personas and analyze their prompt generation patterns. Second, we evaluate prompt effectiveness through systematic image generation and assessment protocols. Third, we develop data-driven optimization methods that integrate persona strengths while addressing weaknesses. Fourth, we validate framework performance using objective checklists, semantic alignment measures, and human preference metrics.

2. Related works

2.1. Text-to-image generation technologies

Text-to-image generation has progressed from early GAN-based systems to sophisticated diffusion models. Generative Adversarial Networks (GANs) pioneered competitive learning between generator and discriminator networks, establishing foundational principles for the field.²⁰ GAN-CLS achieved the first successful text-conditioned image generation,²¹ while StackGAN introduced two-stage high-resolution generation²² and AttnGAN incorporated attention mechanisms for fine-grained text-image alignment.²³ The Contrastive Language-Image Pre-training (CLIP) model marked a paradigm shift, learning semantic associations between text and images from massive datasets.²⁴ CLIP’s foundation enabled DALL-E’s autoregressive approach1 and GLIDE’s diffusion-based text conditioning.²⁵

Diffusion models emerged as a new paradigm, generating high-quality images through iterative noise addition and removal processes. Google’s Imagen demonstrated exceptional realism using powerful language models for text encoding.²⁶ Stable Diffusion achieved computational efficiency through latent diffusion architectures.² The system employs CLIP text encoders to transform prompts into embeddings, conditioning the latent space denoising process for high-quality generation. Recent advances include DALL-E 3 and Stable Diffusion XL, further pushing generation capabilities. This evolution highlights prompt engineering’s growing importance—our study’s focus. However, systematic optimization methodologies remain underdeveloped, particularly for specialized domains like fashion design.

2.2. Prompt engineering methodologies

Prompt engineering research predominantly employs empirical approaches, lacking systematic methodologies for optimization.^13,14 Current practices depend on trial-and-error experimentation and community-derived guidelines instead of data-driven frameworks.^11,12 This methodological gap creates substantial barriers for non-expert adoption.

Recent systematic optimization efforts incorporate Kansei engineering and knowledge graphs for UI generation, achieving measurable designer-model alignment improvements.²⁷ Automated systems like BeautifulPrompt transform simple descriptions into sophisticated prompts,²⁸ while PRISM enables black-box prompt identification.²⁹ Yet these approaches struggle with semantic consistency and cross-model transferability.

LLM integration for prompt generation offers promising solutions. Research confirms that LLM personas effectively simulate domain expertise^30,31 and reliably reproduce expert characteristics under controlled configurations.³² Nevertheless, consistency and authenticity limitations persist,^33,34 emphasizing the need for systematic evaluation and optimization frameworks.

2.3. AI applications in fashion design

Fashion design actively integrates AI across applications spanning trend prediction to generative design. Text-to-image generation emerges as particularly valuable, allowing designers to create visual outputs from textual descriptions of their design ideas.³⁵ Early fashion-specific approaches emphasized pose-preserving garment generation³⁶ and natural language-to-visual feature mapping.³⁷ Advanced systems now achieve end-to-end fashion image retrieval and generation.³⁸

Diffusion models dramatically enhance fashion image generation quality. Recent research integrates fashion domain knowledge with Stable Diffusion, achieving superior attribute preservation in generated designs.³⁹ Yet semantic gaps persist between textual descriptions and fashion visuals, especially for nuanced elements like texture, drape, and styling details.

Fashion AI generation faces key challenges: accurately representing detailed design elements, producing consistent outputs when using the same design specifications, and effectively communicating design requirements through prompts. Despite demonstrated technical capabilities, fashion professionals struggle with adoption due to prompt engineering complexity. This usability gap drives our systematic framework development, bridging domain expertise with technical requirements.

3. Material and methods

We used a comparative approach to test our AI image generation framework, systematically evaluating LLM personas’ impact on prompt quality and image generation outcomes. Fashion design provides ideal experimental context, demanding both creative expression and technical precision—requirements that comprehensively test framework capabilities.

3.1. Framework overview

Our framework employs five sequential stages that optimize text-to-image generation through systematic persona-based prompt engineering (Figure 1). Each stage contributes to a data-driven feedback loop connecting initial prompt generation with optimized output derivation:

1. Prompt generation: Various LLM personas generate domain-specific prompts from identical requirements

2. Image generation: Text-to-image models generate image sets using persona-specific prompts

3. Systematic evaluation: Objective checklist assessment of requirement implementation

4. Optimization analysis: Data-driven identification of persona strengths and weaknesses

5. Validation: Performance verification of optimized prompts through comprehensive evaluation

Figure 1.

LLM persona-based optimization framework for AI image generation.

This iterative approach systematically integrates diverse expertise while maintaining objective evaluation standards, establishing data-driven feedback mechanisms that generate improved prompts from evaluation results. The framework leverages multiple persona strengths while compensating for individual limitations.

3.2. Persona configuration and prompt generation

We designed four persona types to systematically examine how different types of expertise affect prompt generation quality. Each persona represents a distinct knowledge domain: No-persona (baseline with no specialized role), AI Expert (technical prompt engineering expertise), Fashion Designer (domain-specific fashion knowledge), and Multi-expert (combined technical and fashion expertise). This experimental design enables controlled testing of whether domain knowledge, technical knowledge, or their combination produces superior prompting performance.

To create each persona, we began conversations with Claude 3.5 Sonnet by providing specific role assignments. No-persona received standard instructions without any specialized role. AI Expert was assigned “You are an expert in AI image generation,” focusing on technical optimization without fashion knowledge. Fashion Designer received “You are a world-renowned fashion designer,” emphasizing garment expertise without technical prompting skills. Multi-expert used a two-step approach: first generating fashion descriptions as a designer, then converting these into technical prompts as an AI expert.

All personas received identical fashion design requirements and were asked to generate Stable Diffusion-optimized prompts. The specific instructions and requirements given to each persona are shown in Table 1. These requirements included both objective specifications (orange primary color, summer blue secondary color, check patterns, front view, plain background, dramatic lighting) and subjective elements (funky style), allowing us to test how different expertise types handle various specification challenges.

Table 1.

Fashion design requirements for persona-based prompt generation experiment.

Prompts for three different personas
1. No-persona: No input
2. AI Expert: You are an expert in AI image generation.
3. Fashion Designer: You are a world-renowned fashion designer.
[Input persona.] You are going to perform fashion design using Stable Diffusion. Please write prompts for generating fashion design images according to the following requirements.
1. Design specifications:
- garment type: dress
- style: funky
- color: primary color: orange, secondary colors: summer blue
- detail: check patterns
2. Photography settings:
- view: front
- background: plain
- lighting: dramatic
3. Output requirements:
- positive prompt:
* prompt token limit: within 75 tokens
* clothing description (including Quality Modifiers)
* style and mood
* detailed specifications
* parameters for image quality enhancement (e.g., highly detailed, 8k uhd, professional fashion photography)

Prompts for Multi-expert
Step 1:
You are a world-renowned fashion designer. Please provide a detailed description of a fashion design according to the following requirements.
Design specifications:
Identical design specifications to the three personas above.
Step 2:
You are an expert in AI image generation. You are going to perform fashion design using Stable Diffusion. Please write prompts for generating fashion design images according to the following requirements.
1. Design specifications:
[Input the step 1 output.]
2. Photography settings:
Identical photography settings to the three personas above.
3. Output requirements:
Identical output requirements to the three personas above.

3.3. Image generation protocol

We generated images using Stable Diffusion under controlled parameters, standardizing generation parameters to focus on prompt-specific effects. Our experimental setup included AMD Ryzen 7 CPU, NVIDIA GeForce RTX 4060 GPU (8GB VRAM), 32GB RAM, and AUTOMATIC1111’s Stable Diffusion WebUI (v1.6.0). Realistic Vision v5.1, a fine-tuned checkpoint based on Stable Diffusion v1.5 and optimized for realistic image generation, served as the base model throughout all experiments.

We standardized generation parameters across conditions: CFG scale 5, 20 sampling steps, DPM++ SDE Karras sampler, 512×768 initial resolution with 1.5× upscaling via 4×-UltraSharp (denoising strength 0.45). Adetailer enhanced anatomical accuracy using face_yolo8n and hand_yolo8n models for faces and hands respectively.

Each persona generated 40 images using consecutive seeds (1-40), ensuring reproducibility and adequate statistical power. Persona-specific prompts occupied the positive prompt field, while all conditions used Realistic Vision v5.1’s recommended negative prompt template to prevent anatomical deformities, unrealistic rendering styles, and technical artifacts. We systematically organized generated images by persona and seed number, creating a comprehensive dataset for evaluation and analysis.

3.4. Evaluation methodology

We developed a systematic checklist that converts subjective design requirements into measurable criteria, focusing exclusively on objectively observable elements in generated images. The checklist deliberately excluded highly subjective aspects like “funky style” assessment due to difficulties in establishing reliable inter-rater agreement for such aesthetic judgments.

Our evaluation checklist comprised 11 items across four categories (Table 2): Basic Requirements (female figure, dress garment), Design Elements (orange presence, blue presence, check patterns, color proportions), Image Composition (front view, plain background, lighting), and Garment Presentation (complete visibility, texture expression). Categories captured requirement implementation from fundamental recognition to sophisticated design execution.

Table 2.

Objective evaluation checklist for fashion design requirement implementation.

1. Basic requirements (0-2 points)

□ Female figure is identifiable (0/1)

□ Dress-type garment is worn (0/1)

2. Design elements (0-4 points)

□ Orange color is present in the garment (0/1)

□ Blue color is present in the garment (0/1)

□ Orange covers a larger area than blue (0/1)

□ Check pattern (intersecting lines) is identifiable in the garment (0/1)

3. Image composition (0-3 points)

□ Figure’s body is facing forward (0/1)

□ Background has no complex elements (single color or gradient) (0/1)

□ At least 3 identifiable shadow/highlight areas on the figure and garment (0/1)

4. Garment presentation (0-2 points)

□ Entire shape of the garment is visible (not cropped) (0/1)

□ Folds, wrinkles, and texture of the garment are identifiable (0/1)

Total Score: /11 points

Binary scoring (1: present/clear, 0: absent/unclear) maximized objectivity and minimized subjective interpretation, avoiding Likert scale biases. Two independent researchers experienced in fashion design and AI evaluation assessed images blindly, without persona attribution knowledge.

3.5. Prompt optimization algorithm

We established a data-driven optimization procedure analyzing checklist results across personas. The procedure prioritizes weakness identification and strength balancing, systematically integrating optimal elements from each persona approach (Algorithm 1).

The process begins by selecting the highest-scoring persona prompt as base structure, establishing a performance-proven foundation. Systematic weakness analysis follows, identifying and prioritizing lowest-scoring checklist items to focus optimization on critical improvement areas.

Cross-persona analysis is conducted for each weakness. When other personas score higher on specific items, we examine their expressions, structure, and positioning to identify effectiveness factors. For universally low-scoring items, we identify common challenges and develop novel solutions while considering how improving one checklist item might affect others.

Strength balancing parallels weakness analysis, examining exceptionally high-scoring items. We consider weight reductions for perfectly implemented elements, adjust over-emphasized expressions hindering other items, and simplify redundancies. This reallocates attention from over-performing to under-performing areas.

This analysis yields specific improvement strategies: borrowing effective expressions from superior personas, replacing ambiguous terms with specific alternatives, repositioning important elements, modifying weights for balanced attention, and grouping related elements strategically.

Final integration sequentially applies improvements while maintaining prompt coherence. This process balances length against complexity, evaluates inter-item impacts, preserves base prompt strengths, and ensures natural consistency. Final review confirms comprehensive requirement coverage with all improvements implemented.

3.6. Multi-metric validation

We validated optimized prompts using multiple metrics to assess different dimensions of improvement. Checklist assessment provided objective evaluation of requirement implementation through systematic analysis of 40 new images generated with optimized prompts under identical parameters, enabling direct quantitative comparison.

CLIP scores assessed prompt-image semantic alignment. Using ViT-B/32 architecture, we calculated weighted cosine similarity:

CLIP_S (c, v) = w \times \max (\cos (c, v), 0)

(1)

where w = 2.5, c = prompt embedding, v = visual embedding.⁴⁰ Two reference prompts enabled separate assessment: comprehensive (“A woman wearing funky orange dress with summer blue check patterns, standing front view, plain background with dramatic lighting”) and design-focused (“A woman wearing funky orange dress with summer blue check patterns”), isolating design versus compositional elements.

ImageReward evaluation measured human preference alignment through a model trained on expert rankings. Trained on 137,000 expert comparisons, ImageReward captures visual quality and appeal beyond text-image alignment, outperforming BLIP, CLIP, and Aesthetic methods.⁴¹ The same reference prompts used for CLIP evaluation enabled direct comparison between semantic alignment and human preference metrics.

This multi-faceted validation approach allows comprehensive assessment of optimization effectiveness across technical accuracy, semantic alignment, and human preference dimensions, providing robust evaluation beyond single-metric limitations.

4. Results

4.1. Persona-based prompt generation analysis

Each LLM persona produced distinct prompts from identical fashion design requirements, revealing characteristic patterns in structure, terminology, and technical elements (Table 3). These variations directly reflect assigned expertise domains, demonstrating how knowledge types shape AI-assisted prompt generation.

Table 3.

Prompts generated by each LLM persona for fashion design requirements.

Persona	Positive prompts
No	masterpiece, ultra detailed, fashion photography of a funky summer dress, (orange as main color:1.3), (light blue accents:0.8), checkered pattern, sleeveless A-line dress, geometric design, retro futuristic style, front view, plain white background, dramatic studio lighting, professional fashion shoot, 8k uhd, highly detailed, sharp focus, fashion model pose
AI Expert	masterpiece, professional fashion photography, (funky orange dress:1.2), summer blue checkered pattern, front view, dramatic lighting, plain background, highly detailed, sharp focus, 8k uhd, full body shot, fashion design, editorial photography, professional lighting setup, crisp details, perfect composition, designer clothing, high fashion, studio photography, garment focus, intricate fabric texture, seamless pattern
Fashion Designer	a funky style dress, vibrant orange as main color with summer blue accents, checkered pattern details, fashion design sketch, front view presentation, plain background, dramatic studio lighting, highly detailed, crisp lines, fashion illustration, professional fashion photography, 8k uhd, elegant draping, impeccable tailoring, haute couture design, sleek silhouette, master craftsmanship
Multi-expert	masterpiece, professional fashion photography, front view, (one-shoulder A-line dress:1.2), irregular check pattern in (vibrant orange:1.3) and (summer blue:1.2), asymmetrical ruffle detail, bias-cut layers, contemporary design, dramatic lighting, plain background, highly detailed, 8k uhd, fashion editorial style

Component analysis revealed distinct persona patterns. No-persona balanced 19 components across quality enhancement, garment specification, and photographic composition, strategically weighting orange as primary and light blue as accent. AI Expert’s 24-component prompt emphasized technical excellence through “editorial photography,” “professional lighting setup,” and “sharp focus” terminology. Fashion Designer’s 18 components concentrated on garment terminology and haute couture references, uniquely avoiding all technical weights. Multi-expert efficiently integrated domain and technical knowledge in 16 components, combining professional construction terminology with strategic weight application.

4.2. Image generation results

Persona-specific prompts yielded 160 images exhibiting distinct visual characteristics. Representative samples reveal systematic variations in style, composition, and design elements across personas (Figure 2). Systematic visual differences emerged across personas. No-persona and AI Expert produced clean backgrounds with straightforward garment presentation. Fashion Designer unexpectedly generated runway contexts without explicit prompt instructions. Multi-expert achieved rich clothing details and superior asymmetrical design implementation.

Figure 2.

Representative fashion design images generated by each LLM persona (selected seed numbers: 8, 18, 28, 38).

4.3. Checklist evaluation results

Checklist evaluation revealed significant performance differences in requirement implementation across personas (Table 4). One-way ANOVA confirmed these differences (F=19.7, p<0.001), with post-hoc Tukey HSD tests showing distinct performance tiers. Multi-expert achieved the highest baseline score (9.11±1.01), significantly outperforming Fashion Designer but not differing significantly from No-persona (8.95±0.84) and AI Expert (8.95±0.58). Fashion Designer scored significantly lower (8.41±0.90) than all other persona approaches.

Table 4.

Checklist evaluation results by LLM persona and optimized prompt.

Items		Persona
Items		No	AI expert	Fashion designer	Multi-expert	Optimized
Basic	Female	1	1	1	1	1
	Dress	1	1	1	1	1
	Subtotal	2	2	2	2	2
Design	Orange color	1	1	1	1	0.98
	Blue color	0.10	0.06	0.18	0.40	0.78
	Orange>Blue	0.10	0.06	0.16	0.20	0.40
	Check pattern	0.94	0.95	0.73	1	1
	Subtotal	2.14	2.07	2.07	2.60	3.16
Image	Front view	1	1	1	1	0.98
	Background	0.95	0.98	0.34	0.70	0.98
	Lighting	1	1	1	1	1
	Subtotal	2.95	2.98	2.34	2.70	2.96
Garment	Not cropped	0.86	0.90	1	0.81	0.93
	Texture	1	1	1	1	1
	Subtotal	1.86	1.90	2	1.81	1.93
Total (Mean ± SD)		8.95±0.84	8.95±0.58	8.41±0.90	9.11±1.01	10.05±0.81

Note. ANOVA confirmed significant differences (F=19.7, p<0.001) with optimized prompts outperforming all persona approaches and Fashion Designer scoring lowest.

All personas perfectly executed Basic Requirements, generating female dress-wearing models without explicit gender prompts. Design Elements revealed maximum variation—Multi-expert excelled in blue color and check pattern implementation. Image Composition exposed background control differences: AI Expert and No-persona succeeded while Fashion Designer failed due to runway generation. Paradoxically, Fashion Designer achieved perfect Garment Presentation scores because runway contexts naturally feature full-body model presentations, preventing any garment cropping despite failing background requirements.

4.4. Prompt optimization process

Based on checklist evaluation analysis, the Multi-expert prompt served as the optimization base structure due to its highest total performance. Systematic weakness analysis identified color-related items as primary optimization targets: “orange covers larger area than blue” and “blue color presence” showed the lowest implementation scores, followed by “background” and “not cropped” items that demonstrated room for improvement.

Strategic modifications addressed each weakness through cross-persona analysis. Blue color implementation proved universally challenging—most images showed minimal blue despite explicit requirements. We increased summer blue weight from 1.2 to 1.3 while reducing orange from 1.3 to 1.1, rebalancing color dominance to enhance blue expression.

Structural optimization resolved background issues by repositioning “front view, plain background, dramatic lighting” early in the prompt and specifying “dramatic studio lighting” based on No-persona and AI Expert’s successful background implementation strategies. For garment visibility, we adopted “fashion model pose, full body shot” from high-performing personas, avoiding Fashion Designer’s runway context triggers. Style refinements transformed “one-shoulder A-line dress” to “A-line funky dress,” better capturing original requirements while adjusting structural weights for balanced attention distribution.

The systematic integration of all improvements resulted in the final optimized prompt that maintained overall coherence while addressing identified weaknesses:

“masterpiece, professional fashion photography, front view, plain background, dramatic studio lighting, (A-line funky dress:1.1), (irregular check pattern:1.1) in (vibrant orange:1.1) and (summer blue:1.3), asymmetrical ruffle detail, bias-cut layers, contemporary design, highly detailed, 8k uhd, sharp focus, full body shot, fashion model pose, fashion editorial style”

4.5. Optimized prompt performance validation

4.5.1. Checklist evaluation

We generated 40 new images using the optimized prompt under identical parameters, with representative samples in Figure 3. Checklist evaluation demonstrated statistically significant improvement over all persona-based approaches, achieving the highest total score (10.05±0.81) compared to the best baseline performance of Multi-expert (9.11±1.01) (Table 4). One-way ANOVA confirmed significant differences between approaches (F=19.7, p<0.001), with post-hoc Tukey HSD tests showing optimized prompts significantly outperformed all individual persona approaches (p<0.01 for all comparisons), representing a substantial 10.3% improvement over the Multi-expert baseline.

Figure 3.

Representative fashion design images generated by the optimized prompt (selected seed numbers: 8, 18, 28, 38).

Design Elements achieved maximum improvement, especially blue color presence and proportions. Increasing summer blue weight to 1.3 enhanced visibility but caused overcorrection in some images where blue dominated orange, contradicting the ‘orange primary color’ requirement—revealing inherent limitations in weight-based color control. Image Composition scores improved through strategic repositioning and specification refinements. Early placement of “front view, plain background, dramatic studio lighting” with enhanced specificity resolved Multi-expert’s background implementation challenges. Garment Presentation advanced through strategic full-body expression additions. Incorporating “fashion model pose, full body shot” improved garment completeness while avoiding Fashion Designer’s problematic runway context generation.

4.5.2. CLIP score evaluation

Unlike the clear superiority shown in checklist evaluation, CLIP evaluation revealed a different pattern of performance differences across approaches (Table 5). One-way ANOVA confirmed these differences (Comprehensive: F=4.41, p<0.01; Design-focused: F=9.83, p<0.001). Post-hoc Tukey HSD tests showed that Fashion Designer performed significantly lower than other approaches, while No-persona, AI Expert, Multi-expert, and Optimized prompts showed no significant differences in mean performance.

Table 5.

CLIP score evaluation results for persona-based and optimized prompts.

Prompt		Image
Prompt		No-persona	AI expert	Fashion designer	Multi-expert	Optimized
Comprehensive	Mean	0.7786	0.7608	0.7452	0.7691	0.7779
	SD	0.0370	0.0378	0.0532	0.0367	0.0415
	Max	0.8691	0.8193	0.8501	0.8442	0.8760
Design-focused	Mean	0.8286	0.8064	0.7702	0.8212	0.8241
	SD	0.0362	0.0455	0.0720	0.0409	0.0360
	Max	0.8857	0.8813	0.8789	0.8979	0.9043

Note. One-way ANOVA confirmed significant differences between approaches (p<0.01 for both criteria). Tukey HSD post-hoc tests showed Fashion Designer performing significantly lower than other top-performing approaches in pairwise comparisons.

Despite similar average performance among top approaches, optimized prompts achieved the highest maximum scores in both criteria (0.8760 comprehensive, 0.9043 design-focused) versus No-persona (0.8691, 0.8857). This distinction proves critical for practical applications, as fashion practitioners select the best outputs from multiple generations rather than using average results, making peak quality more relevant than average consistency.

Top-performing image analysis (n=200) revealed evaluation criteria impacts (Figure 4). Comprehensive evaluation selected two Optimized and three No-persona images, while design-focused evaluation chose one each from Optimized and No-persona plus three Multi-expert images. Among the top-5 images, only the first-ranked image was selected by both evaluation criteria, with the remaining four positions showing completely different images—highlighting how reference prompts significantly affect CLIP’s image ranking.

Figure 4.

Top 5 highest-scoring images in CLIP score evaluation across all personas. (a) Comprehensive; (b) Design-focused.

Figure 4(a) favored simple backgrounds and prominent shadows, aligning with “plain background” and “dramatic lighting” requirements. However, these images featured simpler garments—only two included frills or lace—suggesting “plain” inadvertently simplified garment design. Conversely, Figure 4(b) prioritized garment complexity with four of five images featuring decorative elements, reflecting “funky” style interpretation while tolerating non-plain backgrounds. Color assessment limitations emerged across criteria. Figure 4(a) selected three images entirely lacking blue, while Figure 4(b) better captured requirements with three images displaying blue check patterns. These findings reveal CLIP’s limitations in evaluating fashion design requirements. Despite high semantic alignment scores, top-ranked images often failed to implement critical design elements, questioning CLIP’s suitability for complex creative evaluation where multiple requirements must be simultaneously satisfied.

4.5.3. ImageReward evaluation

ImageReward evaluation, trained on human preference rankings, contrasted sharply with CLIP results (Table 6). Unlike CLIP’s modest mean differences, ImageReward revealed significant performance variations (Comprehensive: F=47.17, p<0.001; Design-focused: F=32.23, p<0.001) with clear patterns favoring optimized prompts.

Table 6.

ImageReward evaluation results for persona-based and optimized prompts.

Prompt		Image
Prompt		No-persona	AI expert	Fashion designer	Multi-expert	Optimized
Comprehensive	Mean	0.0612	-0.1474	-1.3192	0.0306	0.4330
	SD	0.5397	0.5366	0.4979	0.6270	0.8145
	Max	1.2333	0.8647	-0.4301	1.6940	1.7452
	Frequency¹	9	4	0	7	20
Design-focused	Mean	-0.2691	-0.5410	-1.3656	-0.2335	0.1519
	SD	0.5474	0.5209	0.5960	0.6484	0.8078
	Max	0.9306	0.3046	-0.1568	1.5660	1.5078
	Frequency¹	10	2	0	8	20

Note. ANOVA confirmed significant differences between approaches (p<0.001 for both criteria). Post-hoc tests revealed Fashion Designer consistently performing lowest, with optimized prompts showing superior performance in multiple pairwise comparisons.

¹Frequency indicates the number of images that ranked first when comparing images with identical seed numbers across all personas (out of 40 seed comparisons).

Tukey HSD post-hoc tests showed optimized prompts outperformed AI Expert and Multi-expert in comprehensive evaluation, and exceeded No-persona and AI Expert in design-focused evaluation (p<0.01-0.05). Fashion Designer scored lower than all other approaches across both criteria (p<0.01), confirming poor human preference alignment despite domain expertise. These statistical findings align with frequency analysis results where optimized prompts achieved first place in 50% of head-to-head competitions (20/40 cases)—substantially exceeding random expectation (20%). Maximum scores and ranking frequency confirmed optimized prompt superiority across all ImageReward measures, indicating human preferences capture quality dimensions beyond semantic alignment that CLIP evaluation missed.

ImageReward’s top-performing images (n=200) showed remarkable consistency between criteria (Figure 5). Both comprehensive (Figure 5(a)) and design-focused (Figure 5(b)) evaluations selected four Optimized and one Multi-expert image. Four identical images appeared in both Figure 5(a) and (b) but with different ranking positions; only fifth place differed completely.

Figure 5.

Top 5 highest-scoring images in ImageReward evaluation across all personas. (a) Comprehensive; (b) Design-focused.

All selected images incorporated essential requirements: orange/blue check patterns, plain backgrounds, and funky styling with frills or lace. This comprehensive implementation contrasts CLIP’s tolerance for missing elements. Evaluation consistency despite different prompt specifications proves particularly noteworthy. Design-focused evaluation favored plain backgrounds despite lacking explicit background specifications, suggesting ImageReward implicitly enforces fashion photography standards. The model prioritizes professional presentation even when prompts specify only garment design, reflecting training bias toward clean, uncluttered fashion imagery.

5. Discussion

5.1. Prompt-image generation dynamics and implicit biases

Our framework reveals systematic biases in AI-generated imagery. All 160 images featured female models despite no gender specification—only “funky-style dress” appeared in prompts. This consistent bias reflects Stable Diffusion’s training data associations, raising concerns about gender representation in AI-assisted fashion design. Such implicit assumptions could limit creative diversity in applications requiring gender-neutral or inclusive representations.

Fashion Designer’s results exposed how domain language activates unintended model behaviors. Professional terminology like “haute couture design” and “elegant draping” triggered runway contexts in 65% of outputs without explicit mentions. While this enhanced garment visibility scores, it violated background requirements—demonstrating the double-edged nature of domain expertise. This phenomenon reveals how specialized vocabulary can activate latent biases in diffusion models, producing technically superior but contextually inappropriate results.

Performance gaps between personas challenge assumptions about expertise value in AI systems. Multi-expert’s superiority (9.11/11) over Fashion Designer (8.41/11) suggests effective AI collaboration requires understanding system interpretation patterns, not just domain knowledge. The hybrid approach balanced technical precision with fashion understanding, while pure domain expertise activated unwanted behaviors. These emergent dynamics, invisible in traditional prompt engineering, emphasize systematic evaluation’s necessity for developing predictable AI design tools.

5.2. Multi-metric evaluation challenges

Multi-metric evaluation revealed distinct patterns across assessment approaches, highlighting fundamental differences in how AI-generated creative content should be evaluated. Statistical analysis confirmed significant performance differences across all three metrics, but with varying patterns that illuminate the complexity of creative AI assessment.

Checklist and ImageReward evaluations demonstrated clear optimization advantages, with optimized prompts significantly outperforming persona approaches in objective requirement implementation (p<0.01) and human preference alignment. These metrics’ sensitivity to systematic improvement reflects their focus on measurable design specifications and perceptual quality. However, CLIP evaluation revealed more modest mean differences, with statistical significance emerging primarily from Fashion Designer’s consistently lower performance rather than clear optimization advantages among top-performing approaches.

These divergent results indicate that different metrics capture distinct quality dimensions—checklist measures objective requirement implementation, ImageReward approximates human aesthetic preferences, while CLIP focuses on semantic text-image alignment. For practical creative applications, evaluation method selection significantly influences perceived optimization effectiveness, with checklist and human preference metrics providing the most relevant assessment for design workflows where practitioners prioritize requirement satisfaction and aesthetic appeal.

5.3. Model limitations and technical constraints

Our framework analysis reveals both the interpretability advantages of systematic approaches and fundamental limitations in current text-to-image architectures. Unlike black-box optimization systems, our methodology enables tracing how linguistic patterns influence generation outcomes. This transparency proved essential for understanding how specialized vocabulary activates unexpected model behaviors, producing contextually inappropriate results that compromise design requirements.

However, optimization effectiveness encounters architectural constraints that transcend prompt engineering solutions. Color control challenges exemplify these limitations—despite systematic weight adjustments increasing blue emphasis to 1.3 while reducing orange to 1.1, precise color balance remained elusive. Some images showed overcorrection with blue dominating orange, revealing how diffusion models process color semantically rather than through direct RGB control.

These constraints emphasize that while prompt optimization achieves significant improvements in requirement implementation and human preference alignment, fundamental model limitations necessitate hybrid approaches. Effective creative AI systems require combining systematic prompt optimization with complementary techniques to address architectural constraints, suggesting directions for future research in controllable generation methods.

5.4. Implications for creative AI systems

Our findings redefine expertise requirements for AI-assisted creative work. While systematic optimization demonstrated clear advantages in objective requirement implementation and human preference metrics, effectiveness varies across different assessment approaches. Fashion professionals must develop “AI translation” skills alongside traditional expertise, learning how their specialized language triggers specific model behaviors. This paradigm shift transforms professional development needs, as success depends on bridging human creative intent with machine understanding.

Human-in-the-loop optimization preserves creative agency while leveraging AI capabilities. Unlike automated systems risking output homogenization, our method maintains designer control through transparent modification steps. Each adjustment—from weight modifications to terminology changes—remains interpretable and reversible, crucial for subjective qualities where human judgment guides optimization beyond metrics. This transparency empowers designers to develop intuition about AI behavior, gradually building prompt engineering expertise through understanding rather than memorization.

The framework enables organizational transformation of AI capabilities. Optimized prompts become reusable templates, converting individual prompt engineering skills into institutional knowledge. Organizations can develop style-specific prompt libraries aligned with brand aesthetics or seasonal themes. Different evaluation approaches serve distinct creative purposes—practical applications benefit from checklist and human preference optimization for requirement satisfaction and market appeal, while semantic alignment assessment provides insights for experimental design contexts. This flexibility allows the same framework to serve varied purposes from mainstream fashion to haute couture experimentation.

5.5. Limitations and future research directions

Several limitations emerge despite significant contributions. Color balance challenges reveal architectural constraints transcending prompt engineering solutions, necessitating hybrid approaches combining optimization with complementary techniques. Subjective aesthetic assessment limitations mirror broader computational creativity challenges. Systematic creative quality evaluation remains contentious in AI-generated content research.

The four-persona selection, while systematically designed, introduces potential biases. The personas represent Western-centric perspectives of expertise—AI Expert reflects Silicon Valley technical culture, while Fashion Designer embodies haute couture traditions. Alternative persona configurations incorporating diverse cultural perspectives, emerging design philosophies, or interdisciplinary expertise might yield different optimization patterns. Additionally, the binary distinction between technical and domain expertise oversimplifies real-world knowledge integration, where practitioners often possess hybrid competencies.

Fashion-specific validation limits generalizability claims. Creative domains possess unique vocabularies and aesthetic criteria affecting framework transferability. Future work must test cross-domain applicability and explore coupling diverse multimodal AI systems beyond LLM-image generation pairs. Workflow integration through user experience studies constitutes the critical implementation step. Real-world adoption requires understanding how designers adapt creative processes to systematic AI collaboration.

6. Conclusions

We developed and validated a systematic framework optimizing text-to-image generation through LLM personas, bridging critical gaps in AI-assisted creative workflows. Our five-stage methodology demonstrates that data-driven expertise integration achieves measurable improvements over individual approaches across multiple assessment dimensions. Optimized prompts significantly enhanced objective requirement implementation and human preference alignment while achieving superior peak performance across all evaluation criteria, though effectiveness patterns varied depending on the specific assessment approach employed.

Three key contributions emerge: First, systematic prompt optimization through weakness prioritization and strength balancing transcends trial-and-error methods. Second, empirical evidence confirms synergistic effects when combining domain and technical expertise via LLM personas. Third, comprehensive evaluation integrating objective requirements with human preferences enables robust creative content validation.

Practically, our framework democratizes high-quality AI image generation by removing expertise barriers and enabling consistent creative outputs. Fashion professionals gain systematic AI integration approaches that enhance design workflow efficiency while maintaining quality standards. The methodology enables organizations to transform individual prompt engineering insights into institutional knowledge, supporting scalable creative production across diverse application contexts.

Future research must extend cross-domain applicability, develop systematic aesthetic evaluation methods, and integrate real-world design workflows. This foundation for systematic human-AI collaboration demonstrates how effective LLM-image model coupling enhances creative capabilities while preserving human agency and professional expertise.

Footnotes

Author note

Minsuk Kim, Senior Researcher, Safety Convergence Technology R&D department, Research Institute of Human-Centric Manufacturing Technology, KITECH, Republic of Korea, .

Seungju Lim, Senior Researcher, Regional Industrial Innovation Department (Manufacturing Robot), Research Institute of Human-Centric Manufacturing Technology, KITECH, Republic of Korea, .

ORCID iD

Minsuk Kim

Author contributions

Minsuk Kim: Conceptualization, Methodology, Writing—original draft.

Seungju Lim: Supervision, Writing—review & editing.

Funding

The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported by the Korea Institute of Industrial Technology (SE240044) and the Technology development Program (RS-2022-00141433) funded by the Ministry of SMEs and Startups (MSS, Korea).

Declaration of conflicting interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Data Availability Statement

The datasets generated during the current study are available from the corresponding author upon request.*

References

Ramesh

Pavlov

Goh

, et al. Zero-shot text-to-image generation. In: The 38th international conference on machine learning (eds. Marina

Tong

), 18-24 July 2021, pp. 8821–8831. PMLR.

Rombach

Blattmann

Lorenz

, et al. High-resolution image synthesis with latent diffusion models. In: IEEE/CVF conference on computer vision and pattern recognition (CVPR), New Orleans, USA, 18–24 June 2022, pp. 10674–10685.

Bie

Yang

Zhou

, et al. RenAIssance: A survey into AI text-to-image generation in the era of large model. IEEE Trans Pattern Anal Mach Intell 2025; 47: 2212–2231. https://doi.org/10.1109/TPAMI.2024.3522305

Alcaide-Marzal

Diego-Mas

. Computers as co-creative assistants. A comparative study on the use of text-to-image AI models for computer aided conceptual design. Comput Ind 2025; 164: paper no. 104168. https://doi.org/10.1016/j.compind.2024.104168

Chen

Zhang

Han

, et al. A foundation model enhanced approach for generative design in combinational creativity. J Eng Des 2024; 35: 1394–1420. https://doi.org/10.1080/09544828.2024.2356707

Liu

. Application potential of stable diffusion in different stages of industrial design. In: Artificial intelligence in HCI: 4th International Conference (eds. Degen

Ntoa

), Copenhagen, Denmark, 23–28 July 2023, pp. 590–609. Springer Nature Switzerland.

Zhu

. Using stable diffusion with python: leverage python to control and automate high-quality AI image generation using stable diffusion. 1st ed. Packt Publishing, 2024, p. 10.

Lee

J-K

Lee

Y-C

, et al. Generative artificial intelligence and building design: Early photorealistic render visualization of façades using local identity-trained models. J Comput Des Eng 2024; 11: 85–105. https://doi.org/10.1093/jcde/qwae017

Shi

Seo

Cha

, et al. Generative AI-powered architectural exterior conceptual design based on the design intent. J Comput Des Eng 2024; 11: 125–142. https://doi.org/10.1093/jcde/qwae077

10.

Oppenlaender

. The creativity of text-to-image generation. In: The 25th International Academic Mindtrek Conference, Tampere, Finland, 16-18 November 2022, pp. 192–202. Association for Computing Machinery.

11.

Song

Zeng

Liu

, et al. Fashion customization: image generation based on editing clue. IEEE Trans Circuits Syst Video Technol 2024; 34: 4434–4444. https://doi.org/10.1109/TCSVT.2023.3338459

12.

Liu

Chilton

. Design guidelines for prompt engineering text-to-image generative models. In: The 2022 CHI Conference on Human Factors in Computing Systems, New Orleans, USA, 30 April-5 May 2022, pp. 1–23. Association for Computing Machinery.

13.

Oppenlaender

Linder

Silvennoinen

. Prompting AI art: an investigation into the creative skill of prompt engineering. Int J Hum-Comput Interact 2024; 41: 1–23. https://doi.org/10.1080/10447318.2024.2431761

14.

Oppenlaender

. A taxonomy of prompt modifiers for text-to-image generation. Behav Inf Technol 2024; 43: 3763–3776. https://doi.org/10.1080/0144929X.2023.2286532

15.

Wong

, et al. Multi2Human: Controllable human image generation with multimodal controls. Neurocomput 2024; 587: paper no. 127682. https://doi.org/10.1016/j.neucom.2024.127682

16.

OpenAI

Adler

Agarwal

, et al. GPT-4 technical report. 2024. Preprint, arXiv, paper no. 2303.08774. https://doi.org/10.48550/arXiv.2303.08774

17.

Templeton

Conerly

Marcus

, et al. Scaling monosemanticity: Extracting interpretable features from claude 3 sonnet. https://transformer-circuits.pub/2024/scaling-monosemanticity/ (2024, accessed 22 July 2025).

18.

Barambones

Moral

Antonio

, et al. ChatGPT for learning HCI techniques: A case study on interviews for personas. IEEE Trans Learn Technol 2024; 17: 1460–1475. https://doi.org/10.1109/TLT.2024.3386095

19.

Choi

. PICLe: Eliciting diverse behaviors from large language models with persona in-context learning, 2024. https://doi.org/10.48550/arXiv.2405.02501. Preprint, arXiv, paper no. 2405.02501.

20.

Goodfellow

Pouget-Abadie

Mirza

, et al. Generative adversarial nets. In: The 28th International Conference on Neural Information Processing Systems - Volume 2, Montreal, Canada, 8-13 December 2014, pp. 2672–2680. MIT Press.

21.

Reed

Akata

Yan

, et al. Generative adversarial text to image synthesis. In: The 33rd International Conference on Machine Learning (eds. Maria Florina

Kilian

), New York, USA, 19-24 June 2016, pp. 1060–1069. PMLR.

22.

Zhang

, et al. StackGAN: Text to photo-realistic image synthesis with stacked generative adversarial networks. In: IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017, pp. 5908–5916. IEEE.

23.

Zhang

Huang

, et al. AttnGAN: Fine-grained text to image generation with attentional generative adversarial networks. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018, pp. 1316–1324. IEEE.

24.

Radford

Kim

Hallacy

, et al. Learning transferable visual models from natural language supervision. In: The 38th International Conference on Machine Learning (eds. Marina

Tong

), 18-24 July 2021, pp. 8748–8763. PMLR.

25.

Nichol

Dhariwal

Ramesh

, et al. GLIDE: Towards photorealistic image generation and editing with text-guided diffusion models. In: The 39th International Conference on Machine Learning (eds. Kamalika

Stefanie

), Baltimore, USA, 17-23 July 2022, pp. 16784–16804. PMLR.

26.

Saharia

Chan

Saxena

, et al. Photorealistic text-to-image diffusion models with deep language understanding. In: The 36th International Conference on Neural Information Processing Systems. New Orleans, USA, 28 November-9 December 2022. Curran Associates Inc.

27.

Feng

, et al. Crafting user-centric prompts for UI generations based on Kansei engineering and knowledge graph. Advanced Engineering Informatics 2025; 65: paper no. 103217. https://doi.org/10.1016/j.aei.2025.103217

28.

Cao

Wang

Liu

, et al. BeautifulPrompt: Towards automatic prompt engineering for text-to-image synthesis. In: Conference on Empirical Methods in Natural Language Processing: Industry Track (eds. Wang

Zitouni

), Singapore, 6-10 December 2023, pp. 1–11. Association for Computational Linguistics.

29.

Robey

Murata

, et al. Automated black-box prompt engineering for personalized text-to-image generation, 2025. Preprint, arXiv, paper no. 2403.19103. https://doi.org/10.48550/arXiv.2403.19103

30.

Hwang

Majumder

Tandon

. Aligning Language Models to User Opinions. In: Bouamor

Pino

Bali

(eds). Findings of the Association for Computational Linguistics: EMNLP 2023. Association for Computational Linguistics, 2023, pp. 5906–5919.

31.

Chan

Wang

, et al. Scaling synthetic data creation with 1,000,000,000 personas, 2025. Preprint, arXiv, paper no. 2406.20094. https://doi.org/10.48550/arXiv.2406.20094

32.

Serapio-García

Safdari

Crepy

, et al. Personality traits in large language models, 2025. Preprint, arXiv, paper no. 2307.00184. https://doi.org/10.48550/arXiv.2307.00184

33.

Miotto

Rossberg

Kleinberg

, et al. Who is GPT-3? An exploration of personality, values and demographics. In: Bamman

Hovy

Jurgens

(eds). The Fifth Workshop on Natural Language Processing and Computational Social Science. Association for Computational Linguistics, 2022, pp. 218–227.

34.

Santurkar

Durmus

Ladhak

, et al. Whose opinions do language models reflect? In: The 40th International Conference on Machine Learning (eds. Andreas

Emma

Kyunghyun

), Honolulu, USA, 23-29 July 2023, pp. 29971–30004. JMLR.

35.

Guo

Zhu

, et al. AI assisted fashion design: A review. IEEE Access 2023; 11: 88403–88415. https://doi.org/10.1109/ACCESS.2023.3306235

36.

Zhu

Fidler

Urtasun

, et al. Be your own prada: Fashion synthesis with structural coherence, 2017. Preprint, arXiv, paper no. 1710.07346. https://doi.org/10.48550/arXiv.1710.07346

37.

Günel

Erdem

. Language guided fashion image manipulation with feature-wise transformations, 2018. Preprint, arXiv, paper no. 1808.04000. https://doi.org/10.48550/arXiv.1808.04000

38.

Zhang

. Tell, imagine, and search: End-to-end learning for composing text and image to image retrieval. ACM Trans Multimedia Comput Commun Appl 2022; 18: 1–23. https://doi.org/10.1145/3478642

39.

Chen

. An intelligent generative method of fashion design combining attribute knowledge and stable diffusion model. Text Res J 2025; 95: 1231–1254. https://doi.org/10.1177/00405175241289578

40.

Hessel

Holtzman

Forbes

, et al.), CLIPScore: A reference-free evaluation metric for image captioning. In: The 2021 Conference on Empirical Methods in Natural Language Processing (eds. Moens

M-F

Huang

Specia

), Punta Cana, Dominican Republic, 7-11 November 2021, pp. 7514–7528. Association for Computational Linguistics.

41.

Liu

, et al. ImageReward: learning and evaluating human preferences for text-to-image generation. In: The 37th International Conference on Neural Information Processing Systems, New Orleans, USA, 10-16 December 2023, pp. 15903–15935.