IAM-Edit: Localized Image Editing via Instruction Attention Maps

Abstract

Diffusion models have demonstrated impressive performance in text-to-image generation and image editing. However, in instruction-based image editing, they often encounter two challenges: (1) inaccurate localization of the editing targets and (2) unintended modifications in nontarget regions. These issues stem from the global processing of diffusion models due to attention mechanisms. To address these limitations, we conduct a systematic analysis of attention maps under editing instructions and design localization instructions to obtain the desired attention. We propose Instruction Attention Maps (IAM)-Edit, a localized image editing framework that explicitly decouples an editing pipeline into two stages: region localization followed by region-aware editing. Specifically, to localize the editing region, a mask is generated by clustering patches of self-attention maps and combining them with the focal points of cross-attention maps under the editing instruction. To preserve nonediting regions, we apply an attention modulation method that adjusts cross-attention weights at each denoising step based on the generated mask, enabling the denoising process to focus on the editing region. Experiments show that IAM-Edit outperforms state-of-the-art methods both qualitatively and quantitatively.

Keywords

diffusion models localized image editing attention maps

1. Introduction

Text-to-image diffusion models (Balaji et al., 2022; Podell et al., 2023; Ramesh et al., 2022; Rombach et al., 2022; Saharia et al., 2022; Xue et al., 2023) have made remarkable progress in image generation. Trained on large-scale visual and textual data, these models can accurately model the relationship between text and image, thereby enabling the generation of high-quality images from textual prompts. Image editing extends the capabilities of generative models by enabling direct modifications of images, thereby improving their applicability to real-world scenarios such as personalized content creation, product design, medical imaging, and virtual try-on systems. However, image editing methods based on Stable Diffusion (Rombach et al., 2022), such as Prompt-to-Prompt (P2P) (Hertz et al., 2022), require both source and target prompts, which must be structurally aligned. In contrast, InstructPix2Pix (IP2P) is a pioneering model in instruction-guided image editing (Brooks et al., 2022), which performs image editing based solely on natural language instructions, thereby simplifying the editing pipeline. Moreover, cross-attention maps in IP2P exhibit strong implicit localization capabilities (Guo & Lin, 2024; Li et al., 2024). During denoising, the key nouns in the editing instruction align spatially with their corresponding image regions via a cross-attention mechanism. This alignment enables the model to effectively encode the semantics of the instructions and conduct region-specific edits. However, when dealing with complex images containing multiple objects, diffusion models’ global processing can lead to unwanted modifications. This influences not only the target region but also other objects that are semantically or visually connected to it, as shown in Figure 1.

Figure 1.

IAM-Edit: localized image editing via Instruction Attention Maps. IAM-Edit effectively addresses two challenges in instruction-based image editing: (1) localizing the editing region and (2) preserving the nonediting regions. The four examples illustrate common issues observed in InstructPix2Pix (IP2P) (Brooks et al., 2022). In (a) and (b), IP2P fails to identify the editing region, resulting in incorrect edits, while (c) and (d) reveal that unnecessary modifications occur in nonediting regions, even when the editing targets are correctly identified.

Recent studies have sought to enhance IP2P’s capabilities in local editing. Methods such as ZONE (Li et al., 2024) and WYS (Mirzaei et al., 2024) locate editing regions by capturing feature differences during the denoising process. However, these feature differences can be unreliable in complex scenes, leading to two characteristic limitations, as shown in Figure 2. When the feature differences are too small, the estimated region may become overly constrained, resulting in only slight changes between the original and edited images (Figure 2, WYS column). Conversely, IP2P often produces overexpanded edits due to the lack of an explicit editing region, causing unwanted modifications in nonediting regions (Figure 2, IP2P column). Importantly, even with an explicit mask (e.g., ZONE derives a mask from the difference between early and late cross-attention maps), the region-restricted editing and foreground–background composition may still fail to induce a faithful change within the target region, leading to unsuccessful results (Figure 2, ZONE column). These observations suggest that the key difficulty lies not only in obtaining a mask from denoising differences, but also in how the editing instruction guides the model during denoising—that is, how attention maps respond to the instruction to support instruction-aligned localization. Therefore, a research gap remains: existing methods lack an explicit view of how editing instructions shape attention maps during denoising, which makes it difficult to obtain an instruction-aligned editing region.

Figure 2.

Two typical limitations in localized instruction-based editing: overly constrained regions (WYS) and overexpanded edits affecting nonediting regions (IP2P/ZONE).

Motivated by this gap, we conduct a systematic investigation of the interaction between editing instructions and the attention mechanism throughout the denoising process. We further introduce Instruction Attention Maps (IAM), which refer to all attention maps generated under editing instructions during denoising. We design localization instructions to generate specific attention maps, enabling the model to support three types of edits: change, add, and remove. In self-attention maps, pixels with similar visual features tend to aggregate, forming patches that effectively localize all instruction-referenced objects. Meanwhile, the cross-attention maps of key nouns assign high attention weights to focused points within the editing region. To compensate for the global processing of diffusion models and prevent overediting, all tokens’ cross-attention maps are constrained within the editing region based on the identified patches and focal points.

Based on the above findings, we propose IAM-Edit, a localized image editing method based on instruction attention maps, which is designed to preserve the integrity of nonediting regions while adhering to instruction semantics. In particular, IAM-Edit performs both localization and edit application by first identifying the editing region and then restricting edits to that region. Our contributions are summarized as follows:

We introduce IAM, referring to the self-attention and cross-attention maps generated under editing instructions during the denoising process, and propose IAM-Edit, a training-free localized image editing method.

We design an efficient mask generation method, which converts different types of instructions into localization instructions and leverages both self-attention and cross-attention maps to locate the editing region.

We propose an attention modulation method that focuses token-wise cross-attention on the editing region, reducing unwanted changes in nonediting regions while preserving instruction-attention consistency.

Based on extensive experiments and user studies, IAM-Edit achieves superior performance in both localization accuracy and content preservation compared to state-of-the-art methods.

2. Related Work

2.1 Text-Guided Image Editing

Text-guided image editing methods can be classified into three categories: training-free methods (Brack et al., 2024; Cao et al., 2023; Hertz et al., 2022; Mokady et al., 2023; Patashnik et al., 2023; Tang et al., 2024; Tumanyan et al., 2023; Wang et al., 2023), fine-tuning methods (Feng et al., 2024; Gal, Alaluf, et al., 2022; Ruiz et al., 2022; Wei et al., 2023), and training-based methods (Brooks et al., 2022; Fu et al., 2024; Huang et al., 2024; Ren et al., 2024; Sheynin et al., 2024; Zhang et al., 2023). Training-free methods include P2P (Hertz et al., 2022), which perform image editing by manipulating cross-attention layers. Null-text Inversion (Mokady et al., 2023) reconstructs images by optimizing null-text embeddings, thus enabling the application of P2P to real image editing. PnP (Tumanyan et al., 2023) performs image-to-image translation by manipulating spatial features and their self-attention within the diffusion model (Rombach et al., 2022). Pix2pix-zero (Parmar et al., 2023) discovers editing directions in the text embedding space while preserving image structure through cross-attention maps. MasaCtrl (Cao et al., 2023) improves local consistency by converting self-attention into mutual self-attention. Fine-tuning methods such as Textual Inversion (Gal, Alaluf, et al., 2022) and DreamBooth (Ruiz et al., 2022) fine-tune pretrained diffusion models to associate a unique identifier with a specific subject. Training-based methods, including IP2P (Brooks et al., 2022), utilize GPT-3 (Brown et al., 2020) and Stable Diffusion (Saharia et al., 2022) to synthesize training data, thereby constructing conditional diffusion models for instruction-driven image editing. Emu Edit (Sheynin et al., 2024) enhances editing precision through task embeddings and multitask joint training. SmartEdit (Huang et al., 2024) and MGIE (Fu et al., 2024) incorporate multimodal large language models (MLLMs) to address complex image editing tasks. Compared with training-based or fine-tuning methods, training-free methods do not require updating model parameters, thereby significantly reducing deployment costs.

2.2 Localized Image Editing

Early local editing methods relied on user-provided masks (Avrahami et al., 2022; Lugmayr et al., 2022; Nichol et al., 2022; Yang et al., 2023) or the use of pretrained models (Kirillov et al., 2023) for mask prediction (Yu et al., 2023), which increases operational complexity and time costs. Current research focuses on the automatic identification and manipulation of target regions. DiffEdit (Couairon et al., 2023) and WYS (Mirzaei et al., 2024) generate masks by computing the difference between noise maps with and without text guidance. Cross-attention maps have recently shown significant value in local editing tasks. DPL (Wang et al., 2023) addresses the “attention leakage” problem by optimizing the word embeddings of nouns in the source prompt, ensuring that cross-attention maps remain focused on the target regions. ZONE (Li et al., 2024) captures the differences in cross-attention maps at the beginning and end stages of the denoising process. LEDITS++ (Brack et al., 2024) combines cross-attention maps with noise guidance vectors to generate masks. Other studies integrate cross-attention maps with intermediate features to identify editing regions. LIME utilizes semantic information from intermediate U-Net (Ronneberger et al., 2015) features and combines it with cross-attention scores to determine the editing area. LPM (Patashnik et al., 2023) clusters the self-attention map at specific timesteps and combines them with cross-attention maps to label each patch, identifying the final editing area. Compared to LPM (Patashnik et al., 2023), which clusters self-attention maps during the denoising process guided by the image caption, directly clustering self-attention maps guided by editing instructions can more effectively capture semantic information. In our method, we leverage IAM to perform localized image editing.

3. Instruction Attention Maps

3.1. InstructPix2Pix

For an image $x$ , the diffusion process adds noise to the encoded latent $z = E (x)$ , generating a noisy latent $z_{t}$ whose noise level increases with each time step $t$ , where $E$ is an image encoder. IP2P (Brooks et al., 2022) trains a U-Net-based network, denoted as $ϵ_{θ}$ , which predicts the noise added to $z_{t}$ conditioned on the input image $c_{I}$ and the textual instruction $c_{T}$ . The model minimizes the following latent diffusion objective:

\begin{aligned} L & = E_{ϵ (x), E (c_{I}), c_{T}, ϵ \sim N (0, 1), t} \\ [{‖ ϵ - ϵ_{θ} (z_{t}, t, E (c_{T}), c_{T}) ‖}_{2}^{2}] . \end{aligned}

(1)

During inference, guidance scales

s_{I}

and

s_{T}

are introduced to balance the influence of image conditions and textual instructions. The modified score estimate in IP2P is as follows:

\begin{aligned} {\tilde{ϵ}}_{θ} (z_{t}, t, I, T) & = ϵ_{θ} (z_{t}, t, \emptyset_{I}, \emptyset_{T}) \\ + s_{I} \cdot ϵ_{θ} (z_{t}, t, I, \emptyset_{T}) \\ - s_{I} \cdot ϵ_{θ} (z_{t}, t, \emptyset_{I}, \emptyset_{T}) \\ + s_{T} \cdot ϵ_{θ} (z_{t}, t, I, T) \\ - s_{T} \cdot ϵ_{θ} (z_{t}, t, I, \emptyset_{T}) . \end{aligned}

(2)

As shown in equation (2), the final predicted noise

{\tilde{ϵ}}_{θ}

is derived by incorporating multiple conditions, including both the image condition

c_{I}

and the text condition

c_{T}

. See Supplemental material B for more details on

s_{I}

and

s_{T}

3.2. Attention Mechanism

In IP2P (Brooks et al., 2022) and related models, each layer of the U-Net (Ronneberger et al., 2015) incorporates an attention mechanism. Editing instructions are encoded into text embeddings and integrated with image features through cross-attention mechanisms:

Attention (Q, K, V) = softmax (\frac{Q K^{T}}{\sqrt{d}}) V .

(3)

The attention map is as follows:

A = softmax (\frac{Q K^{T}}{\sqrt{d}}),

(4)

where

Q

K

, and

V

are projections of the input features, and

d

is the dimension of

K

Self-attention arises when $Q$ , $K$ , and $V$ are all taken from image features, capturing long-range dependencies. Cross-attention occurs when $Q$ is from image features and $K / V$ are from text features, allowing textual information to guide image generation.

To better capture instruction semantics, we define two types of instruction-aware attention maps derived from the above attention mechanism. The Instruction Self-Attention Map (ISAM) aggregates self-attention tensors across layers, heads, and timesteps:

\begin{aligned} ISAM & = \frac{1}{| L_{N} | H T} \sum_{ℓ \in L_{N}} \sum_{h = 1}^{H} \sum_{t = 1}^{T} A_{self}^{(ℓ, h, t)}, \end{aligned}

(5)

\begin{aligned} ISAM \in R^{B \times N \times N \times D_{self}} . \end{aligned}

Here,

A_{self}^{(ℓ, h, t)} \in R^{B \times N \times N \times D_{self}}

denotes the self-attention tensor at layer

ℓ

, head

h

, and timestep

t

L_{N}

contains attention layers whose resolution is

N \times N

;

H

is the number of attention heads;

T

is the number of diffusion timesteps (here

T = 100

); and

D_{self}

denotes the feature dimension of the self-attention representation at each spatial location (i.e., the size of the last tensor dimension).

Similarly, the Instruction Cross-Attention Map (ICAM) is defined as:

\begin{aligned} ICAM & = \frac{1}{| L_{N} | H T} \sum_{ℓ \in L_{N}} \sum_{h = 1}^{H} \sum_{t = 1}^{T} A_{cross}^{(ℓ, h, t)}, \end{aligned}

(6)

\begin{aligned} ICAM \in R^{B \times N \times N \times M} . \end{aligned}

Here,

A_{cross}^{(ℓ, h, t)} \in R^{B \times N \times N \times M}

denotes the cross-attention tensor, and

M

is the number of text tokens (

M = 77

for CLIP; Radford et al. (2021)). Since CLIP uses a Byte-Pair Encoding (BPE) tokenizer, each instruction word may map to one or multiple subword tokens. Thus, the cross-attention tensor

A_{cross}^{(ℓ, h, t)}

provides a separate attention map for each BPE token, resulting in

M

token-wise maps. The ICAM in equation (6) is obtained by averaging these token-level maps across selected layers, heads, and timesteps.

Finally, the IAM are defined as:

IAM = (ISAM, ICAM) .

(7)

These maps provide instruction-aware spatial features that serve as the foundation of our localization and editing mechanisms.

3.3. Instruction Self-Attention Maps

The text condition $c_{T}$ directly influences the estimation of the noise ${\tilde{ϵ}}_{θ}$ in the denoising network. At each denoising step, $z_{t}$ integrates textual information via the cross-attention mechanism. Since the $Q$ , $K$ , and $V$ in the self-attention mechanism are all derived from $z_{t}$ , the self-attention maps inherently reflect instruction semantics.

IP2P employs the editing instruction as the textual condition for denoising, whereas LPM (Patashnik et al., 2023) is conditioned on the image caption. As a result, self-attention maps in IP2P evolve in accordance with the instruction semantics as the diffusion process progresses. We perform principal component analysis (PCA) dimensionality reduction to analyze self-attention maps across different timesteps. Specifically, for a self-attention representation at spatial resolution $N \times N$ with feature dimension $D$ , we reshape it into a two-dimensional (2D) matrix of size $(N \times N) \times D$ , where each row corresponds to one spatial location. PCA is then applied to project each $D$ -dimensional row vector into a low-dimensional space (three components in our implementation). The projected results are reshaped back to $N \times N$ for visualization at different timesteps. This PCA reduction is solely used for visualization and is not involved in generating the editing-region mask.

To summarize self-attention information over time, we average self-attention maps from selected timesteps to obtain ISAM. ISAM is reshaped into per-location vectors and clustered by $K$ -means to form coherent patches, where the patch corresponding to the editing target is used as the editing-region mask.

In Figure 3(a), birds that are not present in the original image gradually emerge in the self-attention maps as the timestep increases (e.g., additional highlighted regions appear at $t = 60$ and $t = 100$ ). In Figure 3(b), under the “remove” instruction, the ball in the original image gradually disappears from the self-attention map.

Figure 3.

Localized image editing via instruction attention maps.

3.4. Instruction Cross-Attention Maps

We categorize editing instructions into three types: “remove,” “change,” and “add.” The “change” instructions are further divided into the replacement of semantically related objects and the replacement of semantically unrelated objects. We then analyze two key properties of ICAM—globality and focus—to understand the impact of different instruction types.

Globality

$A_{cross}^{k}$ denotes the cross-attention map of token $k$ , arranged column-wise. Each row in the map encodes the attention weights between the image feature at a given position and all tokens. $A_{cross}^{k} [i, j]$ represents the semantic alignment between the image feature $Q [i]$ and token $j$ . For a fixed position $i$ , the weighted distribution ${A_{cross}^{k} [i, j]}_{j = 1}^{n_{k}}$ indicates the semantic relevance across $n_{k}$ tokens. Thus, updating the image feature $Q [i]$ does not depend solely on a single token, but rather on the collective influence of all tokens, with the weight determined by $A_{cross}^{k} [i, j]$ .

The globality of ICAM is analyzed from both semantic and structural perspectives.

Semantically, each token in the instruction interacts with the other tokens, resulting in cross-attention maps that are inherently interconnected.

Structurally, the tokens $⟨ | startoftext | ⟩$ and $⟨ | endoftext | ⟩$ serve as markers that encode structural information of the instruction. Alterations in the cross-attention maps of these two structural tokens can disrupt the integrity of the image, as shown in Figure 4. By maintaining the cross-attention maps of $⟨ | startoftext | ⟩$ and $⟨ | endoftext | ⟩$ , the structure of the image is preserved.

Figure 4.

The top row presents the editing instruction. (a) Original image. (b) Editing result after altering the cross-attention map of $⟨ | startoftext | ⟩$ . (c) Editing result after altering cross-attention maps of both $⟨ | startoftext | ⟩$ and $⟨ | endoftext | ⟩$ . (d) Original result edited by IP2P.

However, the globality can lead to overexpanded edits, with modifications extending beyond the editing regions. We restrict the cross-attention weights of all tokens within the editing region, ensuring that the editing remains confined to the editing region.

Focus

We argue that instruction semantics directly influence the focal points of ICAM. This effect on the attention distribution of target tokens can be summarized as follows.

Remove instructions: When an object is removed, the cross-attention maps suppress attention in the corresponding region. The attention weight of the token of this object significantly decreases, exhibiting a defocusing effect. As shown in Figure 5(a), the attention weight of the deleted object (“bee”) decays, and the model dynamically reconfigures its attention distribution. Consequently, the corresponding region is no longer attended to in the cross-attention map.

Add instructions: When adding an object, the focal points of the cross-attention map for this token concentrate on the area where the target object is about to appear. As shown in Figure 5(d), the added object (“sunglasses”) exhibits a strong focus in the cross-attention map. This indicates that the attention mechanism effectively guides the model to the target region where the new object is to be generated.

Change instructions: In object replacement tasks, the focal points of the cross-attention map depend on the semantic similarity between the target and original objects—including their visual features such as shape, color, or texture. When the two objects are semantically related (e.g., “replacing a dog with a cat”), the model tends to balance attention across comparable candidates. As a result, the cross-attention map fails to focus on the editing target. For example, in Figure 5(b), the cross-attention map does not properly focus on the “bee.” In Figure 5(e), attention is misdirected to the headscarf instead of the clothes. In contrast, when replacing semantically unrelated objects (e.g., “replacing an apple with a lightbulb”), the model can allocate attention without ambiguity, allowing the cross-attention map to focus on specific areas. In Figure 5(c), the cross-attention map correctly focuses on the “bee,” and in Figure 5(f), it attends to the “clothes,” aligning well with the instruction.

Figure 5.

The visualization of the cross-attention maps of the target tokens under different semantic instructions. The original image is on the far left, with the cross-attention maps of specific tokens under different instructions on the right. (a) Corresponds to a remove instruction: “Remove the bee.” (d) Corresponds to an add instruction: “Give her sunglasses.” (b) and (e) Represent replacing objects with semantically similar objects, with (b) having the instruction “Change the bee to a bird,” and (e) having the instruction “Change her clothes to a dress.” (c) and (f) Represent replacing objects with semantically unrelated objects, with (c) having the instruction “Change the bee to granada,” and (f) having the instruction “Change her clothes to a car.”

Therefore, we convert different types of editing instructions into localization instructions by formulating them as replacements of the object with a semantically unrelated one.

4. Method

4.1 Framework of IAM-Edit

Given the initial editing instruction $T$ , the source image $x$ , and the editing object $o$ , we propose IAM-Edit, which leverages IP2P (Brooks et al., 2022) to implement localized image editing via IAM. As shown in Figure 6, IAM-Edit consists of three modules: localization instruction (Figure 6(a)), mask generation (Figure 6(b)), and attention modulation (Figure 6(c)). This method is implemented through two forward passes: the mask generation phase $P_{1}$ and the image generation phase $P_{2}$ .

Figure 6.

Framework of IAM-Edit. IAM-Edit is designed for instruction-based localized image editing. IAM = instruction attention maps.

We convert $T$ into a localization instruction $T^{'}$ (Section 4.1), where $o$ in $T^{'}$ is indexed as $u_{o}$ . The pair $(T^{'}, x)$ is then fed into $P_{1}$ . After denoising, we generate the mask $M_{o}$ from the cross-attention map of $o$ conditioned on the localization instruction $T^{'}$ and from patches obtained by clustering the self-attention map (Section 4.2). The triplet $(T, x, M_{o})$ is then fed into $P_{2}$ .

At each timestep, we modulate the cross-attention maps of all tokens in $T$ using $M_{o}$ . Finally, the edited image $x^{'}$ is obtained, achieving localized image editing (Section 4.3). Our full procedure is summarized in Algorithm 1.

4.1. Localization Instruction

Two types of instructions can be used as localization instructions. For change instructions on the semantically unrelated object, the cross-attention map $A_{cross}^{u_{o}}$ focuses on the area where object $o$ is located. Similarly, for add instructions, the cross-attention map of the added object inherently possesses strong localization capability.

Given $T$ and $o$ , we convert $T$ into a localization instruction $T^{'}$ :

T^{'} = {\begin{cases} change ⟨ o ⟩ to ⟨ l ⟩, & T \in {remove, change}, \\ T, & T \in {add}, \end{cases}

(8)

where word

l

represents a word with low semantic similarity to

o

. An action classifier (Li et al., 2024) is used to recognize the type of

T

, and its accuracy is evaluated in Section E of the Supplemental materials.

To define the selection rule for the unrelated replacement word $l$ , we choose $l$ from a vocabulary $V$ by minimizing its semantic similarity to the target object $o$ in the CLIP text-embedding space (Radford et al., 2021). We apply the following three-step procedure:

Candidate filtering. Construct a candidate pool $V_{o} \subset V$ by removing words identical to $o$ , containing $o$ as a substring, or being simple morphological variants of $o$ (e.g., plural forms), and discarding nonobject tokens.

Embedding and similarity. For each $n \in V_{o}$ , encode CLIP-style phrases “a photo of $⟨ o ⟩$ ” and “a photo of $⟨ n ⟩$ ” with the CLIP text encoder (Radford et al., 2021) to obtain $v_{o}$ and $v_{n}$ , and compute cosine similarity $\cos (v_{n}, v_{o})$ .

Unrelated replacement. Select the least similar candidate:

l = \arg min_{n \in V_{o}} \cos (v_{n}, v_{o}) .

(9)

This yields an “unrelated” $l$ in the CLIP text-embedding space (Radford et al., 2021), encouraging the localization instruction change $⟨ o ⟩$ to $⟨ l ⟩$ to keep cross-attention focused on the region of $o$ while reducing semantic overlap with $o$ .

We obtain the cross-attention map of $o$ focused on the target region under the localization instruction, as shown in the cross-attention map column of Figure 7.

Figure 7.

Mask generation. We perform clustering on the self-attention map and combine it with the cross-attention map of the editing object to determine the editing region. From left to right: editing object, input image, clustering result of the self-attention map, cross-attention map of the editing object, and the mask of the editing object.

4.2. Mask Generation

$ISAM \in R^{B \times N \times N \times D}$ is aggregated from all timesteps during $P_{1}$ , guided by $T^{'}$ and $x$ . Here, $N \times N$ represents the pixel resolution of the image (e.g., $32 \times 32 = 1, 024$ spatial positions), and $D$ is the dimension of the attention features ( $1, 024$ ). These maps are reshaped into a 2D matrix $A_{self}^{flat}$ of shape $(N \times N, D) = (1, 024, 1, 024)$ .

We then apply $K$ -means clustering to semantically group $A_{self}^{flat}$ , obtaining $K$ patches. Each patch defines a binary mask $M_{k}$ :

{M_{k}}_{k = 1}^{K} = K - means (A_{self}^{flat}, K) .

(10)

According to the index

u_{o}

of object

o

T^{'}

, its cross-attention map is obtained as:

A_{cross}^{u_{o}} = A_{cross} [:, :, u_{o}] .

(11)

We then perform element-wise multiplication of

A_{cross}^{u_{o}}

with

M_{k}

and obtain the region

s_{k}

with the highest cross-attention response, whose index is denoted as

k_{0}

k_{0} = \arg max_{k} (max (A_{cross}^{u_{o}} ⊙ M_{k})) .

(12)

Finally, the mask of

k_{0}

is selected as the target region of

o

M_{o} = M_{k_{0}} .

(13)

Figure 7 illustrates the effectiveness of our method in generating the mask for the editing region of $o$ .

4.3. Attention Modulation

To enable localized image editing, we modulate the cross-attention maps so that the model modifies only the editing region while keeping other regions unchanged.

In each timestep of $P_{2}$ , we keep the cross-attention maps of $⟨ | startoftext | ⟩$ and $⟨ | endoftext | ⟩$ unchanged to maintain the image structure. For all tokens in $T$ , we set the cross-attention weights outside of $M_{o}$ to negative infinity (e.g., $- \infty$ ), and a new cross-attention map is computed accordingly:

MaskAttn (A_{cross}, M_{o}) = {\begin{cases} softmax (\frac{Q K^{⊤}}{\sqrt{d}} + M_{o}^{inf}), & if u \in U, \\ softmax (\frac{Q K^{⊤}}{\sqrt{d}}), & otherwise, \end{cases}

(14)

where the first row applies if

u \in U

, and the second applies otherwise.

$ICAM \in R^{B \times N \times N \times M}$ , where $B$ is the batch size, $N$ denotes the spatial resolution of the attention maps (e.g., $N = 16$ for $16 \times 16$ maps), and $M$ is the maximum number of text tokens ( $M = 77$ ). Accordingly, the token-wise cross-attention map is indexed as $A_{cross}^{u} \in R^{B \times N \times N}$ for each token $u \in {1, \dots, M}$ .

The index set of the tokens in $T$ is $U = {u_{1}, u_{2}, \dots, u_{k}}$ . We reshape the editing mask to $M_{o} \in R^{B \times N \times N \times 1}$ and broadcast it to match $A_{cross}^{u}$ , so that the modulation in equation (14) is applied element-wise on each token-wise spatial attention map.

We define the attention modulation mask as:

M_{o}^{inf} = {\begin{cases} 0, & if M_{o} = 1, \\ - \infty, & if M_{o} = 0 . \end{cases}

(15)

After applying softmax, the results outside

M_{o}

approach zero, meaning that all tokens in the instruction have almost no contribution to the area outside the

M_{o}

Then we perform $P_{2}$ by taking $T$ , $x$ , and $M_{o}$ as inputs. For the cross-attention maps, the new attention calculation $MaskAttn$ replaces the original attention computation at all timesteps.

As shown in Figure 8, after modulation, the new cross-attention map of each token concentrates on the editing region, ensuring that the editing occurs only within $M_{o}$ and preserving the integrity of the nonediting region.

Figure 8.

Cross-attention maps before and after the application of attention modulation. Restrict the cross-attention scores of all tokens in the instruction to the editing region, with the modulated cross-attention maps shown in the second row.

5. Experiments

5.1. Experimental Settings

Baseline

Five image editing methods are used to evaluate IAM-Edit: LEDITS++ (Brack et al., 2024), IP2P (Brooks et al., 2022), MagicBrush (Zhang et al., 2023), WYS (Mirzaei et al., 2024), and ZONE (Li et al., 2024). LEDITS++ is an inversion-based image editing method. IP2P is the base model for instruction-based image editing methods. MagicBrush fine-tunes IP2P using a large-scale, manually annotated dataset. WYS determines the editing mask by leveraging the absolute difference between the image-and-instruction-conditioned noise and the image-only noise in IP2P.

ZONE determines the editing mask from the difference between the first and last cross-attention maps, and further incorporates SAM (Kirillov et al., 2023) to refine the mask. The resulting mask is then dilated, and its boundaries are refined using fast Fourier transform. Finally, the edited image layers are composited onto the original image at the pixel level, producing the final result.

Datasets

We evaluate IAM-Edit on three representative benchmarks: PIE-Bench (Ju et al., 2023), ZONE_testset (Li et al., 2024), and OIR-Bench (Yang et al., 2024). Table 1 summarizes the dataset composition and sample counts. PIE-Bench contains 700 instruction-driven editing samples and groups them into 10 editing tasks, covering: (1) object addition, (2) object removal, (3) object replacement, (4) attribute/color change, (5) material/texture change, (6) style transfer, (7) background change, (8) scene/layout change, (9) spatial/position change, and (10) quantity/number change. ZONE_testset contains 100 images grouped into three action types (add/change/remove). OIR-Bench consists of 308 text–image editing pairs, including 208 single-object pairs and 100 multiobject pairs. We define IDLE-Bench (Instruction Driven Local Editing Benchmark) as the union of these three datasets, resulting in 1,108 samples in total. For a unified evaluation protocol, we standardize each sample into a consistent set of fields: source image, editing instruction, and editing object; prompts and masks are used when available from the original benchmark.

Table 1.
Dataset Composition Used in this Paper.

Dataset Samples Task/Type

PIE-Bench (Ju et al., 2023) 700 10 editing tasks

ZONE_testset (Li et al., 2024) 100 add/change/remove

OIR-Bench (Yang et al., 2024) 308 single-object (208)/multiobject (100)

IDLE-Bench (ours) 1{,}108 union of the above

Dataset	Samples	Task/Type
PIE-Bench (Ju et al., 2023)	700	10 editing tasks
ZONE_testset (Li et al., 2024)	100	add/change/remove
OIR-Bench (Yang et al., 2024)	308	single-object (208)/multiobject (100)
IDLE-Bench (ours)	1{,}108	union of the above

Metrics

CLIP-I (Radford et al., 2021) is used to quantify the visual consistency between the original and edited images. DINOv2 image similarity (Oquab et al., 2023) is used to assess the degree of spatial structure preservation during the editing process, serving as a measure of the stability in nonedited areas. LPIPS score (Zhang et al., 2018) is used as a perceptual metric to evaluate the retention of high-level semantics and details between the edited image and the original image. CLIP-T (Gal, Patashnik, et al., 2022) is used to measure whether the image modifications before and after editing align with the editing instructions. The higher the scores of CLIP-I, DINO-I, and CLIP-T, the better the model performs in terms of semantic consistency, spatial structure preservation, and instruction execution accuracy. A lower LPIPS score indicates higher perceptual similarity.

Implementation Details

We use the IP2P (Brooks et al., 2022) implementation from the Diffusers library on the HuggingFace platform. IP2P is a latent diffusion model for instruction-based image editing, and we employ the Euler ancestral sampler with 100 denoising steps (Karras et al., 2022). During the mask-generation phase, self-attention and cross-attention maps are collected from all timesteps. Based on parameter studies, the number of clusters for the self-attention maps is set to 5. During the image-generation phase, we modulate cross-attention maps at all timesteps. The image guidance scale $s_{I}$ and the text guidance scale $s_{T}$ are set to 0.9 and 8.0, respectively, following the parameter settings used in WYS (Mirzaei et al., 2024). A fixed random seed of 50 is used for all experiments to ensure reproducibility and consistent comparison across methods. All experiments are conducted on an NVIDIA RTX 4090 GPU.

Hyperparameters and Configurations

All IP2P-based methods (IP2P, MagicBrush, WYS, and ZONE) share the same IP2P backbone, the same number of denoising steps, the same guidance scales, and the same fixed random seed as in our Implementation Details.

For LEDITS++, which is not IP2P-based, we follow the settings reported in the original paper. Specifically, we use 20 inversion steps and 20 generation steps, set the implicit-mask threshold to $λ = 0.1$ , and apply the skip operation at $t = 0.75 T$ . For multiprompt edits, we follow the recommended parameter ranges: a skip fraction between 0.2 and 0.3, a guidance scale (edit strength) between 10.0 and 15.0, and a masking threshold between 0.7 and 0.9.

Hyperparameter Selection Protocol

Unless otherwise specified, we start from the default/recommended settings of the original implementations. For the method-specific hyperparameters in IAM-Edit, we determine a unified setting based on the ablation results reported in the Supplemental materials, and keep the same setting across all benchmarks and all compared methods to ensure reproducibility and fairness. In particular, we choose $K$ -means for ISAM clustering due to its favorable stability–efficiency trade-off, and set the number of clusters to $K = 5$ according to the IoU/CLIP-T trends in the parameter study. We also ablate the token scope and the resolution for cross-attention modulation, as well as the guidance scales $(s_{T}, s_{I})$ , and report the full comparisons and selection curves in the Supplemental materials.

5.2. Main Results

Qualitative Evaluation

Figure 9 shows qualitative examples of various editing tasks. We adopt a fixed protocol for qualitative comparison and select six representative image–instruction pairs from IDLE-Bench that cover the three operation types (“change,” “add,” and “remove”). The “change” category further includes multiregion editing, object replacement, large-area editing and style transfer. For each example, LEDITS++ follows its original paper settings since it is not IP2P-based, whereas all remaining methods share the same IP2P backbone, denoising steps, guidance scales, and random seed. Figure 9 shows, from left to right, the input image, the results of LEDITS++, IP2P, MagicBrush, WYS, ZONE, our IAM-Edit, and the editing mask predicted by IAM-Edit. The change edits in LEDITS++ are implemented by adding or removing objects, which does not fully capture the semantics of the change instructions, resulting in incomplete modifications or inaccurate localization. In Figure 9, column “LEDITS++,” rows (c), (d), (e), and (f) all show localization errors, such as no modification between the original and edited images in row (c), blurred edits in row (d), and incorrect edits in rows (e) and (f).

Figure 9.

Qualitative comparison. We tested IAM-Edit on different tasks: (a) multiregion editing, (b) single-region editing, (c) large-area editing, (d) style change, (e) add, and (f) remove. The editing results of IAM-Edit are compared with the state-of-the-art methods. From left to right: input image, LEDITS++ (Brack et al., 2024), IP2P (Brooks et al., 2022), MagicBrush (Zhang et al., 2023), WYS (Mirzaei et al., 2024), ZONE (Li et al., 2024), and IAM-Edit. The last column shows the editing region mask obtained by our method and the cross-attention map of the editing object. All editing instructions are displayed below the image. IAM = instruction attention maps.

IP2P and MagicBrush often produce to overexpanded edits, even though they can localize the editing area based on the instructions. In Figure 9, columns “IP2P” and “MagicBrush,” row (b) shows the basket being modified into a “bottle,” and row (d) shows multiple areas turning blue in addition to the blanket.

WYS suffers from inaccurate localization, occasionally leading to cases where the image remains unedited. For example, in Figure 9, column “WYS,” row (a), the image shows no visible change.

ZONE still exhibits overexpanded edits. In Figure 9, column “ZONE,” row (b), the entire image is modified, making the result appear as if it were generated directly from the edit instruction.

We compare the masks generated by WYS, ZONE, and IAM-Edit in Figure 10. WYS relies on noise differences to generate masks, which struggles to handle visually similar objects (e.g., different types of flowers). While ZONE successfully locates the area of “rose,” it only locates one flower. During rose generation, the mask dilation is uncontrolled, extending into nonedited areas, leading to overexpanded edits, as shown in Figure 9, column “ZONE,” row (a).

Figure 10.

Mask analysis. We show tasks similar to IAM-Edit: the editing region masks and editing results of WYS and ZONE for Figure 9(a). IAM = instruction attention maps.

IAM-Edit generates the mask of the editing region using attention maps, ensuring accurate edits without affecting unrelated regions. This demonstrates the effectiveness of IAM in improving local editing capabilities.

Quantitative Evaluation

As shown in Table 2, IAM-Edit achieves the best results on the metrics CLIP-I, DINO-I, and CLIP-T, indicating that it not only effectively retains the overall semantics (CLIP-I) and visual structure (DINO-I) of the original image, but it also accurately captures and reflects the semantic intent of the editing instructions (CLIP-T), achieving an effective balance between editing accuracy and visual fidelity. ZONE performs slightly better than IAM-Edit on the LPIPS metric due to its stronger pixel-level statistical consistency. However, as evidenced by the qualitative results, ZONE often produces complete modification of the entire image or incorrect localization, which deviates from the goal of image editing.

Table 2.

Quantitative Comparisons.

Method	CLIP-I	DINO-I	LPIPS	CLIP-T
LEDITS++	0.8859	0.9222	0.2200	0.2370
IP2P	0.7603	0.8903	0.5936	0.2312
MagicBrush	0.8145	0.9211	0.4219	0.2454
WYS	0.9124	0.9834	0.1777	0.2394
ZONE	0.9155	0.9735	0.1135	0.2465
IAM-Edit	0.9237	0.9916	0.1703	0.2471

Note. IAM = instruction attention maps.

Bold values indicate the best performance among all compared methods.

User Study

To assess the perceptual quality of localized editing, we conduct a user preference study on 48 image–instruction pairs sampled from IDLE-Bench. The 48 examples are balanced across the three operation types (“change,” “add,” “remove”) and cover both single-object and multiobject edits, as well as large-area and style-change edits. For each example, we generate six edited results using LEDITS++, IP2P, MagicBrush, WYS, ZONE, and our IAM-Edit with the same hyperparameter settings as in the quantitative evaluation.

We recruit five volunteers from diverse academic backgrounds, including computer science and technology, finance, electrical engineering and automation, visual communication design, and business administration. All participants are unfamiliar with the compared methods and are not informed which result is produced by which method. For each question, the interface (see Figure 14 in the Supplemental materials) displays the original image at the top and the six anonymized edited results below it in a randomly shuffled order. The textual editing instruction is shown together with the original image. Participants are asked to select a single image that best satisfies two criteria: (i) alignment with the editing instruction and (ii) preservation of the content and visual quality of nonedited regions. Each participant evaluates all 48 examples, resulting in $5 \times 48 = 240$ selections in total.

We calculate the selection ratio of each method, termed as user preference rate (UPR):

UPR (i) = \frac{N_{i}}{N_{total}},

(16)

where

N_{i}

is the number of times method

i

is selected, and

N_{total}

is the total number of selections over all methods. Thus, the five participants made 240 choices in total (48 questions

\times

5 users). IAM-Edit was selected 81 times, corresponding to a UPR of

33.75 %

, whereas the strongest baseline MagicBrush received 45 selections (

18.75 %

). The 95% confidence interval of IAM-Edit’s UPR is

[27.8 %, 39.7 %]

, which does not overlap with that of MagicBrush (

[13.8 %, 23.7 %]

). Compared to a uniform random choice among six methods (

16.67 %

), IAM-Edit is preferred almost twice as often and improves over the strongest baseline by 15 percentage points, an 80% relative gain. A chi-square goodness-of-fit test against the uniform baseline yields

χ^{2} = 68.9

with 5 degrees of freedom and a

p

-value below

10^{- 10}

, indicating that user selections are highly nonuniform. A two-proportion

z

-test between IAM-Edit and MagicBrush further confirms the statistical significance of this advantage (

p < 0.001

). Overall, these results demonstrate that users consistently favor IAM-Edit, perceiving its outputs as better aligned with the instructions while more faithfully preserving nonediting regions (Table 3).

Table 3.

User Preference Rate Calculation Results.

Method	UPR (%)
LEDITS++	17.92
IP2P	6.25
MagicBrush	18.75
WYS	15.00
ZONE	8.33
IAM-Edit (ours)	33.75

Note. IAM = instruction attention maps.

Bold values indicate the best performance among all compared methods.

6. Conclusion

In this work, we introduced IAM-Edit, a localized image editing framework built on instruction attention maps (IAM). Conceptually, our contributions are threefold. First, we define IAM by jointly analyzing instruction-aware self-attention (ISAM) and cross-attention (ICAM), and show that these maps encode rich semantics about both the editing instruction and the spatial layout of the image. Second, we propose a localization-instruction mechanism and an attention-driven mask generation strategy that clusters ISAM and combines it with ICAM to automatically recover the editing region without additional supervision. Third, we design a generic attention modulation operator that gates the cross-attention maps of all tokens with the editing mask at every denoising step, thereby constraining edits to the target region while preserving nonediting areas.

Although our instantiation of IAM-Edit uses IP2P as the backbone, the proposed concepts and operators are defined purely at the level of attention maps. As such, they are in principle compatible with a broad family of instruction- or text-guided diffusion models that expose self- and cross-attention tensors, including IP2P variants and other backbones such as SDXL-based editors or MLLM-assisted editing pipelines. We therefore expect that the localization and preservation gains provided by IAM-Edit can transfer to these architectures with minimal adaptation, and we regard a systematic evaluation on additional backbones as an important direction for future work. Extensive experiments and user studies on IDLE-Bench demonstrate that IAM-Edit achieves superior localization accuracy and content preservation compared to state-of-the-art baselines.

Looking ahead, we plan to extend IAM-Edit to other modalities such as video and 3D/NeRF-based editing, and to investigate more efficient implementations of IAM that reduce the computational overhead of attention collection and modulation. Another promising direction is to combine IAM-Edit with learned safety filters and content-aware constraints, enabling controllable yet responsible deployment of localized editing in real-world applications.

Limitations

Our current implementation and experiments focus on IP2P as the underlying diffusion backbone. Consequently, in challenging cases where IP2P itself fails to faithfully render the target object, IAM-Edit may still produce unsatisfactory edits even when the editing region is correctly localized. Moreover, IAM-Edit relies on several key dependencies: (i) it requires access to intermediate self-/cross-attention maps inside the diffusion U-Net, and thus is not directly applicable to pipelines that do not expose attention tensors; (ii) its localization cues depend on token-wise cross-attention under the CLIP text encoder with BPE tokenization, where long or complex instructions and word-to-subword splitting may weaken the focus of target token maps; and (iii) it uses an action classifier and an “unrelated replacement” word selection to construct localization instructions, so misclassification or suboptimal replacement selection may reduce localization reliability in rare cases. In addition, IAM-Edit requires two forward passes and similarity computations during the mask generation phase, which introduces extra inference overhead compared to directly applying IP2P. Exploring more efficient implementations and validating IAM-Edit on a wider range of diffusion backbones are left to the future work.

Supplemental Material

sj-pdf-1-eai-10.1177_30504554261450837 - Supplemental material for IAM-Edit: Localized Image Editing via Instruction Attention Maps

Supplemental material, sj-pdf-1-eai-10.1177_30504554261450837 for IAM-Edit: Localized Image Editing via Instruction Attention Maps by Shucheng Mao, Hua Cheng, Zehong Qian, Yingying Ding, Shibo Luo and Yiquan Fang in The European Journal on Artificial Intelligence

Footnotes

Funding

The authors received no financial support for the research, authorship, and/or publication of this article.

ORCID iDs

Shucheng Mao

Hua Cheng

Yiquan Fang

Declaration of Conflicting Interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Supplemental Material

Supplemental material for this article is available online.

References

Avrahami

Lischinski

Fried

(2022). Blended diffusion for text-driven editing of natural images. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR) (pp. 18187–18197).

Balaji

Nah

Huang

Vahdat

Song

Zhang

Kreis

Aittala

Aila

Laine

Catanzaro

Karras

Liu

M.-Y.

(2022). eDiff-I: Text-to-image diffusion models with an ensemble of expert denoisers. arXiv. https://doi.org/10.48550/arXiv.2211.01324

Brack

Friedrich

Kornmeier

Tsaban

Schramowski

Kersting

Passos

(2024). LEDITS++: Limitless image editing using text-to-image models. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR) (pp. 846–856).

Brooks

Holynski

Efros

A. A.

(2022). InstructPix2Pix: Learning to follow image editing instructions. arXiv. https://doi.org/10.48550/arXiv.2211.09800

Brown

T. B.

Mann

Ryder

Subbiah

Kaplan

Dhariwal

Neelakantan

Shyam

Sastry

Askell

Agarwal

Herbert-Voss

Krueger

Henighan

Child

Ramesh

Ziegler

D. M.

Winter

Amodei

(2020). Language models are few-shot learners. Advances in Neural Information Processing Systems (NeurIPS), 33, 1877–1901. https://doi.org/10.48550/arXiv.2005.14165

Cao

Wang

Shan

Qie

Zheng

(2023). MasaCtrl: Tuning-free mutual self-attention control for consistent image synthesis and editing. Proceedings of the IEEE/CVF international conference on computer vision (ICCV) (pp. 22560–22570).

Couairon

Verbeek

Schwenk

Cord

(2023). DiffEdit: Diffusion-based semantic image editing with mask guidance. Proceedings of the international conference on learning representations (ICLR).

Feng

Qiu

Bai

Zhang

Dong

Zhou

Ying

Tassiulas

(2024). An item is worth a prompt: Versatile image editing with disentangled control. arXiv. https://doi.org/10.48550/arXiv.2403.04880

T.-J.

Wang

W. Y.

Yang

Gan

(2024). Guiding instruction-based image editing via multimodal large language models. International Conference on Learning Representations (ICLR).

10.

Gal

Alaluf

Atzmon

Patashnik

Bermano

A. H.

Chechik

Cohen-Or

(2022). An image is worth one word: Personalizing text-to-image generation using textual inversion. arXiv. https://doi.org/10.48550/arXiv.2208.01618

11.

Gal

Patashnik

Maron

Bermano

A. H.

Chechik

Cohen-Or

(2022). StyleGAN-NADA: CLIP-guided domain adaptation of image generators. ACM Transactions on Graphics, 41(4), Article 69. 10.1145/3528223.3530164

12.

Guo

Lin

(2024). Focus on uour instruction: Fine-grained and multi-instruction image editing by attention modulation. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR) (pp. 6986–6996).

13.

Hertz

Mokady

Tenenbaum

Aberman

Pritch

Cohen-Or

(2022). Prompt-to-prompt image editing with cross attention control. arXiv. https://doi.org/10.48550/arXiv.2208.01626

14.

Huang

Xie

Wang

Yuan

Cun

Zhou

Dong

Huang

Zhang

Shan

(2024). SmartEdit: Exploring complex instruction-based image editing with multimodal large language models. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR) (pp. 8362–8371).

15.

Zeng

Bian

Liu

(2023). Direct inversion: Boosting diffusion-based editing with 3 lines of code. arXiv. https://doi.org/10.48550/arXiv.2310.01506

16.

Karras

Aittala

Aila

Laine

(2022). Elucidating the design space of diffusion-based generative models. Advances in Neural Information Processing Systems, 35, 26565–26577. https://arxiv.org/abs/2206.00364

17.

Kirillov

Mintun

Ravi

Mao

Rolland

Gustafson

Xiao

Whitehead

Berg

A. C.

W.-Y.

Dollár

Girshick

(2023). Segment anything. arXiv. https://doi.org/10.48550/arXiv.2304.02643

18.

Zeng

Feng

Gao

Liu

Tang

Liu

Zhang

(2024). ZONE: Zero-shot instruction-guided local editing. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR) (pp. 6254–6263).

19.

Lugmayr

Danelljan

Romero

Timofte

Van Gool

(2022). RePaint: Inpainting using denoising diffusion probabilistic models. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR) (pp. 11461–11471).

20.

Mirzaei

Aumentado-Armstrong

Brubaker

M. A.

Kelly

Levinshtein

Derpanis

K. G.

Gilitschenski

(2024). Watch your steps: Local image and scene editing by text instructions. Proceedings of the European conference on computer vision (ECCV) (pp. 111–129).

21.

Mokady

Hertz

Aberman

Pritch

Cohen-Or

(2023). Null-text inversion for editing real images using guided diffusion models. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR) (pp. 6038–6047).

22.

Nichol

Dhariwal

Ramesh

Shyam

Mishkin

McGrew

Sutskever

Chen

(2022). GLIDE: Towards photorealistic image generation and editing with text-guided diffusion models. Proceedings of the 39th international conference on machine learning (ICML) (pp. 16784–16804).

23.

Oquab

Darcet

Moutakanni

H. V.

Szafraniec

Khalidov

Fernandez

Haziza

Massa

El-Nouby

Assran

Ballas

Galuba

Howes

Huang

P.-Y.

S.-W.

Misra

Rabbat

Sharma

Bojanowski

(2023). DINOv2: Learning robust visual features without supervision. arXiv. https://doi.org/10.48550/arXiv.2304.07193

24.

Parmar

Singh

K. K.

Zhang

Zhu

J.-Y.

(2023). Zero-shot image-to-image translation. arXiv. https://doi.org/10.48550/arXiv.2302.03027

25.

Patashnik

Garibi

Azuri

Averbuch-Elor

Cohen-Or

(2023). Localizing object-level shape variations with text-to-image diffusion models. Proceedings of the IEEE/CVF international conference on computer vision (ICCV) (pp. 22994–23004).

26.

Podell

English

Lacey

Blattmann

Dockhorn

Müller

Penna

Rombach

(2023). SDXL: Improving latent diffusion models for high-resolution image synthesis. arXiv. https://doi.org/10.48550/arXiv.2307.01952

27.

Radford

Kim

J. W.

Hallacy

Ramesh

Goh

Agarwal

Sastry

Askell

Mishkin

Clark

Krueger

Sutskever

(2021). Learning transferable visual models from natural language supervision. Proceedings of the 38th international conference on machine learning (ICML) (pp. 8748–8763).

28.

Ramesh

Dhariwal

Nichol

Chu

Chen

(2022). Hierarchical text-conditional image generation with CLIP latents. arXiv. https://doi.org/10.48550/arXiv.2204.06125

29.

Ren

Kuang

Xia

Wang

Zhu

Xie

Wang

Xiao

Wang

Zheng

(2024). ByteEdit: Boost, comply and accelerate generative image editing. Proceedings of the European conference on computer vision (ECCV) (pp. 184–200).

30.

Rombach

Blattmann

Lorenz

Esser

Ommer

(2022). High-resolution image synthesis with latent diffusion models. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR) (pp. 10684–10695).

31.

Ronneberger

Fischer

Brox

(2015). U-Net: Convolutional networks for biomedical image segmentation. Proceedings of the international conference on medical image computing and computer-assisted intervention (MICCAI) (pp. 234–241).

32.

Ruiz

Jampani

Pritch

Rubinstein

Aberman

(2022). DreamBooth: Fine tuning text-to-image diffusion models for subject-driven generation. arXiv. https://doi.org/10.48550/arXiv.2208.12242

33.

Saharia

Chan

Saxena

Whang

Denton

Ghasemipour

S. K. S.

Ayan

B. K.

Mahdavi

S. S.

Lopes

R. G.

Salimans

Fleet

D. J.

Norouzi

(2022). Photorealistic text-to-image diffusion models with deep language understanding. Advances in Neural Information Processing Systems, 35, 36479–36494. https://doi.org/10.52202/068431-2643

34.

Sheynin

Polyak

Singer

Kirstain

Zohar

Ashual

Parikh

Taigman

(2024). Emu Edit: Precise image editing via recognition and generation tasks. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR) (pp. 8871–8879).

35.

Tang

Wang

Yang

van deWeijer

(2024). LocInv: Localization-aware inversion for text-guided image editing. arXiv. https://doi.org/10.48550/arXiv.2405.01496

36.

Tumanyan

Geyer

Bagon

Dekel

(2023). Plug-and-play diffusion features for text-driven image-to-image translation. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR) (pp. 1921–1930).

37.

Wang

Yang

Butt

M. A.

van deWeijer

(2023). Dynamic prompt learning: Addressing cross-attention leakage for text-based image editing. Advances in Neural Information Processing Systems, 36, 26291–26303. https://doi.org/10.48550/arXiv.2309.15664

38.

Wei

Zhang

Bai

Zhang

Zuo

(2023). ELITE: Encoding visual concepts into textual embeddings for customized text-to-image generation. Proceedings of the IEEE/CVF international conference on computer vision (ICCV) (pp. 15897–15907).

39.

Xue

Song

Guo

Liu

Zong

Liu

Luo

(2023). RAPHAEL: Text-to-image generation via large mixture of diffusion paths. arXiv. https://doi.org/10.48550/arXiv.2305.18295

40.

Yang

Zhang

Chen

Sun

Chen

Wen

(2023). Paint by example: Exemplar-based image editing with diffusion models. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR) (pp. 18381–18391).

41.

Yang

Ding

Wang

Chen

Zhuang

Shen

(2024). Object-aware inversion and reassembly for image editing. Proceedings of the twelfth international conference on learning representations (ICLR).

42.

Feng

Liu

Jin

Zeng

Chen

(2023). Inpaint anything: Segment anything meets image inpainting. arXiv. https://doi.org/10.48550/arXiv.2304.06790

43.

Zhang

Chen

Sun

(2023). MagicBrush: A manually annotated dataset for instruction-guided image editing. Advances in Neural Information Processing Systems (NeurIPS), 36, 31428–31449. https://doi.org/10.52202/075280-1365

44.

Zhang

Isola

Efros

A. A.

Shechtman

Wang

(2018). The unreasonable effectiveness of deep features as a perceptual metric. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR) (pp. 586–595).

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

16.17 MB

IAM-Edit: Localized Image Editing via Instruction Attention Maps

Abstract

Keywords

1. Introduction

2.1 Text-Guided Image Editing

2.2 Localized Image Editing

3. Instruction Attention Maps

3.1. InstructPix2Pix

Globality

Focus

4.1 Framework of IAM-Edit

5.1. Experimental Settings

Baseline

Datasets

Table 1. Dataset Composition Used in this Paper. Dataset Samples Task/Type PIE-Bench (Ju et al., 2023) 700 10 editing tasks ZONE_testset (Li et al., 2024) 100 add/change/remove OIR-Bench (Yang et al., 2024) 308 single-object (208)/multiobject (100) IDLE-Bench (ours) 1{,}108 union of the above

Metrics

Implementation Details

Hyperparameters and Configurations

Hyperparameter Selection Protocol

Qualitative Evaluation

Quantitative Evaluation

User Study

Limitations

Supplemental Material

sj-pdf-1-eai-10.1177_30504554261450837 - Supplemental material for IAM-Edit: Localized Image Editing via Instruction Attention Maps

Footnotes

Funding

ORCID iDs

Declaration of Conflicting Interests

Supplemental Material

References

Supplementary Material

Table 1.
Dataset Composition Used in this Paper.

Dataset Samples Task/Type

PIE-Bench (Ju et al., 2023) 700 10 editing tasks

ZONE_testset (Li et al., 2024) 100 add/change/remove

OIR-Bench (Yang et al., 2024) 308 single-object (208)/multiobject (100)

IDLE-Bench (ours) 1{,}108 union of the above