Abstract
When brands share images on social media, they often pair these visuals with textual captions. How can brands craft image captions that effectively engage consumers? This research introduces the construct of caption extension—defined as the extent to which a caption goes beyond directly describing the visible elements of its paired image (e.g., by adding new perspectives, interpretation, or imagination)—and investigates its impact on consumer engagement on social media. An empirical analysis of 95,155 single-image social media posts from 402 brands across 33 industry sectors, combined with an online experiment (N = 1,200), reveals that greater caption extension increases consumer interest, which in turn enhances engagement. However, this positive effect weakens as the brand’s psychological distance (i.e., the extent to which the brand feels remote or unrelatable to consumers) increases and may even reverse when the distance is especially high. These findings highlight two critical factors for driving engagement on social media: (1) the relationship between different content formats (e.g., visual and textual) and (2) the alignment of content with the brand’s unique characteristics.
Keywords
Social media has become a crucial channel for brands to connect with consumers. Unlike traditional marketing channels (e.g., TV ads), where consumers receive brand information passively, social media enables active consumer interaction through behaviors such as liking, sharing, and commenting on brand-generated content (Evans, Bratton, and McKee 2010). Consumers’ social media engagement offers numerous benefits to brands, including stronger consumer relationships, increased brand awareness, and higher product sales (Babić Rosario et al. 2016; Colicev, Kumar, and O’Connor 2019; Kumar et al. 2016). Consequently, creating engagement-worthy content has become a central priority for today’s brands (Abell and Biswas 2023; Eigenraam, Eelen, and Verlegh 2021; Shahbaznezhad, Dolan, and Rashidirad 2021).
With advances in visual creation tools and faster internet speeds, brands are increasingly relying on visual content to communicate online (Abell and Biswas 2023; Venngage 2022). The market for visual content marketing is rapidly expanding, with projections indicating it could reach a value of $10.6 billion by 2024 (Sproutworth 2023). Among various forms of visual content, images represent one of the most fundamental and widely used formats. As of 2024, it is estimated that more than 95 million photos are uploaded daily on social media platforms (Lister 2024). Notably, brands rarely post images alone; instead, they typically include a textual caption alongside the image. For instance, in this research’s secondary dataset, 98.28% of single-image tweets are accompanied by a caption. This widespread practice raises an important question: How can brands craft image captions that effectively engage consumers?
Existing research in other marketing contexts (e.g., advertising, packaging, consumer reviews) has examined how the relationship between images and text influences consumer responses. These studies generally find that when images and text are mismatched, consumer reactions tend to be less favorable, leading to negative outcomes such as lower ad attitudes (Lee and Mason 1999), poorer product evaluations (Jaud and Melnyk 2020), and reduced perceived helpfulness of reviews (Ceylan, Diehl, and Proserpio 2023). Unlike prior research that focuses on image–text mismatch or incongruence, this research introduces a related but distinct construct: caption extension. Caption extension refers to the extent to which a caption goes beyond describing the visible elements of an image, while remaining contextually appropriate and meaningfully connected. Although brands may not deliberately use extended captions, real-world social media data show that this strategy is already in practice (e.g., as shown in Table 1, caption extension can vary when promoting a white car model from the same brand). Yet whether, when, and how greater caption extension influences consumer engagement remains largely unexplored. This research directly investigates these questions.
Social Media Posts Categorized by Smaller Versus Greater Caption Extension.
Smaller (vs. greater) caption extension refers to captions with extension scores below (vs. above) one standard deviation from the sample mean. See the “Measurement of Variables” section for calculation details.
Both tweets are from the same brand (Nissan); nonessential elements (e.g., URLs) have been removed for clarity.
Using a mixed-methods approach, this research shows that greater caption extension generally boosts consumer engagement by increasing consumer interest. However, this effect weakens and can even reverse as brand psychological distance increases. For example, for brands with closer psychological distance, such as those that consumers are familiar with or frequently purchase from (e.g., Coca-Cola, H&M), greater caption extension tends to spark interest and enhance engagement. In contrast, for brands with greater psychological distance, such as those whose offerings are far removed from everyday consumer experience (e.g., chemical corporation BASF, aerospace and arms producer BAE Systems), greater caption extension can backfire, reducing both interest and engagement.
This research makes several contributions. First, it challenges the conventional view that image and text should be closely aligned to enhance content effectiveness (e.g., Jaud and Melnyk 2020). Instead, it shows that strategically extending captions beyond direct visual descriptions can boost consumer interest and engagement, especially for brands perceived as psychologically closer to consumers.
Second, this research emphasizes the multimodal nature of social media content (e.g., images paired with text, videos paired with text, or images paired with videos; Holiday et al. 2023) and shows that the interaction between content formats significantly shapes consumer engagement. It encourages future research to move beyond single-format analyses and explore how format relationships influence engagement.
Third, this research introduces a novel construct, caption extension, which captures a distinct type of image–text relationship. Unlike image–text mismatch or incongruence, which have been widely studied (e.g., Heckler and Childers 1992; Lee and Mason 1999), caption extension remains underexplored. This research provides a clear conceptualization of caption extension, distinguishing it from related constructs (e.g., image–text incongruence, caption creativity, caption abstractness). It also develops and validates a machine-learning-based method to measure caption extension at scale. Together, the construct, measurement approach, and empirical findings offer a meaningful contribution to content marketing research. Future studies may explore how caption extension interacts with brand characteristics, consumer traits, or platform dynamics across different contexts.
Fourth, while prior research has identified content features that drive engagement, it often overlooks brand-specific factors that may limit or reverse these effects (Montaguti, Valentini, and Vecchioni 2023). This research highlights the importance of brand–content alignment as a key driver of social media engagement, suggesting that brands should carefully assess their own characteristics (e.g., psychological distance from consumers) when tailoring communication strategies for effective engagement.
This research is structured as follows. First, the literature is reviewed to identify gaps and define the core construct of caption extension. Next, research hypotheses are developed, and a theoretical framework is presented. The dataset, variable measurements, and empirical findings are then outlined, followed by a series of robustness checks. An online experiment further tests causality and underlying mechanisms. The article concludes with a general discussion of its contributions, implications, and limitations.
Existing Literature and Research Gap
Content Features That Drive Social Media Engagement
Social media engagement (i.e., the extent of consumer interaction with brands on social media) is vital for driving brand success. It has been linked to increased brand awareness (Hughes, Swaminathan, and Brooks 2019), stronger consumer relationships (Hollebeek, Glynn, and Brodie 2014), and higher sales (Babić Rosario et al. 2016; Kumar et al. 2016). As a result, fostering engagement has become a core objective in digital marketing strategies (Appel et al. 2020; Hollebeek and Macky 2019; Kumar et al. 2016; Lee, Hosanagar, and Nair 2018). Given that engagement is largely shaped by brand-generated content (Hernández-Ortega et al. 2022; Lu, Dinner, and Grewal 2022), a growing body of research has investigated which content features are most effective in eliciting consumer response.
Most research to date has focused on text content, one of the simplest content formats, in shaping engagement. For instance, findings indicate that text content that arouses emotion (Berger and Milkman 2012), expresses certainty (Pezzuti, Leonhardt, and Warren 2021), or incorporates emojis (McShane et al. 2021) tends to increase engagement. More recently, the vibrant and attention-grabbing nature of visual content has increased its popularity among both brands and consumers (Li and Xie 2020). Researchers have increasingly examined how variations in images’ visual features impact engagement. For example, image content that is colorfully complex (Kanuri, Hughes, and Hodges 2024) or emotionally expressive (Rietveld et al. 2020) has been found to drive higher social media engagement.
Definition of Caption Extension
While research on single-format content (e.g., text-only or image-only posts) has advanced understanding of social media engagement, it often overlooks a key reality: Most social media content is multimodal (Holiday et al. 2023; Mazloom et al. 2016). For example, when brands post images on social media, they often pair visuals with textual captions that vary from a few words to a few sentences. More specifically, captions for a given image can range from direct references to visible elements to more extended text that goes beyond what is visually present. In this research, caption extension is thus defined as the extent to which a caption’s content extends beyond the visible elements of its paired image while remaining contextually appropriate and meaningfully connected to the image. To clarify the concept of caption extension, it is helpful to compare it with related but distinct constructs from prior research, including image–text mismatch (or image–text incongruence), caption creativity, and caption abstractness (see Table 2 for an overview of the comparison).
Comparison of Caption Extension with Related Constructs Based on Key Characteristics.
Constructs in the table are considered at their highest levels of expression.
First, caption extension differs from image–text mismatch or image–text incongruence (e.g., Ceylan, Diehl, and Proserpio 2023; Heckler and Childers 1992; Lee and Mason 1999). By definition, greater caption extension does not compromise the caption’s appropriateness for the image. For example, pairing an image of a rose with the caption “a symbol of romance and love” reflects greater caption extension, because the caption does not directly reference the visible elements, yet it remains meaningfully connected to the image’s content and contextually appropriate. In contrast, image–text mismatch occurs when the caption lacks a meaningful relationship with the image, such as pairing an image of a rose with the caption “our newest SUV model.” This incongruence renders the caption irrelevant, unhelpful to the viewer, and contextually inappropriate.
Second, caption extension is not equivalent to caption creativity (e.g., Hofstetter et al. 2021; Smith and Yang 2004). Prior research demonstrates that for content to be considered creative, it must meet the criterion of originality (i.e., being new and novel; Rosengren et al. 2020). However, caption extension does not necessarily imply originality, as a caption can extend beyond an image’s visible elements by drawing on familiar associations rather than introducing something entirely new. For example, the caption “a symbol of romance and love” for an image of a rose demonstrates a greater level of extension. Yet, because the association between roses and the concept of romance is widely recognized (and somewhat cliché), the caption lacks originality. Conversely, a caption can be creative without extending beyond the image’s visible elements. For instance, a caption such as “the rose is having a beautiful dream full of red petals” demonstrates creativity through surreal and unexpected imagination, even though it remains anchored in the image’s visible elements (e.g., “rose” and “petals”), reflecting low caption extension. As such, the two constructs are conceptually distinct and should not be used interchangeably.
Finally, caption extension is different from caption abstractness. Caption abstractness refers to the degree to which abstract language is used in a caption (Brysbaert, Warriner, and Kuperman 2014). Abstract captions typically include intangible concepts (e.g., “happiness,” “freedom,” “pragmatism”) rather than concrete objects (e.g., “apple,” “pen,” “table”). In contrast, greater caption extension does not necessarily reflect greater abstractness; extended captions can still feature concrete details. For example, consider again an image of a rose with the caption “A gift left on the table by her husband.” This caption demonstrates greater extension by avoiding direct reference to the image’s visual features (“rose” and “petals”) while including several concrete elements (“gift,” “table,” and “husband”). Moreover, captions with greater extension, as defined in this research, are expected to remain contextually appropriate and meaningfully connected to the image, whereas highly abstract captions may not.
Research Gaps Addressed in the Current Research
How does caption extension impact consumer engagement on social media? While existing research has not fully addressed this question, studies on image–text interplay in traditional marketing contexts (e.g., advertising, packaging, consumer reviews) offer some insights (see Table 3 for a summary).
Literature on the Impact of Image–Text Interplay on Consumer Responses.
The current research addresses two primary gaps in the existing literature. First, prior studies have largely focused on image–text mismatch or incongruence, cases where the accompanying text is inappropriate or irrelevant to the image (e.g., Ceylan, Diehl, and Proserpio 2023; Heckler and Childers 1992; Jaud and Melnyk 2020; Lee and Mason 1999). For example, experiments in this research line have often paired images with entirely unrelated text, such as labeling a heron image with “falcon” (Jaud and Melnyk 2020) or describing an old-fashioned hotel as “modern” and “trendy” (Van Rompay, De Vries, and Van Venrooij 2010). In contrast, the current research focuses on caption extension—a construct that, unlike mismatch or incongruence, reflects the degree to which a caption avoids directly referencing visible elements in the image, without compromising contextual appropriateness. How this relatively new construct influences consumer responses remains largely underexplored.
Second, the social media environment differs significantly from the contexts examined in prior research (e.g., advertising, packaging, consumer reviews). In these conventional marketing contexts, consumers rely on brand-generated content (e.g., product descriptions in ads or on packaging) or consumer-generated content (e.g., product reviews) primarily to evaluate product attributes and quality in support of purchasing decisions (Chevalier and Mayzlin 2006; Silayoi and Speece 2007; Vakratsas and Ambler 1999). Consequently, consumers would expect content in these domains to be objective (MacKenzie and Lutz 1989), informative (Ducoffe 1995), and credible (Erdem and Swait 1998). In contrast, social media platforms are mainly used for social connection, self-expression, or entertainment (Appel et al. 2020; Gensler et al. 2013), where consumers are not typically in a buying mindset. They are more likely to expect content that piques their interest and provides an enjoyable experience (Dolan et al. 2016; McShane et al. 2021). Consequently, insights from traditional marketing contexts may not always be applicable to social media.
To address these gaps, this research investigates how caption extension influences consumer engagement with brand-generated social media posts. The next section introduces the theoretical framework and corresponding hypotheses.
Theoretical Development
Figure 1 presents a visual summary of the overall theoretical framework and associated hypotheses, serving as a roadmap for the theoretical reasoning and discussion that follow.

Theoretical Framework.
The Role of Consumer Interest in Driving Social Media Engagement
Prior research conceptualizes interest as a positive emotion that motivates individuals to explore and learn from novel and complex experiences (Fredrickson 1998; Silvia 2001, 2005; Weidman and Tracy 2020). Abundant studies have demonstrated that heightened interest generally increases individuals’ engagement with stimuli. For instance, people tend to spend more time viewing an interesting image (Berlyne 1963; Libby, Lacey, and Lacey 1973), devote more time to reading interesting text (Ainley, Hidi, and Berndorff 2002), and process interesting text more deeply at a cognitive level (Schiefele 1999). Accordingly, when a brand-generated social media post evokes consumer interest, it is expected to elicit higher levels of engagement.
In particular, on most contemporary social media platforms (e.g., X, Facebook, Instagram), consumer engagement with brand-generated content is manifested through three types of behaviors: liking, sharing, and commenting. Prior research frequently uses the total number of likes, shares, and comments received by a post as an indicator of its effectiveness in generating consumer engagement (e.g., Labrecque, Swani, and Stephen 2020; Pancer et al. 2019; Pezzuti, Leonhardt, and Warren 2021). Following this approach, the present research also uses these three metrics to capture the overall level of consumer engagement with a given brand post.
Heightened consumer interest is predicted to increase all three types of engagement simultaneously:
1. Liking. Research suggests that liking behavior is often driven by general positive affect or enjoyment of the content (De Vries, Gensler, and Leeflang 2012; McShane et al. 2021). Hence, when a brand post captures consumers’ interest (i.e., evoking a positive emotion), it is likely to generate likes.
2. Sharing. Sharing is a more public engagement behavior than liking, as shared content is likely to be visible to one’s followers and even to the public. Consequently, sharing behavior is often driven by social motivations, such as impression management (Leary and Kowalski 1990) and the desire to maintain or strengthen social bonds (Baumeister and Leary 1995). Prior research suggests that sharing interesting (vs. dull) content enables individuals to project a warm and socially appealing image to others (Berger 2014). Moreover, emotionally arousing content has been shown to facilitate social connection when shared (Berger 2011, 2014). Thus, content that generates greater interest among consumers should be more likely to be shared.
3. Commenting. Commenting enables people to express their opinions and engage in discussions. Research suggests that to encourage commenting, content creators should incorporate elements that invite conversation (Rooderkerk and Pauwels 2016). Unlike repetitive or mundane social media content, interesting content is more thought-provoking, stimulating both emotional and cognitive engagement, and offering fresh perspectives for discussion, thereby driving higher levels of commenting. In summary, the following hypothesis is proposed:
The Dual Determinants of Interest and Their Application to Caption Extension
Silvia (2005), building on prior theoretical perspectives, proposes and validates two core determinants of interest: novelty-complexity and coping potential. Novelty-complexity refers to the degree to which an experience feels new, unpredictable, or mysterious, sparking curiosity as viewers encounter elements that disrupt familiar patterns (Berlyne 1963; Scherer 2001). Coping potential refers to an individual’s confidence in their ability or willingness to comprehend or resolve the novelty and complexity they encounter (Bandura 1997; Lazarus 1991).
In the context of brand-generated content, greater caption extension occurs when the accompanying text moves beyond merely describing visible elements, instead offering unexpected interpretations, deeper insights, or novel perspectives that diverge from consumer expectations. In doing so, greater caption extension satisfies the first condition for eliciting interest by increasing the perceived novelty-complexity of the content. Although the image and caption remain meaningfully connected and appropriate for joint presentation, their integration may demand greater cognitive effort due to a less direct and less obvious connection. This additional mental effort may slightly reduce the content’s overall processing fluency (Alter and Oppenheimer 2009; Lee and Labroo 2004). Therefore, whether this increase in novelty-complexity (brought about by caption extension) translates into actual consumer interest should largely depend on the second determinant: coping potential. In other words, when coping potential is high (vs. low), consumers are more (vs. less) able to resolve the added novelty and complexity, leading to the experience of interest (vs. disinterest).
The Moderating Role of Brand Psychological Distance
The present research proposes that consumers’ coping potential (i.e., their willingness and ability to comprehend or resolve the novelty and complexity they encounter) stems from their perceived psychological distance from the message sender (i.e., the brand). Psychological distance refers to the perceived closeness or remoteness of an object, event, or entity from oneself (Trope and Liberman 2010; Trope, Liberman, and Wakslak 2007). Research indicates that perceived psychological distance varies among brands based on factors like product type, country of origin, past consumer experiences, and brand–consumer identity alignment (Connors et al. 2021; Kim and Song 2019). For example, brands that offer products not directly relevant to consumers’ daily lives (e.g., industrial chemicals) are typically perceived as more psychologically distant than brands offering everyday items (e.g., food and beverages).
When a brand is perceived as psychologically close, consumers are more likely to exhibit high coping potential in response to its content. This occurs through two primary mechanisms. First, psychological proximity often stems from familiarity, such as frequent use or past purchases, which boosts confidence in understanding brand messages (e.g., a McDonald’s post about new burgers is likely to be easily understood by most consumers). Second, content from psychologically close brands tends to feel more personally relevant. For example, a social media post from an apparel brand announcing a sale may align with consumers’ current needs (e.g., looking for discounts), thereby encouraging greater cognitive effort in processing the message.
Taken together, high coping potential resulting from a low brand psychological distance should facilitate not only the ability but also the motivation to process content with higher levels of novelty and complexity, increasing the likelihood of finding such content interesting. These considerations lead to the following hypothesis:
In conjunction with H1’s prediction that consumer interest drives social media engagement, the following hypothesis reflects the proposed moderated effect on engagement.
Conversely, as brand psychological distance increases, the brand and its products are likely to be further removed from consumers’ everyday lives. Consumers may lack the ability to comprehend highly novel or complex content from such distant brands. For example, a chemical company sharing an industry-specific joke may be difficult for the general public to understand. Besides, greater psychological distance implies reduced personal relevance of the brand’s messages (e.g., a mining company’s post about a new extraction breakthrough), which lowers consumers’ motivation to invest cognitive effort in processing the content. Together, these factors suggest that higher psychological distance reduces consumers’ coping potential for brand messages, making them less likely to experience interest in response to content with greater novelty and complexity. These considerations lead to the following hypothesis:
Given that consumer interest is a key driver of social media engagement, the following hypothesis is proposed:
Finally, considering the mediating role of consumer interest and the moderating influence of brand psychological distance, the following hypothesis reflects the proposed moderated mediation mechanism:
Dataset
This research aimed to compile a comprehensive set of brands suitable for assessing social media content and corresponding engagement metrics. To ensure both the breadth of brand representation and the generalizability of the findings, the Brand Finance Global 500 Report (Brand Finance 2022) was referenced to identify the world’s 500 most valuable brands across various sectors. The focus was placed on brand-generated posts on Twitter (now called X), given the platform’s global reach and diverse user base, which includes both brands and consumers (Hartmann et al. 2021; McShane et al. 2021). I selected brands that maintain active Twitter profiles and primarily communicate in English. The Twitter handles of these qualified brands were manually collected, resulting in a dataset of 402 brands representing 33 different sectors.
Based on these Twitter handles, I used the Twitter application programming interface (API; available at https://developer.x.com/en/docs/twitter-api) to access each brand’s most recent 3,200 tweets (data collected in February 2023, at which time the maximum number of posts retrievable per account, as prescribed by the Twitter API, was 3,200). To ensure that the dataset contained only original content intended for a general consumer audience, retweets and replies were excluded. Tweets posted in 2023 were also excluded, as they were considered too recent to have accumulated sufficient engagement. Furthermore, I omitted multi-image tweets (i.e., I focused solely on single-image tweets) to ensure that any observed engagement outcomes could be attributed to the interplay between the focal image and its accompanying text. Notably, approximately 98.28% of these single-image tweets included meaningful textual captions (i.e., captions containing at least one English word). These single-image tweets with captions (N = 95,155) were retained and served as the primary data source for the analysis. See Table W1 in Web Appendix A for a summary of the sample dataset, showing the distribution of brands across various sectors, the number of tweets analyzed per sector, and the range of years during which the tweets were posted for each sector.
Measurement of Variables
Dependent Variable: Social Media Engagement
To capture the full spectrum of consumer interaction on Twitter, three types of engagement were considered: likes, shares (retweets), and comments. Each form of engagement represents a distinct dimension of consumer motivation. Likes signal appreciation of the content, shares reflect consumer-driven word of mouth, and comments indicate a willingness to participate in conversations and express opinions. Together, these metrics reflect the degree to which consumers are involved with and responsive to brand messages.
Accordingly, the dependent variable (i.e., social media engagement) was operationalized as the total number of likes, shares, and comments received by each tweet. This aggregated measure is consistent with prior research on social media engagement (similar to Cruz, Leonhardt, and Pezzuti [2017] and Pezzuti, Leonhardt, and Warren [2021]).
Explanatory Variable: Caption Extension
Machine-learning-based measurement of caption extension
To quantify caption extension at scale, a machine-learning-based approach was employed. Figure 2 presents an overview of the measurement process and serves as a visual guide to the methodology described next.

Measurement Process of Caption Extension.
The first step involves identifying the visible entities within each image. For this purpose, I used the automatic annotation function provided by the Google Cloud Vision API (https://cloud.google.com/vision/), which is powered by the Inception-v3 deep convolutional neural network. Recent academic research has validated the high accuracy of this API in processing visual content in images (Ceylan, Diehl, and Proserpio 2023; Li and Xie 2020). Hence, I applied the API’s “Detect Labels” function (https://cloud.google.com/vision/docs/labels) to annotate all visually recognizable entities in each image. For example, in an image of a person meditating in the mountains (see Figure 3), the API generated labels such as “mountain,” “sky,” and “people in nature.”

Example of Smaller and Greater Caption Extension.
Once the image labels were extracted, the next step was to assess the degree to which the accompanying caption diverged from the image’s visible content. This was operationalized as the semantic distance between the image labels and the caption. Conceptually, a greater semantic distance indicates greater caption extension. For example, if the caption for the image in Figure 3 includes words such as “mountain,” “tree,” and “sunlight,” which are semantically aligned with the image’s visible entities, it reflects smaller caption extension. In contrast, a caption referencing “emails,” “meetings,” and “phone calls”—terms semantically unrelated to the visible entities—would reflect greater caption extension.
To compute this semantic distance, I first cleaned the text by removing stop words, expanding contractions (e.g., converting “don’t” to “do not”), and retaining only English vocabulary. Both sets of words (i.e., those derived from image labels and those from the cleaned captions) were then transformed into their high-dimensional vector representations (i.e., embeddings). Specifically, I used the pretrained Google News Word2Vec model, which provides 300-dimensional embeddings trained on roughly 100 billion words and is particularly suited for analyzing discrete words and short phrases (Mikolov et al. 2013). The word embeddings in each set were averaged to create a single semantic vector for the image labels and another for the caption. Caption extension was then calculated as the cosine distance between these two averaged vectors:
Higher scores indicate greater levels of caption extension. For example, as shown in Figure 3, when a caption closely references visible entities in the image, the cosine distance is relatively low (e.g., .35), reflecting a smaller degree of caption extension. Conversely, when the caption diverges from a direct description of the visible entities, the cosine distance increases (e.g., .83), indicating greater caption extension.
Validation studies: Human perception alignment and contextual appropriateness
I conducted two validation studies to assess the reliability of the computational measure of caption extension. The first study (see details in Web Appendix B) tested alignment with human perception. A total of 600 tweets, consisting of 200 each with smaller, moderate, and greater levels of caption extension, were randomly paired (smaller vs. moderate, moderate vs. greater) to form 400 tweet pairs. For each pair, ten participants from Amazon Mechanical Turk (MTurk) selected the caption they felt extended further beyond the paired image. The tweet chosen by the majority was treated as the collective human judgment. The algorithm’s selections matched human choices in 64.25% of cases, exceeding the 60% benchmark established in prior work (Overgoor et al. 2022), indicating acceptable alignment with human perception.
The second study (see details in Web Appendix B) tested whether greater caption extension undermines contextual appropriateness. The same 600 tweets were rated individually by MTurk participants (10 per tweet) on a seven-point scale (1 = “very inappropriate,” and 7 = “very appropriate”). Appropriateness ratings for the smaller, moderate, and greater caption extension groups were all significantly above the midpoint of 4 (Msmaller = 6.16, tM > 4(1,999) = 106.60, p = .00; Mmoderate = 6.04, tM > 4(1,999) = 93.03, p = .00; Mgreater = 5.94, tM > 4(1,999) = 83.72, p = .00), confirming that even captions with greater extension are perceived as contextually appropriate.
Moderator: Brand Psychological Distance
The next step involved assessing consumer perceptions of each brand’s psychological distance. While direct ratings for each sampled brand would have been ideal, this approach proved problematic because many brands in the dataset (e.g., localized brands such as Électricité de France) were unfamiliar to Prolific respondents based primarily in the United Kingdom and United States. To overcome this issue, psychological distance was measured at the sector level instead. This adjustment rests on the assumption that brands within a given sector tend to share similar characteristics and are thus perceived as similarly proximal or distant (e.g., chemical sector brands are generally perceived as more distant than food brands).
Following the classification by Brand Finance (2022), all sampled brands were grouped into 33 sectors. To assess psychological distance, a crowdsourcing approach was implemented using a “pair rank” survey. This method transformed a list of sector brands into a series of head-to-head comparisons (similar to the approach in Lee [2021]). Participants were shown pairs of sector brands randomly drawn from the sample and asked to indicate which of the two they perceived as more psychologically distant. Each option included a brief description to support understanding and enable more informed judgments. Participants could skip comparisons if unsure. An example of the task is shown in Figure W3 in Web Appendix B. Each sector brand’s “rank score” (ranging from 0 to 100), based on the percentage of wins across its total pairwise comparisons, served as a proxy for perceived psychological distance. The survey was administered via Prolific to 500 participants (Mage = 31.75 years; 56.4% female, 42.8% male, and .8% others), with each participant completing ten comparisons. The resulting psychological distance scores for each sector brand are presented in Table W2 in Web Appendix B.
Control Variables
I identified and measured a comprehensive set of control variables, including image-specific features (i.e., presence of faces, logos, colorfulness, saturation, visual complexity, and sensitive content), caption-specific features (i.e., caption length, number of URLs, @-mentions, hashtags, emojis, emotional intensity and valence, certainty, social and sensory language, and concreteness), brand-specific features (i.e., log-transformed follower and following counts, posting frequency, and account verification), and timing-specific features (i.e., weekend indicator and monthly fixed effects). The selection of these control variables was guided by prior research linking them to social media engagement or by theoretical rationale (details are provided in Web Appendix B).
Descriptive statistics for noncategorical variables used in the analysis are provided in Table W3 in Web Appendix B.
Model Specifications and Primary Empirical Findings
The outcome variable, engagement volume, is a count variable consisting of nonnegative integers, and as shown in the descriptive statistics (Table W3 in Web Appendix B), it exhibits a pronounced long-tail distribution and substantial variability in magnitude. Given these distributional properties, a negative binomial regression model was employed, following the approach of prior studies (see Hartmann et al. 2021; Hughes, Swaminathan, and Brooks 2019; Ordenes et al. 2019; Overgoor et al. 2022). This model is an extension of the Poisson regression model, which typically assumes equal mean and variance, a constraint often too restrictive for real-world data. Negative binomial regression relaxes this assumption by allowing for a variance that exceeds the mean, making it well-suited for handling the overdispersion commonly observed in social media engagement volume (Rietveld et al. 2020).
To test the effect of caption extension and the moderating role of brand psychological distance, two models were estimated: a baseline model without controls for parsimony, and a full model with controls to account for potential confounds. The model estimation results are presented in Table 4 (full numerical results, including exact p-values, are presented in Table W4 of Web Appendix C). In the simple model (Column 1), caption extension was positively associated with engagement, while the interaction term between caption extension and psychological distance was negative and significant. In the full model with all control variables (Column 2), the results remained consistent. Caption extension continued to show a positive and significant correlation with engagement, and the interaction term remained negative and significant. These findings suggest that while greater caption extension is associated with increased consumer engagement, this relationship tends to be weakened by increased brand psychological distance.
Summary of Model Estimation Results: Main Analysis.
Notes: Standard errors and p-values in parentheses; detailed numerical results are presented in Table W4 in Web Appendix C.
To illustrate the moderating role of brand psychological distance, a margins plot (see Figure 4, Panel A) was generated based on the estimation results from the full regression model. This visualization reinforces the statistical findings: for brands perceived as psychologically proximal (represented by black and dark gray lines), greater caption extension is associated with higher consumer engagement, as indicated by upward-sloping lines. As psychological distance increases (indicated by progressively lighter shades of gray), this positive association diminishes, as reflected in the increasingly flatter slopes. Notably, when the brand psychological distance score exceeds approximately 50, the slopes begin to turn downward, indicating that for highly distant brands, greater caption extension is associated with lower engagement.

The Moderating Effect of Brand Psychological Distance.
These patterns reveal the boundary conditions of the caption extension–engagement relationship: while greater extension tends to boost engagement for psychologically proximal brands, the effect weakens and may reverse for distant ones.
Robustness Checks
To assess the stability of the findings and address causality concerns, I conducted a series of robustness checks.
Analysis After Excluding Observations with Ultra-High Caption Extension
By definition, caption extension should not undermine contextual appropriateness. To further rule out potential inappropriateness at ultra-high levels of caption extension, tweets in the top 1%, 5%, and 10% of extension scores were excluded, and the models were reestimated using negative binomial regression with and without control variables.
As shown in Table 5 (detailed results in Table W5, Web Appendix D), the results remained consistent with the main findings. Margins plots (Figure 4, Panel B) also confirmed the same pattern.
Robustness Check by Excluding Observations with Ultra-High Caption Extension.
Notes: Standard errors and p-values in parentheses; detailed numerical results are presented in Table W4 in Web Appendix C.
Decomposing Analysis by Engagement Type
As proposed in the “Theoretical Development” section, greater caption extension is expected to influence overall engagement by simultaneously affecting three common forms of consumer interaction on social media: liking, sharing, and commenting. To test this, each form of engagement was analyzed separately using negative binomial regressions, both with and without control variables. As shown in Table 6 (detailed results in Table W6, Web Appendix D), greater caption extension is consistently associated with increases in all three engagement behaviors, while its interaction with brand psychological distance remains negative.
Robustness Check by Engagement Type.
Notes: Standard errors and p-values in parentheses; detailed numerical results are presented in Table W4 in Web Appendix C.
Margins plots (Figure 4, Panel C) showed that greater caption extension was associated with higher engagement for psychologically proximal brands, with this positive relationship weakening as psychological distance increased. Notably, while the effect reversed for likes and shares at higher distance levels, it only plateaued for comments. This suggests that extended captions may still encourage discussion, even when they reduce likability or shareability.
Analysis Using Alternative Analytical Approaches
To further ensure robustness, the analysis was replicated using three alternative models suitable for count-based, skewed engagement data. First, log-linear ordinary least squares regressions were estimated with log-transformed engagement (adding 1 to address 0s), offering interpretable multiplicative effects. Second, Poisson regressions were used as a benchmark, despite their restrictive assumption of equal mean and variance. Third, zero-inflated negative binomial (ZINB) models were employed to account for excess zeros, with LogFollower used as the inflate variable to reflect content visibility likelihood.
Across all three specifications, the results remained consistent (Table 7; detailed numerical results in Table W7 in Web Appendix D, and margin plots in Figure 4, Panel D).
Robustness Check Using Alternative Analytical Approaches.
Notes: Standard errors and p-values in parentheses; detailed numerical results are presented in Table W4 in Web Appendix C.
Validating Causality Using Propensity Score Matching
To address causality concerns, a propensity score matching approach was used to create two comparable groups of tweets: one with greater caption extension (above the mean) and one with smaller extension (below the mean). Tweets were matched on a broad set of covariates (image-, caption-, brand-, and timing-related), ensuring balance on all factors except caption extension. This quasi-experimental design approximates a randomized controlled trial and strengthens causal interpretation of the caption extension–engagement relationship.
Brands were categorized by psychological distance into low (<33), medium (33–66), and high (≥67) segments. For each segment, separate probit regressions were estimated to model the likelihood of a tweet receiving a greater level of caption extension, using all covariates from the main analysis (excluding month fixed effects). Full model outputs are presented in Tables W8–W10 (Web Appendix D). Nearest-neighbor matching without replacement was performed using a caliper of .01. Match quality was evaluated through covariate balance tests, which revealed no significant differences between treated and control groups across all three psychological distance segments, confirming effective matching. Detailed balance statistics are reported in Tables W11–W13 (Web Appendix D).
To estimate the treatment effect, log-transformed engagement outcomes were compared between the treated and control groups (see detailed results in Table W14 in Web Appendix D). For brands with low psychological distance, tweets with greater caption extension generated higher engagement, with a marginally significant average treatment effect on the treated (difference = .02, SE = .01, t = 1.54, pone-tailed = .06). At medium psychological distance, the effect weakened and became marginally negative (difference = −.01, SE = .01, t = −1.30, pone-tailed = .09). For highly distant brands, the effect turned significantly negative (difference = −.04, SE = .02, t = −2.52, pone-tailed = .01). These results provide causal evidence that the positive effect of greater caption extension on engagement weakens, and may reverse, as brand psychological distance increases.
Experimental Replication and Underlying Mechanisms
To establish a more robust causal inference between caption extension and consumer engagement on social media, and to examine the underlying mechanisms, a preregistered online experiment was conducted (preregistration link: https://aspredicted.org/6tkn-56z2.pdf). 1 The experiment followed a 2 (caption extension level: smaller vs. greater) × 2 (brand psychological distance: low vs. high) between-subjects design.
In addition to testing the hypothesized mediating role of consumer interest in linking caption extension to consumer engagement, the experiment also examined two additional variables (i.e., processing fluency and entertainment) to rule out their potential roles as alternative mediators. First, as discussed in the “Theoretical Development” section, caption extension—by introducing content that goes beyond what is visually depicted in the image—may disrupt the perceptual match between image and text. Even if the caption remains conceptually aligned with the image, it may still reduce processing fluency (i.e., the ease with which information is processed; Alter and Oppenheimer 2009; Graf, Mayer, and Landwehr 2018). Lower fluency has been shown to negatively affect consumer responses (Ceylan, Diehl, and Proserpio 2023; Heckler and Childers 1992; Jaud and Melnyk 2020). Second, entertainment was analyzed as a possible mediator. Although both interest and entertainment are positive emotions, they are conceptually distinct (Weidman and Tracy 2020): Entertainment emphasizes amusement and enjoyment (Elpers, Wedel, and Pieters 2003), whereas interest is more closely associated with curiosity and the drive to seek information (Silvia 2005, 2001). Greater caption extension may enhance entertainment by introducing unexpected or thought-provoking elements (Smith and Yang 2004), which can increase engagement (De Vries, Gensler, and Leeflang 2012).
Material Preparation and Pretests
Three brand pairs were selected for the experiment: (1) automobile versus car rental service (both named “Velox”), (2) pharmaceutical versus chemical (both named “NovaCore”), and (3) airline versus aerospace and defense (both named “AeroNova”). These brand pairs were chosen for two key reasons: First, each brand pair consisted of one brand perceived as psychologically proximal and one perceived as psychologically distant. Based on survey ratings from the secondary data (see Table W2 in Web Appendix B), automobile, pharmaceutical, and airline brands were perceived as psychologically proximal (average scores below 50), while car rental, chemical, and aerospace and defense brands were seen as psychologically distant (scores above 50). Second, pairing brands within similar functional domains made it possible to use identical image content across caption extension conditions, which helped control for potential confounding effects due to image variation. For instance, both automobile and car rental brands could use the same image of a person driving with a scenic view ahead (see Table 8).
Brands, Images, and Captions Used in the Experiment.
To accompany each image, two sets of captions were developed to reflect smaller and greater levels of caption extension. Captions in the smaller-extension condition directly referred to visible elements in the image, whereas captions in the greater-extension condition did not explicitly mention these visual elements. The two caption sets were matched in length to control for potential confounding effects (see details in Table 8).
A pretest was conducted on Prolific (N = 120; Mage = 31.09 years; 50% female, 50% male), which showed no significant differences between the smaller- and greater-extension captions in terms of perceived content quality (i.e., being well-written; Msmaller ext. = 4.98, Mgreater ext. = 5.08; t(118) = −.34, p = .74), appropriateness of pairing with the image (Msmaller ext. = 5.81, Mgreater ext. = 5.48; t(118) = 1.21, p = .23), informativeness (Msmaller ext. = 3.95, Mgreater ext. = 3.62; t(118) = .97, p = .34), emotional intensity (Msmaller ext. = 3.32, Mgreater ext. = 3.72; t(118) = −1.22, p = .23), or emotional valence (Msmaller ext. = 5.53, Mgreater ext. = 5.72; t(118) = −.77, p = .44). A significant difference emerged in the degree of perceived caption extension between the two groups (Msmaller ext. = 2.92, Mgreater ext. = 4.03; t(118) = −3.13, p = .00), confirming that the manipulation was effective while holding other key attributes constant across conditions.
Experiment Procedures
In the main experiment, 1,200 participants were recruited through Prolific (Mage = 32.49 years; 47.92% female, 51.83% male, .25% others). The sample size was determined through a power analysis based on prior trials with smaller samples. All participants were regular Twitter users and proficient in English. Each participant was randomly assigned to one of the proximal brands (automotive, pharmaceutical, or airline) or one of the distant brands (car rental service, chemical, or aerospace and defense). To enhance ecological validity, participants read a brief description of the assigned brand and were asked to imagine following the brand on Twitter. They then viewed a tweet from the brand, with the level of caption extension randomly assigned to either smaller or greater. To reinforce the brand’s association with the tweet content, each caption included two hashtags related to the brand type. Details of all 12 image-featured tweets used in the experiment are provided in Table W15 in Web Appendix E.
After viewing the tweet, participants rated their willingness to engage with it using a three-item scale (e.g., “How likely would you be to [like/share (i.e., retweet)/comment on] this tweet?”; 1 = “very unlikely,” and 7 = “very likely”; α = .86; McShane et al., 2021). They also assessed their interest in the tweet using an eight-item scale (e.g., “I felt interested in the tweet”; 1 = “strongly disagree,” and 7 = “strongly agree”; α = .92; see the full scale in Web Appendix E; Weidman and Tracy 2020). Additionally, processing fluency was measured using a single item (“Processing this tweet’s content is…”; 1 = “very difficult,” and 7 = “very easy”; Stuppy, Landwehr, and McGraw 2024), and entertainment was assessed with a three-item scale (e.g., “This tweet is [exciting/enjoyable/entertaining]”; 1 = “not at all,” and 7 = “very much”; α = .92; Eigenraam, Eelen, and Verlegh, 2021). Engagement likelihood, consumer interest, and entertainment value were calculated as the average of responses across items.
Two manipulation check questions followed. First, participants rated their perceived psychological distance from the brand using the Inclusion of Other in Self scale (see Figure W4 in Web Appendix E; Choi and Winterich 2013). Second, participants rated the caption’s level of extension by answering the question, “To what extent does the tweet’s text extend beyond what is visually depicted in the image?” (1 = “not at all,” and 7 = “very much”). An attention check was included at the end to ensure participant attentiveness, and responses from those who failed this check were excluded from the analysis. Recruitment continued until the target number of qualified submissions was reached.
Results
Manipulation checks
Between-group t-tests confirmed the effectiveness of the manipulations for both caption extension (Msmaller ext. = 2.91, Mgreater ext. = 4.06; t(1,198) = −12.60, p = .00) and brand psychological distance (Mlow dist. = 3.75, Mhigh dist. = 4.34; t(1,198) = −5.85, p = .00).
Main effect and moderation of brand psychological distance
A two-way ANOVA was conducted to examine the effects of caption extension, brand psychological distance, and their interaction on engagement, which revealed a significant main effect of caption extension on engagement (F(1, 1,196) = 8.61, p = .00), as well as a significant interaction between caption extension and brand psychological distance (F(1, 1,196) = 12.68, p = .00), indicating that the positive effect of caption extension on engagement is moderated by brand psychological distance.
To further explore this interaction, one-way ANOVAs were conducted separately for proximal and distant brand conditions (results visualized in Figure 5, Panel A). For psychologically proximal brands, greater caption extension significantly increased willingness to engage (Msmaller ext. = 3.61, Mgreater ext. = 3.97, F(1, 607) = 8.00, p = .00). In contrast, for psychologically distant brands, greater caption extension significantly reduced willingness to engage (Msmaller ext. = 3.85, Mgreater ext. = 3.58, F(1, 589) = 4.82, p = .03). Together, these findings suggest that while greater caption extension can enhance engagement for brands perceived as psychologically close, its effect diminishes with increasing brand psychological distance and ultimately turns negative when the distance becomes substantially high.

Interaction Effects of Caption Extension and Brand Psychological Distance on Engagement and Interest.
Moderated mediation of consumer interest
A two-way ANOVA revealed a significant main effect of caption extension on consumer interest (F(1, 1,196) = 17.35, p = .00). The interaction between caption extension and brand psychological distance was also significant (F(1, 1,196) = 23.24, p = .00), confirming that the impact of caption extension on interest is moderated by psychological distance. Furthermore, one-way ANOVAs were conducted separately for each brand condition (results visualized in Figure 5, Panel B). For psychologically proximal brands, greater caption extension significantly increased consumer interest (Msmaller ext. = 4.49, Mgreater ext. = 4.94, F(1, 607) = 16.18, p = .00). Conversely, for distant brands, greater caption extension significantly reduced consumer interest (Msmaller ext. = 4.60, Mgreater ext. = 4.31, F(1, 589) = 7.67, p = .01).
To test the moderated mediation, Hayes’s (2017) PROCESS Model 8 was used, with caption extension as the independent variable, engagement likelihood as the dependent variable, consumer interest as the mediator, and brand psychological distance as the moderator. The model allowed distance to moderate both the direct effect of caption extension and its indirect effect via interest.
First, the decomposed mediation pathway through consumer interest was examined. For the a-path (from caption extension to interest), results revealed a significant positive effect of caption extension on interest (bextension = .45, SE = .11, p = .00), which was significantly moderated by brand psychological distance (bextension × distance = –.74, SE = .15, p = .00), indicating that the positive effect of caption extension on interest decreases as brand psychological distance increases. For the b-path (from interest to engagement), consumer interest had a strong positive effect on engagement (binterest = .89, SE = .02, p = .00).
Next, the conditional indirect effect of caption extension on engagement via consumer interest was tested, accounting for brand psychological distance. For psychologically proximal brands (distance dummy = 0), the indirect effect was positive and significant (bindirect = .40, 95% CI: [.20, .59]). In contrast, for psychologically distant brands (distance dummy = 1), the indirect effect was negative and significant (bindirect = −.26, 95% CI: [−.44, −.07]). The index of moderated mediation was also significant (bmoderation = −.66, 95% CI: [−.93, −.39]), providing evidence that brand psychological distance moderates the indirect effect of caption extension on engagement through consumer interest.
Ruling out alternative explanations
To rule out the potential mediating roles of processing fluency and entertainment, these variables, alongside consumer interest, were entered as three parallel mediators in Hayes’s (2017) PROCESS Model 8, with caption extension as the independent variable, engagement likelihood as the outcome variable, and brand psychological distance as a moderator.
Results showed that consumer interest significantly mediated the relationship between caption extension and engagement, with a positive indirect effect for psychologically proximal brands (bindirect = .28, 95% CI: [.14, .43]) and a negative indirect effect for distant brands (bindirect = −.18, 95% CI: [−.32, −.05]). The index of moderated mediation was significant (bmoderated mediation = −.47, 95% CI: [−.67, −.28]), confirming that the mediating effect of interest was contingent on psychological distance.
In contrast, the indirect effect through processing fluency was minimal and significant only for distant brands (proximal brands: bindirect = .01, 95% CI: [−.00, .03]; distant brands: bindirect = .02, 95% CI: [.00, .04]). The index of moderated mediation was not significant (bmoderated mediation = .01, 95% CI: [−.01, .03]), indicating that the difference across conditions was not statistically meaningful. Besides, although entertainment significantly mediated the relationship across both brand conditions (proximal brands: bindirect = .14, 95% CI: [.06, .23]; distant brands: bindirect = .15, 95% CI: [.07, .24]), the index of moderated mediation remained nonsignificant (bmoderated mediation = .01, 95% CI: [−.10, .13]). Overall, these findings suggest that the observed interaction between caption extension and brand psychological distance on engagement is specifically driven by consumer interest rather than by processing fluency or entertainment.
General Discussion
In summary, this research employs a mixed-methods approach to provide robust empirical support for the proposed hypotheses (H1–H6), clarifying when caption extension is most effective and uncovering the psychological mechanisms behind its impact. The following subsections discuss the research’s contributions, implications, and limitations.
Conceptual, Theoretical, and Methodological Contributions
Conceptually, this research introduces a relatively novel and underexplored construct: caption extension. This construct is conceptually distinct from previously studied notions such as image–text incongruence, caption creativity, and caption abstractness. A detailed discussion is provided in this research to distinguish caption extension from these related constructs, supported by concrete examples to enhance conceptual clarity and understanding.
Centered on the construct of caption extension, this research proposes and empirically tests a set of theoretical hypotheses within the context of brand-generated social media posts. The findings contribute to several streams of research.
First, this work contributes to the literature on multimodal brand-generated content (e.g., image paired with text; Heckler and Childers 1992; Houston, Childers, and Heckler 1987; Jaud and Melnyk 2020; Lee and Choi 2019). Prior studies have predominantly examined brand-initiated image–text content within traditional marketing contexts (e.g., print advertisements, packaging), with a focus on how image–text mismatch or incongruence affects consumer attitudes and responses. These studies generally suggest that textual and visual elements should be as perceptually consistent as possible to promote processing fluency and elicit favorable consumer reactions. Contrary to this perspective, the present research finds that captions that do not perfectly match the image at a perceptual level can also enhance consumer engagement on social media, especially when the brand is perceived as psychologically close to consumers.
Second, the research advances understanding in the domains of social media marketing and consumer engagement (Hernández-Ortega et al. 2022; McShane et al. 2021; Pezzuti, Leonhardt, and Warren 2021). It highlights that social media content is often multimodal (e.g., combining images with text, videos with text, or even images with videos; Holiday et al. 2023). While prior studies in this area have largely focused on single-format content, relatively little attention has been given to how different content formats interact to shape consumer perceptions and behaviors. This research empirically demonstrates that the relationship between visual and textual elements significantly influences consumer engagement, suggesting that the interplay between modalities is a key driver of consumers’ online responses.
Third, this research identifies interest as a key emotional mechanism driving social media engagement. Drawing on Silvia’s (2005) work on the emotion of interest, it proposes and validates when and how consumer interest can be elicited through specific content strategies. While prior research has primarily emphasized entertainment as the primary emotional driver of engagement (De Vries, Gensler, and Leeflang 2012), the findings presented here demonstrate that interest, although related to entertainment, is a distinct emotional response and that interest is also a strong predictor of engagement.
Fourth, the findings highlight the importance of strategically aligning content with brand profiles to maximize engagement outcomes (Montaguti, Valentini, and Vecchioni 2023). While previous studies have examined the general impact of content features on engagement, they have rarely considered how these effects vary depending on brand type. The results of this research suggest that certain content features (e.g., caption extension) may enhance engagement for some brands while diminishing it for others. Future investigations may explore whether other known engagement drivers could have brand-specific limitations or even adverse effects.
Fifth, this research contributes to the branding literature, particularly in the area of consumer–brand relationships (Connors et al. 2021). The findings demonstrate that brand psychological distance significantly affects how consumers interpret brand-generated messages (e.g., whether they perceive the content as interesting) and, in turn, how they respond behaviorally.
Methodologically, this research introduces a machine-learning-based approach for measuring caption extension at scale. Validated through human evaluations, the method shows strong precision and representativeness in analyzing social media posts from established brands. While further validation is needed to confirm its applicability in other contexts (e.g., content from smaller brands), this approach offers a methodological foundation for future research to analyze multimodal content at scale.
Managerial Implications
To examine whether brands are already tailoring caption extension based on their own characteristics (i.e., psychological distance), an additional analysis was conducted using the secondary dataset. For each sector, the average caption extension score of brand-generated content was compared with the sector’s perceived psychological distance. A pairwise correlation analysis revealed no significant relationship between the two variables (r = .02, p = .90). This relationship is depicted in the scatterplot in Figure 6, where the trend line is nearly flat, indicating an absence of systematic variation. This indicates that even leading global brands may not yet be strategically adjusting their use of caption extension based on how psychologically proximal or distant they are perceived to be. It also highlights the timeliness and managerial relevance of the present research.

Real-World Application of Caption Extension by Brands (Grouped by Sector).
Since the effectiveness of caption extension depends on brand psychological distance, the recommendations offered here are context-sensitive. Brands should consider not only how captions relate to the visual content but also how psychologically proximal or distant they appear to consumers. For brands perceived as psychologically proximal (i.e., those consumers are familiar with and frequently purchase from), extended captions can be used more freely, as greater extension tends to enhance interest and boost engagement. Instead of merely describing the image, these brands can craft captions that expand the context, take a more imaginative or interpretive approach, and introduce new associations or prompt reflection.
By contrast, brands perceived as highly distant (i.e., those without direct consumer contact) should be cautious with extended captions. For these brands, captions that closely align with the visual, providing clear, direct, and descriptive messages, are likely more effective. Greater caption extension may fail to spark interest, potentially causing confusion or aversion, and ultimately weakening engagement.
Limitations and Future Research
While this research offers valuable insights, it has limitations in methodology, data scope, and context, each of which points to opportunities for future research.
First, this research focuses on social media engagement, operationalized through three widely used behavioral indicators: liking, sharing, and commenting. This approach is consistent with prior research (e.g., Pezzuti, Leonhardt, and Warren 2021). While the findings support the idea that greater caption extension enhances overall engagement by increasing consumer interest, the analysis does not unpack the distinct mechanisms behind each behavior. Given that liking, sharing, and commenting are likely driven by different psychological motivations, future research should explore the specific pathways through which caption extension affects each form of engagement.
Second, the analysis does not account for the communication purpose of individual captions (e.g., product announcements, event promotions, brand storytelling) as a potential moderator. The dataset includes brands from various sectors, such as automotive, beverages, airlines, chemicals, and apparel, each with distinct communication goals. Even within a sector, brand positioning can lead to different styles; for example, mainstream apparel brands may focus on affordability, while luxury brands highlight heritage or exclusivity. This variation made it difficult to create a standardized typology of communication purposes. Future research could examine how communication objectives interact with caption extension to influence engagement, possibly through sector-specific analyses.
Third, the empirical analysis focuses on Twitter, chosen for its popularity among brands and consumers and the availability of large-scale, structured data. The subsequent experiment also used brand tweets to replicate real-world dynamics in a controlled setting. While Twitter is a relevant platform, other social media sites such as Instagram, Facebook, or TikTok differ in content format, audience behavior, and engagement algorithms. Future research could test the generalizability of these findings across platforms to strengthen external validity. In addition, the dataset includes only large, established brands; examining smaller or lesser-known brands may reveal how brand scale influences consumer responses to caption extension.
Fourth, although the secondary dataset analysis includes a comprehensive set of control variables, the measure of brand popularity (i.e., log-transformed follower count at the time of data collection) does not reflect historical follower levels at the time each post was published. This variable serves as a practical proxy, consistent with previous work (e.g., Hartmann et al. 2021), yet it lacks temporal specificity. Access to more accurate historical data could enhance the robustness of the findings. Future research could explore whether changes in brand popularity over time influence the link between content characteristics and consumer engagement.
Supplemental Material
sj-pdf-1-jnm-10.1177_10949968251352408 - Supplemental material for How Do Image Captions Drive Consumer Engagement on Social Media?
Supplemental material, sj-pdf-1-jnm-10.1177_10949968251352408 for How Do Image Captions Drive Consumer Engagement on Social Media? by Zitian Ada in Journal of Interactive Marketing
Footnotes
Ethical Considerations
All studies involving human participants were conducted in accordance with institutional ethical guidelines and approved by the appropriate ethics committee.
Consent to Participate
All participants provided informed consent prior to taking part in the studies. Participation was voluntary, and participants could withdraw at any time without consequence.
Editor
Arvind Rangaswamy
Associate Editor
Yakov Bart
Declaration of Conflicting Interests
The author declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Data Availability Statement
The data that support the findings of this study are available from the author upon reasonable request.
Funding
The author received no financial support for the research, authorship, and/or publication of this article.
Notes
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
