Text summarization based on transformer using sentences ranking and time penalty

Abstract

Text summarization systems often struggle with selecting salient content, avoiding repetition, and handling out-of-vocabulary entities. We address these issues with a two-stage approach: a supervised sentence-ranking head (SRM-head) first selects the top- $N$ sentences, and a Transformer generator then produces the summary. The generator is augmented with a time penalty in encoder–decoder attention to discourage reattending to recently focused source positions, and with a pointer mechanism that copies salient spans, thereby improving entity and number fidelity. Experiments on CNN/DailyMail and WikiHow, plus an additional evaluation on XSum, show that our model attains competitive ROUGE scores against recent pretrained systems while using lightweight, modular components.

Keywords

text summarization encoder–decoder Seq–Seq time penalty mechanism pointer network

1. Introduction

Text summarization is the task to generate short but refined text from the original article, which not only fully summarizes the information of the original article, but also reducing information overload. At present, the technologies of generating summary can be divided into two types: extractive and abstractive. The extractive methods reconstitute the summary by selecting important sentences or paragraphs in the source text.¹ They can better produce grammatically readable sentences and express the meaning of the original text, but the summaries generated by them may be incoherent summaries with redundant information. In contrast, the abstractive methods are more in line with human thinking, and the generated summary contains some words that never appear in the original text.²

Because abstractive summary technologies are difficult to implement, most of the early summarization technologies are extractive methods (e.g.^3,4), which are usually modeled as sentence ordering problems with length constraints. The neural sequence-to-sequence model has subsequently achieved good results in many text generation tasks (such as Deep Q-Network,⁵ Restricted Boltzmann Machine,⁶ which makes abstractive summarization techniques feasible (such as^7,8). However, the input length of the sequence-to-sequence model is limited, and when it is set too long, the model will be difficult to train and its accuracy will decrease. To solve this problem, the common practice is to delete the part of article which is longer than the set length and use the remaining part as the model input. However, the deleted text may contain important information about the article. As shown in Figure 1, the score of 10 sentences obtained by TextRank is higher than that by original method. And the input text obtained by original method contains only 5 important sentences, which shows that the original method greatly reduces the information of article.

Figure 1.

An example of CNN/Daily Mail dataset: The upper of the figure indicates that the first 400 words of the article are directly intercepted as the input of the summary generation model, and the bottom indicates that the article is sorted by TextRank, and selects its top 10 sentences to be the input.

Besides, most of the sequence-to-sequence models are based on RNN.^9,10 While RNN is very powerful in encoding sequences, its sequentiality makes the training of network very slowly. Transformer¹¹ can process all words and symbols in a sequence in parallel. So it is meaningful to research text summarization based on transformer model. However, simply using the transformer to generate summary will generate a lot of oov words and repeated words. Therefore, we use some methods to further to improve the transformer to make it more suitable for generating summary.

In view of the above problems, we divide the summary generation work into two sub-tasks: (1) Calculate the amount of information contained in each sentence in the original text, and then select N sentences with the highest amount of information to represent the original text in the order in which they appear in the original text. (2) Take the text obtained in (1) as the input of the summary generation model, and generate a summary of the original text based on the information carried by the N sentences.

Our contributions in this paper are summarized as follows:

We propose a supervised sentences ranking method for articles with titles, which make the text obtained by this method contain more article information.

We use the transformer model to generate summary, which only takes 4.15 hours to train for the 32k vocabulary model, while the traditional RNN-based summary generation model¹² usually takes 4.5 days to train.

We propose to use the time penalty mechanism and pointer network to improve the transformer model to alleviate the case where the generated summary contains many oov words and repeated words. The time penalty mechanism allows different decoding steps to focus on different parts of the original text, and the pointer network allows the model to choose to extract words from the original text as abstract words.

2. Review

Based on the literatures in this section, we will briefly introduce the sentences ranking model and summary generation model, and point out how they are related and different from our work.

2.1 Sentences ranking model

It involves reordering the original text using relevant indicators. Traditional sentences ranking methods are mainly divided into three types: statistical methods, graph-based methods, and center-based methods. Common statistical indicators are TF-IDF,¹³ word frequency,¹⁴ sentence position,¹⁵ and sentence length.¹⁶ Graph-based methods include TextRank,¹⁷ which re-rank sentences following the PageRank algorithm.¹⁸ Zheng and Lapata¹⁹ regards centrality as a feature of sentence importance measures, and centrality refers to a set of words that can represent a series of articles on a topic.

In recent years, neural networks have begun to be used in sentence ranking models.²⁰ divides the sentence ordering task into a hierarchical regression process, which uses RNN to automatically learn the ranking features of sentences and combines the original features of the sentences to perform hierarchical regression.²¹ uses enhanced CNN to capture the prior features of abstracts from variable-length phrases to rank sentences.²² uses two LSTM to encode article title and paragraphs, then use a logistic regression model to get the similarity between title and paragraphs, and finally sort the article paragraphs based on the similarity. Our work is close to,²² but we use the self-attention mechanism mentioned in Vaswani et al.¹¹ to encode title and paragraphes. Besides, we also use the attention mechanism to link the information of title and paragraphes.

2.2 Summary generation

Summary generation techniques can be divided into extractive methods and abstractive methods. The extractive methods are to select several important sentences from the original text and combine them into summaries. The commonly used extractive techniques are based on neural networks, which often model the summarization problem as sequence tagging²³ or sentence ordering.²⁴ The abstractive methods generate summaries through rewriting, and its summary is more refined. Most abstractive technologies are implemented based on a sequence-to-sequence framework. First, the article is encoded using an encoder, and then the decoder generates a summary step by step in an autoregressive manner. The sequence-to-sequence framework was first proposed by⁴ in translation task.²⁵ then began to apply this framework to text summarization, but these models are prone to generate many oov words and duplicate words. Based on this,²⁶applied the hybrid pointer generation network and coverage mechanism to the summary generation model, so that it has the ability to copy words from the original text as abstract words, and can suppress the generation of duplicate words.

Our work is close to²⁶, but we use the transformer model proposed in Vaswani et al.¹¹ to generate the summary, and apply a pointer network and a time penalty mechanism between the encoder and decoder. To avoid information redundancy, we use the attention distribution obtained by the last block of the decoder as a pointer of the pointer mechanism.

3. Our approach

In this section, we mainly describe the overall framework of text summarization model, which is listed in Figure 2. First, we choose a sentence ranking method based on whether the article contains a title. For titled datasets, we use a supervised sentence ordering model to rank them, while using an unsupervised sentence ranking model to rank datasets without titles. Then, we extract the top N sentences that are most relevant to the article topic from the rearranged article as input to the summary generation model to generate summary. In the following sections, we will describe two important components of our model in detail, which can be seen in Figure 3.

Figure 2.

The overall framework.

Figure 3.

Text summarization model: (a) is used to sort the article, from which the top N sentences are selected as the input of (b), and the encoding result of each sentence-encoder is recorded at the same time $£ \neg$ which is used to obtain the embedded representation containing topic information of the input of (b).

3.1. Sentences ranking model

Supervised sentences ranking model: We use two encoders to encode the title and sentences of article, and use the attention mechanism to combine the two encoders. Different from²² who used LSTM to encode the title and sentence, we chose an encoder based on self-attention mechanism to encode the title T and sentence in source text considering the complexity of calculation and long-term dependence.

\begin{aligned} h_{1}^{t}, h_{2}^{t}, \dots, h_{m}^{t} = s e l f a t t n_{T} (x_{1}^{t}, x_{2}^{t}, \dots, x_{m}^{t}) \\ h_{1}^{s}, h_{2}^{s}, \dots, h_{n}^{s} = s e l f a t t n_{S} (x_{1}^{s}, x_{2}^{s}, \dots, x_{n}^{s}) \end{aligned}

(1)

where

x_{i}^{t}

and

x_{j}^{s}

are the tokens in title and sentence respectively,

h_{i}^{t}

and

h_{i}^{s}

are encoded by the title-level encoder and the sentence-level encoder. The structure of the title-level encoder is similar to that of a transformer model, which consists of

n_{t}

identical blocks. First, the title is directly weighted by word embedding and position encoding to get the input vector of the block. Secondly, the input vector is input into the multi-head attention layer. At the same time, in order to avoid the disappearance of the gradient, the residual connection and layer normalization are performed. Finally, the normalized result is input into a position-wise feed-forward network composed of two convolutional transformations and one RELU activation function, and similarly, the residual connection and layer normalization are performed to obtain the output of this block, which is the input vector of the next block. The output of the last block is the final representation of the title T:

{h_{1}^{t}, h_{2}^{t}, \dots, h_{m}^{t}}

. For a detailed description of each module in the encoder, see Vaswani et al.¹¹

The main modules of the sentence-level encoder are the same as that of the title-level encoder, but the semantic information of the title needs to be added to it, so that its final output contains the information of the article.

h^{s} = s e l f a t t n_{T} (x^{s} + T_{s e m a t i c})

(2)

The specific method is to use the output of the title-level encoder as the key and value, and the output of multi-head attention of each block in the sentence-level encoder as the query. That is, after performing self-attention, the sentence-level encoder executes the contextual attention with the title-level encoder to add the title information to the sentence vector and obtain the input of the position-wise feed-forward network.

To calculate the similarity between title and sentence, we use a max-pooling operation on the output of the sentence-level encoder to get the final vector that represents the sentence, and then use a linear mapping and sigmoid activation function to obtain the similarity.

\begin{aligned} \tilde{h} = m a x p o o l i n g ({h_{1}^{s}, h_{2}^{s}, \dots, h_{n}^{s}}) \\ s c o r e = s i g m o i d (W_{h} \tilde{h} + b_{h}) \end{aligned}

(3)

Where

s c o r e

is the similarity between title and sentence, and also indicates the likelihood that the sentence is selected as the input of summary generator. We use this score to reorder the sentences in the article, select top N sentences as the input of the summary generation model, and use the output

{h_{1}^{s}, h_{2}^{s}, \dots, h_{n}^{s}}

in the sentence-level encoder of these N sentences as the embedded representation of them, which contains the information of the article title.

Unsupervised sentences ranking model: For datasets without titles, we use the improved TextRank method to rearrange article. We take each sentence in article as a node in graph, and use the similarity between two sentences to calculate the weight of the edge between two nodes. We use the BERT model²⁷ to obtain a contextualized sentence representation (BERT-base, uncased; no fine-tuning). The semantic similarity between two sentences is calculated as follows:

\begin{aligned} v_{s} = \frac{1}{| s |} \sum B E R T (s) \\ s c o r e_{s i m} (s_{i}, s_{j}) = c o s (v_{i}, v_{j}) \end{aligned}

(4)

Each sentence is tokenized with WordPiece (maximum length 256 tokens; longer sentences are truncated), and its vector

v_{s}

in equation (4) is computed by mean-pooling the last-layer token embeddings (i.e., the average of all token vectors in the sentence). We then compute cosine similarity between sentence vectors to obtain edge weights for the TextRank graph, and reorder sentences accordingly to select the top-

N

as summary generation model inputs.

3.2. Summary generation

Our summary generation model is composed of an encoder and a decoder. The encoder is used to receive the most important sentences obtained by the Sentences Ranking model, and maps each word in the input to a semantic vector $H_{i}^{c}$ . The decoder is used to parse the semantic vector to generate a summary.

The structure of the encoder is the same as the title-level encoder mentioned in Section 3.1. It contains $n_{e}$ blocks, where $n_{e}$ is a learnable parameter and every block performs the same operation. The output of its last block will be used in decoder. The difference is that the word embedding in the encoder of the summary generation model is directly provided by the sentence-level encoder, while the word embedding of the encoder in the Sentences Ranking model is obtained by training Word2vec.

The structure of our decoder is similar to the decoder in transformer model, with $n_{d}$ blocks, and they all add the original article information obtained by encoder to the decoder through an attention operation. The transformer model uses the multi-head attention to achieve that, which calculate the distribution of attention to the source text with the same probability at different decoding steps. We change the transformer model to use an attention mechanism based on time penalty to prevent different decoding steps from paying attention to the same source text information.

Time penalty mechanism: This mechanism calculates the attention distribution to the source text after punishing the information that was concerned in the previous decoding steps, which can reduce the generation of duplicate words.

First, when generating each summary word in the decoder, the way to calculate its attention distribution to the source text is:

e^{t} = \frac{V_{1}^{T} \tan (W_{m} m u l_{o u t} [t]) + W_{e} e n c_{o u t})}{\sqrt{d_{k}}}

(5)

where

W_{m}

W_{e}

and

V_{1}

are learnable parameters.

m u l_{o u t} [t]

is the output of multi-head attention layer in the t-th decoding step.

e n c_{o u t}

is the output of encoder and

d_{k}

is its dimension. At the t-th decoding step, the word obtaining high attention in the decoding step t-1 is subjected to the heaviest penalty, and then the penalty degree from the t-2 step to 1 step is sequentially decreased. This is reflected in the formula below which divides the distribution of interest obtained for each decoding step by the time difference. The real distribution of attention is:

b_{t} = {\begin{cases} \exp (e^{t}) & if t = 1, \\ \frac{\exp (e^{t})}{\sum_{k = 1}^{t - 1} \exp (\frac{e^{k}}{t - k})} & otherwise. \end{cases}

(6)

Equation (6) computes the attention weights at step

t

by down-weighting source positions that were strongly attended in recent steps. Here

e^{t}

is the score vector at step

t

obtained from equation (5) (using

{mul}_{o u t} [t]

{enc}_{o u t}

, and the learnable parameters

W_{m}

W_{e}

V_{1}

with scale

d_{k}

);

e^{k}

denotes the same type of score vector computed at a previous step

k

via equation (5). The

\exp (\cdot)

and the division in equation (6) are applied element-wise, and the denominator sums over

k = 1, \dots, t - 1

so that smaller time gaps

(t - k)

impose stronger penalties on recently attended positions; when

t = 1

no penalty is applied.

The source text information and the generated summary information are integrated according to the attention distribution obtained in the above manner, and then a context matrix $C_{t}^{e d}$ including the original text information and the summary information is obtained.

C_{t}^{e d} = b_{t} \times e n c_{o u t}

(7)

Finally, the context matrix $C^{e d}$ is input to position-wise feed-forward network $F F N$ to get the output of the first block of decoder, and the output is to be the input for the next decoder block. Then the $n_{d}$ blocks are repeatedly operated to get the final output of decoder. The summary word information generated by each decoding step is then mapped to the vocabulary table through a fully connected layer $W P V$ . That is, the vocabulary distribution generated by all decoding steps is:

\begin{aligned} d e c_{o u t} = F F N (L N (C^{e d}) + r e s i d u a l) \\ P_{v o c} = W P V (d e c_{o u t}) \end{aligned}

(8)

P_{v o c} [t]

represents the probability distribution of the summary word generated in t-th decoding step in the vocabulary table. From this distribution, the probability of which the summary word generated at step t is w can be obtained:

p_{t} (w) = P_{v o c} [t] (w)

(9)

When creating the vocabulary table, we set the word with the frequency less than 5 to unk (unknown word), and so some words may be marked as the id of unk when marking the reference abstract. Therefore, when the frequency of real summary word is less than 5, our original model will generate as many unks as possible to achieve low loss, which will make the generated summary less readable. In order to alleviate this problem, we process each sample to get its oov (out-of-vocabulary) table and input id table containing oov information. The oov table is composed of words belonging to the sample but not belonging to the vocabulary table, and the corresponding oov table is different for each sample. The input id table containing oov information is composed of the position in the vocabulary table of each word in the sample sequence, but the id of the word that does not belong to the vocabulary is set to the sum between its position in the oov table and the length of the vocabulary.

Then based on the id table containing oov information, we add the pointer network mentioned in See et al.²⁶ to our model, so that the model can extract words directly from the source text as abstract words, which is reflected in code is to extract the id of the summary word from the id table containing the oov information.

Pointer network: This mechanism, like the attention mechanism, needs to calculate attention distribution to the source text in each step of decoding and we use the attention distribution calculated by time penalty attention mechanism to be the pointer. It should be noted that the decoder has $n_{d}$ blocks, and every block will calculate the attention distribution to the source text. However, the input of each block is the output of the previous block, so the attention distribution obtained by the last block contains the attention distribution information obtained by all the previous blocks. Therefore, in order to avoid information redundancy, we use the attention distribution of the last decoder block as the pointer, telling the model to select which word in the original sequence to be the summary word.

We use a soft switch to control the generation ability and extraction ability of our model. The generation ability means that the abstract word is obtained by sampling $P_{v o c}$ , which is derived from the vocabulary table and the extraction ability means that the summary word is obtained by extracting from the source text based on the attention distribution of the last decoder block. The calculation formula of the soft switch $p_{g}$ is as follows:

p_{g} = σ (W_{x} x_{n_{d}} + W_{m} m u l_{n_{d}} + W_{c} C_{n_{d}})

(10)

where

W_{x}

W_{m}

and

W_{c}

are learnable parameters, and

σ

is the sigmoid activation function.

p_{g}

is ranges from 0 to 1.

x_{n_{d}}

represents the input of the last decoder block,

m u l_{n_{d}}

represents the output of the multi-head attention layer in the last block, and

C_{n_{d}}

represents the context matrix between the last decoder block and the encoder.

Therefore the probability distribution of the abstract word obtained by our model in the extended vocabulary table is:

\begin{aligned} P_{v o c}^{*} = E x p a n d (P_{v o c}) \\ \tilde{p_{t}} (w) = p_{g} [t] & (P_{v o c}^{*} [t] (w)) \tilde{+} (1 - p_{g} [t]) D [t] (w) \end{aligned}

(11)

where

D

refers to the attention distribution to the original text obtained by the last decoder block. The operation of

E x p a n d

is to extend the size of the second dimension of

P_{v o c}

to (

l_{v o c a b} + max l_{o o v s}

). We set the element in

P_{v o c}^{*}

whose position is greater than

l_{v o c a b}

to 0, because the words generated by the original model are from the vocabulary table and the probability of generating oov words is 0.

l_{o o v s}

is a collection of the lengths of the oov tables in all samples. The operation of

\tilde{+}

is to add each element in

D

P_{v o c}^{*}

using the id table with oov information of the source text. If

w

is an oov word, the id of

w

should be greater than

l_{v o c a b}

and then

P_{v o c}^{*} [t] (w)

is 0. Or if w is a word not in the source text,

D [t] (w)

is 0.

Time penalty mechanism theoretical property: To justify the above design and explain why the time penalty reduces repetition while improving coherence, we formalize the mechanism and present a simple property showing that it diversifies cross-step attention.

Let $a_{t} \in R^{| X |}$ denote the encoder–decoder attention logits at decoding step $t$ over source tokens $X = {x_{i}}_{i = 1}^{| X |}$ , and let $α_{t} = softmax (a_{t})$ be the attention distribution. We define a history-aggregated footprint $c_{t - 1} \in R^{| X |}$ and apply a subtractive penalty to obtain

{\tilde{a}}_{t} = a_{t} - λ g (c_{t - 1}), {\tilde{α}}_{t} = softmax ({\tilde{a}}_{t}),

(12)

where

λ > 0

is a penalty weight and

g

is a nonnegative element-wise mapping (we use

g (u) = u

). The footprint is computed with an exponential time kernel:

c_{t - 1} = \sum_{k = 1}^{t - 1} γ^{t - 1 - k} α_{k}, γ \in (0, 1),

(13)

so that more recent attentions are penalized more strongly.

3.2.1 Intuition

The penalty reduces the logit of source positions that were strongly attended in recent steps, preventing the decoder from fixating and thus curbing near-duplicate token emissions.

3.2.2 Proposition

Let $j^{⋆} = \arg max_{i} {c_{t - 1, i}}$ be the most recently emphasized source index. For any $λ > 0$ , the probability assigned to $j^{⋆}$ strictly decreases after the penalty:

{\tilde{α}}_{t, j^{⋆}} < α_{t, j^{⋆}} .

(14)

Moreover, the attention entropy increases:

Δ H_{t} \geq κ λ,

(15)

where

Δ H_{t} = H ({\tilde{α}}_{t}) - H (α_{t})

and

κ > 0

depends on

a_{t}

and

c_{t - 1}

3.2.3 Sketch of proof

Since ${\tilde{a}}_{t, j^{⋆}} = a_{t, j^{⋆}} - λ g (c_{t - 1, j^{⋆}})$ and all other logits are unchanged or less penalized, the softmax ratio ${\tilde{α}}_{t, j^{⋆}} / {\tilde{α}}_{t, i} = \exp ({\tilde{a}}_{t, j^{⋆}} - {\tilde{a}}_{t, i})$ strictly decreases for every $i \neq j^{⋆}$ , yielding ${\tilde{α}}_{t, j^{⋆}} < α_{t, j^{⋆}}$ . By the convexity of $- \sum_{i} p_{i} \log p_{i}$ and the fact that mass is moved from a peak to the tails, the entropy increases; a first-order bound follows from the softmax Lipschitz property with respect to additive logit perturbations.

3.2.4 Complexity

The update of $c_{t - 1}$ is $O (| X |)$ per step with negligible constant overhead; the asymptotic complexity of attention remains unchanged.

4. Datasets and baselines

We use two types of datasets: dataset with titles (WikiHow) and dataset without titles (CNN/Daily Mail).

CNN/Daily Mail includes one million news items, and each piece of news includes one or more manual points, which are combined into its summary. The sizes of the training, validation, and test sets are 286,817, 13,368 and 11,487 respectively. For this dataset, the average article length is about 758 tokens; we build a shared vocabulary with a minimum frequency of 5 ( $\approx$ 50k types) and truncate source and target to 400 and 200 tokens, respectively.

WikiHow is proposed by Koupaee and Wang²⁸ and is a collection of articles in the Wikihow knowledge base. Each of these articles consists of a question and numerous solutions, and each of these solutions is a piece of data, and the introduction of the solution is its title. The sizes of training, validation and test sets are 180,000, 10,000 and 20,000 respectively. For this dataset, the average article length is about 460 tokens; the same shared vocabulary (min frequency 5; $\approx$ 50k types) is used, and truncate source and target to 400 and 200 tokens, respectively.

Besides, we propose two models, one is a sentences ranking model and the other one is the summary generation model. The sentences ranking model is used to obtain the input of the summary generation model. As follows:

(1)
SRM-head is our supervised sentences ranking model, which the output of its sentence-level encoder with title information is used to calculate the similarity between the title and the sentence.
(2)
Trans-Pointer-Time is our summary generator, which uses a pointer network and a time penalty mechanism to improve transformer.

To evaluate the performance of our sentences ranking model, we use the following models for comparison: (1)
Lead refers to selecting the first k sentences of an article to form a summary without any processing.
(2)
TextRank is an unsupervised sentences ranking model that we use to sort dataset without titles, and its weight of the edges is obtained by calculating the semantic similarity between two sentences.
(3)
LR-LSTM²² uses LSTM to construct a logistic regression model to calculate the importance of sentences in article and then then rearrange the article.

In addition, we compared our summary generator with various variants of the sequence-to-sequence model to verify its performance. (1)
NN-ABS²⁹ is a summary generation model based on neural networks and using an attention mechanism to link the encoder and decoder.
(2)
PG-coverage²⁶ is a method of adding a pointer mechanism and a coverage mechanism to the easiest sequence-to-sequence model to generate summary.
(3)
Transformer¹¹ refers to using only the attention mechanism to encode and decode articles to generate a summary.

5. Model oerformance

5.1. Implementation details

5.1.1. Data and preprocessing

We build a vocabulary with a minimum frequency threshold of 5 (rare words mapped to unk; final size $\approx$ 50k). For each document, the Sentences Ranking module selects the Top- $N$ sentences (we use $N = 10$ by default), which are concatenated and truncated to a maximum source length of 400 tokens. Target summaries are truncated to 200 tokens.

5.1.2. Model configuration

The encoder–decoder follows a standard Transformer: 6 encoder layers and 6 decoder layers, model dimension 768, feed-forward dimension 2048, and 12 attention heads per layer. Dropout is set to 0.1 on attention and feed-forward sublayers. The pointer mechanism uses the cross-attention distribution from the last decoder block; the time penalty is applied in encoder–decoder attention as defined in Section 3.

5.1.3. Optimization and training

We train with Adagrad (learning rate $0.2$ ), batch size $64$ , and standard layer normalization and residual connections. Unless noted, all runs use the same training schedule and early stopping on the development set. Decoding uses beam search with the same settings across models to ensure fair comparison.

5.1.4. The effect of the input length

For each article with sentences ${s_{i}}_{i = 1}^{M}$ , SRM-head assigns a relevance score $score (s_{i})$ (equation (3)). We rank sentences by this score, select the top- $N$ indices $I_{N}$ , and concatenate the selected sentences in their original document order to form the generator input

X^{'} (N) = concat ({s_{i}}_{i \in I_{N}}^{doc-order}) .

The encoder then processes

X^{'} (N)

(with token embeddings taken from the sentence-level encoder via a linear projection to the generator dimension), and the decoder proceeds as described above.

We test the effect of input length of summary generation model on Wikihow dataset and we control the input length by selecting a different number of sentences in the article. As shown in Figure 4, we perform experiments on NN-ABS, PG-coverage, and Trans-Pointer-Time, and their inputs are obtained by selecting sentences from original articles using SRM-head. The number of sentences ranges 1 to 20 with an interval of 2. It can be seen from Figure 4 that it is better to select 10 sentences as the input of some neural generation models than others.

Figure 4.

The effect of different number of sentences on the summary generation model(recording ROUGE F1).

5.1.5. The effect of the sentences ranking method

In order to evaluate the effectiveness of SRM-head, we apply it to the Wikihow dataset to compare with various other ranking methods. Specifically, we rearrange the article through the sentences ranking method, extracting the top $L^{'}$ sentences, and calculate the amount of information contained in them. Table 1 lists the comparison results of different sentences ranking models, which uses the recall score of ROUGE between the extracted $L^{'}$ sentences and the reference summary to represent the amount of information contained in the $L^{'}$ sentences. It can be seen from Table 1 that compared with the traditional unsupervised ranking method TextRank, the recall score in ROUGE of SRM-head is nearly $28 %$ higher than it. Compared with the supervised ranking method LR-LSTM, the recall score of SRM-head is also better than that of LR-LSTM, and the amount of information contained has increased by $5 %$ . Besides, it took us only 2.75 hours to train SRM-head, much less time than training LR-LSTM.

Table 1.
Comparison results of sentences ranking models on Wikihow dataset(recording the recall score).

ROUGE-1 ROUGE-L

Methods $L^{'}$ = 5 $L^{'}$ = 10 $L^{'}$ = 15 $L^{'}$ = 5 $L^{'}$ = 10 $L^{'}$ = 15

Lead 38.76 42.78 49.37 23.65 28.31 33.28

TextRank 39.14 43.52 51.93 28.62 33.53 39.47

LR-LSTM 50.23 56.83 61.49 38.43 42.39 47.33

SRM-head(ours) 52.71 60.61 64.5 40.12 43.55 50.62

	ROUGE-1	ROUGE-L
Lead	38.76	42.78	49.37	23.65	28.31	33.28
TextRank	39.14	43.52	51.93	28.62	33.53	39.47
LR-LSTM	50.23	56.83	61.49	38.43	42.39	47.33
SRM-head(ours)	52.71	60.61	64.5	40.12	43.55	50.62

In addition, we calculated the cosine similarity between the top n(lead-n) sentences extracted by the sentences ranking model and the human-written summary, which can represent their semantic similarity. We use this similarity score to evaluate different sentences ranking models. As shown in Figure 5, our SRM-head gets a score of $15 %$ higher than that of TextRank, and $3 %$ higher than LR-LSTM, which shows that our supervised method is more effective and the text extracted by it can better summarize the original article.

Figure 5.

Evaluation of Sentences Ranking Method on WikiHow dataset, which the semantic similarity is obtained by cosine similarity.

5.1.6. The performance of summary generation model

Table 2 is the experimental results of some generators on the CNN/Daily Mail dataset and the Wikihow dataset. The input of the models above the horizontal line in Table 2 is the text in the article that does not exceed the input length set by models. That is, the part of the article that exceeds the set length is directly deleted, and the remaining part is used as the input of the models. The input of models below the horizontal line is the text consisting of the top 10 sentences in rearranged article sorted by TextRank or SRM-head. The scores in row 2 of Table 2 are obtained from ^30,31, while the scores in row 3 are obtained by training the models on the Nividia M40 GPU. Although the experimental results we obtained are not as effective as recording in the original article, this has certain reference significance for us and helps us to conduct experimental analysis.

Table 2.
Comparison results of F1-score on CNN/Daily mail and Wikihow.

CNN/Daily Mail Wikihow

Methods ROUGE-1 ROUGE-2 ROUGE-L ROUGE-1 ROUGE-2 ROUGE-L

NN-ABS 23.19 6.35 13.97 17.53 4.35 12.74

PG+coverage 39.53 17.28 36.38 28.53 9.23 26.54

PG+coverage(by us) 38.91 16.53 35.14 27.13 8.76 26.41

Transformer 31.23 10.81 27.14 25.77 7.76 22.35

Trans-Pointer-Time(ours) 40.84 12.07 36.70 30.49 8.75 30.46

PG+coverage+Textrank 39.24 16.87 35.96 28.05 8.93 25.78

PG+coverage+SRM-head – – – 29.77 9.18 27.57

Transformer+Textrank 32.76 9.23 29.62 30.98 8.86 29.01

Transformer+SRM-head – – – 31.19 9.21 31.10

Trans-Pointer-Time+Textrank 41.53 10.98 37.48 32.04 9.75 31.94

Trans-Pointer-Time+SRM-head(ours) – – – 32.76 10.15 32.71

	CNN/Daily Mail	Wikihow
NN-ABS	23.19	6.35	13.97	17.53	4.35	12.74
PG+coverage	39.53	17.28	36.38	28.53	9.23	26.54
PG+coverage(by us)	38.91	16.53	35.14	27.13	8.76	26.41
Transformer	31.23	10.81	27.14	25.77	7.76	22.35
Trans-Pointer-Time(ours)	40.84	12.07	36.70	30.49	8.75	30.46
PG+coverage+Textrank	39.24	16.87	35.96	28.05	8.93	25.78
PG+coverage+SRM-head	–	–	–	29.77	9.18	27.57
Transformer+Textrank	32.76	9.23	29.62	30.98	8.86	29.01
Transformer+SRM-head	–	–	–	31.19	9.21	31.10
Trans-Pointer-Time+Textrank	41.53	10.98	37.48	32.04	9.75	31.94
Trans-Pointer-Time+SRM-head(ours)	–	–	–	32.76	10.15	32.71

Figure 6 illustrates the semantic similarity between the summary generated by different summary generation models and the reference summary, and Figure 7 records the proportion of novel words and unks included in the summary generated by different summary generation models, where novel words are the words that appear in the summary but not in the original article and unk represents that a summary word generated by summary generator is out of dictionary.

Figure 6.

The semantic similarity between the generated summary and reference summary. (a) records the effect of different summary generation models on different datasets. The input of the model is the first $m a x_{l e n}$ words of the original article, and $m a x_{l e n}$ is the input length set by the model; (b) records the impact of different sentences ranking models on different summary generation models. The input of summary generation model is the top 10 sentences with the highest scores after reordering the article using sentences ranking model.

Figure 7.

The proportion of novel n-grams and unks in summaries generated by different summary generation models. The proportions of novel words contained in Trans-Pointer-Time and PG-coverage are not high for they sometimes extract words from the source text as summary words.

Comparing the data in the first 4 rows in Table 2, we can see that the result of using Transformer to generate a summary is better than using the traditional sequence to sequence model NN-ABS, but not as effective as the sequence to sequence model with pointer mechanism PG+coverage. From Figures 6(a) and Figure 7, we can see that the semantic similarity between the summary generated by Transformer and the reference summary is not as high as that of PG+coverage, and the summary generated by Transformer contains a lot more unks than PG+coverage. This can indicate that Transformer still has the problem that the generated summary contains many oovs.

Thus we propose Trans-Pointer-Time, which adds a pointer mechanism and a time penalty mechanism to Transformer. From the proportion of unks in Figure 7, the summary generated by Trans-Pointer-Time contains fewer unks, which improves the OOV problem. Meanwhile, you can see from rows 3 to 5 in Table 2 that the scores in ROUGE F1 of Trans-Pointer-Time are $23 %$ higher than Transformer and $4 %$ higher than PG+coverage. Besides, training Trans-Pointer-Time takes about 5 hours and 25 minutes, which is approximately one hour more than training Transformer, but it is much faster than training PG+coverage.

In addition, in order to verify that our supervised sentence ranking model can improve the performance of the summary generation model, we conducted comparison experiments on WikiHow dataset. We preprocess it with TextRank and SRM-head, and then input it to summary generators. The result below the horizontal line in Table 2 shows that SRM-head improves the performance of the summary generation model better than TextRank, and SRM-head improves the F1 in ROUGE-1 of Trans-Pointer-Time by nearly $5 %$ . Meanwhile, Figure 6(b) shows that the semantic similarity between the generated summary and its reference summary after applying SRM-head to summary generators is improved.

5.1.7 Ablation: Time penalty and pointer mechanisms

We isolate the contributions of the time penalty and the pointer mechanism on WikiHow dataset. Under identical training and decoding settings as our main results, we evaluate four variants: (i) Transformer (no Time Penalty, no Pointer), (ii) Transformer–Time (Time Penalty only), (iii) Transformer–Pointer (Pointer only), and (iv) Trans–Pointer–Time (ours) (both enabled). We report ROUGE-1/2/L on the test set; the scores are summarized in Table 3.

Table 3.
Ablation study on time penalty mechanism and pointer mechanism.

Wikihow

Methods ROUGE-1 ROUGE-2 ROUGE-L

Transformer 25.77 7.76 22.35

Transformer-Time 27.98 8.08 28.65

Transformer-Pointer 29.12 8.41 26.44

Trans-Pointer-Time(ours) 30.49 8.75 30.46

	Wikihow
Transformer	25.77	7.76	22.35
Transformer-Time	27.98	8.08	28.65
Transformer-Pointer	29.12	8.41	26.44
Trans-Pointer-Time(ours)	30.49	8.75	30.46

Relative to the plain Transformer, Transformer-Time substantially improves quality, with the largest gain on ROUGE-L ( $+$ 6.30), indicating stronger structural coherence and reduced repetition. Transformer-Pointer delivers larger gains on ROUGE-1 ( $+$ 3.35) and ROUGE-2 ( $+$ 0.65), consistent with better lexical coverage via copying. Combining both, Trans-Pointer-Time achieves the best overall results, improving over the baseline by $+$ 4.72 (ROUGE-1), $+$ 0.99 (ROUGE-2), and $+$ 8.11 (ROUGE-L).

Time penalty mechanism improves coherence by penalizing consecutive steps that focus on the same source positions, thereby reducing repetition and improving ordering (higher ROUGE-L). The pointer mechanism adds a gated copy distribution from cross-attention, which handles OOV and preserves key entities and phrases, boosting ROUGE-1/2. The two are complementary: Time Penalty guides where to attend over time, while Pointer influences what to emit. Together, they yield the best overall performance on WikiHow.

5.1.8 Comparison with other SOTA models and cross-dataset generalization

Table 4 compares our model with recent pretrained summarizers on both CNN/DailyMail and XSum. For BART and PEGASUS, we use the test scores reported in their original papers^32,33; for T5 we use its reported CNN/DailyMail score and a widely used public reproduction on XSum³⁴. Our model is trained/evaluated on XSum under the same preprocessing, length control, and ROUGE protocol as in CNN/DailyMail, with shorter targets (e.g., 64 tokens) and a smaller SRM-head top- $N$ (e.g., $N = 3$ – $5$ ) to match XSum’s highly abstractive style.

Table 4.
Comparison with other SOTA models and cross-dataset generalization.

CNN/Daily Mail XSum

Methods ROUGE-1 ROUGE-2 ROUGE-L ROUGE-1 ROUGE-2 ROUGE-L

BART 44.16 21.28 40.90 45.14 22.27 37.25

PEGASUS 43.90 21.20 40.76 45.20 22.06 36.99

T5-11B 43.52 21.55 40.69 36.77 14.69 30.07

Trans-Pointer-Time+SRM-head(ours) 44.10 19.02 40.93 45.15 21.87 37.32

	CNN/Daily Mail	XSum
BART	44.16	21.28	40.90	45.14	22.27	37.25
PEGASUS	43.90	21.20	40.76	45.20	22.06	36.99
T5-11B	43.52	21.55	40.69	36.77	14.69	30.07
Trans-Pointer-Time+SRM-head(ours)	44.10	19.02	40.93	45.15	21.87	37.32

On CNN/DailyMail, our system is on par with BART in ROUGE-1 (44.10 vs 44.16) and achieves the best ROUGE-L (40.93), while trailing in ROUGE-2. On XSum, it matches BART/PEGASUS on ROUGE-1 (45.15 vs 45.14/45.20), is close on ROUGE-2 (21.87 vs 22.27/22.06), and attains the best ROUGE-L (37.32). These patterns of strong ROUGE-L and competitive ROUGE-1 across two distinct domains indicate that our approach generalizes beyond CNN/DailyMail and captures sentence-level coherence and ordering effectively, even under highly abstractive targets.

These outcomes highlight the effectiveness of our design: SRM-head provides explicit content selection, the pointer mechanism preserves lexical fidelity and handles OOV tokens, and the time penalty promotes cross-step attention diversity to curb repetition. Taken together, these lightweight, modular components plug into a standard Transformer and deliver competitive performance without relying on extremely large-scale pretraining.

5.1.9 Computational overhead and scalability

Let $n$ be the (post-selection) source length, $m$ the target length, $d$ the model dimension, and $L_{e} / L_{d}$ the encoder/decoder layer counts. A standard Transformer costs $O (L_{e} n^{2} d)$ for encoder self-attention, $O (L_{d} m^{2} d)$ for decoder self-attention, and $O (L_{d} m n d)$ for cross-attention.

Incremental cost of our additions: (i) Time penalty keeps a recency footprint $c_{t - 1} \in R^{n}$ and subtracts it from cross-attention logits once per step; both the footprint update and subtraction are $O (n)$ per decoding step, adding only a length- $n$ buffer. Asymptotic bounds remain unchanged. (ii) Pointer (copy) reuses the last decoder block’s cross-attention and mixes it with the vocabulary distribution via a scalar gate; beyond the gate, the per-step work is a sparse scatter from source positions, $O (n)$ . Memory adds one attention vector ( $O (n)$ ).

Effect of SRM-head (top- $N$ ): Sentence ranking reduces the effective input from raw length $n_{raw}$ to $n ≪! n_{raw}$ (top- $N$ concatenation), cutting encoder cost roughly by $(n / n_{raw})^{2}$ . Empirically (Figure 4), $N = 10$ offers a favorable quality–efficiency trade-off on WikiHow.

Both modules are lightweight: they introduce only $O (n)$ per-step work and $O (n)$ temporary memory on top of the base cross-attention, leaving Transformer’s asymptotic complexity intact; SRM-head further improves scalability by shrinking $n$ .

6. Conclusion

We presented a two-stage summarization framework in which a supervised sentence-ranking head (SRM-head) selects the top- $N$ sentences and a Transformer-based generator produces the summary. The generator is augmented with a time penalty in encoder-decoder attention to discourage reattending to recently focused content and a pointer mechanism to copy salient spans, thereby improving entity and number fidelity. Experiments on CNN/DailyMail and WikiHow, together with an additional evaluation on XSum, show that our approach delivers competitive ROUGE scores against recent pretrained systems.

6.1 Future work

We plan to (i) extend the pipeline to longer documents via hierarchical selection and long-context/sparse attention, (ii) incorporate structured evidence (tables/knowledge bases) and faithfulness constraints to further improve factual consistency, (iii) complement automatic metrics with targeted human evaluation protocols for fluency, coherence, and relevance, (iv) study domain and language transfer beyond news and how-to articles, and (v) explore joint training of SRM-head and the generator while plugging our modules into large pretrained backbones to combine strong pretraining with our lightweight control mechanisms.

Ethical considerations

This article does not contain any studies with human or animal participants.

Consent to participate

Not applicable.

Consent for publication

Not applicable.

Funding

The author disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: The work is supported by the National Natural Science Foundation of China (Grant Nos. 62373141, 62302157, 62532005), the Science and Technology Program of Changsha (kh2301011), the Major Science and Technology Research Projects of Hunan Province (Grant Nos. 2024QK2010, 2024QK2009), the Yunnan Provincial Major Science and Technology Special Plan Projects (No. 202502AD080009), the Yunnan Science and Technology Talent and Platform Program (No. 202605AK340003), the open Project Fund of the State Key Laboratory of Cyberspace Security Defense (2024-MS-04).

Declaration of conflicting interest

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Data availability

All data generated or analysed during this study are included in this published article.

Footnotes

ORCID iDs

Li Yang

Xiao Wang

Xiong Xiao

Zhuo Tang

References

Wang

. Surveying the landscape of text summarization with deep learning: a comprehensive review. Discret Math Algorithms Appl 2024; 16: 2330004:1.

Wahab

MHH

Ali

Hamid

NAWA

, et al. A review on optimization-based automatic text summarization approach. IEEE Access 2024; 12: 4892–4909.

Mendoza

Bonilla

Noguera

, et al. Extractive single-document summarization based on genetic operators and guided local search. Expert Syst Appl 2014; 41: 4158–4169.

Sutskever

Vinyals

. Sequence to sequence learning with neural networks. In: Proceedings of the 28th international conference on neural information processing systems-volume 2 , 2014, pp.3104–3112. MIT Press.

Yao

Zhang

Luo

, et al. Deep reinforcement learning for extractive document summarization. Neurocomputing 2018; 284: 52–62. DOI: 10.1016/J.NEUCOM.2018.01.020.

Verma

Nidhi

. Extractive summarization using deep learning. Res Comput Sci 2018; 147: 107–117.

Joshi

Fidalgo

Alegre

, et al. Deepsumm: exploiting topic models and sequence to sequence networks for extractive text summarization. Expert Syst Appl 2023; 211: 118442.

Kondath

Suseelan

Idicula

. Extractive summarization of malayalam documents using latent Dirichlet allocation: an experience. J Intell Syst 2022; 31: 393–406.

Brito

Lübbering

Biesner

, et al. Towards supervised extractive text summarization via RNN-based sequence classification. CoRR abs/1911.06121, 2019. DOI: 10.1016/j.future.2019.04.045.

10.

Chopra

Auli

Rush

. Abstractive sentence summarization with attentive recurrent neural networks. In: Proceedings of the 2016 conference of the North American chapter of the association for computational linguistics: human language technologies , 2016, pp.93–98. Association for Computational Linguistics.

11.

Vaswani

Shazeer

Parmar

, et al. Attention is All you need. In: Proceedings of the 31st international conference on neural information processing systems , 2017, pp.6000–6010. Curran Associates Inc.

12.

Chen

Zhuge

. Extractive summarization of documents with images based on multi-modal RNN. Future Gener Comput Syst 2019; 99: 186–196.

13.

Madatov

Bekchanov

Vicic

. Uzbek text summarization based on TF-IDF. CoRR abs/2303.00461, 2023, https://doi.org/10.48550/arXiv.2303.00461.

14.

. Extractive text summarization using word frequency algorithm for English text. In: Working notes of FIRE 2022 - forum for information retrieval evaluation, Kolkata, India, 9–13 December 2022, Volume 3395, 2022, pp.403–408. https://ceur-ws.org/Vol-3395/T6-4.pdf.

15.

Manh

Thanh

Minh

. Extractive multi-document summarization using k-means, centroid-based method, MMR, and sentence position. In: Proceedings of the tenth international symposium on information and communication technology, Ha Noi, Ha Long Bay, Vietnam, 4–6 December 2019, 2019, pp.29–35. DOI: 10.1145/3368926.3369688.

16.

Schumann

. Unsupervised abstractive sentence summarization using length controlled variational autoencoder. CoRR abs/1809.05233, 2018, http://arxiv.org/abs/1809.05233.

17.

Luo

Chen

Jiang

, et al. Gap sentences generation with textrank for Chinese text summarization. In: Proceedings of the 5th international conference on algorithms, computing and artificial intelligence, ACAI 2022, Sanya, China, 23–25 December 2022. 2022, pp.67:1–67:5. DOI: 10.1145/3579654.3579725.

18.

Akülker

Turhan

. Extractive text summarization for turkish: Implementation of TF-IDF and pagerank algorithms. In: Arai K (ed.) Intelligent systems and applications - oroceedings of the 2022 intelligent systems conference, IntelliSys 2022, Amsterdam, The Netherlands, 1–2 September 2022. Volume 3. Lecture Notes in Networks and Systems, volume 544, 2022, pp.688–704. DOI: 10.1007/978-3-031-16075-2_51.

19.

Zheng

Lapata

. Sentence centrality revisited for unsupervised summarization. In: Korhonen A, Traum DR and Màrquez L (eds.) Proceedings of the 57th conference of the association for computational linguistics, ACL 2019, Florence, Italy, 28 July–2 August 2019, Volume 1: Long Papers, 2019, pp.6236–6247. DOI: 10.18653/V1/P19-1628.

20.

Ekmekci

Hagerman

Howald

. Specificity-based sentence ordering for multi-document extractive risk summarization. CoRR abs/1909.10393, 2019, http://arxiv.org/abs/1909.10393.

21.

Song

Huang

Ruan

. Abstractive text summarization using LSTM-CNN based deep learning. Multim Tools Appl 2019; 78: 857–875.

22.

Liu

Lapata

. Hierarchical transformers for multi-document summarization. arXiv preprint arXiv:1905.13164, 2019.

23.

Nallapati

Zhai

Zhou

. SummaRuNNer: a recurrent neural network based sequence model for extractive summarization of documents. In: Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence , 2017, pp.3075–3081. AAAI Press.

24.

Zhou

Yang

Wei

, et al. Neural document summarization by jointly learning to score and select sentences. arXiv preprint arXiv:1807.02305, 2018.

25.

Sun

, et al. Improving semantic relevance for sequence-to-sequence learning of Chinese social media text summarization. CoRR abs/1706.02459, 2017, http://arxiv.org/abs/1706.02459.

26.

See

Liu

Manning

. Get to the point: summarization with pointer-generator networks. arXiv preprint arXiv:1704.04368, 2017.

27.

Devlin

Chang

Lee

, et al. BERT: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.

28.

Koupaee

Wang

. Wikihow: a large scale text summarization dataset. arXiv preprint arXiv:1810.09305, 2018.

29.

Rush

Chopra

Weston

. A neural attention model for abstractive sentence summarization. arXiv preprint arXiv:1509.00685. 2015.

30.

Liu

Litvak

, et al. In conclusion not repetition: comprehensive abstractive summarization with diversified attention based on determinantal point processes. arXiv preprint arXiv:1909.10852, 2019.

31.

Ailem

Zhang

Sha

. Topic augmented generator for abstractive summarization. arXiv preprint arXiv:1908.07026, 2019.

32.

Lewis

Liu

Goyal

, et al. BART: denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In: Proceedings of the 58th annual meeting of the association for computational linguistics , 2020, pp.7871–7880. Association for Computational Linguistics.

33.

Zhang

Zhao

Saleh

, et al. PEGASUS: pre-training with extracted gap-sentences for abstractive summarization. In: Proceedings of the 37th international conference on machine learning , 2020, pp.11328–11339. PMLR.

34.

Raffel

Shazeer

Roberts

, et al. Exploring the limits of transfer learning with a unified text-to-text transformer. J Mach Learn Res 2020; 21: 1–67.

	ROUGE-1			ROUGE-L
Methods	$L^{'}$ = 5	$L^{'}$ = 10	$L^{'}$ = 15	$L^{'}$ = 5	$L^{'}$ = 10	$L^{'}$ = 15
Lead	38.76	42.78	49.37	23.65	28.31	33.28
TextRank	39.14	43.52	51.93	28.62	33.53	39.47
LR-LSTM	50.23	56.83	61.49	38.43	42.39	47.33
SRM-head(ours)	52.71	60.61	64.5	40.12	43.55	50.62

	CNN/Daily Mail			Wikihow
Methods	ROUGE-1	ROUGE-2	ROUGE-L	ROUGE-1	ROUGE-2	ROUGE-L
NN-ABS	23.19	6.35	13.97	17.53	4.35	12.74
PG+coverage	39.53	17.28	36.38	28.53	9.23	26.54
PG+coverage(by us)	38.91	16.53	35.14	27.13	8.76	26.41
Transformer	31.23	10.81	27.14	25.77	7.76	22.35
Trans-Pointer-Time(ours)	40.84	12.07	36.70	30.49	8.75	30.46
PG+coverage+Textrank	39.24	16.87	35.96	28.05	8.93	25.78
PG+coverage+SRM-head	–	–	–	29.77	9.18	27.57
Transformer+Textrank	32.76	9.23	29.62	30.98	8.86	29.01
Transformer+SRM-head	–	–	–	31.19	9.21	31.10
Trans-Pointer-Time+Textrank	41.53	10.98	37.48	32.04	9.75	31.94
Trans-Pointer-Time+SRM-head(ours)	–	–	–	32.76	10.15	32.71

Text summarization based on transformer using sentences ranking and time penalty

Abstract

Keywords

1. Introduction

2.1 Sentences ranking model

2.2 Summary generation

3. Our approach

3.2.2 Proposition

3.2.4 Complexity

4. Datasets and baselines

5.1. Implementation details

5.1.1. Data and preprocessing

5.1.2. Model configuration

5.1.3. Optimization and training

5.1.4. The effect of the input length

Table 3. Ablation study on time penalty mechanism and pointer mechanism. Wikihow Methods ROUGE-1 ROUGE-2 ROUGE-L Transformer 25.77 7.76 22.35 Transformer-Time 27.98 8.08 28.65 Transformer-Pointer 29.12 8.41 26.44 Trans-Pointer-Time(ours) 30.49 8.75 30.46

6. Conclusion

6.1 Future work

Ethical considerations

Consent to participate

Consent for publication

Funding

Declaration of conflicting interest

Data availability

Footnotes

ORCID iDs

References

Table 3.
Ablation study on time penalty mechanism and pointer mechanism.

Wikihow

Methods ROUGE-1 ROUGE-2 ROUGE-L

Transformer 25.77 7.76 22.35

Transformer-Time 27.98 8.08 28.65

Transformer-Pointer 29.12 8.41 26.44

Trans-Pointer-Time(ours) 30.49 8.75 30.46