Abstract
Keyphrase generation (KG) aims at condensing the content from the source text to the target concise phrases. Though many KG algorithms have been proposed, most of them are tailored into deep learning settings with various specially designed strategies and may fail in solving the bias exposure problem. Reinforcement Learning (RL), a class of control optimization techniques, are well suited to compensate for some of the limitations of deep learning methods. Nevertheless, RL methods typically suffer from four core difficulties in keyphrase generation: environment interaction and effective exploration, complex action control, reward design, and task-specific obstacle. To tackle this difficult but significant task, we present RegRL-KG, including actor-critic based-reinforcement learning control and L1 policy regularization under the first principle of minimizing the maximum likelihood estimation (MLE) criterion by a sequence-to-sequence (Seq2Seq) deep learnining model, for efficient keyphrase generation. The agent utilizes an actor-critic network to control the generated probability distribution and employs L1 policy regularization to solve the bias exposure problem. Extensive experiments show that our method brings improvement in terms of the evaluation metrics on five scientific article benchmark datasets.
Keywords
Introduction
Keyphrase generation is a natural language generation task. It partly preserves source text as present keyphrases or chooses the absent content from a pre-defined vocabulary. By distilling the key information of a source text into a set of succinct keyphrases, keyphrase generation serves as the basis for many downstream tasks, including document clustering [12, 8], information retrieval [27, 2], and text summarization [28].
Recently, the field of keyphrase generation study has been further advanced by incorporating the maximum likelihood estimation (MLE) criterion on the final generated distributional probability to train an encoder-decoder deep learning network [18]. Although widely used, MLE-based methods tend to generate universal content [34]. Most of the MLE-based techniques either focus on one limited aspect or rely on an additional loss to enforce some constraints on the process of keyphrase generation. MLE-based models are often prone to suffer from the exposure bias problem [19, 13], i.e., the model is trained conditioned on the ground-truth keyphrases while predicts the next keyword conditioned on its previous generated words. In addition, the MLE criterion is operated on the keyword-level, without considering the instant feedback or long-term effect on subsequent keywords in one keyphrase. Considering a formerly predicted word has a significant influence on the following predictions, the quality of the target keyphrase will be improved by selecting an optimal word at each key step. On the contrary, a bad estimate in one step may lead to divergence of the whole generation process. The error brought by every keyword accumulates and deteriorates the consecutive keyphrases.
Unlike MLE working at the keyword level, reinforcement learning can enable the policy to react instant feedback rapidly and maximize the expected long-term total reward. Another primary driving force behind the explosion of RL in keyphrase generation is its integration with powerful nonlinear function approximators like deep neural networks. This partnership with deep learning methods, often referred to as Deep Reinforcement Learning (DRL), has enabled RL to successfully extend to tasks with high-dimensional input and action spaces. However, designing a reliable keyphrase generation agent is not trivial and often challenging. With the growing need for the keyphrase generation method to handle more complex aspects, it will be much challenging and laborious to build a fully rule-based keyphrase agent, which requires heavy domain expertise. The main disadvantages of rule-based reward models are two-fold: 1) It is difficult to define a comprehensive list of rules. There is a long list of keyphrase generation rules or evaluation criteria in literature, such as F1@5, F1@M, duplication ratio and the average keyphrase number [32, 31]. Each of them focuses on one single aspect. 2) It is difficult to balance different rules. The objective evaluation of keyphrase generation is multi-criteria, where good ratings on one criterion do not necessarily imply good ratings on another. Instead of laboriously designing rule-based reward functions, researchers directly learn the reward model in a data driven manner. It avoids unnecessary or incorrect constraints enforced by human specialists. Data driven RL-based keyphrase generation methods have been proposed in recent studies [4, 24], but the models are often unstable and generate large training variance. Moreover, widespread adoption of data driven reinforcement techniques to the natural language generation task is still limited by four major challenges: environment interaction and effective exploration, complex action control, reward design, and task-specific obstacle. First, RL requires direct interaction with the environment and effective exploration to find good policies as well as avoid converging prematurely to local optima. Nevertheless, environment interactions and effective exploration remain key challenges for DRL in complex environments with large continuous state and action spaces. Then, the complexity of action control comes from the holistic probability distribution generation. Note that the magnitude of actions can reach exponential increments on the average length of a source text and vocabulary size. Such high-dimensional action spaces create challenges to target selection and policy control. Third, RL algorithm needs a predefined reward signal to guide the optimization process. Bootstrapping errors of an inappropriate reward function will deviate the divergence of the objective metrics or feedback less information, otherwise yielding poor performance. Designing a keyphrase reward is not trivial, as it is equivalent to building an evaluative criterion that can judge how well the agent performs its task. A fitness metric can consolidate returns across an entire episode and make RL robust to long time horizons. Finally, despite competing deep learning approaches appearing for KG, RL adaption with the deep learning model is still a severe hindrance in the KG scenario, especially when confronted with the diverse properties of a keyphrase. Meanwhile, we also notice that, for a deep learning-based Seq2Seq model, the training of generation at the next keyword depends on the teacher forcing mechanism [30] rather than on the past predicted keywords. However, during testing, the model has to make predictions based on its previous predictions. The exposure bias inherent in deep learning-based methods tends to deteriorate the subsequent generation quality over time, leading to significant drifts and divergence of the objective metrics generation goes on [13].
To tackle these issues, some reinforcement algorithms are explored to coordinate the supervised Seq2Seq model [4, 24]. Different from the reinforcement algorithm which calculates the metric scores of the entire target keyphrases as rewards, here we first design an actor-critic (AC) framework with a generalized advantage estimator (GAE) [22] to provide a reward to each keyword in the predicted keyphrase and add an L1 policy regularization during the RL training, without explicit domain knowledge. The advantage is that the model can be trained on large volumes of available source text without requiring expensive domain expertise, making it more generalizable. Together with the advantage function from the actor-critic network and the L1 policy regularization, our algorithm updates the Seq2Seq model with a smooth gradient estimate and substantially reduces the training variance. Specifically, a backbone deep learning Seq2Seq network is utilized to extract basic features using MLE criteria to model keyword accuracy. Then we operate AC on the keyphrase generation policy. The actor network selects its action conditioned on the varying action space, while the critic is trained with the guidance of the target keyphrase. Reinforcement learning (RL) methods can operate on the final generation policy, which effectively encodes specific attitude along the generation axis. What is more, due to the natural property of the separated exploration and generation phases in RL, a policy regularization loss is introduced to reduce the compact from the exposure bias problem. Our intuition behind the network design is that high-accuracy and low variance predictions should be preferred. For higher-confidence predictions, the RegRL-KG introduces the relative reward signals by trial and error, which captures the global temporal return and the dependency between the iterative generation and exploration stages. The global temporal return can evaluate the boosted keyphrase generation quality after each backward loss. The dependency is used to judge if the learned policy strategy in the training phrase can choose a more robust and powerful action than the exploration phase. RL provides a more stable and robust representation with the designed relative reward.
The distinctive points of our approach are summarized as follows:
We present an L1 regularized actor-critic based keyphrase generation method, namely RegRL-KG, which can operate direct manipulation on the generated distribution probability in a deterministic environment. L1 policy regularization is introduced to alleviate the exposure bias problem and reduce the variances between the exploration and generation phases during training. We demonstrate that RegRL-KG can boost the task metrics and achieve state-of-the-art performance on five standard science article datasets.
In the rest of the paper, we discuss related keyphrase generation and reinforcement learning methods in Section 2. Section 3 gives a brief formalization of the reinforcement background. Section 4 presents the proposed keyphrase generation method RegRL-KG. Section 5 provides experiment implementation. Section 6 provides experiment results and discusses the impact of the presented contributions. Section 7 concludes the paper and indicates the potential future work.
Related work
Keyphrase generation task can be categorized into keyphrase extraction and generation types. One solution is keyphrase extraction [33, 17, 10]. Other areas of keyphrase generation incorporate additional generation techniques like copy mechanism, coverage mechanism, orthogonal regularization, semantic coverage, and exclusion mechanisms to further improve the keyphrase generability [32, 7, 6, 31]. Another exciting thread of research is the extension of reinforcement learning into deep learning settings [4, 24].
Keyphrase extraction and generation
Early deep extraction keyphrase approaches began from selecting relevant keywords from the source text [33, 17, 10]. The drawback of the extraction keyphrase approaches comes from that it can not generate absent information that does not appear in the source text.
The keyphrase generation methods are discovered using standard backpropagation training via MLE criterion [18, 26, 3, 6], which generates target keyphrases with the content of preserved present content from the source text and the absent information from a pre-defined vocabulary. Meng et al. [18] applied an attentional encoder-decoder architecture and accomplish to learn the generated keyword distribution of present and absent representations by weight sharing and copy mechanisms. To improve model performance, various control strategies like coverage mechanism [26], orthogonal regularization [3], target encoding strategies and semantic coverage [3], and exclusion mechanisms [6] have been explored to bridge the gap across the predicted probabilities and the ground-truth. Some alternative methods directly add an auxiliary component, e.g., variational autoencoder [29] or generative adversarial network [24] to assist in keyphrase generation. Similar scalability and competitive results were demonstrated by Chen et al. using title information to guide keyphrase generation [7]. The up-to-date keyphrase generation method is One2Set [31]. The authors suggest control codes and a K-step target assignment mechanism via bipartite matching to generate keyphrases. In summary, existing deep learning methods make the keyphrase generation task in virtue of designing some additional loss functions. However, the generation accuracy and the exposure bias problem still exist in the current deep learning methods. In this paper, RegRL-KG provides a framework on these developments for further improvements.
Reinforcement learning
Many works have applied reinforced algorithms to assist the Seq2Seq model in the keyphrase generation task [4, 24]. Considering that the number of generated keyphrases is fewer than that of target keyphrases, a reinforcement learning method [4] with adaptive rewards under the guidance of evaluation scores is proposed to generate both sufficient and accurate keyphrases. Closely to our approach, the authors employ a reinforcement algorithm with adaptive rewards to directly optimize the tasks’ evaluation metrics, such as F1@5 and F@M metric scores. These metrics provide a reasonable way to measure the quality of the generated keyphrases. However, the model that utilized the prediction results as rewards requires a valuable work of post-processing and thus generates large training variance. Furthermore, the limited exploration of their algorithm also undermines the model’s accuracy to the quality of predicted keyphrases. Recently, there has been a renewed push in the use of reinforcement learning algorithms to offer alternatives for keyphrase generation. Swaminathan et al. [24] use a generative adversarial network (GAN) to learn the rewards for the generator. Nevertheless, a GAN network is unstable and difficult to train [21]. More importantly, the trade-off between exploration and generation is a key problem for reinforcement learning models. To avoid training instability, we adopt an actor-critic based reinforcement learning with a precisely designed reward to ensure a more stable and accurate generation environment.
Background
In this section, we introduce Markov Decision Process (MDP) [23] and the actor-critic algorithm.
Markov decision process
A standard discounted Markov decision process can be formalized as
Actor-critic algorithm
Deep deterministic policy gradient (DDPG) is a model-free RL algorithm developed for working with continuous high dimensional actions spaces [25]. DDPG uses an actor-critic architecture maintaining a deterministic policy (actor)
where
Additionally, a regularization term
This section introduces an L1 regularized reinforcement learning method, called RegRL-KG, to construct a Markov Decision Process for the keyphrase generation task and applies the actor-critic algorithm with L1 policy regularization to improve the Seq2Seq model. In Section 4.1, the keyphrase generation problem is first formalized. Then the RL method is presented in Section 4.2, the implementation details of the train and inference phases are given in Section 4.3, followed by the joint training loss in Section 4.4.
Architecture overview for RegRL-KG. The left part of the figure is for deep learning part which includes the actor network and the right is for the reinforcement learning and L1 policy regularization. The blue lines are for the exploration phase. The red lines are for the generation. The green lines are for the critic network.
Our task is keyphrase generation. Assuming a source text
Overview of the RegRL-KG
The overall pipeline of our method is illustrated in Fig. 1. we adopt a model-based reinforcement learning approach for keyphrase generation. Our framework can be extended to reinforcement learning based on the existing deep learning model. The networks consist of Observation
[h] : RegRL-KG for Keyphrase Generation[1] Training the source text and the ground-truth keyphrase pairs (
compute the RL loss
Observation, state and transition functions
The KG episode is modeled as a Markov Decision Process (MDP) defined by state space
Reward function
Given a source text, the agent learns to predict a sequence of target keyphrases. The reward function for the agent is usually learned using a relative MLE loss, which provides a measure of similarity between the predicted and the target keyphrases. We want the generation phrase can produce optimal keyphrases with greedy policy, while the exploration phrase can further produce possible better trajectories with stochastic policy. This encourages training of the agent as this would require better actions to follow from the same state vector in the exploration phase. Based on this, we leverage the disparity of the generated probability between the exploration and training stage as a relative reward:
Keywords that reduce MLE loss is rewarded by
Based on the MDP formulation and the designed reward, we use AC algorithm with GAE along the step of a keyphrase and about the final generation probability distribution as action space. Our goal is to learn a stochastic policy
with
Besides the problem discussed above, there still exists the exposure bias problem between the exploration and generation phases. To simultaneously ensure the training stability and convergence, the policy requires to be close to that in the exploration phase after an RL updates. This is achieved by the L1 policy regularization loss. In the keyphrase generation task, the main idea of adding the L1 policy regularization is to regularize the policy during policy optimization. As long as the output of the exploration phase is near the ground truth that embedded in the generation phase, it is regarded as a correct prediction. Therefore, we are more concerned about the target samples in the exploration phase whose prediction is far from the true value. In other words, we hope that after minimizing the exposure disparity, there is a small difference of the probability distribution between the generation and the exploration phases. In conjunction with L1 policy regularization, Teacher forcing and RL objectives, RegRL-KG can be jointly optimized to further limit divergence of the policy and ensure the consistency of policy in the generation phase.
Training, evaluation and testing
Exploration and generation phases in the Training stage
The trade-off between exploration and generation is a key problem for reinforcement learning models. The training process iterates between the exploration and the generation phases. During exploration, stochastic policy is used to gather intermediate data and is more likely to produce better trajectories. To avoid exposure bias, the exploration model makes the next keyword based on its previous predictions. In the generation phase, teacher forcing, L1 policy regularization, and greedy sampling strategies are adopted to guarantee accuracy. In this manner, exploration and generation complement each other.
Training, evaluation and testing stage
Training stage
In the training stage, we design the exploration and the generation phases. At each step, we calculate
Evaluation and testing stages
In evaluation and the testing stage, after generating a keyword, we sequentially use the previously predicted keyword to generate the final keyword probability. Additionally, just the actor network is used to produce the really keyphrase probability given the source text.
Training loss
We apply the actor-critic algorithm with an L1 policy regularization based on the MLE loss function
The details of each loss are summarized as follows:
MLE loss
Traditionally, deep learning-based methods employ maximum likelihood estimation with the teacher-forcing technique [9, 30], which maximizes the loglikelihood by lowering the cross-entropy loss during the training stage. Based on MLE, the KG task is to maximize the conditional possibility given the source text
where
Policy regularization loss
We impose an L1 policy regularization loss to directly minimize the prediction logit of the exploration
RL loss
where
We introduce the benchmark datasets and baselines, evaluation metrics and the corresponding experimental settings for the experiments.
Datasets
Following [18], we train the model on the KP20k training dataset and evaluate on the KP20k test dataset. The trained model is also evaluated by four scientific article datasets, including Inspec [11], NUS [20], Krapivin [16] and SemEval [14] that has different data characteristics from the source training dataset. Each sample from these datasets consists of a title, an abstract and some keyphrases. The released KP20k dataset consists of 567,830 high-quality scientific articles from various computer science domains, where 527,830 for training, 20,000 for validation, and 20,000 for testing. In the training dataset, the title and abstract are concatenated as a source text. The text pre-processing steps are operated on both source texts and ground-truth keyphrases, including tokenization, lowercasing and replacing all digits with symbol
Statistics of the five scientific article testing datasets
Statistics of the five scientific article testing datasets
Avg. #KP: the average number of keyphrases. Avg.
We compare four state-of-the-art baselines that apply a deep learning model to the keyphrase generation task. We also experiment with two RL training methods catSeqTG-2RF1 [4] and GANMR [24].
catSeq [32]. The first baseline model is a RNN-based Seq2Seq model with copy mechanism. catSeqTG [7]. An extension of catSeq with additional title encoding and cross-attention. catSeqTG-2RF1 [4]. An extension of catSeqTG applied the RL-based finetuning algorithm with adaptive rewards to improve the performance of the Seq2Seq model. GANMR [24]. Another extension of catSeq using a discriminator to produce rewards for RL. ExHiRD-h [6]. A method based on catSeq using an exclusive hierarchical decoding framework with a hard exclusion mechanism to improve generation performance. Transformer [31]. A Transformer-based Seq2Seq model with the copy and attention mechanism to achieve the keyphrase generation task. SETTRANS [31]. The up-to-date Transformer-based Seq2Seq model with additional control codes and a K-step target assignment mechanism via bipartite matching to generate keyphrases.
Evaluation metrics
In line with prior works [31, 6], we evaluate both present and absent keyphrase predictions using a metric with a variable cutoff macro-average F1@M and a metric with a fixed cutoff macro-average F1@5. F1@M calculates the total generated keyphrases with the ground-truth keyphrases. For the metric F1@5, if the generated keyphrase number is less than five, we randomly append incorrect keyphrases until it obtains five predictions. Porter Stemmer is adopted before calculating whether two keyphrases are matched.
Implementation details
We implement our framework on the Transformers toolkit1
and py-Torch 1.1.0.2
We train our framework on one Tesla V100 GPU with 32GB memory. The optimization uses Adam [15] with an initial learning rate 0.0001, and halves the learning rate if the perplexity on the validation set stops decreasing for one epoch. Early stopping is employed during training. Batch size is set to 24. For the reward function, we experimentally determine
Results and analysis
Present and absent predictions are conducted to demonstrate the effectiveness of the generality of the approach in Section 6.1. For the illustration of the RL training, the RL analysis trained on the KP20k training datasets is shown in Section 6.2. Section 6.3 provides a case study for the generated keyphrases. In Section 6.4, an ablation study is evaluated on KP20k testing dataset and shows how each component influences the model.
Present keyphrases prediction results of all models. The best results are bold. The subscript represents the corresponding standard deviation (e.g., 0.289
indicates 0.289
0.003)
Present keyphrases prediction results of all models. The best results are bold. The subscript represents the corresponding standard deviation (e.g., 0.289
Comparison with the state-of-the-art algorithms
The present and absent results are reported in Tables 2 and 3 for clear comparisons with RegRL-KG, respectively. Among the seven existing state-of-the-art baselines, our method performs best in both present and absent metrics on almost all of the datasets. As shown in Tables 2 and 3, RegRL-KG obtains 0.90%/1.10% present F1@5 and F1@M improvement over the best baseline (SETTRANS) on the KP20k dataset as well as a present F1@5/F1@M improvement of 0.40%/0.40% (Inspec), 1.40%/0.90% (NUS), 1.50%/2.00% (Krapivin) and
Absent keyphrases prediction results of all models. The best results are bold. The subscript represents the corresponding standard deviation (e.g., 0.020
indicates 0.020
0.001)
Absent keyphrases prediction results of all models. The best results are bold. The subscript represents the corresponding standard deviation (e.g., 0.020
As indicated in Tables 2 and 3, the results we obtain are close to those of SETTRANS on the present SemEval and even declines in the absent Inspec metrics on average. By our observation, the reason might be the shorter average keywords in the training dataset. The average number of each keyphrase in Inspec/SemEval is about 9.79/14.43, which is longer than 5.26 average number in KP20k dataset. This implies that there is less exploring scope that can not serve as supportive evidence when generating the longer keyphrases. The shorter keyphrase number may cause partial observation and the narrower scope will induce the actor with poor generative ability. We believe a better exploration (in our case, increasing the length of trajectories) strategy that can broaden the scope of observation may help the agent to master such harder or unseen dataset.
Reinforcement learning analysis
Reward and return (upper left), value and advantage (upper right), logits for exploration and training (lower left), and entropy for exploration and source (lower right).
In Fig. 2, we show the metric change of RegRL-KG from the perspective of reinforced learning, using KP20k training dataset as a case.
Reward and return
We can observe that the reward metric of RegRL-KG in Fig. 2a is consistent with the score of return metric. The RegRL-KG approximates the expected cumulative rewards and thus provides accurate returns. The reward and return scores grow with the training length and become relatively steady after about 2500 batches. Despite a little fluctuation from batch 27000 to 30000, AC and L1 policy regularization can pull the model back to a stationary state. The sharp oscillation is because RL randomly selects tokens to explore the action space at the exploration phase, which makes the training of policy difficult. As batch increases, the reward judgment aligns well with the results of return scope, indicating the controlled policy works effectively with the environment and state. RegRL-KG can interact well with the reward and the accumulated return, which also implies that RegRL-KG can learn a reasonable policy strategy.
Value and advantage
Figure 2b presents the intermediate process of the reinforcement learning. Both the value and advantage scores are smooth and consistent when training. We evaluate the RegRL-KG by measuring 1) if it reduces the temporal difference error. 2) if it approximates the expected value. As evidenced in Fig. 2b, the change of advantage is initially quite drastic but quickly becomes much smoother with the number of iterations and the convergence is consistent with the value metric training. When the exploration expands, we find the TD errors become smoother while the approximation difference increases. It is because as the exploration scale becomes larger, the model facilitates the TD learning while tries to reduce the training variance. The steady performance of the training advantage curve reveals that it is also viable to use TD advantage policy gradients in AC. In addition, the value curve of critic network is increasing during the training stage, demonstrating the effectiveness of state embedding. We guess the key reason is that the critic of the RL agent already knows the values of certain states and actions. Therefore, it can be used to select which action should be followed in a principled and globally consistent manner.
The exploration is beneficial for the generability
We investigate the influences of exploration by analyzing the logits (Fig. 2c) and the entropy (Fig. 2d) between the exploration and the training phases. We find the keyword logit under the training phase is a little higher than that of exploration, which is generally consistent with the entropy results. Without exploration, the improvement made by AC itself is very limited, which is smaller than that with exploration, because it is difficult for AC to learn a complex word-level critic without exploration. As the training converges, the keyword logit and the entropy learning curves in Fig. 2c and d further demonstrate the ability of exploration. The exploration phase forces the model to explore different words at key positions.
Case study
In order to get a better understanding of RL control and policy regularization, we follow the ExHiRD-h method and provide the same sample with the generated keyphrases on the KP2k testing dataset. From Fig. 3, RegRL-KG generates more accurate keyphrases comparing to the best baseline SETTRANS. It can be seen that RegRL-KG directly highlights the present keyphrases “co-verification” and “debugging” as well as the the absent keyphrases “computer software”. For the content of prediction, RegRL-KG and ExHiRD-h both rank first and predict more accurate keyphrases than the latest method SETTRANS. Nevertheless, the number of the predicted keyphrases by RegRL-KG is six, which is more closer to the target keyphrase number four when compared to the keyphrase number nine by ExHiRD-h. RegRL-KG can improve the number of unique keyphrases by a large margin, which is extremely beneficial from the combination of RL control and L1 policy regularization. Correspondingly, RegRL-KG solves the overgeneration problem that existed in the baselines. Beside, less repeated keyphrases are generated through the RL agent and the policy regularization. For instance, ExHiRD-h produces the keyphrase “computer software” at two times. However, RegRL-KG generates it just one time, which demonstrates that our proposed method is more powerful in avoiding duplicated keyphrases. The same lower average duplication keywords appeared in the Table 4 also quantitatively confirm this phenomenon, which indicates RegRL-KG arranges the keyword probability distributions in a more precise and compact way.
An example of generated keyphrases by baselines and RegRL-KG on the KP20k testing dataset. The correct predictions are bold and the present keyphrases are underlined. The digit in parentheses denotes the duplication number of the predicted keyphrases (e.g., “co-verification (1)” represents the keyphrase “co-verification” is generated one time).
To understand what makes our framework effective, we conduct an ablation study on the SemEval dataset. We ablate RL control and L1 policy regularization loss. The effect of individual component is further reported in Table 4. SETTRANS [31] is used as the baseline.
Full model
From Table 4, we can conclude that utilizing actor-critic and L1 policy regularization strategy simultaneously can learn a strong policy. After adding RL control and L1 policy regularization, the scores of many predicted keyphrases are improved over the baselines. It indicates a positive relationship between the RL control (as well as L1 policy regularization) and the ground-truth, which is consistent with the motivation in the introduction section. Beside, the average present keyphrase number is closest to the ground-truth and the predicted number for absent ranks second among all the baselines. The average duplication of the predicted keyphrases is lowest. Moreover, SETTRANS is built on Transformers, so our improvement is not relied on the replacement of any language model, but using actor-critic algorithm and L1 policy regularization. The comprehensive metrics show that the combination of reinforcement control and L1 policy regularization is superior to each component.
RL control
The final performance is heavily decreased compared to the original full model and is even worse than the baseline SETTRANS. We believe that our AC network designed for maintaining the deep learning structure allows the model to stably generate keyphrases and leads the performance improvement as shown in Table 4. We find that RL plays a crucial role. After adding the RL network, RegRL-KG leads to a 0.50%/1.00% improvement in the F1@5/F1@M metric. This means that the actor is a good explorer for achieving proper actions conditioned on the distributional probability in the training phase. In addition, it is noticeable that, the AC setting is more stable across training. It reveals that, by using the actor-critic algorithm, we can drastically improve the model’s performance and make RL more stable across training due to the stationarity of the environment.
Ablation study. The best results are bold. The subscript represents the corresponding standard deviation (e.g., 0.367
indicates 0.367
0.000)
Ablation study. The best results are bold. The subscript represents the corresponding standard deviation (e.g., 0.367
Avg. #PK: the average number of present keyphrases. Avg. #AK: the average number of absent keyphrases. Dup: the average duplication of the predicted keyphrases. The lower duplication ratio is better. Oracle: the gold average keyphrase number. The closest values to the oracles are bold.
Adding the only L1 policy regularization leads to a higher F1@5 but a higher F1@M metric score when compared to the baseline SETTRANS. Unfortunately, the full model RegRL-KG reaches highest on both high F1@5 and F1@M metric scores, and obtains the highest task success. This is due to the fact that L1 policy regularization limits the exploration and the generation degree, which learns the generation manner and enables the explored action in the exploration phase be analogously reflected in the target training phase.
Conclusion
Deep reinforcement learning methods have offered an intuitive paradigm for exploring goal driven keyphrase generation task. The sheer size of the generation action space, however, has proven to be out of the reach of existing algorithms. In this paper, we introduced RegRL-KG, a novel learning agent that demonstrates the feasibility of scaling reinforcement learning towards keyphrase generation actions spaces with hundreds of millions of actions. The key insight to being able to efficiently explore such large space is the combination of actor-critic algorithm and L1 policy regularization control. The actor-critic network serves as a means for the agent to accumulate information about the target distribution probability while the L1 policy regularization is adopted to ensure model convergence and training stability. Together they constrain the vast space of possible actions into the compact space of sensible ones. The RegRL-KG framework demonstrates that RL algorithms can work well with the keyphrase generation task when combined with DNNs methods. The generality of our approach is evaluated on five datasets and our approach achieves the state-of-the-art for comparable methods.
From a reinforcement learning perspective, RegRL-KG can be viewed as a form of ‘goal-driven guide’ that promotes exploration towards states with higher long-term returns, boosts accurate of explored policies and introduces stability for policy control. The principal mechanism behind RegRL-KG is the capability to incorporate both modes of learning: learning directly from the high accuracy of deep learning methods while being aligned to maximize the long-term returns by leveraging the controllable RL optimization. In this paper, we use a standard AC to operate the backbone deep learning methods on the holistic distribution. Incorporating decoupling sub-distributions, e.g., the copy and generating probability distributions, is an exciting area of future work.
Footnotes
Acknowledgments
The authors would like to thank the anonymous reviewers for their valuable comments and suggestions. This work was supported in part by the National Natural Science Foundation of China under Grant 62272100, the Consulting Project of Chinese Academy of Engineering under Grant 2023-XY-09, and in part by the Major Project of the National Social Science Fund of China under Grant 21ZD11 and the Fundamental Research Funds for the Central Universities.
