完整版-翻译、摘要、会话、文本生成任务顶会论文

机器翻译

ACL

序号	会议/期刊	论文	代码	论文下载地址	摘要	摘要翻译	作者
1	ACL2019	Latent Variable Model for Multi-modal Translation	https://github.com/iacercalixto/variational_mmt	https://arxiv.org/pdf/1811.00357	In this work, we propose to model the interaction between visual and textual features for multi-modal neural machine translation (MMT) through a latent variable model. This latent variable can be seen as a multi-modal stochastic embedding of an image and its description in a foreign language. It is used in a target-language decoder and also to predict image features. Importantly, our model formulation utilises visual and textual inputs during training but does not require that images be available at test time. We show that our latent variable MMT formulation improves considerably over strong baselines, including a multi-task learning approach (Elliott and Kádár, 2017) and a conditional variational auto-encoder approach (Toyama et al., 2016). Finally, we show improvements due to (i) predicting image features in addition to only conditioning on them, (ii) imposing a constraint on the minimum amount of information encoded in the latent variable, and (iii) by training on additional target-language image descriptions (i.e. synthetic data).	在这项工作中，我们建议通过潜在变量模型为多模态神经机器翻译 (MMT) 的视觉和文本特征之间的交互建模。这个潜在变量可以看作是图像的多模态随机嵌入及其在外语中的描述。它用于目标语言解码器，也用于预测图像特征。重要的是，我们的模型公式在训练期间利用视觉和文本输入，但不需要在测试时提供图像。我们表明，我们的潜在变量 MMT 公式在强基线上有相当大的改进，包括多任务学习方法（Elliott 和 Kádár，2017 年）和条件变分自动编码器方法（Toyama 等人，2016 年）。最后，我们展示了由于（i）预测图像特征以及仅对它们进行调节，（ii）对潜在变量中编码的最小信息量施加约束，以及（iii）通过额外目标语言训练的改进图像描述（即合成数据）。	Iacer Calixto Miguel Rios Wilker Aziz
2	ACL2021	Rewriter-Evaluator Architecture for Neural Machine Translation		https://arxiv.org/pdf/2012.05414	Encoder-decoder has been widely used in neural machine translation (NMT). A few methods have been proposed to improve it with multiple passes of decoding. However, their full potential is limited by a lack of appropriate termination policies. To address this issue, we present a novel architecture, Rewriter-Evaluator. It consists of a rewriter and an evaluator. Translating a source sentence involves multiple passes. At every pass, the rewriter produces a new translation to improve the past translation and the evaluator estimates the translation quality to decide whether to terminate the rewriting process. We also propose prioritized gradient descent (PGD) that facilitates training the rewriter and the evaluator jointly. Though incurring multiple passes of decoding, Rewriter-Evaluator with the proposed PGD method can be trained with a similar time to that of training encoder-decoder models. We apply the proposed architecture to improve the general NMT models (e.g., Transformer). We conduct extensive experiments on two translation tasks, Chinese-English and English-German, and show that the proposed architecture notably improves the performances of NMT models and significantly outperforms previous baselines.	编码器-解码器已广泛应用于神经机器翻译（NMT）。已经提出了一些方法来通过多次解码来改进它。然而，由于缺乏适当的终止政策，它们的全部潜力受到限制。为了解决这个问题，我们提出了一种新颖的架构，Rewriter-Evaluator。它由重写器和评估器组成。翻译源句子涉及多次传递。在每次通过时，重写者都会生成一个新的翻译来改进过去的翻译，而评估者则评估翻译质量以决定是否终止重写过程。我们还提出了优先梯度下降 (PGD)，它有助于联合训练重写器和评估器。尽管会导致多次解码，但可以使用与训练编码器-解码器模型相似的时间来训练使用所提出的 PGD 方法的 Rewriter-Evaluator。我们应用所提出的架构来改进一般的 NMT 模型（例如，Transformer）。我们对汉语-英语和英语-德语这两个翻译任务进行了大量实验，结果表明所提出的架构显着提高了 NMT 模型的性能，并显着优于以前的基线。	Yangming Li Kaisheng Yao
3	ACL2021	Consistency Regularization for Cross-Lingual Fine-Tuning	https://github.com/bozheng-hit/xTune	https://arxiv.org/pdf/2106.08226	Fine-tuning pre-trained cross-lingual language models can transfer task-specific supervision from one language to the others. In this work, we propose to improve cross-lingual fine-tuning with consistency regularization. Specifically, we use example consistency regularization to penalize the prediction sensitivity to four types of data augmentations, i.e., subword sampling, Gaussian noise, code-switch substitution, and machine translation. In addition, we employ model consistency to regularize the models trained with two augmented versions of the same training set. Experimental results on the XTREME benchmark show that our method significantly improves cross-lingual fine-tuning across various tasks, including text classification, question answering, and sequence labeling.		Bo Zheng Li Dong Shaohan Huang Wenhui Wang Zewen Chi Saksham Singhal Wanxiang Che Ting Liu Xia Song Furu Wei
4	ACL2021	Improving Pretrained Cross-Lingual Language Models via Self-Labeled Word Alignment	https://github.com/CZWin32768/XLM-Align	https://arxiv.org/pdf/2106.06381	The cross-lingual language models are typically pretrained with masked language modeling on multilingual text or parallel sentences. In this paper, we introduce denoising word alignment as a new cross-lingual pre-training task. Specifically, the model first self-labels word alignments for parallel sentences. Then we randomly mask tokens in a bitext pair. Given a masked token, the model uses a pointer network to predict the aligned token in the other language. We alternately perform the above two steps in an expectation-maximization manner. Experimental results show that our method improves cross-lingual transferability on various datasets, especially on the token-level tasks, such as question answering, and structured prediction. Moreover, the model can serve as a pretrained word aligner, which achieves reasonably low error rates on the alignment benchmarks. The code and pretrained parameters are available at https://github.com/CZWin32768/XLM-Align.	跨语言语言模型通常使用多语言文本或平行句子的掩码语言模型进行预训练。在本文中，我们将去噪词对齐作为一种新的跨语言预训练任务引入。具体来说，该模型首先自我标记平行句子的词对齐。然后我们随机屏蔽一个 bittext 对中的令牌。给定一个掩码标记，该模型使用指针网络来预测其他语言中对齐的标记。我们以期望最大化的方式交替执行上述两个步骤。实验结果表明，我们的方法提高了各种数据集的跨语言迁移能力，尤其是在令牌级任务上，例如问答和结构化预测。此外，该模型可以作为预训练的词对齐器，在对齐基准上实现相当低的错误率。代码和预训练参数可从 https://github.com/CZWin32768/XLM-Align 获得。	Zewen Chi Li Dong Bo Zheng Shaohan Huang Xian-Ling Mao Heyan Huang Furu Wei
5	ACL2021	Improving Zero-Shot Translation by Disentangling Positional Information	https://github.com/nlp-dke/NMTGMinor/tree/master/recipes/zero-shot	https://arxiv.org/pdf/2012.15127	Multilingual neural machine translation has shown the capability of directly translating between language pairs unseen in training, i.e. zero-shot translation. Despite being conceptually attractive, it often suffers from low output quality. The difficulty of generalizing to new translation directions suggests the model representations are highly specific to those language pairs seen in training. We demonstrate that a main factor causing the language-specific representations is the positional correspondence to input tokens. We show that this can be easily alleviated by removing residual connections in an encoder layer. With this modification, we gain up to 18.5 BLEU points on zero-shot translation while retaining quality on supervised directions. The improvements are particularly prominent between related languages, where our proposed model outperforms pivot-based translation. Moreover, our approach allows easy integration of new languages, which substantially expands translation coverage. By thorough inspections of the hidden layer outputs, we show that our approach indeed leads to more language-independent representations.	多语言神经机器翻译已经显示出在训练中看不到的语言对之间直接翻译的能力，即零样本翻译。尽管在概念上很有吸引力，但它经常受到输出质量低的影响。推广到新的翻译方向的困难表明模型表示对于训练中看到的那些语言对是高度特定的。我们证明了导致语言特定表示的一个主要因素是与输入标记的位置对应。我们表明，通过删除编码器层中的残差连接可以轻松缓解这种情况。通过这种修改，我们在零样本平移上获得了高达 18.5 BLEU 点，同时在监督方向上保持了质量。相关语言之间的改进尤为突出，我们提出的模型优于基于枢轴的翻译。此外，我们的方法允许轻松集成新语言，从而大大扩展了翻译范围。通过对隐藏层输出的彻底检查，我们表明我们的方法确实导致了更多与语言无关的表示。	Danni Liu Jan Niehues James Cross Francisco Guzmán Xian Li
6	ACL2021	Beyond Offline Mapping: Learning Cross-lingual Word Embeddings through Context Anchoring		https://arxiv.org/pdf/2012.15715	Recent research on cross-lingual word embeddings has been dominated by unsupervised mapping approaches that align monolingual embeddings. Such methods critically rely on those embeddings having a similar structure, but it was recently shown that the separate training in different languages causes departures from this assumption. In this paper, we propose an alternative approach that does not have this limitation, while requiring a weak seed dictionary (e.g., a list of identical words) as the only form of supervision. Rather than aligning two fixed embedding spaces, our method works by fixing the target language embeddings, and learning a new set of embeddings for the source language that are aligned with them. To that end, we use an extension of skip-gram that leverages translated context words as anchor points, and incorporates self-learning and iterative restarts to reduce the dependency on the initial dictionary. Our approach outperforms conventional mapping methods on bilingual lexicon induction, and obtains competitive results in the downstream XNLI task.	最近关于跨语言词嵌入的研究主要由对齐单语嵌入的无监督映射方法主导。这些方法严重依赖于那些具有相似结构的嵌入，但最近表明，不同语言的单独训练会导致偏离这一假设。在本文中，我们提出了一种没有这种限制的替代方法，同时需要一个弱种子字典（例如，相同单词的列表）作为唯一的监督形式。我们的方法不是对齐两个固定的嵌入空间，而是通过修复目标语言嵌入，并为源语言学习一组与它们对齐的新嵌入来工作。为此，我们使用了skip-gram 的扩展，它利用翻译的上下文词作为锚点，并结合自学习和迭代重启来减少对初始字典的依赖。我们的方法在双语词典归纳方面优于传统的映射方法，并在下游 XNLI 任务中获得了有竞争力的结果。	Aitor Ormazabal Mikel Artetxe Aitor Soroa Gorka Labaka Eneko Agirre
7	ACL2021	Verb Knowledge Injection for Multilingual Event Processing		https://arxiv.org/pdf/2012.15421	In parallel to their overwhelming success across NLP tasks, language ability of deep Transformer networks, pretrained via language modeling (LM) objectives has undergone extensive scrutiny. While probing revealed that these models encode a range of syntactic and semantic properties of a language, they are still prone to fall back on superficial cues and simple heuristics to solve downstream tasks, rather than leverage deeper linguistic knowledge. In this paper, we target one such area of their deficiency, verbal reasoning. We investigate whether injecting explicit information on verbs’ semantic-syntactic behaviour improves the performance of LM-pretrained Transformers in event extraction tasks — downstream tasks for which accurate verb processing is paramount. Concretely, we impart the verb knowledge from curated lexical resources into dedicated adapter modules (dubbed verb adapters), allowing it to complement, in downstream tasks, the language knowledge obtained during LM-pretraining. We first demonstrate that injecting verb knowledge leads to performance gains in English event extraction. We then explore the utility of verb adapters for event extraction in other languages: we investigate (1) zero-shot language transfer with multilingual Transformers as well as (2) transfer via (noisy automatic) translation of English verb-based lexical constraints. Our results show that the benefits of verb knowledge injection indeed extend to other languages, even when verb adapters are trained on noisily translated constraints.	在 NLP 任务中取得压倒性成功的同时，通过语言建模 (LM) 目标预训练的深度 Transformer 网络的语言能力也受到了广泛的审查。虽然探索表明这些模型编码了语言的一系列句法和语义属性，但它们仍然倾向于依靠表面线索和简单的启发式方法来解决下游任务，而不是利用更深层次的语言知识。在本文中，我们针对他们的不足之处之一，即语言推理。我们调查了注入关于动词语义句法行为的显式信息是否可以提高 LM 预训练 Transformer 在事件提取任务中的性能 - 准确的动词处理至关重要的下游任务。具体来说，我们将精选词汇资源中的动词知识传授给专用的适配器模块（称为动词适配器），使其在下游任务中补充 LM 预训练期间获得的语言知识。我们首先证明注入动词知识可以提高英语事件提取的性能。然后，我们探索了动词适配器在其他语言中用于事件提取的效用：我们研究了 (1) 使用多语言 Transformer 的零样本语言迁移以及 (2) 通过（嘈杂的自动）翻译基于英语动词的词汇约束的迁移。我们的结果表明，动词知识注入的好处确实扩展到其他语言，即使动词适配器在嘈杂的翻译约束上进行训练。	Olga Majewska Ivan Vulić Goran Glavaš Edoardo M. Ponti Anna Korhonen
8	ACL2021	Common Sense Beyond English: Evaluating and Improving Multilingual Language Models for Commonsense Reasoning	https://github.com/INK-USC/XCSR	https://arxiv.org/pdf/2106.06937	Commonsense reasoning research has so far been limited to English. We aim to evaluate and improve popular multilingual language models (ML-LMs) to help advance commonsense reasoning (CSR) beyond English. We collect the Mickey Corpus, consisting of 561k sentences in 11 different languages, which can be used for analyzing and improving ML-LMs. We propose Mickey Probe, a language-agnostic probing task for fairly evaluating the common sense of popular ML-LMs across different languages. In addition, we also create two new datasets, X-CSQA and X-CODAH, by translating their English versions to 15 other languages, so that we can evaluate popular ML-LMs for cross-lingual commonsense reasoning. To improve the performance beyond English, we propose a simple yet effective method — multilingual contrastive pre-training (MCP). It significantly enhances sentence representations, yielding a large performance gain on both benchmarks.		Bill Yuchen Lin Seyeon Lee Xiaoyang Qiao Xiang Ren
9	ACL2021	Bilingual Lexicon Induction via Unsupervised Bitext Construction and Word Alignment		https://arxiv.org/pdf/2101.00148	Bilingual lexicons map words in one language to their translations in another, and are typically induced by learning linear projections to align monolingual word embedding spaces. In this paper, we show it is possible to produce much higher quality lexicons with methods that combine (1) unsupervised bitext mining and (2) unsupervised word alignment. Directly applying a pipeline that uses recent algorithms for both subproblems significantly improves induced lexicon quality and further gains are possible by learning to filter the resulting lexical entries, with both unsupervised and semi-supervised schemes. Our final model outperforms the state of the art on the BUCC 2020 shared task by 14 $F_1$ points averaged over 12 language pairs, while also providing a more interpretable approach that allows for rich reasoning of word meaning in context. Further analysis of our output and the standard reference lexicons suggests they are of comparable quality, and new benchmarks may be needed to measure further progress on this task.	双语词典将一种语言中的单词映射到另一种语言中的翻译，并且通常通过学习线性投影来对齐单语单词嵌入空间来诱导。在本文中，我们展示了使用结合 (1) 无监督双文本挖掘和 (2) 无监督词对齐的方法可以生成更高质量的词典。直接应用对两个子问题使用最新算法的管道可以显着提高诱导词典的质量，并且通过学习过滤生成的词条，可以使用无监督和半监督方案进一步提高。我们的最终模型在 BUCC 2020 共享任务上的表现优于现有技术，在 12 个语言对上平均提高了 14 $F_1$ 点，同时还提供了一种更具可解释性的方法，允许在上下文中对词义进行丰富的推理。对我们的输出和标准参考词典的进一步分析表明它们的质量相当，可能需要新的基准来衡量这项任务的进一步进展。	Haoyue Shi Luke Zettlemoyer Sida I. Wang
10	ACL2020	Improving Massively Multilingual Neural Machine Translation and Zero-Shot Translation	https://github.com/bzhangGo/zero	https://arxiv.org/pdf/2004.11867	Massively multilingual models for neural machine translation (NMT) are theoretically attractive, but often underperform bilingual models and deliver poor zero-shot translations. In this paper, we explore ways to improve them. We argue that multilingual NMT requires stronger modeling capacity to support language pairs with varying typological characteristics, and overcome this bottleneck via language-specific components and deepening NMT architectures. We identify the off-target translation issue (i.e. translating into a wrong target language) as the major source of the inferior zero-shot performance, and propose random online backtranslation to enforce the translation of unseen training language pairs. Experiments on OPUS-100 (a novel multilingual dataset with 100 languages) show that our approach substantially narrows the performance gap with bilingual models in both one-to-many and many-to-many settings, and improves zero-shot performance by ~10 BLEU, approaching conventional pivot-based methods.	用于神经机器翻译 (NMT) 的大规模多语言模型在理论上很有吸引力，但通常表现不如双语模型并且提供糟糕的零样本翻译。在本文中，我们探索了改进它们的方法。我们认为多语言 NMT 需要更强的建模能力来支持具有不同类型特征的语言对，并通过特定于语言的组件和深化 NMT 架构来克服这一瓶颈。我们将脱靶翻译问题（即翻译成错误的目标语言）确定为较差的零样本性能的主要来源，并提出随机在线反向翻译来强制翻译看不见的训练语言对。在 OPUS-100（一个包含 100 种语言的新型多语言数据集）上的实验表明，我们的方法大大缩小了在一对多和多对多设置中与双语模型的性能差距，并将零样本性能提高了约 10 BLEU，接近传统的基于枢轴的方法。	Biao Zhang Philip Williams Ivan Titov Rico Sennrich
11	ACL2020	Simultaneous Translation Policies: From Fixed to Adaptive		https://arxiv.org/pdf/2004.13169	Adaptive policies are better than fixed policies for simultaneous translation, since they can flexibly balance the tradeoff between translation quality and latency based on the current context information. But previous methods on obtaining adaptive policies either rely on complicated training process, or underperform simple fixed policies. We design an algorithm to achieve adaptive policies via a simple heuristic composition of a set of fixed policies. Experiments on Chinese -> English and German -> English show that our adaptive policies can outperform fixed ones by up to 4 BLEU points for the same latency, and more surprisingly, it even surpasses the BLEU score of full-sentence translation in the greedy mode (and very close to beam mode), but with much lower latency.	对于同步翻译，自适应策略优于固定策略，因为它们可以根据当前上下文信息灵活地平衡翻译质量和延迟之间的权衡。但是以前获取自适应策略的方法要么依赖于复杂的训练过程，要么表现不佳。我们设计了一种算法，通过一组固定策略的简单启发式组合来实现自适应策略。中文 -> 英文和德文 -> 英文的实验表明，在相同的延迟下，我们的自适应策略可以比固定策略高出多达 4 个 BLEU 点，更令人惊讶的是，它甚至超过了贪婪模式下完整句子翻译的 BLEU 分数（并且非常接近光束模式），但延迟要低得多。	Baigong Zheng Kaibo Liu Renjie Zheng Mingbo Ma Hairong Liu Liang Huang
12	ACL2020	Multiscale Collaborative Deep Models for Neural Machine Translation	https://github.com/pemywei/MSC-NMT	https://arxiv.org/pdf/2004.14021	Recent evidence reveals that Neural Machine Translation (NMT) models with deeper neural networks can be more effective but are difficult to train. In this paper, we present a MultiScale Collaborative (MSC) framework to ease the training of NMT models that are substantially deeper than those used previously. We explicitly boost the gradient back-propagation from top to bottom levels by introducing a block-scale collaboration mechanism into deep NMT models. Then, instead of forcing the whole encoder stack directly learns a desired representation, we let each encoder block learns a fine-grained representation and enhance it by encoding spatial dependencies using a context-scale collaboration. We provide empirical evidence showing that the MSC nets are easy to optimize and can obtain improvements of translation quality from considerably increased depth. On IWSLT translation tasks with three translation directions, our extremely deep models (with 72-layer encoders) surpass strong baselines by +2.2~+3.1 BLEU points. In addition, our deep MSC achieves a BLEU score of 30.56 on WMT14 English-German task that significantly outperforms state-of-the-art deep NMT models.	最近的证据表明，具有更深神经网络的神经机器翻译 (NMT) 模型可能更有效，但难以训练。在本文中，我们提出了一个多尺度协作 (MSC) 框架，以简化 NMT 模型的训练，这些模型比以前使用的模型要深得多。我们通过在深度 NMT 模型中引入块级协作机制，显式地提升了从上到下的梯度反向传播。然后，我们不是强制整个编码器堆栈直接学习所需的表示，而是让每个编码器块学习细粒度的表示，并通过使用上下文尺度协作对空间依赖性进行编码来增强它。我们提供的经验证据表明，MSC 网络易于优化，并且可以从显着增加的深度中获得翻译质量的改进。在具有三个翻译方向的 IWSLT 翻译任务上，我们极深的模型（具有 72 层编码器）超过强基线 +2.2~+3.1 BLEU 点。此外，我们的深度 MSC 在 WMT14 英德任务上获得了 30.56 的 BLEU 分数，明显优于最先进的深度 NMT 模型。	Xiangpeng Wei Heng Yu Yue Hu Yue Zhang Rongxiang Weng Weihua Luo
13	ACL2020	Character-Level Translation with Self-attention		https://arxiv.org/pdf/2004.14788	We explore the suitability of self-attention models for character-level neural machine translation. We test the standard transformer model, as well as a novel variant in which the encoder block combines information from nearby characters using convolutions. We perform extensive experiments on WMT and UN datasets, testing both bilingual and multilingual translation to English using up to three input languages (French, Spanish, and Chinese). Our transformer variant consistently outperforms the standard transformer at the character-level and converges faster while learning more robust character-level alignments.	我们探索了自注意力模型对字符级神经机器翻译的适用性。我们测试了标准转换器模型，以及一种新的变体，其中编码器块使用卷积结合来自附近字符的信息。我们对 WMT 和 UN 数据集进行了大量实验，使用最多三种输入语言（法语、西班牙语和中文）测试双语和多语种翻译成英语。我们的转换器变体在字符级别始终优于标准转换器，并且在学习更强大的字符级别对齐的同时收敛速度更快。	Yingqiang Gao Nikola I. Nikolov Yuhuang Hu Richard H. R. Hahnloser
14	ACL2020	ENGINE: Energy-Based Inference Networks for Non-Autoregressive Machine Translation	https://github.com/lifu-tu/ENGINE	https://arxiv.org/pdf/2005.00850	We propose to train a non-autoregressive machine translation model to minimize the energy defined by a pretrained autoregressive model. In particular, we view our non-autoregressive translation system as an inference network (Tu and Gimpel, 2018) trained to minimize the autoregressive teacher energy. This contrasts with the popular approach of training a non-autoregressive model on a distilled corpus consisting of the beam-searched outputs of such a teacher model. Our approach, which we call ENGINE (ENerGy-based Inference NEtworks), achieves state-of-the-art non-autoregressive results on the IWSLT 2014 DE-EN and WMT 2016 RO-EN datasets, approaching the performance of autoregressive models.	我们建议训练一个非自回归机器翻译模型，以最小化由预训练自回归模型定义的能量。特别是，我们将我们的非自回归翻译系统视为一个推理网络（Tu 和 Gimpel，2018），经过训练以最小化自回归教师能量。这与在由这种教师模型的波束搜索输出组成的蒸馏语料库上训练非自回归模型的流行方法形成对比。我们称为 ENGINE（基于能源的推理网络）的方法在 IWSLT 2014 DE-EN 和 WMT 2016 RO-EN 数据集上实现了最先进的非自回归结果，接近自回归模型的性能。	Lifu Tu Richard Yuanzhe Pang Sam Wiseman Kevin Gimpel
15	ACL2020	Selecting Backtranslated Data from Multiple Sources for Improved Neural Machine Translation		https://arxiv.org/pdf/2005.00308	Machine translation (MT) has benefited from using synthetic training data originating from translating monolingual corpora, a technique known as backtranslation. Combining backtranslated data from different sources has led to better results than when using such data in isolation. In this work we analyse the impact that data translated with rule-based, phrase-based statistical and neural MT systems has on new MT systems. We use a real-world low-resource use-case (Basque-to-Spanish in the clinical domain) as well as a high-resource language pair (German-to-English) to test different scenarios with backtranslation and employ data selection to optimise the synthetic corpora. We exploit different data selection strategies in order to reduce the amount of data used, while at the same time maintaining high-quality MT systems. We further tune the data selection method by taking into account the quality of the MT systems used for backtranslation and lexical diversity of the resulting corpora. Our experiments show that incorporating backtranslated data from different sources can be beneficial, and that availing of data selection can yield improved performance.	机器翻译 (MT) 受益于使用源自翻译单语语料库的合成训练数据，这种技术称为反向翻译。与单独使用这些数据相比，将来自不同来源的反向翻译数据结合起来会产生更好的结果。在这项工作中，我们分析了使用基于规则、基于短语的统计和神经 MT 系统翻译的数据对新 MT 系统的影响。我们使用现实世界的低资源用例（临床领域的巴斯克语到西班牙语）以及高资源语言对（德语到英语）来测试不同的反向翻译场景，并采用数据选择来优化合成语料库。我们利用不同的数据选择策略来减少使用的数据量，同时保持高质量的 MT 系统。我们通过考虑用于反向翻译的 MT 系统的质量和结果语料库的词汇多样性来进一步调整数据选择方法。我们的实验表明，合并来自不同来源的反向翻译数据可能是有益的，并且利用数据选择可以提高性能。	Xabier Soto Dimitar Shterionov Alberto Poncelas Andy Way
16	ACL2020	Dynamic Programming Encoding for Subword Segmentation in Neural Machine Translation	https://github.com/xlhex/dpe	https://arxiv.org/pdf/2005.06606	This paper introduces Dynamic Programming Encoding (DPE), a new segmentation algorithm for tokenizing sentences into subword units. We view the subword segmentation of output sentences as a latent variable that should be marginalized out for learning and inference. A mixed character-subword transformer is proposed, which enables exact log marginal likelihood estimation and exact MAP inference to find target segmentations with maximum posterior probability. DPE uses a lightweight mixed character-subword transformer as a means of pre-processing parallel data to segment output sentences using dynamic programming. Empirical results on machine translation suggest that DPE is effective for segmenting output sentences and can be combined with BPE dropout for stochastic segmentation of source sentences. DPE achieves an average improvement of 0.9 BLEU over BPE (Sennrich et al., 2016) and an average improvement of 0.55 BLEU over BPE dropout (Provilkov et al., 2019) on several WMT datasets including English <=> (German, Romanian, Estonian, Finnish, Hungarian).	本文介绍了动态编程编码 (DPE)，这是一种将句子标记为子词单元的新分词算法。我们将输出句子的子词分割视为一个潜在变量，应该被边缘化以进行学习和推理。提出了一种混合字符-子字转换器，它能够进行精确的对数边际似然估计和精确的 MAP 推理，以找到具有最大后验概率的目标分段。 DPE 使用轻量级混合字符-子字转换器作为预处理并行数据的一种手段，以使用动态编程对输出句子进行分段。机器翻译的实证结果表明，DPE 对分割输出句子是有效的，并且可以与 BPE dropout 结合用于源句子的随机分割。 DPE 在包括英语 <=>（德语、罗马尼亚语、爱沙尼亚语、芬兰语、匈牙利语）。	Xuanli He Gholamreza Haffari Mohammad Norouzi
17	ACL2020	Norm-Based Curriculum Learning for Neural Machine Translation	https://github.com/NLP2CT/norm-nmt	https://arxiv.org/pdf/2006.02014	A neural machine translation (NMT) system is expensive to train, especially with high-resource settings. As the NMT architectures become deeper and wider, this issue gets worse and worse. In this paper, we aim to improve the efficiency of training an NMT by introducing a novel norm-based curriculum learning method. We use the norm (aka length or module) of a word embedding as a measure of 1) the difficulty of the sentence, 2) the competence of the model, and 3) the weight of the sentence. The norm-based sentence difficulty takes the advantages of both linguistically motivated and model-based sentence difficulties. It is easy to determine and contains learning-dependent features. The norm-based model competence makes NMT learn the curriculum in a fully automated way, while the norm-based sentence weight further enhances the learning of the vector representation of the NMT. Experimental results for the WMT’14 English-German and WMT’17 Chinese-English translation tasks demonstrate that the proposed method outperforms strong baselines in terms of BLEU score (+1.17/+1.56) and training speedup (2.22x/3.33x).		Xuebo Liu Houtim Lai Derek F. Wong Lidia S. Chao
18	ACL2020	Bilingual Dictionary Based Neural Machine Translation without Using Parallel Sentences	https://github.com/mttravel/Dictionary-based-MT	https://arxiv.org/pdf/2007.02671	In this paper, we propose a new task of machine translation (MT), which is based on no parallel sentences but can refer to a ground-truth bilingual dictionary. Motivated by the ability of a monolingual speaker learning to translate via looking up the bilingual dictionary, we propose the task to see how much potential an MT system can attain using the bilingual dictionary and large scale monolingual corpora, while is independent on parallel sentences. We propose anchored training (AT) to tackle the task. AT uses the bilingual dictionary to establish anchoring points for closing the gap between source language and target language. Experiments on various language pairs show that our approaches are significantly better than various baselines, including dictionary-based word-by-word translation, dictionary-supervised cross-lingual word embedding transformation, and unsupervised MT. On distant language pairs that are hard for unsupervised MT to perform well, AT performs remarkably better, achieving performances comparable to supervised SMT trained on more than 4M parallel sentences.	在本文中，我们提出了机器翻译 (MT) 的一项新任务，该任务不基于平行句，但可以参考真值双语词典。受单语说话者通过查找双语词典学习翻译的能力的启发，我们提出了一项任务，即在独立于平行句子的情况下，使用双语词典和大规模单语语料库查看 MT 系统可以实现多少潜力。我们建议锚定训练（AT）来解决这个任务。 AT 使用双语词典建立定位点，以缩小源语言和目标语言之间的差距。在各种语言对上的实验表明，我们的方法明显优于各种基线，包括基于字典的逐字翻译、字典监督的跨语言词嵌入转换和无监督的机器翻译。在无监督 MT 难以表现良好的远距离语言对上，AT 表现得非常好，达到了与在超过 4M 并行句子上训练的有监督 SMT 相当的性能。	Xiangyu Duan Baijun Ji Hao Jia Min Tan Min Zhang Boxing Chen Weihua Luo Yue Zhang
19	ACL2020	Improving Massively Multilingual Neural Machine Translation and Zero-Shot Translation	https://github.com/bzhangGo/zero	https://arxiv.org/pdf/2004.11867	Massively multilingual models for neural machine translation (NMT) are theoretically attractive, but often underperform bilingual models and deliver poor zero-shot translations. In this paper, we explore ways to improve them. We argue that multilingual NMT requires stronger modeling capacity to support language pairs with varying typological characteristics, and overcome this bottleneck via language-specific components and deepening NMT architectures. We identify the off-target translation issue (i.e. translating into a wrong target language) as the major source of the inferior zero-shot performance, and propose random online backtranslation to enforce the translation of unseen training language pairs. Experiments on OPUS-100 (a novel multilingual dataset with 100 languages) show that our approach substantially narrows the performance gap with bilingual models in both one-to-many and many-to-many settings, and improves zero-shot performance by ~10 BLEU, approaching conventional pivot-based methods.	用于神经机器翻译 (NMT) 的大规模多语言模型在理论上很有吸引力，但通常表现不如双语模型并且提供糟糕的零样本翻译。在本文中，我们探索了改进它们的方法。我们认为多语言 NMT 需要更强的建模能力来支持具有不同类型特征的语言对，并通过特定于语言的组件和深化 NMT 架构来克服这一瓶颈。我们将脱靶翻译问题（即翻译成错误的目标语言）确定为较差的零样本性能的主要来源，并提出随机在线反向翻译来强制翻译看不见的训练语言对。在 OPUS-100（一个包含 100 种语言的新型多语言数据集）上的实验表明，我们的方法大大缩小了在一对多和多对多设置中与双语模型的性能差距，并将零样本性能提高了约 10 BLEU，接近传统的基于枢轴的方法。	Biao Zhang Philip Williams Ivan Titov Rico Sennrich
20	ACL2020	ENGINE: Energy-Based Inference Networks for Non-Autoregressive Machine Translation	https://github.com/lifu-tu/ENGINE	https://arxiv.org/pdf/2005.00850	We propose to train a non-autoregressive machine translation model to minimize the energy defined by a pretrained autoregressive model. In particular, we view our non-autoregressive translation system as an inference network (Tu and Gimpel, 2018) trained to minimize the autoregressive teacher energy. This contrasts with the popular approach of training a non-autoregressive model on a distilled corpus consisting of the beam-searched outputs of such a teacher model. Our approach, which we call ENGINE (ENerGy-based Inference NEtworks), achieves state-of-the-art non-autoregressive results on the IWSLT 2014 DE-EN and WMT 2016 RO-EN datasets, approaching the performance of autoregressive models.	我们建议训练一个非自回归机器翻译模型，以最小化由预训练自回归模型定义的能量。特别是，我们将我们的非自回归翻译系统视为一个推理网络（Tu 和 Gimpel，2018），经过训练以最小化自回归教师能量。这与在由这种教师模型的波束搜索输出组成的蒸馏语料库上训练非自回归模型的流行方法形成对比。我们称为 ENGINE（基于能源的推理网络）的方法在 IWSLT 2014 DE-EN 和 WMT 2016 RO-EN 数据集上实现了最先进的非自回归结果，接近自回归模型的性能。	Lifu Tu Richard Yuanzhe Pang Sam Wiseman Kevin Gimpel
21	ACL2019	An Effective Approach to Unsupervised Machine Translation	https://github.com/artetxem/monoses	https://arxiv.org/pdf/1902.01313	While machine translation has traditionally relied on large amounts of parallel corpora, a recent research line has managed to train both Neural Machine Translation (NMT) and Statistical Machine Translation (SMT) systems using monolingual corpora only. In this paper, we identify and address several deficiencies of existing unsupervised SMT approaches by exploiting subword information, developing a theoretically well founded unsupervised tuning method, and incorporating a joint refinement procedure. Moreover, we use our improved SMT system to initialize a dual NMT model, which is further fine-tuned through on-the-fly back-translation. Together, we obtain large improvements over the previous state-of-the-art in unsupervised machine translation. For instance, we get 22.5 BLEU points in English-to-German WMT 2014, 5.5 points more than the previous best unsupervised system, and 0.5 points more than the (supervised) shared task winner back in 2014.	虽然机器翻译传统上依赖于大量的平行语料库，但最近的一项研究成功地仅使用单语语料库来训练神经机器翻译 (NMT) 和统计机器翻译 (SMT) 系统。在本文中，我们通过利用子字信息、开发理论上有充分根据的无监督调整方法并结合联合细化程序来识别和解决现有无监督 SMT 方法的几个缺陷。此外，我们使用改进的 SMT 系统来初始化双 NMT 模型，该模型通过动态反向翻译进一步微调。总之，我们在无监督机器翻译方面比以前的最新技术取得了很大的改进。例如，我们在 English-to-German WMT 2014 中获得 22.5 BLEU 分，比之前最好的无监督系统多 5.5 分，比 2014 年的（监督）共享任务获胜者多 0.5 分。	Mikel Artetxe Gorka Labaka Eneko Agirre
22	ACL2019	When a Good Translation is Wrong in Context: Context-Aware Machine Translation Improves on Deixis, Ellipsis, and Lexical Cohesion	https://github.com/lena-voita/good-translation-wrong-in-context	https://arxiv.org/pdf/1905.05979	Though machine translation errors caused by the lack of context beyond one sentence have long been acknowledged, the development of context-aware NMT systems is hampered by several problems. Firstly, standard metrics are not sensitive to improvements in consistency in document-level translations. Secondly, previous work on context-aware NMT assumed that the sentence-aligned parallel data consisted of complete documents while in most practical scenarios such document-level data constitutes only a fraction of the available parallel data. To address the first issue, we perform a human study on an English-Russian subtitles dataset and identify deixis, ellipsis and lexical cohesion as three main sources of inconsistency. We then create test sets targeting these phenomena. To address the second shortcoming, we consider a set-up in which a much larger amount of sentence-level data is available compared to that aligned at the document level. We introduce a model that is suitable for this scenario and demonstrate major gains over a context-agnostic baseline on our new benchmarks without sacrificing performance as measured with BLEU.	尽管由于缺乏一个句子之外的上下文而导致机器翻译错误早已得到承认，但上下文感知 NMT 系统的发展受到几个问题的阻碍。首先，标准指标对文档级翻译一致性的改进不敏感。其次，之前关于上下文感知 NMT 的工作假设句子对齐的并行数据由完整的文档组成，而在大多数实际场景中，此类文档级数据仅构成可用并行数据的一小部分。为了解决第一个问题，我们对英俄字幕数据集进行了人类研究，并将指示符、省略号和词汇衔接确定为不一致的三个主要来源。然后我们创建针对这些现象的测试集。为了解决第二个缺点，我们考虑了一种设置，其中与在文档级别对齐的数据相比，可用的句子级别数据要多得多。我们引入了一个适用于这种情况的模型，并在不牺牲使用 BLEU 测量的性能的情况下，在我们的新基准测试中展示了相对于上下文无关基线的主要收益。	Elena Voita Rico Sennrich Ivan Titov
23	ACL2019	Syntactically Supervised Transformers for Faster Neural Machine Translation	https://github.com/dojoteef/synst	https://arxiv.org/pdf/1906.02780	Standard decoders for neural machine translation autoregressively generate a single target token per time step, which slows inference especially for long outputs. While architectural advances such as the Transformer fully parallelize the decoder computations at training time, inference still proceeds sequentially. Recent developments in non- and semi- autoregressive decoding produce multiple tokens per time step independently of the others, which improves inference speed but deteriorates translation quality. In this work, we propose the syntactically supervised Transformer (SynST), which first autoregressively predicts a chunked parse tree before generating all of the target tokens in one shot conditioned on the predicted parse. A series of controlled experiments demonstrates that SynST decodes sentences ~ 5x faster than the baseline autoregressive Transformer while achieving higher BLEU scores than most competing methods on En-De and En-Fr datasets.	用于神经机器翻译的标准解码器在每个时间步长自动回归生成单个目标标记，这会减慢推理速度，尤其是对于长输出。虽然诸如 Transformer 之类的架构进步在训练时完全并行化了解码器计算，但推理仍然按顺序进行。非自回归解码和半自回归解码的最新发展在每个时间步长生成多个标记，独立于其他标记，这提高了推理速度，但降低了翻译质量。在这项工作中，我们提出了句法监督的 Transformer (SynST)，它首先自回归预测一个分块的解析树，然后在一次以预测解析为条件的镜头中生成所有目标标记。一系列受控实验表明，SynST 解码句子的速度比基线自回归 Transformer 快 5 倍，同时在 En-De 和 En-Fr 数据集上获得比大多数竞争方法更高的 BLEU 分数。	Nader Akoury Kalpesh Krishna Mohit Iyyer
24	ACL2019	Evaluating Gender Bias in Machine Translation		https://arxiv.org/pdf/2106.08680	With language models being deployed increasingly in the real world, it is essential to address the issue of the fairness of their outputs. The word embedding representations of these language models often implicitly draw unwanted associations that form a social bias within the model. The nature of gendered languages like Hindi, poses an additional problem to the quantification and mitigation of bias, owing to the change in the form of the words in the sentence, based on the gender of the subject. Additionally, there is sparse work done in the realm of measuring and debiasing systems for Indic languages. In our work, we attempt to evaluate and quantify the gender bias within a Hindi-English machine translation system. We implement a modified version of the existing TGBI metric based on the grammatical considerations for Hindi. We also compare and contrast the resulting bias measurements across multiple metrics for pre-trained embeddings and the ones learned by our machine translation model.	随着语言模型在现实世界中越来越多地部署，解决其输出的公平性问题至关重要。这些语言模型的词嵌入表示通常会隐含地绘制不需要的关联，从而在模型内形成社会偏见。由于基于主题的性别，句子中单词的形式发生了变化，像印地语这样的性别化语言的性质给偏见的量化和缓解带来了额外的问题。此外，在印度语言的测量和去偏差系统领域完成的工作很少。在我们的工作中，我们尝试评估和量化印地语-英语机器翻译系统中的性别偏见。我们基于印地语的语法考虑实施了现有 TGBI 指标的修改版本。我们还比较和对比了预训练嵌入和我们的机器翻译模型学习的嵌入的多个指标的偏差测量结果。	Gauri Gupta Krithika Ramesh Sanjay Singh Gabriel Stanovsky Noah A. Smith Luke Zettlemoyer
25	ACL2019	Learning Deep Transformer Models for Machine Translation	https://github.com/wangqiangneu/dlcl	https://arxiv.org/pdf/1906.01787	Transformer is the state-of-the-art model in recent machine translation evaluations. Two strands of research are promising to improve models of this kind: the first uses wide networks (a.k.a. Transformer-Big) and has been the de facto standard for the development of the Transformer system, and the other uses deeper language representation but faces the difficulty arising from learning deep networks. Here, we continue the line of research on the latter. We claim that a truly deep Transformer model can surpass the Transformer-Big counterpart by 1) proper use of layer normalization and 2) a novel way of passing the combination of previous layers to the next. On WMT’16 English- German, NIST OpenMT’12 Chinese-English and larger WMT’18 Chinese-English tasks, our deep system (30/25-layer encoder) outperforms the shallow Transformer-Big/Base baseline (6-layer encoder) by 0.4-2.4 BLEU points. As another bonus, the deep model is 1.6X smaller in size and 3X faster in training than Transformer-Big.	Transformer 是最近机器翻译评估中最先进的模型。有两方面的研究有望改进此类模型：一是使用宽网络（又名 Transformer-Big）并且已经成为 Transformer 系统开发的事实上的标准，另一方面使用更深层次的语言表示但面临困难源于学习深度网络。在这里，我们继续对后者进行研究。我们声称，一个真正的深度 Transformer 模型可以通过 1) 正确使用层归一化和 2) 一种将前一层的组合传递到下一层的新方法来超越 Transformer-Big 对应物。在 WMT’16 English-German、NIST OpenMT’12 Chinese-English 和更大的 WMT’18 Chinese-English 任务中，我们的深层系统（30/25 层编码器）优于浅层 Transformer-Big/Base 基线（6 层编码器） ) 0.4-2.4 BLEU 点。作为另一个好处，深度模型的大小比 Transformer-Big 小 1.6 倍，训练速度快 3 倍。	Qiang Wang Bei Li Tong Xiao Jingbo Zhu Changliang Li Derek F. Wong Lidia S. Chao
26	ACL2019	Domain Adaptation of Neural Machine Translation by Lexicon Induction		https://arxiv.org/pdf/1906.00376	It has been previously noted that neural machine translation (NMT) is very sensitive to domain shift. In this paper, we argue that this is a dual effect of the highly lexicalized nature of NMT, resulting in failure for sentences with large numbers of unknown words, and lack of supervision for domain-specific words. To remedy this problem, we propose an unsupervised adaptation method which fine-tunes a pre-trained out-of-domain NMT model using a pseudo-in-domain corpus. Specifically, we perform lexicon induction to extract an in-domain lexicon, and construct a pseudo-parallel in-domain corpus by performing word-for-word back-translation of monolingual in-domain target sentences. In five domains over twenty pairwise adaptation settings and two model architectures, our method achieves consistent improvements without using any in-domain parallel sentences, improving up to 14 BLEU over unadapted models, and up to 2 BLEU over strong back-translation baselines.	之前已经注意到神经机器翻译 (NMT) 对域转移非常敏感。在本文中，我们认为这是 NMT 高度词汇化性质的双重影响，导致具有大量未知单词的句子失败，以及缺乏对特定领域单词的监督。为了解决这个问题，我们提出了一种无监督的自适应方法，该方法使用伪域内语料库对预训练的域外 NMT 模型进行微调。具体来说，我们进行词典归纳以提取域内词典，并通过对单语域内目标句子进行逐字反向翻译来构建伪平行域内语料库。在超过 20 个成对适应设置和两个模型架构的五个域中，我们的方法在不使用任何域内并行语句的情况下实现了一致的改进，在未适应模型上提高了 14 个 BLEU，在强反向翻译基线上提高了 2 个 BLEU。	Junjie Hu Mengzhou Xia Graham Neubig Jaime Carbonell
27	ACL2019	Retrieving Sequential Information for Non-Autoregressive Neural Machine Translation	https://github.com/ictnlp/RSI-NAT	https://arxiv.org/pdf/1906.09444	Non-Autoregressive Transformer (NAT) aims to accelerate the Transformer model through discarding the autoregressive mechanism and generating target words independently, which fails to exploit the target sequential information. Over-translation and under-translation errors often occur for the above reason, especially in the long sentence translation scenario. In this paper, we propose two approaches to retrieve the target sequential information for NAT to enhance its translation ability while preserving the fast-decoding property. Firstly, we propose a sequence-level training method based on a novel reinforcement algorithm for NAT (Reinforce-NAT) to reduce the variance and stabilize the training procedure. Secondly, we propose an innovative Transformer decoder named FS-decoder to fuse the target sequential information into the top layer of the decoder. Experimental results on three translation tasks show that the Reinforce-NAT surpasses the baseline NAT system by a significant margin on BLEU without decelerating the decoding speed and the FS-decoder achieves comparable translation performance to the autoregressive Transformer with considerable speedup.	Non-Autoregressive Transformer (NAT) 旨在通过丢弃自回归机制并独立生成目标词来加速 Transformer 模型，这无法利用目标序列信息。由于上述原因，经常会出现过度翻译和翻译不足的错误，尤其是在长句翻译场景中。在本文中，我们提出了两种方法来检索 NAT 的目标序列信息，以增强其翻译能力，同时保留快速解码特性。首先，我们提出了一种基于新的 NAT 强化算法（Reinforce-NAT）的序列级训练方法，以减少方差并稳定训练过程。其次，我们提出了一种名为 FS-decoder 的创新 Transformer 解码器，将目标序列信息融合到解码器的顶层。在三个翻译任务上的实验结果表明，在不降低解码速度的情况下，Reinforce-NAT 在 BLEU 上大大超过了基线 NAT 系统，并且 FS-decoder 以相当大的速度实现了与自回归 Transformer 相当的翻译性能。	Chenze Shao Yang Feng Jinchao Zhang Fandong Meng Xilin Chen Jie Zhou
28	ACL2019	Robust Neural Machine Translation with Joint Textual and Phonetic Embedding		https://arxiv.org/pdf/1810.06729	Neural machine translation (NMT) is notoriously sensitive to noises, but noises are almost inevitable in practice. One special kind of noise is the homophone noise, where words are replaced by other words with similar pronunciations. We propose to improve the robustness of NMT to homophone noises by 1) jointly embedding both textual and phonetic information of source sentences, and 2) augmenting the training dataset with homophone noises. Interestingly, to achieve better translation quality and more robustness, we found that most (though not all) weights should be put on the phonetic rather than textual information. Experiments show that our method not only significantly improves the robustness of NMT to homophone noises, but also surprisingly improves the translation quality on some clean test sets.	众所周知，神经机器翻译 (NMT) 对噪音非常敏感，但在实践中噪音几乎是不可避免的。一种特殊的噪音是同音噪音，其中单词被其他具有相似发音的单词替换。我们建议通过 1) 联合嵌入源句子的文本和语音信息，以及 2) 用同音噪声增强训练数据集来提高 NMT 对同音噪声的鲁棒性。有趣的是，为了获得更好的翻译质量和更强的鲁棒性，我们发现大多数（尽管不是全部）权重应该放在语音信息而不是文本信息上。实验表明，我们的方法不仅显着提高了 NMT 对同音字噪声的鲁棒性，而且还令人惊讶地提高了一些干净测试集的翻译质量。	Hairong Liu Mingbo Ma Liang Huang Hao Xiong Zhongjun He
29	ACL2019	Simple and Effective Paraphrastic Similarity from Parallel Translations		https://arxiv.org/pdf/1909.13872	We present a model and methodology for learning paraphrastic sentence embeddings directly from bitext, removing the time-consuming intermediate step of creating paraphrase corpora. Further, we show that the resulting model can be applied to cross-lingual tasks where it both outperforms and is orders of magnitude faster than more complex state-of-the-art baselines.		John Wieting Kevin Gimpel Graham Neubig Taylor Berg-Kirkpatrick
30	ACL2019	Unsupervised Question Answering by Cloze Translation	https://github.com/facebookresearch/UnsupervisedQA	https://arxiv.org/pdf/1906.04980	Obtaining training data for Question Answering (QA) is time-consuming and resource-intensive, and existing QA datasets are only available for limited domains and languages. In this work, we explore to what extent high quality training data is actually required for Extractive QA, and investigate the possibility of unsupervised Extractive QA. We approach this problem by first learning to generate context, question and answer triples in an unsupervised manner, which we then use to synthesize Extractive QA training data automatically. To generate such triples, we first sample random context paragraphs from a large corpus of documents and then random noun phrases or named entity mentions from these paragraphs as answers. Next we convert answers in context to “fill-in-the-blank” cloze questions and finally translate them into natural questions. We propose and compare various unsupervised ways to perform cloze-to-natural question translation, including training an unsupervised NMT model using non-aligned corpora of natural questions and cloze questions as well as a rule-based approach. We find that modern QA models can learn to answer human questions surprisingly well using only synthetic training data. We demonstrate that, without using the SQuAD training data at all, our approach achieves 56.4 F1 on SQuAD v1 (64.5 F1 when the answer is a Named entity mention), outperforming early supervised models.	获取问答 (QA) 的训练数据既耗时又耗费资源，现有的 QA 数据集仅适用于有限的领域和语言。在这项工作中，我们探索了 Extractive QA 在多大程度上需要高质量的训练数据，并研究了无监督 Extractive QA 的可能性。我们通过首先学习以无监督的方式生成上下文、问题和答案三元组来解决这个问题，然后我们将其用于自动合成提取 QA 训练数据。为了生成这样的三元组，我们首先从大型文档语料库中随机抽取上下文段落，然后从这些段落中随机抽取名词短语或命名实体作为答案。接下来，我们将上下文中的答案转换为“填空”完形填空题，最后将它们转换为自然问题。我们提出并比较了执行完形填空到自然问题翻译的各种无监督方法，包括使用自然问题和完形填空问题的非对齐语料库以及基于规则的方法训练无监督 NMT 模型。我们发现现代 QA 模型可以学习仅使用合成训练数据就出人意料地很好地回答人类问题。我们证明，在完全不使用 SQuAD 训练数据的情况下，我们的方法在 SQuAD v1 上达到了 56.4 F1（当答案是命名实体提及时为 64.5 F1），优于早期的监督模型。	Patrick Lewis Ludovic Denoyer Sebastian Riedel
31	ACL2019	Bilingual Lexicon Induction through Unsupervised Machine Translation	https://github.com/artetxem/monoses	https://arxiv.org/pdf/1907.10761	A recent research line has obtained strong results on bilingual lexicon induction by aligning independently trained word embeddings in two languages and using the resulting cross-lingual embeddings to induce word translation pairs through nearest neighbor or related retrieval methods. In this paper, we propose an alternative approach to this problem that builds on the recent work on unsupervised machine translation. This way, instead of directly inducing a bilingual lexicon from cross-lingual embeddings, we use them to build a phrase-table, combine it with a language model, and use the resulting machine translation system to generate a synthetic parallel corpus, from which we extract the bilingual lexicon using statistical word alignment techniques. As such, our method can work with any word embedding and cross-lingual mapping technique, and it does not require any additional resource besides the monolingual corpus used to train the embeddings. When evaluated on the exact same cross-lingual embeddings, our proposed method obtains an average improvement of 6 accuracy points over nearest neighbor and 4 points over CSLS retrieval, establishing a new state-of-the-art in the standard MUSE dataset.	最近的一条研究线通过对齐两种语言中独立训练的词嵌入并使用产生的跨语言嵌入通过最近邻或相关检索方法来诱导词翻译对，在双语词典归纳方面取得了很好的成果。在本文中，我们提出了一种替代方法来解决这个问题，该方法建立在最近关于无监督机器翻译的工作基础上。这样，我们不是直接从跨语言嵌入中归纳出双语词典，而是使用它们来构建短语表，将其与语言模型结合，并使用生成的机器翻译系统生成合成平行语料库，从中我们使用统计词对齐技术提取双语词典。因此，我们的方法可以与任何词嵌入和跨语言映射技术一起使用，除了用于训练嵌入的单语语料库之外，它不需要任何额外的资源。当对完全相同的跨语言嵌入进行评估时，我们提出的方法比最近邻平均提高了 6 个精度点，比 CSLS 检索平均提高了 4 个点，在标准 MUSE 数据集中建立了新的最新技术。	Mikel Artetxe Gorka Labaka Eneko Agirre
32	ACL2019	Soft Contextual Data Augmentation for Neural Machine Translation	https://github.com/teslacool/SCA	https://arxiv.org/pdf/1905.10523	While data augmentation is an important trick to boost the accuracy of deep learning methods in computer vision tasks, its study in natural language tasks is still very limited. In this paper, we present a novel data augmentation method for neural machine translation. Different from previous augmentation methods that randomly drop, swap or replace words with other words in a sentence, we softly augment a randomly chosen word in a sentence by its contextual mixture of multiple related words. More accurately, we replace the one-hot representation of a word by a distribution (provided by a language model) over the vocabulary, i.e., replacing the embedding of this word by a weighted combination of multiple semantically similar words. Since the weights of those words depend on the contextual information of the word to be replaced, the newly generated sentences capture much richer information than previous augmentation methods. Experimental results on both small scale and large scale machine translation datasets demonstrate the superiority of our method over strong baselines.	虽然数据增强是在计算机视觉任务中提高深度学习方法准确性的重要技巧，但它在自然语言任务中的研究仍然非常有限。在本文中，我们提出了一种新的神经机器翻译数据增强方法。与之前随机删除、交换或替换句子中的其他单词的增强方法不同，我们通过多个相关单词的上下文混合来轻柔地增强句子中随机选择的单词。更准确地说，我们用词汇表上的分布（由语言模型提供）替换一个词的 one-hot 表示，即用多个语义相似词的加权组合替换这个词的嵌入。由于这些词的权重取决于要替换的词的上下文信息，因此新生成的句子比以前的增强方法捕获了更丰富的信息。在小规模和大规模机器翻译数据集上的实验结果证明了我们的方法在强基线上的优越性。	Jinhua Zhu Fei Gao Lijun Wu Yingce Xia Tao Qin Wengang Zhou Xueqi Cheng Tie-Yan Liu
33	ACL2019	Generalized Data Augmentation for Low-Resource Translation		https://arxiv.org/pdf/1906.03785	Translation to or from low-resource languages LRLs poses challenges for machine translation in terms of both adequacy and fluency. Data augmentation utilizing large amounts of monolingual data is regarded as an effective way to alleviate these problems. In this paper, we propose a general framework for data augmentation in low-resource machine translation that not only uses target-side monolingual data, but also pivots through a related high-resource language HRL. Specifically, we experiment with a two-step pivoting method to convert high-resource data to the LRL, making use of available resources to better approximate the true data distribution of the LRL. First, we inject LRL words into HRL sentences through an induced bilingual dictionary. Second, we further edit these modified sentences using a modified unsupervised machine translation framework. Extensive experiments on four low-resource datasets show that under extreme low-resource settings, our data augmentation techniques improve translation quality by up to~1.5 to~8 BLEU points compared to supervised back-translation baselines	与低资源语言 LRL 之间的翻译在充分性和流畅性方面对机器翻译提出了挑战。利用大量单语数据的数据增强被认为是缓解这些问题的有效方法。在本文中，我们提出了一种用于低资源机器翻译中数据增强的通用框架，该框架不仅使用目标端单语数据，而且还通过相关的高资源语言 HRL 进行支点。具体来说，我们尝试使用两步旋转方法将高资源数据转换为 LRL，利用可用资源更好地近似 LRL 的真实数据分布。首先，我们通过诱导双语词典将 LRL 词注入 HRL 句子中。其次，我们使用修改后的无监督机器翻译框架进一步编辑这些修改后的句子。对四个低资源数据集的大量实验表明，在极端低资源设置下，与有监督的反向翻译基线相比，我们的数据增强技术将翻译质量提高了 1.5 到 8 个 BLEU 点	Mengzhou Xia Xiang Kong Antonios Anastasopoulos Graham Neubig

EMNLP

序号	会议/期刊	论文	代码	论文下载地址	摘要	摘要翻译	作者
1	EMNLP2020	Fully Quantized Transformer for Machine Translation		https://arxiv.org/pdf/1910.10485	State-of-the-art neural machine translation methods employ massive amounts of parameters. Drastically reducing computational costs of such methods without affecting performance has been up to this point unsuccessful. To this end, we propose FullyQT: an all-inclusive quantization strategy for the Transformer. To the best of our knowledge, we are the first to show that it is possible to avoid any loss in translation quality with a fully quantized Transformer. Indeed, compared to full-precision, our 8-bit models score greater or equal BLEU on most tasks. Comparing ourselves to all previously proposed methods, we achieve state-of-the-art quantization results.	最先进的神经机器翻译方法使用大量参数。在不影响性能的情况下大幅降低此类方法的计算成本到目前为止还没有成功。为此，我们提出了 FullQT：Transformer 的全包量化策略。据我们所知，我们是第一个证明使用完全量化的 Transformer 可以避免翻译质量下降的人。事实上，与全精度相比，我们的 8 位模型在大多数任务上的得分更高或等于 BLEU。与之前提出的所有方法相比，我们实现了最先进的量化结果。	Gabriele Prato Ella Charlaix Mehdi Rezagholizadeh
2	EMNLP2020	Fixed Encoder Self-Attention Patterns in Transformer-Based Machine Translation		https://arxiv.org/pdf/2002.10260	Transformer-based models have brought a radical change to neural machine translation. A key feature of the Transformer architecture is the so-called multi-head attention mechanism, which allows the model to focus simultaneously on different parts of the input. However, recent works have shown that most attention heads learn simple, and often redundant, positional patterns. In this paper, we propose to replace all but one attention head of each encoder layer with simple fixed — non-learnable — attentive patterns that are solely based on position and do not require any external knowledge. Our experiments with different data sizes and multiple language pairs show that fixing the attention heads on the encoder side of the Transformer at training time does not impact the translation quality and even increases BLEU scores by up to 3 points in low-resource scenarios.	基于 Transformer 的模型给神经机器翻译带来了根本性的变化。 Transformer 架构的一个关键特性是所谓的多头注意力机制，它允许模型同时关注输入的不同部分。然而，最近的工作表明，大多数注意力头学习简单的、通常是冗余的位置模式。在本文中，我们建议用简单的固定的——不可学习的——注意力模式替换每个编码器层的除一个注意力头之外的所有注意力模式，这些模式完全基于位置并且不需要任何外部知识。我们对不同数据大小和多语言对的实验表明，在训练时将注意力头固定在 Transformer 的编码器端不会影响翻译质量，甚至在低资源场景中将 BLEU 分数提高多达 3 分。	Alessandro Raganato Yves Scherrer Jörg Tiedemann
3	EMNLP2020	Adversarial Subword Regularization for Robust Neural Machine Translation	https://github.com/dmis-lab/AdvSR	https://arxiv.org/pdf/2004.14109	Exposing diverse subword segmentations to neural machine translation (NMT) models often improves the robustness of machine translation as NMT models can experience various subword candidates. However, the diversification of subword segmentations mostly relies on the pre-trained subword language models from which erroneous segmentations of unseen words are less likely to be sampled. In this paper, we present adversarial subword regularization (ADVSR) to study whether gradient signals during training can be a substitute criterion for exposing diverse subword segmentations. We experimentally show that our model-based adversarial samples effectively encourage NMT models to be less sensitive to segmentation errors and improve the performance of NMT models in low-resource and out-domain datasets.	将不同的子词分割暴露给神经机器翻译 (NMT) 模型通常会提高机器翻译的鲁棒性，因为 NMT 模型可以体验各种子词候选。然而，子词切分的多样化主要依赖于预训练的子词语言模型，从这些模型中不太可能对不可见词的错误切分进行采样。在本文中，我们提出了对抗性子词正则化（ADVSR）来研究训练期间的梯度信号是否可以作为暴露不同子词分割的替代标准。我们通过实验表明，我们基于模型的对抗样本有效地鼓励 NMT 模型对分割错误不那么敏感，并提高了 NMT 模型在低资源和域外数据集中的性能。	Jungsoo Park Mujeen Sung Jinhyuk Lee Jaewoo Kang
4	EMNLP2020	Participatory Research for Low-resourced Machine Translation: A Case Study in African Languages	https://github.com/masakhane-io/masakhane-mt	https://arxiv.org/pdf/2010.02353	Research in NLP lacks geographic diversity, and the question of how NLP can be scaled to low-resourced languages has not yet been adequately solved. “Low-resourced”-ness is a complex problem going beyond data availability and reflects systemic problems in society. In this paper, we focus on the task of Machine Translation (MT), that plays a crucial role for information accessibility and communication worldwide. Despite immense improvements in MT over the past decade, MT is centered around a few high-resourced languages. As MT researchers cannot solve the problem of low-resourcedness alone, we propose participatory research as a means to involve all necessary agents required in the MT development process. We demonstrate the feasibility and scalability of participatory research with a case study on MT for African languages. Its implementation leads to a collection of novel translation datasets, MT benchmarks for over 30 languages, with human evaluations for a third of them, and enables participants without formal training to make a unique scientific contribution. Benchmarks, models, data, code, and evaluation results are released under https://github.com/masakhane-io/masakhane-mt.		Wilhelmina Nekoto Vukosi Marivate Tshinondiwa Matsila Timi Fasubaa Tajudeen Kolawole Taiwo Fagbohungbe Solomon Oluwole Akinola Shamsuddeen Hassan Muhammad Salomon Kabongo Salomey Osei Sackey Freshia Rubungo Andre Niyongabo Ricky Macharm Perez Ogayo Orevaoghene Ahia Musie Meressa Mofe Adeyemi Masabata Mokgesi-Selinga Lawrence Okegbemi Laura Jane Martinus Kolawole Tajudeen Kevin Degila Kelechi Ogueji Kathleen Siminyu Julia Kreutzer
5	EMNLP2020	On Romanization for Model Transfer Between Scripts in Neural Machine Translation		https://arxiv.org/pdf/2009.14824	Transfer learning is a popular strategy to improve the quality of low-resource machine translation. For an optimal transfer of the embedding layer, the child and parent model should share a substantial part of the vocabulary. This is not the case when transferring to languages with a different script. We explore the benefit of romanization in this scenario. Our results show that romanization entails information loss and is thus not always superior to simpler vocabulary transfer methods, but can improve the transfer between related languages with different scripts. We compare two romanization tools and find that they exhibit different degrees of information loss, which affects translation quality. Finally, we extend romanization to the target side, showing that this can be a successful strategy when coupled with a simple deromanization model.	迁移学习是提高低资源机器翻译质量的流行策略。为了优化嵌入层的转移，子模型和父模型应该共享词汇表的大部分。转换为具有不同脚本的语言时，情况并非如此。我们探讨了在这种情况下罗马化的好处。我们的结果表明，罗马化会导致信息丢失，因此并不总是优于更简单的词汇转移方法，但可以改善具有不同脚本的相关语言之间的转移。我们比较了两种罗马化工具，发现它们表现出不同程度的信息丢失，这会影响翻译质量。最后，我们将罗马化扩展到目标端，表明当与简单的去罗马化模型相结合时，这可能是一个成功的策略。	Chantal Amrhein Rico Sennrich
6	EMNLP2020	Graph-to-Tree Neural Networks for Learning Structured Input-Output Translation with Applications to Semantic Parsing and Math Word Problem	https://github.com/IBM/Graph2Tree	https://arxiv.org/pdf/2004.13781	The celebrated Seq2Seq technique and its numerous variants achieve excellent performance on many tasks such as neural machine translation, semantic parsing, and math word problem solving. However, these models either only consider input objects as sequences while ignoring the important structural information for encoding, or they simply treat output objects as sequence outputs instead of structural objects for decoding. In this paper, we present a novel Graph-to-Tree Neural Networks, namely Graph2Tree consisting of a graph encoder and a hierarchical tree decoder, that encodes an augmented graph-structured input and decodes a tree-structured output. In particular, we investigated our model for solving two problems, neural semantic parsing and math word problem. Our extensive experiments demonstrate that our Graph2Tree model outperforms or matches the performance of other state-of-the-art models on these tasks.	著名的 Seq2Seq 技术及其众多变体在神经机器翻译、语义解析和数学单词问题解决等许多任务上都取得了出色的表现。然而，这些模型要么只将输入对象视为序列而忽略编码的重要结构信息，要么将输出对象简单地视为序列输出而不是结构对象进行解码。在本文中，我们提出了一种新颖的图到树神经网络，即由图编码器和分层树解码器组成的 Graph2Tree，它对增强的图结构输入进行编码并解码树结构输出。特别是，我们研究了解决两个问题的模型，神经语义解析和数学单词问题。我们广泛的实验表明，我们的 Graph2Tree 模型在这些任务上的表现优于或匹配其他最先进模型的性能。	Shucheng Li Lingfei Wu Shiwei Feng Fangli Xu Fengyuan Xu Sheng Zhong
7	EMNLP2020	On Long-Tailed Phenomena in Neural Machine Translation	https://github.com/vyraun/long-tailed	https://arxiv.org/pdf/2010.04924	State-of-the-art Neural Machine Translation (NMT) models struggle with generating low-frequency tokens, tackling which remains a major challenge. The analysis of long-tailed phenomena in the context of structured prediction tasks is further hindered by the added complexities of search during inference. In this work, we quantitatively characterize such long-tailed phenomena at two levels of abstraction, namely, token classification and sequence generation. We propose a new loss function, the Anti-Focal loss, to better adapt model training to the structural dependencies of conditional text generation by incorporating the inductive biases of beam search in the training process. We show the efficacy of the proposed technique on a number of Machine Translation (MT) datasets, demonstrating that it leads to significant gains over cross-entropy across different language pairs, especially on the generation of low-frequency words. We have released the code to reproduce our results.	最先进的神经机器翻译 (NMT) 模型难以生成低频标记，这仍然是一个主要挑战。在结构化预测任务的上下文中对长尾现象的分析进一步受到推理期间搜索复杂性的阻碍。在这项工作中，我们在两个抽象层次上定量描述了这种长尾现象，即标记分类和序列生成。我们提出了一种新的损失函数，即 Anti-Focal 损失，通过在训练过程中结合波束搜索的归纳偏差，更好地使模型训练适应条件文本生成的结构依赖性。我们展示了所提出的技术在许多机器翻译 (MT) 数据集上的有效性，证明它在跨不同语言对的交叉熵上带来了显着的收益，尤其是在低频词的生成方面。我们已经发布了代码来重现我们的结果。	Vikas Raunak Siddharth Dalmia Vivek Gupta Florian Metze
8	EMNLP2020	A Multilingual View of Unsupervised Machine Translation		https://arxiv.org/pdf/2002.02955	We present a probabilistic framework for multilingual neural machine translation that encompasses supervised and unsupervised setups, focusing on unsupervised translation. In addition to studying the vanilla case where there is only monolingual data available, we propose a novel setup where one language in the (source, target) pair is not associated with any parallel data, but there may exist auxiliary parallel data that contains the other. This auxiliary data can naturally be utilized in our probabilistic framework via a novel cross-translation loss term. Empirically, we show that our approach results in higher BLEU scores over state-of-the-art unsupervised models on the WMT’14 English-French, WMT’16 English-German, and WMT’16 English-Romanian datasets in most directions. In particular, we obtain a +1.65 BLEU advantage over the best-performing unsupervised model in the Romanian-English direction.	我们提出了一个用于多语言神经机器翻译的概率框架，其中包括有监督和无监督设置，重点是无监督翻译。除了研究只有单语数据可用的普通情况外，我们还提出了一种新颖的设置，其中（源、目标）对中的一种语言与任何并行数据无关，但可能存在包含另一种语言的辅助并行数据.通过新的交叉翻译损失项，这些辅助数据可以自然地用于我们的概率框架中。根据经验，我们表明，我们的方法在 WMT’14 English-French、WMT’16 English-German 和 WMT’16 English-Romanian 数据集的大多数方向上比最先进的无监督模型获得更高的 BLEU 分数。特别是，我们在罗马尼亚语-英语方向上获得了优于性能最佳的无监督模型的 +1.65 BLEU 优势。	Xavier Garcia Pierre Foret Thibault Sellam Ankur P. Parikh
9	EMNLP2019	Explicit Cross-lingual Pre-training for Unsupervised Machine Translation		https://arxiv.org/pdf/1909.00180	Pre-training has proven to be effective in unsupervised machine translation due to its ability to model deep context information in cross-lingual scenarios. However, the cross-lingual information obtained from shared BPE spaces is inexplicit and limited. In this paper, we propose a novel cross-lingual pre-training method for unsupervised machine translation by incorporating explicit cross-lingual training signals. Specifically, we first calculate cross-lingual n-gram embeddings and infer an n-gram translation table from them. With those n-gram translation pairs, we propose a new pre-training model called Cross-lingual Masked Language Model (CMLM), which randomly chooses source n-grams in the input text stream and predicts their translation candidates at each time step. Experiments show that our method can incorporate beneficial cross-lingual information into pre-trained models. Taking pre-trained CMLM models as the encoder and decoder, we significantly improve the performance of unsupervised machine translation.	预训练已被证明在无监督机器翻译中是有效的，因为它能够在跨语言场景中对深层上下文信息进行建模。然而，从共享 BPE 空间获得的跨语言信息是不明确和有限的。在本文中，我们通过结合明确的跨语言训练信号，提出了一种新的跨语言预训练方法，用于无监督机器翻译。具体来说，我们首先计算跨语言 n-gram 嵌入并从中推断出 n-gram 翻译表。利用这些 n-gram 翻译对，我们提出了一种称为跨语言掩码语言模型 (CMLM) 的新预训练模型，该模型在输入文本流中随机选择源 n-gram 并在每个时间步预测它们的翻译候选。实验表明，我们的方法可以将有益的跨语言信息整合到预先训练的模型中。以预训练的 CMLM 模型作为编码器和解码器，我们显着提高了无监督机器翻译的性能。	Shuo Ren Yu Wu Shujie Liu Ming Zhou Shuai Ma
10	EMNLP2019	Improving Back-Translation with Uncertainty-based Confidence Estimation	https://github.com/THUNLP-MT/UCE4BT	https://arxiv.org/pdf/1909.00157	While back-translation is simple and effective in exploiting abundant monolingual corpora to improve low-resource neural machine translation (NMT), the synthetic bilingual corpora generated by NMT models trained on limited authentic bilingual data are inevitably noisy. In this work, we propose to quantify the confidence of NMT model predictions based on model uncertainty. With word- and sentence-level confidence measures based on uncertainty, it is possible for back-translation to better cope with noise in synthetic bilingual corpora. Experiments on Chinese-English and English-German translation tasks show that uncertainty-based confidence estimation significantly improves the performance of back-translation.	虽然反向翻译在利用丰富的单语语料库来改进低资源神经机器翻译 (NMT) 方面简单而有效，但由在有限真实双语数据上训练的 NMT 模型生成的合成双语语料库不可避免地存在噪声。在这项工作中，我们建议基于模型不确定性量化 NMT 模型预测的置信度。通过基于不确定性的单词和句子级别的置信度度量，回译可以更好地应对合成双语语料库中的噪声。汉英和英德翻译任务的实验表明，基于不确定性的置信度估计显着提高了回译的性能。	Shuo Wang Yang Liu Chao Wang Huanbo Luan Maosong Sun
11	EMNLP2019	Iterative Dual Domain Adaptation for Neural Machine Translation		https://arxiv.org/pdf/1912.07239	Previous studies on the domain adaptation for neural machine translation (NMT) mainly focus on the one-pass transferring out-of-domain translation knowledge to in-domain NMT model. In this paper, we argue that such a strategy fails to fully extract the domain-shared translation knowledge, and repeatedly utilizing corpora of different domains can lead to better distillation of domain-shared translation knowledge. To this end, we propose an iterative dual domain adaptation framework for NMT. Specifically, we first pre-train in-domain and out-of-domain NMT models using their own training corpora respectively, and then iteratively perform bidirectional translation knowledge transfer (from in-domain to out-of-domain and then vice versa) based on knowledge distillation until the in-domain NMT model convergences. Furthermore, we extend the proposed framework to the scenario of multiple out-of-domain training corpora, where the above-mentioned transfer is performed sequentially between the in-domain and each out-of-domain NMT models in the ascending order of their domain similarities. Empirical results on Chinese-English and English-German translation tasks demonstrate the effectiveness of our framework.	先前关于神经机器翻译（NMT）领域适应的研究主要集中在将域外翻译知识一次性转移到域内 NMT 模型上。在本文中，我们认为这种策略无法完全提取领域共享翻译知识，重复利用不同领域的语料库可以更好地提炼领域共享翻译知识。为此，我们为 NMT 提出了一个迭代双域适应框架。具体来说，我们首先分别使用自己的训练语料对域内和域外 NMT 模型进行预训练，然后迭代执行双向翻译知识转移（从域内到域外，反之亦然）。知识蒸馏，直到域内 NMT 模型收敛。此外，我们将所提出的框架扩展到多个域外训练语料库的场景，其中上述转移是在域内和域外 NMT 模型之间按照域的升序顺序执行的相似之处。汉英和英德翻译任务的实证结果证明了我们框架的有效性。	Jiali Zeng Yang Liu Jinsong Su Yubin Ge Yaojie Lu Yongjing Yin Jiebo Luo
12	EMNLP2019	Context-Aware Monolingual Repair for Neural Machine Translation	https://github.com/lena-voita/good-translation-wrong-in-context	https://arxiv.org/pdf/1909.01383	Modern sentence-level NMT systems often produce plausible translations of isolated sentences. However, when put in context, these translations may end up being inconsistent with each other. We propose a monolingual DocRepair model to correct inconsistencies between sentence-level translations. DocRepair performs automatic post-editing on a sequence of sentence-level translations, refining translations of sentences in context of each other. For training, the DocRepair model requires only monolingual document-level data in the target language. It is trained as a monolingual sequence-to-sequence model that maps inconsistent groups of sentences into consistent ones. The consistent groups come from the original training data; the inconsistent groups are obtained by sampling round-trip translations for each isolated sentence. We show that this approach successfully imitates inconsistencies we aim to fix: using contrastive evaluation, we show large improvements in the translation of several contextual phenomena in an English-Russian translation task, as well as improvements in the BLEU score. We also conduct a human evaluation and show a strong preference of the annotators to corrected translations over the baseline ones. Moreover, we analyze which discourse phenomena are hard to capture using monolingual data only.	现代句子级 NMT 系统通常会对孤立的句子产生合理的翻译。然而，当放在上下文中时，这些翻译最终可能会彼此不一致。我们提出了一种单语 DocRepair 模型来纠正句子级翻译之间的不一致。 DocRepair 对一系列句子级翻译执行自动后期编辑，在彼此的上下文中完善句子的翻译。对于训练，DocRepair 模型只需要目标语言的单语文档级数据。它被训练为单语序列到序列模型，将不一致的句子组映射到一致的句子组。一致组来自原始训练数据；不一致的组是通过对每个孤立句子的往返翻译进行采样来获得的。我们表明这种方法成功地模仿了我们旨在解决的不一致问题：使用对比评估，我们在英俄翻译任务中的几种上下文现象的翻译方面取得了很大的改进，以及 BLEU 分数的改进。我们还进行了人工评估，并显示出注释者对更正翻译的强烈偏好，而不是基线翻译。此外，我们分析了仅使用单语数据难以捕捉哪些话语现象。	Elena Voita Rico Sennrich Ivan Titov
13	EMNLP2019	Dynamic Past and Future for Neural Machine Translation	https://github.com/zhengzx-nlp/dynamic-nmt	https://arxiv.org/pdf/1904.09646	Previous studies have shown that neural machine translation (NMT) models can benefit from explicitly modeling translated (Past) and untranslated (Future) to groups of translated and untranslated contents through parts-to-wholes assignment. The assignment is learned through a novel variant of routing-by-agreement mechanism (Sabour et al., 2017), namely {\em Guided Dynamic Routing}, where the translating status at each decoding step {\em guides} the routing process to assign each source word to its associated group (i.e., translated or untranslated content) represented by a capsule, enabling translation to be made from holistic context. Experiments show that our approach achieves substantial improvements over both RNMT and Transformer by producing more adequate translations. Extensive analysis demonstrates that our method is highly interpretable, which is able to recognize the translated and untranslated contents as expected.	先前的研究表明，神经机器翻译 (NMT) 模型可以受益于通过部分到整体分配将已翻译（过去）和未翻译（未来）显式建模为已翻译和未翻译内容的组。该分配是通过协议路由机制的一种新变体（Sabour 等人，2017 年）学习的，即 {\em 引导动态路由}，其中每个解码步骤的翻译状态{\em 引导}路由过程到将每个源词分配给由胶囊表示的相关组（即翻译或未翻译的内容），从而能够从整体上下文进行翻译。实验表明，我们的方法通过产生更充分的翻译，在 RNMT 和 Transformer 上取得了实质性的改进。广泛的分析表明，我们的方法具有高度的可解释性，能够按预期识别翻译和未翻译的内容。	Zaixiang Zheng Shujian Huang Zhaopeng Tu Xin-Yu Dai Jiajun Chen
14	EMNLP2019	Simpler and Faster Learning of Adaptive Policies for Simultaneous Translation		https://arxiv.org/pdf/1909.01559	Simultaneous translation is widely useful but remains challenging. Previous work falls into two main categories: (a) fixed-latency policies such as Ma et al. (2019) and (b) adaptive policies such as Gu et al. (2017). The former are simple and effective, but have to aggressively predict future content due to diverging source-target word order; the latter do not anticipate, but suffer from unstable and inefficient training. To combine the merits of both approaches, we propose a simple supervised-learning framework to learn an adaptive policy from oracle READ/WRITE sequences generated from parallel text. At each step, such an oracle sequence chooses to WRITE the next target word if the available source sentence context provides enough information to do so, otherwise READ the next source word. Experiments on German<->English show that our method, without retraining the underlying NMT model, can learn flexible policies with better BLEU scores and similar latencies compared to previous work.	同声传译具有广泛的用途，但仍然具有挑战性。以前的工作分为两大类：（a）固定延迟策略，如 Ma 等人。 (2019) 和 (b) 适应性政策，如 Gu 等人。 (2017)。前者简单有效，但由于源目标词序不同，必须积极预测未来的内容；后者没有预期，而是遭受不稳定和低效的训练。为了结合这两种方法的优点，我们提出了一个简单的监督学习框架，从并行文本生成的 oracle READ/WRITE 序列中学习自适应策略。在每一步，如果可用的源句子上下文提供了足够的信息，这样的预言机序列选择写入下一个目标词，否则读取下一个源词。在德语<->英语上的实验表明，与之前的工作相比，我们的方法无需重新训练底层 NMT 模型，就可以学习具有更好 BLEU 分数和类似延迟的灵活策略。	Baigong Zheng Renjie Zheng Mingbo Ma Liang Huang
15	EMNLP2019	Unsupervised Domain Adaptation for Neural Machine Translation with Domain-Aware Feature Embeddings	https://github.com/zdou0830/DAFE	https://arxiv.org/pdf/1908.10430	The recent success of neural machine translation models relies on the availability of high quality, in-domain data. Domain adaptation is required when domain-specific data is scarce or nonexistent. Previous unsupervised domain adaptation strategies include training the model with in-domain copied monolingual or back-translated data. However, these methods use generic representations for text regardless of domain shift, which makes it infeasible for translation models to control outputs conditional on a specific domain. In this work, we propose an approach that adapts models with domain-aware feature embeddings, which are learned via an auxiliary language modeling task. Our approach allows the model to assign domain-specific representations to words and output sentences in the desired domain. Our empirical results demonstrate the effectiveness of the proposed strategy, achieving consistent improvements in multiple experimental settings. In addition, we show that combining our method with back translation can further improve the performance of the model.	最近神经机器翻译模型的成功依赖于高质量域内数据的可用性。当特定于域的数据稀缺或不存在时，需要域自适应。以前的无监督域适应策略包括使用域内复制的单语或反向翻译数据训练模型。然而，这些方法使用文本的通用表示而不考虑域转移，这使得翻译模型无法控制以特定域为条件的输出。在这项工作中，我们提出了一种方法，该方法可以通过辅助语言建模任务学习具有领域感知特征嵌入的模型。我们的方法允许模型将特定于域的表示分配给所需域中的单词和输出句子。我们的实证结果证明了所提出策略的有效性，在多个实验环境中实现了一致的改进。此外，我们表明将我们的方法与反向翻译相结合可以进一步提高模型的性能。	Zi-Yi Dou Junjie Hu Antonios Anastasopoulos Graham Neubig
16	EMNLP2019	Controlling Text Complexity in Neural Machine Translation	https://github.com/sweta20/ComplexityControlledMT	https://arxiv.org/pdf/1911.00835	This work introduces a machine translation task where the output is aimed at audiences of different levels of target language proficiency. We collect a high quality dataset of news articles available in English and Spanish, written for diverse grade levels and propose a method to align segments across comparable bilingual articles. The resulting dataset makes it possible to train multi-task sequence-to-sequence models that translate Spanish into English targeted at an easier reading grade level than the original Spanish. We show that these multi-task models outperform pipeline approaches that translate and simplify text independently.	这项工作引入了机器翻译任务，其中输出针对不同目标语言水平的受众。我们收集了一个高质量的英语和西班牙语新闻文章数据集，为不同年级编写，并提出了一种方法来对齐可比双语文章中的片段。由此产生的数据集可以训练多任务序列到序列模型，将西班牙语翻译成英语，目标是比原始西班牙语更容易阅读。我们表明，这些多任务模型优于独立翻译和简化文本的管道方法。	Sweta Agrawal Marine Carpuat
17	EMNLP2019	Simple and Effective Noisy Channel Modeling for Neural Machine Translation	https://github.com/pytorch/fairseq	https://arxiv.org/pdf/1908.05731	Previous work on neural noisy channel modeling relied on latent variable models that incrementally process the source and target sentence. This makes decoding decisions based on partial source prefixes even though the full source is available. We pursue an alternative approach based on standard sequence to sequence models which utilize the entire source. These models perform remarkably well as channel models, even though they have neither been trained on, nor designed to factor over incomplete target sentences. Experiments with neural language models trained on billions of words show that noisy channel models can outperform a direct model by up to 3.2 BLEU on WMT’17 German-English translation. We evaluate on four language-pairs and our channel models consistently outperform strong alternatives such right-to-left reranking models and ensembles of direct models.	先前关于神经噪声通道建模的工作依赖于增量处理源语句和目标语句的潜在变量模型。即使完整源可用，这也会基于部分源前缀做出解码决策。我们寻求一种基于标准序列到序列模型的替代方法，该模型利用整个源。这些模型作为通道模型的表现非常好，即使它们既没有接受过训练，也没有被设计为考虑不完整的目标句子。在数十亿单词上训练的神经语言模型的实验表明，在 WMT’17 德英翻译中，噪声通道模型的性能比直接模型高 3.2 BLEU。我们对四个语言对进行评估，我们的通道模型始终优于强大的替代方案，例如从右到左的重新排序模型和直接模型的集合。	Kyra Yee Nathan Ng Yann N. Dauphin Michael Auli
18	EMNLP2019	Hint-Based Training for Non-Autoregressive Machine Translation	https://github.com/zhuohan123/hint-nart	https://arxiv.org/pdf/1909.06708	Due to the unparallelizable nature of the autoregressive factorization, AutoRegressive Translation (ART) models have to generate tokens sequentially during decoding and thus suffer from high inference latency. Non-AutoRegressive Translation (NART) models were proposed to reduce the inference time, but could only achieve inferior translation accuracy. In this paper, we proposed a novel approach to leveraging the hints from hidden states and word alignments to help the training of NART models. The results achieve significant improvement over previous NART models for the WMT14 En-De and De-En datasets and are even comparable to a strong LSTM-based ART baseline but one order of magnitude faster in inference.	由于自回归分解的无与伦比的性质，自回归翻译 (ART) 模型必须在解码期间按顺序生成令牌，因此会遭受高推理延迟。提出了非自回归翻译 (NART) 模型以减少推理时间，但只能实现较差的翻译准确度。在本文中，我们提出了一种利用隐藏状态和词对齐的提示来帮助训练 NART 模型的新方法。与 WMT14 En-De 和 De-En 数据集的先前 NART 模型相比，结果实现了显着改进，甚至可与基于 LSTM 的强大 ART 基线相媲美，但推理速度快了一个数量级。	Zhuohan Li Zi Lin Di He Fei Tian Tao Qin Liwei Wang Tie-Yan Liu

NAACL

序号	会议/期刊	论文	代码	论文下载地址	摘要	摘要翻译	作者
1	NAACL2021	Self-Training for Unsupervised Neural Machine Translation in Unbalanced Training Data Scenarios		https://arxiv.org/pdf/2004.04507	Unsupervised neural machine translation (UNMT) that relies solely on massive monolingual corpora has achieved remarkable results in several translation tasks. However, in real-world scenarios, massive monolingual corpora do not exist for some extremely low-resource languages such as Estonian, and UNMT systems usually perform poorly when there is not adequate training corpus for one language. In this paper, we first define and analyze the unbalanced training data scenario for UNMT. Based on this scenario, we propose UNMT self-training mechanisms to train a robust UNMT system and improve its performance in this case. Experimental results on several language pairs show that the proposed methods substantially outperform conventional UNMT systems.	仅依赖海量单语语料库的无监督神经机器翻译 (UNMT) 在多项翻译任务中取得了显著成果。然而，在现实世界的场景中，对于一些资源极少的语言（例如爱沙尼亚语），不存在大量的单语语料库，并且当一种语言没有足够的训练语料库时，UNMT 系统通常表现不佳。在本文中，我们首先定义和分析了 UNMT 的不平衡训练数据场景。基于这种情况，我们提出了 UNMT 自训练机制来训练一个强大的 UNMT 系统并在这种情况下提高其性能。在几个语言对上的实验结果表明，所提出的方法大大优于传统的 UNMT 系统。	Haipeng Sun Rui Wang Kehai Chen Masao Utiyama Eiichiro Sumita Tiejun Zhao
2	NAACL2021	Harnessing Multilinguality in Unsupervised Machine Translation for Rare Languages		https://arxiv.org/pdf/2009.11201	Unsupervised translation has reached impressive performance on resource-rich language pairs such as English-French and English-German. However, early studies have shown that in more realistic settings involving low-resource, rare languages, unsupervised translation performs poorly, achieving less than 3.0 BLEU. In this work, we show that multilinguality is critical to making unsupervised systems practical for low-resource settings. In particular, we present a single model for 5 low-resource languages (Gujarati, Kazakh, Nepali, Sinhala, and Turkish) to and from English directions, which leverages monolingual and auxiliary parallel data from other high-resource language pairs via a three-stage training scheme. We outperform all current state-of-the-art unsupervised baselines for these languages, achieving gains of up to 14.4 BLEU. Additionally, we outperform a large collection of supervised WMT submissions for various language pairs as well as match the performance of the current state-of-the-art supervised model for Nepali-English. We conduct a series of ablation studies to establish the robustness of our model under different degrees of data quality, as well as to analyze the factors which led to the superior performance of the proposed approach over traditional unsupervised models.	无监督翻译在英语-法语和英语-德语等资源丰富的语言对上取得了令人瞩目的表现。然而，早期研究表明，在涉及资源匮乏、稀有语言的更现实环境中，无监督翻译表现不佳，达到 3.0 BLEU 以下。在这项工作中，我们表明多语言对于使无监督系统适用于低资源环境至关重要。特别是，我们为 5 种低资源语言（古吉拉特语、哈萨克语、尼泊尔语、僧伽罗语和土耳其语）与英语方向之间提供了一个单一模型，该模型利用来自其他高资源语言对的单语和辅助并行数据，通过三个-阶段训练计划。我们优于这些语言的所有当前最先进的无监督基线，实现了高达 14.4 BLEU 的增益。此外，我们在各种语言对的监督 WMT 提交中表现出色，并且与当前最先进的尼泊尔语-英语监督模型的性能相匹配。我们进行了一系列消融研究，以建立我们模型在不同数据质量程度下的稳健性，并分析导致所提出方法优于传统无监督模型的因素。	Xavier Garcia Aditya Siddhant Orhan Firat Ankur P. Parikh
3	NAACL2021	Towards Continual Learning for Multilingual Machine Translation via Vocabulary Substitution		https://arxiv.org/pdf/2103.06799	We propose a straightforward vocabulary adaptation scheme to extend the language capacity of multilingual machine translation models, paving the way towards efficient continual learning for multilingual machine translation. Our approach is suitable for large-scale datasets, applies to distant languages with unseen scripts, incurs only minor degradation on the translation performance for the original language pairs and provides competitive performance even in the case where we only possess monolingual data for the new languages.	我们提出了一种简单的词汇适应方案来扩展多语言机器翻译模型的语言能力，为多语言机器翻译的高效持续学习铺平道路。我们的方法适用于大规模数据集，适用于具有看不见的脚本的远程语言，对原始语言对的翻译性能仅造成轻微的下降，并且即使在我们仅拥有新语言的单语数据的情况下也能提供有竞争力的性能。	Xavier Garcia Noah Constant Ankur P. Parikh Orhan Firat
4	NAACL2021	Probing Word Translations in the Transformer and Trading Decoder for Encoder Layers		https://arxiv.org/pdf/2003.09586	Due to its effectiveness and performance, the Transformer translation model has attracted wide attention, most recently in terms of probing-based approaches. Previous work focuses on using or probing source linguistic features in the encoder. To date, the way word translation evolves in Transformer layers has not yet been investigated. Naively, one might assume that encoder layers capture source information while decoder layers translate. In this work, we show that this is not quite the case: translation already happens progressively in encoder layers and even in the input embeddings. More surprisingly, we find that some of the lower decoder layers do not actually do that much decoding. We show all of this in terms of a probing approach where we project representations of the layer analyzed to the final trained and frozen classifier level of the Transformer decoder to measure word translation accuracy. Our findings motivate and explain a Transformer configuration change: if translation already happens in the encoder layers, perhaps we can increase the number of encoder layers, while decreasing the number of decoder layers, boosting decoding speed, without loss in translation quality? Our experiments show that this is indeed the case: we can increase speed by up to a factor 2.3 with small gains in translation quality, while an 18-4 deep encoder configuration boosts translation quality by +1.42 BLEU (En-De) at a speed-up of 1.4.	由于其有效性和性能，Transformer 翻译模型引起了广泛关注，最近在基于探测的方法方面。以前的工作侧重于在编码器中使用或探测源语言特征。迄今为止，尚未研究单词翻译在 Transformer 层中的演变方式。天真地，人们可能会假设编码器层在解码器层转换时捕获源信息。在这项工作中，我们表明情况并非如此：翻译已经在编码器层甚至输入嵌入中逐步发生。更令人惊讶的是，我们发现一些较低的解码器层实际上并没有做那么多解码。我们以探测方法展示了所有这些，其中我们将分析的层的表示投影到 Transformer 解码器的最终训练和冻结分类器级别，以测量单词翻译的准确性。我们的发现激发并解释了 Transformer 配置的变化：如果翻译已经发生在编码器层，也许我们可以增加编码器层的数量，同时减少解码器层的数量，提高解码速度，而不损失翻译质量？我们的实验表明情况确实如此：我们可以将速度提高 2.3 倍，而翻译质量的提高很小，而 18-4 深度编码器配置以 +1.42 BLEU (En-De) 的速度将翻译质量提高- 1.4。	Hongfei Xu Josef van Genabith Qiuhui Liu Deyi Xiong
5	NAACL2021	Sequence Tagging and Machine Translation		https://arxiv.org/pdf/1911.00234	While deep learning is a powerful tool for natural language processing (NLP) problems, successful solutions to these problems rely heavily on large amounts of annotated samples. However, manually annotating data is expensive and time-consuming. Active Learning (AL) strategies reduce the need for huge volumes of labeled data by iteratively selecting a small number of examples for manual annotation based on their estimated utility in training the given model. In this paper, we argue that since AL strategies choose examples independently, they may potentially select similar examples, all of which may not contribute significantly to the learning process. Our proposed approach, Active$\mathbf{^2}$ Learning (A$\mathbf{^2}$L), actively adapts to the deep learning model being trained to eliminate further such redundant examples chosen by an AL strategy. We show that A$\mathbf{^2}$L is widely applicable by using it in conjunction with several different AL strategies and NLP tasks. We empirically demonstrate that the proposed approach is further able to reduce the data requirements of state-of-the-art AL strategies by an absolute percentage reduction of $\approx\mathbf{3-25\%}$ on multiple NLP tasks while achieving the same performance with no additional computation overhead.	虽然深度学习是解决自然语言处理 (NLP) 问题的强大工具，但这些问题的成功解决方案在很大程度上依赖于大量带注释的样本。但是，手动注释数据既昂贵又耗时。主动学习 (AL) 策略通过迭代选择少量示例进行手动注释，基于它们在训练给定模型中的估计效用，减少了对大量标记数据的需求。在本文中，我们认为由于 AL 策略独立选择示例，它们可能会选择相似的示例，所有这些示例可能对学习过程没有显着贡献。我们提出的方法 Active$\mathbf{^2}$ Learning (A$\mathbf{^2}$L) 主动适应正在训练的深度学习模型，以进一步消除 AL 策略选择的此类冗余示例。我们通过将 A$\mathbf{^2}$L 与几种不同的 AL 策略和 NLP 任务结合使用，表明它具有广泛的适用性。我们凭经验证明，所提出的方法能够通过在多个 NLP 任务上减少 $\approx\mathbf{3-25\%}$ 的绝对百分比来进一步降低最先进的 AL 策略的数据需求，同时实现相同的性能，没有额外的计算开销。	Rishi Hazra Parag Dutta Shubham Gupta Mohammed Abdul Qaathir Ambedkar Dukkipati Rishi Hazra Parag Dutta Shubham Gupta Mohammed Abdul Qaathir Ambedkar Dukkipati
6	NAACL2021	Neural Machine Translation without Embeddings	https://github.com/UriSha/EmbeddinglessNMT	https://arxiv.org/pdf/2008.09396	Many NLP models operate over sequences of subword tokens produced by hand-crafted tokenization rules and heuristic subword induction algorithms. A simple universal alternative is to represent every computerized text as a sequence of bytes via UTF-8, obviating the need for an embedding layer since there are fewer token types (256) than dimensions. Surprisingly, replacing the ubiquitous embedding layer with one-hot representations of each byte does not hurt performance; experiments on byte-to-byte machine translation from English to 10 different languages show a consistent improvement in BLEU, rivaling character-level and even standard subword-level models. A deeper investigation reveals that the combination of embeddingless models with decoder-input dropout amounts to token dropout, which benefits byte-to-byte models in particular.		Uri Shaham Omer Levy
7	NAACL2021	From Masked Language Modeling to Translation: Non-English Auxiliary Tasks Improve Zero-shot Spoken Language Understanding	https://bitbucket.org/robvanderg/xsid	https://arxiv.org/pdf/2105.07316	The lack of publicly available evaluation data for low-resource languages limits progress in Spoken Language Understanding (SLU). As key tasks like intent classification and slot filling require abundant training data, it is desirable to reuse existing data in high-resource languages to develop models for low-resource scenarios. We introduce xSID, a new benchmark for cross-lingual Slot and Intent Detection in 13 languages from 6 language families, including a very low-resource dialect. To tackle the challenge, we propose a joint learning approach, with English SLU training data and non-English auxiliary tasks from raw text, syntax and translation for transfer. We study two setups which differ by type and language coverage of the pre-trained embeddings. Our results show that jointly learning the main tasks with masked language modeling is effective for slots, while machine translation transfer works best for intent classification.	缺乏对低资源语言的公开评估数据限制了口语理解 (SLU) 的进展。由于意图分类和槽填充等关键任务需要大量的训练数据，因此需要重用高资源语言的现有数据来开发低资源场景的模型。我们引入了 xSID，这是一种跨语言槽和意图检测的新基准，用于 6 个语言家族的 13 种语言，包括资源非常少的方言。为了应对这一挑战，我们提出了一种联合学习方法，使用英语 SLU 训练数据和来自原始文本、语法和翻译的非英语辅助任务进行迁移。我们研究了两种不同的设置，它们因预训练嵌入的类型和语言覆盖范围而异。我们的结果表明，使用掩码语言建模联合学习主要任务对槽有效，而机器翻译迁移最适合意图分类。	Rob van der Goot Ibrahim Sharaf Aizhan Imankulova Ahmet Üstün Marija Stepanović Alan Ramponi Siti Oryza Khairunnisa Mamoru Komachi Barbara Plank
8	NAACL2021	Improving the Lexical Ability of Pretrained Language Models for Unsupervised Neural Machine Translation	https://github.com/alexandra-chron/lexical_xlm_relm	https://arxiv.org/pdf/2103.10531	Successful methods for unsupervised neural machine translation (UNMT) employ crosslingual pretraining via self-supervision, often in the form of a masked language modeling or a sequence generation task, which requires the model to align the lexical- and high-level representations of the two languages. While cross-lingual pretraining works for similar languages with abundant corpora, it performs poorly in low-resource and distant languages. Previous research has shown that this is because the representations are not sufficiently aligned. In this paper, we enhance the bilingual masked language model pretraining with lexical-level information by using type-level cross-lingual subword embeddings. Empirical results demonstrate improved performance both on UNMT (up to 4.5 BLEU) and bilingual lexicon induction using our method compared to a UNMT baseline.	无监督神经机器翻译 (UNMT) 的成功方法通过自我监督采用跨语言预训练，通常采用掩码语言建模或序列生成任务的形式，这需要模型对齐两者的词汇和高级表示语言。虽然跨语言预训练适用于具有丰富语料库的相似语言，但它在资源匮乏和距离较远的语言中表现不佳。先前的研究表明，这是因为表示没有充分对齐。在本文中，我们通过使用类型级别的跨语言子词嵌入来增强具有词汇级别信息的双语掩码语言模型预训练。实证结果表明，与 UNMT 基线相比，使用我们的方法在 UNMT（高达 4.5 BLEU）和双语词典归纳方面的性能都有所提高。	Alexandra Chronopoulou Dario Stojanovski Alexander Fraser
9	NAACL2021	Multi-Task Learning with Shared Encoder for Non-Autoregressive Machine Translation	https://github.com/yongchanghao/multi-task-nat	https://arxiv.org/pdf/2010.12868	Non-Autoregressive machine Translation (NAT) models have demonstrated significant inference speedup but suffer from inferior translation accuracy. The common practice to tackle the problem is transferring the Autoregressive machine Translation (AT) knowledge to NAT models, e.g., with knowledge distillation. In this work, we hypothesize and empirically verify that AT and NAT encoders capture different linguistic properties of source sentences. Therefore, we propose to adopt Multi-Task learning to transfer the AT knowledge to NAT models through encoder sharing. Specifically, we take the AT model as an auxiliary task to enhance NAT model performance. Experimental results on WMT14 English-German and WMT16 English-Romanian datasets show that the proposed Multi-Task NAT achieves significant improvements over the baseline NAT models. Furthermore, the performance on large-scale WMT19 and WMT20 English-German datasets confirm the consistency of our proposed method. In addition, experimental results demonstrate that our Multi-Task NAT is complementary to knowledge distillation, the standard knowledge transfer method for NAT.	非自回归机器翻译 (NAT) 模型已经证明了显着的推理加速，但翻译准确性较差。解决该问题的常见做法是将自回归机器翻译 (AT) 知识转移到 NAT 模型，例如，通过知识蒸馏。在这项工作中，我们假设并凭经验验证 AT 和 NAT 编码器捕获源句子的不同语言属性。因此，我们建议采用多任务学习，通过编码器共享将 AT 知识转移到 NAT 模型。具体来说，我们将 AT 模型作为辅助任务来增强 NAT 模型的性能。 WMT14 English-German 和 WMT16 English-Romanian 数据集的实验结果表明，所提出的多任务 NAT 比基线 NAT 模型取得了显着的改进。此外，大规模 WMT19 和 WMT20 英德数据集的性能证实了我们提出的方法的一致性。此外，实验结果表明，我们的多任务 NAT 是知识蒸馏的补充，这是 NAT 的标准知识转移方法。	Yongchang Hao Shilin He Wenxiang Jiao Zhaopeng Tu Michael Lyu Xing Wang
10	NAACL2021	Assessing Reference-Free Peer Evaluation for Machine Translation		https://arxiv.org/pdf/2104.05146	Reference-free evaluation has the potential to make machine translation evaluation substantially more scalable, allowing us to pivot easily to new languages or domains. It has been recently shown that the probabilities given by a large, multilingual model can achieve state of the art results when used as a reference-free metric. We experiment with various modifications to this model and demonstrate that by scaling it up we can match the performance of BLEU. We analyze various potential weaknesses of the approach and find that it is surprisingly robust and likely to offer reasonable performance across a broad spectrum of domains and different system qualities.		Sweta Agrawal George Foster Markus Freitag Colin Cherry
11	NAACL2021	Generative Imagination Elevates Machine Translation		https://arxiv.org/pdf/2009.09654	There are common semantics shared across text and images. Given a sentence in a source language, whether depicting the visual scene helps translation into a target language? Existing multimodal neural machine translation methods (MNMT) require triplets of bilingual sentence - image for training and tuples of source sentence - image for inference. In this paper, we propose ImagiT, a novel machine translation method via visual imagination. ImagiT first learns to generate visual representation from the source sentence, and then utilizes both source sentence and the “imagined representation” to produce a target translation. Unlike previous methods, it only needs the source sentence at the inference time. Experiments demonstrate that ImagiT benefits from visual imagination and significantly outperforms the text-only neural machine translation baselines. Further analysis reveals that the imagination process in ImagiT helps fill in missing information when performing the degradation strategy.	文本和图像之间共享通用语义。给定一个源语言的句子，描绘视觉场景是否有助于翻译成目标语言？现有的多模态神经机器翻译方法 (MNMT) 需要双语句子的三元组 - 用于训练的图像和用于推理的源句元组 - 图像。在本文中，我们提出了 ImagiT，一种通过视觉想象的新型机器翻译方法。 ImagiT 首先学习从源句生成视觉表示，然后利用源句和“想象的表示”来生成目标翻译。与以前的方法不同，它只需要推理时的源语句。实验表明，ImagiT 受益于视觉想象力，并显着优于纯文本神经机器翻译基线。进一步的分析表明，在执行退化策略时，ImagiT 中的想象过程有助于填补缺失的信息。	Quanyu Long Mingxuan Wang Lei Li
12	NAACL2021	Context-aware Decoder for Neural Machine Translation using a Target-side Document-Level Language Model		https://arxiv.org/pdf/2010.12827	Although many context-aware neural machine translation models have been proposed to incorporate contexts in translation, most of those models are trained end-to-end on parallel documents aligned in sentence-level. Because only a few domains (and language pairs) have such document-level parallel data, we cannot perform accurate context-aware translation in most domains. We therefore present a simple method to turn a sentence-level translation model into a context-aware model by incorporating a document-level language model into the decoder. Our context-aware decoder is built upon only a sentence-level parallel corpora and monolingual corpora; thus no document-level parallel data is needed. In a theoretical viewpoint, the core part of this work is the novel representation of contextual information using point-wise mutual information between context and the current sentence. We show the effectiveness of our approach in three language pairs, English to French, English to Russian, and Japanese to English, by evaluation in \textsc{bleu} and contrastive tests for context-aware translation.	尽管已经提出了许多上下文感知神经机器翻译模型来将上下文结合到翻译中，但这些模型中的大多数都是在句子级别对齐的并行文档上进行端到端训练的。因为只有少数域（和语言对）有这样的文档级并行数据，我们无法在大多数域中执行准确的上下文感知翻译。因此，我们提出了一种简单的方法，通过将文档级语言模型合并到解码器中，将句子级翻译模型转变为上下文感知模型。我们的上下文感知解码器仅建立在句子级平行语料库和单语语料库上；因此不需要文档级并行数据。从理论的角度来看，这项工作的核心部分是使用上下文和当前句子之间的逐点互信息对上下文信息进行新颖的表示。我们通过 \textsc{bleu} 中的评估和上下文感知翻译的对比测试，展示了我们的方法在英语到法语、英语到俄语和日语到英语这三种语言对中的有效性。	Amane Sugiyama Naoki Yoshinaga
13	NAACL2021	Restoring and Mining the Records of the Joseon Dynasty via Neural Language Modeling and Machine Translation		https://arxiv.org/pdf/2104.05964	Understanding voluminous historical records provides clues on the past in various aspects, such as social and political issues and even natural science facts. However, it is generally difficult to fully utilize the historical records, since most of the documents are not written in a modern language and part of the contents are damaged over time. As a result, restoring the damaged or unrecognizable parts as well as translating the records into modern languages are crucial tasks. In response, we present a multi-task learning approach to restore and translate historical documents based on a self-attention mechanism, specifically utilizing two Korean historical records, ones of the most voluminous historical records in the world. Experimental results show that our approach significantly improves the accuracy of the translation task than baselines without multi-task learning. In addition, we present an in-depth exploratory analysis on our translated results via topic modeling, uncovering several significant historical events.	了解大量的历史记录可以从各个方面提供有关过去的线索，例如社会和政治问题，甚至自然科学事实。然而，要充分利用历史记录，一般来说是困难的，因为大多数文件不是用现代语言写成的，部分内容随着时间的推移而损坏。因此，修复损坏或无法识别的部分以及将记录翻译成现代语言是至关重要的任务。作为回应，我们提出了一种基于自注意力机制的多任务学习方法来恢复和翻译历史文献，特别是利用了两个韩国历史记录，这是世界上最丰富的历史记录。实验结果表明，与没有多任务学习的基线相比，我们的方法显着提高了翻译任务的准确性。此外，我们通过主题建模对我们的翻译结果进行了深入的探索性分析，揭示了几个重要的历史事件。	Kyeongpil Kang Kyohoon Jin Soyoung Yang Sujin Jang Jaegul Choo Youngbin Kim
14	NAACL2021	The Curious Case of Hallucinations in Neural Machine Translation	https://github.com/vyraun/hallucinations	https://arxiv.org/pdf/2104.06683	In this work, we study hallucinations in Neural Machine Translation (NMT), which lie at an extreme end on the spectrum of NMT pathologies. Firstly, we connect the phenomenon of hallucinations under source perturbation to the Long-Tail theory of Feldman (2020), and present an empirically validated hypothesis that explains hallucinations under source perturbation. Secondly, we consider hallucinations under corpus-level noise (without any source perturbation) and demonstrate that two prominent types of natural hallucinations (detached and oscillatory outputs) could be generated and explained through specific corpus-level noise patterns. Finally, we elucidate the phenomenon of hallucination amplification in popular data-generation processes such as Backtranslation and sequence-level Knowledge Distillation.	在这项工作中，我们研究了神经机器翻译 (NMT) 中的幻觉，它处于 NMT 病理范围的极端。首先，我们将源扰动下的幻觉现象与 Feldman (2020) 的长尾理论联系起来，并提出了一个经过实证验证的假设来解释源扰动下的幻觉。其次，我们考虑了语料库级别噪声（没有任何源扰动）下的幻觉，并证明可以通过特定的语料库级别噪声模式生成和解释两种主要类型的自然幻觉（分离输出和振荡输出）。最后，我们阐明了流行的数据生成过程（例如反向翻译和序列级知识蒸馏）中的幻觉放大现象。	Vikas Raunak Arul Menezes Marcin Junczys-Dowmunt
15	NAACL2021	Revisiting the Weaknesses of Reinforcement Learning for Neural Machine Translation	https://github.com/samuki/reinforce-joey	https://arxiv.org/pdf/2106.08942	Policy gradient algorithms have found wide adoption in NLP, but have recently become subject to criticism, doubting their suitability for NMT. Choshen et al. (2020) identify multiple weaknesses and suspect that their success is determined by the shape of output distributions rather than the reward. In this paper, we revisit these claims and study them under a wider range of configurations. Our experiments on in-domain and cross-domain adaptation reveal the importance of exploration and reward scaling, and provide empirical counter-evidence to these claims.	策略梯度算法在 NLP 中被广泛采用，但最近受到批评，怀疑它们是否适用于 NMT。 Choshen 等人。 (2020) 确定了多个弱点，并怀疑它们的成功取决于输出分布的形状而不是奖励。在本文中，我们重新审视这些主张并在更广泛的配置下研究它们。我们对域内和跨域适应的实验揭示了探索和奖励缩放的重要性，并为这些主张提供了实证反证。	Samuel Kiegeland Julia Kreutzer
16	NAACL2021	Cross-lingual Supervision Improves Unsupervised Neural Machine Translation		https://arxiv.org/pdf/2004.03137	Neural machine translation~(NMT) is ineffective for zero-resource languages. Recent works exploring the possibility of unsupervised neural machine translation (UNMT) with only monolingual data can achieve promising results. However, there are still big gaps between UNMT and NMT with parallel supervision. In this work, we introduce a multilingual unsupervised NMT (\method) framework to leverage weakly supervised signals from high-resource language pairs to zero-resource translation directions. More specifically, for unsupervised language pairs \texttt{En-De}, we can make full use of the information from parallel dataset \texttt{En-Fr} to jointly train the unsupervised translation directions all in one model. \method is based on multilingual models which require no changes to the standard unsupervised NMT. Empirical results demonstrate that \method significantly improves the translation quality by more than 3 BLEU score on six benchmark unsupervised translation directions.	神经机器翻译~（NMT）对零资源语言无效。最近探索仅使用单语数据进行无监督神经机器翻译 (UNMT) 的可能性的工作可以获得有希望的结果。然而，并行监督的UNMT和NMT之间仍然存在很大差距。在这项工作中，我们引入了一种多语言无监督 NMT（\method）框架，以利用来自高资源语言对的弱监督信号到零资源翻译方向。更具体地说，对于无监督语言对 \texttt{En-De}，我们可以充分利用来自并行数据集 \texttt{En-Fr} 的信息，在一个模型中联合训练无监督翻译方向。 \method 基于多语言模型，无需更改标准的无监督 NMT。实证结果表明，\method 在六个基准无监督翻译方向上显着提高了超过 3 BLEU 分数的翻译质量。	Mingxuan Wang Hongxiao Bai Hai Zhao Lei Li
17	NAACL2019	ReWE: Regressing Word Embeddings for Regularization of Neural Machine Translation Systems		https://arxiv.org/pdf/1904.02461	Regularization of neural machine translation is still a significant problem, especially in low-resource settings. To mollify this problem, we propose regressing word embeddings (ReWE) as a new regularization technique in a system that is jointly trained to predict the next word in the translation (categorical value) and its word embedding (continuous value). Such a joint training allows the proposed system to learn the distributional properties represented by the word embeddings, empirically improving the generalization to unseen sentences. Experiments over three translation datasets have showed a consistent improvement over a strong baseline, ranging between 0.91 and 2.54 BLEU points, and also a marked improvement over a state-of-the-art system.	神经机器翻译的正则化仍然是一个重大问题，尤其是在资源匮乏的环境中。为了解决这个问题，我们提出回归词嵌入 (ReWE) 作为一种新的正则化技术，该系统经过联合训练以预测翻译中的下一个词（分类值）及其词嵌入（连续值）。这种联合训练允许所提出的系统学习由词嵌入表示的分布特性，从经验上改进对看不见的句子的泛化。在三个翻译数据集上的实验表明，在强大的基线上有持续的改进，范围在 0.91 到 2.54 BLEU 点之间，并且比最先进的系统也有显着的改进。	Inigo Jauregi Unanue Ehsan Zare Borzeshi Nazanin Esmaili Massimo Piccardi
18	NAACL2019	Lost in Machine Translation: A Method to Reduce Meaning Loss	https://github.com/reubenharry/pragmatic-translation	https://arxiv.org/pdf/1902.09514	A desideratum of high-quality translation systems is that they preserve meaning, in the sense that two sentences with different meanings should not translate to one and the same sentence in another language. However, state-of-the-art systems often fail in this regard, particularly in cases where the source and target languages partition the “meaning space” in different ways. For instance, “I cut my finger.” and “I cut my finger off.” describe different states of the world but are translated to French (by both Fairseq and Google Translate) as “Je me suis coupe le doigt.”, which is ambiguous as to whether the finger is detached. More generally, translation systems are typically many-to-one (non-injective) functions from source to target language, which in many cases results in important distinctions in meaning being lost in translation. Building on Bayesian models of informative utterance production, we present a method to define a less ambiguous translation system in terms of an underlying pre-trained neural sequence-to-sequence model. This method increases injectivity, resulting in greater preservation of meaning as measured by improvement in cycle-consistency, without impeding translation quality (measured by BLEU score).	高质量翻译系统的一个必要条件是它们能够保留意义，也就是说，两个意义不同的句子不应翻译成另一种语言的同一个句子。然而，最先进的系统在这方面经常失败，特别是在源语言和目标语言以不同方式划分“意义空间”的情况下。例如，“我割破了手指”。和“我切掉了我的手指。”描述世界的不同状态，但被翻译成法语（由 Fairseq 和谷歌翻译）为“Je me suis coupe le doigt.”，对于手指是否分离是模棱两可的。更一般地说，翻译系统通常是从源语言到目标语言的多对一（非内射）功能，这在许多情况下导致翻译中丢失意义的重要区别。基于信息性话语产生的贝叶斯模型，我们提出了一种方法，根据底层的预训练神经序列到序列模型定义一个不那么模糊的翻译系统。这种方法增加了注入性，从而在不影响翻译质量（由 BLEU 分数衡量）的情况下，通过改进循环一致性来衡量更好地保留意义。	Reuben Cohn-Gordon Noah Goodman
19	NAACL2019	Syntax-Enhanced Neural Machine Translation with Syntax-Aware Word Representations		https://arxiv.org/pdf/1905.02878	Syntax has been demonstrated highly effective in neural machine translation (NMT). Previous NMT models integrate syntax by representing 1-best tree outputs from a well-trained parsing system, e.g., the representative Tree-RNN and Tree-Linearization methods, which may suffer from error propagation. In this work, we propose a novel method to integrate source-side syntax implicitly for NMT. The basic idea is to use the intermediate hidden representations of a well-trained end-to-end dependency parser, which are referred to as syntax-aware word representations (SAWRs). Then, we simply concatenate such SAWRs with ordinary word embeddings to enhance basic NMT models. The method can be straightforwardly integrated into the widely-used sequence-to-sequence (Seq2Seq) NMT models. We start with a representative RNN-based Seq2Seq baseline system, and test the effectiveness of our proposed method on two benchmark datasets of the Chinese-English and English-Vietnamese translation tasks, respectively. Experimental results show that the proposed approach is able to bring significant BLEU score improvements on the two datasets compared with the baseline, 1.74 points for Chinese-English translation and 0.80 point for English-Vietnamese translation, respectively. In addition, the approach also outperforms the explicit Tree-RNN and Tree-Linearization methods.	语法已被证明在神经机器翻译 (NMT) 中非常有效。以前的 NMT 模型通过表示来自训练有素的解析系统的 1-best 树输出来集成语法，例如代表性的 Tree-RNN 和 Tree-Linearization 方法，它们可能会受到错误传播的影响。在这项工作中，我们提出了一种为 NMT 隐式集成源端语法的新方法。基本思想是使用训练有素的端到端依赖解析器的中间隐藏表示，称为语法感知词表示 (SAWR)。然后，我们简单地将这些 SAWR 与普通的词嵌入连接起来，以增强基本的 NMT 模型。该方法可以直接集成到广泛使用的序列到序列 (Seq2Seq) NMT 模型中。我们从一个代表性的基于 RNN 的 Seq2Seq 基线系统开始，并分别在汉英和英越翻译任务的两个基准数据集上测试我们提出的方法的有效性。实验结果表明，与基线相比，所提出的方法能够在两个数据集上带来显着的 BLEU 分数改进，汉英翻译分别为 1.74 分和英越翻译 0.80 分。此外，该方法还优于显式 Tree-RNN 和 Tree-Linearization 方法。	Meishan Zhang Zhenghua Li Guohong Fu Min Zhang
20	NAACL2019	Improving Robustness of Machine Translation with Synthetic Noise	https://github.com/MysteryVaibhav/robust_mtnt	https://arxiv.org/pdf/1902.09508	Modern Machine Translation (MT) systems perform consistently well on clean, in-domain text. However most human generated text, particularly in the realm of social media, is full of typos, slang, dialect, idiolect and other noise which can have a disastrous impact on the accuracy of output translation. In this paper we leverage the Machine Translation of Noisy Text (MTNT) dataset to enhance the robustness of MT systems by emulating naturally occurring noise in otherwise clean data. Synthesizing noise in this manner we are ultimately able to make a vanilla MT system resilient to naturally occurring noise and partially mitigate loss in accuracy resulting therefrom.	现代机器翻译 (MT) 系统在干净的域内文本上始终表现良好。然而，大多数人工生成的文本，特别是在社交媒体领域，充满了拼写错误、俚语、方言、方言和其他噪音，这些噪音会对输出翻译的准确性产生灾难性的影响。在本文中，我们利用嘈杂文本机器翻译 (MTNT) 数据集通过在其他干净的数据中模拟自然发生的噪声来增强 MT 系统的鲁棒性。以这种方式合成噪声，我们最终能够使普通 MT 系统对自然发生的噪声具有弹性，并部分减轻由此导致的精度损失。	Vaibhav Vaibhav Sumeet Singh Craig Stewart Graham Neubig Vladimir Karpukhin Omer Levy Jacob Eisenstein Marjan Ghazvininejad
21	NAACL2019	Differentiable Sampling with Flexible Reference Word Order for Neural Machine Translation	https://github.com/Izecson/saml-nmt	https://arxiv.org/pdf/1904.04079	Despite some empirical success at correcting exposure bias in machine translation, scheduled sampling algorithms suffer from a major drawback: they incorrectly assume that words in the reference translations and in sampled sequences are aligned at each time step. Our new differentiable sampling algorithm addresses this issue by optimizing the probability that the reference can be aligned with the sampled output, based on a soft alignment predicted by the model itself. As a result, the output distribution at each time step is evaluated with respect to the whole predicted sequence. Experiments on IWSLT translation tasks show that our approach improves BLEU compared to maximum likelihood and scheduled sampling baselines. In addition, our approach is simpler to train with no need for sampling schedule and yields models that achieve larger improvements with smaller beam sizes.		Weijia Xu Xing Niu Marine Carpuat
22	NAACL2019	Fluent Translations from Disfluent Speech in End-to-End Speech Translation		https://arxiv.org/pdf/1906.00556	Spoken language translation applications for speech suffer due to conversational speech phenomena, particularly the presence of disfluencies. With the rise of end-to-end speech translation models, processing steps such as disfluency removal that were previously an intermediate step between speech recognition and machine translation need to be incorporated into model architectures. We use a sequence-to-sequence model to translate from noisy, disfluent speech to fluent text with disfluencies removed using the recently collected `copy-edited’ references for the Fisher Spanish-English dataset. We are able to directly generate fluent translations and introduce considerations about how to evaluate success on this task. This work provides a baseline for a new task, the translation of conversational speech with joint removal of disfluencies.	由于会话语音现象，特别是不流畅的存在，语音的口语翻译应用程序受到影响。随着端到端语音翻译模型的兴起，之前作为语音识别和机器翻译之间的中间步骤的不流畅去除等处理步骤需要纳入模型架构中。我们使用序列到序列模型将嘈杂、不流利的语音转换为流利的文本，并使用最近收集的 Fisher 西班牙语-英语数据集的“复制编辑”参考删除了不流利之处。我们能够直接生成流畅的翻译，并引入有关如何评估此任务成功的注意事项。这项工作为一项新任务提供了基线，即翻译会话语音并联合消除不流畅。	Elizabeth Salesky Matthias Sperber Alex Waibel
23	NAACL2019	Selective Attention for Context-aware Neural Machine Translation	https://github.com/sameenmaruf/selective-attn	https://arxiv.org/pdf/1903.08788	Despite the progress made in sentence-level NMT, current systems still fall short at achieving fluent, good quality translation for a full document. Recent works in context-aware NMT consider only a few previous sentences as context and may not scale to entire documents. To this end, we propose a novel and scalable top-down approach to hierarchical attention for context-aware NMT which uses sparse attention to selectively focus on relevant sentences in the document context and then attends to key words in those sentences. We also propose single-level attention approaches based on sentence or word-level information in the context. The document-level context representation, produced from these attention modules, is integrated into the encoder or decoder of the Transformer model depending on whether we use monolingual or bilingual context. Our experiments and evaluation on English-German datasets in different document MT settings show that our selective attention approach not only significantly outperforms context-agnostic baselines but also surpasses context-aware baselines in most cases.	尽管在句子级 NMT 方面取得了进展，但当前的系统仍然无法实现对完整文档的流畅、高质量的翻译。最近在上下文感知 NMT 中的工作只考虑前面的几个句子作为上下文，可能无法扩展到整个文档。为此，我们提出了一种新颖且可扩展的自上而下的上下文感知 NMT 分层注意力方法，该方法使用稀疏注意力来选择性地关注文档上下文中的相关句子，然后关注这些句子中的关键词。我们还提出了基于上下文中句子或单词级信息的单级注意力方法。由这些注意力模块产生的文档级上下文表示被集成到 Transformer 模型的编码器或解码器中，具体取决于我们使用的是单语还是双语上下文。我们在不同文档 MT 设置中对英德数据集的实验和评估表明，我们的选择性注意方法不仅显着优于上下文无关基线，而且在大多数情况下也超过了上下文感知基线。	Sameen Maruf André F. T. Martins Gholamreza Haffari

COLING

序号	会议/期刊	论文	代码	论文下载地址	摘要	摘要翻译	作者
1	COLING2020	Emergent Communication Pretraining for Few-Shot Machine Translation	https://github.com/cambridgeltl/ECNMT	https://arxiv.org/pdf/2011.00890	While state-of-the-art models that rely upon massively multilingual pretrained encoders achieve sample efficiency in downstream applications, they still require abundant amounts of unlabelled text. Nevertheless, most of the world’s languages lack such resources. Hence, we investigate a more radical form of unsupervised knowledge transfer in the absence of linguistic data. In particular, for the first time we pretrain neural networks via emergent communication from referential games. Our key assumption is that grounding communication on images—-as a crude approximation of real-world environments—-inductively biases the model towards learning natural languages. On the one hand, we show that this substantially benefits machine translation in few-shot settings. On the other hand, this also provides an extrinsic evaluation protocol to probe the properties of emergent languages ex vitro. Intuitively, the closer they are to natural languages, the higher the gains from pretraining on them should be. For instance, in this work we measure the influence of communication success and maximum sequence length on downstream performances. Finally, we introduce a customised adapter layer and annealing strategies for the regulariser of maximum-a-posteriori inference during fine-tuning. These turn out to be crucial to facilitate knowledge transfer and prevent catastrophic forgetting. Compared to a recurrent baseline, our method yields gains of $59.0\% $\sim$ 147.6\%$ in BLEU score with only $500$ NMT training instances and $65.1\% $\sim$ 196.7\%$ with $1,000$ NMT training instances across four language pairs. These proof-of-concept results reveal the potential of emergent communication pretraining for both natural language processing tasks in resource-poor settings and extrinsic evaluation of artificial languages.		Yaoyiran Li Edoardo M. Ponti Ivan Vulić Anna Korhonen
2	COLING2020	Investigating Catastrophic Forgetting During Continual Training for Neural Machine Translation		https://arxiv.org/pdf/2011.00678	Neural machine translation (NMT) models usually suffer from catastrophic forgetting during continual training where the models tend to gradually forget previously learned knowledge and swing to fit the newly added data which may have a different distribution, e.g. a different domain. Although many methods have been proposed to solve this problem, we cannot get to know what causes this phenomenon yet. Under the background of domain adaptation, we investigate the cause of catastrophic forgetting from the perspectives of modules and parameters (neurons). The investigation on the modules of the NMT model shows that some modules have tight relation with the general-domain knowledge while some other modules are more essential in the domain adaptation. And the investigation on the parameters shows that some parameters are important for both the general-domain and in-domain translation and the great change of them during continual training brings about the performance decline in general-domain. We conduct experiments across different language pairs and domains to ensure the validity and reliability of our findings.	神经机器翻译 (NMT) 模型在持续训练期间通常会遭受灾难性遗忘，其中模型往往会逐渐忘记先前学到的知识并摆动以适应可能具有不同分布的新添加数据，例如不同的域。虽然已经提出了很多方法来解决这个问题，但我们还不能知道是什么导致了这种现象。在领域适应的背景下，我们从模块和参数（神经元）的角度研究了灾难性遗忘的原因。对 NMT 模型模块的调查表明，一些模块与通用领域知识关系密切，而另一些模块在领域适应中更为重要。对参数的调查表明，一些参数对通用域和域内翻译都很重要，并且在持续训练过程中它们的巨大变化导致通用域的性能下降。我们在不同的语言对和领域进行实验，以确保我们发现的有效性和可靠性。	Shuhao Gu Yang Feng
3	COLING2020	Layer-wise Multi-view Learning for Neural Machine Translation		https://arxiv.org/pdf/2011.01482	Traditional neural machine translation is limited to the topmost encoder layer’s context representation and cannot directly perceive the lower encoder layers. Existing solutions usually rely on the adjustment of network architecture, making the calculation more complicated or introducing additional structural restrictions. In this work, we propose layer-wise multi-view learning to solve this problem, circumventing the necessity to change the model structure. We regard each encoder layer’s off-the-shelf output, a by-product in layer-by-layer encoding, as the redundant view for the input sentence. In this way, in addition to the topmost encoder layer (referred to as the primary view), we also incorporate an intermediate encoder layer as the auxiliary view. We feed the two views to a partially shared decoder to maintain independent prediction. Consistency regularization based on KL divergence is used to encourage the two views to learn from each other. Extensive experimental results on five translation tasks show that our approach yields stable improvements over multiple strong baselines. As another bonus, our method is agnostic to network architectures and can maintain the same inference speed as the original model.	传统的神经机器翻译仅限于最顶层编码器层的上下文表示，无法直接感知较低的编码器层。现有的解决方案通常依赖于网络架构的调整，使计算更加复杂或引入额外的结构限制。在这项工作中，我们提出了分层多视图学习来解决这个问题，避免了改变模型结构的必要性。我们将每个编码器层的现成输出（逐层编码的副产品）视为输入句子的冗余视图。这样，除了最顶层的编码器层（称为主视图），我们还合并了一个中间编码器层作为辅助视图。我们将两个视图提供给部分共享的解码器以保持独立预测。基于KL散度的一致性正则化用于鼓励两种观点相互学习。五项翻译任务的大量实验结果表明，我们的方法在多个强基线上产生了稳定的改进。作为另一个好处，我们的方法与网络架构无关，并且可以保持与原始模型相同的推理速度。	Qiang Wang Changliang Li Yue Zhang Tong Xiao Jingbo Zhu
4	COLING2020	Leveraging Discourse Rewards for Document-Level Neural Machine Translation		https://arxiv.org/pdf/2010.03732	Document-level machine translation focuses on the translation of entire documents from a source to a target language. It is widely regarded as a challenging task since the translation of the individual sentences in the document needs to retain aspects of the discourse at document level. However, document-level translation models are usually not trained to explicitly ensure discourse quality. Therefore, in this paper we propose a training approach that explicitly optimizes two established discourse metrics, lexical cohesion (LC) and coherence (COH), by using a reinforcement learning objective. Experiments over four different language pairs and three translation domains have shown that our training approach has been able to achieve more cohesive and coherent document translations than other competitive approaches, yet without compromising the faithfulness to the reference translation. In the case of the Zh-En language pair, our method has achieved an improvement of 2.46 percentage points (pp) in LC and 1.17 pp in COH over the runner-up, while at the same time improving 0.63 pp in BLEU score and 0.47 pp in F_BERT.	文档级机器翻译侧重于将整个文档从源语言翻译成目标语言。它被广泛认为是一项具有挑战性的任务，因为文档中单个句子的翻译需要在文档级别保留话语的各个方面。然而，文档级翻译模型通常没有经过训练以明确确保话语质量。因此，在本文中，我们提出了一种训练方法，该方法通过使用强化学习目标明确优化两个已建立的话语指标，词汇衔接 (LC) 和连贯 (COH)。对四种不同语言对和三个翻译领域的实验表明，我们的训练方法比其他竞争方法能够实现更具凝聚力和连贯性的文档翻译，同时又不影响对参考翻译的忠实度。在 Zh-En 语言对的情况下，我们的方法在 LC 和 COH 方面比亚军提高了 2.46 个百分点（pp），同时在 BLEU 得分上提高了 0.63 个百分点（pp）和 0.47 F_BERT 中的 pp。	Inigo Jauregi Unanue Nazanin Esmaili Gholamreza Haffari Massimo Piccardi
5	COLING2020	Optimized Transformer for Low-resource Neural Machine Translation		https://arxiv.org/pdf/2011.02266	Language pairs with limited amounts of parallel data, also known as low-resource languages, remain a challenge for neural machine translation. While the Transformer model has achieved significant improvements for many language pairs and has become the de facto mainstream architecture, its capability under low-resource conditions has not been fully investigated yet. Our experiments on different subsets of the IWSLT14 training data show that the effectiveness of Transformer under low-resource conditions is highly dependent on the hyper-parameter settings. Our experiments show that using an optimized Transformer for low-resource conditions improves the translation quality up to 7.3 BLEU points compared to using the Transformer default settings.	并行数据量有限的语言对，也称为低资源语言，仍然是神经机器翻译的挑战。虽然 Transformer 模型在许多语言对上都取得了显着的改进，已经成为事实上的主流架构，但其在低资源条件下的能力尚未得到充分研究。我们对 IWSLT14 训练数据的不同子集进行的实验表明，Transformer 在低资源条件下的有效性高度依赖于超参数设置。我们的实验表明，与使用 Transformer 默认设置相比，在低资源条件下使用优化的 Transformer 可将翻译质量提高多达 7.3 BLEU 点。	Ali Araabi Christof Monz
6	COLING2020	Robust Unsupervised Neural Machine Translation with Adversarial Denoising Training		https://arxiv.org/pdf/2002.12549	Unsupervised neural machine translation (UNMT) has recently attracted great interest in the machine translation community. The main advantage of the UNMT lies in its easy collection of required large training text sentences while with only a slightly worse performance than supervised neural machine translation which requires expensive annotated translation pairs on some translation tasks. In most studies, the UMNT is trained with clean data without considering its robustness to the noisy data. However, in real-world scenarios, there usually exists noise in the collected input sentences which degrades the performance of the translation system since the UNMT is sensitive to the small perturbations of the input sentences. In this paper, we first time explicitly take the noisy data into consideration to improve the robustness of the UNMT based systems. First of all, we clearly defined two types of noises in training sentences, i.e., word noise and word order noise, and empirically investigate its effect in the UNMT, then we propose adversarial training methods with denoising process in the UNMT. Experimental results on several language pairs show that our proposed methods substantially improved the robustness of the conventional UNMT systems in noisy scenarios.	无监督神经机器翻译（UNMT）最近引起了机器翻译社区的极大兴趣。 UNMT 的主要优势在于它可以轻松收集所需的大型训练文本句子，而其性能仅比监督神经机器翻译稍差，后者在某些翻译任务上需要昂贵的注释翻译对。在大多数研究中，UMNT 是用干净的数据训练的，而没有考虑它对噪声数据的鲁棒性。然而，在实际场景中，由于 UNMT 对输入句子的小扰动很敏感，因此收集的输入句子中通常存在噪声，这会降低翻译系统的性能。在本文中，我们第一次明确地将噪声数据考虑在内，以提高基于 UNMT 的系统的鲁棒性。首先，我们明确定义了训练句子中的两类噪声，即词噪声和词序噪声，并实证研究了其在 UNMT 中的影响，然后我们在 UNMT 中提出了具有去噪过程的对抗性训练方法。几个语言对的实验结果表明，我们提出的方法大大提高了传统 UNMT 系统在嘈杂场景中的鲁棒性。	Haipeng Sun Rui Wang Kehai Chen Xugang Lu Masao Utiyama Eiichiro Sumita Tiejun Zhao
7	COLING2020	Token Drop mechanism for Neural Machine Translation	https://github.com/zhajiahe/Token_Drop	https://arxiv.org/pdf/2010.11018	Neural machine translation with millions of parameters is vulnerable to unfamiliar inputs. We propose Token Drop to improve generalization and avoid overfitting for the NMT model. Similar to word dropout, whereas we replace dropped token with a special token instead of setting zero to words. We further introduce two self-supervised objectives: Replaced Token Detection and Dropped Token Prediction. Our method aims to force model generating target translation with less information, in this way the model can learn textual representation better. Experiments on Chinese-English and English-Romanian benchmark demonstrate the effectiveness of our approach and our model achieves significant improvements over a strong Transformer baseline.	具有数百万个参数的神经机器翻译容易受到陌生输入的影响。我们提出 Token Drop 来提高泛化能力并避免对 NMT 模型的过度拟合。类似于单词丢失，而我们用特殊标记替换掉的标记，而不是将单词设置为零。我们进一步介绍了两个自我监督的目标：替换令牌检测和丢弃令牌预测。我们的方法旨在用更少的信息强制模型生成目标翻译，这样模型可以更好地学习文本表示。中文-英文和英文-罗马尼亚语基准的实验证明了我们方法的有效性，我们的模型在强大的 Transformer 基线上取得了显着的改进。	Huaao Zhang Shigui Qiu Xiangyu Duan Min Zhang
8	COLING2020	Understanding Pure Character-Based Neural Machine Translation: The Case of Translating Finnish into English		https://arxiv.org/pdf/2011.03469	Recent work has shown that deeper character-based neural machine translation (NMT) models can outperform subword-based models. However, it is still unclear what makes deeper character-based models successful. In this paper, we conduct an investigation into pure character-based models in the case of translating Finnish into English, including exploring the ability to learn word senses and morphological inflections and the attention mechanism. We demonstrate that word-level information is distributed over the entire character sequence rather than over a single character, and characters at different positions play different roles in learning linguistic knowledge. In addition, character-based models need more layers to encode word senses which explains why only deeper models outperform subword-based models. The attention distribution pattern shows that separators attract a lot of attention and we explore a sparse word-level attention to enforce character hidden states to capture the full word-level information. Experimental results show that the word-level attention with a single head results in 1.2 BLEU points drop.	最近的工作表明，更深层次的基于字符的神经机器翻译 (NMT) 模型可以胜过基于子词的模型。然而，目前尚不清楚是什么让更深层次的基于字符的模型成功。在本文中，我们在将芬兰语翻译成英语的情况下对纯基于字符的模型进行了调查，包括探索学习词义和形态变化的能力以及注意机制。我们证明了词级信息分布在整个字符序列而不是单个字符上，并且不同位置的字符在学习语言知识中扮演着不同的角色。此外，基于字符的模型需要更多层来编码词义，这解释了为什么只有更深的模型才能胜过基于子词的模型。注意力分布模式表明分隔符吸引了很多注意力，我们探索了一种稀疏的词级注意力来强制字符隐藏状态来捕获完整的词级信息。实验结果表明，单个头部的词级注意力导致 1.2 BLEU 点下降。	Gongbo Tang Rico Sennrich Joakim Nivre

会话/对话系统

ACL

序号	会议/期刊	论文	代码	论文下载地址	摘要	摘要翻译	作者
1	ACL2021	TicketTalk: Toward human-level performance with end-to-end, transaction-based dialog systems		https://arxiv.org/pdf/2012.12458	We present a data-driven, end-to-end approach to transaction-based dialog systems that performs at near-human levels in terms of verbal response quality and factual grounding accuracy. We show that two essential components of the system produce these results: a sufficiently large and diverse, in-domain labeled dataset, and a neural network-based, pre-trained model that generates both verbal responses and API call predictions. In terms of data, we introduce TicketTalk, a movie ticketing dialog dataset with 23,789 annotated conversations. The movie ticketing conversations range from completely open-ended and unrestricted to more structured, both in terms of their knowledge base, discourse features, and number of turns. In qualitative human evaluations, model-generated responses trained on just 10,000 TicketTalk dialogs were rated to “make sense” 86.5 percent of the time, almost the same as human responses in the same contexts. Our simple, API-focused annotation schema results in a much easier labeling task making it faster and more cost effective. It is also the key component for being able to predict API calls accurately. We handle factual grounding by incorporating API calls in the training data, allowing our model to learn which actions to take and when. Trained on the same 10,000-dialog set, the model’s API call predictions were rated to be correct 93.9 percent of the time in our evaluations, surpassing the ratings for the corresponding human labels. We show how API prediction and response generation scores improve as the dataset size incrementally increases from 5000 to 21,000 dialogs. Our analysis also clearly illustrates the benefits of pre-training. We are publicly releasing the TicketTalk dataset with this paper to facilitate future work on transaction-based dialogs.	我们为基于事务的对话系统提供了一种数据驱动的端到端方法，该方法在口头响应质量和事实基础准确性方面的表现接近人类水平。我们展示了系统的两个基本组成部分会产生这些结果：一个足够大且多样化的域内标记数据集，以及一个基于神经网络的预训练模型，该模型生成口头响应和 API 调用预测。在数据方面，我们引入了 TicketTalk，这是一个电影票务对话数据集，包含 23,789 个带注释的对话。电影票务对话的范围从完全开放和不受限制到更加结构化，无论是在知识基础、话语特征还是回合数方面。在定性的人类评估中，仅在 10,000 个 TicketTalk 对话上训练的模型生成的响应被评为“有意义”的时间为 86.5%，几乎与相同上下文中的人类响应相同。我们以 API 为中心的简单注释模式使标记任务变得更加简单，从而使其更快、更具成本效益。它也是能够准确预测 API 调用的关键组件。我们通过在训练数据中加入 API 调用来处理事实基础，让我们的模型了解要采取哪些行动以及何时采取行动。在相同的 10,000 个对话集上进行训练，模型的 API 调用预测在我们的评估中被评为 93.9% 的正确率，超过了相应人工标签的评分。我们展示了 API 预测和响应生成分数如何随着数据集大小从 5000 个对话逐渐增加到 21,000 个而提高。我们的分析还清楚地说明了预训练的好处。我们将随本文公开发布 TicketTalk 数据集，以促进基于事务的对话的未来工作。	Bill Byrne Karthik Krishnamoorthi Saravanan Ganesh Mihir Sanjay Kale
2	ACL2021	HERALD: An Annotation Efficient Method to Detect User Disengagement in Social Conversations	https://github.com/Weixin-Liang/HERALD	https://arxiv.org/pdf/2106.00162	Open-domain dialog systems have a user-centric goal: to provide humans with an engaging conversation experience. User engagement is one of the most important metrics for evaluating open-domain dialog systems, and could also be used as real-time feedback to benefit dialog policy learning. Existing work on detecting user disengagement typically requires hand-labeling many dialog samples. We propose HERALD, an efficient annotation framework that reframes the training data annotation process as a denoising problem. Specifically, instead of manually labeling training samples, we first use a set of labeling heuristics to label training samples automatically. We then denoise the weakly labeled data using the Shapley algorithm. Finally, we use the denoised data to train a user engagement detector. Our experiments show that HERALD improves annotation efficiency significantly and achieves 86% user disengagement detection accuracy in two dialog corpora.	开放域对话系统有一个以用户为中心的目标：为人类提供引人入胜的对话体验。用户参与度是评估开放域对话系统的最重要指标之一，也可以用作实时反馈以促进对话策略学习。检测用户脱离的现有工作通常需要手动标记许多对话样本。我们提出了 HERALD，这是一种高效的注释框架，可将训练数据注释过程重新构建为去噪问题。具体来说，我们首先使用一组标记启发式方法来自动标记训练样本，而不是手动标记训练样本。然后我们使用 Shapley 算法对弱标记数据进行去噪。最后，我们使用去噪数据来训练用户参与检测器。我们的实验表明，HERALD 显着提高了注释效率，并在两个对话语料库中实现了 86% 的用户脱离检测准确率。	Weixin Liang Kai-Hui Liang Zhou Yu
3	ACL2021	Maria: A Visual Experience Powered Conversational Agent	https://github.com/jokieleung/Maria	https://arxiv.org/pdf/2105.13073	Arguably, the visual perception of conversational agents to the physical world is a key way for them to exhibit the human-like intelligence. Image-grounded conversation is thus proposed to address this challenge. Existing works focus on exploring the multimodal dialog models that ground the conversation on a given image. In this paper, we take a step further to study image-grounded conversation under a fully open-ended setting where no paired dialog and image are assumed available. Specifically, we present Maria, a neural conversation agent powered by the visual world experiences which are retrieved from a large-scale image index. Maria consists of three flexible components, i.e., text-to-image retriever, visual concept detector and visual-knowledge-grounded response generator. The retriever aims to retrieve a correlated image to the dialog from an image index, while the visual concept detector extracts rich visual knowledge from the image. Then, the response generator is grounded on the extracted visual knowledge and dialog context to generate the target response. Extensive experiments demonstrate Maria outperforms previous state-of-the-art methods on automatic metrics and human evaluation, and can generate informative responses that have some visual commonsense of the physical world.	可以说，会话代理对物理世界的视觉感知是他们展示类人智能的关键方式。因此，提出了基于图像的对话来应对这一挑战。现有的工作侧重于探索基于给定图像进行对话的多模态对话模型。在本文中，我们进一步研究在完全开放的设置下基于图像的对话，假设没有配对的对话和图像可用。具体来说，我们展示了 Maria，一种神经对话代理，由从大规模图像索引中检索的视觉世界体验提供支持。 Maria 由三个灵活的组件组成，即文本到图像检索器、视觉概念检测器和基于视觉知识的响应生成器。检索器旨在从图像索引中检索与对话相关的图像，而视觉概念检测器从图像中提取丰富的视觉知识。然后，响应生成器基于提取的视觉知识和对话上下文来生成目标响应。大量实验表明，Maria 在自动度量和人工评估方面优于以前最先进的方法，并且可以生成具有物理世界一些视觉常识的信息响应。	Zujie Liang Huang Hu Can Xu Chongyang Tao Xiubo Geng Yining Chen Fan Liang Daxin Jiang
4	ACL2021	Dialogue Response Selection with Hierarchical Curriculum Learning		https://arxiv.org/pdf/2012.14756	We study the learning of a matching model for dialogue response selection. Motivated by the recent finding that models trained with random negative samples are not ideal in real-world scenarios, we propose a hierarchical curriculum learning framework that trains the matching model in an “easy-to-difficult” scheme. Our learning framework consists of two complementary curricula: (1) corpus-level curriculum (CC); and (2) instance-level curriculum (IC). In CC, the model gradually increases its ability in finding the matching clues between the dialogue context and a response candidate. As for IC, it progressively strengthens the model’s ability in identifying the mismatching information between the dialogue context and a response candidate. Empirical studies on three benchmark datasets with three state-of-the-art matching models demonstrate that the proposed learning framework significantly improves the model performance across various evaluation metrics.		Yixuan Su Deng Cai Qingyu Zhou Zibo Lin Simon Baker Yunbo Cao Shuming Shi Nigel Collier Yan Wang
5	ACL2021	Diversifying Dialog Generation via Adaptive Label Smoothing	https://github.com/lemon234071/AdaLabel	https://arxiv.org/pdf/2105.14556	Neural dialogue generation models trained with the one-hot target distribution suffer from the over-confidence issue, which leads to poor generation diversity as widely reported in the literature. Although existing approaches such as label smoothing can alleviate this issue, they fail to adapt to diverse dialog contexts. In this paper, we propose an Adaptive Label Smoothing (AdaLabel) approach that can adaptively estimate a target label distribution at each time step for different contexts. The maximum probability in the predicted distribution is used to modify the soft target distribution produced by a novel light-weight bi-directional decoder module. The resulting target distribution is aware of both previous and future contexts and is adjusted to avoid over-training the dialogue model. Our model can be trained in an end-to-end manner. Extensive experiments on two benchmark datasets show that our approach outperforms various competitive baselines in producing diverse responses.	使用 one-hot 目标分布训练的神经对话生成模型存在过度自信的问题，这导致了文献中广泛报道的生成多样性较差。尽管标签平滑等现有方法可以缓解这个问题，但它们无法适应不同的对话上下文。在本文中，我们提出了一种自适应标签平滑 (AdaLabel) 方法，该方法可以针对不同的上下文在每个时间步自适应地估计目标标签分布。预测分布中的最大概率用于修改由新型轻量级双向解码器模块产生的软目标分布。由此产生的目标分布了解之前和未来的上下文，并进行调整以避免过度训练对话模型。我们的模型可以以端到端的方式进行训练。对两个基准数据集的大量实验表明，我们的方法在产生不同响应方面优于各种竞争基线。	Yida Wang Yinhe Zheng Yong Jiang Minlie Huang
6	ACL2021	BoB: BERT Over BERT for Training Persona-based Dialogue Models from Limited Personalized Data	https://github.com/songhaoyu/BoB	https://arxiv.org/pdf/2106.06169	Maintaining consistent personas is essential for dialogue agents. Although tremendous advancements have been brought, the limited-scale of annotated persona-dense data are still barriers towards training robust and consistent persona-based dialogue models. In this work, we show how the challenges can be addressed by disentangling persona-based dialogue generation into two sub-tasks with a novel BERT-over-BERT (BoB) model. Specifically, the model consists of a BERT-based encoder and two BERT-based decoders, where one decoder is for response generation, and another is for consistency understanding. In particular, to learn the ability of consistency understanding from large-scale non-dialogue inference data, we train the second decoder in an unlikelihood manner. Under different limited data settings, both automatic and human evaluations demonstrate that the proposed model outperforms strong baselines in response quality and persona consistency.	保持一致的角色对于对话代理至关重要。尽管已经带来了巨大的进步，但带注释的角色密集数据的规模有限仍然是训练强大且一致的基于角色的对话模型的障碍。在这项工作中，我们展示了如何通过使用新颖的 BERT-over-BERT (BoB) 模型将基于角色的对话生成分解为两个子任务来解决挑战。具体来说，该模型由一个基于 BERT 的编码器和两个基于 BERT 的解码器组成，其中一个解码器用于响应生成，另一个用于一致性理解。特别是，为了从大规模非对话推理数据中学习一致性理解能力，我们以不太可能的方式训练第二个解码器。在不同的有限数据设置下，自动和人工评估都表明，所提出的模型在响应质量和角色一致性方面优于强大的基线。	Haoyu Song Yan Wang Kaiyan Zhang Wei-Nan Zhang Ting Liu
7	ACL2021	I like fish, especially dolphins: Addressing Contradictions in Dialogue Modeling		https://arxiv.org/pdf/2012.13391	To quantify how well natural language understanding models can capture consistency in a general conversation, we introduce the DialoguE COntradiction DEtection task (DECODE) and a new conversational dataset containing both human-human and human-bot contradictory dialogues. We then compare a structured utterance-based approach of using pre-trained Transformer models for contradiction detection with the typical unstructured approach. Results reveal that: (i) our newly collected dataset is notably more effective at providing supervision for the dialogue contradiction detection task than existing NLI data including those aimed to cover the dialogue domain; (ii) the structured utterance-based approach is more robust and transferable on both analysis and out-of-distribution dialogues than its unstructured counterpart. We also show that our best contradiction detection model correlates well with human judgments and further provide evidence for its usage in both automatically evaluating and improving the consistency of state-of-the-art generative chatbots.	为了量化自然语言理解模型在一般对话中捕捉一致性的能力，我们引入了对话矛盾检测任务 (DECODE) 和一个包含人与人和人与机器人矛盾对话的新对话数据集。然后，我们将使用预训练的 Transformer 模型进行矛盾检测的基于结构化话语的方法与典型的非结构化方法进行比较。结果表明：（i）我们新收集的数据集在为对话矛盾检测任务提供监督方面比现有的 NLI 数据（包括旨在覆盖对话域的数据）更有效； (ii) 基于结构化话语的方法在分析和分布外对话上都比其非结构化方法更健壮和可转移。我们还表明，我们最好的矛盾检测模型与人类判断密切相关，并进一步为其在自动评估和提高最先进生成聊天机器人的一致性方面的使用提供了证据。	Yixin Nie Mary Williamson Mohit Bansal Douwe Kiela Jason Weston
8	ACL2021	A Sequence-to-Sequence Approach to Dialogue State Tracking	https://github.com/sweetalyssum/Seq2Seq-DU	https://arxiv.org/pdf/2011.09553	This paper is concerned with dialogue state tracking (DST) in a task-oriented dialogue system. Building a DST module that is highly effective is still a challenging issue, although significant progresses have been made recently. This paper proposes a new approach to dialogue state tracking, referred to as Seq2Seq-DU, which formalizes DST as a sequence-to-sequence problem. Seq2Seq-DU employs two BERT-based encoders to respectively encode the utterances in the dialogue and the descriptions of schemas, an attender to calculate attentions between the utterance embeddings and the schema embeddings, and a decoder to generate pointers to represent the current state of dialogue. Seq2Seq-DU has the following advantages. It can jointly model intents, slots, and slot values; it can leverage the rich representations of utterances and schemas based on BERT; it can effectively deal with categorical and non-categorical slots, and unseen schemas. In addition, Seq2Seq-DU can also be used in the NLU (natural language understanding) module of a dialogue system. Experimental results on benchmark datasets in different settings (SGD, MultiWOZ2.2, MultiWOZ2.1, WOZ2.0, DSTC2, M2M, SNIPS, and ATIS) show that Seq2Seq-DU outperforms the existing methods.	本文关注的是面向任务的对话系统中的对话状态跟踪（DST）。尽管最近取得了重大进展，但构建高效的 DST 模块仍然是一个具有挑战性的问题。本文提出了一种新的对话状态跟踪方法，称为 Seq2Seq-DU，它将 DST 形式化为序列到序列问题。 Seq2Seq-DU 使用两个基于 BERT 的编码器分别对对话中的话语和模式描述进行编码，一个参与器计算话语嵌入和模式嵌入之间的注意力，以及一个解码器来生成表示当前对话状态的指针. Seq2Seq-DU 具有以下优点。它可以联合建模意图、槽位和槽位值；它可以利用基于 BERT 的丰富的话语和模式表示；它可以有效地处理分类和非分类槽以及看不见的模式。此外，Seq2Seq-DU 还可以用于对话系统的 NLU（自然语言理解）模块。在不同设置（SGD、MultiWOZ2.2、MultiWOZ2.1、WOZ2.0、DSTC2、M2M、SNIPS 和 ATIS）的基准数据集上的实验结果表明，Seq2Seq-DU 优于现有方法。	Yue Feng Yang Wang Hang Li
9	ACL2021	Generating Relevant and Coherent Dialogue Responses using Self-Separated Conditional Variational AutoEncoders		https://arxiv.org/pdf/2106.03410	Conditional Variational AutoEncoder (CVAE) effectively increases the diversity and informativeness of responses in open-ended dialogue generation tasks through enriching the context vector with sampled latent variables. However, due to the inherent one-to-many and many-to-one phenomena in human dialogues, the sampled latent variables may not correctly reflect the contexts’ semantics, leading to irrelevant and incoherent generated responses. To resolve this problem, we propose Self-separated Conditional Variational AutoEncoder (abbreviated as SepaCVAE) that introduces group information to regularize the latent variables, which enhances CVAE by improving the responses’ relevance and coherence while maintaining their diversity and informativeness. SepaCVAE actively divides the input data into groups, and then widens the absolute difference between data pairs from distinct groups, while narrowing the relative distance between data pairs in the same group. Empirical results from automatic evaluation and detailed analysis demonstrate that SepaCVAE can significantly boost responses in well-established open-domain dialogue datasets.	条件变分自动编码器 (CVAE) 通过用采样的潜在变量丰富上下文向量，有效地增加了开放式对话生成任务中响应的多样性和信息量。然而，由于人类对话中固有的一对多和多对一现象，采样的潜在变量可能无法正确反映上下文的语义，导致生成的响应不相关和不连贯。为了解决这个问题，我们提出了自分离条件变分自动编码器（SepaCVAE），它引入了组信息来规范潜在变量，通过提高响应的相关性和连贯性来增强 CVAE，同时保持它们的多样性和信息量。 SepaCVAE 主动将输入数据分组，然后扩大来自不同组的数据对之间的绝对差异，同时缩小同一组中数据对之间的相对距离。自动评估和详细分析的实证结果表明，SepaCVAE 可以显着提高在完善的开放域对话数据集中的响应。	Bin Sun Shaoxiong Feng Yiwei Li Jiamou Liu Kan Li
10	ACL2021	Intent Classification and Slot Filling for Privacy Policies	https://github.com/wasiahmad/PolicyIE	https://arxiv.org/pdf/2101.00123	Understanding privacy policies is crucial for users as it empowers them to learn about the information that matters to them. Sentences written in a privacy policy document explain privacy practices, and the constituent text spans convey further specific information about that practice. We refer to predicting the privacy practice explained in a sentence as intent classification and identifying the text spans sharing specific information as slot filling. In this work, we propose PolicyIE, an English corpus consisting of 5,250 intent and 11,788 slot annotations spanning 31 privacy policies of websites and mobile applications. PolicyIE corpus is a challenging real-world benchmark with limited labeled examples reflecting the cost of collecting large-scale annotations from domain experts. We present two alternative neural approaches as baselines, (1) intent classification and slot filling as a joint sequence tagging and (2) modeling them as a sequence-to-sequence (Seq2Seq) learning task. The experiment results show that both approaches perform comparably in intent classification, while the Seq2Seq method outperforms the sequence tagging approach in slot filling by a large margin. We perform a detailed error analysis to reveal the challenges of the proposed corpus.	了解隐私政策对用户至关重要，因为它使他们能够了解对他们而言重要的信息。写在隐私政策文件中的句子解释了隐私实践，构成文本的跨度传达了有关该实践的进一步具体信息。我们将预测在句子中解释的隐私实践称为意图分类，并将共享特定信息的文本跨度称为插槽填充。在这项工作中，我们提出了 PolicyIE，这是一个英语语料库，由 5,250 个意图和 11,788 个槽注释组成，涵盖 31 个网站和移动应用程序的隐私政策。 PolicyIE 语料库是一个具有挑战性的现实世界基准，其有限的标记示例反映了从领域专家那里收集大规模注释的成本。我们提出了两种替代神经方法作为基线，(1) 意图分类和槽填充作为联合序列标记，(2) 将它们建模为序列到序列 (Seq2Seq) 学习任务。实验结果表明，两种方法在意图分类方面的表现相当，而 Seq2Seq 方法在槽填充方面优于序列标记方法。我们进行了详细的错误分析，以揭示所提出的语料库的挑战。	Wasi Uddin Ahmad Jianfeng Chi Tu Le Thomas Norton Yuan Tian Kai-Wei Chang
11	ACL2021	Dual Slot Selector via Local Reliability Verification for Dialogue State Tracking	https://github.com/guojinyu88/DSSDST	https://arxiv.org/pdf/2107.12578	The goal of dialogue state tracking (DST) is to predict the current dialogue state given all previous dialogue contexts. Existing approaches generally predict the dialogue state at every turn from scratch. However, the overwhelming majority of the slots in each turn should simply inherit the slot values from the previous turn. Therefore, the mechanism of treating slots equally in each turn not only is inefficient but also may lead to additional errors because of the redundant slot value generation. To address this problem, we devise the two-stage DSS-DST which consists of the Dual Slot Selector based on the current turn dialogue, and the Slot Value Generator based on the dialogue history. The Dual Slot Selector determines each slot whether to update slot value or to inherit the slot value from the previous turn from two aspects: (1) if there is a strong relationship between it and the current turn dialogue utterances; (2) if a slot value with high reliability can be obtained for it through the current turn dialogue. The slots selected to be updated are permitted to enter the Slot Value Generator to update values by a hybrid method, while the other slots directly inherit the values from the previous turn. Empirical results show that our method achieves 56.93%, 60.73%, and 58.04% joint accuracy on MultiWOZ 2.0, MultiWOZ 2.1, and MultiWOZ 2.2 datasets respectively and achieves a new state-of-the-art performance with significant improvements.	对话状态跟踪（DST）的目标是在给定所有先前对话上下文的情况下预测当前对话状态。现有方法通常从头开始预测每一轮的对话状态。然而，每一回合中的绝大多数槽位应该简单地继承上一回合的槽位值。因此，在每一轮中平等对待槽的机制不仅效率低下，而且可能由于冗余槽值的生成而导致额外的错误。为了解决这个问题，我们设计了两阶段 DSS-DST，它由基于当前回合对话的双槽选择器和基于对话历史的槽值生成器组成。 Dual Slot Selector从两个方面决定每个槽是更新槽值还是继承上一回合的槽值：（1）是否与当前回合对话话语有很强的关系； (2) 是否可以通过当前回合对话为其获得高可靠性的槽值。选择更新的槽位被允许进入槽位值生成器以混合方式更新值，而其他槽位直接继承上一回合的值。实证结果表明，我们的方法在 MultiWOZ 2.0、MultiWOZ 2.1 和 MultiWOZ 2.2 数据集上分别实现了 56.93%、60.73% 和 58.04% 的联合精度，并实现了新的最先进的性能，并具有显着的改进。	Jinyu Guo Kai Shuang Jijie Li Zihan Wang
12	ACL2021	Learning from Perturbations: Diverse and Informative Dialogue Generation with Inverse Adversarial Training		https://arxiv.org/pdf/2105.15171	In this paper, we propose Inverse Adversarial Training (IAT) algorithm for training neural dialogue systems to avoid generic responses and model dialogue history better. In contrast to standard adversarial training algorithms, IAT encourages the model to be sensitive to the perturbation in the dialogue history and therefore learning from perturbations. By giving higher rewards for responses whose output probability reduces more significantly when dialogue history is perturbed, the model is encouraged to generate more diverse and consistent responses. By penalizing the model when generating the same response given perturbed dialogue history, the model is forced to better capture dialogue history and generate more informative responses. Experimental results on two benchmark datasets show that our approach can better model dialogue history and generate more diverse and consistent responses. In addition, we point out a problem of the widely used maximum mutual information (MMI) based methods for improving the diversity of dialogue response generation models and demonstrate it empirically.	在本文中，我们提出了用于训练神经对话系统的反向对抗训练 (IAT) 算法，以更好地避免通用响应和模型对话历史。与标准的对抗训练算法相比，IAT 鼓励模型对对话历史中的扰动敏感，从而从扰动中学习。通过对对话历史受到扰动时输出概率降低更显着的响应给予更高的奖励，鼓励模型生成更多样化和一致的响应。通过在给定扰动的对话历史时生成相同响应时惩罚模型，该模型被迫更好地捕获对话历史并生成更多信息响应。在两个基准数据集上的实验结果表明，我们的方法可以更好地对对话历史进行建模，并生成更多样化和一致的响应。此外，我们指出了广泛使用的基于最大互信息（MMI）的方法用于提高对话响应生成模型的多样性的问题，并进行了实证证明。	Wangchunshu Zhou Qifei Li Chenle Li
13	ACL2021	Modeling Bilingual Conversational Characteristics for Neural Chat Translation	https://github.com/XL2248/CPCC	https://arxiv.org/pdf/2107.11164	Neural chat translation aims to translate bilingual conversational text, which has a broad application in international exchanges and cooperation. Despite the impressive performance of sentence-level and context-aware Neural Machine Translation (NMT), there still remain challenges to translate bilingual conversational text due to its inherent characteristics such as role preference, dialogue coherence, and translation consistency. In this paper, we aim to promote the translation quality of conversational text by modeling the above properties. Specifically, we design three latent variational modules to learn the distributions of bilingual conversational characteristics. Through sampling from these learned distributions, the latent variables, tailored for role preference, dialogue coherence, and translation consistency, are incorporated into the NMT model for better translation. We evaluate our approach on the benchmark dataset BConTrasT (English-German) and a self-collected bilingual dialogue corpus, named BMELD (English-Chinese). Extensive experiments show that our approach notably boosts the performance over strong baselines by a large margin and significantly surpasses some state-of-the-art context-aware NMT models in terms of BLEU and TER. Additionally, we make the BMELD dataset publicly available for the research community.	神经聊天翻译旨在翻译双语会话文本，在国际交流与合作中有着广泛的应用。尽管句子级和上下文感知神经机器翻译 (NMT) 的表现令人印象深刻，但由于其固有的特性，例如角色偏好、对话连贯性和翻译一致性，翻译双语会话文本仍然存在挑战。在本文中，我们旨在通过对上述属性进行建模来提高会话文本的翻译质量。具体来说，我们设计了三个潜在的变分模块来学习双语会话特征的分布。通过从这些学习到的分布中采样，为角色偏好、对话连贯性和翻译一致性量身定制的潜在变量被纳入 NMT 模型中，以实现更好的翻译。我们在基准数据集 BConTrasT（英德）和自收集的双语对话语料库 BMELD（英汉）上评估我们的方法。大量实验表明，我们的方法显着提高了强基线的性能，并且在 BLEU 和 TER 方面显着超越了一些最先进的上下文感知 NMT 模型。此外，我们还向研究社区公开 BMELD 数据集。	Yunlong Liang Fandong Meng Yufeng Chen Jinan Xu Jie Zhou
14	ACL2020	Dynamic Fusion Network for Multi-Domain End-to-end Task-Oriented Dialog	https://github.com/LooperXX/DF-Net	https://arxiv.org/pdf/2004.11019	Recent studies have shown remarkable success in end-to-end task-oriented dialog system. However, most neural models rely on large training data, which are only available for a certain number of task domains, such as navigation and scheduling.
This makes it difficult to scalable for a new domain with limited labeled data. However, there has been relatively little research on how to effectively use data from all domains to improve the performance of each domain and also unseen domains. To this end, we investigate methods that can make explicit use of domain knowledge and introduce a shared-private network to learn shared and specific knowledge. In addition, we propose a novel Dynamic Fusion Network (DF-Net) which automatically exploit the relevance between the target domain and each domain. Results show that our model outperforms existing methods on multi-domain dialogue, giving the state-of-the-art in the literature. Besides, with little training data, we show its transferability by outperforming prior best model by 13.9\% on average.	最近的研究表明，端到端的面向任务的对话系统取得了显着的成功。然而，大多数神经模型依赖于大量的训练数据，这些数据仅适用于一定数量的任务域，例如导航和调度。
这使得难以针对具有有限标记数据的新域进行扩展。然而，关于如何有效地使用来自所有领域的数据来提高每个领域和不可见领域的性能的研究相对较少。为此，我们研究了可以明确使用领域知识并引入共享私有网络来学习共享和特定知识的方法。此外，我们提出了一种新颖的动态融合网络（DF-Net），它可以自动利用目标域和每个域之间的相关性。结果表明，我们的模型在多领域对话方面优于现有方法，提供了文献中的最新技术。此外，在训练数据很少的情况下，我们通过比先前的最佳模型平均高出 13.9% 来展示其可迁移性。	Libo Qin Xiao Xu Wanxiang Che Yue Zhang Ting Liu
15	ACL2020	Learning Dialog Policies from Weak Demonstrations		https://arxiv.org/pdf/2004.11054	Deep reinforcement learning is a promising approach to training a dialog manager, but current methods struggle with the large state and action spaces of multi-domain dialog systems. Building upon Deep Q-learning from Demonstrations (DQfD), an algorithm that scores highly in difficult Atari games, we leverage dialog data to guide the agent to successfully respond to a user’s requests. We make progressively fewer assumptions about the data needed, using labeled, reduced-labeled, and even unlabeled data to train expert demonstrators. We introduce Reinforced Fine-tune Learning, an extension to DQfD, enabling us to overcome the domain gap between the datasets and the environment. Experiments in a challenging multi-domain dialog system framework validate our approaches, and get high success rates even when trained on out-of-domain data.	深度强化学习是训练对话管理器的一种很有前途的方法，但当前的方法难以应对多域对话系统的大型状态和动作空间。基于演示中的深度 Q 学习 (DQfD)，一种在困难的 Atari 游戏中得分很高的算法，我们利用对话数据来指导代理成功响应用户的请求。我们逐渐减少对所需数据的假设，使用标记、减少标记甚至未标记的数据来训练专家演示者。我们引入了强化微调学习，这是 DQfD 的扩展，使我们能够克服数据集和环境之间的领域差距。在具有挑战性的多域对话系统框架中进行的实验验证了我们的方法，即使在域外数据上进行训练时也能获得很高的成功率。	Gabriel Gordon-Hall Philip John Gorinski Shay B. Cohen
16	ACL2020	Span-ConveRT: Few-shot Span Extraction for Dialog with Pretrained Conversational Representations	https://github.com/PolyAI-LDN/task-specific-datasets	https://arxiv.org/pdf/2005.08866	We introduce Span-ConveRT, a light-weight model for dialog slot-filling which frames the task as a turn-based span extraction task. This formulation allows for a simple integration of conversational knowledge coded in large pretrained conversational models such as ConveRT (Henderson et al., 2019). We show that leveraging such knowledge in Span-ConveRT is especially useful for few-shot learning scenarios: we report consistent gains over 1) a span extractor that trains representations from scratch in the target domain, and 2) a BERT-based span extractor. In order to inspire more work on span extraction for the slot-filling task, we also release RESTAURANTS-8K, a new challenging data set of 8,198 utterances, compiled from actual conversations in the restaurant booking domain.	我们引入了 Span-ConveRT，这是一种用于对话槽填充的轻量级模型，它将任务框架为基于回合的跨度提取任务。这种公式允许简单地集成在大型预训练会话模型（如 ConveRT）中编码的会话知识（Henderson 等，2019）。我们表明，在 Span-ConveRT 中利用这些知识对于少样本学习场景特别有用：我们报告了一致的收益：1）一个跨度提取器，在目标域中从头开始训练表示，2）一个基于 BERT 的跨度提取器。为了激发更多关于槽位填充任务的跨度提取工作，我们还发布了 RESTAURANTS-8K，这是一个新的具有挑战性的数据集，包含 8,198 条话语，由餐厅预订领域的实际对话编译而成。	Sam Coope Tyler Farghly Daniela Gerz Ivan Vulić Matthew Henderson
17	ACL2020	Diversifying Dialogue Generation with Non-Conversational Text	https://github.com/chin-gyou/Div-Non-Conv	https://arxiv.org/pdf/2005.04346	Neural network-based sequence-to-sequence (seq2seq) models strongly suffer from the low-diversity problem when it comes to open-domain dialogue generation. As bland and generic utterances usually dominate the frequency distribution in our daily chitchat, avoiding them to generate more interesting responses requires complex data filtering, sampling techniques or modifying the training objective. In this paper, we propose a new perspective to diversify dialogue generation by leveraging non-conversational text. Compared with bilateral conversations, non-conversational text are easier to obtain, more diverse and cover a much broader range of topics. We collect a large-scale non-conversational corpus from multi sources including forum comments, idioms and book snippets. We further present a training paradigm to effectively incorporate these text via iterative back translation. The resulting model is tested on two conversational datasets and is shown to produce significantly more diverse responses without sacrificing the relevance with context.	当涉及到开放域对话生成时，基于神经网络的序列到序列 (seq2seq) 模型严重受到低多样性问题的影响。由于平淡和通用的话语通常在我们的日常聊天中占据频率分布的主导地位，因此避免它们产生更有趣的响应需要复杂的数据过滤、采样技术或修改训练目标。在本文中，我们提出了一个新的视角，通过利用非对话文本来使对话生成多样化。与双边对话相比，非对话文本更容易获取、更多样化并且涵盖的主题范围更广。我们从包括论坛评论、习语和书籍片段在内的多个来源收集了一个大规模的非会话语料库。我们进一步提出了一种训练范式，以通过迭代回译有效地合并这些文本。生成的模型在两个会话数据集上进行了测试，结果表明可以在不牺牲与上下文的相关性的情况下产生更加多样化的响应。	Hui Su Xiaoyu Shen Sanqiang Zhao Xiao Zhou Pengwei Hu Randy Zhong Cheng Niu Jie Zhou
18	ACL2020	Grounding Conversations with Improvised Dialogues	https://github.com/wise-east/spolin	https://arxiv.org/pdf/2004.09544	Effective dialogue involves grounding, the process of establishing mutual knowledge that is essential for communication between people. Modern dialogue systems are not explicitly trained to build common ground, and therefore overlook this important aspect of communication. Improvisational theater (improv) intrinsically contains a high proportion of dialogue focused on building common ground, and makes use of the yes-and principle, a strong grounding speech act, to establish coherence and an actionable objective reality. We collect a corpus of more than 26,000 yes-and turns, transcribing them from improv dialogues and extracting them from larger, but more sparsely populated movie script dialogue corpora, via a bootstrapped classifier. We fine-tune chit-chat dialogue systems with our corpus to encourage more grounded, relevant conversation and confirm these findings with human evaluations.	有效的对话涉及基础，即建立相互了解的过程，这对于人与人之间的交流至关重要。现代对话系统没有经过明确训练来建立共同点，因此忽略了沟通的这一重要方面。即兴戏剧 (improv) 本质上包含着大量侧重于建立共同点的对话，并利用是和原则，一种强大的基础言语行为，建立连贯性和可操作的客观现实。我们收集了超过 26,000 个是和转的语料库，从即兴对话中转录它们，并通过自举分类器从更大但人口更稀少的电影剧本对话语料库中提取它们。我们使用我们的语料库微调闲聊对话系统，以鼓励更扎实、相关的对话，并通过人工评估确认这些发现。	Hyundong Cho Jonathan May
19	ACL2020	Designing Precise and Robust Dialogue Response Evaluators	https://github.com/ZHAOTING/dialog-processing	https://arxiv.org/pdf/2004.04908	Automatic dialogue response evaluator has been proposed as an alternative to automated metrics and human evaluation. However, existing automatic evaluators achieve only moderate correlation with human judgement and they are not robust. In this work, we propose to build a reference-free evaluator and exploit the power of semi-supervised training and pretrained (masked) language models. Experimental results demonstrate that the proposed evaluator achieves a strong correlation (> 0.6) with human judgement and generalizes robustly to diverse responses and corpora. We open-source the code and data in https://github.com/ZHAOTING/dialog-processing.	自动对话响应评估器已被提议作为自动度量和人工评估的替代方案。然而，现有的自动评估器与人类判断仅实现中等程度的相关性，并且它们并不稳健。在这项工作中，我们建议构建一个无参考评估器，并利用半监督训练和预训练（屏蔽）语言模型的力量。实验结果表明，所提出的评估器与人类判断具有很强的相关性（> 0.6），并且可以稳健地推广到不同的响应和语料库。我们在 https://github.com/ZHAOTING/dialog-processing 中开源了代码和数据。	Tianyu Zhao Divesh Lala Tatsuya Kawahara
20	ACL2020	PLATO: Pre-trained Dialogue Generation Model with Discrete Latent Variable	https://github.com/PaddlePaddle/Research	https://arxiv.org/pdf/1910.07931	Pre-training models have been proved effective for a wide range of natural language processing tasks. Inspired by this, we propose a novel dialogue generation pre-training framework to support various kinds of conversations, including chit-chat, knowledge grounded dialogues, and conversational question answering. In this framework, we adopt flexible attention mechanisms to fully leverage the bi-directional context and the uni-directional characteristic of language generation. We also introduce discrete latent variables to tackle the inherent one-to-many mapping problem in response generation. Two reciprocal tasks of response generation and latent act recognition are designed and carried out simultaneously within a shared network. Comprehensive experiments on three publicly available datasets verify the effectiveness and superiority of the proposed framework.	预训练模型已被证明对广泛的自然语言处理任务有效。受此启发，我们提出了一种新颖的对话生成预训练框架，以支持各种对话，包括闲聊、基于知识的对话和对话问答。在这个框架中，我们采用灵活的注意力机制来充分利用双向上下文和语言生成的单向特性。我们还引入了离散潜在变量来解决响应生成中固有的一对多映射问题。响应生成和潜在行为识别这两个互惠任务是在共享网络中同时设计和执行的。对三个公开可用数据集的综合实验验证了所提出框架的有效性和优越性。	Siqi Bao Huang He Fan Wang Hua Wu Haifeng Wang
21	ACL2020	Zero-Shot Transfer Learning with Synthesized Data for Multi-Domain Dialogue State Tracking	https://github.com/stanford-oval/zero-shot-multiwoz-acl2020	https://arxiv.org/pdf/2005.00891	Zero-shot transfer learning for multi-domain dialogue state tracking can allow us to handle new domains without incurring the high cost of data acquisition. This paper proposes new zero-short transfer learning technique for dialogue state tracking where the in-domain training data are all synthesized from an abstract dialogue model and the ontology of the domain. We show that data augmentation through synthesized data can improve the accuracy of zero-shot learning for both the TRADE model and the BERT-based SUMBT model on the MultiWOZ 2.1 dataset. We show training with only synthesized in-domain data on the SUMBT model can reach about 2/3 of the accuracy obtained with the full training dataset. We improve the zero-shot learning state of the art on average across domains by 21%.	用于多域对话状态跟踪的零样本迁移学习可以让我们处理新域，而不会产生高昂的数据采集成本。本文提出了新的用于对话状态跟踪的零短转移学习技术，其中域内训练数据全部由抽象对话模型和域本体合成。我们表明，通过合成数据进行数据增强可以提高 MultiWOZ 2.1 数据集上的 TRADE 模型和基于 BERT 的 SUMBT 模型的零样本学习的准确性。我们表明，仅在 SUMBT 模型上使用合成的域内数据进行训练可以达到使用完整训练数据集获得的准确度的 2/3 左右。我们将跨领域的零样本学习技术平均提高了 21%。	Giovanni Campagna Agata Foryciarz Mehrad Moradshahi Monica S. Lam
22	ACL2020	Multi-Agent Task-Oriented Dialog Policy Learning with Role-Aware Reward Decomposition	https://github.com/truthless11/MADPL	https://arxiv.org/pdf/2004.03809	Many studies have applied reinforcement learning to train a dialog policy and show great promise these years. One common approach is to employ a user simulator to obtain a large number of simulated user experiences for reinforcement learning algorithms. However, modeling a realistic user simulator is challenging. A rule-based simulator requires heavy domain expertise for complex tasks, and a data-driven simulator requires considerable data and it is even unclear how to evaluate a simulator. To avoid explicitly building a user simulator beforehand, we propose Multi-Agent Dialog Policy Learning, which regards both the system and the user as the dialog agents. Two agents interact with each other and are jointly learned simultaneously. The method uses the actor-critic framework to facilitate pretraining and improve scalability. We also propose Hybrid Value Network for the role-aware reward decomposition to integrate role-specific domain knowledge of each agent in the task-oriented dialog. Results show that our method can successfully build a system policy and a user policy simultaneously, and two agents can achieve a high task success rate through conversational interaction.	许多研究已经应用强化学习来训练对话策略，并且这些年来显示出巨大的希望。一种常见的方法是使用用户模拟器为强化学习算法获取大量模拟用户体验。然而，对真实的用户模拟器建模是具有挑战性的。基于规则的模拟器需要大量的领域专业知识来处理复杂的任务，而数据驱动的模拟器需要大量数据，甚至不清楚如何评估模拟器。为了避免事先明确构建用户模拟器，我们提出了多代理对话策略学习，它将系统和用户都视为对话代理。两个代理相互交互并同时共同学习。该方法使用 actor-critic 框架来促进预训练并提高可扩展性。我们还提出了用于角色感知奖励分解的混合价值网络，以在面向任务的对话中集成每个代理的特定于角色的领域知识。结果表明，我们的方法可以成功地同时构建系统策略和用户策略，并且两个代理可以通过对话交互实现较高的任务成功率。	Ryuichi Takanobu Runze Liang Minlie Huang
23	ACL2020	Towards Conversational Recommendation over Multi-Type Dialogs	https://github.com/PaddlePaddle/models	https://arxiv.org/pdf/2005.03954	We propose a new task of conversational recommendation over multi-type dialogs, where the bots can proactively and naturally lead a conversation from a non-recommendation dialog (e.g., QA) to a recommendation dialog, taking into account user’s interests and feedback. To facilitate the study of this task, we create a human-to-human Chinese dialog dataset \emph{DuRecDial} (about 10k dialogs, 156k utterances), which contains multiple sequential dialogs for every pair of a recommendation seeker (user) and a recommender (bot). In each dialog, the recommender proactively leads a multi-type dialog to approach recommendation targets and then makes multiple recommendations with rich interaction behavior. This dataset allows us to systematically investigate different parts of the overall problem, e.g., how to naturally lead a dialog, how to interact with users for recommendation. Finally we establish baseline results on DuRecDial for future studies. Dataset and codes are publicly available at https://github.com/PaddlePaddle/models/tree/develop/PaddleNLP/Research/ACL2020-DuRecDial.	我们提出了多类型对话的对话推荐新任务，其中机器人可以主动且自然地将对话从非推荐对话（例如 QA）引导到推荐对话，同时考虑到用户的兴趣和反馈。为了促进这项任务的研究，我们创建了一个人对人的中文对话数据集 \emph{DuRecDial}（大约 10k 个对话，156k 个话语），其中包含每对推荐搜索者（用户）和一个推荐人（机器人）。在每个对话中，推荐者主动引导多类型对话接近推荐目标，然后做出具有丰富交互行为的多重推荐。该数据集允许我们系统地调查整个问题的不同部分，例如，如何自然地引导对话，如何与用户交互以进行推荐。最后，我们在 DuRecDial 上建立基线结果以供未来研究。数据集和代码可在 https://github.com/PaddlePaddle/models/tree/develop/PaddleNLP/Research/ACL2020-DuRecDial 公开获得。	Zeming Liu Haifeng Wang Zheng-Yu Niu Hua Wu Wanxiang Che Ting Liu
24	ACL2020	KdConv: A Chinese Multi-domain Dialogue Dataset Towards Multi-turn Knowledge-driven Conversation	https://github.com/thu-coai/KdConv	https://arxiv.org/pdf/2004.04100	The research of knowledge-driven conversational systems is largely limited due to the lack of dialog data which consist of multi-turn conversations on multiple topics and with knowledge annotations. In this paper, we propose a Chinese multi-domain knowledge-driven conversation dataset, KdConv, which grounds the topics in multi-turn conversations to knowledge graphs. Our corpus contains 4.5K conversations from three domains (film, music, and travel), and 86K utterances with an average turn number of 19.0. These conversations contain in-depth discussions on related topics and natural transition between multiple topics. To facilitate the following research on this corpus, we provide several benchmark models. Comparative results show that the models can be enhanced by introducing background knowledge, yet there is still a large space for leveraging knowledge to model multi-turn conversations for further research. Results also show that there are obvious performance differences between different domains, indicating that it is worth to further explore transfer learning and domain adaptation. The corpus and benchmark models are publicly available.	由于缺乏由多个主题的多轮对话和知识注释组成的对话数据，知识驱动的对话系统的研究在很大程度上受到限制。在本文中，我们提出了一个中文多领域知识驱动对话数据集 KdConv，它将多轮对话中的主题基于知识图谱。我们的语料库包含来自三个领域（电影、音乐和旅行）的 4.5K 对话和 86K 话语，平均轮数为 19.0。这些对话包含对相关主题的深入讨论以及多个主题之间的自然过渡。为了便于对该语料库的后续研究，我们提供了几个基准模型。对比结果表明，可以通过引入背景知识来增强模型，但利用知识对多轮对话进行建模以供进一步研究仍有很大的空间。结果还表明，不同领域之间存在明显的性能差异，表明值得进一步探索迁移学习和领域适应。语料库和基准模型是公开的。	Hao Zhou Chujie Zheng Kaili Huang Minlie Huang Xiaoyan Zhu
25	ACL2019	Incremental Transformer with Deliberation Decoder for Document Grounded Conversations	https://github.com/lizekang/ITDD	https://arxiv.org/pdf/1907.08854	Document Grounded Conversations is a task to generate dialogue responses when chatting about the content of a given document. Obviously, document knowledge plays a critical role in Document Grounded Conversations, while existing dialogue models do not exploit this kind of knowledge effectively enough. In this paper, we propose a novel Transformer-based architecture for multi-turn document grounded conversations. In particular, we devise an Incremental Transformer to encode multi-turn utterances along with knowledge in related documents. Motivated by the human cognitive process, we design a two-pass decoder (Deliberation Decoder) to improve context coherence and knowledge correctness. Our empirical study on a real-world Document Grounded Dataset proves that responses generated by our model significantly outperform competitive baselines on both context coherence and knowledge relevance.	Document Grounded Conversations 是一项在讨论给定文档的内容时生成对话响应的任务。显然，文档知识在基于文档的对话中起着至关重要的作用，而现有的对话模型并没有足够有效地利用这种知识。在本文中，我们提出了一种新颖的基于 Transformer 的架构，用于多轮文档基础对话。特别是，我们设计了一个增量转换器来编码多轮话语以及相关文档中的知识。受人类认知过程的启发，我们设计了一个两遍解码器（Deliberation Decoder）来提高上下文的一致性和知识的正确性。我们对真实世界文档接地数据集的实证研究证明，我们的模型生成的响应在上下文一致性和知识相关性方面都显着优于竞争基准。	Zekang Li Cheng Niu Fandong Meng Yang Feng Qian Li Jie Zhou
26	ACL2019	E3: Entailment-driven Extracting and Editing for Conversational Machine Reading	https://github.com/vzhong/e3	https://arxiv.org/pdf/1906.05373	Conversational machine reading systems help users answer high-level questions (e.g. determine if they qualify for particular government benefits) when they do not know the exact rules by which the determination is made(e.g. whether they need certain income levels or veteran status). The key challenge is that these rules are only provided in the form of a procedural text (e.g. guidelines from government website) which the system must read to figure out what to ask the user. We present a new conversational machine reading model that jointly extracts a set of decision rules from the procedural text while reasoning about which are entailed by the conversational history and which still need to be edited to create questions for the user. On the recently introduced ShARC conversational machine reading dataset, our Entailment-driven Extract and Edit network (E3) achieves a new state-of-the-art, outperforming existing systems as well as a new BERT-based baseline. In addition, by explicitly highlighting which information still needs to be gathered, E3 provides a more explainable alternative to prior work. We release source code for our models and experiments at https://github.com/vzhong/e3.	当用户不知道做出决定的确切规则（例如，他们是否需要某些收入水平或退伍军人身份）时，对话式机器阅读系统可帮助用户回答高级问题（例如，确定他们是否有资格获得特定的政府福利）。关键的挑战在于，这些规则仅以程序文本的形式提供（例如政府网站上的指南），系统必须阅读这些文本才能弄清楚要问用户什么。我们提出了一种新的对话机器阅读模型，该模型从程序文本中联合提取一组决策规则，同时推理对话历史所包含的哪些规则以及哪些仍需要编辑以为用户创建问题。在最近推出的 ShARC 对话式机器阅读数据集上，我们的 Entailment-driven 提取和编辑网络 (E3) 实现了新的最先进的技术，优于现有系统以及新的基于 BERT 的基线。此外，通过明确突出显示哪些信息仍需要收集，E3 为之前的工作提供了一个更易于解释的替代方案。我们在 https://github.com/vzhong/e3 上发布了我们的模型和实验的源代码。	Victor Zhong Luke Zettlemoyer
27	ACL2019	Improving Multi-turn Dialogue Modelling with Utterance ReWriter		https://arxiv.org/pdf/1906.07004	Recent research has made impressive progress in single-turn dialogue modelling. In the multi-turn setting, however, current models are still far from satisfactory. One major challenge is the frequently occurred coreference and information omission in our daily conversation, making it hard for machines to understand the real intention. In this paper, we propose rewriting the human utterance as a pre-process to help multi-turn dialgoue modelling. Each utterance is first rewritten to recover all coreferred and omitted information. The next processing steps are then performed based on the rewritten utterance. To properly train the utterance rewriter, we collect a new dataset with human annotations and introduce a Transformer-based utterance rewriting architecture using the pointer network. We show the proposed architecture achieves remarkably good performance on the utterance rewriting task. The trained utterance rewriter can be easily integrated into online chatbots and brings general improvement over different domains.		Hui Su Xiaoyu Shen Rongzhi Zhang Fei Sun Pengwei Hu Cheng Niu Jie Zhou
28	ACL2019	Semantically Conditioned Dialog Response Generation via Hierarchical Disentangled Self-Attention	https://github.com/wenhuchen/HDSA-Dialog	https://arxiv.org/pdf/1905.12866	Semantically controlled neural response generation on limited-domain has achieved great performance. However, moving towards multi-domain large-scale scenarios are shown to be difficult because the possible combinations of semantic inputs grow exponentially with the number of domains. To alleviate such scalability issue, we exploit the structure of dialog acts to build a multi-layer hierarchical graph, where each act is represented as a root-to-leaf route on the graph. Then, we incorporate such graph structure prior as an inductive bias to build a hierarchical disentangled self-attention network, where we disentangle attention heads to model designated nodes on the dialog act graph. By activating different (disentangled) heads at each layer, combinatorially many dialog act semantics can be modeled to control the neural response generation. On the large-scale Multi-Domain-WOZ dataset, our model can yield a significant improvement over the baselines on various automatic and human evaluation metrics.	在有限域上语义控制的神经响应生成已经取得了很好的性能。然而，由于语义输入的可能组合随着域的数量呈指数增长，因此转向多域大规模场景是很困难的。为了缓解这种可扩展性问题，我们利用对话行为的结构来构建多层分层图，其中每个行为在图上表示为从根到叶的路线。然后，我们将这种先验图结构作为归纳偏置来构建分层解缠结的自注意力网络，在其中我们解开注意力头以模拟对话行为图上的指定节点。通过在每一层激活不同的（解开的）头，可以组合地对许多对话行为语义进行建模以控制神经响应的生成。在大规模多域 WOZ 数据集上，我们的模型可以在各种自动和人工评估指标的基线上产生显着改进。	Wenhu Chen Jianshu Chen Pengda Qin Xifeng Yan William Yang Wang
29	ACL2019	Observing Dialogue in Therapy: Categorizing and Forecasting Behavioral Codes	https://github.com/utahnlp/therapist-observer	https://arxiv.org/pdf/1907.00326	Automatically analyzing dialogue can help understand and guide behavior in domains such as counseling, where interactions are largely mediated by conversation. In this paper, we study modeling behavioral codes used to asses a psychotherapy treatment style called Motivational Interviewing (MI), which is effective for addressing substance abuse and related problems. Specifically, we address the problem of providing real-time guidance to therapists with a dialogue observer that (1) categorizes therapist and client MI behavioral codes and, (2) forecasts codes for upcoming utterances to help guide the conversation and potentially alert the therapist. For both tasks, we define neural network models that build upon recent successes in dialogue modeling. Our experiments demonstrate that our models can outperform several baselines for both tasks. We also report the results of a careful analysis that reveals the impact of the various network design tradeoffs for modeling therapy dialogue.	自动分析对话可以帮助理解和指导咨询等领域的行为，在这些领域，互动主要通过对话进行。在本文中，我们研究了用于评估称为动机访谈 (MI) 的心理治疗治疗方式的行为代码建模，该方式可有效解决药物滥用和相关问题。具体来说，我们解决了通过对话观察员向治疗师提供实时指导的问题，该对话观察员 (1) 对治疗师和客户 MI 行为代码进行分类，以及 (2) 预测即将出现的话语的代码，以帮助指导对话并可能提醒治疗师。对于这两个任务，我们定义了建立在对话建模最近成功基础上的神经网络模型。我们的实验表明，我们的模型在这两个任务上都可以胜过几个基线。我们还报告了仔细分析的结果，揭示了各种网络设计权衡对建模治疗对话的影响。	Jie Cao Michael Tanana Zac E. Imel Eric Poitras David C. Atkins Vivek Srikumar
30	ACL2019	Multimodal Transformer Networks for End-to-End Video-Grounded Dialogue Systems	https://github.com/henryhungle/MTN	https://arxiv.org/pdf/1907.01166	Developing Video-Grounded Dialogue Systems (VGDS), where a dialogue is conducted based on visual and audio aspects of a given video, is significantly more challenging than traditional image or text-grounded dialogue systems because (1) feature space of videos span across multiple picture frames, making it difficult to obtain semantic information; and (2) a dialogue agent must perceive and process information from different modalities (audio, video, caption, etc.) to obtain a comprehensive understanding. Most existing work is based on RNNs and sequence-to-sequence architectures, which are not very effective for capturing complex long-term dependencies (like in videos). To overcome this, we propose Multimodal Transformer Networks (MTN) to encode videos and incorporate information from different modalities. We also propose query-aware attention through an auto-encoder to extract query-aware features from non-text modalities. We develop a training procedure to simulate token-level decoding to improve the quality of generated responses during inference. We get state of the art performance on Dialogue System Technology Challenge 7 (DSTC7). Our model also generalizes to another multimodal visual-grounded dialogue task, and obtains promising performance. We implemented our models using PyTorch and the code is released at https://github.com/henryhungle/MTN.	开发基于视频的对话系统 (VGDS)，其中基于给定视频的视觉和音频方面进行对话，比传统的图像或文本对话系统更具挑战性，因为 (1) 视频的特征空间跨越多个图片框，语义信息获取困难； (2) 对话代理必须感知和处理来自不同形式（音频、视频、字幕等）的信息，以获得全面的理解。大多数现有工作基于 RNN 和序列到序列架构，这对于捕获复杂的长期依赖关系（如视频）不是很有效。为了克服这个问题，我们提出了多模态变压器网络 (MTN) 来编码视频并合并来自不同模态的信息。我们还通过自动编码器提出了查询感知注意力，以从非文本模式中提取查询感知特征。我们开发了一个训练程序来模拟令牌级解码，以提高推理过程中生成响应的质量。我们在对话系统技术挑战 7 (DSTC7) 上获得了最先进的性能。我们的模型还推广到另一个多模态基于视觉的对话任务，并获得了有希望的性能。我们使用 PyTorch 实现了我们的模型，代码发布在 https://github.com/henryhungle/MTN。	Hung Le Doyen Sahoo Nancy F. Chen Steven C. H. Hoi
31	ACL2019	Persuasion for Good: Towards a Personalized Persuasive Dialogue System for Social Good	https://gitlab.com/ucdavisnlp/persuasionforgood	https://arxiv.org/pdf/1906.06725	Developing intelligent persuasive conversational agents to change people’s opinions and actions for social good is the frontier in advancing the ethical development of automated dialogue systems. To do so, the first step is to understand the intricate organization of strategic disclosures and appeals employed in human persuasion conversations. We designed an online persuasion task where one participant was asked to persuade the other to donate to a specific charity. We collected a large dataset with 1,017 dialogues and annotated emerging persuasion strategies from a subset. Based on the annotation, we built a baseline classifier with context information and sentence-level features to predict the 10 persuasion strategies used in the corpus. Furthermore, to develop an understanding of personalized persuasion processes, we analyzed the relationships between individuals’ demographic and psychological backgrounds including personality, morality, value systems, and their willingness for donation. Then, we analyzed which types of persuasion strategies led to a greater amount of donation depending on the individuals’ personal backgrounds. This work lays the ground for developing a personalized persuasive dialogue system.	开发智能的有说服力的对话代理来改变人们的意见和行为以促进社会利益是推进自动对话系统伦理发展的前沿。为此，第一步是了解人类说服对话中采用的战略披露和呼吁的复杂组织。我们设计了一项在线说服任务，要求一名参与者说服另一名参与者向特定慈善机构捐款。我们收集了一个包含 1,017 个对话的大型数据集，并从一个子集中注释了新兴的说服策略。基于注释，我们构建了一个具有上下文信息和句子级特征的基线分类器，以预测语料库中使用的 10 种说服策略。此外，为了了解个性化说服过程，我们分析了个人的人口统计学和心理背景之间的关系，包括个性、道德、价值体系和他们的捐赠意愿。然后，我们根据个人的个人背景分析了哪些类型的说服策略会导致更多的捐赠。这项工作为开发个性化的说服性对话系统奠定了基础。	Xuewei Wang Weiyan Shi Richard Kim Yoojung Oh Sijia Yang Jingwen Zhang Zhou Yu

EMNLP

序号	会议/期刊	论文	代码	论文下载地址	摘要	摘要翻译	作者
1	EMNLP2020	Difference-aware Knowledge Selection for Knowledge-grounded Conversation Generation	https://github.com/chujiezheng/DiffKS	https://arxiv.org/pdf/2009.09378	In a multi-turn knowledge-grounded dialog, the difference between the knowledge selected at different turns usually provides potential clues to knowledge selection, which has been largely neglected in previous research. In this paper, we propose a difference-aware knowledge selection method. It first computes the difference between the candidate knowledge sentences provided at the current turn and those chosen in the previous turns. Then, the differential information is fused with or disentangled from the contextual information to facilitate final knowledge selection. Automatic, human observational, and interactive evaluation shows that our method is able to select knowledge more accurately and generate more informative responses, significantly outperforming the state-of-the-art baselines. The codes are available at https://github.com/chujiezheng/DiffKS.	在多轮基于知识的对话中，不同轮选择的知识之间的差异通常为知识选择提供了潜在的线索，而这在以前的研究中被很大程度上忽略了。在本文中，我们提出了一种差异感知知识选择方法。它首先计算当前回合提供的候选知识句子与前一回合选择的候选知识句子之间的差异。然后，差异信息与上下文信息融合或分离，以促进最终的知识选择。自动、人工观察和交互式评估表明，我们的方法能够更准确地选择知识并生成更多信息响应，显着优于最先进的基线。代码可在 https://github.com/chujiezeng/DiffKS 获得。	Chujie Zheng Yunbo Cao Daxin Jiang Minlie Huang
2	EMNLP2020	Few-shot Natural Language Generation for Task-Oriented Dialog	https://github.com/pengbaolin/SC-GPT	https://arxiv.org/pdf/2002.12328	As a crucial component in task-oriented dialog systems, the Natural Language Generation (NLG) module converts a dialog act represented in a semantic form into a response in natural language. The success of traditional template-based or statistical models typically relies on heavily annotated data, which is infeasible for new domains. Therefore, it is pivotal for an NLG system to generalize well with limited labelled data in real applications. To this end, we present FewShotWoz, the first NLG benchmark to simulate the few-shot learning setting in task-oriented dialog systems. Further, we develop the SC-GPT model. It is pre-trained on a large set of annotated NLG corpus to acquire the controllable generation ability, and fine-tuned with only a few domain-specific labels to adapt to new domains. Experiments on FewShotWoz and the large Multi-Domain-WOZ datasets show that the proposed SC-GPT significantly outperforms existing methods, measured by various automatic metrics and human evaluations.	作为面向任务的对话系统中的重要组成部分，自然语言生成 (NLG) 模块将以语义形式表示的对话行为转换为自然语言的响应。传统的基于模板或统计模型的成功通常依赖于大量注释的数据，这对于新领域是不可行的。因此，NLG 系统在实际应用中利用有限的标记数据很好地泛化是至关重要的。为此，我们提出了FewShotWoz，这是第一个在面向任务的对话系统中模拟小样本学习设置的NLG 基准测试。此外，我们开发了 SC-GPT 模型。它在大量带注释的 NLG 语料库上进行预训练以获得可控生成能力，并仅使用少数特定领域的标签进行微调以适应新领域。在FewShotWoz 和大型Multi-Domain-WOZ 数据集上的实验表明，所提出的SC-GPT 显着优于现有方法，通过各种自动指标和人工评估来衡量。	Baolin Peng Chenguang Zhu Chunyuan Li Xiujun Li Jinchao Li Michael Zeng Jianfeng Gao
3	EMNLP2020	Learning Knowledge Bases with Parameters for Task-Oriented Dialogue Systems	https://github.com/HLTCHKUST/ke-dialogue	https://arxiv.org/pdf/2009.13656	Task-oriented dialogue systems are either modularized with separate dialogue state tracking (DST) and management steps or end-to-end trainable. In either case, the knowledge base (KB) plays an essential role in fulfilling user requests. Modularized systems rely on DST to interact with the KB, which is expensive in terms of annotation and inference time. End-to-end systems use the KB directly as input, but they cannot scale when the KB is larger than a few hundred entries. In this paper, we propose a method to embed the KB, of any size, directly into the model parameters. The resulting model does not require any DST or template responses, nor the KB as input, and it can dynamically update its KB via fine-tuning. We evaluate our solution in five task-oriented dialogue datasets with small, medium, and large KB size. Our experiments show that end-to-end models can effectively embed knowledge bases in their parameters and achieve competitive performance in all evaluated datasets.	面向任务的对话系统要么采用单独的对话状态跟踪 (DST) 和管理步骤进行模块化，要么采用端到端可训练。无论哪种情况，知识库 (KB) 在满足用户请求方面都起着至关重要的作用。模块化系统依赖 DST 与 KB 交互，这在注释和推理时间方面很昂贵。端到端系统直接使用 KB 作为输入，但当 KB 大于几百个条目时，它们无法扩展。在本文中，我们提出了一种将任意大小的知识库直接嵌入模型参数的方法。生成的模型不需要任何 DST 或模板响应，也不需要 KB 作为输入，并且可以通过微调动态更新其 KB。我们在具有小、中和大 KB 大小的五个面向任务的对话数据集中评估我们的解决方案。我们的实验表明，端到端模型可以有效地将知识库嵌入其参数中，并在所有评估数据集中实现具有竞争力的性能。	Andrea Madotto Samuel Cahyawijaya Genta Indra Winata Yan Xu Zihan Liu Zhaojiang Lin Pascale Fung
4	EMNLP2020	Plug-and-Play Conversational Models	https://github.com/andreamad8/PPCM	https://arxiv.org/pdf/2010.04344	There has been considerable progress made towards conversational models that generate coherent and fluent responses; however, this often involves training large language models on large dialogue datasets, such as Reddit. These large conversational models provide little control over the generated responses, and this control is further limited in the absence of annotated conversational datasets for attribute specific generation that can be used for fine-tuning the model. In this paper, we first propose and evaluate plug-and-play methods for controllable response generation, which does not require dialogue specific datasets and does not rely on fine-tuning a large model. While effective, the decoding procedure induces considerable computational overhead, rendering the conversational model unsuitable for interactive usage. To overcome this, we introduce an approach that does not require further computation at decoding time, while also does not require any fine-tuning of a large language model. We demonstrate, through extensive automatic and human evaluation, a high degree of control over the generated conversational responses with regard to multiple desired attributes, while being fluent.	在产生连贯和流畅反应的对话模型方面取得了相当大的进展；然而，这通常涉及在大型对话数据集（例如 Reddit）上训练大型语言模型。这些大型对话模型对生成的响应几乎没有控制，并且这种控制在缺少可用于微调模型的特定属性生成的带注释对话数据集的情况下进一步受到限制。在本文中，我们首先提出并评估了用于可控响应生成的即插即用方法，该方法不需要对话特定的数据集，也不依赖于对大型模型进行微调。虽然有效，但解码过程会引起相当大的计算开销，使会话模型不适合交互式使用。为了克服这个问题，我们引入了一种在解码时不需要进一步计算的方法，同时也不需要对大型语言模型进行任何微调。我们通过广泛的自动和人工评估展示了对生成的关于多个所需属性的对话响应的高度控制，同时流畅。	Andrea Madotto Etsuko Ishii Zhaojiang Lin Sumanth Dathathri Pascale Fung
5	EMNLP2020	COSMIC: COmmonSense knowledge for eMotion Identification in Conversations	https://github.com/declare-lab/conv-emotion	https://arxiv.org/pdf/2010.02795	In this paper, we address the task of utterance level emotion recognition in conversations using commonsense knowledge. We propose COSMIC, a new framework that incorporates different elements of commonsense such as mental states, events, and causal relations, and build upon them to learn interactions between interlocutors participating in a conversation. Current state-of-the-art methods often encounter difficulties in context propagation, emotion shift detection, and differentiating between related emotion classes. By learning distinct commonsense representations, COSMIC addresses these challenges and achieves new state-of-the-art results for emotion recognition on four different benchmark conversational datasets. Our code is available at https://github.com/declare-lab/conv-emotion.	在本文中，我们使用常识知识解决对话中话语级情感识别的任务。我们提出了 COSMIC，这是一个新框架，它结合了不同的常识元素，例如心理状态、事件和因果关系，并以此为基础来学习参与对话的对话者之间的互动。当前最先进的方法在上下文传播、情感转移检测和相关情感类别之间的区分方面经常遇到困难。通过学习不同的常识表示，COSMIC 解决了这些挑战，并在四个不同的基准会话数据集上实现了情感识别的最新结果。我们的代码可在 https://github.com/declare-lab/conv-emotion 获得。	Deepanway Ghosal Navonil Majumder Alexander Gelbukh Rada Mihalcea Soujanya Poria
6	EMNLP2020	Generalizable and Explainable Dialogue Generation via Explicit Action Learning		https://arxiv.org/pdf/2010.03755	Response generation for task-oriented dialogues implicitly optimizes two objectives at the same time: task completion and language quality. Conditioned response generation serves as an effective approach to separately and better optimize these two objectives. Such an approach relies on system action annotations which are expensive to obtain. To alleviate the need of action annotations, latent action learning is introduced to map each utterance to a latent representation. However, this approach is prone to over-dependence on the training data, and the generalization capability is thus restricted. To address this issue, we propose to learn natural language actions that represent utterances as a span of words. This explicit action representation promotes generalization via the compositional structure of language. It also enables an explainable generation process. Our proposed unsupervised approach learns a memory component to summarize system utterances into a short span of words. To further promote a compact action representation, we propose an auxiliary task that restores state annotations as the summarized dialogue context using the memory component. Our proposed approach outperforms latent action baselines on MultiWOZ, a benchmark multi-domain dataset.	面向任务的对话的响应生成同时隐式优化了两个目标：任务完成和语言质量。条件反应生成是分别和更好地优化这两个目标的有效方法。这种方法依赖于获取昂贵的系统动作注释。为了减轻对动作注释的需要，引入了潜在动作学习以将每个话语映射到潜在表示。然而，这种方法容易过度依赖训练数据，从而限制了泛化能力。为了解决这个问题，我们建议学习将话语表示为单词跨度的自然语言动作。这种显式的动作表示通过语言的组合结构促进了泛化。它还支持可解释的生成过程。我们提出的无监督方法学习了一个记忆组件，将系统的话语概括为一个短的单词跨度。为了进一步促进紧凑的动作表示，我们提出了一个辅助任务，该任务使用记忆组件将状态注释恢复为汇总的对话上下文。我们提出的方法优于基准多域数据集 MultiWOZ 上的潜在动作基线。	Xinting Huang Jianzhong Qi Yu Sun Rui Zhang
7	EMNLP2020	Effects of Naturalistic Variation in Goal-Oriented Dialog	https://github.com/IBM/naturalistic-variation-goal-oriented-dialog-datasets	https://arxiv.org/pdf/2010.02260	Existing benchmarks used to evaluate the performance of end-to-end neural dialog systems lack a key component: natural variation present in human conversations. Most datasets are constructed through crowdsourcing, where the crowd workers follow a fixed template of instructions while enacting the role of a user/agent. This results in straight-forward, somewhat routine, and mostly trouble-free conversations, as crowd workers do not think to represent the full range of actions that occur naturally with real users. In this work, we investigate the impact of naturalistic variation on two goal-oriented datasets: bAbI dialog task and Stanford Multi-Domain Dataset (SMD). We also propose new and more effective testbeds for both datasets, by introducing naturalistic variation by the user. We observe that there is a significant drop in performance (more than 60% in Ent. F1 on SMD and 85% in per-dialog accuracy on bAbI task) of recent state-of-the-art end-to-end neural methods such as BossNet and GLMP on both datasets.	用于评估端到端神经对话系统性能的现有基准缺乏一个关键组成部分：人类对话中存在的自然变化。大多数数据集是通过众包构建的，众包工作人员在扮演用户/代理角色的同时遵循固定的指令模板。这导致了直接的、有点常规的、并且大部分是无故障的对话，因为人群工作人员并不认为代表真实用户自然发生的全部动作。在这项工作中，我们研究了自然变化对两个面向目标的数据集的影响：bAbI 对话任务和斯坦福多域数据集 (SMD)。我们还通过引入用户的自然变化，为这两个数据集提出了新的、更有效的测试平台。我们观察到，最近最先进的端到端神经方法的性能显着下降（SMD 上的 Ent.F1 超过 60%，bAbI 任务的每个对话准确度下降了 85%），例如作为两个数据集上的 BossNet 和 GLMP。	Jatin Ganhotra Robert Moore Sachindra Joshi Kahini Wadhawan
8	EMNLP2019	DialogueGCN: A Graph Convolutional Neural Network for Emotion Recognition in Conversation	https://github.com/SenticNet/conv-emotion	https://arxiv.org/pdf/1908.11540	Emotion recognition in conversation (ERC) has received much attention, lately, from researchers due to its potential widespread applications in diverse areas, such as health-care, education, and human resources. In this paper, we present Dialogue Graph Convolutional Network (DialogueGCN), a graph neural network based approach to ERC. We leverage self and inter-speaker dependency of the interlocutors to model conversational context for emotion recognition. Through the graph network, DialogueGCN addresses context propagation issues present in the current RNN-based methods. We empirically show that this method alleviates such issues, while outperforming the current state of the art on a number of benchmark emotion classification datasets.		Deepanway Ghosal Navonil Majumder Soujanya Poria Niyati Chhaya Alexander Gelbukh
9	EMNLP2019	Modeling Multi-Action Policy for Task-Oriented Dialogues		https://arxiv.org/pdf/1908.11546	Dialogue management (DM) plays a key role in the quality of the interaction with the user in a task-oriented dialogue system. In most existing approaches, the agent predicts only one DM policy action per turn. This significantly limits the expressive power of the conversational agent and introduces unwanted turns of interactions that may challenge users’ patience. Longer conversations also lead to more errors and the system needs to be more robust to handle them. In this paper, we compare the performance of several models on the task of predicting multiple acts for each turn. A novel policy model is proposed based on a recurrent cell called gated Continue-Act-Slots (gCAS) that overcomes the limitations of the existing models. Experimental results show that gCAS outperforms other approaches. The code is available at https://leishu02.github.io/	在面向任务的对话系统中，对话管理 (DM) 在与用户交互的质量中起着关键作用。在大多数现有方法中，代理每回合仅预测一个 DM 策略操作。这显着限制了对话代理的表达能力，并引入了可能挑战用户耐心的不必要的交互转变。更长的对话也会导致更多的错误，系统需要更强大来处理它们。在本文中，我们比较了几种模型在预测每个回合的多个动作的任务上的性能。基于称为门控Continue-Act-Slots（gCAS）的循环单元提出了一种新颖的策略模型，该模型克服了现有模型的局限性。实验结果表明，gCAS 优于其他方法。代码可在 https://leishu02.github.io/ 获得	Lei Shu Hu Xu Bing Liu Piero Molino
10	EMNLP2019	Automatically Learning Data Augmentation Policies for Dialogue Tasks	https://github.com/WolfNiu/AutoAugDialogue	https://arxiv.org/pdf/1909.12868	Automatic data augmentation (AutoAugment) (Cubuk et al., 2019) searches for optimal perturbation policies via a controller trained using performance rewards of a sampled policy on the target task, hence reducing data-level model bias. While being a powerful algorithm, their work has focused on computer vision tasks, where it is comparatively easy to apply imperceptible perturbations without changing an image’s semantic meaning. In our work, we adapt AutoAugment to automatically discover effective perturbation policies for natural language processing (NLP) tasks such as dialogue generation. We start with a pool of atomic operations that apply subtle semantic-preserving perturbations to the source inputs of a dialogue task (e.g., different POS-tag types of stopword dropout, grammatical errors, and paraphrasing). Next, we allow the controller to learn more complex augmentation policies by searching over the space of the various combinations of these atomic operations. Moreover, we also explore conditioning the controller on the source inputs of the target task, since certain strategies may not apply to inputs that do not contain that strategy’s required linguistic features. Empirically, we demonstrate that both our input-agnostic and input-aware controllers discover useful data augmentation policies, and achieve significant improvements over the previous state-of-the-art, including trained on manually-designed policies.	自动数据增强 (AutoAugment) (Cubuk et al., 2019) 通过使用目标任务采样策略的性能奖励训练的控制器搜索最佳扰动策略，从而减少数据级模型偏差。虽然是一种强大的算法，但他们的工作主要集中在计算机视觉任务上，在不改变图像语义的情况下应用不易察觉的扰动相对容易。在我们的工作中，我们采用 AutoAugment 自动发现自然语言处理 (NLP) 任务（如对话生成）的有效扰动策略。我们从一组原子操作开始，这些操作将微妙的语义保留扰动应用于对话任务的源输入（例如，不同的 POS 标签类型的停用词丢失、语法错误和释义）。接下来，我们允许控制器通过搜索这些原子操作的各种组合的空间来学习更复杂的增强策略。此外，我们还探索在目标任务的源输入上调节控制器，因为某些策略可能不适用于不包含该策略所需语言特征的输入。从经验上讲，我们证明了我们的输入不可知和输入感知控制器都发现了有用的数据增强策略，并且比以前的最先进技术取得了显着的改进，包括对手动设计的策略进行训练。	Tong Niu Mohit Bansal
11	EMNLP2019	Dependency Parsing for Spoken Dialog Systems		https://arxiv.org/pdf/1909.03317	Dependency parsing of conversational input can play an important role in language understanding for dialog systems by identifying the relationships between entities extracted from user utterances. Additionally, effective dependency parsing can elucidate differences in language structure and usage for discourse analysis of human-human versus human-machine dialogs. However, models trained on datasets based on news articles and web data do not perform well on spoken human-machine dialog, and currently available annotation schemes do not adapt well to dialog data. Therefore, we propose the Spoken Conversation Universal Dependencies (SCUD) annotation scheme that extends the Universal Dependencies (UD) (Nivre et al., 2016) guidelines to spoken human-machine dialogs. We also provide ConvBank, a conversation dataset between humans and an open-domain conversational dialog system with SCUD annotation. Finally, to demonstrate the utility of the dataset, we train a dependency parser on the ConvBank dataset. We demonstrate that by pre-training a dependency parser on a set of larger public datasets and fine-tuning on ConvBank data, we achieved the best result, 85.05% unlabeled and 77.82% labeled attachment accuracy.		Sam Davidson Dian Yu Zhou Yu
12	EMNLP2019	Towards Knowledge-Based Recommender Dialog System	https://github.com/THUDM/KBRD	https://arxiv.org/pdf/1908.05391	In this paper, we propose a novel end-to-end framework called KBRD, which stands for Knowledge-Based Recommender Dialog System. It integrates the recommender system and the dialog generation system. The dialog system can enhance the performance of the recommendation system by introducing knowledge-grounded information about users’ preferences, and the recommender system can improve that of the dialog generation system by providing recommendation-aware vocabulary bias. Experimental results demonstrate that our proposed model has significant advantages over the baselines in both the evaluation of dialog generation and recommendation. A series of analyses show that the two systems can bring mutual benefits to each other, and the introduced knowledge contributes to both their performances.	在本文中，我们提出了一种名为 KBRD 的新型端到端框架，它代表基于知识的推荐对话系统。它集成了推荐系统和对话生成系统。对话系统可以通过引入关于用户偏好的基于知识的信息来提高推荐系统的性能，推荐系统可以通过提供推荐感知词汇偏差来改进对话生成系统的性能。实验结果表明，我们提出的模型在对话生成和推荐的评估方面均优于基线。一系列分析表明，这两个系统可以互惠互利，所引入的知识对两者的性能都有贡献。	Qibin Chen Junyang Lin Yichang Zhang Ming Ding Yukuo Cen Hongxia Yang Jie Tang
13	EMNLP2019	DyKgChat: Benchmarking Dialogue Generation Grounding on Dynamic Knowledge Graphs	https://github.com/Pascalson/DyKGChat	https://arxiv.org/pdf/1910.00610	Data-driven, knowledge-grounded neural conversation models are capable of generating more informative responses. However, these models have not yet demonstrated that they can zero-shot adapt to updated, unseen knowledge graphs. This paper proposes a new task about how to apply dynamic knowledge graphs in neural conversation model and presents a novel TV series conversation corpus (DyKgChat) for the task. Our new task and corpus aids in understanding the influence of dynamic knowledge graphs on responses generation. Also, we propose a preliminary model that selects an output from two networks at each time step: a sequence-to-sequence model (Seq2Seq) and a multi-hop reasoning model, in order to support dynamic knowledge graphs. To benchmark this new task and evaluate the capability of adaptation, we introduce several evaluation metrics and the experiments show that our proposed approach outperforms previous knowledge-grounded conversation models. The proposed corpus and model can motivate the future research directions.	数据驱动、以知识为基础的神经对话模型能够生成更多信息响应。然而，这些模型尚未证明它们可以零样本适应更新的、看不见的知识图。本文提出了有关如何在神经会话模型和任务提出了一种新的电视连续剧谈话文集（DyKgChat）应用动态知识图新任务。我们的新任务和语料库有助于理解动态知识图对响应生成的影响。此外，我们提出了一个初步的模型，其选择从两个网络的输出在每个时间步骤：一个序列到序列模型（Seq2Seq）和多跳推理模型中，为了支持动态知识的曲线图。为了对这项新任务进行基准测试并评估适应能力，我们引入了几个评估指标，实验表明我们提出的方法优于以前的基于知识的对话模型。所提出的语料库和模型可以激励未来的研究方向。	Yi-Lin Tuan Yun-Nung Chen Hung-yi Lee
14	EMNLP2019	How to Build User Simulators to Train RL-based Dialog Systems	https://github.com/wyshi/user-simulator	https://arxiv.org/pdf/1909.01388	User simulators are essential for training reinforcement learning (RL) based dialog models. The performance of the simulator directly impacts the RL policy. However, building a good user simulator that models real user behaviors is challenging. We propose a method of standardizing user simulator building that can be used by the community to compare dialog system quality using the same set of user simulators fairly. We present implementations of six user simulators trained with different dialog planning and generation methods. We then calculate a set of automatic metrics to evaluate the quality of these simulators both directly and indirectly. We also ask human users to assess the simulators directly and indirectly by rating the simulated dialogs and interacting with the trained systems. This paper presents a comprehensive evaluation framework for user simulator study and provides a better understanding of the pros and cons of different user simulators, as well as their impacts on the trained systems.	用户模拟器对于训练基于强化学习 (RL) 的对话模型至关重要。模拟器的性能直接影响 RL 策略。然而，构建一个好的用户模拟器来模拟真实的用户行为是具有挑战性的。我们提出了一种标准化用户模拟器构建的方法，社区可以使用该方法来公平地比较使用同一组用户模拟器的对话系统质量。我们展示了使用不同对话规划和生成方法训练的六个用户模拟器的实现。然后我们计算一组自动指标来直接和间接评估这些模拟器的质量。我们还要求人类用户通过对模拟对话进行评级并与训练有素的系统交互来直接或间接评估模拟器。本文提出了一个用于用户模拟器研究的综合评估框架，并更好地了解不同用户模拟器的优缺点，以及它们对训练系统的影响。	Weiyan Shi Kun Qian Xuewei Wang Zhou Yu
15	EMNLP2019	Dual Attention Networks for Visual Reference Resolution in Visual Dialog	https://github.com/gicheonkang/DAN-VisDial	https://arxiv.org/pdf/1902.09368	Visual dialog (VisDial) is a task which requires an AI agent to answer a series of questions grounded in an image. Unlike in visual question answering (VQA), the series of questions should be able to capture a temporal context from a dialog history and exploit visually-grounded information. A problem called visual reference resolution involves these challenges, requiring the agent to resolve ambiguous references in a given question and find the references in a given image. In this paper, we propose Dual Attention Networks (DAN) for visual reference resolution. DAN consists of two kinds of attention networks, REFER and FIND. Specifically, REFER module learns latent relationships between a given question and a dialog history by employing a self-attention mechanism. FIND module takes image features and reference-aware representations (i.e., the output of REFER module) as input, and performs visual grounding via bottom-up attention mechanism. We qualitatively and quantitatively evaluate our model on VisDial v1.0 and v0.9 datasets, showing that DAN outperforms the previous state-of-the-art model by a significant margin.	视觉对话 (VisDial) 是一项需要 AI 代理回答基于图像的一系列问题的任务。与视觉问答 (VQA) 不同，这一系列问题应该能够从对话历史中捕获时间上下文并利用基于视觉的信息。称为视觉参考解析的问题涉及这些挑战，需要代理解决给定问题中的模糊参考并在给定图像中找到参考。在本文中，我们提出了用于视觉参考分辨率的双注意力网络（DAN）。 DAN 由两种注意力网络组成，REFER 和 FIND。具体来说，REFER 模块通过采用自注意力机制来学习给定问题和对话历史之间的潜在关系。 FIND 模块以图像特征和参考感知表示（即 REFER 模块的输出）作为输入，并通过自下而上的注意力机制执行视觉接地。我们在 VisDial v1.0 和 v0.9 数据集上定性和定量地评估了我们的模型，表明 DAN 的性能明显优于之前的最先进模型。	Gi-Cheon Kang Jaeseo Lim Byoung-Tak Zhang
16	EMNLP2019	Dialog Intent Induction with Deep Multi-View Clustering	https://github.com/asappresearch/dialog-intent-induction	https://arxiv.org/pdf/1908.11487	We introduce the dialog intent induction task and present a novel deep multi-view clustering approach to tackle the problem. Dialog intent induction aims at discovering user intents from user query utterances in human-human conversations such as dialogs between customer support agents and customers. Motivated by the intuition that a dialog intent is not only expressed in the user query utterance but also captured in the rest of the dialog, we split a conversation into two independent views and exploit multi-view clustering techniques for inducing the dialog intent. In particular, we propose alternating-view k-means (AV-KMEANS) for joint multi-view representation learning and clustering analysis. The key innovation is that the instance-view representations are updated iteratively by predicting the cluster assignment obtained from the alternative view, so that the multi-view representations of the instances lead to similar cluster assignments. Experiments on two public datasets show that AV-KMEANS can induce better dialog intent clusters than state-of-the-art unsupervised representation learning methods and standard multi-view clustering approaches.	我们介绍了对话意图归纳任务，并提出了一种新颖的深度多视图聚类方法来解决该问题。对话意图归纳旨在从人与人对话（例如客户支持代理和客户之间的对话）中的用户查询话语中发现用户意图。受对话意图不仅在用户查询话语中表达而且还在对话的其余部分中表达的直觉的启发，我们将对话分成两个独立的视图，并利用多视图聚类技术来诱导对话意图。特别是，我们提出了交替视图 k-means (AV-KMEANS) 用于联合多视图表示学习和聚类分析。关键创新是通过预测从替代视图获得的集群分配来迭代更新实例视图表示，以便实例的多视图表示导致相似的集群分配。在两个公共数据集上的实验表明，与最先进的无监督表示学习方法和标准多视图聚类方法相比，AV-KMEANS 可以诱导更好的对话意图聚类。	Hugh Perkins Yi Yang
17	EMNLP2019	Multi-Task Learning for Conversational Question Answering over a Large-Scale Knowledge Base	https://github.com/taoshen58/MaSP	https://arxiv.org/pdf/1910.05069	We consider the problem of conversational question answering over a large-scale knowledge base. To handle huge entity vocabulary of a large-scale knowledge base, recent neural semantic parsing based approaches usually decompose the task into several subtasks and then solve them sequentially, which leads to following issues: 1) errors in earlier subtasks will be propagated and negatively affect downstream ones; and 2) each subtask cannot naturally share supervision signals with others. To tackle these issues, we propose an innovative multi-task learning framework where a pointer-equipped semantic parsing model is designed to resolve coreference in conversations, and naturally empower joint learning with a novel type-aware entity detection model. The proposed framework thus enables shared supervisions and alleviates the effect of error propagation. Experiments on a large-scale conversational question answering dataset containing 1.6M question answering pairs over 12.8M entities show that the proposed framework improves overall F1 score from 67% to 79% compared with previous state-of-the-art work.	我们考虑在大规模知识库上进行对话式问答的问题。为了处理大规模知识库的庞大实体词汇表，最近的基于神经语义解析的方法通常将任务分解为几个子任务，然后依次解决它们，这导致以下问题：1）早期子任务中的错误将被传播并产生负面影响下游的； 2）每个子任务不能自然地与其他人共享监督信号。为了解决这些问题，我们提出了一种创新的多任务学习框架，其中设计了一个配备指针的语义解析模型来解决对话中的共指问题，并自然地通过一种新颖的类型感知实体检测模型来授权联合学习。因此，所提出的框架能够实现共享监督并减轻错误传播的影响。在包含超过 1280 万个实体的 160 万个问答对的大规模会话问答数据集上的实验表明，与之前的最先进工作相比，所提出的框架将整体 F1 分数从 67% 提高到 79%。	Tao Shen Xiubo Geng Tao Qin Daya Guo Duyu Tang Nan Duan Guodong Long Daxin Jiang

NAACL

序号	会议/期刊	论文	代码	论文下载地址	摘要	摘要翻译	作者
1	NAACL2021	Open-Domain Question Answering Goes Conversational via Question Rewriting	https://github.com/apple/ml-qrecc	https://arxiv.org/pdf/2010.04898	We introduce a new dataset for Question Rewriting in Conversational Context (QReCC), which contains 14K conversations with 80K question-answer pairs. The task in QReCC is to find answers to conversational questions within a collection of 10M web pages (split into 54M passages). Answers to questions in the same conversation may be distributed across several web pages. QReCC provides annotations that allow us to train and evaluate individual subtasks of question rewriting, passage retrieval and reading comprehension required for the end-to-end conversational question answering (QA) task. We report the effectiveness of a strong baseline approach that combines the state-of-the-art model for question rewriting, and competitive models for open-domain QA. Our results set the first baseline for the QReCC dataset with F1 of 19.10, compared to the human upper bound of 75.45, indicating the difficulty of the setup and a large room for improvement.		Raviteja Anantha Svitlana Vakulenko Zhucheng Tu Shayne Longpre Stephen Pulman Srinivas Chappidi
2	NAACL2021	Action-Based Conversations Dataset: A Corpus for Building More In-Depth Task- Oriented Dialogue Systems	https://github.com/asappresearch/abcd	https://arxiv.org/pdf/2104.00783	Existing goal-oriented dialogue datasets focus mainly on identifying slots and values. However, customer support interactions in reality often involve agents following multi-step procedures derived from explicitly-defined company policies as well. To study customer service dialogue systems in more realistic settings, we introduce the Action-Based Conversations Dataset (ABCD), a fully-labeled dataset with over 10K human-to-human dialogues containing 55 distinct user intents requiring unique sequences of actions constrained by policies to achieve task success. We propose two additional dialog tasks, Action State Tracking and Cascading Dialogue Success, and establish a series of baselines involving large-scale, pre-trained language models on this dataset. Empirical results demonstrate that while more sophisticated networks outperform simpler models, a considerable gap (50.8% absolute accuracy) still exists to reach human-level performance on ABCD.	现有的面向目标的对话数据集主要侧重于识别槽和值。然而，现实中的客户支持交互通常也涉及代理遵循源自明确定义的公司政策的多步骤程序。为了在更现实的环境中研究客户服务对话系统，我们引入了基于动作的对话数据集 (ABCD)，这是一个完全标记的数据集，其中包含超过 10K 人与人之间的对话，其中包含 55 个不同的用户意图，需要受策略约束的独特动作序列以实现任务成功。我们提出了两个额外的对话任务，Action State Tracking 和 Cascading Dialogue Success，并在这个数据集上建立了一系列涉及大规模、预训练语言模型的基线。实证结果表明，虽然更复杂的网络优于更简单的模型，但在 ABCD 上达到人类水平的表现仍然存在相当大的差距（50.8% 的绝对准确率）。	Derek Chen Howard Chen Yi Yang Alex Lin Zhou Yu
3	NAACL2021	Self-Supervised Contrastive Learning for Efficient User Satisfaction Prediction in Conversational Agents		https://arxiv.org/pdf/2010.11230	Turn-level user satisfaction is one of the most important performance metrics for conversational agents. It can be used to monitor the agent’s performance and provide insights about defective user experiences. Moreover, a powerful satisfaction model can be used as an objective function that a conversational agent continuously optimizes for. While end-to-end deep learning has shown promising results, having access to a large number of reliable annotated samples required by these methods remains challenging. In a large-scale conversational system, there is a growing number of newly developed skills, making the traditional data collection, annotation, and modeling process impractical due to the required annotation costs as well as the turnaround times. In this paper, we suggest a self-supervised contrastive learning approach that leverages the pool of unlabeled data to learn user-agent interactions. We show that the pre-trained models using the self-supervised objective are transferable to the user satisfaction prediction. In addition, we propose a novel few-shot transfer learning approach that ensures better transferability for very small sample sizes. The suggested few-shot method does not require any inner loop optimization process and is scalable to very large datasets and complex models. Based on our experiments using real-world data from a large-scale commercial system, the suggested approach is able to significantly reduce the required number of annotations, while improving the generalization on unseen out-of-domain skills.	回合级用户满意度是会话代理最重要的性能指标之一。它可用于监控代理的性能并提供有关有缺陷的用户体验的见解。此外，强大的满意度模型可以用作对话代理不断优化的目标函数。虽然端到端深度学习已显示出有希望的结果，但访问这些方法所需的大量可靠的带注释的样本仍然具有挑战性。在大型对话系统中，新开发的技能越来越多，由于所需的注释成本和周转时间，使得传统的数据收集、注释和建模过程变得不切实际。在本文中，我们提出了一种自监督对比学习方法，该方法利用未标记数据池来学习用户-代理交互。我们表明使用自监督目标的预训练模型可转移到用户满意度预测。此外，我们提出了一种新颖的少样本迁移学习方法，可确保对非常小的样本量具有更好的迁移能力。建议的小样本方法不需要任何内循环优化过程，并且可扩展到非常大的数据集和复杂模型。基于我们使用来自大规模商业系统的真实世界数据的实验，所建议的方法能够显着减少所需的注释数量，同时提高对看不见的域外技能的泛化。	Mohammad Kachuee Hao Yuan Young-Bum Kim Sungjin Lee
4	NAACL2021	Human-like informative conversations: Better acknowledgements using conditional mutual information	https://github.com/AshwinParanjape/human-like-informative-conversations	https://arxiv.org/pdf/2104.07831	This work aims to build a dialogue agent that can weave new factual content into conversations as naturally as humans. We draw insights from linguistic principles of conversational analysis and annotate human-human conversations from the Switchboard Dialog Act Corpus to examine humans strategies for acknowledgement, transition, detail selection and presentation. When current chatbots (explicitly provided with new factual content) introduce facts into a conversation, their generated responses do not acknowledge the prior turns. This is because models trained with two contexts - new factual content and conversational history - generate responses that are non-specific w.r.t. one of the contexts, typically the conversational history. We show that specificity w.r.t. conversational history is better captured by Pointwise Conditional Mutual Information ($\text{pcmi}_h$) than by the established use of Pointwise Mutual Information ($\text{pmi}$). Our proposed method, Fused-PCMI, trades off $\text{pmi}$ for $\text{pcmi}_h$ and is preferred by humans for overall quality over the Max-PMI baseline 60% of the time. Human evaluators also judge responses with higher $\text{pcmi}_h$ better at acknowledgement 74% of the time. The results demonstrate that systems mimicking human conversational traits (in this case acknowledgement) improve overall quality and more broadly illustrate the utility of linguistic principles in improving dialogue agents.	这项工作旨在构建一个对话代理，可以像人类一样自然地将新的事实内容编织到对话中。我们从对话分析的语言原则中汲取见解，并从 Switchboard Dialog Act Corpus 中注释人与人的对话，以检查人类在确认、转换、细节选择和呈现方面的策略。当当前的聊天机器人（明确提供新的事实内容）将事实引入对话时，它们生成的响应不会确认先前的转变。这是因为在两个上下文（新的事实内容和对话历史）中训练的模型会生成非特定 w.r.t. 的响应。上下文之一，通常是对话历史。我们展示了这种特异性 w.r.t. Pointwise Conditional Mutual Information ($\text{pcmi}_h$) 比使用 Pointwise Mutual Information ($\text{pmi}$) 更好地捕获会话历史。我们提出的方法，Fused-PCMI，用 $\text{pmi}$ 换取 $\text{pcmi}_h$，并且在 60% 的时间里，人们更喜欢在 Max-PMI 基线上的整体质量。人类评估者在 74% 的时间里也能更好地判断具有较高 $\text{pcmi}_h$ 的响应。结果表明，模仿人类对话特征（在这种情况下是确认）的系统提高了整体质量，并更广泛地说明了语言原则在改进对话代理方面的效用。	Ashwin Paranjape Christopher D. Manning
5	NAACL2021	DiSCoL: Toward Engaging Dialogue Systems through Conversational Line Guided Response Generation		https://arxiv.org/pdf/2102.02191	Having engaging and informative conversations with users is the utmost goal for open-domain conversational systems. Recent advances in transformer-based language models and their applications to dialogue systems have succeeded to generate fluent and human-like responses. However, they still lack control over the generation process towards producing contentful responses and achieving engaging conversations. To achieve this goal, we present \textbf{DiSCoL} (\textbf{Di}alogue \textbf{S}ystems through \textbf{Co}versational \textbf{L}ine guided response generation). DiSCoL is an open-domain dialogue system that leverages conversational lines (briefly \textbf{convlines}) as controllable and informative content-planning elements to guide the generation model produce engaging and informative responses. Two primary modules in DiSCoL’s pipeline are conditional generators trained for 1) predicting relevant and informative convlines for dialogue contexts and 2) generating high-quality responses conditioned on the predicted convlines. Users can also change the returned convlines to \textit{control} the direction of the conversations towards topics that are more interesting for them. Through automatic and human evaluations, we demonstrate the efficiency of the convlines in producing engaging conversations.	与用户进行引人入胜且信息丰富的对话是开放域对话系统的最大目标。基于转换器的语言模型及其在对话系统中的应用的最新进展已经成功地产生了流畅和类似人类的响应。然而，他们仍然缺乏对生成内容的响应和实现引人入胜的对话的生成过程的控制。为了实现这一目标，我们提出了 \textbf{DiSCoL}（\textbf{Di}alogue \textbf{S}ystems through \textbf{Co}versational \textbf{L}ine 引导响应生成）。 DiSCoL 是一个开放域对话系统，它利用对话线（简称 \textbf{convlines}）作为可控和信息丰富的内容规划元素来指导生成模型产生引人入胜和信息丰富的响应。 DiSCoL 管道中的两个主要模块是条件生成器，用于 1) 预测对话上下文的相关和信息丰富的 convlines 和 2) 生成以预测的 convlines 为条件的高质量响应。用户还可以将返回的 convlines 更改为 \textit{control} 将对话的方向转向对他们更感兴趣的主题。通过自动和人工评估，我们展示了 convlines 在产生引人入胜的对话方面的效率。	Sarik Ghazarian Zixi Liu Tuhin Chakrabarty Xuezhe Ma Aram Galstyan Nanyun Peng
6	NAACL2021	Adding Chit-Chat to Enhance Task-Oriented Dialogues		https://arxiv.org/pdf/2010.12757	Existing dialogue corpora and models are typically designed under two disjoint motives: while task-oriented systems focus on achieving functional goals (e.g., booking hotels), open-domain chatbots aim at making socially engaging conversations. In this work, we propose to integrate both types of systems by Adding Chit-Chat to ENhance Task-ORiented dialogues (ACCENTOR), with the goal of making virtual assistant conversations more engaging and interactive. Specifically, we propose a Human <-> AI collaborative data collection approach for generating diverse chit-chat responses to augment task-oriented dialogues with minimal annotation effort. We then present our new chit-chat-based annotations to 23.8K dialogues from two popular task-oriented datasets (Schema-Guided Dialogue and MultiWOZ 2.1) and demonstrate their advantage over the originals via human evaluation. Lastly, we propose three new models for adding chit-chat to task-oriented dialogues, explicitly trained to predict user goals and to generate contextually relevant chit-chat responses. Automatic and human evaluations show that, compared with the state-of-the-art task-oriented baseline, our models can code-switch between task and chit-chat to be more engaging, interesting, knowledgeable, and humanlike, while maintaining competitive task performance.	现有的对话语料库和模型通常是在两个不相交的动机下设计的：面向任务的系统专注于实现功能目标（例如，预订酒店），而开放域聊天机器人旨在进行社交互动。在这项工作中，我们建议通过将 Chit-Chat 添加到增强面向任务的对话 (ACCENTOR) 来集成两种类型的系统，目的是使虚拟助手对话更具吸引力和互动性。具体来说，我们提出了一种 Human <-> AI 协作数据收集方法，用于生成各种闲聊响应，以最少的注释工作来增强面向任务的对话。然后，我们向来自两个流行的面向任务的数据集（Schema-Guided Dialogue 和 MultiWOZ 2.1）的 23.8K 对话展示了我们新的基于闲聊的注释，并通过人工评估证明了它们相对于原始数据的优势。最后，我们提出了三个新模型，用于将闲聊添加到面向任务的对话中，经过明确训练以预测用户目标并生成上下文相关的闲聊响应。自动和人工评估表明，与最先进的面向任务的基线相比，我们的模型可以在任务和闲聊之间进行代码切换，使其更具吸引力、有趣、知识渊博和人性化，同时保持竞争性任务表现。	Kai Sun Seungwhan Moon Paul Crook Stephen Roller Becka Silvert Bing Liu Zhiguang Wang Honglei Liu Eunjoon Cho Claire Cardie
7	NAACL2021	Fast and Scalable Dialogue State Tracking with Explicit Modular Decomposition		https://arxiv.org/pdf/2004.10663	We present a fast and scalable architecture called Explicit Modular Decomposition (EMD), in which we incorporate both classification-based and extraction-based methods and design four modules (for classification and sequence labelling) to jointly extract dialogue states. Experimental results based on the MultiWoz 2.0 dataset validates the superiority of our proposed model in terms of both complexity and scalability when compared to the state-of-the-art methods, especially in the scenario of multi-domain dialogues entangled with many turns of utterances.	我们提出了一种称为显式模块化分解 (EMD) 的快速且可扩展的架构，其中我们结合了基于分类和基于提取的方法，并设计了四个模块（用于分类和序列标记）来联合提取对话状态。与最先进的方法相比，基于 MultiWoz 2.0 数据集的实验结果验证了我们提出的模型在复杂性和可扩展性方面的优越性，尤其是在多域对话与多轮话语纠缠的情况下.	Dingmin Wang Chenghua Lin Li Zhong Kam-Fai Wong
8	NAACL2019	Cross-lingual Transfer Learning for Multilingual Task Oriented Dialog		https://arxiv.org/pdf/1810.13327	One of the first steps in the utterance interpretation pipeline of many task-oriented conversational AI systems is to identify user intents and the corresponding slots. Since data collection for machine learning models for this task is time-consuming, it is desirable to make use of existing data in a high-resource language to train models in low-resource languages. However, development of such models has largely been hindered by the lack of multilingual training data. In this paper, we present a new data set of 57k annotated utterances in English (43k), Spanish (8.6k) and Thai (5k) across the domains weather, alarm, and reminder. We use this data set to evaluate three different cross-lingual transfer methods: (1) translating the training data, (2) using cross-lingual pre-trained embeddings, and (3) a novel method of using a multilingual machine translation encoder as contextual word representations. We find that given several hundred training examples in the the target language, the latter two methods outperform translating the training data. Further, in very low-resource settings, multilingual contextual word representations give better results than using cross-lingual static embeddings. We also compare the cross-lingual methods to using monolingual resources in the form of contextual ELMo representations and find that given just small amounts of target language data, this method outperforms all cross-lingual methods, which highlights the need for more sophisticated cross-lingual methods.		Sebastian Schuster Sonal Gupta Rushin Shah Mike Lewis
9	NAACL2019	Evaluating Coherence in Dialogue Systems using Entailment	https://github.com/nouhadziri/DialogEntailment	https://arxiv.org/pdf/1904.03371	Evaluating open-domain dialogue systems is difficult due to the diversity of possible correct answers. Automatic metrics such as BLEU correlate weakly with human annotations, resulting in a significant bias across different models and datasets. Some researchers resort to human judgment experimentation for assessing response quality, which is expensive, time consuming, and not scalable. Moreover, judges tend to evaluate a small number of dialogues, meaning that minor differences in evaluation configuration may lead to dissimilar results. In this paper, we present interpretable metrics for evaluating topic coherence by making use of distributed sentence representations. Furthermore, we introduce calculable approximations of human judgment based on conversational coherence by adopting state-of-the-art entailment techniques. Results show that our metrics can be used as a surrogate for human judgment, making it easy to evaluate dialogue systems on large-scale datasets and allowing an unbiased estimate for the quality of the responses.	由于可能正确答案的多样性，评估开放域对话系统很困难。 BLEU 等自动指标与人工注释的相关性较弱，导致不同模型和数据集之间存在显着偏差。一些研究人员采用人工判断实验来评估响应质量，这是昂贵、耗时且不可扩展的。此外，评委倾向于评价少量对话，这意味着评价配置的微小差异可能导致不同的结果。在本文中，我们通过使用分布式句子表示，提出了用于评估主题连贯性的可解释指标。此外，我们通过采用最先进的蕴含技术，基于对话连贯性引入了人类判断的可计算近似值。结果表明，我们的指标可以用作人类判断的替代品，从而可以轻松评估大规模数据集上的对话系统，并允许对响应质量进行无偏估计。	Nouha Dziri Ehsan Kamalloo Kory W. Mathewson Osmar Zaiane

COLING

序号	会议/期刊	论文	代码	论文下载地址	摘要	摘要翻译	作者
1	COLING2020	A Taxonomy of Empathetic Response Intents in Human Social Conversations	https://github.com/anuradha1992/EmpatheticIntents	https://arxiv.org/pdf/2012.04080	Open-domain conversational agents or chatbots are becoming increasingly popular in the natural language processing community. One of the challenges is enabling them to converse in an empathetic manner. Current neural response generation methods rely solely on end-to-end learning from large scale conversation data to generate dialogues. This approach can produce socially unacceptable responses due to the lack of large-scale quality data used to train the neural models. However, recent work has shown the promise of combining dialogue act/intent modelling and neural response generation. This hybrid method improves the response quality of chatbots and makes them more controllable and interpretable. A key element in dialog intent modelling is the development of a taxonomy. Inspired by this idea, we have manually labeled 500 response intents using a subset of a sizeable empathetic dialogue dataset (25K dialogues). Our goal is to produce a large-scale taxonomy for empathetic response intents. Furthermore, using lexical and machine learning methods, we automatically analysed both speaker and listener utterances of the entire dataset with identified response intents and 32 emotion categories. Finally, we use information visualization methods to summarize emotional dialogue exchange patterns and their temporal progression. These results reveal novel and important empathy patterns in human-human open-domain conversations and can serve as heuristics for hybrid approaches.	开放域对话代理或聊天机器人在自然语言处理社区中变得越来越流行。挑战之一是使他们能够以善解人意的方式交谈。当前的神经响应生成方法仅依赖于从大规模对话数据中进行端到端学习来生成对话。由于缺乏用于训练神经模型的大规模质量数据，这种方法可能会产生社会上无法接受的反应。然而，最近的工作显示了将对话行为/意图建模和神经反应生成相结合的前景。这种混合方法提高了聊天机器人的响应质量，并使它们更加可控和可解释。对话意图建模的一个关键要素是分类法的开发。受这个想法的启发，我们使用一个相当大的移情对话数据集（25K 对话）的子集手动标记了 500 个响应意图。我们的目标是为移情反应意图生成大规模分类法。此外，使用词汇和机器学习方法，我们自动分析了整个数据集的说话者和听者的话语，并确定了响应意图和 32 种情感类别。最后，我们使用信息可视化方法来总结情感对话交流模式及其时间进展。这些结果揭示了人与人开放域对话中新颖而重要的移情模式，可以作为混合方法的启发式方法。	Anuradha Welivita Pearl Pu
2	COLING2020	CEREC: A Corpus for Entity Resolution in Email Conversations	https://github.com/paragdakle/emailcoref	https://arxiv.org/pdf/2105.10606	We present the first large scale corpus for entity resolution in email conversations (CEREC). The corpus consists of 6001 email threads from the Enron Email Corpus containing 36,448 email messages and 60,383 entity coreference chains. The annotation is carried out as a two-step process with minimal manual effort. Experiments are carried out for evaluating different features and performance of four baselines on the created corpus. For the task of mention identification and coreference resolution, a best performance of 59.2 F1 is reported, highlighting the room for improvement. An in-depth qualitative and quantitative error analysis is presented to understand the limitations of the baselines considered.	我们展示了第一个用于电子邮件对话中实体解析的大规模语料库 (CEREC)。该语料库由来自安然电子邮件语料库的 6001 个电子邮件线程组成，其中包含 36,448 封电子邮件和 60,383 个实体共指链。注释作为两步过程执行，手动操作最少。进行实验以评估四个基线在创建的语料库上的不同特征和性能。对于提及识别和共指解析的任务，报告了 59.2 F1 的最佳性能，突出了改进的空间。提供了深入的定性和定量误差分析，以了解所考虑基线的局限性。	Parag Pravin Dakle Dan I. Moldovan
3	COLING2020	Conversational Machine Comprehension: a Literature Review		https://arxiv.org/pdf/2006.00671	Conversational Machine Comprehension (CMC), a research track in conversational AI, expects the machine to understand an open-domain natural language text and thereafter engage in a multi-turn conversation to answer questions related to the text. While most of the research in Machine Reading Comprehension (MRC) revolves around single-turn question answering (QA), multi-turn CMC has recently gained prominence, thanks to the advancement in natural language understanding via neural language models such as BERT and the introduction of large-scale conversational datasets such as CoQA and QuAC. The rise in interest has, however, led to a flurry of concurrent publications, each with a different yet structurally similar modeling approach and an inconsistent view of the surrounding literature. With the volume of model submissions to conversational datasets increasing every year, there exists a need to consolidate the scattered knowledge in this domain to streamline future research. This literature review attempts at providing a holistic overview of CMC with an emphasis on the common trends across recently published models, specifically in their approach to tackling conversational history. The review synthesizes a generic framework for CMC models while highlighting the differences in recent approaches and intends to serve as a compendium of CMC for future researchers.	Conversational Machine Comprehension (CMC) 是会话 AI 的一个研究方向，它希望机器能够理解开放领域的自然语言文本，然后进行多轮对话以回答与文本相关的问题。虽然机器阅读理解 (MRC) 的大部分研究都围绕单轮问答 (QA) 展开，但由于通过 BERT 等神经语言模型和介绍大规模会话数据集，如 CoQA 和 QuAC。然而，兴趣的增加导致了大量并发出版物，每一种都有不同但结构相似的建模方法和对周围文献的不一致看法。随着会话数据集的模型提交量逐年增加，需要整合该领域的零散知识以简化未来的研究。这篇文献综述试图提供 CMC 的整体概述，重点是最近发布的模型的共同趋势，特别是它们处理对话历史的方法。该评论综合了 CMC 模型的通用框架，同时突出了最近方法的差异，并打算作为未来研究人员的 CMC 纲要。	Somil Gupta Bhanu Pratap Singh Rawat Hong Yu
4	COLING2020	Improving Conversational Question Answering Systems after Deployment using Feedback-Weighted Learning	https://github.com/jjacampos/FeedbackWeightedLearning	https://arxiv.org/pdf/2011.00615	The interaction of conversational systems with users poses an exciting opportunity for improving them after deployment, but little evidence has been provided of its feasibility. In most applications, users are not able to provide the correct answer to the system, but they are able to provide binary (correct, incorrect) feedback. In this paper we propose feedback-weighted learning based on importance sampling to improve upon an initial supervised system using binary user feedback. We perform simulated experiments on document classification (for development) and Conversational Question Answering datasets like QuAC and DoQA, where binary user feedback is derived from gold annotations. The results show that our method is able to improve over the initial supervised system, getting close to a fully-supervised system that has access to the same labeled examples in in-domain experiments (QuAC), and even matching in out-of-domain experiments (DoQA). Our work opens the prospect to exploit interactions with real users and improve conversational systems after deployment.	对话系统与用户的交互为部署后改进它们提供了一个令人兴奋的机会，但几乎没有提供其可行性的证据。在大多数应用中，用户无法向系统提供正确答案，但他们能够提供二元（正确、不正确）反馈。在本文中，我们提出了基于重要性采样的反馈加权学习，以改进使用二元用户反馈的初始监督系统。我们对文档分类（用于开发）和对话式问答数据集（如 QuAC 和 DoQA）进行模拟实验，其中二进制用户反馈来自黄金注释。结果表明，我们的方法能够改进初始监督系统，接近完全监督系统，该系统可以在域内实验 (QuAC) 中访问相同的标记示例，甚至可以在域外进行匹配实验（DoQA）。我们的工作开辟了利用与真实用户的交互并在部署后改进对话系统的前景。	Jon Ander Campos Kyunghyun Cho Arantxa Otegi Aitor Soroa Gorka Azkune Eneko Agirre
5	COLING2020	Towards Topic-Guided Conversational Recommender System	https://github.com/RUCAIBox/TG-ReDial	https://arxiv.org/pdf/2010.04125	Conversational recommender systems (CRS) aim to recommend high-quality items to users through interactive conversations. To develop an effective CRS, the support of high-quality datasets is essential. Existing CRS datasets mainly focus on immediate requests from users, while lack proactive guidance to the recommendation scenario. In this paper, we contribute a new CRS dataset named \textbf{TG-ReDial} (\textbf{Re}commendation through \textbf{T}opic-\textbf{G}uided \textbf{Dial}og). Our dataset has two major features. First, it incorporates topic threads to enforce natural semantic transitions towards the recommendation scenario. Second, it is created in a semi-automatic way, hence human annotation is more reasonable and controllable. Based on TG-ReDial, we present the task of topic-guided conversational recommendation, and propose an effective approach to this task. Extensive experiments have demonstrated the effectiveness of our approach on three sub-tasks, namely topic prediction, item recommendation and response generation. TG-ReDial is available at https://github.com/RUCAIBox/TG-ReDial.	会话推荐系统（CRS）旨在通过交互式会话向用户推荐高质量的项目。要开发有效的 CRS，高质量数据集的支持必不可少。现有的 CRS 数据集主要关注用户的即时请求，而缺乏对推荐场景的主动指导。在本文中，我们贡献了一个名为 \textbf{TG-ReDial} 的新 CRS 数据集（\textbf{Re}commendation through \textbf{T}opic-\textbf{G}uided \textbf{Dial}og）。我们的数据集有两个主要特征。首先，它结合了主题线程来强制向推荐场景进行自然语义转换。其次，采用半自动方式创建，人工标注更加合理可控。基于TG-ReDial，我们提出了主题引导的对话推荐任务，并提出了一种有效的方法来完成这项任务。大量实验证明了我们的方法在三个子任务上的有效性，即主题预测、项目推荐和响应生成。 TG-ReDial 可在 https://github.com/RUCAIBox/TG-ReDial 获得。	Kun Zhou Yuanhang Zhou Wayne Xin Zhao Xiaoke Wang Ji-Rong Wen
6	COLING2020	Deconstruct to Reconstruct a Configurable Evaluation Metric for Open-Domain Dialogue Systems	https://github.com/vitouphy/usl_dialogue_metric	https://arxiv.org/pdf/2011.00483	Many automatic evaluation metrics have been proposed to score the overall quality of a response in open-domain dialogue. Generally, the overall quality is comprised of various aspects, such as relevancy, specificity, and empathy, and the importance of each aspect differs according to the task. For instance, specificity is mandatory in a food-ordering dialogue task, whereas fluency is preferred in a language-teaching dialogue system. However, existing metrics are not designed to cope with such flexibility. For example, BLEU score fundamentally relies only on word overlapping, whereas BERTScore relies on semantic similarity between reference and candidate response. Thus, they are not guaranteed to capture the required aspects, i.e., specificity. To design a metric that is flexible to a task, we first propose making these qualities manageable by grouping them into three groups: understandability, sensibleness, and likability, where likability is a combination of qualities that are essential for a task. We also propose a simple method to composite metrics of each aspect to obtain a single metric called USL-H, which stands for Understandability, Sensibleness, and Likability in Hierarchy. We demonstrated that USL-H score achieves good correlations with human judgment and maintains its configurability towards different aspects and metrics.	已经提出了许多自动评估指标来对开放域对话中响应的整体质量进行评分。一般来说，整体素质由相关性、特异性、同理心等多个方面组成，每个方面的重要性因任务而异。例如，在订餐对话任务中，特定性是强制性的，而在语言教学对话系统中，流畅性是首选。然而，现有的指标并不是为了应对这种灵活性而设计的。例如，BLEU 分数从根本上只依赖于单词重叠，而 BERTScore 依赖于参考和候选响应之间的语义相似性。因此，它们不能保证捕获所需的方面，即特异性。为了设计一个对任务灵活的度量，我们首先建议通过将这些品质分为三组来使这些品质易于管理：可理解性、合理性和可爱性，其中可爱性是任务必不可少的品质的组合。我们还提出了一种简单的方法来组合每个方面的指标，以获得一个称为 USL-H 的单一指标，它代表层次结构中的可理解性、敏感性和可爱性。我们证明 USL-H 分数与人类判断具有良好的相关性，并保持其对不同方面和指标的可配置性。	Vitou Phy Yang Zhao Akiko Aizawa
7	COLING2020	Recent Neural Methods on Slot Filling and Intent Classification for Task-Oriented Dialogue Systems: A Survey		https://arxiv.org/pdf/2011.00564	In recent years, fostered by deep learning technologies and by the high demand for conversational AI, various approaches have been proposed that address the capacity to elicit and understand user’s needs in task-oriented dialogue systems. We focus on two core tasks, slot filling (SF) and intent classification (IC), and survey how neural-based models have rapidly evolved to address natural language understanding in dialogue systems. We introduce three neural architectures: independent model, which model SF and IC separately, joint models, which exploit the mutual benefit of the two tasks simultaneously, and transfer learning models, that scale the model to new domains. We discuss the current state of the research in SF and IC and highlight challenges that still require attention.	近年来，在深度学习技术和对对话式 AI 的高需求的推动下，已经提出了各种方法来解决在面向任务的对话系统中引发和理解用户需求的能力。我们专注于两个核心任务，槽填充 (SF) 和意图分类 (IC)，并调查基于神经的模型如何快速发展以解决对话系统中的自然语言理解问题。我们引入了三种神经架构：独立模型，分别对 SF 和 IC 进行建模，联合模型，同时利用两个任务的互利，以及迁移学习模型，将模型扩展到新领域。我们讨论了 SF 和 IC 的研究现状，并强调了仍然需要关注的挑战。	Samuel Louvan Bernardo Magnini

文本生成

ACL

序号	会议/期刊	论文	代码	论文下载地址	摘要	摘要翻译	作者
1	ACL2021	Generalising Multilingual Concept-to-Text NLG with Language Agnostic Delexicalisation		https://arxiv.org/pdf/2105.03432	Concept-to-text Natural Language Generation is the task of expressing an input meaning representation in natural language. Previous approaches in this task have been able to generalise to rare or unseen instances by relying on a delexicalisation of the input. However, this often requires that the input appears verbatim in the output text. This poses challenges in multilingual settings, where the task expands to generate the output text in multiple languages given the same input. In this paper, we explore the application of multilingual models in concept-to-text and propose Language Agnostic Delexicalisation, a novel delexicalisation method that uses multilingual pretrained embeddings, and employs a character-level post-editing model to inflect words in their correct form during relexicalisation. Our experiments across five datasets and five languages show that multilingual models outperform monolingual models in concept-to-text and that our framework outperforms previous approaches, especially for low resource languages.	概念到文本的自然语言生成是用自然语言表达输入意义表示的任务。此任务中的先前方法已经能够通过依赖于输入的去词法化来推广到罕见或看不见的实例。但是，这通常要求输入在输出文本中逐字显示。这给多语言设置带来了挑战，在这种情况下，任务扩展为在给定相同输入的情况下以多种语言生成输出文本。在本文中，我们探索了多语言模型在概念到文本中的应用，并提出了 Language Agnostic Delexicalisation，这是一种新的 delexicalisation 方法，它使用多语言预训练嵌入，并采用字符级后编辑模型以正确的形式对单词进行变形在词法化过程中。我们对五个数据集和五种语言的实验表明，多语言模型在概念到文本方面优于单语模型，并且我们的框架优于以前的方法，尤其是对于低资源语言。	Giulio Zhou Gerasimos Lampouras
2	ACL2021	Prefix-Tuning: Optimizing Continuous Prompts for Generation		https://arxiv.org/pdf/2101.00190	Fine-tuning is the de facto way to leverage large pretrained language models to perform downstream tasks. However, it modifies all the language model parameters and therefore necessitates storing a full copy for each task. In this paper, we propose prefix-tuning, a lightweight alternative to fine-tuning for natural language generation tasks, which keeps language model parameters frozen, but optimizes a small continuous task-specific vector (called the prefix). Prefix-tuning draws inspiration from prompting, allowing subsequent tokens to attend to this prefix as if it were “virtual tokens”. We apply prefix-tuning to GPT-2 for table-to-text generation and to BART for summarization. We find that by learning only 0.1\% of the parameters, prefix-tuning obtains comparable performance in the full data setting, outperforms fine-tuning in low-data settings, and extrapolates better to examples with topics unseen during training.	微调是利用大型预训练语言模型执行下游任务的事实上的方法。然而，它修改了所有语言模型参数，因此需要为每个任务存储一个完整的副本。在本文中，我们提出了前缀调整，这是自然语言生成任务微调的轻量级替代方案，它保持语言模型参数冻结，但优化了一个小的连续任务特定向量（称为前缀）。前缀调整从提示中汲取灵感，允许后续标记关注这个前缀，就好像它是“虚拟标记”一样。我们将前缀调整应用于 GPT-2 以生成表格到文本，并应用于 BART 以进行汇总。我们发现，通过仅学习 0.1% 的参数，前缀调整在全数据设置中获得了可比的性能，在低数据设置中优于微调，并且可以更好地外推到在训练期间未见过的主题的示例。	Xiang Lisa Li Percy Liang
3	ACL2021	Polyjuice: Generating Counterfactuals for Explaining, Evaluating, and Improving Models	https://github.com/tongshuangwu/polyjuice	https://arxiv.org/pdf/2101.00288	While counterfactual examples are useful for analysis and training of NLP models, current generation methods either rely on manual labor to create very few counterfactuals, or only instantiate limited types of perturbations such as paraphrases or word substitutions. We present Polyjuice, a general-purpose counterfactual generator that allows for control over perturbation types and locations, trained by finetuning GPT-2 on multiple datasets of paired sentences. We show that Polyjuice produces diverse sets of realistic counterfactuals, which in turn are useful in various distinct applications: improving training and evaluation on three different tasks (with around 70% less annotation effort than manual generation), augmenting state-of-the-art explanation techniques, and supporting systematic counterfactual error analysis by revealing behaviors easily missed by human experts.	虽然反事实示例对于 NLP 模型的分析和训练很有用，但当前的生成方法要么依靠手工劳动来创建很少的反事实，要么仅实例化有限类型的扰动，例如释义或单词替换。我们展示了 Polyjuice，一种通用的反事实生成器，允许控制扰动类型和位置，通过在多个成对句子数据集上微调 GPT-2 进行训练。我们展示了 Polyjuice 产生了多种真实的反事实，这些反事实反过来又可用于各种不同的应用：改进对三个不同任务的训练和评估（比手动生成减少大约 70% 的注释工作），增强最先进的技术解释技术，并通过揭示人类专家容易遗漏的行为来支持系统的反事实错误分析。	Tongshuang Wu Marco Tulio Ribeiro Jeffrey Heer Daniel S. Weld
4	ACL2021	Conditional Generation of Temporally-ordered Event Sequences		https://arxiv.org/pdf/2012.15786	Models of narrative schema knowledge have proven useful for a range of event-related tasks, but they typically do not capture the temporal relationships between events. We propose a single model that addresses both temporal ordering, sorting given events into the order they occurred, and event infilling, predicting new events which fit into an existing temporally-ordered sequence. We use a BART-based conditional generation model that can capture both temporality and common event co-occurrence, meaning it can be flexibly applied to different tasks in this space. Our model is trained as a denoising autoencoder: we take temporally-ordered event sequences, shuffle them, delete some events, and then attempt to recover the original event sequence. This task teaches the model to make inferences given incomplete knowledge about the events in an underlying scenario. On the temporal ordering task, we show that our model is able to unscramble event sequences from existing datasets without access to explicitly labeled temporal training data, outperforming both a BERT-based pairwise model and a BERT-based pointer network. On event infilling, human evaluation shows that our model is able to generate events that fit better temporally into the input events when compared to GPT-2 story completion models.		Shih-Ting Lin Nathanael Chambers Greg Durrett
5	ACL2021	Writing by Memorizing: Hierarchical Retrieval-based Medical Report Generation		https://arxiv.org/pdf/2106.06471	Medical report generation is one of the most challenging tasks in medical image analysis. Although existing approaches have achieved promising results, they either require a predefined template database in order to retrieve sentences or ignore the hierarchical nature of medical report generation. To address these issues, we propose MedWriter that incorporates a novel hierarchical retrieval mechanism to automatically extract both report and sentence-level templates for clinically accurate report generation. MedWriter first employs the Visual-Language Retrieval~(VLR) module to retrieve the most relevant reports for the given images. To guarantee the logical coherence between sentences, the Language-Language Retrieval~(LLR) module is introduced to retrieve relevant sentences based on the previous generated description. At last, a language decoder fuses image features and features from retrieved reports and sentences to generate meaningful medical reports. We verified the effectiveness of our model by automatic evaluation and human evaluation on two datasets, i.e., Open-I and MIMIC-CXR.	医学报告生成是医学图像分析中最具挑战性的任务之一。尽管现有方法已经取得了可喜的成果，但它们要么需要一个预定义的模板数据库来检索句子，要么忽略医疗报告生成的层次性。为了解决这些问题，我们提出了 MedWriter，它结合了一种新颖的分层检索机制，可以自动提取报告和句子级模板，以生成临床准确的报告。 MedWriter 首先使用 Visual-Language Retrieval~(VLR) 模块来检索给定图像的最相关报告。为了保证句子之间的逻辑连贯性，引入了Language-Language Retrieval~(LLR)模块，根据之前生成的描述检索相关句子。最后，语言解码器将图像特征与检索到的报告和句子的特征融合，以生成有意义的医学报告。我们通过对两个数据集（即 Open-I 和 MIMIC-CXR）的自动评估和人工评估来验证我们模型的有效性。	Xingyi Yang Muchao Ye Quanzeng You Fenglong Ma
6	ACL2021	Factorising Meaning and Form for Intent-Preserving Paraphrasing	https://github.com/tomhosking/separator	https://arxiv.org/pdf/2105.15053	We propose a method for generating paraphrases of English questions that retain the original intent but use a different surface form. Our model combines a careful choice of training objective with a principled information bottleneck, to induce a latent encoding space that disentangles meaning and form. We train an encoder-decoder model to reconstruct a question from a paraphrase with the same meaning and an exemplar with the same surface form, leading to separated encoding spaces. We use a Vector-Quantized Variational Autoencoder to represent the surface form as a set of discrete latent variables, allowing us to use a classifier to select a different surface form at test time. Crucially, our method does not require access to an external source of target exemplars. Extensive experiments and a human evaluation show that we are able to generate paraphrases with a better tradeoff between semantic preservation and syntactic novelty compared to previous methods.	我们提出了一种生成保留原始意图但使用不同表面形式的英语问题释义的方法。我们的模型将精心选择的训练目标与原则性的信息瓶颈相结合，以诱导潜在的编码空间，将意义和形式分开。我们训练一个编码器-解码器模型，以从具有相同含义的释义和具有相同表面形式的示例中重建问题，从而导致分离的编码空间。我们使用矢量量化变分自编码器将表面形式表示为一组离散的潜在变量，允许我们在测试时使用分类器选择不同的表面形式。至关重要的是，我们的方法不需要访问目标示例的外部源。广泛的实验和人工评估表明，与以前的方法相比，我们能够生成在语义保留和句法新颖性之间具有更好权衡的释义。	Tom Hosking Mirella Lapata
7	ACL2021	Improving Formality Style Transfer with Context-Aware Rule Injection		https://arxiv.org/pdf/2106.00210	Models pre-trained on large-scale regular text corpora often do not work well for user-generated data where the language styles differ significantly from the mainstream text. Here we present Context-Aware Rule Injection (CARI), an innovative method for formality style transfer (FST). CARI injects multiple rules into an end-to-end BERT-based encoder and decoder model. It learns to select optimal rules based on context. The intrinsic evaluation showed that CARI achieved the new highest performance on the FST benchmark dataset. Our extrinsic evaluation showed that CARI can greatly improve the regular pre-trained models’ performance on several tweet sentiment analysis tasks.	在大规模常规文本语料库上预训练的模型通常不适用于语言风格与主流文本显着不同的用户生成数据。在这里，我们提出了上下文感知规则注入 (CARI)，这是一种形式风格转移 (FST) 的创新方法。 CARI 将多个规则注入基于端到端 BERT 的编码器和解码器模型。它学习根据上下文选择最佳规则。内在评估表明，CARI 在 FST 基准数据集上取得了新的最高性能。我们的外部评估表明，CARI 可以极大地提高常规预训练模型在多项推文情感分析任务上的性能。	Zonghai Yao Hong Yu
8	ACL2021	DeepRapper: Neural Rap Generation with Rhyme and Rhythm Modeling		https://arxiv.org/pdf/2107.01875	Rap generation, which aims to produce lyrics and corresponding singing beats, needs to model both rhymes and rhythms. Previous works for rap generation focused on rhyming lyrics but ignored rhythmic beats, which are important for rap performance. In this paper, we develop DeepRapper, a Transformer-based rap generation system that can model both rhymes and rhythms. Since there is no available rap dataset with rhythmic beats, we develop a data mining pipeline to collect a large-scale rap dataset, which includes a large number of rap songs with aligned lyrics and rhythmic beats. Second, we design a Transformer-based autoregressive language model which carefully models rhymes and rhythms. Specifically, we generate lyrics in the reverse order with rhyme representation and constraint for rhyme enhancement and insert a beat symbol into lyrics for rhythm/beat modeling. To our knowledge, DeepRapper is the first system to generate rap with both rhymes and rhythms. Both objective and subjective evaluations demonstrate that DeepRapper generates creative and high-quality raps with rhymes and rhythms. Code will be released on GitHub.	Rap 生成旨在生成歌词和相应的歌唱节拍，需要对韵律和节奏进行建模。以前的说唱创作专注于押韵歌词，但忽略了对说唱表演很重要的节奏节拍。在本文中，我们开发了 DeepRapper，这是一种基于 Transformer 的说唱生成系统，可以对韵律和节奏进行建模。由于没有可用的有节奏节拍的说唱数据集，我们开发了一个数据挖掘管道来收集大规模的说唱数据集，其中包括大量具有对齐歌词和节奏节拍的说唱歌曲。其次，我们设计了一个基于 Transformer 的自回归语言模型，它仔细地对韵律和节奏进行建模。具体来说，我们以相反的顺序生成带有韵律表示和韵律增强约束的歌词，并在歌词中插入节拍符号以进行节奏/节拍建模。据我们所知，DeepRapper 是第一个同时生成韵律和节奏的说唱系统。客观和主观的评价都表明，DeepRapper 能够创作出具有韵律和节奏感的创造性和高质量的说唱。代码将在 GitHub 上发布。	Lanqing Xue Kaitao Song Duocai Wu Xu Tan Nevin L. Zhang Tao Qin Wei-Qiang Zhang Tie-Yan Liu
9	ACL2021	Generating Landmark Navigation Instructions from Maps as a Graph-to-Text Problem		https://arxiv.org/pdf/2012.15329	Car-focused navigation services are based on turns and distances of named streets, whereas navigation instructions naturally used by humans are centered around physical objects called landmarks. We present a neural model that takes OpenStreetMap representations as input and learns to generate navigation instructions that contain visible and salient landmarks from human natural language instructions. Routes on the map are encoded in a location- and rotation-invariant graph representation that is decoded into natural language instructions. Our work is based on a novel dataset of 7,672 crowd-sourced instances that have been verified by human navigation in Street View. Our evaluation shows that the navigation instructions generated by our system have similar properties as human-generated instructions, and lead to successful human navigation in Street View.	以汽车为中心的导航服务基于命名街道的转弯和距离，而人类自然使用的导航指令则以称为地标的物理对象为中心。我们提出了一个神经模型，该模型将 OpenStreetMap 表示作为输入，并学习生成包含来自人类自然语言指令的可见和显着地标的导航指令。地图上的路线以位置和旋转不变的图形表示进行编码，该图形表示被解码为自然语言指令。我们的工作基于一个包含 7,672 个众包实例的新数据集，这些实例已通过街景中的人工导航进行验证。我们的评估表明，我们系统生成的导航指令与人工生成的指令具有相似的特性，并导致街景中的人工导航成功。	Raphael Schumann Stefan Riezler
10	ACL2021	One2Set: Generating Diverse Keyphrases as a Set	https://github.com/jiacheng-ye/kg_one2set	https://arxiv.org/pdf/2105.11134	Recently, the sequence-to-sequence models have made remarkable progress on the task of keyphrase generation (KG) by concatenating multiple keyphrases in a predefined order as a target sequence during training. However, the keyphrases are inherently an unordered set rather than an ordered sequence. Imposing a predefined order will introduce wrong bias during training, which can highly penalize shifts in the order between keyphrases. In this work, we propose a new training paradigm One2Set without predefining an order to concatenate the keyphrases. To fit this paradigm, we propose a novel model that utilizes a fixed set of learned control codes as conditions to generate a set of keyphrases in parallel. To solve the problem that there is no correspondence between each prediction and target during training, we propose a $K$-step target assignment mechanism via bipartite matching, which greatly increases the diversity and reduces the duplication ratio of generated keyphrases. The experimental results on multiple benchmarks demonstrate that our approach significantly outperforms the state-of-the-art methods.	最近，序列到序列模型通过在训练期间以预定义的顺序连接多个关键短语作为目标序列，在关键短语生成 (KG) 任务上取得了显着进展。然而，关键短语本质上是一个无序的集合，而不是一个有序的序列。强加预定义的顺序会在训练期间引入错误的偏差，这会严重影响关键短语之间的顺序变化。在这项工作中，我们提出了一种新的训练范式 One2Set，而无需预先定义连接关键短语的顺序。为了适应这种范式，我们提出了一种新模型，该模型利用一组固定的学习控制代码作为条件来并行生成一组关键短语。为了解决训练过程中每个预测和目标之间没有对应关系的问题，我们提出了一种通过二部匹配的$K$-step目标分配机制，大大增加了多样性并减少了生成的关键短语的重复率。多个基准的实验结果表明，我们的方法明显优于最先进的方法。	Jiacheng Ye Tao Gui Yichao Luo Yige Xu Qi Zhang
11	ACL2020	Distilling Knowledge Learned in BERT for Text Generation	https://github.com/ChenRocks/Distill-BERT-Textgen	https://arxiv.org/pdf/1911.03829	Large-scale pre-trained language model such as BERT has achieved great success in language understanding tasks. However, it remains an open question how to utilize BERT for language generation. In this paper, we present a novel approach, Conditional Masked Language Modeling (C-MLM), to enable the finetuning of BERT on target generation tasks. The finetuned BERT (teacher) is exploited as extra supervision to improve conventional Seq2Seq models (student) for better text generation performance. By leveraging BERT’s idiosyncratic bidirectional nature, distilling knowledge learned in BERT can encourage auto-regressive Seq2Seq models to plan ahead, imposing global sequence-level supervision for coherent text generation. Experiments show that the proposed approach significantly outperforms strong Transformer baselines on multiple language generation tasks such as machine translation and text summarization. Our proposed model also achieves new state of the art on IWSLT German-English and English-Vietnamese MT datasets. Code is available at https://github.com/ChenRocks/Distill-BERT-Textgen.	BERT 等大规模预训练语言模型在语言理解任务中取得了巨大成功。然而，如何利用 BERT 进行语言生成仍然是一个悬而未决的问题。在本文中，我们提出了一种新方法，即条件掩码语言建模 (C-MLM)，以实现 BERT 在目标生成任务上的微调。微调的 BERT（教师）被用作额外的监督来改进传统的 Seq2Seq 模型（学生）以获得更好的文本生成性能。通过利用 BERT 的特殊双向性质，提炼在 BERT 中学到的知识可以鼓励自回归 Seq2Seq 模型提前计划，为连贯文本生成施加全局序列级监督。实验表明，所提出的方法在机器翻译和文本摘要等多语言生成任务上明显优于强 Transformer 基线。我们提出的模型还在 IWSLT 德语-英语和英语-越南语 MT 数据集上达到了最新的技术水平。代码可在 https://github.com/ChenRocks/Distill-BERT-Textgen 获得。	Yen-Chun Chen Zhe Gan Yu Cheng Jingzhou Liu Jingjing Liu
12	ACL2020	Rigid Formats Controlled Text Generation	https://github.com/lipiji/SongNet	https://arxiv.org/pdf/2004.08022	Neural text generation has made tremendous progress in various tasks. One common characteristic of most of the tasks is that the texts are not restricted to some rigid formats when generating. However, we may confront some special text paradigms such as Lyrics (assume the music score is given), Sonnet, SongCi (classical Chinese poetry of the Song dynasty), etc. The typical characteristics of these texts are in three folds: (1) They must comply fully with the rigid predefined formats. (2) They must obey some rhyming schemes. (3) Although they are restricted to some formats, the sentence integrity must be guaranteed. To the best of our knowledge, text generation based on the predefined rigid formats has not been well investigated. Therefore, we propose a simple and elegant framework named SongNet to tackle this problem. The backbone of the framework is a Transformer-based auto-regressive language model. Sets of symbols are tailor-designed to improve the modeling performance especially on format, rhyme, and sentence integrity. We improve the attention mechanism to impel the model to capture some future information on the format. A pre-training and fine-tuning framework is designed to further improve the generation quality. Extensive experiments conducted on two collected corpora demonstrate that our proposed framework generates significantly better results in terms of both automatic metrics and the human evaluation.	神经文本生成在各种任务中取得了巨大的进步。大多数任务的一个共同特征是文本在生成时不受某些严格格式的限制。但是，我们可能会遇到一些特殊的文本范式，例如歌词（假设给定乐谱）、十四行诗、宋词（宋代中国古典诗歌）等。这些文本的典型特征有三方面：（1）它们必须完全符合严格的预定义格式。 (2) 他们必须遵守一些押韵方案。 (3) 虽然限于某些格式，但必须保证句子的完整性。据我们所知，基于预定义的刚性格式的文本生成尚未得到很好的研究。因此，我们提出了一个简单而优雅的框架 SongNet 来解决这个问题。该框架的主干是一个基于 Transformer 的自回归语言模型。符号集是量身定制的，以提高建模性能，尤其是在格式、韵律和句子完整性方面。我们改进了注意力机制以促使模型捕获有关格式的一些未来信息。预训练和微调框架旨在进一步提高生成质量。在两个收集的语料库上进行的大量实验表明，我们提出的框架在自动度量和人工评估方面都产生了明显更好的结果。	Piji Li Haisong Zhang Xiaojiang Liu Shuming Shi
13	ACL2020	Semantic Graphs for Generating Deep Questions	https://github.com/WING-NUS/SG-Deep-Question-Generation	https://arxiv.org/pdf/2004.12704	This paper proposes the problem of Deep Question Generation (DQG), which aims to generate complex questions that require reasoning over multiple pieces of information of the input passage. In order to capture the global structure of the document and facilitate reasoning, we propose a novel framework which first constructs a semantic-level graph for the input document and then encodes the semantic graph by introducing an attention-based GGNN (Att-GGNN). Afterwards, we fuse the document-level and graph-level representations to perform joint training of content selection and question decoding. On the HotpotQA deep-question centric dataset, our model greatly improves performance over questions requiring reasoning over multiple facts, leading to state-of-the-art performance. The code is publicly available at https://github.com/WING-NUS/SG-Deep-Question-Generation.	本文提出了深度问题生成 (DQG) 问题，旨在生成需要对输入段落的多条信息进行推理的复杂问题。为了捕获文档的全局结构并促进推理，我们提出了一种新颖的框架，该框架首先为输入文档构建语义级图，然后通过引入基于注意力的 GGNN（Att-GGNN）对语义图进行编码。之后，我们融合文档级和图形级表示来执行内容选择和问题解码的联合训练。在 HotpotQA 以深度问题为中心的数据集上，我们的模型大大提高了需要对多个事实进行推理的问题的性能，从而达到最先进的性能。该代码可在 https://github.com/WING-NUS/SG-Deep-Question-Generation 上公开获得。	Liangming Pan Yuxi Xie Yansong Feng Tat-Seng Chua Min-Yen Kan
14	ACL2020	Politeness Transfer: A Tag and Generate Approach		https://arxiv.org/pdf/2004.14257	This paper introduces a new task of politeness transfer which involves converting non-polite sentences to polite sentences while preserving the meaning. We also provide a dataset of more than 1.39 instances automatically labeled for politeness to encourage benchmark evaluations on this new task. We design a tag and generate pipeline that identifies stylistic attributes and subsequently generates a sentence in the target style while preserving most of the source content. For politeness as well as five other transfer tasks, our model outperforms the state-of-the-art methods on automatic metrics for content preservation, with a comparable or better performance on style transfer accuracy. Additionally, our model surpasses existing methods on human evaluations for grammaticality, meaning preservation and transfer accuracy across all the six style transfer tasks. The data and code is located at https://github.com/tag-and-generate.	本文介绍了一种礼貌转移的新任务，即在保留意义的同时将非礼貌句子转换为礼貌句子。我们还提供了一个包含超过 1.39 个自动标记礼貌的实例的数据集，以鼓励对这项新任务进行基准评估。我们设计了一个标签并生成了识别风格属性的管道，随后在保留大部分源内容的同时以目标风格生成了一个句子。对于礼貌以及其他五个转移任务，我们的模型在内容保存的自动度量方面优于最先进的方法，在样式转移准确性方面具有可比或更好的性能。此外，我们的模型在语法方面超越了现有的人类评估方法，这意味着在所有六种风格迁移任务中的保留和迁移准确性。数据和代码位于 https://github.com/tag-and-generate。	Aman Madaan Amrith Setlur Tanmay Parekh Barnabas Poczos Graham Neubig Yiming Yang Ruslan Salakhutdinov Alan W Black Shrimai Prabhumoye
15	ACL2020	GPT-too: A language-model-first approach for AMR-to-text generation	https://github.com/IBM/GPT-too-AMR2text	https://arxiv.org/pdf/2005.09123	Meaning Representations (AMRs) are broad-coverage sentence-level semantic graphs. Existing approaches to generating text from AMR have focused on training sequence-to-sequence or graph-to-sequence models on AMR annotated data only. In this paper, we propose an alternative approach that combines a strong pre-trained language model with cycle consistency-based re-scoring. Despite the simplicity of the approach, our experimental results show these models outperform all previous techniques on the English LDC2017T10dataset, including the recent use of transformer architectures. In addition to the standard evaluation metrics, we provide human evaluation experiments that further substantiate the strength of our approach.	含义表示（AMR）是广泛覆盖的句子级语义图。现有的从 AMR 生成文本的方法侧重于仅在 AMR 注释数据上训练序列到序列或图到序列模型。在本文中，我们提出了一种替代方法，将强大的预训练语言模型与基于循环一致性的重新评分相结合。尽管该方法很简单，但我们的实验结果表明，这些模型在英国 LDC2017T10 数据集上的表现优于之前的所有技术，包括最近使用的变压器架构。除了标准的评估指标外，我们还提供了人工评估实验，进一步证实了我们方法的优势。	Manuel Mager Ramon Fernandez Astudillo Tahira Naseem Md Arafat Sultan Young-Suk Lee Radu Florian Salim Roukos
16	ACL2020	Posterior Control of Blackbox Generation	https://github.com/XiangLi1999/PosteriorControl-NLG	https://arxiv.org/pdf/2005.04560	Text generation often requires high-precision output that obeys task-specific rules. This fine-grained control is difficult to enforce with off-the-shelf deep learning models. In this work, we consider augmenting neural generation models with discrete control states learned through a structured latent-variable approach. Under this formulation, task-specific knowledge can be encoded through a range of rich, posterior constraints that are effectively trained into the model. This approach allows users to ground internal model decisions based on prior knowledge, without sacrificing the representational power of neural generative models. Experiments consider applications of this approach for text generation. We find that this method improves over standard benchmarks, while also providing fine-grained control.	文本生成通常需要遵守特定任务规则的高精度输出。这种细粒度的控制很难用现成的深度学习模型来实施。在这项工作中，我们考虑使用通过结构化潜在变量方法学习的离散控制状态来增强神经生成模型。在这个公式下，特定任务的知识可以通过一系列丰富的后验约束进行编码，这些约束被有效地训练到模型中。这种方法允许用户基于先验知识进行内部模型决策，而不会牺牲神经生成模型的表示能力。实验考虑了这种方法在文本生成中的应用。我们发现这种方法比标准基准有所改进，同时还提供了细粒度的控制。	Xiang Lisa Li Alexander M. Rush
17	ACL2020	Parallel Data Augmentation for Formality Style Transfer	https://github.com/lancopku/Augmented_Data_for_FST	https://arxiv.org/pdf/2005.07522	The main barrier to progress in the task of Formality Style Transfer is the inadequacy of training data. In this paper, we study how to augment parallel data and propose novel and simple data augmentation methods for this task to obtain useful sentence pairs with easily accessible models and systems. Experiments demonstrate that our augmented parallel data largely helps improve formality style transfer when it is used to pre-train the model, leading to the state-of-the-art results in the GYAFC benchmark dataset.	形式风格迁移任务进展的主要障碍是训练数据的不足。在本文中，我们研究如何增强并行数据并为此任务提出新颖且简单的数据增强方法，以获得具有易于访问的模型和系统的有用句子对。实验表明，我们的增强并行数据在用于预训练模型时在很大程度上有助于改进形式风格转移，从而在 GYAFC 基准数据集中获得最先进的结果。	Yi Zhang Tao Ge Xu Sun
18	ACL2020	Neural Data-to-Text Generation via Jointly Learning the Segmentation and Correspondence		https://arxiv.org/pdf/2005.01096	The neural attention model has achieved great success in data-to-text generation tasks. Though usually excelling at producing fluent text, it suffers from the problem of information missing, repetition and “hallucination”. Due to the black-box nature of the neural attention architecture, avoiding these problems in a systematic way is non-trivial. To address this concern, we propose to explicitly segment target text into fragment units and align them with their data correspondences. The segmentation and correspondence are jointly learned as latent variables without any human annotations. We further impose a soft statistical constraint to regularize the segmental granularity. The resulting architecture maintains the same expressive power as neural attention models, while being able to generate fully interpretable outputs with several times less computational cost. On both E2E and WebNLG benchmarks, we show the proposed model consistently outperforms its neural attention counterparts.		Xiaoyu Shen Ernie Chang Hui Su Jie Zhou Dietrich Klakow
19	ACL2020	BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension		https://arxiv.org/pdf/1910.13461	We present BART, a denoising autoencoder for pretraining sequence-to-sequence models. BART is trained by (1) corrupting text with an arbitrary noising function, and (2) learning a model to reconstruct the original text. It uses a standard Tranformer-based neural machine translation architecture which, despite its simplicity, can be seen as generalizing BERT (due to the bidirectional encoder), GPT (with the left-to-right decoder), and many other more recent pretraining schemes. We evaluate a number of noising approaches, finding the best performance by both randomly shuffling the order of the original sentences and using a novel in-filling scheme, where spans of text are replaced with a single mask token. BART is particularly effective when fine tuned for text generation but also works well for comprehension tasks. It matches the performance of RoBERTa with comparable training resources on GLUE and SQuAD, achieves new state-of-the-art results on a range of abstractive dialogue, question answering, and summarization tasks, with gains of up to 6 ROUGE. BART also provides a 1.1 BLEU increase over a back-translation system for machine translation, with only target language pretraining. We also report ablation experiments that replicate other pretraining schemes within the BART framework, to better measure which factors most influence end-task performance.	我们提出了 BART，一种用于预训练序列到序列模型的去噪自动编码器。 BART 通过 (1) 使用任意噪声函数破坏文本和 (2) 学习模型来重建原始文本进行训练。它使用标准的基于 Transformer 的神经机器翻译架构，尽管它很简单，但可以看作是对 BERT（由于双向编码器）、GPT（带有从左到右的解码器）和许多其他最近的预训练方案的泛化.我们评估了许多噪声方法，通过随机打乱原始句子的顺序和使用新颖的填充方案来找到最佳性能，其中文本的跨度被替换为单个掩码标记。 BART 在针对文本生成进行微调时特别有效，但也适用于理解任务。它将 RoBERTa 的性能与 GLUE 和 SQuAD 上的可比训练资源相匹配，在一系列抽象对话、问答和总结任务上取得了最新的最新成果，增益高达 6 ROUGE。 BART 还为机器翻译提供了比反向翻译系统增加 1.1 BLEU 的能力，并且只进行了目标语言预训练。我们还报告了在 BART 框架内复制其他预训练方案的消融实验，以更好地衡量哪些因素对最终任务性能影响最大。	Mike Lewis Yinhan Liu Naman Goyal Marjan Ghazvininejad Abdelrahman Mohamed Omer Levy Ves Stoyanov Luke Zettlemoyer
20	ACL2020	Pre-train and Plug-in: Flexible Conditional Text Generation with Variational Auto- Encoders	https://github.com/WHUIR/PPVAE	https://arxiv.org/pdf/1911.03882	Conditional Text Generation has drawn much attention as a topic of Natural Language Generation (NLG) which provides the possibility for humans to control the properties of generated contents. Current conditional generation models cannot handle emerging conditions due to their joint end-to-end learning fashion. When a new condition added, these techniques require full retraining. In this paper, we present a new framework named Pre-train and Plug-in Variational Auto-Encoder (PPVAE) towards flexible conditional text generation. PPVAE decouples the text generation module from the condition representation module to allow “one-to-many” conditional generation. When a fresh condition emerges, only a lightweight network needs to be trained and works as a plug-in for PPVAE, which is efficient and desirable for real-world applications. Extensive experiments demonstrate the superiority of PPVAE against the existing alternatives with better conditionality and diversity but less training effort.	条件文本生成作为自然语言生成 (NLG) 的一个主题备受关注，它为人类提供了控制生成内容属性的可能性。由于它们的联合端到端学习方式，当前的条件生成模型无法处理新出现的条件。当添加新条件时，这些技术需要完全重新训练。在本文中，我们提出了一个名为 Pre-train and Plug-in Variational Auto-Encoder (PPVAE) 的新框架，用于灵活的条件文本生成。 PPVAE 将文本生成模块与条件表示模块分离，以允许“一对多”条件生成。当新的条件出现时，只需要训练一个轻量级网络，并作为 PPVAE 的插件工作，这对于现实世界的应用是有效的和可取的。大量实验证明了 PPVAE 相对于现有替代方案的优越性，具有更好的条件和多样性，但训练工作量更少。	Yu Duan Canwen Xu Jiaxin Pei Jialong Han Chenliang Li
21	ACL2020	Two Birds, One Stone: A Simple, Unified Model for Text Generation from Structured and Unstructured Data	https://github.com/h-shahidi/2birds-gen	https://arxiv.org/pdf/1909.10158	A number of researchers have recently questioned the necessity of increasingly complex neural network (NN) architectures. In particular, several recent papers have shown that simpler, properly tuned models are at least competitive across several NLP tasks. In this work, we show that this is also the case for text generation from structured and unstructured data. We consider neural table-to-text generation and neural question generation (NQG) tasks for text generation from structured and unstructured data, respectively. Table-to-text generation aims to generate a description based on a given table, and NQG is the task of generating a question from a given passage where the generated question can be answered by a certain sub-span of the passage using NN models. Experimental results demonstrate that a basic attention-based seq2seq model trained with the exponential moving average technique achieves the state of the art in both tasks. Code is available at https://github.com/h-shahidi/2birds-gen.	许多研究人员最近质疑日益复杂的神经网络 (NN) 架构的必要性。特别是，最近的几篇论文表明，更简单、经过适当调整的模型至少在多个 NLP 任务中具有竞争力。在这项工作中，我们表明这也是从结构化和非结构化数据生成文本的情况。我们分别考虑从结构化和非结构化数据生成文本的神经表格到文本生成和神经问题生成 (NQG) 任务。表到文本生成旨在基于给定的表生成描述，而 NQG 是从给定的段落中生成问题的任务，其中生成的问题可以使用 NN 模型通过段落的某个子跨度来回答。实验结果表明，使用指数移动平均技术训练的基于注意力的基本 seq2seq 模型在这两个任务中都达到了最先进的水平。代码可在 https://github.com/h-shahidi/2birds-gen 获得。	Hamidreza Shahidi Ming Li Jimmy Lin
22	ACL2020	Unsupervised Opinion Summarization as Copycat-Review Generation	https://github.com/ixlan/CopyCat-abstractive-opinion-summarizer	https://arxiv.org/pdf/1911.02247	Opinion summarization is the task of automatically creating summaries that reflect subjective information expressed in multiple documents, such as product reviews. While the majority of previous work has focused on the extractive setting, i.e., selecting fragments from input reviews to produce a summary, we let the model generate novel sentences and hence produce abstractive summaries. Recent progress in summarization has seen the development of supervised models which rely on large quantities of document-summary pairs. Since such training data is expensive to acquire, we instead consider the unsupervised setting, in other words, we do not use any summaries in training. We define a generative model for a review collection which capitalizes on the intuition that when generating a new review given a set of other reviews of a product, we should be able to control the “amount of novelty” going into the new review or, equivalently, vary the extent to which it deviates from the input. At test time, when generating summaries, we force the novelty to be minimal, and produce a text reflecting consensus opinions. We capture this intuition by defining a hierarchical variational autoencoder model. Both individual reviews and the products they correspond to are associated with stochastic latent codes, and the review generator (“decoder”) has direct access to the text of input reviews through the pointer-generator mechanism. Experiments on Amazon and Yelp datasets, show that setting at test time the review’s latent code to its mean, allows the model to produce fluent and coherent summaries reflecting common opinions.	意见摘要是自动创建反映在多个文档中表达的主观信息（例如产品评论）的摘要的任务。虽然以前的大部分工作都集中在提取设置上，即从输入评论中选择片段以生成摘要，但我们让模型生成新颖的句子，从而生成抽象摘要。摘要的最新进展见证了依赖大量文档摘要对的监督模型的发展。由于获取此类训练数据的成本很高，因此我们转而考虑无监督设置，换句话说，我们在训练中不使用任何摘要。我们为评论集合定义了一个生成模型，该模型利用了一种直觉，即在给定一组产品的其他评论生成新评论时，我们应该能够控制进入新评论的“新颖性”，或者等效地，改变它偏离输入的程度。在测试时，在生成摘要时，我们将新颖性降至最低，并生成反映共识意见的文本。我们通过定义分层变分自编码器模型来捕捉这种直觉。个人评论和它们对应的产品都与随机潜在代码相关联，评论生成器（“解码器”）可以通过指针生成器机制直接访问输入评论的文本。在 Amazon 和 Yelp 数据集上的实验表明，在测试时将评论的潜在代码设置为其平均值，允许模型生成反映共同意见的流畅和连贯的摘要。	Arthur Bražinskas Mirella Lapata Ivan Titov
23	ACL2020	Evidence-Aware Inferential Text Generation with Vector Quantised Variational AutoEncoder	https://github.com/microsoft/EA-VQ-VAE	https://arxiv.org/pdf/2006.08101	Generating inferential texts about an event in different perspectives requires reasoning over different contexts that the event occurs. Existing works usually ignore the context that is not explicitly provided, resulting in a context-independent semantic representation that struggles to support the generation. To address this, we propose an approach that automatically finds evidence for an event from a large text corpus, and leverages the evidence to guide the generation of inferential texts. Our approach works in an encoder-decoder manner and is equipped with a Vector Quantised-Variational Autoencoder, where the encoder outputs representations from a distribution over discrete variables. Such discrete representations enable automatically selecting relevant evidence, which not only facilitates evidence-aware generation, but also provides a natural way to uncover rationales behind the generation. Our approach provides state-of-the-art performance on both Event2Mind and ATOMIC datasets. More importantly, we find that with discrete representations, our model selectively uses evidence to generate different inferential texts.	从不同角度生成关于事件的推理文本需要对事件发生的不同上下文进行推理。现有作品通常会忽略未明确提供的上下文，从而导致难以支持生成的上下文无关语义表示。为了解决这个问题，我们提出了一种从大型文本语料库中自动寻找事件证据的方法，并利用这些证据来指导推理文本的生成。我们的方法以编码器-解码器的方式工作，并配备了矢量量化变分自动编码器，其中编码器输出离散变量分布的表示。这种离散表示能够自动选择相关证据，这不仅有利于证据意识的生成，而且还提供了一种自然的方式来揭示生成背后的基本原理。我们的方法在 Event2Mind 和 ATOMIC 数据集上都提供了最先进的性能。更重要的是，我们发现对于离散表示，我们的模型有选择地使用证据来生成不同的推理文本。	Daya Guo Duyu Tang Nan Duan Jian Yin Daxin Jiang Ming Zhou
24	ACL2020	BLEURT: Learning Robust Metrics for Text Generation	https://github.com/google-research/bleurt	https://arxiv.org/pdf/2004.04696	Text generation has made significant advances in the last few years. Yet, evaluation metrics have lagged behind, as the most popular choices (e.g., BLEU and ROUGE) may correlate poorly with human judgments. We propose BLEURT, a learned evaluation metric based on BERT that can model human judgments with a few thousand possibly biased training examples. A key aspect of our approach is a novel pre-training scheme that uses millions of synthetic examples to help the model generalize. BLEURT provides state-of-the-art results on the last three years of the WMT Metrics shared task and the WebNLG Competition dataset. In contrast to a vanilla BERT-based approach, it yields superior results even when the training data is scarce and out-of-distribution.	文本生成在过去几年中取得了重大进展。然而，评估指标已经落后，因为最流行的选择（例如，BLEU 和 ROUGE）可能与人类判断的相关性很差。我们提出了 BLEURT，这是一种基于 BERT 的学习评估指标，可以用几千个可能有偏见的训练示例对人类判断进行建模。我们方法的一个关键方面是一种新颖的预训练方案，它使用数百万个合成示例来帮助模型泛化。 BLEURT 提供了过去三年 WMT 指标共享任务和 WebNLG 竞赛数据集的最新结果。与基于 BERT 的普通方法相比，即使训练数据稀缺且分布不均时，它也能产生出色的结果。	Thibault Sellam Dipanjan Das Ankur P. Parikh
25	ACL2019	Automatic Generation of High Quality CCGbanks for Parser Domain Adaptation		https://arxiv.org/pdf/1906.01834	We propose a new domain adaptation method for Combinatory Categorial Grammar (CCG) parsing, based on the idea of automatic generation of CCG corpora exploiting cheaper resources of dependency trees. Our solution is conceptually simple, and not relying on a specific parser architecture, making it applicable to the current best-performing parsers. We conduct extensive parsing experiments with detailed discussion; on top of existing benchmark datasets on (1) biomedical texts and (2) question sentences, we create experimental datasets of (3) speech conversation and (4) math problems. When applied to the proposed method, an off-the-shelf CCG parser shows significant performance gains, improving from 90.7% to 96.6% on speech conversation, and from 88.5% to 96.8% on math problems.	我们提出了一种用于组合分类语法（CCG）解析的新域适应方法，基于利用依赖树的更便宜资源自动生成 CCG 语料库的思想。我们的解决方案在概念上很简单，不依赖于特定的解析器架构，使其适用于当前性能最佳的解析器。我们进行了广泛的解析实验并进行了详细的讨论；在 (1) 生物医学文本和 (2) 问题句子的现有基准数据集之上，我们创建了 (3) 语音对话和 (4) 数学问题的实验数据集。当应用于所提出的方法时，现成的 CCG 解析器显示出显着的性能提升，语音对话从 90.7% 提高到 96.6%，数学问题从 88.5% 提高到 96.8%。	Masashi Yoshikawa Hiroshi Noji Koji Mineshima Daisuke Bekki
26	ACL2019	PaperRobot: Incremental Draft Generation of Scientific Ideas	https://github.com/EagleW/PaperRobot	https://arxiv.org/pdf/1905.07870	We present a PaperRobot who performs as an automatic research assistant by (1) conducting deep understanding of a large collection of human-written papers in a target domain and constructing comprehensive background knowledge graphs (KGs); (2) creating new ideas by predicting links from the background KGs, by combining graph attention and contextual text attention; (3) incrementally writing some key elements of a new paper based on memory-attention networks: from the input title along with predicted related entities to generate a paper abstract, from the abstract to generate conclusion and future work, and finally from future work to generate a title for a follow-on paper. Turing Tests, where a biomedical domain expert is asked to compare a system output and a human-authored string, show PaperRobot generated abstracts, conclusion and future work sections, and new titles are chosen over human-written ones up to 30%, 24% and 12% of the time, respectively.	我们展示了一个 PaperRobot，它通过（1）对目标领域的大量人工论文进行深入理解并构建全面的背景知识图（KG）； (2) 通过结合图注意力和上下文文本注意力从背景知识图谱中预测链接来创造新的想法； (3) 增量编写基于记忆注意力网络的新论文的一些关键元素：从输入标题和预测的相关实体生成论文摘要，从摘要生成结论和未来工作，最后从未来工作到为后续论文生成标题。图灵测试，要求生物医学领域专家比较系统输出和人工编写的字符串，显示 PaperRobot 生成的摘要、结论和未来工作部分，并且选择新标题而不是人工编写的标题高达 30% 和 24%和 12% 的时间，分别。	Qingyun Wang Lifu Huang Zhiying Jiang Kevin Knight Heng Ji Mohit Bansal Yi Luan
27	ACL2019	Data-to-text Generation with Entity Modeling	https://github.com/ratishsp/data2text-entity-py	https://arxiv.org/pdf/1906.03221	Recent approaches to data-to-text generation have shown great promise thanks to the use of large-scale datasets and the application of neural network architectures which are trained end-to-end. These models rely on representation learning to select content appropriately, structure it coherently, and verbalize it grammatically, treating entities as nothing more than vocabulary tokens. In this work we propose an entity-centric neural architecture for data-to-text generation. Our model creates entity-specific representations which are dynamically updated. Text is generated conditioned on the data input and entity memory representations using hierarchical attention at each time step. We present experiments on the RotoWire benchmark and a (five times larger) new dataset on the baseball domain which we create. Our results show that the proposed model outperforms competitive baselines in automatic and human evaluation.	由于大规模数据集的使用和端到端训练的神经网络架构的应用，最近的数据到文本生成方法显示出了巨大的希望。这些模型依靠表示学习来适当地选择内容、连贯地构建内容并按语法表达，将实体视为词汇标记。在这项工作中，我们提出了一种以实体为中心的神经架构，用于数据到文本的生成。我们的模型创建了动态更新的特定于实体的表示。文本是根据数据输入和实体内存表示在每个时间步使用分层注意力生成的。我们展示了 RotoWire 基准测试和我们创建的棒球域上的（大五倍）新数据集。我们的结果表明，所提出的模型在自动和人工评估中优于竞争基线。	Ratish Puduppully Li Dong Mirella Lapata
28	ACL2019	Key Fact as Pivot: A Two-Stage Model for Low Resource Table-to-Text Generation	https://github.com/lancopku/Pivot	https://arxiv.org/pdf/1908.03067	Table-to-text generation aims to translate the structured data into the unstructured text. Most existing methods adopt the encoder-decoder framework to learn the transformation, which requires large-scale training samples. However, the lack of large parallel data is a major practical problem for many domains. In this work, we consider the scenario of low resource table-to-text generation, where only limited parallel data is available. We propose a novel model to separate the generation into two stages: key fact prediction and surface realization. It first predicts the key facts from the tables, and then generates the text with the key facts. The training of key fact prediction needs much fewer annotated data, while surface realization can be trained with pseudo parallel corpus. We evaluate our model on a biography generation dataset. Our model can achieve $27.34$ BLEU score with only $1,000$ parallel data, while the baseline model only obtain the performance of $9.71$ BLEU score.	表到文本生成旨在将结构化数据转换为非结构化文本。大多数现有方法采用编码器-解码器框架来学习转换，这需要大规模的训练样本。然而，缺乏大型并行数据是许多领域的主要实际问题。在这项工作中，我们考虑了低资源表到文本生成的场景，其中只有有限的并行数据可用。我们提出了一种新颖的模型，将生成分为两个阶段：关键事实预测和表面实现。它首先从表格中预测关键事实，然后用关键事实生成文本。关键事实预测的训练需要更少的标注数据，而表面实现可以用伪平行语料库训练。我们在传记生成数据集上评估我们的模型。我们的模型只需 1,000 美元的并行数据即可获得 27.34 美元的 BLEU 分数，而基准模型仅获得 9.71 美元的 BLEU 分数。	Shuming Ma Pengcheng Yang Tianyu Liu Peng Li Jie Zhou Xu Sun
29	ACL2019	Reinforced Dynamic Reasoning for Conversational Question Generation	https://github.com/ZJULearning/ReDR	https://arxiv.org/pdf/1907.12667	This paper investigates a new task named Conversational Question Generation (CQG) which is to generate a question based on a passage and a conversation history (i.e., previous turns of question-answer pairs). CQG is a crucial task for developing intelligent agents that can drive question-answering style conversations or test user understanding of a given passage. Towards that end, we propose a new approach named Reinforced Dynamic Reasoning (ReDR) network, which is based on the general encoder-decoder framework but incorporates a reasoning procedure in a dynamic manner to better understand what has been asked and what to ask next about the passage. To encourage producing meaningful questions, we leverage a popular question answering (QA) model to provide feedback and fine-tune the question generator using a reinforcement learning mechanism. Empirical results on the recently released CoQA dataset demonstrate the effectiveness of our method in comparison with various baselines and model variants. Moreover, to show the applicability of our method, we also apply it to create multi-turn question-answering conversations for passages in SQuAD.		Boyuan Pan Hao Li Ziyu Yao Deng Cai Huan Sun
30	ACL2019	Neural Keyphrase Generation via Reinforcement Learning with Adaptive Rewards	https://github.com/kenchan0226/keyphrase-generation-rl	https://arxiv.org/pdf/1906.04106	Generating keyphrases that summarize the main points of a document is a fundamental task in natural language processing. Although existing generative models are capable of predicting multiple keyphrases for an input document as well as determining the number of keyphrases to generate, they still suffer from the problem of generating too few keyphrases. To address this problem, we propose a reinforcement learning (RL) approach for keyphrase generation, with an adaptive reward function that encourages a model to generate both sufficient and accurate keyphrases. Furthermore, we introduce a new evaluation method that incorporates name variations of the ground-truth keyphrases using the Wikipedia knowledge base. Thus, our evaluation method can more robustly evaluate the quality of predicted keyphrases. Extensive experiments on five real-world datasets of different scales demonstrate that our RL approach consistently and significantly improves the performance of the state-of-the-art generative models with both conventional and new evaluation methods.	生成总结文档要点的关键短语是自然语言处理中的一项基本任务。尽管现有的生成模型能够预测输入文档的多个关键短语以及确定要生成的关键短语的数量，但它们仍然存在生成的关键短语太少的问题。为了解决这个问题，我们提出了一种用于生成关键短语的强化学习 (RL) 方法，该方法具有自适应奖励功能，可以鼓励模型生成足够且准确的关键短语。此外，我们引入了一种新的评估方法，该方法使用维基百科知识库结合了真实关键词的名称变体。因此，我们的评估方法可以更稳健地评估预测的关键短语的质量。对五个不同规模的真实世界数据集进行的大量实验表明，我们的 RL 方法通过传统和新的评估方法一致且显着地提高了最先进的生成模型的性能。	Hou Pong Chan Wang Chen Lu Wang Irwin King
31	ACL2019	Topic-Aware Neural Keyphrase Generation for Social Media Language	https://github.com/yuewang-cuhk/TAKG	https://arxiv.org/pdf/1906.03889	A huge volume of user-generated content is daily produced on social media. To facilitate automatic language understanding, we study keyphrase prediction, distilling salient information from massive posts. While most existing methods extract words from source posts to form keyphrases, we propose a sequence-to-sequence (seq2seq) based neural keyphrase generation framework, enabling absent keyphrases to be created. Moreover, our model, being topic-aware, allows joint modeling of corpus-level latent topic representations, which helps alleviate the data sparsity that widely exhibited in social media language. Experiments on three datasets collected from English and Chinese social media platforms show that our model significantly outperforms both extraction and generation models that do not exploit latent topics. Further discussions show that our model learns meaningful topics, which interprets its superiority in social media keyphrase generation.		Yue Wang Jing Li Hou Pong Chan Irwin King Michael R. Lyu Shuming Shi
32	ACL2019	Argument Generation with Retrieval, Planning, and Realization		https://arxiv.org/pdf/1906.03717	Automatic argument generation is an appealing but challenging task. In this paper, we study the specific problem of counter-argument generation, and present a novel framework, CANDELA. It consists of a powerful retrieval system and a novel two-step generation model, where a text planning decoder first decides on the main talking points and a proper language style for each sentence, then a content realization decoder reflects the decisions and constructs an informative paragraph-level argument. Furthermore, our generation model is empowered by a retrieval system indexed with 12 million articles collected from Wikipedia and popular English news media, which provides access to high-quality content with diversity. Automatic evaluation on a large-scale dataset collected from Reddit shows that our model yields significantly higher BLEU, ROUGE, and METEOR scores than the state-of-the-art and non-trivial comparisons. Human evaluation further indicates that our system arguments are more appropriate for refutation and richer in content.	自动参数生成是一项有吸引力但具有挑战性的任务。在本文中，我们研究了反论点生成的具体问题，并提出了一个新颖的框架 CANDELA。它由一个强大的检索系统和一个新颖的两步生成模型组成，其中文本规划解码器首先决定每个句子的主要谈话点和适当的语言风格，然后内容实现解码器反映决定并构建信息段落- 级别的论点。此外，我们的生成模型由一个检索系统提供支持，该检索系统索引了从维基百科和流行英语新闻媒体收集的 1200 万篇文章，可提供对高质量内容的访问。对从 Reddit 收集的大规模数据集的自动评估表明，我们的模型产生的 BLEU、ROUGE 和 METEOR 分数明显高于最先进的和非平凡的比较。人工评估进一步表明，我们的系统论点更适合反驳，内容更丰富。	Xinyu Hua Zhe Hu Lu Wang
33	ACL2019	Semantically Conditioned Dialog Response Generation via Hierarchical Disentangled Self-Attention	https://github.com/wenhuchen/HDSA-Dialog	https://arxiv.org/pdf/1905.12866	Semantically controlled neural response generation on limited-domain has achieved great performance. However, moving towards multi-domain large-scale scenarios are shown to be difficult because the possible combinations of semantic inputs grow exponentially with the number of domains. To alleviate such scalability issue, we exploit the structure of dialog acts to build a multi-layer hierarchical graph, where each act is represented as a root-to-leaf route on the graph. Then, we incorporate such graph structure prior as an inductive bias to build a hierarchical disentangled self-attention network, where we disentangle attention heads to model designated nodes on the dialog act graph. By activating different (disentangled) heads at each layer, combinatorially many dialog act semantics can be modeled to control the neural response generation. On the large-scale Multi-Domain-WOZ dataset, our model can yield a significant improvement over the baselines on various automatic and human evaluation metrics.	在有限域上语义控制的神经响应生成已经取得了很好的性能。然而，由于语义输入的可能组合随着域的数量呈指数增长，因此转向多域大规模场景是很困难的。为了缓解这种可扩展性问题，我们利用对话行为的结构来构建多层分层图，其中每个行为在图上表示为从根到叶的路线。然后，我们将这种先验图结构作为归纳偏置来构建分层解缠结的自注意力网络，在其中我们解开注意力头以模拟对话行为图上的指定节点。通过在每一层激活不同的（解开的）头，可以组合地对许多对话行为语义进行建模以控制神经响应的生成。在大规模多域 WOZ 数据集上，我们的模型可以在各种自动和人工评估指标的基线上产生显着改进。	Wenhu Chen Jianshu Chen Pengda Qin Xifeng Yan William Yang Wang
34	ACL2019	Coherent Comments Generation for Chinese Articles with a Graph-to-Sequence Model	https://github.com/lancopku/Graph-to-seq-comment-generation	https://arxiv.org/pdf/1906.01231	Automatic article commenting is helpful in encouraging user engagement and interaction on online news platforms. However, the news documents are usually too long for traditional encoder-decoder based models, which often results in general and irrelevant comments. In this paper, we propose to generate comments with a graph-to-sequence model that models the input news as a topic interaction graph. By organizing the article into graph structure, our model can better understand the internal structure of the article and the connection between topics, which makes it better able to understand the story. We collect and release a large scale news-comment corpus from a popular Chinese online news platform Tencent Kuaibao. Extensive experiment results show that our model can generate much more coherent and informative comments compared with several strong baseline models.	自动文章评论有助于鼓励用户在在线新闻平台上参与和互动。然而，对于传统的基于编码器-解码器的模型来说，新闻文档通常太长，这通常会导致一般和不相关的评论。在本文中，我们建议使用图到序列模型生成评论，该模型将输入新闻建模为主题交互图。通过将文章组织成图结构，我们的模型可以更好地理解文章的内部结构和主题之间的联系，从而更好地理解故事。我们从中国流行的在线新闻平台腾讯快报收集并发布了一个大规模的新闻评论语料库。广泛的实验结果表明，与几个强大的基线模型相比，我们的模型可以生成更加连贯和信息丰富的评论。	Wei Li Jingjing Xu Yancheng He Shengli Yan Yunfang Wu Xu sun
35	ACL2019	Cross-Lingual Training for Automatic Question Generation	https://github.com/vishwajeet93/clqg	https://arxiv.org/pdf/1906.02525	Automatic question generation (QG) is a challenging problem in natural language understanding. QG systems are typically built assuming access to a large number of training instances where each instance is a question and its corresponding answer. For a new language, such training instances are hard to obtain making the QG problem even more challenging. Using this as our motivation, we study the reuse of an available large QG dataset in a secondary language (e.g. English) to learn a QG model for a primary language (e.g. Hindi) of interest. For the primary language, we assume access to a large amount of monolingual text but only a small QG dataset. We propose a cross-lingual QG model which uses the following training regime: (i) Unsupervised pretraining of language models in both primary and secondary languages and (ii) joint supervised training for QG in both languages. We demonstrate the efficacy of our proposed approach using two different primary languages, Hindi and Chinese. We also create and release a new question answering dataset for Hindi consisting of 6555 sentences.	自动问题生成（QG）是自然语言理解中的一个具有挑战性的问题。 QG 系统通常假设访问大量训练实例而构建，其中每个实例都是一个问题及其相应的答案。对于一种新语言，很难获得这样的训练实例，这使得 QG 问题更具挑战性。以此为动机，我们研究了在第二语言（例如英语）中重用可用的大型 QG 数据集来学习感兴趣的主要语言（例如印地语）的 QG 模型。对于主要语言，我们假设可以访问大量单语文本，但只能访问一个小的 QG 数据集。我们提出了一种跨语言 QG 模型，它使用以下训练机制：（i）主要和次要语言的语言模型的无监督预训练和（ii）两种语言的 QG 联合监督训练。我们使用两种不同的主要语言印地语和中文证明了我们提出的方法的有效性。我们还为印地语创建并发布了一个新的问答数据集，由 6555 个句子组成。	Vishwajeet Kumar Nitish Joshi Arijit Mukherjee Ganesh Ramakrishnan Preethi Jyothi
36	ACL2019	Graph Neural Networks with Generated Parameters for Relation Extraction		https://arxiv.org/pdf/1902.00756	Recently, progress has been made towards improving relational reasoning in machine learning field. Among existing models, graph neural networks (GNNs) is one of the most effective approaches for multi-hop relational reasoning. In fact, multi-hop relational reasoning is indispensable in many natural language processing tasks such as relation extraction. In this paper, we propose to generate the parameters of graph neural networks (GP-GNNs) according to natural language sentences, which enables GNNs to process relational reasoning on unstructured text inputs. We verify GP-GNNs in relation extraction from text. Experimental results on a human-annotated dataset and two distantly supervised datasets show that our model achieves significant improvements compared to baselines. We also perform a qualitative analysis to demonstrate that our model could discover more accurate relations by multi-hop relational reasoning.	最近，在改进机器学习领域的关系推理方面取得了进展。在现有模型中，图神经网络 (GNN) 是多跳关系推理最有效的方法之一。事实上，在关系抽取等很多自然语言处理任务中，多跳关系推理是必不可少的。在本文中，我们建议根据自然语言句子生成图神经网络 (GP-GNN) 的参数，这使 GNN 能够处理非结构化文本输入的关系推理。我们在从文本中提取关系中验证了 GP-GNN。在人工注释数据集和两个远程监督数据集上的实验结果表明，与基线相比，我们的模型取得了显着的改进。我们还进行了定性分析，以证明我们的模型可以通过多跳关系推理发现更准确的关系。	Hao Zhu Yankai Lin Zhiyuan Liu Jie Fu Tat-seng Chua Maosong Sun
37	ACL2019	Learning to Select, Track, and Generate for Data-to-Text	https://github.com/aistairc/rotowire-modified	https://arxiv.org/pdf/1907.09699	We propose a data-to-text generation model with two modules, one for tracking and the other for text generation. Our tracking module selects and keeps track of salient information and memorizes which record has been mentioned. Our generation module generates a summary conditioned on the state of tracking module. Our model is considered to simulate the human-like writing process that gradually selects the information by determining the intermediate variables while writing the summary. In addition, we also explore the effectiveness of the writer information for generation. Experimental results show that our model outperforms existing models in all evaluation metrics even without writer information. Incorporating writer information further improves the performance, contributing to content planning and surface realization.	我们提出了一种具有两个模块的数据到文本生成模型，一个用于跟踪，另一个用于文本生成。我们的跟踪模块选择并跟踪显着信息并记住提到的记录。我们的生成模块根据跟踪模块的状态生成摘要。我们的模型被认为是模拟类人的写作过程，通过在撰写摘要时确定中间变量来逐步选择信息。此外，我们还探索了作者信息对生成的有效性。实验结果表明，即使没有作者信息，我们的模型在所有评估指标上都优于现有模型。结合作者信息进一步提高了性能，有助于内容规划和表面实现。	Hayate Iso Yui Uehara Tatsuya Ishigaki Hiroshi Noji Eiji Aramaki Ichiro Kobayashi Yusuke Miyao Naoaki Okazaki Hiroya Takamura
38	ACL2019	Predicting Human Activities from User-Generated Content		https://arxiv.org/pdf/1907.08540	The activities we do are linked to our interests, personality, political preferences, and decisions we make about the future. In this paper, we explore the task of predicting human activities from user-generated content. We collect a dataset containing instances of social media users writing about a range of everyday activities. We then use a state-of-the-art sentence embedding framework tailored to recognize the semantics of human activities and perform an automatic clustering of these activities. We train a neural network model to make predictions about which clusters contain activities that were performed by a given user based on the text of their previous posts and self-description. Additionally, we explore the degree to which incorporating inferred user traits into our model helps with this prediction task.	我们所做的活动与我们的兴趣、个性、政治偏好以及我们对未来所做的决定有关。在本文中，我们探索了从用户生成的内容中预测人类活动的任务。我们收集了一个数据集，其中包含社交媒体用户撰写的一系列日常活动的实例。然后我们使用最先进的句子嵌入框架来识别人类活动的语义并执行这些活动的自动聚类。我们训练一个神经网络模型，以根据给定用户之前帖子的文本和自我描述来预测哪些集群包含由给定用户执行的活动。此外，我们探索了将推断的用户特征合并到我们的模型中有助于此预测任务的程度。	Steven R. Wilson Rada Mihalcea

EMNLP

序号	会议/期刊	论文	代码	论文下载地址	摘要	摘要翻译	作者
1	EMNLP2020	Few-shot Natural Language Generation for Task-Oriented Dialog	https://github.com/pengbaolin/SC-GPT	https://arxiv.org/pdf/2002.12328	As a crucial component in task-oriented dialog systems, the Natural Language Generation (NLG) module converts a dialog act represented in a semantic form into a response in natural language. The success of traditional template-based or statistical models typically relies on heavily annotated data, which is infeasible for new domains. Therefore, it is pivotal for an NLG system to generalize well with limited labelled data in real applications. To this end, we present FewShotWoz, the first NLG benchmark to simulate the few-shot learning setting in task-oriented dialog systems. Further, we develop the SC-GPT model. It is pre-trained on a large set of annotated NLG corpus to acquire the controllable generation ability, and fine-tuned with only a few domain-specific labels to adapt to new domains. Experiments on FewShotWoz and the large Multi-Domain-WOZ datasets show that the proposed SC-GPT significantly outperforms existing methods, measured by various automatic metrics and human evaluations.	作为面向任务的对话系统中的重要组成部分，自然语言生成 (NLG) 模块将以语义形式表示的对话行为转换为自然语言的响应。传统的基于模板或统计模型的成功通常依赖于大量注释的数据，这对于新领域是不可行的。因此，NLG 系统在实际应用中利用有限的标记数据很好地泛化是至关重要的。为此，我们提出了FewShotWoz，这是第一个在面向任务的对话系统中模拟小样本学习设置的NLG 基准测试。此外，我们开发了 SC-GPT 模型。它在大量带注释的 NLG 语料库上进行预训练以获得可控生成能力，并仅使用少数特定领域的标签进行微调以适应新领域。在FewShotWoz 和大型Multi-Domain-WOZ 数据集上的实验表明，所提出的SC-GPT 显着优于现有方法，通过各种自动指标和人工评估来衡量。	Baolin Peng Chenguang Zhu Chunyuan Li Xiujun Li Jinchao Li Michael Zeng Jianfeng Gao
2	EMNLP2020	How Decoding Strategies Affect the Verifiability of Generated Text		https://arxiv.org/pdf/1911.03587	Recent progress in pre-trained language models led to systems that are able to generate text of an increasingly high quality. While several works have investigated the fluency and grammatical correctness of such models, it is still unclear to which extent the generated text is consistent with factual world knowledge. Here, we go beyond fluency and also investigate the verifiability of text generated by state-of-the-art pre-trained language models. A generated sentence is verifiable if it can be corroborated or disproved by Wikipedia, and we find that the verifiability of generated text strongly depends on the decoding strategy. In particular, we discover a tradeoff between factuality (i.e., the ability of generating Wikipedia corroborated text) and repetitiveness. While decoding strategies such as top-k and nucleus sampling lead to less repetitive generations, they also produce less verifiable text. Based on these finding, we introduce a simple and effective decoding strategy which, in comparison to previously used decoding strategies, produces less repetitive and more verifiable text.	预训练语言模型的最新进展导致系统能够生成质量越来越高的文本。虽然有几项工作调查了此类模型的流畅性和语法正确性，但仍不清楚生成的文本在多大程度上与事实世界知识一致。在这里，我们超越了流畅性，还研究了由最先进的预训练语言模型生成的文本的可验证性。如果生成的句子可以被维基百科证实或反驳，那么它就是可验证的，我们发现生成文本的可验证性在很大程度上取决于解码策略。特别是，我们发现了事实性（即生成维基百科确证文本的能力）和重复性之间的权衡。虽然诸如 top-k 和核采样之类的解码策略会减少重复生成，但它们也会产生较少的可验证文本。基于这些发现，我们引入了一种简单有效的解码策略，与以前使用的解码策略相比，该策略产生的文本重复性更低且可验证性更高。	Luca Massarelli Fabio Petroni Aleksandra Piktus Myle Ott Tim Rocktäschel Vassilis Plachouras Fabrizio Silvestri Sebastian Riedel
3	EMNLP2020	Control, Generate, Augment: A Scalable Framework for Multi-Attribute Text Generation		https://arxiv.org/pdf/2004.14983	We introduce CGA, a conditional VAE architecture, to control, generate, and augment text. CGA is able to generate natural English sentences controlling multiple semantic and syntactic attributes by combining adversarial learning with a context-aware loss and a cyclical word dropout routine. We demonstrate the value of the individual model components in an ablation study. The scalability of our approach is ensured through a single discriminator, independently of the number of attributes. We show high quality, diversity and attribute control in the generated sentences through a series of automatic and human assessments. As the main application of our work, we test the potential of this new NLG model in a data augmentation scenario. In a downstream NLP task, the sentences generated by our CGA model show significant improvements over a strong baseline, and a classification performance often comparable to adding same amount of additional real data.	我们引入了 CGA，一种有条件的 VAE 架构，来控制、生成和增加文本。 CGA 能够通过将对抗性学习与上下文感知损失和循环词丢失例程相结合，生成控制多个语义和句法属性的自然英语句子。我们在消融研究中展示了各个模型组件的价值。我们的方法的可扩展性是通过单个鉴别器来确保的，与属性的数量无关。我们通过一系列自动和人工评估在生成的句子中展示了高质量、多样性和属性控制。作为我们工作的主要应用，我们在数据增强场景中测试了这种新的 NLG 模型的潜力。在下游 NLP 任务中，我们的 CGA 模型生成的句子在强大的基线上显示出显着的改进，并且分类性能通常与添加相同数量的额外真实数据相当。	Giuseppe Russo Nora Hollenstein Claudiu Musat Ce Zhang
4	EMNLP2020	Pretrained Language Models for Dialogue Generation with Multiple Input Sources	https://github.com/caoyu-noob/Multi-GPT2	https://arxiv.org/pdf/2010.07576	Large-scale pretrained language models have achieved outstanding performance on natural language understanding tasks. However, it is still under investigating how to apply them to dialogue generation tasks, especially those with responses conditioned on multiple sources. Previous work simply concatenates all input sources or averages information from different input sources. In this work, we study dialogue models with multiple input sources adapted from the pretrained language model GPT2. We explore various methods to fuse multiple separate attention information corresponding to different sources. Our experimental results show that proper fusion methods deliver higher relevance with dialogue history than simple fusion baselines.	大规模预训练语言模型在自然语言理解任务上取得了出色的表现。然而，它仍在研究如何将它们应用于对话生成任务，尤其是那些以多个来源为条件的响应。以前的工作只是连接所有输入源或平均来自不同输入源的信息。在这项工作中，我们研究了从预训练语言模型 GPT2 改编而来的具有多个输入源的对话模型。我们探索了各种方法来融合对应于不同来源的多个单独的注意力信息。我们的实验结果表明，与简单的融合基线相比，适当的融合方法与对话历史的相关性更高。	Yu Cao Wei Bi Meng Fang Dacheng Tao
5	EMNLP2020	Logic2Text: High-Fidelity Natural Language Generation from Logical Forms	https://github.com/czyssrs/Logic2Text	https://arxiv.org/pdf/2004.14579	Previous works on Natural Language Generation (NLG) from structured data have primarily focused on surface-level descriptions of record sequences. However, for complex structured data, e.g., multi-row tables, it is often desirable for an NLG system to describe interesting facts from logical inferences across records. If only provided with the table, it is hard for existing models to produce controllable and high-fidelity logical generations. In this work, we formulate logical level NLG as generation from logical forms in order to obtain controllable, high-fidelity, and faithful generations. We present a new large-scale dataset, \textsc{Logic2Text}, with 10,753 descriptions involving common logic types paired with the underlying logical forms. The logical forms show diversified graph structure of free schema, which poses great challenges on the model’s ability to understand the semantics. We experiment on (1) Fully-supervised training with the full datasets, and (2) Few-shot setting, provided with hundreds of paired examples; We compare several popular generation models and analyze their performances. We hope our dataset can encourage research towards building an advanced NLG system capable of natural, faithful, and human-like generation. The dataset and code are available at https://github.com/czyssrs/Logic2Text.	以前从结构化数据中生成自然语言 (NLG) 的工作主要集中在记录序列的表面级描述上。然而，对于复杂的结构化数据，例如多行表，NLG 系统通常希望通过跨记录的逻辑推理来描述有趣的事实。如果只提供表格，现有模型很难产生可控且高保真的逻辑代。在这项工作中，我们将逻辑层次 NLG 制定为逻辑形式的代，以获得可控、高保真和忠实的代。我们提出了一个新的大规模数据集 \textsc{Logic2Text}，其中包含 10,753 个描述，涉及与底层逻辑形式配对的常见逻辑类型。逻辑形式表现出自由模式的多样化图结构，这对模型的语义理解能力提出了很大的挑战。我们对 (1) 使用完整数据集进行全监督训练，以及 (2) 小样本设置进行实验，提供数百个配对示例；我们比较了几种流行的生成模型并分析了它们的性能。我们希望我们的数据集能够鼓励研究建立一个能够自然、忠实和类人生成的先进 NLG 系统。数据集和代码可从 https://github.com/czyssrs/Logic2Text 获得。	Zhiyu Chen Wenhu Chen Hanwen Zha Xiyou Zhou Yunkai Zhang Sairam Sundaresan William Yang Wang
6	EMNLP2020	Composed Variational Natural Language Generation for Few-shot Intents		https://arxiv.org/pdf/2009.10056	In this paper, we focus on generating training examples for few-shot intents in the realistic imbalanced scenario. To build connections between existing many-shot intents and few-shot intents, we consider an intent as a combination of a domain and an action, and propose a composed variational natural language generator (CLANG), a transformer-based conditional variational autoencoder. CLANG utilizes two latent variables to represent the utterances corresponding to two different independent parts (domain and action) in the intent, and the latent variables are composed together to generate natural examples. Additionally, to improve the generator learning, we adopt the contrastive regularization loss that contrasts the in-class with the out-of-class utterance generation given the intent. To evaluate the quality of the generated utterances, experiments are conducted on the generalized few-shot intent detection task. Empirical results show that our proposed model achieves state-of-the-art performances on two real-world intent detection datasets.	在本文中，我们专注于为现实不平衡场景中的小样本意图生成训练示例。为了在现有的多样本意图和少样本意图之间建立联系，我们将意图视为域和动作的组合，并提出了一种组合变分自然语言生成器 (CLANG)，一种基于转换器的条件变分自动编码器。 CLANG 利用两个潜在变量来表示意图中两个不同独立部分（领域和动作）对应的话语，并将潜在变量组合在一起以生成自然示例。此外，为了改进生成器学习，我们采用了对比正则化损失，将给定意图的课堂内话语生成与课堂外话语生成进行对比。为了评估生成的话语的质量，对广义的小样本意图检测任务进行了实验。实证结果表明，我们提出的模型在两个真实世界的意图检测数据集上实现了最先进的性能。	Congying Xia Caiming Xiong Philip Yu Richard Socher
7	EMNLP2020	Continual Learning for Natural Language Generation in Task-oriented Dialog Systems		https://arxiv.org/pdf/2010.00910	Natural language generation (NLG) is an essential component of task-oriented dialog systems. Despite the recent success of neural approaches for NLG, they are typically developed in an offline manner for particular domains. To better fit real-life applications where new data come in a stream, we study NLG in a “continual learning” setting to expand its knowledge to new domains or functionalities incrementally. The major challenge towards this goal is catastrophic forgetting, meaning that a continually trained model tends to forget the knowledge it has learned before. To this end, we propose a method called ARPER (Adaptively Regularized Prioritized Exemplar Replay) by replaying prioritized historical exemplars, together with an adaptive regularization technique based on ElasticWeight Consolidation. Extensive experiments to continually learn new domains and intents are conducted on MultiWoZ-2.0 to benchmark ARPER with a wide range of techniques. Empirical results demonstrate that ARPER significantly outperforms other methods by effectively mitigating the detrimental catastrophic forgetting issue.	自然语言生成 (NLG) 是面向任务的对话系统的重要组成部分。尽管最近 NLG 的神经方法取得了成功，但它们通常是针对特定领域以离线方式开发的。为了更好地适应新数据流入的现实生活应用程序，我们在“持续学习”设置中研究 NLG，以逐步将其知识扩展到新的领域或功能。实现这一目标的主要挑战是灾难性遗忘，这意味着不断训练的模型往往会忘记它之前学到的知识。为此，我们提出了一种称为 ARPER（Adaptively Regularized Prioritized Exemplar Replay）的方法，通过重放优先的历史样本，以及基于 ElasticWeight Consolidation 的自适应正则化技术。在 MultiWoZ-2.0 上进行了大量实验以不断学习新的领域和意图，以使用各种技术对 ARPER 进行基准测试。实证结果表明，ARPER 通过有效减轻有害的灾难性遗忘问题显着优于其他方法。	Fei Mi Liangwei Chen Mengjie Zhao Minlie Huang Boi Faltings
8	EMNLP2020	Dual Inference for Improving Language Understanding and Generation	https://github.com/MiuLab/DuaLUG	https://arxiv.org/pdf/2010.04246	Natural language understanding (NLU) and Natural language generation (NLG) tasks hold a strong dual relationship, where NLU aims at predicting semantic labels based on natural language utterances and NLG does the opposite. The prior work mainly focused on exploiting the duality in model training in order to obtain the models with better performance. However, regarding the fast-growing scale of models in the current NLP area, sometimes we may have difficulty retraining whole NLU and NLG models. To better address the issue, this paper proposes to leverage the duality in the inference stage without the need of retraining. The experiments on three benchmark datasets demonstrate the effectiveness of the proposed method in both NLU and NLG, providing the great potential of practical usage.	自然语言理解 (NLU) 和自然语言生成 (NLG) 任务具有很强的双重关系，其中 NLU 旨在根据自然语言表达预测语义标签，而 NLG 则相反。先前的工作主要集中在利用模型训练中的二元性以获得性能更好的模型。然而，对于当前 NLP 领域快速增长的模型规模，有时我们可能难以重新训练整个 NLU 和 NLG 模型。为了更好地解决这个问题，本文建议在推理阶段利用二元性，而无需重新训练。在三个基准数据集上的实验证明了该方法在 NLU 和 NLG 中的有效性，提供了巨大的实际使用潜力。	Shang-Yu Su Yung-Sung Chuang Yun-Nung Chen
9	EMNLP2019	Neural data-to-text generation: A comparison between pipeline and end- to-end architectures	https://github.com/ThiagoCF05/webnlg	https://arxiv.org/pdf/1908.09022	Traditionally, most data-to-text applications have been designed using a modular pipeline architecture, in which non-linguistic input data is converted into natural language through several intermediate transformations. In contrast, recent neural models for data-to-text generation have been proposed as end-to-end approaches, where the non-linguistic input is rendered in natural language with much less explicit intermediate representations in-between. This study introduces a systematic comparison between neural pipeline and end-to-end data-to-text approaches for the generation of text from RDF triples. Both architectures were implemented making use of state-of-the art deep learning methods as the encoder-decoder Gated-Recurrent Units (GRU) and Transformer. Automatic and human evaluations together with a qualitative analysis suggest that having explicit intermediate steps in the generation process results in better texts than the ones generated by end-to-end approaches. Moreover, the pipeline models generalize better to unseen inputs. Data and code are publicly available.	传统上，大多数数据到文本应用程序都是使用模块化管道架构设计的，其中非语言输入数据通过几个中间转换转换为自然语言。相比之下，最近用于数据到文本生成的神经模型已被提出作为端到端方法，其中非语言输入以自然语言呈现，中间的中间表示要少得多。本研究介绍了神经管道和端到端数据到文本方法之间的系统比较，用于从 RDF 三元组生成文本。这两种架构都是利用最先进的深度学习方法作为编码器-解码器门控循环单元 (GRU) 和转换器来实现的。自动和人工评估以及定性分析表明，在生成过程中具有明确的中间步骤会产生比端到端方法生成的文本更好的文本。此外，管道模型可以更好地泛化到看不见的输入。数据和代码是公开的。	Thiago Castro Ferreira Chris van der Lee Emiel van Miltenburg Emiel Krahmer
10	EMNLP2019	MoverScore: Text Generation Evaluating with Contextualized Embeddings and Earth Mover Distance		https://arxiv.org/pdf/1909.02622	A robust evaluation metric has a profound impact on the development of text generation systems. A desirable metric compares system output against references based on their semantics rather than surface forms. In this paper we investigate strategies to encode system and reference texts to devise a metric that shows a high correlation with human judgment of text quality. We validate our new metric, namely MoverScore, on a number of text generation tasks including summarization, machine translation, image captioning, and data-to-text generation, where the outputs are produced by a variety of neural and non-neural systems. Our findings suggest that metrics combining contextualized representations with a distance measure perform the best. Such metrics also demonstrate strong generalization capability across tasks. For ease-of-use we make our metrics available as web service.	一个强大的评估指标对文本生成系统的发展有着深远的影响。理想的度量根据语义而不是表面形式将系统输出与引用进行比较。在本文中，我们研究了对系统和参考文本进行编码的策略，以设计出与人类对文本质量的判断高度相关的度量。我们在许多文本生成任务上验证了我们的新指标，即 MoverScore，包括摘要、机器翻译、图像字幕和数据到文本生成，其中输出由各种神经和非神经系统生成。我们的研究结果表明，将上下文表示与距离度量相结合的指标表现最佳。这些指标还展示了跨任务的强大泛化能力。为了便于使用，我们将我们的指标作为 Web 服务提供。	Wei Zhao Maxime Peyrard Fei Liu Yang Gao Christian M. Meyer Steffen Eger
11	EMNLP2019	Select and Attend: Towards Controllable Content Selection in Text Generation	https://github.com/chin-gyou/controllable-selection	https://arxiv.org/pdf/1909.04453	Many text generation tasks naturally contain two steps: content selection and surface realization. Current neural encoder-decoder models conflate both steps into a black-box architecture. As a result, the content to be described in the text cannot be explicitly controlled. This paper tackles this problem by decoupling content selection from the decoder. The decoupled content selection is human interpretable, whose value can be manually manipulated to control the content of generated text. The model can be trained end-to-end without human annotations by maximizing a lower bound of the marginal likelihood. We further propose an effective way to trade-off between performance and controllability with a single adjustable hyperparameter. In both data-to-text and headline generation tasks, our model achieves promising results, paving the way for controllable content selection in text generation.	许多文本生成任务自然包含两个步骤：内容选择和表面实现。当前的神经编码器-解码器模型将这两个步骤合并为一个黑盒架构。因此，无法明确控制文本中要描述的内容。本文通过将内容选择与解码器解耦来解决这个问题。解耦的内容选择是人类可解释的，其值可以手动操作以控制生成文本的内容。通过最大化边际似然的下限，可以在没有人工注释的情况下端到端地训练模型。我们进一步提出了一种有效的方法，可以通过单个可调超参数在性能和可控性之间进行权衡。在数据到文本和标题生成任务中，我们的模型取得了可喜的结果，为文本生成中的可控内容选择铺平了道路。	Xiaoyu Shen Jun Suzuki Kentaro Inui Hui Su Dietrich Klakow Satoshi Sekine
12	EMNLP2019	Knowledge Aware Conversation Generation with Explainable Reasoning over Augmented Graphs	https://github.com/PaddlePaddle/Research/tree/master/NLP/EMNLP2019-AKGCM	https://arxiv.org/pdf/1903.10245	Two types of knowledge, triples from knowledge graphs and texts from documents, have been studied for knowledge aware open-domain conversation generation, in which graph paths can narrow down vertex candidates for knowledge selection decision, and texts can provide rich information for response generation. Fusion of a knowledge graph and texts might yield mutually reinforcing advantages, but there is less study on that. To address this challenge, we propose a knowledge aware chatting machine with three components, an augmented knowledge graph with both triples and texts, knowledge selector, and knowledge aware response generator. For knowledge selection on the graph, we formulate it as a problem of multi-hop graph reasoning to effectively capture conversation flow, which is more explainable and flexible in comparison with previous work. To fully leverage long text information that differentiates our graph from others, we improve a state of the art reasoning algorithm with machine reading comprehension technology. We demonstrate the effectiveness of our system on two datasets in comparison with state-of-the-art models.	已经研究了两种类型的知识，知识图谱中的三元组和文档中的文本，用于知识感知开放域对话生成，其中图路径可以缩小知识选择决策的顶点候选，文本可以为响应生成提供丰富的信息。知识图谱和文本的融合可能会产生相辅相成的优势，但对此的研究较少。为了应对这一挑战，我们提出了一种具有三个组件的知识感知聊天机，一个具有三元组和文本的增强知识图，知识选择器和知识感知响应生成器。对于图上的知识选择，我们将其表述为多跳图推理问题，以有效捕获对话流，与以前的工作相比，它更具可解释性和灵活性。为了充分利用将我们的图与其他图区分开来的长文本信息，我们使用机器阅读理解技术改进了最先进的推理算法。与最先进的模型相比，我们证明了我们的系统在两个数据集上的有效性。	Zhibin Liu Zheng-Yu Niu Hua Wu Haifeng Wang
13	EMNLP2019	Autoregressive Text Generation Beyond Feedback Loops	https://github.com/schmiflo/crf-generation	https://arxiv.org/pdf/1908.11658	Autoregressive state transitions, where predictions are conditioned on past predictions, are the predominant choice for both deterministic and stochastic sequential models. However, autoregressive feedback exposes the evolution of the hidden state trajectory to potential biases from well-known train-test discrepancies. In this paper, we combine a latent state space model with a CRF observation model. We argue that such autoregressive observation models form an interesting middle ground that expresses local correlations on the word level but keeps the state evolution non-autoregressive. On unconditional sentence generation we show performance improvements compared to RNN and GAN baselines while avoiding some prototypical failure modes of autoregressive models.	自回归状态转换，其中预测以过去的预测为条件，是确定性和随机序列模型的主要选择。然而，自回归反馈将隐藏状态轨迹的演变暴露于众所周知的训练测试差异的潜在偏差。在本文中，我们将潜在状态空间模型与 CRF 观察模型相结合。我们认为，这种自回归观察模型形成了一个有趣的中间立场，它在单词级别上表达了局部相关性，但保持状态演化非自回归。在无条件句子生成方面，我们展示了与 RNN 和 GAN 基线相比的性能改进，同时避免了自回归模型的一些原型故障模式。	Florian Schmidt Stephan Mandt Thomas Hofmann
14	EMNLP2019	ARAML: A Stable Adversarial Training Framework for Text Generation	https://github.com/kepei1106/ARAML	https://arxiv.org/pdf/1908.07195	Most of the existing generative adversarial networks (GAN) for text generation suffer from the instability of reinforcement learning training algorithms such as policy gradient, leading to unstable performance. To tackle this problem, we propose a novel framework called Adversarial Reward Augmented Maximum Likelihood (ARAML). During adversarial training, the discriminator assigns rewards to samples which are acquired from a stationary distribution near the data rather than the generator’s distribution. The generator is optimized with maximum likelihood estimation augmented by the discriminator’s rewards instead of policy gradient. Experiments show that our model can outperform state-of-the-art text GANs with a more stable training process.	大多数现有的用于文本生成的生成对抗网络（GAN）都受到强化学习训练算法（如策略梯度）的不稳定性的影响，导致性能不稳定。为了解决这个问题，我们提出了一种称为对抗性奖励增强最大似然（ARAML）的新框架。在对抗性训练期间，鉴别器将奖励分配给从数据附近的平稳分布而不是生成器的分布中获得的样本。生成器通过最大似然估计进行优化，由鉴别器的奖励而不是策略梯度来增强。实验表明，我们的模型可以通过更稳定的训练过程超越最先进的文本 GAN。	Pei Ke Fei Huang Minlie Huang Xiaoyan Zhu
15	EMNLP2019	Judge the Judges: A Large-Scale Evaluation Study of Neural Language Models for Online Review Generation	https://github.com/Crista23/JudgeTheJudges	https://arxiv.org/pdf/1901.00398	We conduct a large-scale, systematic study to evaluate the existing evaluation methods for natural language generation in the context of generating online product reviews. We compare human-based evaluators with a variety of automated evaluation procedures, including discriminative evaluators that measure how well machine-generated text can be distinguished from human-written text, as well as word overlap metrics that assess how similar the generated text compares to human-written references. We determine to what extent these different evaluators agree on the ranking of a dozen of state-of-the-art generators for online product reviews. We find that human evaluators do not correlate well with discriminative evaluators, leaving a bigger question of whether adversarial accuracy is the correct objective for natural language generation. In general, distinguishing machine-generated text is challenging even for human evaluators, and human decisions correlate better with lexical overlaps. We find lexical diversity an intriguing metric that is indicative of the assessments of different evaluators. A post-experiment survey of participants provides insights into how to evaluate and improve the quality of natural language generation systems.	我们进行了大规模、系统的研究，以在生成在线产品评论的背景下评估现有的自然语言生成评估方法。我们将基于人类的评估器与各种自动评估程序进行比较，包括衡量机器生成的文本与人类编写的文本的区分程度的判别评估器，以及评估生成的文本与人类的相似程度的单词重叠指标- 书面参考。我们确定这些不同的评估者在多大程度上同意在线产品评论的十几个最先进的生成器的排名。我们发现人类评估者与判别性评估者的相关性不佳，这留下了一个更大的问题，即对抗性准确性是否是自然语言生成的正确目标。一般来说，即使对于人类评估者来说，区分机器生成的文本也是一项挑战，而人类决策与词汇重叠的相关性更好。我们发现词汇多样性是一个有趣的指标，它表明不同评估者的评估。对参与者的实验后调查提供了有关如何评估和提高自然语言生成系统质量的见解。	Cristina Garbacea Samuel Carton Shiyan Yan Qiaozhu Mei

NAACL

序号	会议/期刊	论文	代码	论文下载地址	摘要	摘要翻译	作者
1	NAACL2021	APo-VAE: Text Generation in Hyperbolic Space		https://arxiv.org/pdf/2005.00054	Natural language often exhibits inherent hierarchical structure ingrained with complex syntax and semantics. However, most state-of-the-art deep generative models learn embeddings only in Euclidean vector space, without accounting for this structural property of language. In this paper, we investigate text generation in a hyperbolic latent space to learn continuous hierarchical representations. An Adversarial Poincare Variational Autoencoder (APo-VAE) is presented, where both the prior and variational posterior of latent variables are defined over a Poincare ball via wrapped normal distributions. By adopting the primal-dual formulation of KL divergence, an adversarial learning procedure is introduced to empower robust model training. Extensive experiments in language modeling and dialog-response generation tasks demonstrate the winning effectiveness of the proposed APo-VAE model over VAEs in Euclidean latent space, thanks to its superb capabilities in capturing latent language hierarchies in hyperbolic space.	自然语言通常表现出根深蒂固的复杂语法和语义的固有层次结构。然而，大多数最先进的深度生成模型仅在欧几里得向量空间中学习嵌入，而没有考虑语言的这种结构特性。在本文中，我们研究了双曲潜在空间中的文本生成，以学习连续的分层表示。提出了对抗性庞加莱变分自动编码器 (APo-VAE)，其中潜在变量的先验和变分后验都通过包裹正态分布在庞加莱球上定义。通过采用 KL 散度的原始对偶公式，引入了对抗性学习程序以增强稳健的模型训练。语言建模和对话响应生成任务中的大量实验证明了所提出的 APo-VAE 模型在欧几里德潜在空间中的 VAE 上的获胜有效性，这要归功于其在双曲线空间中捕获潜在语言层次结构的卓越能力。	Shuyang Dai Zhe Gan Yu Cheng Chenyang Tao Lawrence Carin Jingjing Liu
2	NAACL2021	FUDGE: Controlled Text Generation With Future Discriminators	https://github.com/yangkevin2/naacl-2021-fudge-controlled-generation	https://arxiv.org/pdf/2104.05218	We propose Future Discriminators for Generation (FUDGE), a flexible and modular method for controlled text generation. Given a pre-existing model G for generating text from a distribution of interest, FUDGE enables conditioning on a desired attribute a (for example, formality) while requiring access only to G’s output logits. FUDGE learns an attribute predictor operating on a partial sequence, and uses this predictor’s outputs to adjust G’s original probabilities. We show that FUDGE models terms corresponding to a Bayesian decomposition of the conditional distribution of G given attribute a. Moreover, FUDGE can easily compose predictors for multiple desired attributes. We evaluate FUDGE on three tasks — couplet completion in poetry, topic control in language generation, and formality change in machine translation — and observe gains in all three tasks.	我们提出了未来生成判别器（FUDGE），这是一种用于受控文本生成的灵活且模块化的方法。给定一个用于从感兴趣的分布生成文本的预先存在的模型 G，FUDGE 可以对所需的属性 a（例如形式）进行调节，同时只需要访问 G 的输出 logits。 FUDGE 学习对部分序列进行操作的属性预测器，并使用该预测器的输出来调整 G 的原始概率。我们展示了 FUDGE 模型项对应于给定属性 a 的 G 的条件分布的贝叶斯分解。此外，FUDGE 可以轻松地为多个所需属性组合预测器。我们在三项任务上评估 FUDGE——诗歌中的对联完成、语言生成中的主题控制和机器翻译中的形式变化——并观察所有三项任务的收益。	Kevin Yang Dan Klein
3	NAACL2021	NeuroLogic Decoding: (Un)supervised Neural Text Generation with Predicate Logic Constraints		https://arxiv.org/pdf/2010.12884	Conditional text generation often requires lexical constraints, i.e., which words should or shouldn’t be included in the output text. While the dominant recipe for conditional text generation has been large-scale pretrained language models that are finetuned on the task-specific training data, such models do not learn to follow the underlying constraints reliably, even when supervised with large amounts of task-specific examples.
We propose NeuroLogic Decoding, a simple yet effective algorithm that enables neural language models — supervised or not — to generate fluent text while satisfying complex lexical constraints. Our approach is powerful yet efficient. It handles any set of lexical constraints that is expressible under predicate logic, while its asymptotic runtime is equivalent to conventional beam search.
Empirical results on four benchmarks show that NeuroLogic Decoding outperforms previous approaches, including algorithms that handle a subset of our constraints. Moreover, we find that unsupervised models with NeuroLogic Decoding often outperform supervised models with conventional decoding, even when the latter is based on considerably larger networks. Our results suggest the limit of large-scale neural networks for fine-grained controllable generation and the promise of inference-time algorithms.	条件文本生成通常需要词法约束，即哪些词应该或不应该包含在输出文本中。虽然条件文本生成的主要方法是在特定任务的训练数据上进行微调的大规模预训练语言模型，但这些模型并不能可靠地学习遵循潜在的约束，即使在有大量特定任务示例的监督下.
我们提出了 NeuroLogic Decoding，这是一种简单而有效的算法，它使神经语言模型（无论是否受监督）都能生成流畅的文本，同时满足复杂的词汇约束。我们的方法强大而高效。它处理在谓词逻辑下可表达的任何词法约束集，而其渐近运行时等效于传统的波束搜索。
四个基准的实证结果表明，神经逻辑解码优于以前的方法，包括处理我们的约束子集的算法。此外，我们发现使用 NeuroLogic Decoding 的无监督模型通常优于使用传统解码的监督模型，即使后者基于相当大的网络。我们的结果表明，大规模神经网络对细粒度可控生成的限制和推理时间算法的前景。	Ximing Lu Peter West Rowan Zellers Ronan Le Bras Chandra Bhagavatula Yejin Choi
4	NAACL2021	Plot-guided Adversarial Example Construction for Evaluating Open-domain Story Generation	https://github.com/PlusLabNLP/Plot-guided-Coherence-Evaluation	https://arxiv.org/pdf/2104.05801	With the recent advances of open-domain story generation, the lack of reliable automatic evaluation metrics becomes an increasingly imperative issue that hinders the fast development of story generation. According to conducted researches in this regard, learnable evaluation metrics have promised more accurate assessments by having higher correlations with human judgments. A critical bottleneck of obtaining a reliable learnable evaluation metric is the lack of high-quality training data for classifiers to efficiently distinguish plausible and implausible machine-generated stories. Previous works relied on \textit{heuristically manipulated} plausible examples to mimic possible system drawbacks such as repetition, contradiction, or irrelevant content in the text level, which can be \textit{unnatural} and \textit{oversimplify} the characteristics of implausible machine-generated stories. We propose to tackle these issues by generating a more comprehensive set of implausible stories using {\em plots}, which are structured representations of controllable factors used to generate stories. Since these plots are compact and structured, it is easier to manipulate them to generate text with targeted undesirable properties, while at the same time maintain the grammatical correctness and naturalness of the generated sentences. To improve the quality of generated implausible stories, we further apply the adversarial filtering procedure presented by \citet{zellers2018swag} to select a more nuanced set of implausible texts. Experiments show that the evaluation metrics trained on our generated data result in more reliable automatic assessments that correlate remarkably better with human judgments compared to the baselines.	随着开放域故事生成的最新进展，缺乏可靠的自动评估指标成为阻碍故事生成快速发展的日益紧迫的问题。根据在这方面进行的研究，可学习的评估指标通过与人类判断具有更高的相关性，有望实现更准确的评估。获得可靠的可学习评估指标的一个关键瓶颈是缺乏用于分类器的高质量训练数据，以有效区分机器生成的可信和不可信的故事。以前的工作依靠\textit{启发式操作}似是而非的例子来模仿可能的系统缺陷，例如文本级别的重复、矛盾或不相关的内容，这些可能是\textit{unnatural}和\textit{oversimplify}难以置信的机器的特征-生成的故事。我们建议通过使用 {\em plots} 生成一组更全面的难以置信的故事来解决这些问题，这些故事是用于生成故事的可控因素的结构化表示。由于这些情节紧凑且结构化，因此更容易操纵它们以生成具有针对性的不良属性的文本，同时保持生成的句子的语法正确性和自然性。为了提高生成的难以置信的故事的质量，我们进一步应用了\citet{zellers2018swag} 提出的对抗性过滤程序来选择一组更细微的难以置信的文本。实验表明，在我们生成的数据上训练的评估指标会产生更可靠的自动评估，与基线相比，与人类判断的相关性明显更好。	Sarik Ghazarian Zixi Liu Akash SM Ralph Weischedel Aram Galstyan Nanyun Peng
5	NAACL2021	Progressive Generation of Long Text with Pretrained Language Models	https://github.com/tanyuqian/progressive-generation	https://arxiv.org/pdf/2006.15720	Large-scale language models (LMs) pretrained on massive corpora of text, such as GPT-2, are powerful open-domain text generators. However, as our systematic examination reveals, it is still challenging for such models to generate coherent long passages of text (e.g., 1000 tokens), especially when the models are fine-tuned to the target domain on a small corpus. Previous planning-then-generation methods also fall short of producing such long text in various domains. To overcome the limitations, we propose a simple but effective method of generating text in a progressive manner, inspired by generating images from low to high resolution. Our method first produces domain-specific content keywords and then progressively refines them into complete passages in multiple stages. The simple design allows our approach to take advantage of pretrained LMs at each stage and effectively adapt to any target domain given only a small set of examples. We conduct a comprehensive empirical study with a broad set of evaluation metrics, and show that our approach significantly improves upon the fine-tuned large LMs and various planning-then-generation methods in terms of quality and sample efficiency. Human evaluation also validates that our model generations are more coherent.	在大量文本语料库（例如 GPT-2）上预训练的大规模语言模型 (LM) 是强大的开放域文本生成器。然而，正如我们的系统检查所揭示的那样，这些模型生成连贯的长文本段落（例如 1000 个标记）仍然具有挑战性，尤其是当模型在小型语料库上针对目标域进行微调时。以前的规划然后生成方法也无法在各个领域生成如此长的文本。为了克服这些限制，我们提出了一种以渐进方式生成文本的简单而有效的方法，其灵感来自于生成从低分辨率到高分辨率的图像。我们的方法首先生成特定领域的内容关键字，然后在多个阶段逐步将它们细化为完整的段落。简单的设计使我们的方法能够在每个阶段利用预训练的 LM，并有效地适应任何仅给定一小部分示例的目标域。我们使用广泛的评估指标进行了全面的实证研究，并表明我们的方法在质量和样本效率方面显着改进了微调的大型 LM 和各种规划然后生成方法。人工评估还验证了我们的模型生成更加连贯。	Bowen Tan Zichao Yang Maruan AI-Shedivat Eric P. Xing Zhiting Hu
6	NAACL2021	OodGAN: Generative Adversarial Network for Out-of-Domain Data Generation		https://arxiv.org/pdf/2104.02484	Detecting an Out-of-Domain (OOD) utterance is crucial for a robust dialog system. Most dialog systems are trained on a pool of annotated OOD data to achieve this goal. However, collecting the annotated OOD data for a given domain is an expensive process. To mitigate this issue, previous works have proposed generative adversarial networks (GAN) based models to generate OOD data for a given domain automatically. However, these proposed models do not work directly with the text. They work with the text’s latent space instead, enforcing these models to include components responsible for encoding text into latent space and decoding it back, such as auto-encoder. These components increase the model complexity, making it difficult to train. We propose OodGAN, a sequential generative adversarial network (SeqGAN) based model for OOD data generation. Our proposed model works directly on the text and hence eliminates the need to include an auto-encoder. OOD data generated using OodGAN model outperforms state-of-the-art in OOD detection metrics for ROSTD (67% relative improvement in FPR 0.95) and OSQ datasets (28% relative improvement in FPR 0.95) (Zheng et al., 2020).	检测域外 (OOD) 话语对于强大的对话系统至关重要。大多数对话系统都在带注释的 OOD 数据池上进行训练以实现这一目标。但是，为给定域收集带注释的 OOD 数据是一个昂贵的过程。为了缓解这个问题，以前的工作提出了基于生成对抗网络 (GAN) 的模型来自动生成给定域的 OOD 数据。然而，这些提议的模型并不直接与文本一起工作。相反，它们使用文本的潜在空间，强制这些模型包含负责将文本编码到潜在空间并将其解码回来的组件，例如自动编码器。这些组件增加了模型的复杂性，使其难以训练。我们提出了 OodGAN，这是一种基于序列生成对抗网络 (SeqGAN) 的模型，用于生成 OOD 数据。我们提出的模型直接作用于文本，因此不需要包含自动编码器。使用 OodGAN 模型生成的 OOD 数据在 ROSTD（FPR 0.95 的相对改进 67%）和 OSQ 数据集（FPR 0.95 的相对改进 28%）（Zheng 等人，2020）的 OOD 检测指标方面优于最新技术。	Petr Marek Vishal Ishwar Naik Vincent Auvray Anuj Goyal
7	NAACL2019	Jointly Optimizing Diversity and Relevance in Neural Response Generation		https://arxiv.org/pdf/1902.11205	Although recent neural conversation models have shown great potential, they often generate bland and generic responses. While various approaches have been explored to diversify the output of the conversation model, the improvement often comes at the cost of decreased relevance. In this paper, we propose a SpaceFusion model to jointly optimize diversity and relevance that essentially fuses the latent space of a sequence-to-sequence model and that of an autoencoder model by leveraging novel regularization terms. As a result, our approach induces a latent space in which the distance and direction from the predicted response vector roughly match the relevance and diversity, respectively. This property also lends itself well to an intuitive visualization of the latent space. Both automatic and human evaluation results demonstrate that the proposed approach brings significant improvement compared to strong baselines in both diversity and relevance.	尽管最近的神经对话模型显示出巨大的潜力，但它们通常会产生平淡而一般的反应。虽然已经探索了各种方法来使对话模型的输出多样化，但改进通常以降低相关性为代价。在本文中，我们提出了一种 SpaceFusion 模型来联合优化多样性和相关性，该模型通过利用新的正则化项基本上融合了序列到序列模型的潜在空间和自动编码器模型的潜在空间。结果，我们的方法引入了一个潜在空间，其中与预测响应向量的距离和方向分别大致匹配相关性和多样性。此属性也非常适合潜在空间的直观可视化。自动和人工评估结果都表明，与多样性和相关性的强大基线相比，所提出的方法带来了显着的改进。	Xiang Gao Sungjin Lee Yizhe Zhang Chris Brockett Michel Galley Jianfeng Gao Bill Dolan
8	NAACL2019	Neural Text Generation from Rich Semantic Representations	https://github.com/shlurbee/dmrs-text-generation-naacl2019	https://arxiv.org/pdf/1904.11564	We propose neural models to generate high-quality text from structured representations based on Minimal Recursion Semantics (MRS). MRS is a rich semantic representation that encodes more precise semantic detail than other representations such as Abstract Meaning Representation (AMR). We show that a sequence-to-sequence model that maps a linearization of Dependency MRS, a graph-based representation of MRS, to English text can achieve a BLEU score of 66.11 when trained on gold data. The performance can be improved further using a high-precision, broad coverage grammar-based parser to generate a large silver training corpus, achieving a final BLEU score of 77.17 on the full test set, and 83.37 on the subset of test data most closely matching the silver data domain. Our results suggest that MRS-based representations are a good choice for applications that need both structured semantics and the ability to produce natural language text as output.	我们提出神经模型，从基于最小递归语义 (MRS) 的结构化表示生成高质量文本。 MRS 是一种丰富的语义表示，与抽象含义表示 (AMR) 等其他表示相比，它编码了更精确的语义细节。我们表明，当对黄金数据进行训练时，将依赖关系 MRS（一种基于图形的 MRS 表示）的线性化映射到英文文本的序列到序列模型可以达到 66.11 的 BLEU 分数。使用高精度、覆盖面广的基于语法的解析器生成大型银色训练语料库，可以进一步提高性能，在完整测试集上获得 77.17 的最终 BLEU 分数，在最匹配的测试数据子集上获得 83.37 分银色数据域。我们的结果表明，对于既需要结构化语义又需要生成自然语言文本作为输出的应用程序，基于 MRS 的表示是一个不错的选择。	Valerie Hajdik Jan Buys Michael W. Goodman Emily M. Bender
9	NAACL2019	Text Generation from Knowledge Graphs with Graph Transformers	https://github.com/rikdz/GraphWriter	https://arxiv.org/pdf/1904.02342	Generating texts which express complex ideas spanning multiple sentences requires a structured representation of their content (document plan), but these representations are prohibitively expensive to manually produce. In this work, we address the problem of generating coherent multi-sentence texts from the output of an information extraction system, and in particular a knowledge graph. Graphical knowledge representations are ubiquitous in computing, but pose a significant challenge for text generation techniques due to their non-hierarchical nature, collapsing of long-distance dependencies, and structural variety. We introduce a novel graph transforming encoder which can leverage the relational structure of such knowledge graphs without imposing linearization or hierarchical constraints. Incorporated into an encoder-decoder setup, we provide an end-to-end trainable system for graph-to-text generation that we apply to the domain of scientific text. Automatic and human evaluations show that our technique produces more informative texts which exhibit better document structure than competitive encoder-decoder methods.	生成表达跨越多个句子的复杂想法的文本需要对其内容进行结构化表示（文档计划），但这些表示手动生成的成本高得令人望而却步。在这项工作中，我们解决了从信息提取系统的输出，尤其是知识图谱中生成连贯的多句文本的问题。图形知识表示在计算中无处不在，但由于其非层次性、长距离依赖的崩溃和结构多样性，对文本生成技术构成了重大挑战。我们引入了一种新颖的图转换编码器，它可以利用此类知识图的关系结构，而无需施加线性化或分层约束。结合到编码器 - 解码器设置中，我们提供了一个端到端的可训练系统，用于我们应用于科学文本领域的图到文本生成。自动和人工评估表明，我们的技术产生了更多信息文本，与竞争性编码器 - 解码器方法相比，这些文本表现出更好的文档结构。	Rik Koncel-Kedziorski Dhanush Bekal Yi Luan Mirella Lapata Hannaneh Hajishirzi
10	NAACL2019	Text Generation with Exemplar-based Adaptive Decoding		https://arxiv.org/pdf/1904.04428	We propose a novel conditioned text generation model. It draws inspiration from traditional template-based text generation techniques, where the source provides the content (i.e., what to say), and the template influences how to say it. Building on the successful encoder-decoder paradigm, it first encodes the content representation from the given input text; to produce the output, it retrieves exemplar text from the training data as “soft templates,” which are then used to construct an exemplar-specific decoder. We evaluate the proposed model on abstractive text summarization and data-to-text generation. Empirical results show that this model achieves strong performance and outperforms comparable baselines.	我们提出了一种新颖的条件文本生成模型。它从传统的基于模板的文本生成技术中汲取灵感，其中源提供内容（即说什么），而模板影响如何说。基于成功的编码器-解码器范例，它首先对给定输入文本的内容表示进行编码；为了产生输出，它从训练数据中检索示例文本作为“软模板”，然后用于构建特定于示例的解码器。我们在抽象文本摘要和数据到文本生成方面评估了所提出的模型。实证结果表明，该模型实现了强大的性能并优于可比较的基线。	Hao Peng Ankur P. Parikh Manaal Faruqui Bhuwan Dhingra Dipanjan Das
11	NAACL2019	Accelerated Reinforcement Learning for Sentence Generation by Vocabulary Prediction	https://github.com/hassyGo/NLG-RL	https://arxiv.org/pdf/1809.01694	A major obstacle in reinforcement learning-based sentence generation is the large action space whose size is equal to the vocabulary size of the target-side language. To improve the efficiency of reinforcement learning, we present a novel approach for reducing the action space based on dynamic vocabulary prediction. Our method first predicts a fixed-size small vocabulary for each input to generate its target sentence. The input-specific vocabularies are then used at supervised and reinforcement learning steps, and also at test time. In our experiments on six machine translation and two image captioning datasets, our method achieves faster reinforcement learning ($\sim$2.7x faster) with less GPU memory ($\sim$2.3x less) than the full-vocabulary counterpart. The reinforcement learning with our method consistently leads to significant improvement of BLEU scores, and the scores are equal to or better than those of baselines using the full vocabularies, with faster decoding time ($\sim$3x faster) on CPUs.	基于强化学习的句子生成的一个主要障碍是大的动作空间，其大小等于目标方语言的词汇量。为了提高强化学习的效率，我们提出了一种基于动态词汇预测来减少动作空间的新方法。我们的方法首先为每个输入预测一个固定大小的小词汇表以生成其目标句子。然后在监督和强化学习步骤以及测试时使用特定于输入的词汇表。在我们对六个机器翻译和两个图像字幕数据集的实验中，我们的方法实现了更快的强化学习（$\sim$2.7x 快），而 GPU 内存更少（$\sim$2.3x）比全词汇对应物少。使用我们的方法进行的强化学习始终导致 BLEU 分数的显着提高，并且分数等于或优于使用完整词汇表的基线的分数，并且在 CPU 上具有更快的解码时间（$\sim$3x 快）。	Kazuma Hashimoto Yoshimasa Tsuruoka
12	NAACL2019	Pre-trained language model representations for language generation	https://github.com/pytorch/fairseq	https://arxiv.org/pdf/1903.09722	Pre-trained language model representations have been successful in a wide range of language understanding tasks. In this paper, we examine different strategies to integrate pre-trained representations into sequence to sequence models and apply it to neural machine translation and abstractive summarization. We find that pre-trained representations are most effective when added to the encoder network which slows inference by only 14%. Our experiments in machine translation show gains of up to 5.3 BLEU in a simulated resource-poor setup. While returns diminish with more labeled data, we still observe improvements when millions of sentence-pairs are available. Finally, on abstractive summarization we achieve a new state of the art on the full text version of CNN/DailyMail.	预训练的语言模型表示已在广泛的语言理解任务中取得成功。在本文中，我们研究了将预训练表示集成到序列到序列模型中的不同策略，并将其应用于神经机器翻译和抽象摘要。我们发现，将预训练的表示添加到编码器网络时最有效，仅将推理速度降低 14%。我们的机器翻译实验表明，在模拟资源匮乏的设置中，增益高达 5.3 BLEU。虽然回报随着更多标记数据而减少，但当有数百万个句子对可用时，我们仍然观察到改进。最后，在抽象摘要方面，我们在 CNN/DailyMail 的全文版本上实现了最新的技术水平。	Sergey Edunov Alexei Baevski Michael Auli
13	NAACL2019	Pragmatically Informative Text Generation		https://arxiv.org/pdf/1904.01301	We improve the informativeness of models for conditional text generation using techniques from computational pragmatics. These techniques formulate language production as a game between speakers and listeners, in which a speaker should generate output text that a listener can use to correctly identify the original input that the text describes. While such approaches are widely used in cognitive science and grounded language learning, they have received less attention for more standard language generation tasks. We consider two pragmatic modeling methods for text generation: one where pragmatics is imposed by information preservation, and another where pragmatics is imposed by explicit modeling of distractors. We find that these methods improve the performance of strong existing systems for abstractive summarization and generation from structured meaning representations.	我们使用计算语用学的技术提高了条件文本生成模型的信息量。这些技术将语言生成表述为说话者和听者之间的游戏，其中说话者应该生成输出文本，听者可以使用该输出文本来正确识别文本描述的原始输入。虽然这些方法广泛用于认知科学和基础语言学习，但它们在更标准的语言生成任务中受到的关注较少。我们考虑了两种用于文本生成的语用建模方法：一种是通过信息保存来施加语用，另一种是通过干扰项的显式建模来施加语用。我们发现这些方法提高了强大的现有系统的性能，用于从结构化含义表示中进行抽象摘要和生成。	Sheng Shen Daniel Fried Jacob Andreas Dan Klein
14	NAACL2019	Stochastic Wasserstein Autoencoder for Probabilistic Sentence Generation	https://github.com/HareeshBahuleyan/probabilistic_nlg	https://arxiv.org/pdf/1806.08462	The variational autoencoder (VAE) imposes a probabilistic distribution (typically Gaussian) on the latent space and penalizes the Kullback—Leibler (KL) divergence between the posterior and prior. In NLP, VAEs are extremely difficult to train due to the problem of KL collapsing to zero. One has to implement various heuristics such as KL weight annealing and word dropout in a carefully engineered manner to successfully train a VAE for text. In this paper, we propose to use the Wasserstein autoencoder (WAE) for probabilistic sentence generation, where the encoder could be either stochastic or deterministic. We show theoretically and empirically that, in the original WAE, the stochastically encoded Gaussian distribution tends to become a Dirac-delta function, and we propose a variant of WAE that encourages the stochasticity of the encoder. Experimental results show that the latent space learned by WAE exhibits properties of continuity and smoothness as in VAEs, while simultaneously achieving much higher BLEU scores for sentence reconstruction.		Hareesh Bahuleyan Lili Mou Hao Zhou Olga Vechtomova

COLING

序号	会议/期刊	论文	代码	论文下载地址	摘要	摘要翻译	作者
1	COLING2020	Affective Text Generation	https://github.com/ishikasingh/Affective-text-gen	https://arxiv.org/pdf/2011.04000	Human use language not just to convey information but also to express their inner feelings and mental states. In this work, we adapt the state-of-the-art language generation models to generate affective (emotional) text. We posit a model capable of generating affect-driven and topic-focused sentences without losing grammatical correctness as the affect intensity increases. We propose to incorporate emotion as prior for the probabilistic state-of-the-art text generation model such as GPT-2. The model gives a user the flexibility to control the category and intensity of emotion as well as the topic of the generated text. Previous attempts at modelling fine-grained emotions fall out on grammatical correctness at extreme intensities, but our model is resilient to this and delivers robust results at all intensities. We conduct automated evaluations and human studies to test the performance of our model and provide a detailed comparison of the results with other models. In all evaluations, our model outperforms existing affective text generation models.	人类使用语言不仅是为了传达信息，也是为了表达内心的感受和心理状态。在这项工作中，我们采用最先进的语言生成模型来生成情感（情感）文本。我们假设一个模型能够生成情感驱动和以主题为中心的句子，而不会随着情感强度的增加而失去语法正确性。我们建议将情感作为先验的概率状态最先进的文本生成模型，例如 GPT-2。该模型使用户可以灵活地控制情绪的类别和强度以及生成文本的主题。之前对细粒度情绪建模的尝试在极端强度下的语法正确性失败，但我们的模型对此具有弹性，并在所有强度下都能提供稳健的结果。我们进行自动评估和人体研究来测试我们模型的性能，并提供与其他模型的结果的详细比较。在所有评估中，我们的模型优于现有的情感文本生成模型。	Ishika Singh Ahsan Barkati Tushar Goswamy Ashutosh Modi Luca Massarelli Fabio Petroni Aleksandra Piktus Myle Ott Tim Rocktäschel Vassilis Plachouras Fabrizio Silvestri Sebastian Riedel Sayan Ghosh Mathieu Chollet Eugene Laksana Louis-Philippe Morency Stefan Scherer
2	COLING2020	Curious Case of Language Generation Evaluation Metrics: A Cautionary Tale		https://arxiv.org/pdf/2010.13588	Automatic evaluation of language generation systems is a well-studied problem in Natural Language Processing. While novel metrics are proposed every year, a few popular metrics remain as the de facto metrics to evaluate tasks such as image captioning and machine translation, despite their known limitations. This is partly due to ease of use, and partly because researchers expect to see them and know how to interpret them. In this paper, we urge the community for more careful consideration of how they automatically evaluate their models by demonstrating important failure cases on multiple datasets, language pairs and tasks. Our experiments show that metrics (i) usually prefer system outputs to human-authored texts, (ii) can be insensitive to correct translations of rare words, (iii) can yield surprisingly high scores when given a single sentence as system output for the entire test set.	语言生成系统的自动评估是自然语言处理中一个经过充分研究的问题。虽然每年都会提出新的指标，但一些流行的指标仍然是评估图像字幕和机器翻译等任务的实际指标，尽管它们有已知的局限性。这部分是由于易于使用，部分是因为研究人员希望看到它们并知道如何解释它们。在本文中，我们敦促社区更仔细地考虑他们如何通过在多个数据集、语言对和任务上展示重要的失败案例来自动评估他们的模型。我们的实验表明，指标 (i) 通常更喜欢系统输出而不是人工编写的文本，(ii) 可能对稀有单词的正确翻译不敏感，(iii) 当将单个句子作为整个系统的输出时，可以产生令人惊讶的高分测试集。	Ozan Caglayan Pranava Madhyastha Lucia Specia
3	COLING2020	Facts2Story: Controlling Text Generation by Key Facts		https://arxiv.org/pdf/2012.04332	Recent advancements in self-attention neural network architectures have raised the bar for open-ended text generation. Yet, while current methods are capable of producing a coherent text which is several hundred words long, attaining control over the content that is being generated — as well as evaluating it — are still open questions. We propose a controlled generation task which is based on expanding a sequence of facts, expressed in natural language, into a longer narrative. We introduce human-based evaluation metrics for this task, as well as a method for deriving a large training dataset. We evaluate three methods on this task, based on fine-tuning pre-trained models. We show that while auto-regressive, unidirectional Language Models such as GPT2 produce better fluency, they struggle to adhere to the requested facts. We propose a plan-and-cloze model (using fine-tuned XLNet) which produces competitive fluency while adhering to the requested content.	自注意力神经网络架构的最新进展提高了开放式文本生成的门槛。然而，虽然当前的方法能够生成几百字长的连贯文本，但对正在生成的内容进行控制——以及对其进行评估——仍然是悬而未决的问题。我们提出了一个受控生成任务，该任务基于将用自然语言表达的一系列事实扩展为更长的叙述。我们为此任务引入了基于人类的评估指标，以及一种用于导出大型训练数据集的方法。我们基于微调预训练模型评估了针对此任务的三种方法。我们表明，虽然 GPT2 等自回归、单向语言模型产生更好的流畅性，但它们难以坚持要求的事实。我们提出了一个计划和完形填空模型（使用微调的 XLNet），该模型在坚持要求的内容的同时产生有竞争力的流畅度。	Eyal Orbach Yoav Goldberg
4	COLING2020	Vec2Sent: Probing Sentence Embeddings with Natural Language Generation		https://arxiv.org/pdf/2011.00592	We introspect black-box sentence embeddings by conditionally generating from them with the objective to retrieve the underlying discrete sentence. We perceive of this as a new unsupervised probing task and show that it correlates well with downstream task performance. We also illustrate how the language generated from different encoders differs. We apply our approach to generate sentence analogies from sentence embeddings.	我们通过有条件地生成黑盒句子嵌入来内省黑盒句子嵌入，目的是检索潜在的离散句子。我们认为这是一项新的无监督探测任务，并表明它与下游任务性能相关性很好。我们还说明了从不同编码器生成的语言有何不同。我们应用我们的方法从句子嵌入生成句子类比。	Martin Kerscher Steffen Eger

摘要

ACL

序号	会议/期刊	论文	代码	论文下载地址	摘要	摘要翻译	作者
1	ACL2021	Cross-Lingual Abstractive Summarization with Limited Parallel Resources	https://github.com/WoodenWhite/MCLAS	https://arxiv.org/pdf/2105.13648	Parallel cross-lingual summarization data is scarce, requiring models to better use the limited available cross-lingual resources. Existing methods to do so often adopt sequence-to-sequence networks with multi-task frameworks. Such approaches apply multiple decoders, each of which is utilized for a specific task. However, these independent decoders share no parameters, hence fail to capture the relationships between the discrete phrases of summaries in different languages, breaking the connections in order to transfer the knowledge of the high-resource languages to low-resource languages. To bridge these connections, we propose a novel Multi-Task framework for Cross-Lingual Abstractive Summarization (MCLAS) in a low-resource setting. Employing one unified decoder to generate the sequential concatenation of monolingual and cross-lingual summaries, MCLAS makes the monolingual summarization task a prerequisite of the cross-lingual summarization (CLS) task. In this way, the shared decoder learns interactions involving alignments and summary patterns across languages, which encourages attaining knowledge transfer. Experiments on two CLS datasets demonstrate that our model significantly outperforms three baseline models in both low-resource and full-dataset scenarios. Moreover, in-depth analysis on the generated summaries and attention heads verifies that interactions are learned well using MCLAS, which benefits the CLS task under limited parallel resources.	并行的跨语言摘要数据稀缺，需要模型更好地利用有限的可用跨语言资源。现有的方法通常采用具有多任务框架的序列到序列网络。此类方法应用多个解码器，每个解码器用于特定任务。然而，这些独立的解码器没有共享参数，因此无法捕捉不同语言摘要的离散短语之间的关系，从而打破了联系，以将高资源语言的知识转移到低资源语言。为了弥合这些联系，我们提出了一种新的多任务框架，用于在低资源环境中进行跨语言抽象摘要（MCLAS）。 MCLAS 采用一个统一的解码器来生成单语和跨语言摘要的顺序连接，使单语摘要任务成为跨语言摘要 (CLS) 任务的先决条件。通过这种方式，共享解码器学习涉及跨语言对齐和摘要模式的交互，从而鼓励实现知识转移。在两个 CLS 数据集上的实验表明，我们的模型在低资源和全数据集场景中都明显优于三个基线模型。此外，对生成的摘要和注意力头的深入分析验证了使用 MCLAS 可以很好地学习交互，这有利于有限并行资源下的 CLS 任务。	Yu Bai Yang Gao Heyan Huang
2	ACL2021	Improving Factual Consistency of Abstractive Summarization via Question Answering		https://arxiv.org/pdf/2105.04623	A commonly observed problem with the state-of-the art abstractive summarization models is that the generated summaries can be factually inconsistent with the input documents. The fact that automatic summarization may produce plausible-sounding yet inaccurate summaries is a major concern that limits its wide application. In this paper we present an approach to address factual consistency in summarization. We first propose an efficient automatic evaluation metric to measure factual consistency; next, we propose a novel learning algorithm that maximizes the proposed metric during model training. Through extensive experiments, we confirm that our method is effective in improving factual consistency and even overall quality of the summaries, as judged by both automatic metrics and human evaluation.	最先进的抽象摘要模型的一个常见问题是生成的摘要可能与输入文档实际上不一致。自动摘要可能会产生看似合理但不准确的摘要，这一事实是限制其广泛应用的主要问题。在本文中，我们提出了一种解决摘要中事实一致性的方法。我们首先提出了一种有效的自动评估指标来衡量事实一致性；接下来，我们提出了一种新颖的学习算法，可以在模型训练期间最大化提出的度量。通过广泛的实验，我们确认我们的方法在提高事实一致性甚至摘要的整体质量方面是有效的，这通过自动指标和人工评估来判断。	Feng Nan Cicero Nogueira dos Santos Henghui Zhu Patrick Ng Kathleen McKeown Ramesh Nallapati Dejiao Zhang Zhiguo Wang Andrew O. Arnold Bing Xiang
3	ACL2021	Long-Span Summarization via Local Attention and Content Selection	https://github.com/potsawee/longsum0	https://arxiv.org/pdf/2105.03801	Transformer-based models have achieved state-of-the-art results in a wide range of natural language processing (NLP) tasks including document summarization. Typically these systems are trained by fine-tuning a large pre-trained model to the target task. One issue with these transformer-based models is that they do not scale well in terms of memory and compute requirements as the input length grows. Thus, for long document summarization, it can be challenging to train or fine-tune these models. In this work, we exploit large pre-trained transformer-based models and address long-span dependencies in abstractive summarization using two methods: local self-attention; and explicit content selection. These approaches are compared on a range of network configurations. Experiments are carried out on standard long-span summarization tasks, including Spotify Podcast, arXiv, and PubMed datasets. We demonstrate that by combining these methods, we can achieve state-of-the-art results on all three tasks in the ROUGE scores. Moreover, without a large-scale GPU card, our approach can achieve comparable or better results than existing approaches.	基于 Transformer 的模型在包括文档摘要在内的各种自然语言处理 (NLP) 任务中取得了最先进的结果。通常，这些系统是通过针对目标任务对大型预训练模型进行微调来训练的。这些基于转换器的模型的一个问题是，随着输入长度的增加，它们在内存和计算要求方面不能很好地扩展。因此，对于长文档摘要，训练或微调这些模型可能具有挑战性。在这项工作中，我们利用大型预训练的基于 Transformer 的模型，并使用两种方法解决抽象摘要中的长跨度依赖性：局部自注意力；和明确的内容选择。这些方法在一系列网络配置上进行了比较。实验是在标准的大跨度摘要任务上进行的，包括 Spotify Podcast、arXiv 和 PubMed 数据集。我们证明，通过结合这些方法，我们可以在 ROUGE 分数中的所有三个任务上获得最先进的结果。此外，在没有大规模 GPU 卡的情况下，我们的方法可以获得与现有方法相当或更好的结果。	Potsawee Manakul Mark J. F. Gales
4	ACL2021	TWAG: A Topic-Guided Wikipedia Abstract Generator	https://github.com/THU-KEG/TWAG	https://arxiv.org/pdf/2106.15135	Wikipedia abstract generation aims to distill a Wikipedia abstract from web sources and has met significant success by adopting multi-document summarization techniques. However, previous works generally view the abstract as plain text, ignoring the fact that it is a description of a certain entity and can be decomposed into different topics. In this paper, we propose a two-stage model TWAG that guides the abstract generation with topical information. First, we detect the topic of each input paragraph with a classifier trained on existing Wikipedia articles to divide input documents into different topics. Then, we predict the topic distribution of each abstract sentence, and decode the sentence from topic-aware representations with a Pointer-Generator network. We evaluate our model on the WikiCatSum dataset, and the results show that \modelnames outperforms various existing baselines and is capable of generating comprehensive abstracts. Our code and dataset can be accessed at \url{https://github.com/THU-KEG/TWAG}	维基百科摘要生成旨在从网络资源中提取维基百科摘要，并通过采用多文档摘要技术取得了重大成功。然而，以前的工作一般将摘要视为纯文本，忽略了它是对某个实体的描述，可以分解为不同主题的事实。在本文中，我们提出了一个两阶段模型 TWAG，它用主题信息指导抽象生成。首先，我们使用在现有维基百科文章上训练的分类器检测每个输入段落的主题，以将输入文档划分为不同的主题。然后，我们预测每个抽象句子的主题分布，并使用指针生成器网络从主题感知表示中解码句子。我们在 WikiCatSum 数据集上评估我们的模型，结果表明 \modelnames 优于各种现有的基线，并且能够生成全面的摘要。我们的代码和数据集可以在 \url{https://github.com/THU-KEG/TWAG} 访问	Fangwei Zhu Shangqing Tu Jiaxin Shi Juanzi Li Lei Hou Tong Cui
5	ACL2021	Language Model as an Annotator: Exploring DialoGPT for Dialogue Summarization	https://github.com/xcfcode/PLM_annotator	https://arxiv.org/pdf/2105.12544	Current dialogue summarization systems usually encode the text with a number of general semantic features (e.g., keywords and topics) to gain more powerful dialogue modeling capabilities. However, these features are obtained via open-domain toolkits that are dialog-agnostic or heavily relied on human annotations. In this paper, we show how DialoGPT, a pre-trained model for conversational response generation, can be developed as an unsupervised dialogue annotator, which takes advantage of dialogue background knowledge encoded in DialoGPT. We apply DialoGPT to label three types of features on two dialogue summarization datasets, SAMSum and AMI, and employ pre-trained and non pre-trained models as our summarizes. Experimental results show that our proposed method can obtain remarkable improvements on both datasets and achieves new state-of-the-art performance on the SAMSum dataset.		Xiachong Feng Xiaocheng Feng Libo Qin Bing Qin Ting Liu
6	ACL2021	BASS: Boosting Abstractive Summarization with Unified Semantic Graph		https://arxiv.org/pdf/2105.12041	Abstractive summarization for long-document or multi-document remains challenging for the Seq2Seq architecture, as Seq2Seq is not good at analyzing long-distance relations in text. In this paper, we present BASS, a novel framework for Boosting Abstractive Summarization based on a unified Semantic graph, which aggregates co-referent phrases distributing across a long range of context and conveys rich relations between phrases. Further, a graph-based encoder-decoder model is proposed to improve both the document representation and summary generation process by leveraging the graph structure. Specifically, several graph augmentation methods are designed to encode both the explicit and implicit relations in the text while the graph-propagation attention mechanism is developed in the decoder to select salient content into the summary. Empirical results show that the proposed architecture brings substantial improvements for both long-document and multi-document summarization tasks.	长文档或多文档的抽象摘要对于 Seq2Seq 架构仍然具有挑战性，因为 Seq2Seq 不擅长分析文本中的长距离关系。在本文中，我们提出了 BASS，这是一种基于统一语义图的增强抽象摘要的新框架，该框架聚合了分布在广泛上下文中的共同指涉短语，并传达了短语之间的丰富关系。此外，提出了一种基于图的编码器-解码器模型，以通过利用图结构来改进文档表示和摘要生成过程。具体来说，设计了几种图增强方法来对文本中的显式和隐式关系进行编码，同时在解码器中开发图传播注意机制以将显着内容选择到摘要中。实证结果表明，所提出的架构为长文档和多文档摘要任务带来了实质性的改进。	Wenhao Wu Wei Li Xinyan Xiao Jiachen Liu Ziqiang Cao Sujian Li Hua Wu Haifeng Wang
7	ACL2021	Focus Attention: Promoting Faithfulness and Diversity in Summarization		https://arxiv.org/pdf/2105.11921	Professional summaries are written with document-level information, such as the theme of the document, in mind. This is in contrast with most seq2seq decoders which simultaneously learn to focus on salient content, while deciding what to generate, at each decoding step. With the motivation to narrow this gap, we introduce Focus Attention Mechanism, a simple yet effective method to encourage decoders to proactively generate tokens that are similar or topical to the input document. Further, we propose a Focus Sampling method to enable generation of diverse summaries, an area currently understudied in summarization. When evaluated on the BBC extreme summarization task, two state-of-the-art models augmented with Focus Attention generate summaries that are closer to the target and more faithful to their input documents, outperforming their vanilla counterparts on \rouge and multiple faithfulness measures. We also empirically demonstrate that Focus Sampling is more effective in generating diverse and faithful summaries than top-$k$ or nucleus sampling-based decoding methods.	专业摘要是用文档级别的信息编写的，例如文档的主题。这与大多数 seq2seq 解码器形成对比，后者在每个解码步骤中同时学习关注显着内容，同时决定生成什么。为了缩小这一差距，我们引入了焦点注意力机制，这是一种简单而有效的方法，可以鼓励解码器主动生成与输入文档相似或主题的标记。此外，我们提出了一种焦点抽样方法，以生成多样化的摘要，这是目前在摘要中研究不足的领域。当在 BBC 极端摘要任务上进行评估时，两个最先进的模型增强了 Focus Attention 生成的摘要更接近目标，更忠实于他们的输入文档，在 \rouge 和多重忠实度度量上优于普通模型。我们还凭经验证明，焦点采样在生成多样化和忠实的摘要方面比基于 top-$k$ 或核采样的解码方法更有效。	Rahul Aralikatte Shashi Narayan Joshua Maynez Sascha Rothe Ryan McDonald
8	ACL2021	Generating Query Focused Summaries from Query-Free Resources	https://github.com/yumoxu/marge	https://arxiv.org/pdf/2012.14774	The availability of large-scale datasets has driven the development of neural models that create generic summaries from single or multiple documents. In this work we consider query focused summarization (QFS), a task for which training data in the form of queries, documents, and summaries is not readily available. We propose to decompose QFS into (1) query modeling (i.e., finding supportive evidence within a set of documents for a query) and (2) conditional language modeling (i.e., summary generation). We introduce MaRGE, a Masked ROUGE Regression framework for evidence estimation and ranking which relies on a unified representation for summaries and queries, so that summaries in generic data can be converted into proxy queries for learning a query model. Experiments across QFS benchmarks and query types show that our model achieves state-of-the-art performance despite learning from weak supervision.	大规模数据集的可用性推动了神经模型的发展，该模型从单个或多个文档创建通用摘要。在这项工作中，我们考虑以查询为中心的摘要 (QFS)，这是一项不容易获得查询、文档和摘要形式的训练数据的任务。我们建议将 QFS 分解为 (1) 查询建模（即在一组文档中为查询找到支持证据）和（2）条件语言建模（即摘要生成）。我们引入了 MaRGE，这是一种用于证据估计和排序的 Masked ROUGE 回归框架，它依赖于摘要和查询的统一表示，因此可以将通用数据中的摘要转换为代理查询以学习查询模型。跨 QFS 基准测试和查询类型的实验表明，尽管从弱监督中学习，我们的模型仍实现了最先进的性能。	Yumo Xu Mirella Lapata
9	ACL2021	ConvoSumm: Conversation Summarization Benchmark and Improved Abstractive Summarization with Argument Mining	https://github.com/Yale-LILY/ConvoSumm	https://arxiv.org/pdf/2106.00829	While online conversations can cover a vast amount of information in many different formats, abstractive text summarization has primarily focused on modeling solely news articles. This research gap is due, in part, to the lack of standardized datasets for summarizing online discussions. To address this gap, we design annotation protocols motivated by an issues—viewpoints—assertions framework to crowdsource four new datasets on diverse online conversation forms of news comments, discussion forums, community question answering forums, and email threads. We benchmark state-of-the-art models on our datasets and analyze characteristics associated with the data. To create a comprehensive benchmark, we also evaluate these models on widely-used conversation summarization datasets to establish strong baselines in this domain. Furthermore, we incorporate argument mining through graph construction to directly model the issues, viewpoints, and assertions present in a conversation and filter noisy input, showing comparable or improved results according to automatic and human evaluations.	虽然在线对话可以涵盖多种不同格式的大量信息，但抽象文本摘要主要侧重于对新闻文章进行建模。这种研究差距部分是由于缺乏用于总结在线讨论的标准化数据集。为了解决这一差距，我们设计了由问题-观点-断言框架驱动的注释协议，以众包新闻评论、讨论论坛、社区问答论坛和电子邮件线程等各种在线对话形式的四个新数据集。我们在我们的数据集上对最先进的模型进行基准测试并分析与数据相关的特征。为了创建一个全面的基准，我们还在广泛使用的对话摘要数据集上评估这些模型，以在该领域建立强大的基线。此外，我们通过图构建结合参数挖掘，直接对对话中存在的问题、观点和断言进行建模，并过滤嘈杂的输入，根据自动和人工评估显示可比较或改进的结果。	Alexander R. Fabbri Faiaz Rahman Imad Rizvi Borui Wang Haoran Li Yashar Mehdad Dragomir Radev
10	ACL2020	On Faithfulness and Factuality in Abstractive Summarization	https://github.com/google-research-datasets/xsum_hallucination_annotations	https://arxiv.org/pdf/2005.00661	It is well known that the standard likelihood training and approximate decoding objectives in neural text generation models lead to less human-like responses for open-ended tasks such as language modeling and story generation. In this paper we have analyzed limitations of these models for abstractive document summarization and found that these models are highly prone to hallucinate content that is unfaithful to the input document. We conducted a large scale human evaluation of several neural abstractive summarization systems to better understand the types of hallucinations they produce. Our human annotators found substantial amounts of hallucinated content in all model generated summaries. However, our analysis does show that pretrained models are better summarizers not only in terms of raw metrics, i.e., ROUGE, but also in generating faithful and factual summaries as evaluated by humans. Furthermore, we show that textual entailment measures better correlate with faithfulness than standard metrics, potentially leading the way to automatic evaluation metrics as well as training and decoding criteria.	众所周知，神经文本生成模型中的标准似然训练和近似解码目标会导致对语言建模和故事生成等开放式任务的响应较少。在本文中，我们分析了这些模型在抽象文档摘要方面的局限性，发现这些模型很容易出现对输入文档不忠实的幻觉内容。我们对几个神经抽象摘要系统进行了大规模的人工评估，以更好地了解它们产生的幻觉类型。我们的人工注释者在所有模型生成的摘要中发现了大量幻觉内容。然而，我们的分析确实表明，预训练模型不仅在原始指标（即 ROUGE）方面是更好的总结器，而且在生成人类评估的忠实和事实摘要方面也是如此。此外，我们表明，与标准度量相比，文本蕴涵度量与忠诚度的相关性更好，这可能会引领自动评估度量以及训练和解码标准的发展。	Joshua Maynez Shashi Narayan Bernd Bohnet Ryan McDonald
11	ACL2020	FEQA: A Question Answering Evaluation Framework for Faithfulness Assessment in Abstractive Summarization	https://github.com/esdurmus/summary-faithfulness	https://arxiv.org/pdf/2005.03754	Neural abstractive summarization models are prone to generate content inconsistent with the source document, i.e. unfaithful. Existing automatic metrics do not capture such mistakes effectively. We tackle the problem of evaluating faithfulness of a generated summary given its source document. We first collected human annotations of faithfulness for outputs from numerous models on two datasets. We find that current models exhibit a trade-off between abstractiveness and faithfulness: outputs with less word overlap with the source document are more likely to be unfaithful. Next, we propose an automatic question answering (QA) based metric for faithfulness, FEQA, which leverages recent advances in reading comprehension. Given question-answer pairs generated from the summary, a QA model extracts answers from the document; non-matched answers indicate unfaithful information in the summary. Among metrics based on word overlap, embedding similarity, and learned language understanding models, our QA-based metric has significantly higher correlation with human faithfulness scores, especially on highly abstractive summaries.	神经抽象摘要模型容易生成与源文档不一致的内容，即不忠实的。现有的自动指标不能有效地捕捉此类错误。我们解决了在给定源文件的情况下评估生成摘要的忠实度的问题。我们首先从两个数据集上的众多模型中收集了人类忠实度的人工注释。我们发现当前的模型表现出抽象性和忠实性之间的权衡：与源文档的单词重叠较少的输出更有可能是不忠实的。接下来，我们提出了一种基于自动问答 (QA) 的忠诚度指标，FEQA，它利用了阅读理解方面的最新进展。给定从摘要生成的问答对，QA 模型从文档中提取答案；不匹配的答案表示摘要中的信息不真实。在基于单词重叠、嵌入相似性和学习语言理解模型的指标中，我们基于 QA 的指标与人类忠诚度得分的相关性显着更高，尤其是在高度抽象的摘要上。	Esin Durmus He He Mona Diab
12	ACL2020	On Faithfulness and Factuality in Abstractive Summarization	https://github.com/google-research-datasets/xsum_hallucination_annotations	https://arxiv.org/pdf/2005.00661	It is well known that the standard likelihood training and approximate decoding objectives in neural text generation models lead to less human-like responses for open-ended tasks such as language modeling and story generation. In this paper we have analyzed limitations of these models for abstractive document summarization and found that these models are highly prone to hallucinate content that is unfaithful to the input document. We conducted a large scale human evaluation of several neural abstractive summarization systems to better understand the types of hallucinations they produce. Our human annotators found substantial amounts of hallucinated content in all model generated summaries. However, our analysis does show that pretrained models are better summarizers not only in terms of raw metrics, i.e., ROUGE, but also in generating faithful and factual summaries as evaluated by humans. Furthermore, we show that textual entailment measures better correlate with faithfulness than standard metrics, potentially leading the way to automatic evaluation metrics as well as training and decoding criteria.	众所周知，神经文本生成模型中的标准似然训练和近似解码目标会导致对语言建模和故事生成等开放式任务的响应较少。在本文中，我们分析了这些模型在抽象文档摘要方面的局限性，发现这些模型很容易出现对输入文档不忠实的幻觉内容。我们对几个神经抽象摘要系统进行了大规模的人工评估，以更好地了解它们产生的幻觉类型。我们的人工注释者在所有模型生成的摘要中发现了大量幻觉内容。然而，我们的分析确实表明，预训练模型不仅在原始指标（即 ROUGE）方面是更好的总结器，而且在生成人类评估的忠实和事实摘要方面也是如此。此外，我们表明，与标准度量相比，文本蕴涵度量与忠诚度的相关性更好，这可能会引领自动评估度量以及训练和解码标准的发展。	Joshua Maynez Shashi Narayan Bernd Bohnet Ryan McDonald
13	ACL2020	Knowledge Graph-Augmented Abstractive Summarization with Semantic-Driven Cloze Reward	https://github.com/luyang-huang96/GraphAugmentedSum	https://arxiv.org/pdf/2005.01159	Sequence-to-sequence models for abstractive summarization have been studied extensively, yet the generated summaries commonly suffer from fabricated content, and are often found to be near-extractive. We argue that, to address these issues, the summarizer should acquire semantic interpretation over input, e.g., via structured representation, to allow the generation of more informative summaries. In this paper, we present ASGARD, a novel framework for Abstractive Summarization with Graph-Augmentation and semantic-driven RewarD. We propose the use of dual encoders—-a sequential document encoder and a graph-structured encoder—-to maintain the global context and local characteristics of entities, complementing each other. We further design a reward based on a multiple choice cloze test to drive the model to better capture entity interactions. Results show that our models produce significantly higher ROUGE scores than a variant without knowledge graph as input on both New York Times and CNN/Daily Mail datasets. We also obtain better or comparable performance compared to systems that are fine-tuned from large pretrained language models. Human judges further rate our model outputs as more informative and containing fewer unfaithful errors.	用于抽象摘要的序列到序列模型已被广泛研究，但生成的摘要通常受到捏造的内容的影响，并且经常被发现接近于提取。我们认为，为了解决这些问题，摘要器应该获得对输入的语义解释，例如，通过结构化表示，以允许生成更多信息摘要。在本文中，我们提出了 ASGARD，这是一种具有图增强和语义驱动的 RewarD 的抽象摘要框架。我们建议使用双编码器——顺序文档编码器和图结构编码器——来维护实体的全局上下文和局部特征，相互补充。我们进一步设计了基于多项选择完形填空测试的奖励，以驱动模型更好地捕获实体交互。结果表明，我们的模型在纽约时报和 CNN/每日邮报数据集上产生的 ROUGE 分数明显高于没有知识图作为输入的变体。与从大型预训练语言模型微调的系统相比，我们还获得了更好或相当的性能。人类法官进一步评价我们的模型输出信息更多，包含更少的不忠实错误。	Luyang Huang Lingfei Wu Lu Wang
14	ACL2020	The Summary Loop: Learning to Write Abstractive Summaries Without Examples	https://github.com/cannylab/summary_loop	https://arxiv.org/pdf/2105.05361	This work presents a new approach to unsupervised abstractive summarization based on maximizing a combination of coverage and fluency for a given length constraint. It introduces a novel method that encourages the inclusion of key terms from the original document into the summary: key terms are masked out of the original document and must be filled in by a coverage model using the current generated summary. A novel unsupervised training procedure leverages this coverage model along with a fluency model to generate and score summaries. When tested on popular news summarization datasets, the method outperforms previous unsupervised methods by more than 2 R-1 points, and approaches results of competitive supervised methods. Our model attains higher levels of abstraction with copied passages roughly two times shorter than prior work, and learns to compress and merge sentences without supervision.	这项工作提出了一种新的无监督抽象摘要方法，该方法基于在给定长度约束下最大化覆盖率和流畅度的组合。它引入了一种新方法，鼓励将原始文档中的关键术语包含在摘要中：关键术语从原始文档中被屏蔽，并且必须由使用当前生成的摘要的覆盖模型填充。一种新颖的无监督训练程序利用此覆盖模型和流畅性模型来生成和评分摘要。在流行的新闻摘要数据集上进行测试时，该方法比以前的无监督方法高出 2 个 R-1 点以上，并且接近竞争性监督方法的结果。我们的模型通过复制的段落比以前的工作短大约两倍，获得了更高的抽象水平，并学会了在没有监督的情况下压缩和合并句子。	Philippe Laban Andrew Hsi John Canny Marti A. Hearst
15	ACL2020	Leveraging Graph to Improve Abstractive Multi-Document Summarization	https://github.com/PaddlePaddle/Research	https://arxiv.org/pdf/2005.10043	Graphs that capture relations between textual units have great benefits for detecting salient information from multiple documents and generating overall coherent summaries. In this paper, we develop a neural abstractive multi-document summarization (MDS) model which can leverage well-known graph representations of documents such as similarity graph and discourse graph, to more effectively process multiple input documents and produce abstractive summaries. Our model utilizes graphs to encode documents in order to capture cross-document relations, which is crucial to summarizing long documents. Our model can also take advantage of graphs to guide the summary generation process, which is beneficial for generating coherent and concise summaries. Furthermore, pre-trained language models can be easily combined with our model, which further improve the summarization performance significantly. Empirical results on the WikiSum and MultiNews dataset show that the proposed architecture brings substantial improvements over several strong baselines.	捕获文本单元之间关系的图对于从多个文档中检测显着信息和生成整体连贯的摘要有很大的好处。在本文中，我们开发了一种神经抽象多文档摘要 (MDS) 模型，该模型可以利用众所周知的文档图表示（例如相似图和话语图）来更有效地处理多个输入文档并生成抽象摘要。我们的模型利用图对文档进行编码以捕获跨文档关系，这对于总结长文档至关重要。我们的模型还可以利用图来指导摘要生成过程，这有利于生成连贯简洁的摘要。此外，预训练的语言模型可以很容易地与我们的模型结合，这进一步显着提高了摘要性能。 WikiSum 和 MultiNews 数据集上的实证结果表明，所提出的架构比几个强大的基线带来了实质性的改进。	Wei Li Xinyan Xiao Jiachen Liu Hua Wu Haifeng Wang Junping Du
16	ACL2019	Scoring Sentence Singletons and Pairs for Abstractive Summarization	https://github.com/ucfnlp/summarization-sing-pair-mix	https://arxiv.org/pdf/1906.00077	When writing a summary, humans tend to choose content from one or two sentences and merge them into a single summary sentence. However, the mechanisms behind the selection of one or multiple source sentences remain poorly understood. Sentence fusion assumes multi-sentence input; yet sentence selection methods only work with single sentences and not combinations of them. There is thus a crucial gap between sentence selection and fusion to support summarizing by both compressing single sentences and fusing pairs. This paper attempts to bridge the gap by ranking sentence singletons and pairs together in a unified space. Our proposed framework attempts to model human methodology by selecting either a single sentence or a pair of sentences, then compressing or fusing the sentence(s) to produce a summary sentence. We conduct extensive experiments on both single- and multi-document summarization datasets and report findings on sentence selection and abstraction.	在撰写摘要时，人们倾向于从一两个句子中选择内容并将它们合并为一个摘要句子。然而，选择一个或多个源句子背后的机制仍然知之甚少。句子融合假设多句输入；然而，句子选择方法仅适用于单个句子，而不适用于它们的组合。因此，在句子选择和融合之间存在一个关键的差距，以支持通过压缩单个句子和融合对来进行总结。本文试图通过在统一空间中对句子单例和句子对进行排序来弥合这一差距。我们提出的框架试图通过选择单个句子或一对句子来模拟人类方法，然后压缩或融合句子以生成摘要句子。我们对单文档和多文档摘要数据集进行了广泛的实验，并报告了关于句子选择和抽象的发现。	Logan Lebanoff Kaiqiang Song Franck Dernoncourt Doo Soon Kim Seokhwan Kim Walter Chang Fei Liu

EMNLP

序号	会议/期刊	论文	代码	论文下载地址	摘要	摘要翻译	作者
1	EMNLP2020	A Hierarchical Network for Abstractive Meeting Summarization with Cross- Domain Pretraining	https://github.com/microsoft/HMNet	https://arxiv.org/pdf/2004.02016	With the abundance of automatic meeting transcripts, meeting summarization is of great interest to both participants and other parties. Traditional methods of summarizing meetings depend on complex multi-step pipelines that make joint optimization intractable. Meanwhile, there are a handful of deep neural models for text summarization and dialogue systems. However, the semantic structure and styles of meeting transcripts are quite different from articles and conversations. In this paper, we propose a novel abstractive summary network that adapts to the meeting scenario. We design a hierarchical structure to accommodate long meeting transcripts and a role vector to depict the difference among speakers. Furthermore, due to the inadequacy of meeting summary data, we pretrain the model on large-scale news summary data. Empirical results show that our model outperforms previous approaches in both automatic metrics and human evaluation. For example, on ICSI dataset, the ROUGE-1 score increases from 34.66% to 46.28%.	由于有大量的自动会议记录，会议摘要对参与者和其他各方都非常感兴趣。总结会议的传统方法依赖于复杂的多步骤管道，这使得联合优化变得难以处理。同时，还有一些用于文本摘要和对话系统的深度神经模型。然而，会议记录的语义结构和风格与文章和对话有很大不同。在本文中，我们提出了一种适用于会议场景的新型抽象摘要网络。我们设计了一个层次结构来容纳长会议记录和一个角色向量来描述演讲者之间的差异。此外，由于会议摘要数据不足，我们在大规模新闻摘要数据上预训练模型。实证结果表明，我们的模型在自动度量和人工评估方面都优于以前的方法。例如，在 ICSI 数据集上，ROUGE-1 分数从 34.66% 增加到 46.28%。	Chenguang Zhu Ruochen Xu Michael Zeng Xuedong Huang
2	EMNLP2020	Conditional Neural Generation using Sub-Aspect Functions for Extractive News Summarization		https://arxiv.org/pdf/2004.13983	Much progress has been made in text summarization, fueled by neural architectures using large-scale training corpora. However, in the news domain, neural models easily overfit by leveraging position-related features due to the prevalence of the inverted pyramid writing style. In addition, there is an unmet need to generate a variety of summaries for different users. In this paper, we propose a neural framework that can flexibly control summary generation by introducing a set of sub-aspect functions (i.e. importance, diversity, position). These sub-aspect functions are regulated by a set of control codes to decide which sub-aspect to focus on during summary generation. We demonstrate that extracted summaries with minimal position bias is comparable with those generated by standard models that take advantage of position preference. We also show that news summaries generated with a focus on diversity can be more preferred by human raters. These results suggest that a more flexible neural summarization framework providing more control options could be desirable in tailoring to different user preferences, which is useful since it is often impractical to articulate such preferences for different applications a priori.	在使用大规模训练语料库的神经架构的推动下，文本摘要取得了很大进展。然而，在新闻领域，由于倒金字塔写作风格的盛行，神经模型很容易通过利用与位置相关的特征来过度拟合。此外，还存在为不同用户生成各种摘要的需求未得到满足。在本文中，我们提出了一个神经框架，可以通过引入一组子方面函数（即重要性、多样性、位置）来灵活控制摘要生成。这些子方面功能由一组控制代码调节，以决定在摘要生成期间关注哪个子方面。我们证明了具有最小位置偏差的提取摘要与利用位置偏好的标准模型生成的摘要具有可比性。我们还表明，人工评估者可能更喜欢以多样性为重点生成的新闻摘要。这些结果表明，在针对不同的用户偏好进行定制时，可能需要提供更多控制选项的更灵活的神经摘要框架，这很有用，因为先验地阐明不同应用程序的此类偏好通常是不切实际的。	Zhengyuan Liu Ke Shi Nancy F. Chen
3	EMNLP2020	Unsupervised Extractive Summarization by Pre-training Hierarchical Transformers	https://github.com/xssstory/STAS	https://arxiv.org/pdf/2010.08242	Unsupervised extractive document summarization aims to select important sentences from a document without using labeled summaries during training. Existing methods are mostly graph-based with sentences as nodes and edge weights measured by sentence similarities. In this work, we find that transformer attentions can be used to rank sentences for unsupervised extractive summarization. Specifically, we first pre-train a hierarchical transformer model using unlabeled documents only. Then we propose a method to rank sentences using sentence-level self-attentions and pre-training objectives. Experiments on CNN/DailyMail and New York Times datasets show our model achieves state-of-the-art performance on unsupervised summarization. We also find in experiments that our model is less dependent on sentence positions. When using a linear combination of our model and a recent unsupervised model explicitly modeling sentence positions, we obtain even better results.	无监督提取文档摘要旨在在训练期间不使用标记摘要从文档中选择重要句子。现有的方法大多是基于图的，以句子为节点，边权重由句子相似度衡量。在这项工作中，我们发现 Transformer attention 可用于对无监督提取摘要的句子进行排名。具体来说，我们首先仅使用未标记的文档预训练分层转换器模型。然后我们提出了一种使用句子级自我注意和预训练目标对句子进行排序的方法。在 CNN/DailyMail 和纽约时报数据集上的实验表明，我们的模型在无监督摘要方面达到了最先进的性能。我们还在实验中发现我们的模型对句子位置的依赖性较小。当使用我们的模型和最近的无监督模型的线性组合显式建模句子位置时，我们获得了更好的结果。	Shusheng Xu Xingxing Zhang Yi Wu Furu Wei Ming Zhou
4	EMNLP2020	Corpora Evaluation and System Bias detection in Multi Document Summarization	https://github.com/LCS2-IIITD/summarization_bias	https://arxiv.org/pdf/2010.01786	Multi-document summarization (MDS) is the task of reflecting key points from any set of documents into a concise text paragraph. In the past, it has been used to aggregate news, tweets, product reviews, etc. from various sources. Owing to no standard definition of the task, we encounter a plethora of datasets with varying levels of overlap and conflict between participating documents. There is also no standard regarding what constitutes summary information in MDS. Adding to the challenge is the fact that new systems report results on a set of chosen datasets, which might not correlate with their performance on the other datasets. In this paper, we study this heterogeneous task with the help of a few widely used MDS corpora and a suite of state-of-the-art models. We make an attempt to quantify the quality of summarization corpus and prescribe a list of points to consider while proposing a new MDS corpus. Next, we analyze the reason behind the absence of an MDS system which achieves superior performance across all corpora. We then observe the extent to which system metrics are influenced, and bias is propagated due to corpus properties. The scripts to reproduce the experiments in this work are available at https://github.com/LCS2-IIITD/summarization_bias.git.		Alvin Dey Tanya Chowdhury Yash Kumar Atri Tanmoy Chakraborty
5	EMNLP2020	An Empirical Study of Cross-Dataset Evaluation for Neural Summarization Systems	https://github.com/zide05/CDEvalSumm	https://arxiv.org/pdf/2010.05139	Neural network-based models augmented with unsupervised pre-trained knowledge have achieved impressive performance on text summarization. However, most existing evaluation methods are limited to an in-domain setting, where summarizers are trained and evaluated on the same dataset. We argue that this approach can narrow our understanding of the generalization ability for different summarization systems. In this paper, we perform an in-depth analysis of characteristics of different datasets and investigate the performance of different summarization models under a cross-dataset setting, in which a summarizer trained on one corpus will be evaluated on a range of out-of-domain corpora. A comprehensive study of 11 representative summarization systems on 5 datasets from different domains reveals the effect of model architectures and generation ways (i.e. abstractive and extractive) on model generalization ability. Further, experimental results shed light on the limitations of existing summarizers. Brief introduction and supplementary code can be found in https://github.com/zide05/CDEvalSumm.		Yiran Chen Pengfei Liu Ming Zhong Zi-Yi Dou Danqing Wang Xipeng Qiu Xuanjing Huang
6	EMNLP2019	Contrastive Attention Mechanism for Abstractive Sentence Summarization	https://github.com/travel-go/Abstractive-Text-Summarization	https://arxiv.org/pdf/1910.13114	We propose a contrastive attention mechanism to extend the sequence-to-sequence framework for abstractive sentence summarization task, which aims to generate a brief summary of a given source sentence. The proposed contrastive attention mechanism accommodates two categories of attention: one is the conventional attention that attends to relevant parts of the source sentence, the other is the opponent attention that attends to irrelevant or less relevant parts of the source sentence. Both attentions are trained in an opposite way so that the contribution from the conventional attention is encouraged and the contribution from the opponent attention is discouraged through a novel softmax and softmin functionality. Experiments on benchmark datasets show that, the proposed contrastive attention mechanism is more focused on the relevant parts for the summary than the conventional attention mechanism, and greatly advances the state-of-the-art performance on the abstractive sentence summarization task. We release the code at https://github.com/travel-go/Abstractive-Text-Summarization	我们提出了一种对比注意机制来扩展抽象句子摘要任务的序列到序列框架，旨在生成给定源句子的简短摘要。所提出的对比注意力机制包含两类注意力：一种是关注源句相关部分的常规注意力，另一种是关注源句中不相关或不太相关部分的对手注意力。两种注意力都以相反的方式进行训练，从而通过新颖的 softmax 和 softmin 功能鼓励传统注意力的贡献，并阻止对手注意力的贡献。在基准数据集上的实验表明，所提出的对比注意机制比传统的注意机制更关注摘要的相关部分，并且大大提高了抽象句子摘要任务的最新性能。我们在 https://github.com/travel-go/Abstractive-Text-Summarization 发布代码	Xiangyu Duan Hoongfei Yu Mingming Yin Min Zhang Weihua Luo Yue Zhang
7	EMNLP2019	Concept Pointer Network for Abstractive Summarization	https://github.com/wprojectsn/codes	https://arxiv.org/pdf/1910.08486	A quality abstractive summary should not only copy salient source texts as summaries but should also tend to generate new conceptual words to express concrete details. Inspired by the popular pointer generator sequence-to-sequence model, this paper presents a concept pointer network for improving these aspects of abstractive summarization. The network leverages knowledge-based, context-aware conceptualizations to derive an extended set of candidate concepts. The model then points to the most appropriate choice using both the concept set and original source text. This joint approach generates abstractive summaries with higher-level semantic concepts. The training model is also optimized in a way that adapts to different data, which is based on a novel method of distantly-supervised learning guided by reference summaries and testing set. Overall, the proposed approach provides statistically significant improvements over several state-of-the-art models on both the DUC-2004 and Gigaword datasets. A human evaluation of the model’s abstractive abilities also supports the quality of the summaries produced within this framework.	高质量的抽象摘要不仅应该复制突出的源文本作为摘要，还应该倾向于生成新的概念词来表达具体细节。受流行的指针生成器序列到序列模型的启发，本文提出了一个概念指针网络，用于改进抽象摘要的这些方面。该网络利用基于知识的、上下文感知的概念化来推导出一组扩展的候选概念。然后，模型使用概念集和原始源文本指出最合适的选择。这种联合方法生成具有更高级别语义概念的抽象摘要。训练模型也以适应不同数据的方式进行了优化，这是基于一种以参考摘要和测试集为指导的远程监督学习的新方法。总体而言，与 DUC-2004 和 Gigaword 数据集上的几个最先进模型相比，所提出的方法在统计上有显着改进。对模型抽象能力的人工评估也支持在该框架内生成的摘要的质量。	Wang Wenbo Gao Yang Huang Heyan Zhou Yuxiang
8	EMNLP2019	Contrastive Attention Mechanism for Abstractive Sentence Summarization	https://github.com/travel-go/Abstractive-Text-Summarization	https://arxiv.org/pdf/1910.13114	We propose a contrastive attention mechanism to extend the sequence-to-sequence framework for abstractive sentence summarization task, which aims to generate a brief summary of a given source sentence. The proposed contrastive attention mechanism accommodates two categories of attention: one is the conventional attention that attends to relevant parts of the source sentence, the other is the opponent attention that attends to irrelevant or less relevant parts of the source sentence. Both attentions are trained in an opposite way so that the contribution from the conventional attention is encouraged and the contribution from the opponent attention is discouraged through a novel softmax and softmin functionality. Experiments on benchmark datasets show that, the proposed contrastive attention mechanism is more focused on the relevant parts for the summary than the conventional attention mechanism, and greatly advances the state-of-the-art performance on the abstractive sentence summarization task. We release the code at https://github.com/travel-go/Abstractive-Text-Summarization	我们提出了一种对比注意机制来扩展抽象句子摘要任务的序列到序列框架，旨在生成给定源句子的简短摘要。所提出的对比注意力机制包含两类注意力：一种是关注源句相关部分的常规注意力，另一种是关注源句中不相关或不太相关部分的对手注意力。两种注意力都以相反的方式进行训练，从而通过新颖的 softmax 和 softmin 功能鼓励传统注意力的贡献，并阻止对手注意力的贡献。在基准数据集上的实验表明，所提出的对比注意机制比传统的注意机制更关注摘要的相关部分，并且大大提高了抽象句子摘要任务的最新性能。我们在 https://github.com/travel-go/Abstractive-Text-Summarization 发布代码	Xiangyu Duan Hoongfei Yu Mingming Yin Min Zhang Weihua Luo Yue Zhang
9	EMNLP2019	Neural Extractive Text Summarization with Syntactic Compression		https://arxiv.org/pdf/1902.00863	Recent neural network approaches to summarization are largely either selection-based extraction or generation-based abstraction. In this work, we present a neural model for single-document summarization based on joint extraction and syntactic compression. Our model chooses sentences from the document, identifies possible compressions based on constituency parses, and scores those compressions with a neural model to produce the final summary. For learning, we construct oracle extractive-compressive summaries, then learn both of our components jointly with this supervision. Experimental results on the CNN/Daily Mail and New York Times datasets show that our model achieves strong performance (comparable to state-of-the-art systems) as evaluated by ROUGE. Moreover, our approach outperforms an off-the-shelf compression module, and human and manual evaluation shows that our model’s output generally remains grammatical.	最近的神经网络总结方法主要是基于选择的提取或基于生成的抽象。在这项工作中，我们提出了一种基于联合提取和句法压缩的单文档摘要神经模型。我们的模型从文档中选择句子，根据选区解析识别可能的压缩，并使用神经模型对这些压缩进行评分以生成最终摘要。对于学习，我们构建 oracle 提取压缩摘要，然后在此监督下共同学习我们的两个组件。在 CNN/Daily Mail 和纽约时报数据集上的实验结果表明，我们的模型获得了 ROUGE 评估的强大性能（可与最先进的系统相媲美）。此外，我们的方法优于现成的压缩模块，人工和人工评估表明我们模型的输出通常保持语法。	Jiacheng Xu Greg Durrett
10	EMNLP2019	Text Summarization with Pretrained Encoders	https://github.com/nlpyang/PreSumm	https://arxiv.org/pdf/1908.08345	Bidirectional Encoder Representations from Transformers (BERT) represents the latest incarnation of pretrained language models which have recently advanced a wide range of natural language processing tasks. In this paper, we showcase how BERT can be usefully applied in text summarization and propose a general framework for both extractive and abstractive models. We introduce a novel document-level encoder based on BERT which is able to express the semantics of a document and obtain representations for its sentences. Our extractive model is built on top of this encoder by stacking several inter-sentence Transformer layers. For abstractive summarization, we propose a new fine-tuning schedule which adopts different optimizers for the encoder and the decoder as a means of alleviating the mismatch between the two (the former is pretrained while the latter is not). We also demonstrate that a two-staged fine-tuning approach can further boost the quality of the generated summaries. Experiments on three datasets show that our model achieves state-of-the-art results across the board in both extractive and abstractive settings. Our code is available at https://github.com/nlpyang/PreSumm	Transformers 的双向编码器表示 (BERT) 代表了预训练语言模型的最新化身，这些模型最近推进了广泛的自然语言处理任务。在本文中，我们展示了 BERT 如何有效地应用于文本摘要，并为提取和抽象模型提出了一个通用框架。我们引入了一种基于 BERT 的新型文档级编码器，它能够表达文档的语义并获得其句子的表示。我们的提取模型建立在这个编码器之上，通过堆叠几个句间 Transformer 层。对于抽象摘要，我们提出了一种新的微调计划，它对编码器和解码器采用不同的优化器作为减轻两者之间不匹配的手段（前者是预训练的，而后者不是）。我们还证明了两阶段微调方法可以进一步提高生成摘要的质量。在三个数据集上的实验表明，我们的模型在提取和抽象设置中都取得了全面的最新结果。我们的代码可在 https://github.com/nlpyang/PreSumm 获得	Yang Liu Mirella Lapata

NAACL

序号	会议/期刊	论文	代码	论文下载地址	摘要	摘要翻译	作者
1	NAACL2021	GSum: A General Framework for Guided Neural Abstractive Summarization	https://github.com/neulab/guided_summarization	https://arxiv.org/pdf/2010.08014	Neural abstractive summarization models are flexible and can produce coherent summaries, but they are sometimes unfaithful and can be difficult to control. While previous studies attempt to provide different types of guidance to control the output and increase faithfulness, it is not clear how these strategies compare and contrast to each other. In this paper, we propose a general and extensible guided summarization framework (GSum) that can effectively take different kinds of external guidance as input, and we perform experiments across several different varieties. Experiments demonstrate that this model is effective, achieving state-of-the-art performance according to ROUGE on 4 popular summarization datasets when using highlighted sentences as guidance. In addition, we show that our guided model can generate more faithful summaries and demonstrate how different types of guidance generate qualitatively different summaries, lending a degree of controllability to the learned models.	神经抽象摘要模型很灵活，可以产生连贯的摘要，但它们有时不可靠并且难以控制。虽然以前的研究试图提供不同类型的指导来控制输出和增加忠诚度，但尚不清楚这些策略如何相互比较和对比。在本文中，我们提出了一个通用且可扩展的引导式总结框架（GSum），它可以有效地将不同种类的外部引导作为输入，并在几个不同的品种上进行实验。实验表明，该模型是有效的，在使用突出显示的句子作为指导时，根据 ROUGE 在 4 个流行的摘要数据集上实现了最先进的性能。此外，我们展示了我们的引导模型可以生成更忠实的摘要，并展示不同类型的引导如何生成质量不同的摘要，从而为学习模型提供一定程度的可控性。	Zi-Yi Dou Pengfei Liu Hiroaki Hayashi Zhengbao Jiang Graham Neubig
2	NAACL2021	Improving Zero and Few-Shot Abstractive Summarization with Intermediate Fine- tuning and Data Augmentation		https://arxiv.org/pdf/2010.12836	Models pretrained with self-supervised objectives on large text corpora achieve state-of-the-art performance on English text summarization tasks. However, these models are typically fine-tuned on hundreds of thousands of data points, an infeasible requirement when applying summarization to new, niche domains. In this work, we introduce a novel and generalizable method, called WikiTransfer, for fine-tuning pretrained models for summarization in an unsupervised, dataset-specific manner. WikiTransfer fine-tunes pretrained models on pseudo-summaries, produced from generic Wikipedia data, which contain characteristics of the target dataset, such as the length and level of abstraction of the desired summaries. WikiTransfer models achieve state-of-the-art, zero-shot abstractive summarization performance on the CNN-DailyMail dataset and demonstrate the effectiveness of our approach on three additional diverse datasets. These models are more robust to noisy data and also achieve better or comparable few-shot performance using 10 and 100 training examples when compared to few-shot transfer from other summarization datasets. To further boost performance, we employ data augmentation via round-trip translation as well as introduce a regularization term for improved few-shot transfer. To understand the role of dataset aspects in transfer performance and the quality of the resulting output summaries, we further study the effect of the components of our unsupervised fine-tuning data and analyze few-shot performance using both automatic and human evaluation.	在大型文本语料库上使用自监督目标进行预训练的模型在英语文本摘要任务上取得了最先进的性能。然而，这些模型通常会在数十万个数据点上进行微调，这在将汇总应用于新的利基领域时是不可行的。在这项工作中，我们引入了一种新颖且可推广的方法，称为 WikiTransfer，用于微调预训练模型，以无监督的、特定于数据集的方式进行汇总。 WikiTransfer 微调伪摘要上的预训练模型，这些模型由通用维基百科数据生成，其中包含目标数据集的特征，例如所需摘要的长度和抽象级别。 WikiTransfer 模型在 CNN-DailyMail 数据集上实现了最先进的零样本抽象摘要性能，并证明了我们的方法在另外三个不同数据集上的有效性。与来自其他摘要数据集的小样本传输相比，这些模型对嘈杂的数据更加稳健，并且使用 10 和 100 个训练示例也能实现更好或可比的小样本性能。为了进一步提高性能，我们通过往返翻译使用数据增强，并引入了一个正则化项来改进小样本传输。为了了解数据集方面在传输性能和结果输出摘要质量中的作用，我们进一步研究了无监督微调数据组件的影响，并使用自动和人工评估分析了小样本性能。	Alexander R. Fabbri Simeng Han Haoyuan Li Haoran Li Marjan Ghazvininejad Shafiq Joty Dragomir Radev Yashar Mehdad
3	NAACL2021	Structure-Aware Abstractive Conversation Summarization via Discourse and Action Graphs	https://github.com/GT-SALT/Structure-Aware-BART	https://arxiv.org/pdf/2104.08400	Abstractive conversation summarization has received much attention recently. However, these generated summaries often suffer from insufficient, redundant, or incorrect content, largely due to the unstructured and complex characteristics of human-human interactions. To this end, we propose to explicitly model the rich structures in conversations for more precise and accurate conversation summarization, by first incorporating discourse relations between utterances and action triples (“who-doing-what”) in utterances through structured graphs to better encode conversations, and then designing a multi-granularity decoder to generate summaries by combining all levels of information. Experiments show that our proposed models outperform state-of-the-art methods and generalize well in other domains in terms of both automatic evaluations and human judgments. We have publicly released our code at https://github.com/GT-SALT/Structure-Aware-BART.		Jiaao Chen Diyi Yang
4	NAACL2021	AdaptSum: Towards Low-Resource Domain Adaptation for Abstractive Summarization	https://github.com/TysonYu/AdaptSum	https://arxiv.org/pdf/2103.11332	State-of-the-art abstractive summarization models generally rely on extensive labeled data, which lowers their generalization ability on domains where such data are not available. In this paper, we present a study of domain adaptation for the abstractive summarization task across six diverse target domains in a low-resource setting. Specifically, we investigate the second phase of pre-training on large-scale generative models under three different settings: 1) source domain pre-training; 2) domain-adaptive pre-training; and 3) task-adaptive pre-training. Experiments show that the effectiveness of pre-training is correlated with the similarity between the pre-training data and the target domain task. Moreover, we find that continuing pre-training could lead to the pre-trained model’s catastrophic forgetting, and a learning method with less forgetting can alleviate this issue. Furthermore, results illustrate that a huge gap still exists between the low-resource and high-resource settings, which highlights the need for more advanced domain adaptation methods for the abstractive summarization task.	最先进的抽象摘要模型通常依赖于广泛的标记数据，这降低了它们在这些数据不可用的域上的泛化能力。在本文中，我们提出了在低资源环境中跨六个不同目标域的抽象摘要任务的域适应研究。具体来说，我们研究了在三种不同设置下对大规模生成模型进行预训练的第二阶段：1）源域预训练； 2）领域自适应预训练； 3) 任务自适应预训练。实验表明，预训练的有效性与预训练数据与目标域任务之间的相似性相关。此外，我们发现继续预训练可能导致预训练模型的灾难性遗忘，而减少遗忘的学习方法可以缓解这个问题。此外，结果表明低资源和高资源设置之间仍然存在巨大差距，这突出了抽象摘要任务需要更高级的域适应方法。	Tiezheng Yu Zihan Liu Pascale Fung
5	NAACL2021	A New Approach to Overgenerating and Scoring Abstractive Summaries	https://github.com/ucfnlp/varying-length-summ	https://arxiv.org/pdf/2104.01726	We propose a new approach to generate multiple variants of the target summary with diverse content and varying lengths, then score and select admissible ones according to users’ needs. Abstractive summarizers trained on single reference summaries may struggle to produce outputs that achieve multiple desirable properties, i.e., capturing the most important information, being faithful to the original, grammatical and fluent. In this paper, we propose a two-staged strategy to generate a diverse set of candidate summaries from the source text in stage one, then score and select admissible ones in stage two. Importantly, our generator gives a precise control over the length of the summary, which is especially well-suited when space is limited. Our selectors are designed to predict the optimal summary length and put special emphasis on faithfulness to the original text. Both stages can be effectively trained, optimized and evaluated. Our experiments on benchmark summarization datasets suggest that this paradigm can achieve state-of-the-art performance.	我们提出了一种新方法来生成具有不同内容和不同长度的目标摘要的多个变体，然后根据用户的需求评分并选择可接受的变体。在单一参考摘要上训练的抽象摘要者可能难以产生实现多种理想属性的输出，即捕获最重要的信息、忠实于原文、语法和流利。在本文中，我们提出了一个两阶段的策略，在第一阶段从源文本生成一组不同的候选摘要，然后在第二阶段评分并选择可接受的摘要。重要的是，我们的生成器可以精确控制摘要的长度，这尤其适用于空间有限的情况。我们的选择器旨在预测最佳摘要长度，并特别强调对原文的忠实度。这两个阶段都可以有效地训练、优化和评估。我们在基准摘要数据集上的实验表明，这种范式可以实现最先进的性能。	Kaiqiang Song Bingqing Wang Zhe Feng Fei Liu
6	NAACL2021	Improving Faithfulness in Abstractive Summarization with Contrast Candidate Generation and Selection		https://arxiv.org/pdf/2104.09061	Despite significant progress in neural abstractive summarization, recent studies have shown that the current models are prone to generating summaries that are unfaithful to the original context. To address the issue, we study contrast candidate generation and selection as a model-agnostic post-processing technique to correct the extrinsic hallucinations (i.e. information not present in the source text) in unfaithful summaries. We learn a discriminative correction model by generating alternative candidate summaries where named entities and quantities in the generated summary are replaced with ones with compatible semantic types from the source document. This model is then used to select the best candidate as the final output summary. Our experiments and analysis across a number of neural summarization systems show that our proposed method is effective in identifying and correcting extrinsic hallucinations. We analyze the typical hallucination phenomenon by different types of neural summarization systems, in hope to provide insights for future work on the direction.	尽管在神经抽象摘要方面取得了重大进展，但最近的研究表明，当前的模型容易生成与原始上下文不相符的摘要。为了解决这个问题，我们研究了对比候选生成和选择作为一种与模型无关的后处理技术，以纠正不可靠摘要中的外在幻觉（即源文本中不存在的信息）。我们通过生成替代候选摘要来学习判别校正模型，其中生成的摘要中的命名实体和数量被替换为源文档中具有兼容语义类型的实体和数量。然后使用该模型选择最佳候选者作为最终输出摘要。我们对许多神经摘要系统的实验和分析表明，我们提出的方法在识别和纠正外在幻觉方面是有效的。我们通过不同类型的神经摘要系统分析了典型的幻觉现象，希望为未来的方向工作提供见解。	Sihao Chen Fan Zhang Kazoo Sone Dan Roth
7	NAACL2021	Enriching Transformers with Structured Tensor-Product Representations for Abstractive Summarization	https://github.com/jiangycTarheel/TPT-Summ	https://arxiv.org/pdf/2106.01317	Abstractive summarization, the task of generating a concise summary of input documents, requires: (1) reasoning over the source document to determine the salient pieces of information scattered across the long document, and (2) composing a cohesive text by reconstructing these salient facts into a shorter summary that faithfully reflects the complex relations connecting these facts. In this paper, we adapt TP-TRANSFORMER (Schlag et al., 2019), an architecture that enriches the original Transformer (Vaswani et al., 2017) with the explicitly compositional Tensor Product Representation (TPR), for the task of abstractive summarization. The key feature of our model is a structural bias that we introduce by encoding two separate representations for each token to represent the syntactic structure (with role vectors) and semantic content (with filler vectors) separately. The model then binds the role and filler vectors into the TPR as the layer output. We argue that the structured intermediate representations enable the model to take better control of the contents (salient facts) and structures (the syntax that connects the facts) when generating the summary. Empirically, we show that our TP-TRANSFORMER outperforms the Transformer and the original TP-TRANSFORMER significantly on several abstractive summarization datasets based on both automatic and human evaluations. On several syntactic and semantic probing tasks, we demonstrate the emergent structural information in the role vectors and improved syntactic interpretability in the TPR layer outputs. Code and models are available at https://github.com/jiangycTarheel/TPT-Summ.	抽象摘要，即生成输入文档的简明摘要的任务，需要：(1) 对源文档进行推理以确定散布在长文档中的显着信息，以及 (2) 通过重构这些显着事实来组成一个有凝聚力的文本成一个简短的总结，忠实地反映了连接这些事实的复杂关系。在本文中，我们采用了 TP-TRANSFORMER (Schlag et al., 2019)，一种用显式组合张量积表示 (TPR) 丰富原始 Transformer (Vaswani et al., 2017) 的架构，用于抽象总结的任务.我们模型的关键特征是结构偏差，我们通过为每个标记编码两个单独的表示来分别表示句法结构（使用角色向量）和语义内容（使用填充向量）来引入结构偏差。然后模型将角色和填充向量绑定到 TPR 作为层输出。我们认为，结构化的中间表示使模型能够在生成摘要时更好地控制内容（显着事实）和结构（连接事实的语法）。根据经验，我们表明我们的 TP-TRANSFORMER 在基于自动和人工评估的几个抽象摘要数据集上显着优于 Transformer 和原始 TP-TRANSFORMER。在几个句法和语义探测任务中，我们展示了角色向量中的紧急结构信息和 TPR 层输出中改进的句法可解释性。代码和模型可在 https://github.com/jiangycTarheel/TPT-Summ 获得。	Yichen Jiang Asli Celikyilmaz Paul Smolensky Paul Soulos Sudha Rao Hamid Palangi Roland Fernandez Caitlin Smith Mohit Bansal Jianfeng Gao
8	NAACL2021	Attention Head Masking for Inference Time Content Selection in Abstractive Summarization		https://arxiv.org/pdf/2104.02205	How can we effectively inform content selection in Transformer-based abstractive summarization models? In this work, we present a simple-yet-effective attention head masking technique, which is applied on encoder-decoder attentions to pinpoint salient content at inference time. Using attention head masking, we are able to reveal the relation between encoder-decoder attentions and content selection behaviors of summarization models. We then demonstrate its effectiveness on three document summarization datasets based on both in-domain and cross-domain settings. Importantly, our models outperform prior state-of-the-art models on CNN/Daily Mail and New York Times datasets. Moreover, our inference-time masking technique is also data-efficient, requiring only 20% of the training samples to outperform BART fine-tuned on the full CNN/DailyMail dataset.	我们如何在基于 Transformer 的抽象摘要模型中有效地告知内容选择？在这项工作中，我们提出了一种简单而有效的注意力头掩蔽技术，该技术应用于编码器-解码器注意力以在推理时精确定位显着内容。使用注意力头屏蔽，我们能够揭示编码器-解码器注意力与摘要模型的内容选择行为之间的关系。然后，我们在基于域内和跨域设置的三个文档摘要数据集上证明了它的有效性。重要的是，我们的模型在 CNN/《每日邮报》和《纽约时报》数据集上的表现优于先前最先进的模型。此外，我们的推理时间屏蔽技术也是数据高效的，只需要 20% 的训练样本就可以胜过在完整的 CNN/DailyMail 数据集上微调的 BART。	Shuyang Cao Lu Wang
9	NAACL2021	Understanding Factuality in Abstractive Summarization with FRANK: A Benchmark for Factuality Metrics	https://github.com/artidoro/frank	https://arxiv.org/pdf/2104.13346	Modern summarization models generate highly fluent but often factually unreliable outputs. This motivated a surge of metrics attempting to measure the factuality of automatically generated summaries. Due to the lack of common benchmarks, these metrics cannot be compared. Moreover, all these methods treat factuality as a binary concept and fail to provide deeper insights into the kinds of inconsistencies made by different systems. To address these limitations, we devise a typology of factual errors and use it to collect human annotations of generated summaries from state-of-the-art summarization systems for the CNN/DM and XSum datasets. Through these annotations, we identify the proportion of different categories of factual errors in various summarization models and benchmark factuality metrics, showing their correlation with human judgment as well as their specific strengths and weaknesses.	现代摘要模型生成高度流畅但通常实际上不可靠的输出。这引发了大量试图衡量自动生成的摘要的真实性的指标。由于缺乏共同的基准，这些指标无法进行比较。此外，所有这些方法都将事实性视为一个二元概念，无法更深入地了解不同系统造成的各种不一致。为了解决这些限制，我们设计了一种事实错误的类型学，并使用它从 CNN/DM 和 XSum 数据集的最先进的摘要系统中收集生成的摘要的人工注释。通过这些注释，我们确定了各种摘要模型和基准事实性指标中不同类别的事实错误的比例，显示了它们与人类判断的相关性以及它们的具体优势和劣势。	Artidoro Pagnoni Vidhisha Balachandran Yulia Tsvetkov

COLING

序号	会议/期刊	论文	代码	论文下载地址	摘要	摘要翻译	作者
1	COLING2020	How Domain Terminology Affects Meeting Summarization Performance	https://github.com/ucfnlp/meeting-domain-terminology	https://arxiv.org/pdf/2011.00692	Meetings are essential to modern organizations. Numerous meetings are held and recorded daily, more than can ever be comprehended. A meeting summarization system that identifies salient utterances from the transcripts to automatically generate meeting minutes can help. It empowers users to rapidly search and sift through large meeting collections. To date, the impact of domain terminology on the performance of meeting summarization remains understudied, despite that meetings are rich with domain knowledge. In this paper, we create gold-standard annotations for domain terminology on a sizable meeting corpus; they are known as jargon terms. We then analyze the performance of a meeting summarization system with and without jargon terms. Our findings reveal that domain terminology can have a substantial impact on summarization performance. We publicly release all domain terminology to advance research in meeting summarization.		Jia Jin Koay Alexander Roustai Xiaojin Dai Dillon Burns Alec Kerrigan Fei Liu
2	COLING2020	Fact-level Extractive Summarization with Hierarchical Graph Mask on BERT	https://github.com/Ruifeng-paper/FactExsum-coling2020	https://arxiv.org/pdf/2011.09739	Most current extractive summarization models generate summaries by selecting salient sentences. However, one of the problems with sentence-level extractive summarization is that there exists a gap between the human-written gold summary and the oracle sentence labels. In this paper, we propose to extract fact-level semantic units for better extractive summarization. We also introduce a hierarchical structure, which incorporates the multi-level of granularities of the textual information into the model. In addition, we incorporate our model with BERT using a hierarchical graph mask. This allows us to combine BERT’s ability in natural language understanding and the structural information without increasing the scale of the model. Experiments on the CNN/DaliyMail dataset show that our model achieves state-of-the-art results.	大多数当前的提取摘要模型通过选择显着句子来生成摘要。然而，句子级提取摘要的问题之一是人工编写的黄金摘要与oracle句子标签之间存在差距。在本文中，我们建议提取事实级别的语义单元以更好地提取摘要。我们还引入了一种层次结构，它将文本信息的多级粒度合并到模型中。此外，我们使用分层图掩码将我们的模型与 BERT 结合起来。这使我们能够在不增加模型规模的情况下，将 BERT 在自然语言理解方面的能力与结构信息结合起来。在 CNN/DaliyMail 数据集上的实验表明，我们的模型达到了最先进的结果。	Ruifeng Yuan Zili Wang Wenjie Li
3	COLING2020	WSL-DS: Weakly Supervised Learning with Distant Supervision for Query Focused Multi-Document Abstractive Summarization	https://github.com/tahmedge/WSL-DS-COLING-2020	https://arxiv.org/pdf/2011.01421	In the Query Focused Multi-Document Summarization (QF-MDS) task, a set of documents and a query are given where the goal is to generate a summary from these documents based on the given query. However, one major challenge for this task is the lack of availability of labeled training datasets. To overcome this issue, in this paper, we propose a novel weakly supervised learning approach via utilizing distant supervision. In particular, we use datasets similar to the target dataset as the training data where we leverage pre-trained sentence similarity models to generate the weak reference summary of each individual document in a document set from the multi-document gold reference summaries. Then, we iteratively train our summarization model on each single-document to alleviate the computational complexity issue that occurs while training neural summarization models in multiple documents (i.e., long sequences) at once. Experimental results in Document Understanding Conferences (DUC) datasets show that our proposed approach sets a new state-of-the-art result in terms of various evaluation metrics.	在 Query Focused Multi-Document Summarization (QF-MDS) 任务中，给出了一组文档和一个查询，其目标是根据给定的查询从这些文档中生成摘要。然而，这项任务的一个主要挑战是缺乏标记的训练数据集。为了克服这个问题，在本文中，我们提出了一种利用远程监督的新型弱监督学习方法。特别是，我们使用与目标数据集相似的数据集作为训练数据，我们利用预训练的句子相似性模型从多文档黄金参考摘要中生成文档集中每个单独文档的弱参考摘要。然后，我们在每个单文档上迭代训练我们的摘要模型，以减轻在一次在多个文档（即长序列）中训练神经摘要模型时出现的计算复杂性问题。文档理解会议 (DUC) 数据集中的实验结果表明，我们提出的方法在各种评估指标方面取得了新的最新成果。	Md Tahmid Rahman Laskar Enamul Hoque Jimmy Xiangji Huang