论文合集||NLP论文合集(Powered by ChatGPT)

发表时间:2023-05-30 17:29作者:沃恩智慧

【1】 To Asymmetry and Beyond: Structured Pruning of Sequence to Sequence Models for Improved Inference Efficiency

链接: https://arxiv.org/abs/2304.02721

ChatGPT Summary

本文研究了结构剪枝对序列到序列(Seq2Seq)语言模型的影响,以提高其推理效率。作者研究了模型大小、结构剪枝、推理效率和摘要准确性之间的关系,证明编码器大小与模型准确性相关,解码器与推理效率相关。使用非对称剪枝可使推理延迟减少近3倍,而 Rouge-2 得分仅平均降低了约1分。此外,作者发现非对称剪枝的作用与模型大小和数据集变体是一致的。

(1)本文研究了结构剪枝对Seq2Seq语言模型的影响,以提高模型的推理效率。

(2)本文提出了使用非对称剪枝策略以降低推理延迟来提高Seq2Seq语言模型的效率。

(3)本文的实验结果表明,使用非对称剪枝可以减少推理延迟同时一定程度上降低模型的精度,且在不同大小的模型和数据集变体中都表现一致。

Abstract: Sequence-to-sequence language models can be used to produce abstractive summaries which are coherent, relevant, and concise. Still, model sizes can make deployment in latency-sensitive or web-scale implementations difficult. This paper studies the relationship between model size, structured pruning, inference efficiency, and summarization accuracy on widely used summarization datasets. We show that model accuracy is tied to the encoder size while inference efficiency is connected to the decoder. Using asymmetric pruning can lead to nearly 3x improvement in inference latency with ~1 point loss in Rouge-2. Moreover, we find both the average degradation and the role of asymmetry to be consistent across model sizes and variations in datasets.

【2】 Core Challenges in Embodied Vision-Language Planning

链接: https://arxiv.org/abs/2304.02738

ChatGPT Summary

这篇论文讨论了落地视觉-语言规划(Embodied Vision-Language Planning,EVLP)任务的核心挑战。EVLP任务是涉及视觉、自然语言处理和机器人学相结合的一组突出的落地导航和操作问题。作者提出了统一这些任务的分类法,并对当前和新的算法方法、度量、模拟器和EVLP任务所使用的数据集进行了深入的分析和比较。最终,作者提出了核心挑战,并倡导使任务能够实现模型的普适性和推进实际部署。

(1):这篇论文解决了什么问题或任务?

答:这篇论文主要是针对落地视觉-语言规划(EVLP)任务的核心挑战进行了讨论,并提出了应该寻求解决这些挑战的新的EVLP工作。

(2):这项工作的关键创新点或贡献是什么?

答:这项工作的关键创新点在于提出了统一落地视觉-语言规划(EVLP)任务的分类法,并对当前和新的算法方法、度量、模拟器和EVLP任务所使用的数据集进行了深入的分析和比较。同时,作者还提出了核心挑战,并倡导使任务能够实现模型的普适性和推进实际部署。

(3):这篇文章的优点和缺点是什么?

答:这篇文章的优点在于它全面地总结了落地视觉-语言规划(EVLP)任务的现状,并指出了需要解决的核心挑战,为后续的研究提供了重要的参考。缺点可能是,文章没有对一些具体的算法方法进行深入的阐述,对于一些初学者来说可能难以理解。


Abstract: Recent advances in the areas of Multimodal Machine Learning and Artificial Intelligence (AI) have led to the development of challenging tasks at the intersection of Computer Vision, Natural Language Processing, and Robotics. Whereas many approaches and previous survey pursuits have characterised one or two of these dimensions, there has not been a holistic analysis at the center of all three. Moreover, even when combinations of these topics are considered, more focus is placed on describing, e.g., current architectural methods, as opposed to also illustrating high-level challenges and opportunities for the field. In this survey paper, we discuss Embodied Vision-Language Planning (EVLP) tasks, a family of prominent embodied navigation and manipulation problems that jointly leverage computer vision and natural language for interaction in physical environments. We propose a taxonomy to unify these tasks and provide an in-depth analysis and comparison of the current and new algorithmic approaches, metrics, simulators, and datasets used for EVLP tasks. Finally, we present the core challenges that we believe new EVLP works should seek to address, and we advocate for task construction that enables model generalisability and furthers real-world deployment.

【3】 Bengali Fake Review Detection using Semi-supervised Generative Adversarial Networks

链接: https://arxiv.org/abs/2304.02739

ChatGPT Summary

这篇论文通过半监督生成对抗网络(GAN)来微调预训练语言模型,从极少的标注数据中对孟加拉假评论和真实评论进行分类,并探索了GAN在这个任务中的潜力。研究表明,即使只有1024个标注样本,使用半监督GAN-LM架构(生成对抗网络在预训练语言模型之上)的BanglaBERT与半监督GAN(SSGAN)在准确率和F1得分方面都能够优于其他预训练语言模型(如BanglaBERT生成器、Bangla BERT Base和Bangla-Electra)近3%、4%和10%。研究人员可以通过半监督GAN来解决分类数据不足的问题,如辨别假评论等。

(1)这篇论文试图解决的问题是什么?

这篇论文试图通过使用半监督生成对抗网络(GAN)来微调预训练语言模型,从极少的标注数据中对孟加拉假评论和真实评论进行分类。

(2)这项工作的创新点或贡献是什么?

这篇论文提出了使用半监督GAN来解决极少标注数据下分类问题的方法,并在解决孟加拉假评论分类问题上取得了比其他预训练语言模型更好的效果。

(3)这篇论文的优点和缺点是什么?

这篇论文的有点在于提出了一种有效解决标注数据不足问题的方法,并在孟加拉假评论分类问题上取得了较好的效果。缺点在于仍然需要一些标注数据进行训练,如果数据非常少则可能无法取得良好的分类效果。

Abstract: This paper investigates the potential of semi-supervised Generative Adversarial Networks (GANs) to fine-tune pretrained language models in order to classify Bengali fake reviews from real reviews with a few annotated data. With the rise of social media and e-commerce, the ability to detect fake or deceptive reviews is becoming increasingly important in order to protect consumers from being misled by false information. Any machine learning model will have trouble identifying a fake review, especially for a low resource language like Bengali. We have demonstrated that the proposed semi-supervised GAN-LM architecture (generative adversarial network on top of a pretrained language model) is a viable solution in classifying Bengali fake reviews as the experimental results suggest that even with only 1024 annotated samples, BanglaBERT with semi-supervised GAN (SSGAN) achieved an accuracy of 83.59% and a f1-score of 84.89% outperforming other pretrained language models - BanglaBERT generator, Bangla BERT Base and Bangla-Electra by almost 3%, 4% and 10% respectively in terms of accuracy. The experiments were conducted on a manually labeled food review dataset consisting of total 6014 real and fake reviews collected from various social media groups. Researchers that are experiencing difficulty recognizing not just fake reviews but other classification issues owing to a lack of labeled data may find a solution in our proposed methodology.

【4】 Behavioral estimates of conceptual structure are robust across tasks in humans but not large language models

链接: https://arxiv.org/abs/2304.02754

ChatGPT Summary

这篇论文讨论了语言AIs作为研究人类思维和大脑概念表示的工具。文章使用两种认知心理学方法,估计并比较人类和一个广为人知的AI模型——GPT-3的概念表示结构。结果表明,概念结构在人类中的表现具有鲁棒性,而语言模型中的表现则依赖于产生行为响应的特定任务,其中不同任务的响应所生成的表示结构不如人类那样一致。这个结果表明,当代人工智能所蕴含的知识与人类认知有重要的差异。

(1):这篇论文的问题或任务是什么?

这篇论文旨在比较人类和GPT-3这一著名语言AI之间的概念表示结构差异,探讨概念结构的稳健性和一致性问题。

(2):这项研究的关键创新点或贡献是什么?

这项研究使用两种常用的认知心理学方法比较了人类和GPT-3的概念结构,揭示了当代AI蕴含的知识与人类认知的重要差异。

(3):这篇文章的优缺点是什么?

优点:采用了有效的方法比较了不同对象的概念结构; 结果引起了对当代AI与人类认知异同的关注。

缺点:由于研究对象的局限性,无法涵盖所有当代AI模型;文章的原始数据未公开。


Abstract: Neural network models of language have long been used as a tool for developing hypotheses about conceptual representation in the mind and brain. For many years, such use involved extracting vector-space representations of words and using distances among these to predict or understand human behavior in various semantic tasks. In contemporary language AIs, however, it is possible to interrogate the latent structure of conceptual representations using methods nearly identical to those commonly used with human participants. The current work uses two common techniques borrowed from cognitive psychology to estimate and compare lexical-semantic structure in both humans and a well-known AI, the DaVinci variant of GPT-3. In humans, we show that conceptual structure is robust to differences in culture, language, and method of estimation. Structures estimated from AI behavior, while individually fairly consistent with those estimated from human behavior, depend much more upon the particular task used to generate behavior responses--responses generated by the very same model in the two tasks yield estimates of conceptual structure that cohere less with one another than do human structure estimates. The results suggest one important way that knowledge inhering in contemporary AIs can differ from human cognition.

【5】 Application of Transformers based methods in Electronic Medical Records: A Systematic Literature Review

链接: https://arxiv.org/abs/2304.02768

ChatGPT Summary

这篇论文是关于应用基于Transformer的方法处理电子医疗记录的系统性文献综述。近年来,由于医疗记录数据规模巨大,且结构复杂,这些数据的结构不适合进行统计分析,因此自然语言处理技术被广泛应用于该领域。本文旨在提供一个全面的综述,对基于Transformer的方法在电子医疗记录(EMR)领域中的自然语言处理任务的最新研究进行系统性审查。这项研究在最初的查询中选择了99篇文章并过滤出65篇进行了详细分析。这些文章针对商业问题、自然语言处理任务、模型和技术、数据集的可用性、建模重现性、语言和交换格式进行了分析。最后,本文提出了当前研究的一些局限性和一些未来研究的建议。

(1)这篇论文尝试解决的问题是什么?

这篇论文尝试综述基于Transformer的方法在电子医疗记录(EMR)领域中的自然语言处理任务的最新研究。

(2)这项工作的关键创新点或贡献是什么?

这项工作提供了一个全面的综述,对基于Transformer的方法在电子医疗记录(EMR)领域中的自然语言处理任务的最新研究进行了系统性审查。

(3)这篇文章的优点和缺点是什么?

优点:这篇文章提供了一个全面的综述,对基于Transformer的方法在电子医疗记录(EMR)领域中的自然语言处理任务的最新研究进行了系统性审查。

缺点:由于本文是一篇文献综述,其局限性在于它只总结了已经发表的工作,因此可能会忽略一些未发表的进展和创新。此外,本文没有提出精确的数据分析,如性能评估或定量比较等。


Abstract: The combined growth of available data and their unstructured nature has received increased interest in natural language processing (NLP) techniques to make value of these data assets since this format is not suitable for statistical analysis. This work presents a systematic literature review of state-of-the-art advances using transformer-based methods on electronic medical records (EMRs) in different NLP tasks. To the best of our knowledge, this work is unique in providing a comprehensive review of research on transformer-based methods for NLP applied to the EMR field. In the initial query, 99 articles were selected from three public databases and filtered into 65 articles for detailed analysis. The papers were analyzed with respect to the business problem, NLP task, models and techniques, availability of datasets, reproducibility of modeling, language, and exchange format. The paper presents some limitations of current research and some recommendations for further research.

【6】 Performance of Data Augmentation Methods for Brazilian Portuguese Text Classification

链接: https://arxiv.org/abs/2304.02785

ChatGPT Summary

这篇论文是一篇关于改进机器学习模型性能和一般化能力的论文,研究者使用数据增强方法,分析其在应用于巴西葡萄牙语语料库的文本分类问题中的表现。研究表明,某些数据增强方法可以提高性能,但同时也需要更多的关于语言偏见和非英语文本数据的研究。

(1) 这篇论文的问题或任务是什么?

这篇论文旨在探索利用数据增强方法提高巴西葡萄牙语文本分类任务的性能。

(2) 这项工作的关键创新点或贡献是什么?

这项工作的关键创新点是使用数据增强方法对巴西葡萄牙语文本分类任务进行分析,并提出了一些使用这些技术可以提高性能的建议。

(3) 这篇文章的优点和缺点是什么?

优点:通过研究分析数据增强方法在巴西葡萄牙语文本分类任务中的表现,给出了提高该任务性能的建议。

缺点:研究发现需要更多的关于语言偏见和非英语文本数据的研究,这表明这篇文章只是引出了这个问题,而并没有对其进行更深入的探究。

Abstract: Improving machine learning performance while increasing model generalization has been a constantly pursued goal by AI researchers. Data augmentation techniques are often used towards achieving this target, and most of its evaluation is made using English corpora. In this work, we took advantage of different existing data augmentation methods to analyze their performances applied to text classification problems using Brazilian Portuguese corpora. As a result, our analysis shows some putative improvements in using some of these techniques; however, it also suggests further exploitation of language bias and non-English text data scarcity.

【7】 Context-Aware Classification of Legal Document Pages

链接: https://arxiv.org/abs/2304.02787

ChatGPT Summary

这篇论文主要介绍了一种针对处理和检索专业文档(如PDF格式的法律文件等)的业务应用所需的分类问题的方法。作者指出,大多数现有文档图像分类领域的研究要么针对单页文档,要么将多页文档独立处理。为了解决这个问题,本文提出了一种简单但有效的方法,即使用增强后的输入,并引入具有上一页顺序信息的额外标记来增强输入,从而使用预训练的 Transformer 模型,如 BERT 进行上下文感知页分类。

(1): 这篇论文解决的问题是如何有效地对专业文档进行分类以实现文档的处理,索引和检索。

(2): 这篇论文的主要创新点是提出了一种增强输入的方法,并引入具有顺序信息的标记来上下文感知地对给定文档的每一页进行分类。

(3): 这篇论文的优点在于提出了一种有效的方法来加强文档的分类。然而,本文在实验中只使用了英语和葡萄牙语的两个司法数据集,因此其泛化能力还需要进一步研究。


Abstract: For many business applications that require the processing, indexing, and retrieval of professional documents such as legal briefs (in PDF format etc.), it is often essential to classify the pages of any given document into their corresponding types beforehand. Most existing studies in the field of document image classification either focus on single-page documents or treat multiple pages in a document independently. Although in recent years a few techniques have been proposed to exploit the context information from neighboring pages to enhance document page classification, they typically cannot be utilized with large pre-trained language models due to the constraint on input length. In this paper, we present a simple but effective approach that overcomes the above limitation. Specifically, we enhance the input with extra tokens carrying sequential information about previous pages - introducing recurrence - which enables the usage of pre-trained Transformer models like BERT for context-aware page classification. Our experiments conducted on two legal datasets in English and Portuguese respectively show that the proposed approach can significantly improve the performance of document page classification compared to the non-recurrent setup as well as the other context-aware baselines.

【8】 Pragmatically Appropriate Diversity for Dialogue Evaluation

链接: https://arxiv.org/abs/2304.02812

ChatGPT Summary

这篇论文主要解决的问题是关于对话系统生成多样化回复时遇到的挑战,即当前的自然语言生成技术很难生成不同的有意义的回复。作者提出了Pragmatically Appropriate Diversity的概念,定义为对话系统在生成回复时多样性的衡量标准。作者通过使用人类创建的多回复数据集,支持了这种基于言语行为的多样性度量标准。作者的创新点是在评估回复多样性的时候考虑到了言语行为的因素。作者的研究表明,创作作家对引发多样性回复的对话的评价与“Pragmatically Appropriate Diversity”有良好的契合度。该研究的优点是提出了对话系统多样性度量的新方法,深度考虑了对话的后续回复。缺点是这项新方法的有效性尚未经过更严格的测试。

Abstract: Linguistic pragmatics state that a conversation's underlying speech acts can constrain the type of response which is appropriate at each turn in the conversation. When generating dialogue responses, neural dialogue agents struggle to produce diverse responses. Currently, dialogue diversity is assessed using automatic metrics, but the underlying speech acts do not inform these metrics.   To remedy this, we propose the notion of Pragmatically Appropriate Diversity, defined as the extent to which a conversation creates and constrains the creation of multiple diverse responses. Using a human-created multi-response dataset, we find significant support for the hypothesis that speech acts provide a signal for the diversity of the set of next responses. Building on this result, we propose a new human evaluation task where creative writers predict the extent to which conversations inspire the creation of multiple diverse responses. Our studies find that writers' judgments align with the Pragmatically Appropriate Diversity of conversations. Our work suggests that expectations for diversity metric scores should vary depending on the speech act.

【9】 GPT detectors are biased against non-native English writers

链接: https://arxiv.org/abs/2304.02819

ChatGPT Summary

这篇论文探讨了Generative Pretrained Transformer (GPT) 检测器的局限性,发现 GPT 的检测器在识别来自非母语英语写作者的文本时存在误判的问题,而对于母语英语写作者的文本则具有较高的准确性。作者还通过设计简单的提示策略来缓解这种偏见,并有效地规避 GPT 检测器的检测,显示这种检测器存在可能会惩罚具有受限语言表达能力的作者的风险。论文的研究表明 GPT 检测器的公正性和鲁棒性有待进一步研究,同时警告在评价或教育性环境中使用此类检测器存在的潜在问题,特别是对于可能排除或惩罚非母语英语作者的全球话语环境。

(1):这篇论文的问题或任务是什么?

答:这篇论文的问题是研究 Generative Pretrained Transformer (GPT) 检测器在识别来自非母语英语写作者的文本时存在误判的问题,而对于母语英语写作者的文本则具有较高的准确性。

(2):这个工作的关键创新点或贡献是什么?

答:本文通过实验发现 GPT 检测器在识别来自非母语英语写作者的文本时存在误判的问题,并通过设计简单的提示策略来缓解这种偏见,并有效地规避 GPT 检测器的检测。这一研究表明,GPT 检测器的公正性和鲁棒性有待进一步研究,同时警告在评价或教育性环境中使用此类检测器存在的潜在问题,特别是对于可能排除或惩罚非母语英语作者的全球话语环境。

(3):这篇论文的优点和缺点是什么?

答:这篇论文的优点在于通过实验研究发现了现有 GPT 检测器的局限性,并提供了缓解这种偏见的方法。而其缺点在于本研究并未充分探讨检测器的公正性和鲁棒性,需要更多的研究支持。

Abstract: The rapid adoption of generative language models has brought about substantial advancements in digital communication, while simultaneously raising concerns regarding the potential misuse of AI-generated content. Although numerous detection methods have been proposed to differentiate between AI and human-generated content, the fairness and robustness of these detectors remain underexplored. In this study, we evaluate the performance of several widely-used GPT detectors using writing samples from native and non-native English writers. Our findings reveal that these detectors consistently misclassify non-native English writing samples as AI-generated, whereas native writing samples are accurately identified. Furthermore, we demonstrate that simple prompting strategies can not only mitigate this bias but also effectively bypass GPT detectors, suggesting that GPT detectors may unintentionally penalize writers with constrained linguistic expressions. Our results call for a broader conversation about the ethical implications of deploying ChatGPT content detectors and caution against their use in evaluative or educational settings, particularly when they may inadvertently penalize or exclude non-native English speakers from the global discourse.

【10】 Can Large Language Models Play Text Games Well? Current State-of-the-Art and Open Questions

链接: https://arxiv.org/abs/2304.02868

ChatGPT Summary

这篇论文探索了大型语言模型如何在文本游戏中表现,它们是否具备理解环境和用对话响应情境的能力。该研究发现,ChatGPT在与现有系统的比较中具有竞争力,但它表现出较低的智能水平。ChatGPT不能通过玩游戏或阅读游戏手册来构建世界模型,可能无法利用其具已有的世界知识,无法推理每一步的目标随着游戏的进展而变化。这项研究的结果在人工智能、机器学习和自然语言处理的交叉领域开辟了新的研究问题。

(1):此论文试图解决什么问题或任务?

此论文试图探索大型语言模型在文本游戏中的表现,并了解它们是否具备理解环境和用对话响应情境的能力。

(2):这项工作的关键创新点或贡献是什么?

这项工作发现,在与现有系统的比较中,ChatGPT在文本游戏中具有竞争力,但它表现出较低的智能水平。并且,这项工作在人工智能、机器学习和自然语言处理的交叉领域开辟了新的研究问题。

(3):这篇文章的优点和缺点是什么?

这篇文章探索的问题是非常新颖的,该研究告诉我们大型语言模型的表现在文本游戏中并不优秀,这开辟了新的研究领域。但是,这篇文章还需要更多的研究来确定这种表现的标准以及如何使模型在这个任务中表现更好。

Abstract: Large language models (LLMs) such as ChatGPT and GPT-4 have recently demonstrated their remarkable abilities of communicating with human users. In this technical report, we take an initiative to investigate their capacities of playing text games, in which a player has to understand the environment and respond to situations by having dialogues with the game world. Our experiments show that ChatGPT performs competitively compared to all the existing systems but still exhibits a low level of intelligence. Precisely, ChatGPT can not construct the world model by playing the game or even reading the game manual; it may fail to leverage the world knowledge that it already has; it cannot infer the goal of each step as the game progresses. Our results open up new research questions at the intersection of artificial intelligence, machine learning, and natural language processing.

【11】 Automatic ICD-10 Code Association: A Challenging Task on French Clinical Texts

链接: https://arxiv.org/abs/2304.02886

ChatGPT Summary

这篇论文是关于自动将ICD-10代码与医疗数据进行关联的任务,主要针对法语临床文本。作者将预先训练的基于Transformer的语言模型,应用于这一任务。作者提出了一个模型,结合了最新的自然语言处理和多标签分类技术,来解决这个任务中需要处理的大量输入标记和待猜测的标记的问题。作者在法语临床数据集上进行了实验,证明其方法将F1指标的得分提高了55%以上,相较于现有的最优结果而言。

1. 本文要解决的问题是自动将ICD-10代码与医疗数据进行关联的任务。

2. 本文的创新点在于将预先训练的Transformer语言模型应用到该任务中,并提出了一个结合自然语言处理和多标签分类技术的新模型来解决大量输入标记和待猜测标记的问题。

3. 本文的强项在于所提出的方法在法语临床数据集上取得了显著优于现有最优结果的性能。但是,本文并没有对多语言情况下的应用进行探究。

Abstract: Automatically associating ICD codes with electronic health data is a well-known NLP task in medical research. NLP has evolved significantly in recent years with the emergence of pre-trained language models based on Transformers architecture, mainly in the English language. This paper adapts these models to automatically associate the ICD codes. Several neural network architectures have been experimented with to address the challenges of dealing with a large set of both input tokens and labels to be guessed. In this paper, we propose a model that combines the latest advances in NLP and multi-label classification for ICD-10 code association. Fair experiments on a Clinical dataset in the French language show that our approach increases the $F_1$-score metric by more than 55\% compared to state-of-the-art results.

【12】 Affect as a proxy for literary mood

链接: https://arxiv.org/abs/2304.02894

ChatGPT Summary

这篇论文的主题是“以情感作为文学情绪的代理”。作者指出,情感可以作为文学文本中情绪的代理。研究中,作者探索了在计算上检测语调与检测情绪之间的差异。从方法论上,作者利用情感词嵌入,研究不同文本段落中情感分布的不同。同时,作者还提出了一种简单而有效的方法,通过考虑语义移位和文本领域来增强情感词典,从而产生与现代定性分析严格匹配的实际结果。

(1):文章试图解决什么问题或任务?

答:本文章试图将情感作为文艺作品中情绪的代理。

(2):这项工作的关键创新点或贡献是什么?

答:作者使用了情感词嵌入,同时提出了一种用于增强情感词典的简单有效方法,以便更好地检测计算机计算和现代定性分析的情绪信息。

(3):这篇文章的优缺点是什么?

答:该篇文章提出了一种新方法来解决现有情感词典的缺点,同时与现代定性分析相匹配。但是,该方法的可行性和应用范围需要进一步研究。


Abstract: We propose to use affect as a proxy for mood in literary texts. In this study, we explore the differences in computationally detecting tone versus detecting mood. Methodologically we utilize affective word embeddings to look at the affective distribution in different text segments. We also present a simple yet efficient and effective method of enhancing emotion lexicons to take both semantic shift and the domain of the text into account producing real-world congruent results closely matching both contemporary and modern qualitative analyses.

【13】 SpanRE: Entities and Overlapping Relations Extraction Based on Spans and Entity Attention

链接: https://arxiv.org/abs/2304.02901

ChatGPT Summary

这篇论文名为“基于Spans和实体注意力机制的实体与重叠关系抽取(SpanRE)”,它提出了一种新的方法来解决实体与关系抽取中的重叠问题。首先,使用标准span机制提取候选主语,然后使用标记的span机制同时提取对象和关系。我们的方法在提取对象和关系时使用实体注意力机制来增强主语和句子之间的信息融合。在两个公共数据集上进行测试,我们的方法在这两个数据集上均取得了最佳的性能表现。

(1) 这篇论文试图解决的问题或任务是什么?

这篇论文试图解决实体抽取和关系抽取中的重叠问题。以前的方法要么没有解决重叠问题,要么只解决了部分重叠问题。该论文提出了一种新的方法,以完全解决三元组重叠问题。

(2) 这项工作的关键创新点或贡献是什么?

该项工作的关键创新点是:使用标准span机制提取候选主语,然后使用标记的span机制同时提取对象和关系;设计实体注意力机制来增强主语和句子之间的信息融合。它是一种有效解决实体抽取和关系抽取中重叠问题的新方法。

(3) 这篇文章的优势和缺点是什么?

优点是该方法能够完全解决实体抽取和关系抽取中的重叠问题,并且在两个公共数据集上取得了最佳的性能表现。缺点是该论文并未和其他方法进行直接比较,因此无法确定其性能是否优于其他方法。

Abstract: Extracting entities and relations is an essential task of information extraction. Triplets extracted from a sentence might overlap with each other. Previous methods either did not address the overlapping issues or solved overlapping issues partially. To tackle triplet overlapping problems completely, firstly we extract candidate subjects with a standard span mechanism. Then we present a labeled span mechanism to extract the objects and relations simultaneously, we use the labeled span mechanism to generate labeled spans whose start and end positions indicate the objects, and whose labels correspond to relations of subject and objects. Besides, we design an entity attention mechanism to enhance the information fusion between subject and sentence during extracting objects and relations. We test our method on two public datasets, our method achieves the best performances on these two datasets.

【14】 Multi-label classification of open-ended questions with BERT

链接: https://arxiv.org/abs/2304.02945

ChatGPT Summary

这篇论文介绍了使用基于transformer的模型BERT来实现多标签分类的方法,其中多个标签可以被分配给回答开放式问题的文本数据。与传统的多标签分类算法(二进制重要性、标签视角、ECC)相比,这篇论文的创新点在于使用BERT模型在德语社会科学调查中具有更好的效果(0/1损失最小13.1%)。通过使用自动化多标记分类方法,本文还扩展了社会科学领域的半自动分类方法,从而使得多标签分类变得更加精确和自动化。该论文的优点是表明使用BERT模型可以有效地自动化多标签分类,并将结果与其他方法进行比较。缺点是该方法仍然无法解决存在多个标签的文本数据的麻烦。

(1) 该论文研究了文本数据的多标签分类问题,尤其是对于开放式调查问题的回答文本数据进行多标签分类的问题。

(2) 该论文的创新点在于使用BERT模型对德语社会科学调查的多标记分类问题进行有效的解决,同时提出了完全自动化分类的方法。

(3) 该方法的优点在于使用BERT模型可以自动化多标签分类,提供更加精确的结果,并进行了有效比较。不足之处在于该方法无法处理多标签数据的复杂性,需要进一步改进。


Abstract: Open-ended questions in surveys are valuable because they do not constrain the respondent's answer, thereby avoiding biases. However, answers to open-ended questions are text data which are harder to analyze. Traditionally, answers were manually classified as specified in the coding manual. Most of the effort to automate coding has gone into the easier problem of single label prediction, where answers are classified into a single code. However, open-ends that require multi-label classification, i.e., that are assigned multiple codes, occur frequently. This paper focuses on multi-label classification of text answers to open-ended survey questions in social science surveys. We evaluate the performance of the transformer-based architecture BERT for the German language in comparison to traditional multi-label algorithms (Binary Relevance, Label Powerset, ECC) in a German social science survey, the GLES Panel (N=17,584, 55 labels). We find that classification with BERT (forcing at least one label) has the smallest 0/1 loss (13.1%) among methods considered (18.9%-21.6%). As expected, it is much easier to correctly predict answer texts that correspond to a single label (7.1% loss) than those that correspond to multiple labels ($\sim$50% loss). Because BERT predicts zero labels for only 1.5% of the answers, forcing at least one label, while recommended, ultimately does not lower the 0/1 loss by much. Our work has important implications for social scientists: 1) We have shown multi-label classification with BERT works in the German language for open-ends. 2) For mildly multi-label classification tasks, the loss now appears small enough to allow for fully automatic classification (as compared to semi-automatic approaches). 3) Multi-label classification with BERT requires only a single model. The leading competitor, ECC, iterates through individual single label predictions.

【15】 Leveraging Social Interactions to Detect Misinformation on Social Media

链接: https://arxiv.org/abs/2304.02983

ChatGPT Summary

这篇论文的主题是如何利用社交互动来检测社交媒体上的错误信息。作者提出了一个解决方案,使用新冠疫情期间创建的数据集,该数据集包含级联推文,讨论被标记为可靠或不可靠的信息,基于以前对信息来源的评估。该论文的关键创新点是将社交网络信息融入到检测不可靠级联推文的算法中,利用社交情况来判断信息的可靠性,并测试了多种方法来学习级联推文中的社交互动的表示。此外,通过记录互动过程中的时间,该论文还提出了一种改进现有算法的方法。

(1):这篇论文试图解决的问题是什么?

答:该论文试图解决如何利用社交互动来检测社交媒体上的错误信息的问题。

(2):这项工作的主要创新点或贡献是什么?

答:论文提出了将社交网络信息融入到检测不可靠级联推文的算法中,利用社交情况来判断信息的可靠性,并测试了多种方法来学习级联推文中的社交互动的表示。此外,通过记录互动过程中的时间,该论文还提出了一种改进现有算法的方法。

(3):这篇论文的优点和缺点是什么?

答:该论文的优点是它能够利用社交网络信息来检测社交媒体上的错误信息,同时它还提出了一种改进现有算法的方法。但该论文的缺点是它仍然需要进一步测试和验证,并在更广泛的情况下应用它的方法。


Abstract: Detecting misinformation threads is crucial to guarantee a healthy environment on social media. We address the problem using the data set created during the COVID-19 pandemic. It contains cascades of tweets discussing information weakly labeled as reliable or unreliable, based on a previous evaluation of the information source. The models identifying unreliable threads usually rely on textual features. But reliability is not just what is said, but by whom and to whom. We additionally leverage on network information. Following the homophily principle, we hypothesize that users who interact are generally interested in similar topics and spreading similar kind of news, which in turn is generally reliable or not. We test several methods to learn representations of the social interactions within the cascades, combining them with deep neural language models in a Multi-Input (MI) framework. Keeping track of the sequence of the interactions during the time, we improve over previous state-of-the-art models.

【16】 Natural Language Robot Programming: NLP integrated with autonomous robotic grasping

链接: https://arxiv.org/abs/2304.02993

ChatGPT Summary

这篇论文介绍了一个基于语法的自然语言编程框架,用于控制自主机器人进行拾取和放置任务。该框架使用自定义动作词典,用于存储在含义上相似的单词,易于通过从词汇数据库中添加更多的动作词来扩展词汇。作者通过模拟和实际实验验证了他们所提出的自然语言机器人编程(NLRP)框架,使用了带有校准的手持相机和麦克风的 Franka Panda 机械臂。参与者使用口头命令完成拾取和放置任务,这些命令被转换成文本通过NLRP框架处理以获取机器人的运动轨迹。结果表明,该方法具有较高的系统可用性评分。该框架的词典可轻松扩展而不依赖于迁移学习或大型数据集。作者计划通过综合用户研究将这一框架与其他人类辅助拾取和放置任务的方法进行比较。

(1):这篇论文所解决的问题是什么?

-- 通过基于语法的自然语言编程框架实现对自主机器人进行拾取和放置任务的控制。

(2):这项工作的关键创新点或贡献是什么?

-- 提出了一种有效的自然语言机器人编程框架,通过自定义动作词典来扩展机器人学习新动作。

(3):这篇文章的优缺点是什么?

-- 优点:该框架易于扩展,不依赖于迁移学习或大型数据集,具有高的系统可用性评分。

-- 缺点:该工作仅在拾取和放置任务中进行了实验验证,需要进行更广泛的应用场景验证。


Abstract: In this paper, we present a grammar-based natural language framework for robot programming, specifically for pick-and-place tasks. Our approach uses a custom dictionary of action words, designed to store together words that share meaning, allowing for easy expansion of the vocabulary by adding more action words from a lexical database. We validate our Natural Language Robot Programming (NLRP) framework through simulation and real-world experimentation, using a Franka Panda robotic arm equipped with a calibrated camera-in-hand and a microphone. Participants were asked to complete a pick-and-place task using verbal commands, which were converted into text using Google's Speech-to-Text API and processed through the NLRP framework to obtain joint space trajectories for the robot. Our results indicate that our approach has a high system usability score. The framework's dictionary can be easily extended without relying on transfer learning or large data sets. In the future, we plan to compare the presented framework with different approaches of human-assisted pick-and-place tasks via a comprehensive user study.

【17】 Compression of enumerations and gain

链接: https://arxiv.org/abs/2304.03030

ChatGPT Summary

这篇论文研究了可计算可枚举集的枚举压缩性及其在相对Kolmogorov复杂度中的作用。作者探讨了强压缩和弱压缩两种形式,研究了压缩枚举中嵌入的辅助信息量,以及与密度相关的增益。对于任何可计算可枚举集,都能显示出强压缩和弱无收益压缩的情况,同时研究了一个位置博弈以了解强无收益压缩情况下的内容。

(1): 这篇论文是为了解决什么样的问题或任务?

这篇论文试图研究可计算可枚举集的枚举压缩性以及在相对Kolmogorov复杂中的作用,对于强压缩和弱压缩两种形式进行探讨。

(2): 这项工作的关键创新点或贡献是什么?

这项工作的关键创新点在于它深入研究了可计算可枚举集合的枚举压缩性问题,同时提出了强压缩和弱压缩两种形式,并探讨了在压缩枚举中嵌入的辅助信息量和与密度相关的增益问题。

(3): 这篇论文的优点和缺点分别是什么?

这篇论文的优点在于它深入研究了可计算可枚举集合的枚举压缩性问题,同时提出了强压缩和弱压缩两种形式,并对于强无收益压缩下的情况进行了研究。然而,这篇论文可能难以理解,需要一定的数学背景并且逻辑思维能力也需要较强。


Abstract: We study the compressibility of enumerations, and its role in the relative Kolmogorov complexity of computably enumerable sets, with respect to density. With respect to a strong and a weak form of compression, we examine the gain: the amount of auxiliary information embedded in the compressed enumeration. Strong compression and weak gainless compression is shown for any computably enumerable set, and a positional game is studied toward understanding strong gainless compression.

【18】 ETPNav: Evolving Topological Planning for Vision-Language Navigation in Continuous Environments

链接: https://arxiv.org/abs/2304.03047

ChatGPT Summary

本文主要介绍了如何在连续环境中完成视觉语言导航(Vision-Language Navigation)。为了开发一个稳健的视觉语言导航代理(VLNAgent),本文提出了ETPNav,这是一个基于拓扑映射的导航框架,侧重于两个关键技能:1)具有抽象环境和生成长程导航计划的能力,2)在连续环境中避免障碍物的控制能力。ETPNav通过沿着遍历路径自组织预测航点来在线生成环境的拓扑映射,而不需要先前的环境经验。它将导航过程分解为高级规划和低级控制,利用基于Transformer的跨模态规划器根据拓扑地图和指令生成导航计划,然后通过避障控制器执行计划。实验结果表明,ETPNav相对于先前的最先进技术,在R2R-CE和RxR-CE数据集上分别提高了10%和20%以上的性能。

(1):本文的研究目的是什么?

这篇论文主要研究在连续环境中如何完成视觉语言导航,旨在开发一个稳健的VLN代理。

(2):本文的关键创新点或贡献是什么?

本文的关键创新点是提出了基于拓扑映射的导航框架ETPNav,该框架将导航过程分为高级规划和低级控制,并利用基于Transformer的跨模态规划器生成导航计划,并通过避障控制器执行计划。

(3):本文的优缺点是什么?

本文的优点在于提出了一个基于拓扑映射的导航框架,使用了Transformer和避障控制器等一系列技术,在VLN-CE任务上取得了良好的效果。缺点是本文的实验仍然局限于特定的数据集和任务,实验结果可能不具备普适性。

Abstract: Vision-language navigation is a task that requires an agent to follow instructions to navigate in environments. It becomes increasingly crucial in the field of embodied AI, with potential applications in autonomous navigation, search and rescue, and human-robot interaction. In this paper, we propose to address a more practical yet challenging counterpart setting - vision-language navigation in continuous environments (VLN-CE). To develop a robust VLN-CE agent, we propose a new navigation framework, ETPNav, which focuses on two critical skills: 1) the capability to abstract environments and generate long-range navigation plans, and 2) the ability of obstacle-avoiding control in continuous environments. ETPNav performs online topological mapping of environments by self-organizing predicted waypoints along a traversed path, without prior environmental experience. It privileges the agent to break down the navigation procedure into high-level planning and low-level control. Concurrently, ETPNav utilizes a transformer-based cross-modal planner to generate navigation plans based on topological maps and instructions. The plan is then performed through an obstacle-avoiding controller that leverages a trial-and-error heuristic to prevent navigation from getting stuck in obstacles. Experimental results demonstrate the effectiveness of the proposed method. ETPNav yields more than 10% and 20% improvements over prior state-of-the-art on R2R-CE and RxR-CE datasets, respectively. Our code is available at https://github.com/MarSaKi/ETPNav.

【19】 ChatGPT for Shaping the Future of Dentistry: The Potential of Multi-Modal Large Language Model

链接: https://arxiv.org/abs/2304.03086

ChatGPT Summary

这篇论文讨论了大型自然语言模型在牙科领域中的应用前景。作者通过介绍自然语言处理领域中的大型语言模型以及对牙科诊断和治疗中两种主要部署方法 —— 自动牙科诊断和跨模态牙科诊断的应用进行分析。特别是,作者提出了一种全自动多模态 LLM AI 系统的应用示例,使用跨模态编码器对多源数据进行处理以及使用自然语言推理来执行复杂的临床操作。虽然 LLMs 提供了重要的潜在优势,但数据隐私、数据质量和模型偏差等挑战仍需要进一步研究。总的来说,LLMs 在牙科诊断和治疗中有巨大的潜力,表明了牙科领域临床应用和研究的一个有前途的方向。

(1) 这篇论文的问题或任务是什么?

- 这篇论文主要讨论在牙科领域中利用大型自然语言模型的应用前景和潜力,并介绍了自然语言处理领域中的大型语言模型以及牙科诊断和治疗中的两种主要部署方法。

(2) 这项工作的关键创新点或贡献是什么?

- 这项工作提出了利用跨模态编码器对多源数据进行处理,使用自然语言推理来执行复杂的临床操作,为牙科领域的诊断和治疗提供了一种全自动的多模态 LLM AI 系统,表明了 LLMs 在牙科领域中的巨大潜力。

(3) 这篇文章的优点和缺点是什么?

- 优点:这篇论文提供了一个全面的研究视角,介绍了大型自然语言模型在牙科诊断和治疗方面的前沿应用,为未来的应用和研究方向提供了思路。

- 缺点:并没有侧重研究某一具体的应用场景,也没有提供实验结果和性能评估。此外,作者也没有针对数据隐私、数据质量和模型偏差等问题提出解决方法。


Abstract: The ChatGPT, as a lite and conversational variant of Generative Pretrained Transformer 4 (GPT-4) developed by OpenAI, is one of the milestone Large Language Models (LLMs) with billions of parameters. LLMs, in fact, have stirred up a lot of interest among researchers and practitioners by their impressive skills in natural language processing tasks, which have a profound impact on a wide range of fields. This paper mainly discusses the future applications of LLMs in dentistry. We introduce two primary LLM deployment methods in dentistry, including automated dental diagnosis and cross-modal dental diagnosis, and examine their potential applications. Especially, equipped with a cross-modal encoder, a single LLM can manage multi-source data and conduct advanced natural language reasoning to perform complex clinical operations. A use case is presented to demonstrate the potential of a fully automatic Multi-Modal LLM AI system for dentistry clinical application. While LLMs offer significant potential benefits, the challenges, such as data privacy, data quality, and model bias, need further study. Overall, LLMs have the potential to revolutionize dental diagnosis and treatment, which indicates a promising avenue for clinical application and research in dentistry.

【20】 Investigating Chain-of-thought with ChatGPT for Stance Detection on Social Media

链接: https://arxiv.org/abs/2304.03087

ChatGPT Summary

这篇论文主要探讨以聊天GPT为代表的大规模预训练语言模型上的Chain-of-Thought (CoT)方法在社交媒体中的立场检测任务中的应用。该研究发现CoT在该任务上具有更高的准确度,是一种有效的替代方法。

(1) 这篇论文解决了什么问题或任务?

这篇论文探讨了在社交媒体上进行立场检测的问题,并探究了无需反向传播训练的Chain-of-Thought (CoT)方法在这一任务上的应用。

(2) 这项工作的关键创新点或贡献是什么?

该论文研究表明,在立场检测任务中,Chain-of-Thought (CoT)方法相对于传统的机器学习、早期的深度神经网络和预训练精调模型等方法具有更高的准确性。

(3) 这篇文章的优点和缺点是什么?

优点:该论文提供了一种新的方法,可以更好地完成社交媒体中的立场检测任务,并在实验中取得了较好的效果。

缺点:该文研究的范围较窄,只涉及链式思维这一方法在立场检测任务上的应用。

Abstract: Stance detection predicts attitudes towards targets in texts and has gained attention with the rise of social media. Traditional approaches include conventional machine learning, early deep neural networks, and pre-trained fine-tuning models. However, with the evolution of very large pre-trained language models (VLPLMs) like ChatGPT (GPT-3.5), traditional methods face deployment challenges. The parameter-free Chain-of-Thought (CoT) approach, not requiring backpropagation training, has emerged as a promising alternative. This paper examines CoT's effectiveness in stance detection tasks, demonstrating its superior accuracy and discussing associated challenges.

【21】 Static Fuzzy Bag-of-Words: a lightweight sentence embedding algorithm

链接: https://arxiv.org/abs/2304.03098

ChatGPT Summary

这篇论文是关于提出了一种轻量级的句子嵌入算法Static Fuzzy Bag-of-Words。该模型是模糊单词包的改进版本,提供了一个预定义的维度的句子嵌入。它在语义文本相似性基准测试中表现出了具有竞争力的性能,同时要求更少的计算资源。这篇文章的贡献在于解决了句子嵌入问题,并提供了一种快速高效的解决方案。

(1): 这篇文章试图解决的问题或任务是什么?

文章试图解决的问题是如何在低计算资源的情况下提供句子级别的嵌入算法,以用于语义文本相似性基准测试。

(2): 这项工作的关键创新点或贡献是什么?

这项工作的关键贡献是提出了Static Fuzzy Bag-of-Words模型,为解决句子嵌入问题提供了一种快速高效的解决方案。

(3):这篇文章的优缺点是什么?

该模型在处理句子嵌入问题上具有较高的效率和精度,可以在语义文本相似度基准测试中表现出竞争力。缺点可能是由于该模型是静态的,不太适用于需要不断更新的动态文本。

Abstract: The introduction of embedding techniques has pushed forward significantly the Natural Language Processing field. Many of the proposed solutions have been presented for word-level encoding; anyhow, in the last years, new mechanism to treat information at an higher level of aggregation, like at sentence- and document-level, have emerged. With this work we address specifically the sentence embeddings problem, presenting the Static Fuzzy Bag-of-Word model. Our model is a refinement of the Fuzzy Bag-of-Words approach, providing sentence embeddings with a predefined dimension. SFBoW provides competitive performances in Semantic Textual Similarity benchmarks, while requiring low computational resources.

【22】 Evaluating the Robustness of Machine Reading Comprehension Models to Low Resource Entity Renaming

链接: https://arxiv.org/abs/2304.03145

ChatGPT Summary

这篇论文主要研究机器阅读理解模型对实体重命名的鲁棒性,提出了一种名为EntSwap的方法用于测试数据的扰动。研究发现,在被重命名为非洲低资源地区实体的测试集上,相较于基础模型,大型模型表现得更好。针对实体类型的分析表明,人类实体类型对MRC模型的表现具有高度挑战性。

(1) 这篇论文试图解决什么问题或完成什么任务?

论文研究了机器阅读理解模型对于实体重命名的鲁棒性问题。

(2) 这项工作的关键创新点或贡献是什么?

论文提出了一种名为EntSwap的方法用于测试数据的扰动,通过该方法生成了一个名为AfriSQuAD2的测试集,并评估了三种流行MRC模型的鲁棒性。

(3) 这篇文章的优点和缺点是什么?

优点:提出了一种新的测试集生成方法EntSwap,能够评估MRC模型的鲁棒性,对MRC领域具有重要的研究意义。

缺点:只研究了实体重命名的鲁棒性问题,可能存在一定的局限性。

Abstract: Question answering (QA) models have shown compelling results in the task of Machine Reading Comprehension (MRC). Recently these systems have proved to perform better than humans on held-out test sets of datasets e.g. SQuAD, but their robustness is not guaranteed. The QA model's brittleness is exposed when evaluated on adversarial generated examples by a performance drop. In this study, we explore the robustness of MRC models to entity renaming, with entities from low-resource regions such as Africa. We propose EntSwap, a method for test-time perturbations, to create a test set whose entities have been renamed. In particular, we rename entities of type: country, person, nationality, location, organization, and city, to create AfriSQuAD2. Using the perturbed test set, we evaluate the robustness of three popular MRC models. We find that compared to base models, large models perform well comparatively on novel entities. Furthermore, our analysis indicates that entity type person highly challenges the MRC models' performance.

【23】 Zero-Shot Next-Item Recommendation using Large Pretrained Language Models

链接: https://arxiv.org/abs/2304.03153

ChatGPT Summary

这篇论文的题目是“Zero-Shot Next-Item Recommendation using Large Pretrained Language Models”,主要内容是探索如何利用大型预训练语言模型进行零样本推荐,即推荐用户没有在过去交互过的物品。论文指出了这个问题需要解决两个主要挑战,推荐空间非常大,而且 LLMs 不知道目标用户过去交互的物品和偏好。为了解决这个问题,论文提出了一种 Zero-Shot Next-Item Recommendation (NIR) 的提示策略,使用外部模块基于用户过滤或物品过滤生成候选项,然后使用 3 步提示的方式指导 GPT-3 捕捉用户喜好,选择代表性的以前观看过的电影,并推荐一个排名前十的电影列表。论文在 MovieLens 100K 数据集上进行了评估,证明了该方法具有很强的零样本推荐性能,甚至超过了一些在整个训练数据集上训练的强有力的顺序推荐模型。该研究证明了利用 LLMs 作为推荐器的广阔研究机会。该代码可以在 https://github.com/AGI-Edgerunners/LLM-Next-Item-Rec 上找到。

(1):这篇论文的问题或任务是什么?

这篇论文的问题是如何使用大型预训练语言模型进行零样本下一个物品推荐。

(2):这项工作的关键创新点或贡献是什么?

这项工作的关键创新点是提出一种 Zero-Shot Next-Item Recommendation (NIR) 的提示策略,该策略使用外部模块基于用户过滤或物品过滤生成候选项,然后使用 3 步提示的方式指导 GPT-3 捕捉用户喜好,选择代表性的以前观看过的电影,并推荐一个排名前十的电影列表。该方法具有很强的零样本推荐性能,甚至超过了一些在整个训练数据集上训练的强有力的顺序推荐模型。

(3):这篇论文的优点和缺点是什么?

优点:

1. 利用大型预训练语言模型进行零样本下一个物品推荐,展示了 LLMs 在推理无训练样本的能力。

2. 提出了一种 Zero-Shot Next-Item Recommendation (NIR) 的提示策略,有效地解决了推荐空间大和 LLMs 不知道目标用户过去交互的物品和偏好的问题。

3. 在 MovieLens 100K 数据集上证明了该方法具有很强的零样本推荐性能,甚至超过了一些在整个训练数据集上训练的强有力的顺序推荐模型。

缺点:

1. 该方法依赖于外部模块生成候选项,这可能会影响其推荐质量。

2. 此方法在更大的数据集上的性能还需要进一步研究。


Abstract: Large language models (LLMs) have achieved impressive zero-shot performance in various natural language processing (NLP) tasks, demonstrating their capabilities for inference without training examples. Despite their success, no research has yet explored the potential of LLMs to perform next-item recommendations in the zero-shot setting. We have identified two major challenges that must be addressed to enable LLMs to act effectively as recommenders. First, the recommendation space can be extremely large for LLMs, and LLMs do not know about the target user's past interacted items and preferences. To address this gap, we propose a prompting strategy called Zero-Shot Next-Item Recommendation (NIR) prompting that directs LLMs to make next-item recommendations. Specifically, the NIR-based strategy involves using an external module to generate candidate items based on user-filtering or item-filtering. Our strategy incorporates a 3-step prompting that guides GPT-3 to carry subtasks that capture the user's preferences, select representative previously watched movies, and recommend a ranked list of 10 movies. We evaluate the proposed approach using GPT-3 on MovieLens 100K dataset and show that it achieves strong zero-shot performance, even outperforming some strong sequential recommendation models trained on the entire training dataset. These promising results highlight the ample research opportunities to use LLMs as recommenders. The code can be found at https://github.com/AGI-Edgerunners/LLM-Next-Item-Rec.

【24】 CoT-MAE v2: Contextual Masked Auto-Encoder with Multi-view Modeling for Passage Retrieval

链接: https://arxiv.org/abs/2304.03158

ChatGPT Summary

这篇论文的题目是《CoT-MAE v2: Contextual Masked Auto-Encoder with Multi-view Modeling for Passage Retrieval》。该篇论文主要是提出了一种名为CoT-MAE v2的多视图文本表示预训练方式,通过利用多视图建模和多维自编码器等方法来提升在长文本检索任务上的表现。CoT-MAE v2利用多视图来生成密集和稀疏的向量作为文本表示,以获取不同方面的语义信息。实验结果表明,CoT-MAE v2的有效性和鲁棒性在大规模长文本检索和零样本检索中得到了证明。

(1): 这篇论文试图解决什么问题或完成什么任务?

- 该篇论文的研究目的在于提出一种多视图文本表示预训练方式CoT-MAE v2,以提升长文本检索任务的性能。

(2): 这项工作的关键创新点或贡献是什么?

- CoT-MAE v2利用多视图来生成密集和稀疏的向量作为文本表示,以获取不同方面的语义信息。而在多视图解码模型预训练的过程中,CoT-MAE v2利用了自编码和自回归两种解码器,提供了重构和生成信号以获取更好的上下文表示预训练。

(3): 这篇文章的优点和缺点是什么?

- 优点:该篇论文提出了一种新颖的文本表示方式CoT-MAE v2,根据实验结果证明其性能良好且鲁棒性强,对长文本检索任务和零样本检索表现优异。

- 缺点:该篇论文仅就技术方法本身进行了说明,没有对其应用场景做进一步的探讨,需要进行更多的研究和实验验证。


Abstract: Growing techniques have been emerging to improve the performance of passage retrieval. As an effective representation bottleneck pretraining technique, the contextual masked auto-encoder utilizes contextual embedding to assist in the reconstruction of passages. However, it only uses a single auto-encoding pre-task for dense representation pre-training. This study brings multi-view modeling to the contextual masked auto-encoder. Firstly, multi-view representation utilizes both dense and sparse vectors as multi-view representations, aiming to capture sentence semantics from different aspects. Moreover, multiview decoding paradigm utilizes both autoencoding and auto-regressive decoders in representation bottleneck pre-training, aiming to provide both reconstructive and generative signals for better contextual representation pretraining. We refer to this multi-view pretraining method as CoT-MAE v2. Through extensive experiments, we show that CoT-MAE v2 is effective and robust on large-scale passage retrieval benchmarks and out-of-domain zero-shot benchmarks.

【25】 Bridging the Language Gap: Knowledge Injected Multilingual Question Answering

链接: https://arxiv.org/abs/2304.03159

ChatGPT Summary

论文标题:跨语言知识注入多语言问答

本文旨在解决跨语言的抽取式问答(extractive question answering)任务,提出了一个通用的跨语言知识注入框架,通过利用链路预测技术来丰富多语言知识,并在真实世界数据集 MLQA 上进行实验,结果表明该方法可以显著提高性能,F1 / EM 平均值超过基线方法13.18% / 12.00%。

1、这篇论文试图解决什么问题或任务?

答:本文试图解决跨语言抽取式问答任务的问题,提出了一个通用的跨语言知识注入框架。

2、这项工作的关键创新点或贡献是什么?

答:这项工作的关键创新点是提出了通用的跨语言知识注入框架,并且通过利用链路预测技术来丰富多语言知识,从而提高了模型的性能。

3、这篇论文的优点和缺点是什么?

答:本文的优点在于提出了通用的跨语言知识注入框架,并在真实世界数据集 MLQA 上进行了实验,结果表明该方法可以显著提高性能。缺点可能在于,虽然该方法可用于多语言问答,在一些单语言问答任务上可能并不具有优势。

Abstract: Question Answering (QA) is the task of automatically answering questions posed by humans in natural languages. There are different settings to answer a question, such as abstractive, extractive, boolean, and multiple-choice QA. As a popular topic in natural language processing tasks, extractive question answering task (extractive QA) has gained extensive attention in the past few years. With the continuous evolvement of the world, generalized cross-lingual transfer (G-XLT), where question and answer context are in different languages, poses some unique challenges over cross-lingual transfer (XLT), where question and answer context are in the same language. With the boost of corresponding development of related benchmarks, many works have been done to improve the performance of various language QA tasks. However, only a few works are dedicated to the G-XLT task. In this work, we propose a generalized cross-lingual transfer framework to enhance the model's ability to understand different languages. Specifically, we first assemble triples from different languages to form multilingual knowledge. Since the lack of knowledge between different languages greatly limits models' reasoning ability, we further design a knowledge injection strategy via leveraging link prediction techniques to enrich the model storage of multilingual knowledge. In this way, we can profoundly exploit rich semantic knowledge. Experiment results on real-world datasets MLQA demonstrate that the proposed method can improve the performance by a large margin, outperforming the baseline method by 13.18%/12.00% F1/EM on average.

【26】 Selective Data Augmentation for Robust Speech Translation

链接: https://arxiv.org/abs/2304.03169

ChatGPT Summary

这篇论文是关于针对语音翻译的鲁棒性进行有选择性的数据增强。作者提出使用端到端(e2e)英印语(en-hi)ST体系结构来生成并改善翻译效果。该文利用两个不完美的机器翻译服务来将Libri-trans en文本翻译成hi文本,在每个MT服务为生成平行ST数据提供的MT数据的基础上,作者还向其中引入了一个噪声的MT数据增强策略,三者的混合结果为更好的ST翻译效果(BLEU分数),相比于处理MT数据霸王式增强,研究人员的方法得到了最高1.59分的绝对改进。

(1)这篇论文的问题或任务是什么?

论文的问题或任务是在语音翻译中提高翻译的质量和结果,利用有选择性的数据增强方法改善波动性和不确定性问题。

(2)这项工作的主要创新点或贡献是什么?

其主要的创新点是一种数据增强策略,结合了噪声MT数据的有选择性的增强方法,能够提高语音翻译的BLEU分数。相比于传统的处理MT数据的方法,该增强方法得到了更好的效果。

(3)这篇文章的优点和缺点是什么?

该篇文章提出了一种新的有选择性数据增强的方法,能够提高语音翻译的质量,然而,由于该方法是一种端到端模型,它需要更多的计算和资源,可能对一些低端设备和计算能力差的用户有不适用的情况。


Abstract: Speech translation (ST) systems translate speech in one language to text in another language. End-to-end ST systems (e2e-ST) have gained popularity over cascade systems because of their enhanced performance due to reduced latency and computational cost. Though resource intensive, e2e-ST systems have the inherent ability to retain para and non-linguistic characteristics of the speech unlike cascade systems. In this paper, we propose to use an e2e architecture for English-Hindi (en-hi) ST. We use two imperfect machine translation (MT) services to translate Libri-trans en text into hi text. While each service gives MT data individually to generate parallel ST data, we propose a data augmentation strategy of noisy MT data to aid robust ST. The main contribution of this paper is the proposal of a data augmentation strategy. We show that this results in better ST (BLEU score) compared to brute force augmentation of MT data. We observed an absolute improvement of 1.59 BLEU score with our approach.

【27】 Cerebras-GPT: Open Compute-Optimal Language Models Trained on the Cerebras Wafer-Scale Cluster

链接: https://arxiv.org/abs/2304.03208

ChatGPT Summary

本文主要探讨了通过高效的预训练和扩展、开放数据集和工具来改进大型语言模型的研究进展,并介绍了一种名为Cerebras-GPT的开放式最佳计算语言模型系列,该系列中的模型参数量从111M到13B。作者使用Eleuther Pile数据集进行了预训练,并遵循DeepMind Chinchilla扩展规则,以实现高效的预训练。研究表明,Cerebras-GPT拥有状态-of-the-art的训练效率。此外,作者还提出了Maximal Update Parameterization ($\mu$P)以进一步改善大型模型的扩展。该研究开源了预先训练的模型和代码,并标志着将计算最佳模型扩展与固定数据集大小上训练的模型进行比较的第一项开放性和可再现性工作。Cerebras-GPT模型可以在HuggingFace上获得。

(1) 这篇论文主要是探讨了怎样通过高效的预训练和扩展来改进大型语言模型,同时研究了Cerebras-GPT这种开放式最佳计算语言模型系列。

(2) 本文的创新点在于使用深度神经网络技术和规模化扩展技术,利用更多数据和硬件资源使语言模型更加有效提高性能。

(3) 这篇文章的优点是提出了一种新的开放式最佳计算语言模型,模型的训练效率非常高,同时该研究对深度学习领域具有一定的推动作用。缺点是本文没有在实验结果中给出人为数据集结果的实验结果,缺乏实验结果评估的系统性。


Abstract: We study recent research advances that improve large language models through efficient pre-training and scaling, and open datasets and tools. We combine these advances to introduce Cerebras-GPT, a family of open compute-optimal language models scaled from 111M to 13B parameters. We train Cerebras-GPT models on the Eleuther Pile dataset following DeepMind Chinchilla scaling rules for efficient pre-training (highest accuracy for a given compute budget). We characterize the predictable power-law scaling and compare Cerebras-GPT with other publicly-available models to show all Cerebras-GPT models have state-of-the-art training efficiency on both pre-training and downstream objectives. We describe our learnings including how Maximal Update Parameterization ($\mu$P) can further improve large model scaling, improving accuracy and hyperparameter predictability at scale. We release our pre-trained models and code, making this paper the first open and reproducible work comparing compute-optimal model scaling to models trained on fixed dataset sizes. Cerebras-GPT models are available on HuggingFace: https://huggingface.co/cerebras.

【28】 On the Pareto Front of Multilingual Neural Machine Translation

链接: https://arxiv.org/abs/2304.03216

ChatGPT Summary

本文研究了在多语言神经机器翻译(MNMT)中,给定方向的广义性能如何随其采样比率的变化而变化。通过使用各种模型大小、方向和任务总数进行训练,我们发现在训练样本中存在数据不平衡时,标量化会导致多任务权衡前沿偏离传统帕累托前沿。也就是说,对于某些翻译方向,其性能不随其权重在多任务优化目标中的增加而提高,这给提高所有方向的整体性能带来了更大的挑战。根据我们的观察,我们提出了双重幂律来预测 MNMT 中的唯一性能权衡前沿,该方法在各种语言、数据充足性和任务数量上都具有鲁棒性。最后,我们基于双重幂律将 MNMT 中的样本比选择制定为一个优化问题,在我们的实验中,使用了高达总培训预算的一半,这种方法比温度搜索和梯度操纵方法实现了更好的性能。

(1): 本文研究了在多语言神经机器翻译中,给定方向的广义性能如何随其采样比率的变化而变化,以解决在数据不平衡时提高所有方向的整体性能的挑战。

(2):本文创新地提出了双重幂律来预测唯一性能权衡前沿,该方法在各种语言、数据充足性和任务数量上都具有鲁棒性,并且将 MNMT 中的样本比选择制定为一个优化问题,取得了比温度搜索和梯度操纵方法更好的性能。

(3): 优点为创新地提出了用于解决MNMT问题的双重幂律,提出了优化样本比选择的方法。缺点则未提及。


Abstract: In this work, we study how the generalization performance of a given direction changes with its sampling ratio in Multilingual Neural Machine Translation (MNMT). By training over 200 multilingual models with various model sizes, directions, and total numbers of tasks, we find that scalarization leads to a multitask trade-off front that deviates from the traditional Pareto front when there exists data imbalance in the training corpus. That is, the performance of certain translation directions does not improve with the increase of its weight in the multi-task optimization objective, which poses greater challenge to improve the overall performance of all directions. Based on our observations, we propose the Double Power Law to predict the unique performance trade-off front in MNMT, which is robust across various languages, data adequacy and number of tasks. Finally, we formulate sample ratio selection in MNMT as an optimization problem based on the Double Power Law, which achieves better performance than temperature searching and gradient manipulation methods using up to half of the total training budget in our experiments.

【29】 FedBot: Enhancing Privacy in Chatbots with Federated Learning

链接: https://arxiv.org/abs/2304.03228

ChatGPT Summary

这篇论文的题目是:“FedBot: Enhancing Privacy in Chatbots with Federated Learning”。本文介绍了一种名为 Fedbot 的联邦学习聊天机器人,用于保护客户的数据隐私。该机器人利用了大规模的客户支持数据,并将深度双向 Transformer 模型和联邦学习算法相结合,以在模型协同训练过程中保护客户数据的隐私性。该论文展示了一种新的保护客户隐私性的聊天机器人解决方案,有望改变客户支持行业,提供个性化和高效的客户服务,并符合数据隐私法规和法律要求。Fedbot 还具有从先前交互学习以提高性能和准确性的能力。

(1):这篇论文试图解决什么问题或任务?

答:本文试图解决聊天机器人训练所需的数据隐私问题,利用联邦学习算法来保护客户的数据隐私。

(2):这项工作的关键创新点或贡献是什么?

答:该论文通过将联邦学习算法应用于聊天机器人的训练,为聊天机器人提供了一种新的数据隐私保护解决方案,为客户提供更好的服务。

(3):这篇论文的优点和缺点是什么?

答:优点是该文章提出了一个创新的、可行的保护客户隐私的解决方案,并且论文提供了大量数据证明了这一解决方案的有效性。缺点是,该文章是一个深度学习相关算法在隐私领域的应用,局限性可能存在于数据的质量和实现的难度上。


Abstract: Chatbots are mainly data-driven and usually based on utterances that might be sensitive. However, training deep learning models on shared data can violate user privacy. Such issues have commonly existed in chatbots since their inception. In the literature, there have been many approaches to deal with privacy, such as differential privacy and secure multi-party computation, but most of them need to have access to users' data. In this context, Federated Learning (FL) aims to protect data privacy through distributed learning methods that keep the data in its location. This paper presents Fedbot, a proof-of-concept (POC) privacy-preserving chatbot that leverages large-scale customer support data. The POC combines Deep Bidirectional Transformer models and federated learning algorithms to protect customer data privacy during collaborative model training. The results of the proof-of-concept showcase the potential for privacy-preserving chatbots to transform the customer support industry by delivering personalized and efficient customer service that meets data privacy regulations and legal requirements. Furthermore, the system is specifically designed to improve its performance and accuracy over time by leveraging its ability to learn from previous interactions.

【30】 Large language models effectively leverage document-level context for literary translation, but critical errors persist

链接: https://arxiv.org/abs/2304.03245

ChatGPT Summary

这篇论文研究了大语言模型在文学翻译中的有效应用,但仍存在严重误译的问题。作者通过对18个语言对进行了严格的人工评估,发现使用Gpt-3.5 LLM对整个文学段落进行翻译(而不是句子级别的翻译)可以获得更高质量的翻译结果。评估中需要请翻译人员提供标注和评价,发现除了偶尔的内容遗漏外,基于篇章级别的LLM翻译比基于句子级别的方法更少出现误译、语法错误和风格不一致的问题。作者公开了数据集和标注以促进文学翻译的未来研究。

(1) 这篇论文的问题或任务是什么?

这篇论文的问题是探究大型语言模型在文学翻译中的效果,并比较在篇章级别和句子级别翻译时的优劣。

(2) 这个工作的关键创新点或贡献是什么?

在文学翻译领域,作者展示了基于篇章级别的LLM在翻译中可以产生更高质量的结果,同时公开了新的数据集和标注,为未来的研究提供了资源。

(3) 这篇论文的优点和缺点是什么?

优点:相比于句子级别的翻译方式,基于篇章级别的LLM在文学翻译中表现更好,避免了一些语法错误和风格不一致的问题。

缺点:仍存在内容遗漏等严重误译的问题,需要人工干预。评估过程需要大量的标注和分析,成本高。

Abstract: Large language models (LLMs) are competitive with the state of the art on a wide range of sentence-level translation datasets. However, their ability to translate paragraphs and documents remains unexplored because evaluation in these settings is costly and difficult. We show through a rigorous human evaluation that asking the Gpt-3.5 (text-davinci-003) LLM to translate an entire literary paragraph (e.g., from a novel) at once results in higher-quality translations than standard sentence-by-sentence translation across 18 linguistically-diverse language pairs (e.g., translating into and out of Japanese, Polish, and English). Our evaluation, which took approximately 350 hours of effort for annotation and analysis, is conducted by hiring translators fluent in both the source and target language and asking them to provide both span-level error annotations as well as preference judgments of which system's translations are better. We observe that discourse-level LLM translators commit fewer mistranslations, grammar errors, and stylistic inconsistencies than sentence-level approaches. With that said, critical errors still abound, including occasional content omissions, and a human translator's intervention remains necessary to ensure that the author's voice remains intact. We publicly release our dataset and error annotations to spur future research on evaluation of document-level literary translation.

【31】 Instruction Tuning with GPT-4

链接: https://arxiv.org/abs/2304.03277

ChatGPT Summary

这篇论文的题目是“使用GPT-4调整指导”,其摘要介绍了之前的研究表明,使用机器生成的指令跟随数据对大型语言模型(LLM)进行微调,可以使这些模型在新任务上实现显著的零-shot能力,而不需要人工编写指令。在这篇论文中,我们首次尝试使用GPT-4生成指令跟随数据来进行LLM微调。我们早期在指令调整的LLaMA模型上的实验表明,GPT-4生成的5.2万个英文和中文指令跟随数据比之前最先进的模型生成的指令跟随数据具有更优异的零-shot性能。我们还从GPT-4收集反馈和比较数据,以便进行全面评估和奖励模型培训。我们公开了使用GPT-4生成的数据和我们的代码库。

(1) 本文旨在通过使用GPT-4生成指令跟随数据,在LLM微调中实现新任务的零-shot能力。

(2) 本文的创新点是首次尝试使用GPT-4生成指令跟随数据进行LLM微调,并且得到了比之前最先进模型生成数据更优异的零-shot性能,从而实现更好的模型表现。

(3) 本文的优点在于使用GPT-4生成的数据能够实现更好的模型表现,同时收集的反馈和比较数据使全面评估和模型培训的效果更好。而缺点可能是使用GPT-4生成数据的方式可能存在一定的局限性,同时本文的实验并没有涉及其它语言的指令跟随数据的生成和应用。


Abstract: Prior work has shown that finetuning large language models (LLMs) using machine-generated instruction-following data enables such models to achieve remarkable zero-shot capabilities on new tasks, and no human-written instructions are needed. In this paper, we present the first attempt to use GPT-4 to generate instruction-following data for LLM finetuning. Our early experiments on instruction-tuned LLaMA models show that the 52K English and Chinese instruction-following data generated by GPT-4 leads to superior zero-shot performance on new tasks to the instruction-following data generated by previous state-of-the-art models. We also collect feedback and comparison data from GPT-4 to enable a comprehensive evaluation and reward model training. We make our data generated using GPT-4 as well as our codebase publicly available.

【32】 Do the Rewards Justify the Means? Measuring Trade-Offs Between Rewards and Ethical Behavior in the MACHIAVELLI Benchmark

链接: https://arxiv.org/abs/2304.03279

ChatGPT Summary

这篇论文研究了人工智能代理如何在行为和奖励之间进行权衡,并提出了MACHIAVELLI基准测试,其中包含了一系列社交决策情境作为评估代理行为的测量标准。作者观察到最大化奖励与道德行为之间存在着紧张关系,为了改善这种权衡,作者提出了一些基于LM的措施来引导代理人向着更少有害行为的方向来行动。通过实验结果表明,代理人可以在行为和道德方面都表现得很好,这为机器伦理学的研究提供了突破口。

--

(1): 本文试图解决什么问题或任务?

这篇论文旨在研究人工智能代理在行为和奖励之间进行权衡的问题,并提出了MACHIAVELLI基准测试作为评估代理人行为的标准。

(2): 这项工作的关键创新点或贡献是什么?

该工作的关键创新点在于提出了一个新的基准测试MACHIAVELLI来测试代理人行为,并提出了一些基于LM的措施来引导代理人向着更少有害行为的方向来行动。

(3): 这篇文章的优点和缺点分别是什么?

该文章的优点在于引入了一个新的标准来研究代理人的行为和情景,并提供了实验数据证明该方法的可行性;缺点是只是在理论上提出了改善代理人行为的措施,并没有具体的实现方法。

Abstract: Artificial agents have traditionally been trained to maximize reward, which may incentivize power-seeking and deception, analogous to how next-token prediction in language models (LMs) may incentivize toxicity. So do agents naturally learn to be Machiavellian? And how do we measure these behaviors in general-purpose models such as GPT-4? Towards answering these questions, we introduce MACHIAVELLI, a benchmark of 134 Choose-Your-Own-Adventure games containing over half a million rich, diverse scenarios that center on social decision-making. Scenario labeling is automated with LMs, which are more performant than human annotators. We mathematize dozens of harmful behaviors and use our annotations to evaluate agents' tendencies to be power-seeking, cause disutility, and commit ethical violations. We observe some tension between maximizing reward and behaving ethically. To improve this trade-off, we investigate LM-based methods to steer agents' towards less harmful behaviors. Our results show that agents can both act competently and morally, so concrete progress can currently be made in machine ethics--designing agents that are Pareto improvements in both safety and capabilities.


分享到: