Sapienza NLP @ ACL 2024
8 papers at ACL!
NounAtlas: Filling the Gap in Nominal Semantic Role Labeling
by R. Navigli, M. Lo Pinto, P. Silvestri, D. Rotondi, S. Ciciliano, and A. Scirè
Despite the significant advances made in Semantic Role Labeling (SRL), much work in this field has been carried out with a focus on verbal predicates, with the research on nominal SRL lagging behind. In many contexts, however, nominal predicates are often as informative as verbal ones, thus calling for proper treatment. In this paper we aim to fill this gap and make nominal SRL a first-class citizen. We introduce a novel approach in order to create the first large-scale, high-quality inventory of nominal predicates and organize them into semantically- coherent frames. Although it is automatically created, NounAtlas – our nominal frame inventory – is subsequently fully validated. We then put forward a technique for generating silver training data for nominal SRL and show that a state-of-the-art SRL model can achieve good performance. Interestingly, thanks to our design choices, which enable seamless integration of our predicate inventory with its verbal counterpart, i.e., VerbAtlas, we can mix verbal and nominal data and perform robust SRL on both types of predicate.
Guardians of the Machine Translation Meta-Evaluation: Sentinel Metrics Fall In!
by S. Perrella, L. Proietti, A. Scirè, E. Barba, R. Navigli
Annually, the organizers of the Metrics Shared Task at the Conference on Machine Translation (WMT) conduct the meta-evaluation of Machine Translation (MT) metrics, ranking them according to their correlation with human judgments. Their results guide researchers toward enhancing the next generation of metrics and MT systems.With the recent introduction of neural metrics, the field has seen notable advancements. Nevertheless, the inherent opacity of these metrics has posed substantial challenges to the meta-evaluation process.This work highlights two issues with the meta-evaluation framework currently employed in WMT, and assesses their impact on the metrics rankings. We introduce the concept of sentinel metrics, which are designed explicitly to scrutinize the accuracy, robustness, and fairness of the meta-evaluation process. By employing sentinel metrics, we aim to validate our findings, and shed light and monitor the potential biases or inconsistencies in the rankings. We discover that the present meta-evaluation framework favors two categories of metrics: i) those explicitly trained to mimic human quality assessments, and ii) continuous metrics. Ultimately, we raise concerns regarding the evaluation capabilities of state-of-the-art metrics, highlighting that they might be basing their assessments on spurious correlations found in their training data.
Maverick: Efficient and Accurate Coreference Resolution Defying Recent Trends
by G. Martinelli, E. Barba, R. Navigli
Large autoregressive generative models have emerged as the cornerstone for achieving the highest performance across several Natural Language Processing tasks. However, the urge to attain superior results has, at times, led to the premature replacement of carefully designed task-specific approaches without exhaustive experimentation. The Coreference Resolution task is no exception; all recent state-of-the-art solutions adopt large generative autoregressive models that outperform encoder-based discriminative systems. In this work,we challenge this recent trend by introducing Maverick, a carefully designed - yet simple - pipeline, which enables running a state-of-the-art Coreference Resolution system within the constraints of an academic budget, outperforming models with up to 13 billion parameters with as few as 500 million parameters. Maverick achieves state-of-the-art performance on the CoNLL-2012 benchmark, training with up to 0.006x the memory resources and obtaining a 170x faster inference compared to previous state-of-the-art systems. We extensively validate the robustness of the Maverick framework with an array of diverse experiments, reporting improvements over prior systems in data-scarce, long-document, and out-of-domain settings. We release our code and models for research purposes.
FENICE: Factuality Evauation of Summarization based on Natural language Inference and Claim Extraction
by A. Scirè, K. Ghonim, R. Navigli
Recent advancements in text summarization, particularly with the advent of Large Language Models (LLMs), have shown remarkable performance. However, a notable challenge persists as a substantial number of automatically-generated summaries exhibit factual inconsistencies, such as hallucinations. In response to this issue, various approaches for the evaluation of consistency for summarization have emerged. Yet, these newly-introduced metrics face several limitations, including lack of interpretability, focus on short document summaries (e.g., news articles), and computational impracticality, especially for LLM-based metrics. To address these shortcomings, we propose Factuality Evaluation of summarization based on Natural language Inference and Claim Extraction (FENICE), a more interpretable and efficient factuality-oriented metric. FENICE leverages an NLI-based alignment between information in the source document and a set of atomic facts, referred to as claims, extracted from the summary. Our metric achieves state-of-the-art performance on AGGREFACT, the de-facto benchmark for factuality evaluation. Moreover, we extend our evaluation to a more challenging setting by conducting a human annotation process of long-form summarization.
Word Sense Linking: Disambiguating Outside the Sandbox
by S. Bejgu, E. Barba, L. Procopio, A.Fernàndez-Castro, R. Navigli
Word Sense Disambiguation (WSD) is the task of associating a word in a given context with its most suitable meaning among a set of possible candidates. While the task has recently witnessed renewed interest, with systems achieving performances above the estimated inter-annotator agreement, at the time of writing it still struggles to find downstream applications. We argue that one of the reasons behind this is the difficulty of applying WSD to plain text. Indeed, in the standard formulation, models work under the assumptions that a) all the spans to disambiguate have already been identified, and b) all the possible candidate senses of each span are provided, both of which are requirements that are far from trivial. In this work, we present a new task called Word Sense Linking (WSL) where, given an input text and a reference sense inventory, systems have to both identify which spans to disambiguate and then link them to their most suitable meaning. We put forward a transformer-based architecture for the task and thoroughly evaluate both its performance and those of state-of-the-art WSD systems scaled to WSL, iteratively relaxing the assumptions of WSD. We hope that our work will foster easier integration of lexical semantics into downstream applications.
ReLiK: Retrieve and LinK, Fast and Accurate Entity Linking and Relation Extraction on an Academic Budget
by R. Orlando, P. Huguet-Cabot, E. Barba, R. Navigli
Entity Linking (EL) and Relation Extraction (RE) are fundamental tasks in Natural Language Processing, serving as critical components in a wide range of applications. In this paper, we propose ReLiK, a Retriever-Reader architecture for both EL and RE, where, given an input text, the Retriever module undertakes the identification of candidate entities or relations that could potentially appear within the text. Subsequently, the Reader module is tasked to discern the pertinent retrieved entities or relations and establish their alignment with the corresponding textual spans. Notably, we put forward an innovative input representation that incorporates the candidate entities or relations alongside the text, making it possible to link entities or extract relations in a single forward pass and to fully leverage pre-trained language models contextualization capabilities, in contrast with previous Retriever-Reader-based methods, which require a forward pass for each candidate. Our formulation of EL and RE achieves state-of-the-art performance in both in-domain and out-of-domain benchmarks while using academic budget training and with up to 40x inference speed compared to competitors. Finally, we show how our architecture can be used seamlessly for Information Extraction (cIE), i.e. EL + RE, and setting a new state of the art by employing a shared Reader that simultaneously extracts entities and relations.
Dissecting Biases in Relation Extraction: A Cross-Dataset Analysis on People’s Gender and Origin
by M. Stranisci, P. Huguet Cabot, E. Bassignana, R. Navigli
Relation Extraction (RE) is at the core of many Natural Language Understanding tasks, including knowledge-base population and Question Answering. However, any Natural Language Processing system is exposed to biases, and the analysis of these has not received much attention in RE. We propose a new method for inspecting bias in the RE pipeline, which is completely transparent in terms of interpretability. Specifically, in this work we analyze biases related to gender and place of birth. Our methodology includes (i) obtaining semantic triplets (subject, object, semantic relation) involving ‘person’ entities from RE resources, (ii) collecting meta-information (‘gender’ and ‘place of birth’) using Entity Linking technologies, and then (iii) analyze the distribution of triplets across different groups (e.g., men versus women). We investigate bias at two levels: In the training data of three commonly used RE datasets (SREDFM, CrossRE, NYT), and in the predictions of a state-of-the-art RE approach (ReLiK). To enable cross-dataset analysis, we introduce a taxonomy of relation types mapping the label sets of different RE datasets to a unified label space. Our findings reveal that bias is a compounded issue affecting underrepresented groups within data and predictions for RE.
Mitigating Data Scarcity in Semantic Parsing across Languages with the Multilingual Semantic Layer and its Dataset
by A. C. Martinez Lorenzo, P. Huguet-Cabot, K. Ghonim, L. Xu, H. Choi, A. Fernández-Castro, R. Navigli
Data scarcity is a prevalent challenge in the era of Large Language Models (LLMs). The insatiable hunger of LLMs for large corpora becomes even more pronounced when dealing with non-English and low-resource languages. The issue is particularly exacerbated in Semantic Parsing (SP), i.e. the task of converting text into a formal representation. The complexity of semantic formalisms makes training human annotators and subsequent data annotation unfeasible on a large scale, especially across languages. To mitigate this, we first introduce the Multilingual Semantic Layer (MSL), a conceptual evolution of previous formalisms, which decouples from disambiguation and external inventories and simplifies the task. MSL provides the necessary tools to encode the meaning across languages, paving the way for developing a high-quality semantic parsing dataset across different languages in a semi-automatic strategy. Subsequently, we manually refine a portion of this dataset and fine-tune GPT-3.5 to propagate these refinements across the dataset. Then, we manually annotate 1,100 sentences in eleven languages, including low-resource ones. Finally, we assess our dataset's quality, showcasing the performance gap reduction across languages in Semantic Parsing.