We strongly believe that scientific advances can be made by sharing common resources and standardized datasets with the research community. On this page you can find many of the resources that were developed as part of our research work at LCL.

BabelNet


A very large multilingual semantic network with millions of concepts obtained from an integration of WordNet and Wikipedia and translations from Wikipedia's cross-language links and a state-of-the-art machine translation system.
________
Website

EuroSense


EuroSense is a multilingual sense-annotated resource, automatically built via the joint disambiguation of the Europarl parallel corpus, with almost 123 million sense annotations for over 155 thousand distinct concepts and entities in 21 languages, drawn from the language-independent unified sense inventory of BabelNet. EuroSense's disambiguation pipeline couples a graph-based multilingual disambiguation system, Babelfy, and a language-independent vector representations of concepts and entities, Nasari, without relying on word alignments againt a pivot language.
________
Website

SEW


SEW (Semantically Enriched Wikipedia) is a sense-annotated corpus, automatically built from Wikipedia, in which the overall number of linked mentions has been more than tripled solely by exploiting the hyperlink structure of Wikipedia pages and categories, along with the wide-coverage sense inventory of BabelNet. As a result SEW constitutes both a large-scale Wikipedia-based semantic network and a sense-tagged dataset with more than 200 million annotations of over 4 million different concepts and named entities.
________
Website

BabelDomains


BabelDomains is a unified resource that includes domain information for lexical items included in different lexical resources (BabelNet, Wikipedia and WordNet). Each synset is associated with a pre-defined domain of knowledge from the Wikipedia featured articles page. The last version of BabelDomains has been integrated in BabelNet, both into the online interface and in the API.
________
Website

Word Sense Disambiguation: a Unified Evaluation Framework


We have gathered together and unified five standard all-words Word Sense Disambiguation datasets . We hope this unified framework will ease the work of researchers to evaluate the models and will enable a fair comparison among all systems. This evaluation framework is currently available for English, and all the sense annotations belong to WordNet 3.0 sense inventory.
________
Website

SemEval-2017 task #02: Multilingual and Cross-lingual Semantic Word Similarity


Datasets and resources for the Semeval-2017 task #02 on multilingual and cross-lingual Semantic Word Similarity.
________
Website

Outlier Detection


Given a group of words, the goal of the outlier detection task is to identify the word that does not belong in the group. For example, book would be an outlier for the set of words apple, banana, lemon, book, orange, as it is not a fruit like the others. This task is intended to test interesting properties of word embeddings not fully addressed to date in common intrinsic evaluation benchmarks such as word similarity. Although the task is quite well-defined and humans achieve a near-perfect performance, this task is still challenging for state-of-the-art word embeddings. In fact, some of the shortcomings of current word embeddings are clearly highlighted as part of the evaluation.
________
Website

MultiWiBi (Multilingual Wikipedia Bitaxonomy)


MultiWiBi (Multilingual Wikipedia Bitaxonomy) is an approach to the automatic creation of a bitaxonomy for Wikipedia in arbitrary languages. MultiWiBi is a unified, 3-phase approach to the automatic creation of a bitaxonomy for Wikipedia, that is, an integrated taxonomy of Wikipage pages and categories in arbitrary language. We leverage the information available in either one of the taxonomies to reinforce the creation of the other taxonomy.
________
Website

Monolingual and Cross-lingual Word Similarity Datasets


We release two monolingual word similarity datasets for Spanish and Farsi (Persian) obtained from the standard RG-65 dataset (Rubenstein and Goodenough, 1965). The guidelines for the creation of these datasets (in Spanish and Farsi) are also provided. Additionally we provide fifteen cross-lingual datasets including English (EN), French (FR), German (DE), Spanish (ES), Portuguese (PT) and Farsi (FA) languages.
________
Website

A Large-Scale Multilingual Disambiguation of Glosses


A multilingual large-scale corpus of disambiguated glosses coming from different resources integrated in BabelNet such as Wikipedia, Wiktionary, WordNet, OmegaWiki and Wikidata. Almost 250 milion sense-annotations for both concepts and named entities are provided. In total, over 35 million definitions have been disambiguated for 256 languages. All definitions have been automatically disambiguated by exploiting at best their cross-language and cross-resource complementarities using Babelfy, a state-of-the-art multilingual Word Sense Disambiguation-Entity Linking system, and the semantic vector representations of NASARI.
________
Website

Babelfied Wikipedia


Babelfied Wikipedia is an automatic annotation of the Wikipedia dump, with both word senses (i.e. concepts) and named entities using Babelfy.
________
Website

Semeval-2015 task #13: Multilingual All-Words Sense Disambiguation and Entity Linking


Datasets and resources for the Semeval-2015 task #13 on multilingual All-Words Sense Disambiguation and Entity Linking.
________
Website

Similarity-based Pseudowords


Similarity-based Pseudowords is a dataset of pseudowords that model ambiguous nouns in WordNet 3.0. Each ambiguous noun is modeled by a pseudoword upon selecting for each of its senses, the most suitable monosemous representative of that sense.
________
Website

WiSeNet


WiSeNet is a Wikipedia-based Semantic network. We extract semantic relations from Wikipedia sentences and ontologize them by, first, creating sets of synonymous relational phrases, called relation synsets, second, assigning semantic classes to the arguments of these relation synsets and, third, disambiguating the initial relation instances with relation synsets.
________
Website

Semeval-2013 task #12: Multilingual Word Sense Disambiguation


Datasets and resources for the Semeval-2013 task #12 on multilingual Word Sense Disambiguation.
________
Website

SPred


SPred is an approach for large-scale harvesting of Semantic Predicates.
________
Website

MORESQUE


A dataset for the evaluation of subtopic information retrieval. The dataset contains 114 topics (i.e., queries): each topic is further categorized into subtopics and contains 100 top-ranking documents.
________
Website

TaxoLearn


A graph-based approach aimed at learning a lexical taxonomy automatically starting from a domain corpus and the Web.
________
Website

WordNet++


An extension of WordNet comprising millions of new semantic pointers between WordNet synsets harvested from Wikipedia co-occurring links via BabelNet's mapping from Wikipedia pages to WordNet synsets.
________
Website

Word-Class Lattices (WCL)


A generalization of word lattices to model textual definitions. Our classifiers, based on two variants of WCLs, are able to identify definitions and extract hypernyms with high accuracy.
________
Website

WikiTax2WordNet


A dataset of mappings from Wikipedia categories to WordNet synsets that were automatically generated from WikiTaxonomy.
________
Website

Semeval-2007 task #7: Coarse-grained English all-words


Datasets and resources for the Semeval-2007 task #7 on coarse-grained all-words WSD for English.
________
Website

Semeval-2007 task #10: English Lexical Substitution


Datasets and resources for the Semeval-2007 task #10 on English lexical substitution.
________
Website