NASARI semantic vector representations for BabelNet synsets* and Wikipedia pages in several languages. Currently available three vector types: lexical, unified and embedded. NASARI provides a large coverage of concepts and named entities and has been proved to be useful for many Natural Language Processing tasks such as multilingual semantic similarity, sense clustering or word sense disambiguation, tasks on which NASARI has contributed to achieve state-of-the-art results on standard benchmarks.

*Please note that BabelNet covers WordNet and Wikipedia among other resources, enabling our vectors to be applicable for representations of concepts and named entities in each of these resources.

     

Downloads

Currently available for English, Spanish, French, German and Italian. Please find more information in the README file. Stay tuned for the release of NASARI representations in other languages! The NASARI embed vectors below share the same space with the pre-trained vectors of Word2Vec for English (trained on the Google News corpus), or with the Word2Vec word embeddings trained on the Spanish Billion Words Corpus for Spanish (more information in the main reference paper). You can download the Spanish word embeddings here.

New (July 2017): Now you can additionally download the 300-dimensional NASARI-embed concept and entity vectors along with the Word2Vec word embeddings trained on the UMBC corpus, both in the same vector space. These vectors tend to show a superior performance than the NASARI-embed vectors trained on Google News below. Download both NASARI-embed and the UMBC word embeddings in a single compressed bin file here:     [7GB] (note that in this version all word embeddings are lowercased).

Note: the first three lines of the table below correspond to the NASARI vector representations for all English Wikipedia pages (Wikipedia dump of November 2014). In the remaining files each vector is tagged with its corresponding BabelNet synset and Wikipedia page.

Language Type # of BabelNet synsets # of Wikipedia pages Download Size
English Lexical(Wiki) - 4.40M 4.7GB
English Embed(Wiki) - 4.40M 5.9GB
English Unified(Wiki) - 2.85M 341MB
English Lexical 4.42M 4.40M 4.7GB
English Embed 4.42M 4.40M 5.9GB
English Unified 2.87M 2.85M 352MB
Spanish Lexical 1.07M 1.05M 705MB
Spanish Unified 657K 635K 60MB
Spanish Embed 1.07M 1.05M 1.4GB
French Lexical 1.48M 1.45M 1.1GB
French Unified 882K 861K 96MB
German Lexical 1.51M 1.49M 1.4GB
German Unified 857K 836K 59MB
Italian Lexical 1.10M 1.08M 843MB
Italian Unified 631K 610K 69MB

The NASARI_embed vector representations can also be downloaded in binary format: [bin:4.8GB] (compatible with Word2Vec). NASARI lexical vectors in English can also be downloaded in tar.bz2 compression format: [tar.bz2:3.6GB].
Please note that you can use the BabelNet API to get the most from these vectors, e.g., access the corresponding WordNet synsets or lexicalizations.

Release history

Current version: 3.0

Release version Date Features Reference paper
1.0 April 2015 English lexical and unified vectors for WordNet synsets and Wikipedia pages. NAACL 2015
2.0 August 2015 Multilingual extension through BabelNet. Available in English, Spanish, French, German and Italian. ACL 2015
2.1 October 2015 Minor bug fixing and updated format. -
3.0 March 2016 Improved lexical and unified vectors. Integration of embedding vector representations. AIJ 2016

Main reference

If you use any of the resources available in this website, please refer to the following article [bib]:

José Camacho-Collados, Mohammad Taher Pilehvar and Roberto Navigli.
Nasari: Integrating explicit knowledge and corpus statistics for a multilingual representation of concepts and entities.
Artificial Intelligence 240, Elsevier, 2016, pp.567-577.

Previous reference papers

José Camacho-Collados, Mohammad Taher Pilehvar and Roberto Navigli.
NASARI: a Novel Approach to a Semantically-Aware Representation of Items.
In Proceedings of the North American Chapter of the Association of Computational Linguistics (NAACL 2015), Denver, USA, pp. 567-577, 2015

José Camacho-Collados, Mohammad Taher Pilehvar and Roberto Navigli.
A Unified Multilingual Semantic Representation of Concepts.
In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics (ACL 2015), Beijing, China, July 27-29, pp. 741-751, 2015

Contact

Should you have any enquiries about any of the resources, please contact us:


NASARI is an output of the MultiJEDI ERC Starting Grant No. 259234. NASARI is licensed under a Creative Commons Attribution - Noncommercial - Share Alike 3.0 License. Creative Commons License