Find the word that does not belong:
A Framework for an Intrinsic Evaluation of Word Vector Representations

Test your own word embeddings on the outlier detection task!

Given a group of words, the goal of the outlier detection task is to identify the word that does not belong in the group.
For example, book would be an outlier for the set of words apple, banana, lemon, book, orange, as it is not a fruit like the others.

This task is intended to test interesting properties of word embeddings not fully addressed to date in common intrinsic evaluation benchmarks such as word similarity.
Although the task is quite well-defined and humans achieve a near-perfect performance, this task is still challenging for state-of-the-art word embeddings.
In fact, some of the shortcomings of current word embeddings are clearly highlighted as part of the evaluation.

Please find more information about the dataset and the outlier detection task in the reference paper.

Download

Download the whole package [<1MB] including the following files:

A README file.

The 8-8-8 outlier detection dataset (including the guidelines given to the annotators to create it). Check some examples included in the dataset.

An easy-to-use Python script to test your word embeddings on the outlier detection dataset (it only needs your embeddings on a standard txt format for testing).

A small sample of word embeddings trained on the Wikipedia corpus by using the Skip-Gram model of Word2Vec.

The reference paper (see below).

Reference paper

When using these resources, please refer to the following paper:

José Camacho-Collados and Roberto Navigli. Find the word that does not belong: A Framework for an Intrinsic Evaluation of Word Vector Representations.
In Proceedings of the ACL Workshop on Evaluating Vector Space Representations for NLP, Berlin, Germany, August 12, 2016.

@inproceedings{camacho2016find,
  title={Find the word that does not belong: A framework for an intrinsic evaluation of word vector representations},
  author={Camacho-Collados, Jos{\'e} and Navigli, Roberto},
  booktitle={Proceedings of the ACL Workshop on Evaluating Vector Space Representations for NLP},
  pages={43--50},
  year={2016}
}

New: Large outlier detection dataset for five languages automatically constructed using Wikidata and Wikipedia released by Blair et al. (2017). Download the dataset here.

Contact

Should you have any enquiries about any of the resources, please contact José Camacho Collados (collados [at] di.uniroma1 [dot] it).

Last update: 25 Jun. 2016 by José Camacho Collados