Similarity-based Pseudowords

Similarity-based Pseudowords is a dataset of pseudowords that model ambiguous nouns in WordNet 3.0. Each ambiguous noun is modeled by a pseudoword upon selecting for each of its senses, the most suitable monosemous representative of that sense.

Datasets

Download all package including the readme file.

Or, alternatively, download a specific set of pseudowords:

English Gigaword

noPOS	noMWE	minFreq
✗	✗	100	500	1000
✔	✗	100	500	1000
✔	✔	100	500	1000

English Gigaword Fifth Edition

noPOS	noMWE	minFreq
✗	✗	100	500	1000	2000
✔	✗	100	500	1000	2000
✔	✔	100	500	1000	2000

British National Corpus

noPOS	noMWE	minFreq
✗	✗	100
✔	✗	100
✔	✔	100

minFreq : the occurrence frequency of pseudosenses in the corresponding corpus is at least that given by the minFreq value. In other words, given a set of pseudowords and its corresponding corpus, minFreq denotes the minimum number of sentences that can be pseudosense-tagged for each individual pseudosense.

filters:
- noMWE : pseudosenses are single-token words only (no multi-word expression allowed).
- posFilter : can be used for corpora that are not pos-tagged or the tagging had been unreliable. In this set of pseudowords, pseudosenses cannot have any other pos tag (verb, adverb and adjective) in WordNet 3.0.

Dataset Format

Each txt file consists of 15,935 pseudowords, one in each line, corresponding to all the 15,935 ambiguous nouns in WordNet 3.0. Each line has the following format:
<ambiguous_noun> corresponding_pseudoword avgRankScore
where the "corresponding_pseudoword" is the set of space-separated pseudosenses (pseudosense_1 pseudosense_2 ... pseudosense_n) which are in the same order as their corresponding senses in WordNet 3.0 and "avgRankScore" is the average position of the selected pseudosenses (please refer the paper). The lower the value of this score, the more confidence we have in the preservation of meaning. These scores can help the selection of more reliable pseudowords if a subset of the dataset is needed.

Reference

When citing the Similarity-based Pseudowords, please refer to the following paper:

Mohammad Taher Pilehvar and Roberto Navigli. Paving the Way to a Large-scale Pseudosense-annotated Dataset. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT 2013) , pages 1100-1109, Atlanta, USA, June 10-12, 2013.

Last update: 1 Oct. 2013 by Mohammad Taher Pilehvar