Training data - Word Sense Disambiguation: A Unified Evaluation Framework

Training data

SemCor (Miller et al., 1994). SemCor is a manually sense-annotated corpus divided in 352 documents for a total of 226,040 sense annotations. SemCor is, to our knowledge, the largest corpus manually annotated with WordNet senses, and is the main corpus used in the literature to train supervised WSD systems (Agirre et al., 2010b; Zhong and Ng, 2010).

OMSTI (Taghipour and Ng, 2015a). OMSTI (One Million Sense-Tagged Instances) is a large corpus annotated with senses from the WordNet 3.0 inventory. It was automatically constructed by using an alignment-based WSD approach (Chan and Ng, 2005) on a large English-Chinese parallel corpus (Eisele and Chen, 2010, MultiUN corpus). OMSTI has already shown its potential as training corpus by improving the performance of supervised systems which add OM-STI as part of their training (Taghipour and Ng, 2015a; Iacobacci et al., 2016).

Download sense-annotated training data here [162MB].

References:

- George A Miller, Martin Chodorow, Shari Landes, Claudia Leacock, and Robert G Thomas. 1994. Using a semantic concordance for sense identification. In Proceedings of the workshop on Human Language Technology, pages 240–243. Association for Computational Linguistics.

- Kaveh Taghipour and Hwee Tou Ng. 2015a. One million sense-tagged instances for word sense disambiguation and induction. CoNLL 2015, pages 338-344.