======================================================================================== SPred: Large-scale Harvesting of Semantic Predicates (Resource and annotated datasets) ======================================================================================== Tiziano Flati, Roberto Navigli SAPIENZA Universita' di Roma ======================================================================================== =================== === The Package === =================== The package contains 3 .txt file, 2 .tar.gz packages and 3 annotated datasets. A) the 3 .txt files contain the list of predicates used as input to the system; namely: (i) oxford.predicates.txt: the Oxford lexical predicates (ii) oxford.sample.predicates.txt: the Oxford lexical predicate sample (iii) kh.patterns.txt: the 24 Kozareva & Hovy lexical patterns B) the 2 packages contain the semantic predicates for the Oxford Advanced Learner's Dictionary and for the 24 Kozareva & Hovy 2010 lexical patterns. C) the 3 annotated datasets contain the annotations for (i) the semantic class ranking evaluation using the sample of 50 random Oxford lexical predicates (ii) the semantic class ranking evaluation using the predicates from Kozareva & Hovy and (iii) the argument classification evaluation performed on the Oxford sample. D) a pdf copy of the paper A description of the datasets and a comprehensive overview of the system is provided in: Tiziano Flati, Roberto Navigli. SPred: Large-scale Harvesting of Semantic Predicates. Proc. of the 51st Annual Meeting of the Association for Computational Linguistics (ACL 2013), Sofia, Bulgaria, August 4-9, 2013 Please cite this paper to refer to the datasets and the SPred system. ======================================== === A) Lexical Predicate List Format === ======================================== Fields are tab-delimited and each file contains 3 columns: 1. a letter (among {A, B}, standing for After and Before) which determines the type of the predicate, i.e., whether is of type A (design *) or of type B (* design). This information is also implicit in the following field, but it has been left for processing purposes. 2. the lexical predicate ID. It is composed by the name of the lexical predicate (at the lemma level) and its type. 3. a list of space-delimited tokens. Each token is defined as a truple, which each triple's components separated by a '/' (Stanford-like format). The first component represents the word, the second one represents the lemma and the last one represents the part of speech of the token. Any of the components can equal '.*', meaning "match everything" ======================================== === B) Semantic Class Ranking Format === ======================================== Within each *.predicates.tar.gz archive, you find a directory with the name of the set of predicates mentioned in the archive's name. Each file contained in the directory is associated to one semantic predicate and contains the semantic class ranking for that predicate. Predicate names have been hashed; for instance, the file corresponding to the lexical predicate "accumulate *" is "ac/cu/accumulated_A_ALL.classes.WORDNET_CORE.txt" Each file is tab-delimited and contains 3 columns: 1. the importance score of the semantic class, defined as the number of argument occurrences linked to the semantic class 2. the name of the semantic class 3. a list of comma-separated arguments ======================================== ==== C) Annotated Datasets Format ===== ======================================== == (i + ii) Semantic class ranking evaluation == This file is divided in records, with a new record beginning with an empty line. Each record is composed by (i) one line containing the predicate identifier (either "predicate [*]" or "[*] predicate"). Following, there is a variable number of lines (20 at the most, but there could well be less), each representing a semantic class. Each line contains (i) the judgement (OK or empty field) (ii) the semantic class importance score (iii) the WordNet gloss for the semantic class (iv) the list of arguments linked to the semantic class, separated by '|', wrapped by '"' symbols. == (iii) Argument classification evaluation == The file is tab-delimited and contains 7 columns: (i) The argument ID (domain#oxford word#integer) (ii) The word of the Oxford dictionary from which the usage-note has been stripped off. (iii) The domain (name of the Oxford usage-note) (iv) lexical predicate (v) argument (vi) semantic class (the first sense in the WordNet semantic class if the system could give an answer; NONE otherwise) (vii) judgement (either OK or FAILED)