================================================================================================================================= BabelDomains: Large-Scale Domain Labeling of Lexical Resources Jose Camacho Collados and Roberto Navigli ================================================================================================================================= This package contains four files and two directories: * README.txt (this file) * domain_list.txt : List of 34 domains included in BabelDomains. It consists of 32 domains from the Wikipedia featured articles page (https://en.wikipedia.org/wiki/Wikipedia:Featured_articles) plus "Farming" and "Textile and Clothing". * domain_seeds_Wikipedia.txt: List of seeds (Wikipedia articles) associated with each domain from the Wikipedia featured articles. It also includes seeds for "Farming" and "Textile and Clothing". The file is tab-separated, one domain per line: domain seed1 seed2 ... * domain_vectors.txt : Lexical vectors (sorted weighted list of words representing the domain). Weights are given using lexical specificity (Camacho-Collados et al., AIJ 2016). Each line corresponds to a domain and is tab-separated (words and weights separated by space): domain word1 score1 word2 score2... * BabelDomains (Directory): Directory including the BabelDomains resource for BabelNet (babeldomains_bn.txt), WordNet (babeldomains_wn.txt) and Wikipedia (babeldomains_wiki.txt). See below for more information in the format of the files. * Evaluation_datasets (Directory) : Two gold standard datasets for WordNet ("wordnet_dataset_gold.txt", 1540 instances) and BabelNet ("babelnet_dataset_gold.txt", 200 instances). * evaluate_domains.py : Python script to test domain annotations in the gold standard datasets. See below for instructions on how to use it. * EACL2017_BabelDomains.pdf (Reference paper): Please read this paper to have a more detailed information about BabelDomains. ================================================================================================================================= BABELDOMAINS: FORMAT ================================================================================================================================= The resource is released in a unified format for BabelNet 3.0 (babeldomains_bn.txt), WordNet 3.0 (babeldomains_wn.txt) and Wikipedia (dump of November 2014) (babeldomains_wiki.txt). All files are sorted by confidence score. The format of each file is tab-separated, with each line having a minimum of three columns, as follows: instance main_domain confidence_score second_domain? third_domain? - instance: This corresponds to the lexical item (synset for BabelNet and WordNet and Wikipedia page title for Wikipedia). - main_domain: The domain annotation of the instance. See "list_domains.txt" for the full list of domains used in this version of BabelDomains. - confidence score: This refers to the confidence score of the domain annotation ("main_domain"). The confidence score for each synset's domain label is computed as the relative number of neighbours in the BabelNet semantic network sharing the same domain. *Note1*: If one instance is isolated in the BabelNet semantic network (or none of its neighbours have a domain label), we have included the confidence score of the corresponding step in the domain annotation process. In this case, the confidence score is preceeded of a star (*). *Note2*: We have performed a quality check for the confidence score validating 20 synsets on five intervals and the isolated synsets(i.e. those with a star * on the confidence score). The precision of the domain annotation for each interval were as follows: [0.0,0.2):60%, [0.2,0.4):80%, [0.4,0.6):90% [0.6,0.8):95%, [0.8,1.0]:100%, and isolated synsets(*):90%. - second_domain? and third_domain?: These columns are optional (i.e. not all instances have a second or third domain associated). Some instances may arguably belong to different domains. As additional information, we have included second domain (and third domain in some cases) for instances whose score on the distributional step for these domains was close to the main domain. ================================================================================================================================= INSTRUCTION FOR EVALUATING DOMAINS LABELS ON THE EVALUATION DATASETS ================================================================================================================================= Input: - Gold standard dataset (e.g. "babelnet_dataset_gold.txt"). It should be formatted tab-separated: synset domain. - Domain annotations (e.g. "babeldomains_babelnet.txt"). Similarly to the gold standard dataset, this file should be tab-separated (synset domain). Output: Precision, recall and F-Measure results. Intructions to run the Python script "evaluate_domains.py": The code takes the following parameters: path_dataset, path_vectors path_goldstandard : Path of the gold standard dataset. path_babeldomains : Path of the domain annotations. evaluation_mode (optional): "all" or "onlytest" (by default "only test"). - "onlytest" -> it considers domain annotations from the input file for domains which also appear in the gold standard. This may be used when the domains do not exactly correspond to the ones used in the gold standard. - "all" -> it evaluates all domain annotations of the synsets in the gold standard. Run it in the terminal by typing the following expression: $ python evaluate_domains.py path_goldstandard path_babeldomains ?evaluation_type? Examples of usage: $ python evaluate_domains.py Evaluation_datasets/babelnet_dataset_gold.txt BabelDomains/babeldomains_babelnet.txt $ python evaluate_domains.py Evaluation_datasets/wordnet_dataset_gold.txt BabelDomains/babeldomains_wordnet.txt NOTE: Since this version of BabelDomains includes two additional domains, the results from BabelDomains may be slightly different from the ones reported in the reference paper ================================================================================================================================= REFERENCE PAPER ================================================================================================================================= When using these resources, please refer to the following paper (included in the package as "EACL2017_BabelDomains.pdf"): Jose Camacho-Collados and Roberto Navigli. BabelDomains: Large-Scale Domain Labeling of Lexical Resources. Proceedings of EACL (short). Valencia, Spain, 2017. ================================================================================================================================= CONTACT ================================================================================================================================= If you have any enquiries about any of the resources, please contact Jose Camacho Collados (collados [at] di.uniroma1 [dot] it) =================================================================================================================================