=================================================================================================================================
				BabelDomains: Large-Scale Domain Labeling of Lexical Resources

				     Jose Camacho Collados and Roberto Navigli
=================================================================================================================================


This package contains four files and two directories:

* README.txt (this file)

* domain_list.txt : List of 34 domains included in BabelDomains. It consists of 32 domains from the Wikipedia featured articles page
		     (https://en.wikipedia.org/wiki/Wikipedia:Featured_articles) plus "Farming" and "Textile and Clothing".
		   
* domain_seeds_Wikipedia.txt: List of seeds (Wikipedia articles) associated with each domain from the Wikipedia featured articles. 
			      It also includes seeds for "Farming" and "Textile and Clothing". The file is tab-separated, one domain 
			      per line: domain <tab> seed1 <tab> seed2 <tab>...
			      

* domain_vectors.txt : Lexical vectors (sorted weighted list of words representing the domain). Weights are given using lexical 
		       specificity (Camacho-Collados et al., AIJ 2016). Each line corresponds to a domain and is tab-separated 
		       (words and weights separated by space): domain <tab> word1 score1 <tab> word2 score2...  

* BabelDomains (Directory): Directory including the BabelDomains resource for BabelNet (babeldomains_bn.txt), WordNet 
			    (babeldomains_wn.txt) and Wikipedia (babeldomains_wiki.txt). See below for more information in the format 
			    of the files.

* Evaluation_datasets (Directory) : Two gold standard datasets for WordNet ("wordnet_dataset_gold.txt", 1540 instances) and BabelNet
				    ("babelnet_dataset_gold.txt", 200 instances).

* evaluate_domains.py : Python script to test domain annotations in the gold standard datasets. See below for instructions on how
			to use it.

* EACL2017_BabelDomains.pdf (Reference paper): Please read this paper to have a more detailed information about BabelDomains. 
						         

=================================================================================================================================
BABELDOMAINS: FORMAT
=================================================================================================================================

The resource is released in a unified format for BabelNet 3.0 (babeldomains_bn.txt), WordNet 3.0 (babeldomains_wn.txt) and 
Wikipedia (dump of November 2014) (babeldomains_wiki.txt). All files are sorted by confidence score. 

The format of each file is tab-separated, with each line having a minimum of three columns, as follows:

		instance <tab> main_domain <tab> confidence_score <tab> second_domain? <tab> third_domain?


 - instance: This corresponds to the lexical item (synset for BabelNet and WordNet and Wikipedia page title for Wikipedia). 

 - main_domain: The domain annotation of the instance. See "list_domains.txt" for the full list of domains used in this version of 
		 BabelDomains.

 - confidence score: This refers to the confidence score of the domain annotation ("main_domain"). The confidence score for each 
	             synset's domain label is computed as the relative number of neighbours in the BabelNet semantic network sharing 
		     the same domain.

		     *Note1*: If one instance is isolated in the BabelNet semantic network (or none of its neighbours have a domain 
			   label), we have included the confidence score of the corresponding step in the domain annotation process.
			   In this case, the confidence score is preceeded of a star (*). 

		     *Note2*: We have performed a quality check for the confidence score validating 20 synsets on five intervals and 
			      the isolated synsets(i.e. those with a star * on the confidence score). The precision of the domain 
			      annotation for each interval were as follows: [0.0,0.2):60%, [0.2,0.4):80%, [0.4,0.6):90% [0.6,0.8):95%, 
			      [0.8,1.0]:100%, and isolated synsets(*):90%.
			      

 - second_domain? and third_domain?: These columns are optional (i.e. not all instances have a second or third domain associated).
				    Some instances may arguably belong to different domains. As additional information, we have 
				    included second domain (and third domain in some cases) for instances whose score on the 
				    distributional step for these domains was close to the main domain.

=================================================================================================================================
INSTRUCTION FOR EVALUATING DOMAINS LABELS ON THE EVALUATION DATASETS
=================================================================================================================================

Input: 
	- Gold standard dataset (e.g. "babelnet_dataset_gold.txt"). It should be formatted tab-separated: synset <tab> domain.
	
	- Domain annotations (e.g. "babeldomains_babelnet.txt"). Similarly to the gold standard dataset, this file should be
	  tab-separated (synset <tab> domain).

Output: Precision, recall and F-Measure results.

Intructions to run the Python script "evaluate_domains.py": 

The code takes the following parameters: path_dataset, path_vectors
	path_goldstandard	  : Path of the gold standard dataset.
        path_babeldomains	  : Path of the domain annotations.
	evaluation_mode (optional): "all" or "onlytest" (by default "only test"). 
					- "onlytest" -> it considers domain annotations from the input file for domains which also 
							appear in the gold standard. This may be used when the domains do not 
							exactly correspond to the ones used in the gold standard. 
					- "all"      -> it evaluates all domain annotations of the synsets in the gold standard.

Run it in the terminal by typing the following expression: 
	$ python evaluate_domains.py path_goldstandard path_babeldomains ?evaluation_type?

Examples of usage: 
	$ python evaluate_domains.py Evaluation_datasets/babelnet_dataset_gold.txt BabelDomains/babeldomains_babelnet.txt

	$ python evaluate_domains.py Evaluation_datasets/wordnet_dataset_gold.txt BabelDomains/babeldomains_wordnet.txt


NOTE: Since this version of BabelDomains includes two additional domains, the results from BabelDomains may be slightly different 
      from the ones reported in the reference paper	

=================================================================================================================================
REFERENCE PAPER
=================================================================================================================================

When using these resources, please refer to the following paper (included in the package as "EACL2017_BabelDomains.pdf"):

	Jose Camacho-Collados and Roberto Navigli. 
	BabelDomains: Large-Scale Domain Labeling of Lexical Resources.
	Proceedings of EACL (short).
	Valencia, Spain, 2017. 


=================================================================================================================================
CONTACT
=================================================================================================================================
 
If you have any enquiries about any of the resources, please contact Jose Camacho Collados (collados [at] di.uniroma1 [dot] it)

=================================================================================================================================