TASS-2018-Task 3. eHealth Knowledge Discovery

Corpora description

“MedlinePlus is the National Institutes of Health's Website for patients and their families and friends. Produced by the National Library of Medicine, the world’s largest medical library, it brings you information about diseases, conditions, and wellness issues in language you can understand. MedlinePlus offers reliable, up-to-date health information, anytime, anywhere, for free.” [3] This platform freely provides large health textual data from which we have made a selection for constituting the eHealth-KD corpus. T he selection has been made by sampling specific XML files from the collection available in https://medlineplus.gov/xml.html .

These files contain several entries related to health and medicine topics and have been processed to remove all XML markup to extract the textual content. Only Spanish language items were considered. Once cleaned, each individual item was converted to a plain text document, and some further post-processing is applied to remove unwanted sentences, such as headers, footers and similar elements, and to flatten HTML lists into plain sentences. The final documents are manually tagged using Brat by a group of annotators. After tagging, a post-processing was applied to Brat’s output files (ANN format) to obtain the output files in the formats described in this document.

The resulting documents and output files are distributed along with the Task. There is no need for participants to download extra data from MedlinePlus servers, since all the input is already distributed.

Test data

The corpus is split into two sets, one is distributed for the purpose of training and development and the remaining documents are kept for blind evaluation ( test ). This test evaluation set is further divided into three sets, one for each of the evaluation scenarios described below. Of these sets, only input files and the relevant output files are distributed (see below) and the rest of the output files are kept for evaluating participant’s submissions.

After the test evaluation is completed and results are disclosed, the full corpus will be provided, along with all the configuration files and utility scripts used to create the corpus and perform the annotation, to encourage future researchers to build upon these resources.

Additional Resources

Besides the training data, some additional resources will be provided, such as the evaluation script (i.e. score_training.py and score_test.py ) and the example files (ie. trial ) shown in this document. Participants may use any other external resources as long as they are declared at the time of submission (e.g., WordNet, other corpora, software libraries). However, participants are not allowed to manually annotate the test data prior to submission.

All these resources (trian and training data, test files, scripts) are available on the following Github project and will be updated accordingly as new resources become available:

https://github.com/tass18-task3/data

Training Set version 1.0

As of April 5th, 2018, an initial training set has been published in https://github.com/tass18-task3/data. Follow the instructions in the Readme file to understand how to use the dataset.

This dataset contains a total of 559 sentences with 5673 annotations. All sentences are fully annotated. The distribution of annotations is summarized in the following table:

Annotation Count
Entity 3276
Action 849
Concept 2427
Annotation Count
Relation 1012
Is-a 434
Part-of 149
Property-of 399
Same-as 30
Annotation Count
Roles 1385
Subjects 599
Targets 786

An additional 285 sentences are included in the develop folder. These sentences are also fully tagged, and are meant to be used for model selection and parameter tunning. We encourage participants to try different models, algorithms, and parameter settings. Each of these different variants should be trained on the training corpus only, and then their performance measured on the development corpus, to select the best variant. This separation ensures first a fair comparison among participants. Furthermore, comparting different models on a development corpus, independent from the training corpus, also helps reducing the risk of overfitting, and will give you a more accurate estimate of the actual performance of your models.