Tesis

The publication of a PhD thesis is the result of an intensive research in a particular area of science. This hard work is always an advance in science that has been studied. One of the objectives of SEPLN is the promotion of the research on Natural Language Processing, and with the aim of giving out all the advances published on NLP in Spain, SEPLN publishes all the NLP PhD theses.

Year:
Title
Negation and Speculation Detection in Medical and Review Texts
Author
Noa Patricia Cruz Díaz
Supervisor
Manuel J. Maña López
Abstract

Negation and speculation detection has been an active research area during the last years in the Natural Language Processing community, including some Shared Tasks in relevant conferences. In fact, it constitutes a challenge in which many applications can benefit from identifying this kind of information (e.g., interaction detection, information extraction, sentiment analysis). This thesis aims to contribute to the ongoing research on negation and speculation in the Language Technology community through the development of machine- learning systems which determine the speculation and negation cues and resolve their scope (i.e., identify at sentence level which tokens are affected by the cues). It is focused on the two domains in which negation and hedging have drawn more attention: the biomedical and the review domains. In the first one, the proposed method improves the results to date for the sub-collection of clinical documents of the BioScope corpus. In the second, the novelty of the contribution lies in the fact that, to the best of our knowledge, this is the first system trained and tested on the SFU Review corpus annotated with negative and speculative information. At the same time, this is the first attempt to detect speculation in the review domain. Additionally, and due to the tokenization problems that were encountered during the pre- processing of the BioScope corpus and the small number of works in the bibliography which propose solutions for this problem, this thesis closely describes this issue and provide both a comprehensive overview analysis and evaluation of a set of tokenization tools. This means, the first comparative evaluation study of tokenizers in the biomedical domain which could help Natural Language Processing developers to choose the best tokenizer to use.

Title
The relational discourse structure in pragmatics: description and evaluation in Computational Linguistics
Author
Mikel Iruskieta
Supervisors
Arantza Díaz de Ilarraza, Mikel Lersundi
Abstract

Written human communications usually consist of more than one sentence, and the coherence relations that exist between these sentences cannot be explained in terms of a successive sequence of phrases  (van Dijk 1997). Normally, coherent texts have a structure that is much more complex than mere juxtaposition, providing, of course, that the author wishes to explain him or herself clearly and take into account all the different sides (even the opposing ones) of the issue at hand. This structure is called relational discourse structure, and its description is located within the field of pragmatics known as discourse analysis.

Upon reading works focusing on relational discourse structure, we realize that although a concerted effort has been made by the scientific community to describe the two main phenomena of the relational discourse structure theory (hierarchical structure and the rhetorical relations between text segments), hardly any work has been carried out in this field in relation to the Basque language, and implicit coherence relations have not been taken into account. This thesis-report describes how we annotated scientific abstracts from different domains with the relational discourse structures found in them. It also describes how we overcame the most important problem encountered when annotating texts at this level, namely inter-annotator subjectivity. To this end, we used Rhetorical Structure Theory (RST) \cite{RefWorks:76}, %Mann-Thompson1987 the most widely accepted theory for describing relational discourse structure phenomena in the field of computational linguistics.

As stated above, for the Basque language, coherence relations have only been partially analyzed to date, with almost all focus being firmly placed on explicit coherence relations. This thesis seeks to redress this situation by describing coherence relations (both explicit and implicit) at different levels (micro-structure and macro-structure), and based on semantic-pragmatic criteria. Moreover, thanks to an innovative annotation method that will also be presented here, the paper’s main claim is that inter-annotator subjectivity is not always present to the same degree in the backbone of hierarchical structures, at the different levels of the discourse structure tree or indeed in certain coherence relations between different text segments. To demonstrate this, we propose an innovative qualitative-quantitative 
relational discourse structure evaluation system. Although we have used this system here to evaluate the reliability of an annotated text
in the Basque language, we will also demonstrate that it can be used to compare structures in parallel corpora. Moreover, in order both to avoid circularity problems between rhetorical relations and their signals that may arise as the result of a training phase designed to increase inter-annotator agreement, and to enhance the reliability of discourse structures, we first established the criteria to be followed by the super annotator within RST.  The principal outcome of this proposal is a set of characteristics of the first reference corpus in the Basque language annotated with relational discourse structure.  We will also outline some innovative search tools to consult the contents of the tagged corpus and will describe the work carried out to disseminate the corpus and make it available to the scientific community at large. 
The files of the corpus annotated at different language levels have been made available to any interested party, in the hope that they will prove useful to certain tasks involved in the processing  of the Basque language, including: automatic segmentation, information retrieval, 
automatic summarization and machine translation, among others.

The addresses of the corpus annotated with relational discourse structure, the electronic version of the thesis in Basque, and  the abbreviated translation of the thesis are as follows: