TASS-2017: Workshop on Semantic Analysis at SEPLN

About

The workshop and shared task "Sentiment Analysis at SEPLN (TASS)" has been held since 2012, under the umbrella of the International Conference of the Spanish Society for Natural Language Processing (SEPLN). TASS was the first shared task on sentiment analysis in Twitter in Spanish. Spanish is the second language used in Facebook and Twitter [1], which calls for the development and availability of language-specific methods and resources for sentiment analysis. The initial aim of TASS was the furtherance of research on sentiment analysis in Spanish with a special interest on the language used in Twitter.

Although sentiment analysis is still an open problem, the Organization Committee would like to foster research on other tasks related to the processing of the semantics of texts written in Spanish. Consequently, the name of the workshop/shared task has been changed to "Workshop on Semantic Analysis at SEPLN (TASS)".

As in previous years, TASS-2017 proposes two evaluation tasks related to polarity classification at tweet level and at aspect level. The novelty of this year is the proposal of a new dataset for the task of sentiment analysis at document (tweet) level.

Moreover, the Organization Committee appeals to the research community to propose and organize evaluation tasks related to other semantic tasks in the Spanish language. New tasks provide an opportunity to create linguistic resources, evaluate their usefulness, and promotes the consolidation of a community of researchers interested in the addressed topics. Thus, we encourage the semantic processing community to propose and submit evaluation tasks, with the support of the Organization Committee of TASS.

TASS-2017 will be the 6th event of the series and will be held in conjunction with the 33rd International Conference of the Spanish Society for Natural Language Processing (SEPLN), in Murcia, Spain, on September 19th, 2017.

A Google Group has been set up for this year’s TASS Shared Task where announcements will be made. Do send your questions and feedback to (tass-tasks@googlegroups.com).

Proposal of Tasks

Semantic analysis has given rise to new tasks that attempt to further improve natural language understanding systems. In the context of sentiment analysis, some such tasks are cross- and multi-domain sentiment analysis, as well as aspect-based sentiment analysis. Outside the sentiment analysis arena, other tasks attracting the interest of the research community are stance classification, negation handling, rumour identification, fake news identification, open information extraction, argumentation mining, classification of semantic relations, and question answering of non-factoid questions, to name a few. We encourage the research community to propose evaluation tasks related to such semantic analysis processes in Spanish. The above list is by no means closed, so feel free to submit any evaluation task proposal that you consider interesting for the research community.

Proposals must include the following:

  • Title of the task
  • Description of the evaluation task
  • Linguistic resources available or resources to be created
  • Important dates
  • Organization committee
  • Contact person (name and email)

The proposals must be sent to tass-sepln@googlegroups.com by April 15, 2017. Notification of acceptance April 21, 2017.

Tasks

Although TASS-2017 will include tasks related to several types of semantic processing tasks, sentiment analysis is still the main target of the workshop. Two tasks address the performance of polarity classification systems of tweets written in Spanish.

Task 1: Sentiment Analysis at Tweet level

This task focuses on the evaluation of polarity classification systems at tweet level in Spanish. Training, development and test datasets will be provided in order to train and evaluate the systems. The new dataset is called InterTASS and it is composed for tweets written in Spanish. For more details, read the Datasets section.

The dataset, which is called InterTASS, is annotated with 4 different polarity labels (P, N, NEU, NONE), and the submitted systems will have to identify the intensity of the opinion expressed in each tweet.

The submitted systems can used any set of data as training dataset, i.e. the training set of InterTASS, other training set from the previous editions or other set of tweets. However, it is forbiden the use of the test set of InterTASS and the test set of the datasets of previous editions as training data. Participants can use any kind of linguistic resource for the development of their classification model. The systems must be evaluated on the test set of InterTASS and the General Corpus of TASS (see previous editions). Participants are expected to submit three experiments per each evaluation set, so each participant team can submit a maximum of 6 systems.

Accuracy and the macro-averaged versions of Precision, Recall and F1 will be used as evaluation measures. Systems will be ranked by the Macro-F1 and Accuracy measures.

Results must be submitted in a plain text file with the following format:

tweet_id \t polarity

Where polarity can be: P, NEU, N, None

Task 2: Aspect-based Sentiment Analysis

The second task proposes the development of aspect-based polarity classification systems. Two datasets are provided to evaluate the systems: Social-TV and STOMPOL. The two datasets have annotated for aspect, the main category of aspect, and the polarity of the opinion about the aspect. The systems have to classify the opinion about the given aspect in a three-intensity level range of opinion: Positive, Neutral and Negative.

Participants are expected to submit up to 3 experiments for each corpus, each in a plain text file with the following format:

tweetid \t aspect \t polarity

Allowed polarity values are P, NEU and N.

For evaluation, a single label combining "aspect-polarity" will be considered. As in Task 1, the macro-averaged version of Precision, Recall and F1, and Accuracy are the evaluation measures, and Macro-F1 will be used for ranking the systems.

Datasets

The participants of TASS-2017 will use the following corpora for developing their systems.

InterTASS Corpus

International TASS Corpus (InterTASS) is a new corpus released this year for Task 1.

The sentiemnt of the tweets of the corpus are annotated in a scale of 4 levels of polarity: P, NEU, N and NONE. The corpus has three datasets:

  • Training: it is composed of 1008 tweets.
  • Development: it is composed of 506 tweets.
  • Test: it is composed of 1920 tweets.

The three datasets of the corpus are three XML files, and an example of a tweet of InterTASS is the following one:

<tweet>
	<tweetid>768224728049999872</tweetid>
	<user>caval100</user>
	<content>Se ha terminado #Rio2016 Lamentablemente no arriendo las ganancias al pueblo brasileño por la penuria que les espera Suerte y solidaridad</content>
	<date>2016-08-23 23:13:42</date>
	<lang>es</lang>
	<sentiment>
		<polarity><value>N</value></polarity>
	</sentiment>
</tweet>
						

The General Corpus of TASS is still available. Please visit this link for details on how to obtain it.

Social-TV Corpus

This corpus was collected during the 2014 Copa del Rey final in Spain between Real Madrid and F.C. Barcelona, played on 16 April 2014 at Mestalla Stadium in Valencia. Over 1 million tweets were collected from 15 minutes before to 15 minutes after the match. Irrelevant tweets where filtered out and a subset of 2,773 was selected.

All tweets were manually annotated at aspect level and more than one aspect may be in each tweet. The list of aspects is:

  • Afición
  • Árbitro
  • Autoridades
  • Entrenador
  • Teams: Equipo-Atlético_de_Madrid, Equipo-Barcelona, Equipo-Real_Madrid, Equipo (any other team)
  • Players: Jugador-Alexis_Sánchez, Jugador-Alvaro_Arbeloa, Jugador-Andrés_Iniesta, Jugador-Angel_Di_María, Jugador-Asier_Ilarramendi, Jugador-Carles_Puyol, Jugador-Cesc_Fábregas, Jugador-Cristiano_Ronaldo, Jugador-Dani_Alves, Jugador-Dani_Carvajal, Jugador-Fábio_Coentrão, Jugador-Gareth_Bale, Jugador-Iker_Casillas, Jugador-Isco, Jugador-Javier_Mascherano, Jugador-Jesé_Rodríguez, Jugador-José_Manuel_Pinto, Jugador-Karim_Benzema, Jugador-Lionel_Messi, Jugador-Luka_Modric, Jugador-Marc_Bartra, Jugador-Neymar_Jr., Jugador-Pedro_Rodríguez, Jugador-Pepe, Jugador-Sergio_Busquets, Jugador-Sergio_Ramos, Jugador-Xabi_Alonso, Jugador-Xavi_Hernández, Jugador (any other player)
  • Partido
  • Retransmisión

Sentiment polarity was annotated from the point of view of the Twitter user, using 3 tags: P, NEU and N. No distinction is made in cases when the author does not express any sentiment or expresses a no-positive no-negative sentiment.

The Social-TV corpus was randomly divided into two sets: training (1,773 tweets) and test (1,000 tweets), with a similar distribution of both aspects and sentiments. The training set will be released so that participants may train and validate their models. The test corpus will be provided without any annotation and will be used to evaluate the results provided by the different systems.

Three sample tweets from the training set are shown here:

<tweet id="456544898791907328">
		<sentiment aspect="Equipo-Real_Madrid" polarity="P">#HalaMadrid</sentiment> ganamos sin <sentiment aspect="Jugador-Cristiano_Ronaldo" polarity="NEU">Cristiano</sentiment>. .perdéis con <sentiment aspect="Jugador-Lionel_Messi" polarity="N">Messi</sentiment>. Hala <sentiment aspect="Equipo-Real_Madrid" polarity="P">Madrid</sentiment>! !!!!!
	</tweet>
	<tweet id="456544898942906369">
		@nevermind2192 <sentiment aspect="Equipo-Barcelona" polarity="P">Barça</sentiment> por siempre!!
	</tweet>
	<tweet id="456544898951282688">
		<sentiment aspect="Partido" polarity="NEU">#FinalCopa</sentiment> Hala <sentiment aspect="Equipo-Real_Madrid" polarity="P">Madrid</sentiment>, hala <sentiment aspect="Equipo-Real_Madrid" polarity="P">Madrid</sentiment>, campeón de la <sentiment aspect="Partido" polarity="P">copa del rey</sentiment>
	</tweet>

STOMPOL

STOMPOL (corpus of Spanish Tweets for Opinion Mining at aspect level about POLitics) is a corpus of tweets written in Spanish annotated at aspect level. The topic of the tweets is the political campaign of the 2015 regional and local elections in Spain. The tweets were gathered April 23-24, and are related to one of the following political aspects:

  • Economía (Economy): taxes, infrastructure, markets, labor policy...
  • Sanidad (Health System): hospitals, public/private health system, drugs, doctors...
  • Educación (Education): state school, private school, scholarships...
  • Propio_partido (Political party): anything good (speeches, electoral programme...) or bad (corruption, criticism) related to the entity
  • Otros_aspectos (Other aspects): electoral system, environmental policy...

Each aspect is related to one or several entities (separated by the pipe symbol |) that correspond to one of the main political parties in Spain:

  • Partido_Popular (PP)
  • Partido_Socialista_Obrero_Español (PSOE)
  • Izquierda_Unida (IU)
  • Podemos
  • Ciudadanos (Cs)
  • Unión_Progreso_y_Democracia (UPyD)

Each tweet in the corpus was manually annotated by two different annotators, plus a third one in case of disagreement, with the sentiment polarity at aspect level. Sentiment polarity was annotated from the point of view of the Twitter user, using 3 levels: P, NEU and N. No difference is made between no sentiment and neutral sentiment (neither positive nor negative).

Each political aspect is linked to its corresponding political party and its polarity.

Some examples are shown in the following figure:

<tweet id="591267548311769088">
		@ahorapodemos @Pablo_Iglesias_ @SextaNocheTV Que alguien pregunte si habrá cambios en las <sentiment aspect="Educacion" entity="Podemos" polarity="NEU">becas</sentiment> MEC para universitarios, por favor.
	</tweet>
	
	<tweet id="591192167944736769">
		#Arroyomolinos lo que le interesa al ciudadano son Políticos cercanos que se interesen y preocupen por sus problemas <sentiment aspect="Propio_partido" entity="Union_Progreso_y_Democracia" polarity="P">@UPyD</sentiment> VECINOS COMO TU
	</tweet>

The corpus is made up of 1,284 tweets, and has been divided into training set (784 tweets), which is provided for building and validating the systems, and test set (500 tweets) that will be used for evaluation.

Licence

Downloading any of these datasets requires that you sign the TASS Corpus Licence Agreement, and send it to tass-sepln@googlegroups.com. The TASS team will send you the password to download all of the datasets. You can find the Licence Agreement here.

If you use the corpus for your research (papers, articles, presentations for conferences or educational purposes), please cite one of the following publications:

Shared Task

Registration

You have to fill the Registration Form to be registered on the TASS 2017.

Evaluation

The evaluation web page will be published after the submission deadline.

Datasets downloads

Participants must use the following datasets for developing and evaluating their data. The Licence has to be signed in order to download the data.

Task 1

InterTASS corpus

Task 2

Social-TV corpus

STOMPOL corpus

Proceedings

The Organization Committee of TASS encourages participants to submit a description paper of their systems. Submitted papers will be reviewed by a scientific committee, and only accepted papers will be published at CEUR, as in previous years (2015 and 2016).

The manuscripts must satisfy the following rules:

  • Up to 6 pages including references and figures, formatted according to the SEPLN template.
  • Articles can be written in English or Spanish. The title, abstract and keywords must be written in both languages.
  • The document format must be Word or Latex, but the submission must be in PDF format.
  • Instead of describing the task and/or the corpus, you should focus on the description of your experiments and the analysis of your results, and include a citation to the Overview paper.

Depending on the final number of participants and the time allocated for the workshop, all or a selected group of papers will be presented and discussed in the Workshop session.

Important dates

Task proposal deadline

April 15, 2017

Decision notification

April 21, 2017

Release of training and development corpora

May 1, 2017

Release of test corpora

June 20, 2017

Registration deadline

June 30, 2017

Experiment submission and evaluation

July 1, 2017

Paper submission

July 15, 2017

Review notification

July 31, 2017

Camera ready submission

August 31, 2017

Publication

September 15, 2017

Workshop

September 19, 2017

Organizing Committee

Program Committee

  • Edgar Casasola Murillo University of Costa Rica, Costa Rica
  • Fermín Cruz Mata University of Sevilla, Spain
  • Yoan Gutiérrez Vázquez University of Alicante, Spain
  • Lluís F. Hurtado Polytechnic University of Valencia, Spain
  • Salud María Jiménez Zafra University of Jaén, Spain
  • Mª. Teresa Martín Valdivia University of Jaén, Spain
  • Manuel Montes Gómez National Institute of Astrophysics, Optics and Electronics, Mexico
  • Antonio Moreno Ortíz University of Málaga, Spain
  • Preslav Nakov Qatar Computing Research Institute, Qatar
  • José Manuel Perea Ortega University of Extremadura, Spain
  • Ferrán Pla Universidad Politécnica de Valencia, Spain
  • Sara Rosenthal IBM Research, U.S.A.
  • Maite Taboada Simon Fraser University, Canada
  • L. Alfonso Ureña López University of Jaén, Spain

Organized by:

References

  1. Instituto Cervantes. 2016. El español: una lengua viva.