Corpus

TASS 2013 experiments will be based on two different corpus.

General corpus

The general corpus contains over 68 000 Twitter messages, written in Spanish by about 150 well-known personalities and celebrities of the world of politics, economy, communication, mass media and culture, between November 2011 and March 2012. Although the context of extraction has a Spain-focused bias, the diverse nationality of the authors, including people from Spain, Mexico, Colombia, Puerto Rico, USA and many other countries, makes the corpus reach a global coverage in the Spanish-speaking world, which may allow to perform experiments for instance on the usage of different varieties of Spanish by different users based on their geographical information.

Each Twitter message includes its ID (tweetid), the creation date (date) and the user ID (user). Due to restrictions in the Twitter API Terms of Service), it is forbidden to redistribute a corpus that includes text contents or information about users. However, it is valid if those fields are removed and instead IDs (including Tweet IDs and user IDs) are provided. The actual message content can be easily obtained by making queries to the Twitter API using the tweetid.

The general corpus has been divided into two sets: training (about 10%) and test (90%). The training set will be released so that participants may train and validate their models for classification and sentiment analysis. The test corpus will be provided without any tagging and will be used to evaluate the results provided by the different systems.

Each message in both the training and test set is tagged with its global polarity, indicating whether the text expresses a positive, negative or neutral sentiment, or no sentiment at all. 5 levels have been defined: strong positive (P+), positive (P), neutral (NEU), negative (N), strong negative (N+) and one additional no sentiment tag (NONE).

In addition, there is also an indication of the level of agreement or disagreement of the expressed sentiment within the content, with two possible values: AGREEMENT and DISAGREEMENT. This is especially useful to make out whether a neutral sentiment comes from neutral keywords or else the text contains positive and negative sentiments at the same time.

Moreover, the polarity at entity level, i.e., the polarity values related to the entities that are mentioned in the text, is also included for those cases when applicable. These values are similarly divided into 5 levels and include the level of agreement as related to each entity.

On the other hand, a selection of a set of topics has been made based on the thematic areas covered by the corpus, such as "política" ("politics"), "fútbol" ("soccer"), "literatura" ("literature") or "entretenimiento" ("entertainment"). Each message in both the training and test set has been assigned to one or several of these topics (most messages are associated to just one topic, due to the short length of the text).

All tagging has been done semiautomatically: a baseline machine learning model is first run and then all tags are manually checked by human experts. In the case of the polarity at entity level, due to the high volume of data to check, this tagging has just been done for the training set.

Format

The corpus is written in XML as defined by the following tweets.xsd schema, in which the text of the content entity has been removed to follow the Twitter restrictions.

The following figure shows the information of two sample tweets. The first tweet is only tagged with the global polarity as the text contains no mentions to any entity, but the second one is tagged with both the global polarity of the message and the polarity associated to each of the entities that appear in the text (UPyD and Foro Asturias).

		<tweet>
			<tweetid>0000000000</tweetid>
			<user>usuario0</user>
			<content><![CDATA['Conozco a alguien q es adicto al drama! Ja ja ja te suena d algo!]]></content>
			<date>2011-12-02T02:59:03</date>
			<lang>es</lang>
			<sentiments>
				<polarity><value>P+</value><type>AGREEMENT</type></polarity>
			</sentiments>
			<topics>
				<topic>entretenimiento</topic>
			</topics>
		</tweet>

		<tweet>
			<tweetid>0000000001</tweetid>
			<user>usuario1</user>
			<content><![CDATA['UPyD contará casi seguro con grupo gracias al Foro Asturias.]]></content>
			<date>2011-12-02T00:21:01</date>
			<lang>es</lang>
			<sentiments>
				<polarity><value>P</value><type>AGREEMENT</type></polarity>
				<polarity><entity>UPyD</entity><value>P</value><type>AGREEMENT</type></polarity>
				<polarity><entity>Foro_Asturias</entity><value>P</value><type>AGREEMENT</type></polarity>
			</sentiments>
			<topics>
				<topic>política</topic>
			</topics>
		</tweet>				
	

The general-tweets-sample.xml file contains a subset of the corpus with around 30 tagged tweets.

Politics corpus

The Politics corpus contains 2 500 tweets, gathered during the electoral campaign of the 2011 general elections in Spain (Elecciones a Cortes Generales de 2011), from Twitter messages mentioning any of the four main national-level political parties: Partido Popular (PP), Partido Socialista Obrero Español (PSOE), Izquierda Unida (IU) y Unión, Progreso y Democracia (UPyD).

Similarly to the General corpus, the global polarity and the polarity at entity level for those four entities has been manually tagged for all messages. However, in this case, only 3 levels are used in this case: positive (P), neutral (NEU), negative (N), and one additional no sentiment tag (NONE).

Format

The format is the same as the General corpus: XML as defined by the same tweets.xsd schema, where the text of the content entity has been removed to follow the Twitter restrictions. The only difference is that the entity element includes a source attribute that indicates the political party to which the entity refers: PP, PSOE, IU and UPyD.

The following figure shows the information of one sample tweet.

		<tweet>
			<tweetid>137231808990412800</tweetid>
			<user>marianarajay</user>
			<content><![CDATA['@marianorajoy Por favor, amigosh, no me votéish que me lo he penshado mejor con este tshunami que se me viene encima.]]></content>
			<date>2011-10-17T19:13:07</date>
			<lang>es</lang>
			<sentiments>
				<polarity><value>N</value><type>AGREEMENT</type></polarity>
				<polarity><entity source="PP">@marianorajoy</entity><value>N</value><type>AGREEMENT</type></polarity>
			</sentiments>
			<topics>
				<topic>política</topic>
			</topics>
		</tweet>	

We would be very grateful if you tell us about any problem that you detect or any correction that you make on the corpus, so that we can make it available to the community.

Downloads

Public files


Password protected area


Request a password

The corpus will be made freely available to the community after the workshop. Please send an email to with your email, affiliation (institution, company or any kind of organization) and a brief description of your research objectives, and you will be given a password to download the files in the password protected area.

Citing TASS

If you use the corpus in your research, please include a citation to the paper and/or the website:

Proceedings of the TASS workshop at SEPLN 2013. Actas del XXIX Congreso de la Sociedad Española de Procesamiento de Lenguaje Natural. IV Congreso Español de Informática. 17-20 September 2013, Madrid, Spain. Díaz Esteban, Alberto; Alegría, Iñaki; Villena Román, Julio (eds). ISBN: 978-84-695-8349-4. http://www.congresocedi.es/images/site/actas/ActasSEPLN.pdf.

Villena-Román, Julio, Lana-Serrano, Sara, Martínez-Cámara, Eugenio, González-Cristobal, José Carlos. 2013. TASS (2012) - Workshop on Sentiment Analysis at SEPLN. Revista de Procesamiento del Lenguaje Natural, 50, pp 37-44. http://journal.sepln.org/sepln/ojs/ojs/index.php/pln/article/view/4657.

TASS (Taller de Análisis de Sentimientos en la SEPLN) website. http://www.daedalus.es/TASS.


Daedalus SINAI-UJAEN GSI-UPM