Datasets

English

Voice Likability database

The likablity database is a subset of the AGender database, both generated at the Telekom Innovation Laboratories. From the AGender data, which contains sentences from German speakers spoken over telefone equally distibuted in seven age-gender groups, 800 utterances were taken (one utterance per speaker exluding the children) and the sound of the voice judged on a 7 point scale according to their likability by at least 16 different people.

This paper described the database

F. Burkhardt, B. Schuller, B. Weiss, F. Weninger: "Would You Buy A Car From Me?" - On the Likability of Telephone Voices, Interspeech, 2011

Rada Mihalcea's Resources

URL: http://web.eecs.umich.edu/~mihalcea/downloads.html
Several resources: alignment between lexical databases, semantic relateness, subjectivity, sentiment, ...

Emotions datasets by Media Core @ UFL

http://csea.phhp.ufl.edu/media.html
- International Affective Picture System (IAPS)
- International Affective Digital Sounds (IADS)
- Affective Norms for English Words (ANEW). The Affective Norms for English Words (ANEW) provides a set of normative emotional ratings for a large number of words in the English language. This set of verbal materials have been rated in terms of pleasure, arousal, and dominance to complement the existing International Affective Picture System (IAPS, Lang, Bradley, & Cuthbert, 1999) and International Affective Digitized Sounds (IADS; Bradley & Lang, 1999), which are collections of picture and sound stimuli, respectively, that also include these affective ratings.
- Affective Norms for English Text (ANET). The Affective Norms for English Text (ANET) provides normative ratings of emotion (pleasure, arousal, dominance) for a large set of brief texts in the English language for use in experimental investigations of emotion and attention. The ANET is being developed and distributed by the Center for Emotion and Attention (CSEA) at the University of Florida.
- The Self-Assessment Manikin (SAM)

The Self-Assessment Manikin (SAM) is a non-verbal pictorial assessment technique that directly measures the pleasure, arousal, and dominance associated with a person's affective reaction to a wide variety of stimuli.

Sentube Corpus

Source: “Sentiment analysis of Youtube videos with joint models of text and speech" http://ikernels-portal.disi.unitn.it/projects/sentube/
Description: The SenTube corpus is available for research and commercial purposes. The comments corpus can be downloaded from here (16MB). Video files are available on request.
Restrictions: No one. Author contacted (Olga Uryupina uryupina@gmail.com)

Multiclass Twitter Emotion Corpus

Source: “Feature Specific Sentiment Analysis for Product Review" http://people.mpi-inf.mpg.de/~smukherjee/
Description

1. Dataset1 (1257 reviews from different domains annotated in 2 classes - positive or negative)
2. Dataset2 (3834 reviews from different domains annotated in 2 classes - positive or negative)
3. Dataset3 (425 reviews from 3 domains annotated in 2 classes - positive or negative)

Restrictions: No one. Author contacted (Smukherjee smukherjee@mpi-inf.mpg.de), contact if we use it.

LoughranMcDonald

Source: http://provalisresearch.com/products/content-analysis-software/wordstat-dictionary/sentiment-dictionaries/
Description: Financial sentiment dictionary
Restrictions: Academic / citation

SentiWordNet

Source: http://sentiwordnet.isti.cnr.it/
Description: Sentiment Dictionary based on WN
Restrictions: Academic / citation
Note: Python interface available Sentiwordnet (English and Spanish linked) binding class to perform Sentiment Analysis and Opinion Mining http://github.com/rmaestre/Sentiwordnet-BC

Sentiment and Emotion lexicons

http://saifmohammad.com/WebPages/lexicons.html It includes
- NRC Word-Emotion Association Lexicon. Sentiments: negative, positive

Emotions: anger, anticipation, disgust, fear, joy, sadness, surprise, trust

- NRC Hashtag Emotion Lexicon and Corpus.
- MaxDiff General Domain Sentiment Lexicon
- NRC Hashtag Sentiment Lexicon
- Sentiment140 Lexicon
- Yelp Restaurant Sentiment Lexicon
- Amazon Sentiment Lexicon
- NRC Word-Colour Association Lexicon

Various corpuses for Aspect Based Sentiment Analysis

SEMEVAL 2015: http://alt.qcri.org/semeval2015/task12/

SEMEVAL 2014: http://alt.qcri.org/semeval2014/task4/

Both tasks only have pos/neg/neutral classifications, which we believe is not enough for many applications. We would like to see the full continuous values for a number of reasons (including the way our annotators work), but at least a 5 class system (highly_neg, neg, neutral, pos, highly_pos) would be nice.

Multilingual

Multilingual Sentiment Analysis Twitter Corpus

URL: http://www.win.tue.nl/~mpechen/projects/smm/ http://www.win.tue.nl/~mpechen/projects/pdfs/Tromp2011.pdf
Source: “Feature Specific Sentiment Analysis for Product Review" http://people.mpi-inf.mpg.de/~smukherjee/
Description

CINLP_datasets.zip (description.txt) Preprocessed labeled Twitter datasets, one automatically annotated and two manually annotated as used in Tromp et al, 2013, submission to CINLP special issue.
Turkish_Movie_Sentiment.zip and Turkish_Products_Sentiment.zip (descpription.txt): Movie reviews and multi-domain product reviews (both in Turkish) dataset as used in Demirtas & Pechenizkiy, WISDOM@KDD'13 (cross-lingual polarity detection with machine translation).
LIGA_Benelearn11_dataset.zip (description.txt) Preprocessed labeled Twitter data in six languages, used in Tromp & Pechenizkiy, Benelearn 2011

SA_Datasets_Thesis.zip (description.txt) All preprocessed datasets as used in Tromp 2011, MSc Thesis

Restrictions: No one.

Multilingual sentiment lexicons

Source: https://sites.google.com/site/datascienceslab/projects/multilingualsentiment
Description: Financial sentiment dictionary
Restrictions: Academic / citation

Sentistrength

Source: http://sentistrength.wlv.ac.uk/
Description: Lexicon in English and Spanish with annotated tweets
Restrictions: Academic / citation

Emoticon Sentiment Lexicon

Source: http://people.few.eur.nl/hogenboom/files/EmoticonSentimentLexicon.zip
Description: This emoticon sentiment lexicon was created by:

- Alexander Hogenboom (hogenboom@ese.eur.nl, Erasmus University Rotterdam) - Daniella Bal (daniella.bal@xs4all.nl, Erasmus University Rotterdam) - Flavius Frasincar (frasincar@ese.eur.nl, Erasmus University Rotterdam) - Malissa Bal (malissa.bal@xs4all.nl, Erasmus University Rotterdam) - Franciska de Jong (f.m.g.dejong@utwente.nl, Universiteit Twente/Erasmus University Rotterdam) - Uzay Kaymak (u.kaymak@ieee.org, Eindhoven University of Technology)

Each line in the lexicon file (EmoticonSentimentLexicon.txt) contains an emoticon and its associated sentiment according to our human annotators, separated by a tab. The sentiment of an emoticon is either -1 (negative), 0 (neutral), or 1 (positive).

Restrictions: Academic / citation

Twitter Sentiment Analysis Training Corpus

URL: http://thinknook.com/twitter-sentiment-analysis-training-corpus-dataset-2012-09-22/
Source: http://thinknook.com/twitter-sentiment-analysis-training-corpus-dataset-2012-09-22/
Description: The Twitter Sentiment Analysis Dataset contains 1,578,627 classified tweets, each row is marked as 1 for positive sentiment and 0 for negative sentiment.
Restrictions: No one.

TBOD Feature-based Sentiment Corpus

Source: http://www.lsi.us.es/~fermin/index.php/Datasets
Description: TBOD Corpus for feature based-sentiment analysis in three domains: cars, headphones and hotels.
Restrictions: Academic / citation

Spanish

Corpus Cine (Spanish cinema)

Source: http://www.lsi.us.es/~fermin/index.php/Datasets
Description: Annotated reviews from www.muchocine.net
Restrictions: Academic / citation

ElhPolar dictionary

Source: http://komunitatea.elhuyar.org/ig/files/2013/10/ElhPolar_esV1.lex
Description: SentimentWortschatz, or SentiWS for short, is a publicly available German-language resource for sentiment analysis, opinion mining etc. It lists positive and negative polarity bearing words weighted within the interval of [-1; 1] plus their part of speech tag, and if applicable, their inflections. The current version of SentiWS (v1.8b) contains 1,650 positive and 1,818 negative words, which sum up to 15,649 positive and 15,632 negative word forms incl. their inflections, respectively. It not only contains adjectives and adverbs explicitly expressing a sentiment, but also nouns and verbs implicitly containing one.
Restrictions: none

ISOL

Source: http://timm.ujaen.es/recursos/isol/
Description: iSOL es una lista de palabras indicadoras de opinión en español independiente del dominio.

Para la elaboración del recurso se ha partido de la lista de palabras que mantiene el profesor Bing Liu (Bing Liu’s Opinion Lexicon). La lista de palabras ha sido traducida automáticamente usando el traductor Reverso y posteriormente se han corregido manualmente. La lista está formada por 2.509 palabras positivas y por 5.626. Para más información sobre como se ha desarrolla la lista puede consultar el artículo: Bilingual Experiments on an Opinion Comparable Corpus

Restrictions: academic/citaiton

Sentiment Spanish Lexicon

Source: http://web.eecs.umich.edu/~mihalcea/downloads.html#SPANISH_SENT_LEXICONS
Description: This resource contains two polarity lexicons in Spanish. The lexicons have been automatically or semi-automatically generated. [download] (April 3, 2012).
Restrictions: academic/citation

ML-SentiCON Sentiment Spanish Lexicon

Source: http://www.lsi.us.es/~fermin/index.php/Datasets
Description: Multilingual, layered sentiment lexicons at lemma level. This resource contains lemma-level sentiment lexicons at lemma level for English, Spanish, Catalan, Basque and Galician. For each lemma, it provides an estimation of polarity (from very negative -1.0 to very positive +1.0), and a standard deviation (related with ambiguity of the polarity estimation, please refer to paper for further details).

Restrictions: academic/citation

TASS - Twitter Sentiment Analysis in Spanish

Source

TASS 2013 http://www.daedalus.es/TASS2013/corpus.php
TASS 2014 http://www.daedalus.es/TASS2014/tass2014.php
TASS 2014 http://www.daedalus.es/TASS2015/tass2015.php#corpus

Description

- The general corpus contains over 68 000 Twitter messages, written in Spanish by about 150 well-known personalities and celebrities of the world of politics, economy, communication, mass media and culture, between November 2011 and March 2012. Although the context of extraction has a Spain-focused bias, the diverse nationality of the authors, including people from Spain, Mexico, Colombia, Puerto Rico, USA and many other countries, makes the corpus reach a global coverage in the Spanish-speaking world.
- Tasks 2013: Task 1: Sentiment Analysis at global level, Task 2: Topic classification, Task 3: Sentiment Analysis at entity level, Task 4: Political tendency identification
- Tasks 2014: (legacy) Task 1: Sentiment Analysis at global level, (legacy) Task 2: Topic classification, (new) Task 3: Aspect detection, (new) Task 4: Aspect-based sentiment analysis
- Tasks 2015: Task 1: Sentiment Analysis at global level and Task 2: Aspect-based sentiment analysis
- The general corpus contains over 68 000 Twitter messages, written in Spanish by about 150 well-known personalities and celebrities of the world of politics, economy, communication, mass media and culture, between November 2011 and March 2012. Although the context of extraction has a Spain-focused bias, the diverse nationality of the authors, including people from Spain, Mexico, Colombia, Puerto Rico, USA and many other countries, makes the corpus reach a global coverage in the Spanish-speaking world.

Restrictions: Private access. We should include citation.

German

German SentiWortschatz

URL: http://asv.informatik.uni-leipzig.de/download/sentiws.html
Source: http://asv.informatik.uni-leipzig.de/download/sentiws.html
Description: The ElhPolar polarity lexicon for Spanish was created from different sources, and includes both negative and positive words. You can find a detailed description of the content, as well as, the way the lexicon was built in following publication: Saralegi X., San Vicente I.. 2013. "Elhuyar at TASS 2013". In Proceedings of "XXIX Congreso de la Sociedad EspaÃ±ola de Procesamiento de lenguaje natural". Workshop on Sentiment Analysis at SEPLN (TASS2013). Madrid. pp. 143-150. ISBN: 978-84-695-8349-4
Restrictions: none