Datasets

From Linked Data Models for Emotion and Sentiment Analysis Community Group

English

Voice Likability database

The likablity database is a subset of the AGender database, both generated at the Telekom Innovation Laboratories. From the AGender data, which contains sentences from German speakers spoken over telefone equally distibuted in seven age-gender groups, 800 utterances were taken (one utterance per speaker exluding the children) and the sound of the voice judged on a 7 point scale according to their likability by at least 16 different people.

This paper described the database

F. Burkhardt, B. Schuller, B. Weiss, F. Weninger: "Would You Buy A Car From Me?" - On the Likability of Telephone Voices, Interspeech, 2011

Rada Mihalcea's Resources

URL
http://web.eecs.umich.edu/~mihalcea/downloads.html
Several resources
alignment between lexical databases, semantic relateness, subjectivity, sentiment, ...

Emotions datasets by Media Core @ UFL

  • http://csea.phhp.ufl.edu/media.html
    • International Affective Picture System (IAPS)
    • International Affective Digital Sounds (IADS)
    • Affective Norms for English Words (ANEW). The Affective Norms for English Words (ANEW) provides a set of normative emotional ratings for a large number of words in the English language. This set of verbal materials have been rated in terms of pleasure, arousal, and dominance to complement the existing International Affective Picture System (IAPS, Lang, Bradley, & Cuthbert, 1999) and International Affective Digitized Sounds (IADS; Bradley & Lang, 1999), which are collections of picture and sound stimuli, respectively, that also include these affective ratings.
    • Affective Norms for English Text (ANET). The Affective Norms for English Text (ANET) provides normative ratings of emotion (pleasure, arousal, dominance) for a large set of brief texts in the English language for use in experimental investigations of emotion and attention. The ANET is being developed and distributed by the Center for Emotion and Attention (CSEA) at the University of Florida.
    • The Self-Assessment Manikin (SAM)

The Self-Assessment Manikin (SAM) is a non-verbal pictorial assessment technique that directly measures the pleasure, arousal, and dominance associated with a person's affective reaction to a wide variety of stimuli.


Sentube Corpus

Source
“Sentiment analysis of Youtube videos with joint models of text and speech" http://ikernels-portal.disi.unitn.it/projects/sentube/
Description
The SenTube corpus is available for research and commercial purposes. The comments corpus can be downloaded from here (16MB). Video files are available on request.
Restrictions
No one. Author contacted (Olga Uryupina uryupina@gmail.com)

Multiclass Twitter Emotion Corpus

Source
“Feature Specific Sentiment Analysis for Product Review" http://people.mpi-inf.mpg.de/~smukherjee/
Description
  • 1. Dataset1 (1257 reviews from different domains annotated in 2 classes - positive or negative)
  • 2. Dataset2 (3834 reviews from different domains annotated in 2 classes - positive or negative)
  • 3. Dataset3 (425 reviews from 3 domains annotated in 2 classes - positive or negative)
Restrictions
No one. Author contacted (Smukherjee smukherjee@mpi-inf.mpg.de), contact if we use it.

LoughranMcDonald

Source
http://provalisresearch.com/products/content-analysis-software/wordstat-dictionary/sentiment-dictionaries/
Description
Financial sentiment dictionary
Restrictions
Academic / citation

SentiWordNet

Source
http://sentiwordnet.isti.cnr.it/
Description
Sentiment Dictionary based on WN
Restrictions
Academic / citation
Note
Python interface available Sentiwordnet (English and Spanish linked) binding class to perform Sentiment Analysis and Opinion Mining http://github.com/rmaestre/Sentiwordnet-BC

Sentiment and Emotion lexicons

Emotions: anger, anticipation, disgust, fear, joy, sadness, surprise, trust

    • NRC Hashtag Emotion Lexicon and Corpus.
    • MaxDiff General Domain Sentiment Lexicon
    • NRC Hashtag Sentiment Lexicon
    • Sentiment140 Lexicon
    • Yelp Restaurant Sentiment Lexicon
    • Amazon Sentiment Lexicon
    • NRC Word-Colour Association Lexicon

Various corpuses for Aspect Based Sentiment Analysis

SEMEVAL 2015: http://alt.qcri.org/semeval2015/task12/


SEMEVAL 2014: http://alt.qcri.org/semeval2014/task4/


Both tasks only have pos/neg/neutral classifications, which we believe is not enough for many applications. We would like to see the full continuous values for a number of reasons (including the way our annotators work), but at least a 5 class system (highly_neg, neg, neutral, pos, highly_pos) would be nice.

Multilingual

Multilingual Sentiment Analysis Twitter Corpus

URL
http://www.win.tue.nl/~mpechen/projects/smm/ http://www.win.tue.nl/~mpechen/projects/pdfs/Tromp2011.pdf
Source
“Feature Specific Sentiment Analysis for Product Review" http://people.mpi-inf.mpg.de/~smukherjee/
Description
  • CINLP_datasets.zip (description.txt) Preprocessed labeled Twitter datasets, one automatically annotated and two manually annotated as used in Tromp et al, 2013, submission to CINLP special issue.
  • Turkish_Movie_Sentiment.zip and Turkish_Products_Sentiment.zip (descpription.txt): Movie reviews and multi-domain product reviews (both in Turkish) dataset as used in Demirtas & Pechenizkiy, WISDOM@KDD'13 (cross-lingual polarity detection with machine translation).
  • LIGA_Benelearn11_dataset.zip (description.txt) Preprocessed labeled Twitter data in six languages, used in Tromp & Pechenizkiy, Benelearn 2011

SA_Datasets_Thesis.zip (description.txt) All preprocessed datasets as used in Tromp 2011, MSc Thesis

Restrictions
No one.

Multilingual sentiment lexicons

Source
https://sites.google.com/site/datascienceslab/projects/multilingualsentiment
Description
Financial sentiment dictionary
Restrictions
Academic / citation

Sentistrength

Source
http://sentistrength.wlv.ac.uk/
Description
Lexicon in English and Spanish with annotated tweets
Restrictions
Academic / citation


Emoticon Sentiment Lexicon

Source
http://people.few.eur.nl/hogenboom/files/EmoticonSentimentLexicon.zip
Description
This emoticon sentiment lexicon was created by:

- Alexander Hogenboom (hogenboom@ese.eur.nl, Erasmus University Rotterdam) - Daniella Bal (daniella.bal@xs4all.nl, Erasmus University Rotterdam) - Flavius Frasincar (frasincar@ese.eur.nl, Erasmus University Rotterdam) - Malissa Bal (malissa.bal@xs4all.nl, Erasmus University Rotterdam) - Franciska de Jong (f.m.g.dejong@utwente.nl, Universiteit Twente/Erasmus University Rotterdam) - Uzay Kaymak (u.kaymak@ieee.org, Eindhoven University of Technology)

Each line in the lexicon file (EmoticonSentimentLexicon.txt) contains an emoticon and its associated sentiment according to our human annotators, separated by a tab. The sentiment of an emoticon is either -1 (negative), 0 (neutral), or 1 (positive).

Restrictions
Academic / citation

Twitter Sentiment Analysis Training Corpus

URL
http://thinknook.com/twitter-sentiment-analysis-training-corpus-dataset-2012-09-22/
Source
http://thinknook.com/twitter-sentiment-analysis-training-corpus-dataset-2012-09-22/
Description
The Twitter Sentiment Analysis Dataset contains 1,578,627 classified tweets, each row is marked as 1 for positive sentiment and 0 for negative sentiment.
Restrictions
No one.

TBOD Feature-based Sentiment Corpus

Source
http://www.lsi.us.es/~fermin/index.php/Datasets
Description
TBOD Corpus for feature based-sentiment analysis in three domains: cars, headphones and hotels.
Restrictions
Academic / citation


Spanish

Corpus Cine (Spanish cinema)

Source
http://www.lsi.us.es/~fermin/index.php/Datasets
Description
Annotated reviews from www.muchocine.net
Restrictions
Academic / citation

ElhPolar dictionary

Source
http://komunitatea.elhuyar.org/ig/files/2013/10/ElhPolar_esV1.lex
Description
SentimentWortschatz, or SentiWS for short, is a publicly available German-language resource for sentiment analysis, opinion mining etc. It lists positive and negative polarity bearing words weighted within the interval of [-1; 1] plus their part of speech tag, and if applicable, their inflections. The current version of SentiWS (v1.8b) contains 1,650 positive and 1,818 negative words, which sum up to 15,649 positive and 15,632 negative word forms incl. their inflections, respectively. It not only contains adjectives and adverbs explicitly expressing a sentiment, but also nouns and verbs implicitly containing one.
Restrictions
none

ISOL

Source
http://timm.ujaen.es/recursos/isol/
Description
iSOL es una lista de palabras indicadoras de opinión en español independiente del dominio.

Para la elaboración del recurso se ha partido de la lista de palabras que mantiene el profesor Bing Liu (Bing Liu’s Opinion Lexicon). La lista de palabras ha sido traducida automáticamente usando el traductor Reverso y posteriormente se han corregido manualmente. La lista está formada por 2.509 palabras positivas y por 5.626. Para más información sobre como se ha desarrolla la lista puede consultar el artículo: Bilingual Experiments on an Opinion Comparable Corpus

Restrictions
academic/citaiton

Sentiment Spanish Lexicon

Source
http://web.eecs.umich.edu/~mihalcea/downloads.html#SPANISH_SENT_LEXICONS
Description
This resource contains two polarity lexicons in Spanish. The lexicons have been automatically or semi-automatically generated. [download] (April 3, 2012).
Restrictions
academic/citation


ML-SentiCON Sentiment Spanish Lexicon

Source
http://www.lsi.us.es/~fermin/index.php/Datasets
Description
Multilingual, layered sentiment lexicons at lemma level. This resource contains lemma-level sentiment lexicons at lemma level for English, Spanish, Catalan, Basque and Galician. For each lemma, it provides an estimation of polarity (from very negative -1.0 to very positive +1.0), and a standard deviation (related with ambiguity of the polarity estimation, please refer to paper for further details).
Restrictions
academic/citation

TASS - Twitter Sentiment Analysis in Spanish

Source
Description
    • The general corpus contains over 68 000 Twitter messages, written in Spanish by about 150 well-known personalities and celebrities of the world of politics, economy, communication, mass media and culture, between November 2011 and March 2012. Although the context of extraction has a Spain-focused bias, the diverse nationality of the authors, including people from Spain, Mexico, Colombia, Puerto Rico, USA and many other countries, makes the corpus reach a global coverage in the Spanish-speaking world.
    • Tasks 2013: Task 1: Sentiment Analysis at global level, Task 2: Topic classification, Task 3: Sentiment Analysis at entity level, Task 4: Political tendency identification
    • Tasks 2014: (legacy) Task 1: Sentiment Analysis at global level, (legacy) Task 2: Topic classification, (new) Task 3: Aspect detection, (new) Task 4: Aspect-based sentiment analysis
    • Tasks 2015: Task 1: Sentiment Analysis at global level and Task 2: Aspect-based sentiment analysis
    • The general corpus contains over 68 000 Twitter messages, written in Spanish by about 150 well-known personalities and celebrities of the world of politics, economy, communication, mass media and culture, between November 2011 and March 2012. Although the context of extraction has a Spain-focused bias, the diverse nationality of the authors, including people from Spain, Mexico, Colombia, Puerto Rico, USA and many other countries, makes the corpus reach a global coverage in the Spanish-speaking world.
Restrictions
Private access. We should include citation.

German

German SentiWortschatz

URL
http://asv.informatik.uni-leipzig.de/download/sentiws.html
Source
http://asv.informatik.uni-leipzig.de/download/sentiws.html
Description
The ElhPolar polarity lexicon for Spanish was created from different sources, and includes both negative and positive words. You can find a detailed description of the content, as well as, the way the lexicon was built in following publication: Saralegi X., San Vicente I.. 2013. "Elhuyar at TASS 2013". In Proceedings of "XXIX Congreso de la Sociedad Española de Procesamiento de lenguaje natural". Workshop on Sentiment Analysis at SEPLN (TASS2013). Madrid. pp. 143-150. ISBN: 978-84-695-8349-4
Restrictions
none