Datasets
English
Voice Likability database
The likablity database is a subset of the AGender database, both generated at the Telekom Innovation Laboratories. From the AGender data, which contains sentences from German speakers spoken over telefone equally distibuted in seven age-gender groups, 800 utterances were taken (one utterance per speaker exluding the children) and the sound of the voice judged on a 7 point scale according to their likability by at least 16 different people.
This paper described the database
F. Burkhardt, B. Schuller, B. Weiss, F. Weninger: "Would You Buy A Car From Me?" - On the Likability of Telephone Voices, Interspeech, 2011
Rada Mihalcea's Resources
- URL
- http://web.eecs.umich.edu/~mihalcea/downloads.html
- Several resources
- alignment between lexical databases, semantic relateness, subjectivity, sentiment, ...
Emotions datasets by Media Core @ UFL
- http://csea.phhp.ufl.edu/media.html
- International Affective Picture System (IAPS)
- International Affective Digital Sounds (IADS)
- Affective Norms for English Words (ANEW). The Affective Norms for English Words (ANEW) provides a set of normative emotional ratings for a large number of words in the English language. This set of verbal materials have been rated in terms of pleasure, arousal, and dominance to complement the existing International Affective Picture System (IAPS, Lang, Bradley, & Cuthbert, 1999) and International Affective Digitized Sounds (IADS; Bradley & Lang, 1999), which are collections of picture and sound stimuli, respectively, that also include these affective ratings.
- Affective Norms for English Text (ANET). The Affective Norms for English Text (ANET) provides normative ratings of emotion (pleasure, arousal, dominance) for a large set of brief texts in the English language for use in experimental investigations of emotion and attention. The ANET is being developed and distributed by the Center for Emotion and Attention (CSEA) at the University of Florida.
- The Self-Assessment Manikin (SAM)
The Self-Assessment Manikin (SAM) is a non-verbal pictorial assessment technique that directly measures the pleasure, arousal, and dominance associated with a person's affective reaction to a wide variety of stimuli.
Sentube Corpus
- Source
- “Sentiment analysis of Youtube videos with joint models of text and speech" http://ikernels-portal.disi.unitn.it/projects/sentube/
- Description
- The SenTube corpus is available for research and commercial purposes. The comments corpus can be downloaded from here (16MB). Video files are available on request.
- Restrictions
- No one. Author contacted (Olga Uryupina uryupina@gmail.com)
Multiclass Twitter Emotion Corpus
- Source
- “Feature Specific Sentiment Analysis for Product Review" http://people.mpi-inf.mpg.de/~smukherjee/
- Description
- 1. Dataset1 (1257 reviews from different domains annotated in 2 classes - positive or negative)
- 2. Dataset2 (3834 reviews from different domains annotated in 2 classes - positive or negative)
- 3. Dataset3 (425 reviews from 3 domains annotated in 2 classes - positive or negative)
- Restrictions
- No one. Author contacted (Smukherjee smukherjee@mpi-inf.mpg.de), contact if we use it.
LoughranMcDonald
- Source
- http://provalisresearch.com/products/content-analysis-software/wordstat-dictionary/sentiment-dictionaries/
- Description
- Financial sentiment dictionary
- Restrictions
- Academic / citation
SentiWordNet
- Source
- http://sentiwordnet.isti.cnr.it/
- Description
- Sentiment Dictionary based on WN
- Restrictions
- Academic / citation
- Note
- Python interface available Sentiwordnet (English and Spanish linked) binding class to perform Sentiment Analysis and Opinion Mining http://github.com/rmaestre/Sentiwordnet-BC
Sentiment and Emotion lexicons
- http://saifmohammad.com/WebPages/lexicons.html It includes
- NRC Word-Emotion Association Lexicon. Sentiments: negative, positive
Emotions: anger, anticipation, disgust, fear, joy, sadness, surprise, trust
- NRC Hashtag Emotion Lexicon and Corpus.
- MaxDiff General Domain Sentiment Lexicon
- NRC Hashtag Sentiment Lexicon
- Sentiment140 Lexicon
- Yelp Restaurant Sentiment Lexicon
- Amazon Sentiment Lexicon
- NRC Word-Colour Association Lexicon
Various corpuses for Aspect Based Sentiment Analysis
SEMEVAL 2015: http://alt.qcri.org/semeval2015/task12/
SEMEVAL 2014: http://alt.qcri.org/semeval2014/task4/
Both tasks only have pos/neg/neutral classifications, which we believe is not enough for many applications.
We would like to see the full continuous values for a number of reasons (including the way our annotators work), but at least a 5 class system (highly_neg, neg, neutral, pos, highly_pos) would be nice.
Multilingual
Multilingual Sentiment Analysis Twitter Corpus
- URL
- http://www.win.tue.nl/~mpechen/projects/smm/ http://www.win.tue.nl/~mpechen/projects/pdfs/Tromp2011.pdf
- Source
- “Feature Specific Sentiment Analysis for Product Review" http://people.mpi-inf.mpg.de/~smukherjee/
- Description
- CINLP_datasets.zip (description.txt) Preprocessed labeled Twitter datasets, one automatically annotated and two manually annotated as used in Tromp et al, 2013, submission to CINLP special issue.
- Turkish_Movie_Sentiment.zip and Turkish_Products_Sentiment.zip (descpription.txt): Movie reviews and multi-domain product reviews (both in Turkish) dataset as used in Demirtas & Pechenizkiy, WISDOM@KDD'13 (cross-lingual polarity detection with machine translation).
- LIGA_Benelearn11_dataset.zip (description.txt) Preprocessed labeled Twitter data in six languages, used in Tromp & Pechenizkiy, Benelearn 2011
SA_Datasets_Thesis.zip (description.txt) All preprocessed datasets as used in Tromp 2011, MSc Thesis
- Restrictions
- No one.
Multilingual sentiment lexicons
- Source
- https://sites.google.com/site/datascienceslab/projects/multilingualsentiment
- Description
- Financial sentiment dictionary
- Restrictions
- Academic / citation
Sentistrength
- Source
- http://sentistrength.wlv.ac.uk/
- Description
- Lexicon in English and Spanish with annotated tweets
- Restrictions
- Academic / citation
Emoticon Sentiment Lexicon
- Source
- http://people.few.eur.nl/hogenboom/files/EmoticonSentimentLexicon.zip
- Description
- This emoticon sentiment lexicon was created by:
- Alexander Hogenboom (hogenboom@ese.eur.nl, Erasmus University Rotterdam) - Daniella Bal (daniella.bal@xs4all.nl, Erasmus University Rotterdam) - Flavius Frasincar (frasincar@ese.eur.nl, Erasmus University Rotterdam) - Malissa Bal (malissa.bal@xs4all.nl, Erasmus University Rotterdam) - Franciska de Jong (f.m.g.dejong@utwente.nl, Universiteit Twente/Erasmus University Rotterdam) - Uzay Kaymak (u.kaymak@ieee.org, Eindhoven University of Technology)
Each line in the lexicon file (EmoticonSentimentLexicon.txt) contains an emoticon and its associated sentiment according to our human annotators, separated by a tab. The sentiment of an emoticon is either -1 (negative), 0 (neutral), or 1 (positive).
- Restrictions
- Academic / citation
Twitter Sentiment Analysis Training Corpus
- URL
- http://thinknook.com/twitter-sentiment-analysis-training-corpus-dataset-2012-09-22/
- Source
- http://thinknook.com/twitter-sentiment-analysis-training-corpus-dataset-2012-09-22/
- Description
- The Twitter Sentiment Analysis Dataset contains 1,578,627 classified tweets, each row is marked as 1 for positive sentiment and 0 for negative sentiment.
- Restrictions
- No one.
TBOD Feature-based Sentiment Corpus
- Source
- http://www.lsi.us.es/~fermin/index.php/Datasets
- Description
- TBOD Corpus for feature based-sentiment analysis in three domains: cars, headphones and hotels.
- Restrictions
- Academic / citation
Spanish
Corpus Cine (Spanish cinema)
- Source
- http://www.lsi.us.es/~fermin/index.php/Datasets
- Description
- Annotated reviews from www.muchocine.net
- Restrictions
- Academic / citation
ElhPolar dictionary
- Source
- http://komunitatea.elhuyar.org/ig/files/2013/10/ElhPolar_esV1.lex
- Description
- SentimentWortschatz, or SentiWS for short, is a publicly available German-language resource for sentiment analysis, opinion mining etc. It lists positive and negative polarity bearing words weighted within the interval of [-1; 1] plus their part of speech tag, and if applicable, their inflections. The current version of SentiWS (v1.8b) contains 1,650 positive and 1,818 negative words, which sum up to 15,649 positive and 15,632 negative word forms incl. their inflections, respectively. It not only contains adjectives and adverbs explicitly expressing a sentiment, but also nouns and verbs implicitly containing one.
- Restrictions
- none
ISOL
- Source
- http://timm.ujaen.es/recursos/isol/
- Description
- iSOL es una lista de palabras indicadoras de opinión en español independiente del dominio.
Para la elaboración del recurso se ha partido de la lista de palabras que mantiene el profesor Bing Liu (Bing Liu’s Opinion Lexicon). La lista de palabras ha sido traducida automáticamente usando el traductor Reverso y posteriormente se han corregido manualmente. La lista está formada por 2.509 palabras positivas y por 5.626. Para más información sobre como se ha desarrolla la lista puede consultar el artículo: Bilingual Experiments on an Opinion Comparable Corpus
- Restrictions
- academic/citaiton
Sentiment Spanish Lexicon
- Source
- http://web.eecs.umich.edu/~mihalcea/downloads.html#SPANISH_SENT_LEXICONS
- Description
- This resource contains two polarity lexicons in Spanish. The lexicons have been automatically or semi-automatically generated. [download] (April 3, 2012).
- Restrictions
- academic/citation
ML-SentiCON Sentiment Spanish Lexicon
- Source
- http://www.lsi.us.es/~fermin/index.php/Datasets
- Description
- Multilingual, layered sentiment lexicons at lemma level. This resource contains lemma-level sentiment lexicons at lemma level for English, Spanish, Catalan, Basque and Galician. For each lemma, it provides an estimation of polarity (from very negative -1.0 to very positive +1.0), and a standard deviation (related with ambiguity of the polarity estimation, please refer to paper for further details).
- Restrictions
- academic/citation
TASS - Twitter Sentiment Analysis in Spanish
- Source
- TASS 2013 http://www.daedalus.es/TASS2013/corpus.php
- TASS 2014 http://www.daedalus.es/TASS2014/tass2014.php
- TASS 2014 http://www.daedalus.es/TASS2015/tass2015.php#corpus
- Description
- The general corpus contains over 68 000 Twitter messages, written in Spanish by about 150 well-known personalities and celebrities of the world of politics, economy, communication, mass media and culture, between November 2011 and March 2012. Although the context of extraction has a Spain-focused bias, the diverse nationality of the authors, including people from Spain, Mexico, Colombia, Puerto Rico, USA and many other countries, makes the corpus reach a global coverage in the Spanish-speaking world.
- Tasks 2013: Task 1: Sentiment Analysis at global level, Task 2: Topic classification, Task 3: Sentiment Analysis at entity level, Task 4: Political tendency identification
- Tasks 2014: (legacy) Task 1: Sentiment Analysis at global level, (legacy) Task 2: Topic classification, (new) Task 3: Aspect detection, (new) Task 4: Aspect-based sentiment analysis
- Tasks 2015: Task 1: Sentiment Analysis at global level and Task 2: Aspect-based sentiment analysis
- The general corpus contains over 68 000 Twitter messages, written in Spanish by about 150 well-known personalities and celebrities of the world of politics, economy, communication, mass media and culture, between November 2011 and March 2012. Although the context of extraction has a Spain-focused bias, the diverse nationality of the authors, including people from Spain, Mexico, Colombia, Puerto Rico, USA and many other countries, makes the corpus reach a global coverage in the Spanish-speaking world.
- Restrictions
- Private access. We should include citation.
German
German SentiWortschatz
- URL
- http://asv.informatik.uni-leipzig.de/download/sentiws.html
- Source
- http://asv.informatik.uni-leipzig.de/download/sentiws.html
- Description
- The ElhPolar polarity lexicon for Spanish was created from different sources, and includes both negative and positive words. You can find a detailed description of the content, as well as, the way the lexicon was built in following publication: Saralegi X., San Vicente I.. 2013. "Elhuyar at TASS 2013". In Proceedings of "XXIX Congreso de la Sociedad Española de Procesamiento de lenguaje natural". Workshop on Sentiment Analysis at SEPLN (TASS2013). Madrid. pp. 143-150. ISBN: 978-84-695-8349-4
- Restrictions
- none