Improving Sentiment Analysis Performance on Morphologically Rich Languages: Language and Domain Independent Approach

External Link

Journal Paper

KINCL, Tomáš, NOVÁK, Michal, PŘIBIL, Jiří

Computer Speech & Language, Vol. 56, p. 36-51 ISSN: 0885-2308

Publication year: 2019

Sentiment analysis has become a phenomenon with the proliferation of social media and the popularity of opinion-rich resources such as online reviews and blogs. Even though significant advances have been achieved in this field, there are still some major challenges to be addressed – i.e. sentiment analysis in multiple languages or thematic domains. Only a few studies have focused on minor or morphologically rich languages. Moreover, it is a question of whether the results of sentiment analysis could be further improved by incorporating the surrounding context (local or chronological) of the analyzed document. This paper presents a language- and domain-independent sentiment analysis model based on character n-grams which improves the classifiers performance by utilizing surrounding context.

Four experiments on various datasets were conducted in order to validate the model. The datasets included a reference corpus containing movie reviews in English, movie reviews in the Czech language, the bestselling Amazon book of 2012 Fifty Shades of Grey novel reviews dataset from three Amazon language mutations (English, German, and French), another reference corpus containing Amazon reviews in multiple languages (German, French, and Japanese), and a multi-domain dataset (movies, books, and product categories ranging from electronics and home appliances to sports gear and supplies for hobbies and pets).

The experiments confirmed the approach of incorporating the surrounding context in order to be effective for datasets from various languages and domains, suggesting a strong performance of a character n-gram based model for multi-domain and language datasets as well. A simple all-in-one classifier, which uses a mixture of labeled data from multiple languages (or domains) to train a sentiment classification model, may rival more sophisticated domain/language adaptation techniques. Such an approach reflects the needs of companies – with the interconnectedness of today’s world, most companies operate across multiple markets and would find it difficult to obtain a specific sentiment analysis solution for each market they serve.

Tomáš Kincl

Faculty of Management

Prague University
of Economics and Business