Sentiment analysis on review texts using category of words information and string kernels

Abstract: With millions of opinions written every day around the internet, the review sentiment analysis task has been shown to be an interesting and relevant problem. Support vector machines offer an excellent alternative when the amount of available data makes other models, such as deep learning, infeasible. A usual way to detect hidden sentiments in textual data is to address the mutual information through a corpus with a support vector machine or any other sophisticated classification algorithm. Approaches that are able to extract information from sequences of words, such as string kernels, have the potential for better performance. However, finding similarities can be difficult given the ample texts used to express opinions and the wide variety of vocabulary. To solve that problem, we suggest using clustering methods to automatically group words into categories based on a word vector, replacing the words in the dataset with their corresponding categories, and then using these categories to find mutual information in the text with support vector machines that use string kernels. This approach significantly reduces the token space and enhances the efficiency of the kernel methods. The proposed method showed better performance than state-of-the-art approaches for this task in a set of real-world problems. Different models were testing against our proposal. Results show that the proposed method has the ability to extract useful data from opinions in long texts and remains an interesting option for review sentiment analysis in general, even outperforming other state-of-the-art methods in certain datasets. It also opens the possibility of applying the same philosophy to deep learning and similar models.

J. M. Cuevas-Muñoz, A. de Haro-García and N. García-Pedrajas (2025) “Sentiment analysis on review texts using category of words information and string kernels”, submitted.

Source code: