Student: Felipe Oliveira Feder
Title: Comparative study between textual representation approaches and algorithms used in classification
Advisor: Gustavo Paiva Guedes e Silva
Committee: Gustavo Paiva Guedes e Silva (CEFET/RJ), Eduardo Bezerra (CEFET/RJ), Geraldo Xexéo (Coppe/UFRJ)
Day/Time: November 22, 2022 / 9 a.m.
Abstract: We are experiencing an unprecedented technological revolution in recent years. The way were late to each other has been – and will continue to be – impacted in different ways. Following the evolutions in hardware and technologies that allow us to produce and store data in unthinkable volumes, algorithmic and methodological evolutions are also observed that allow us to advancein search of an entirely new world, even dealing with old typically human issues. The frontier of human-machine understanding has been constantly pushed forward. Natural language processing is the bridge that connects human speech to previously unimaginable possibilities for a machine to properly interpret and process it. The means of textual representation have been evolving consistently in recent decades. Bag-of-Words (BOW), linked to the use of numerical representations forwords, has been successfully used in textual representation. However, overcoming the deficiencies of BOW, we observed the emergence of complex numerical representations, generated by deep neural networks, which are capable of preserving the semantic and syntactic relationships between words; the Word Embeddings (WE). The frontier was pushed forward; new evolutions, new applications, new uses. The use of Neural Language Models (MLN), with WE, has reached the state of the art in different tasks in text processing. This research compares these two word representation methods, BOW and WE, and their uses in a binary polarity classification task. Two groups of classifiers were set up and four data sets were used. The first group, formed by n-gram models, here called Traditional Machine Learning Models (MAMT), dealt with textual representations that used BOW with TF-IDF and BOW with LSA. The second group, formed by MLNs, which are models from deep neural networks that deal with tasks related to text processing, used the WE and the Contextual WE to represent the texts that would be processed. In the experiments carried out, the superiority of the semantic text classification models over the n-gram models was observed. Despite this, the choice of which textual representation technique (BOW or WE) and type of language model to use (n-gram or MLN) depends on the context, since n-gram models, even when compared to the most recent approaches, have satisfactory predictive performance and can be useful in many contexts of use.