Student: Gabriel Nascimento dos Santos
Title: Treatment of words outside the vocabulary in tasks of analysis of feelings with lexicons
Advisors: Gustavo Paiva Guedes e Silva (advisor)
Committee: Gustavo Paiva Guedes e Silva (president), Eduardo Bezerra da Silva (CEFET/RJ) Fellipe Ribeiro Duarte (UFRRJ/RJ), Ronaldo Ribeiro Goldschmidt (IME-RJ)
Day/Time: July, 12 / 14h
Room: Auditorium 5
The number of Internet users who use social networks, microblogs and review sites has been increasing significantly in recent years. With this, users tend to express their opinions and convey what they feel about a given service, product, and the most diverse issues, like policy. This has attracted the interest of natural language processing researchers, especially those of Sentiment Analysis, who are interested in exploring techniques to extract and understand the opinions provided by users who use opinions-oriented services. The Sentiment Analysis has three approaches: machine-learning based approach, lexical-based approach, and hybrid approach. The lexical and hybrid approaches suffer from the problem of out-of-vocabulary words in dealing with the nature of social network texts. Dealing with texts from social networks is a big challenge because they range from well-written texts to completely meaningless sentences. This occurs for a number of reasons, such as limiting the number of characters (such as Twitter) and even intentional misspellings. This work proposes an algorithm that uses word embeddings to treat words out of vocabulary in Analysis tasks with approaches based on lexical or hybrid approaches. The strategy of the proposed algorithm is based on the hypothesis that words that occur in similar context tend to have similar meanings. The algorithm consists of choosing the most semantically similar words and using the features of the closest one that is contained in the lexicon used. The experiments were conducted in three datasets in Brazilian Portuguese. Three classifiers were used and improvements of up to 3.3% in the F1 score were observed after the use of the proposed algorithm.