Student: Ivair Nobrega Luques
Title: Inteligência Computacional Aplicada à Detecção Intrínseca de Plágio em Documentos Textuais
Advisors: Eduardo Bezerra (advisor), Pedro Henrique González Silva (co-advisor)
Committee: Eduardo Bezerra (president), Pedro Henrique González Silva (CEFET/RJ), Jorge de Abreu Soares (CEFET/RJ), Igor Machado Coelho (UFF)
Day/Time: January 31, 2020 / 10h
Room: Auditorium 5
Access to information has been fostered by movements of open access to knowledge through digital libraries, which make available large collections of textual documents. However, the misuse of these available documents is contributing to the growth of cases of plagiarism. Machine Learning has aided in detecting plagiarism in many kinds of textual documents, such as published thesis, dissertations, and scientific articles. One particular technique is intrinsic plagiarism detection, in which potentially plagiarized sentences in a document are highlighted by using only the document content as input (that is, no external information source is used). In such a task, an essential step corresponds to figuring out stylistic differences between plagiarized and original sentences inside a suspicious document. Deep Neural Networks have achieved state-of-art results in the solution of several problems in Natural Language Processing in recent years. Inspired by that, in this work, we apply a simple but effective combination of Deep Learning techniques to the task of intrinsic plagiarism detection. In particular, we use Skip-Thoughts, an embedding model to represent each sentence of a document as a multi-dimensional vector. After that, we train a Siamese neural network using as training set a collections of sentence pairs (each sentence represented as a SkipThoughts vector) extracted from documents in the PAN11 corpus. We then model each document as a weighted, non-directed graph to enable the application of the cluster correlation algorithm, which makes it possible to identify potentially plagiarized passages. Our computational experiments show that the resulting Siamese neural network model is capable of recognizing stylistic differences between sentences in a document. Besides, the identification of potentially plagiarized passages through the cluster correlation approach yields results comparable to those in the literature.