Dissertation Defense (05/02/2019): Rodrigo Tavares de Souza

Student: Rodrigo Tavares de Souza

Title:  Appraisal-Spark: An approach to large-scale imputation

Guiding: Jorge Abreu Soares (advisor)

Bank: Jorge Abreu Soares (CEFET/RJ) (president), Eduardo Soares Ogasawara (CEFET/RJ), Ronaldo Ribeiro Goldschmidt (IME)

Day/Time: 05 February/10H

Room: Auditorium V

Abstract:

Continuously grow the volume of stored data and the demand for integration between them. This scenario increases the occurrence of a well-known problem of data scientists: the various possibilities of inconsistencies. Moreover, a type of its common types, the absence of data, can impair the analysis and result of any technique producing information. Imputation is the area that studies methods that seek to approximate the imputed value of the real. The composite imputation technique applies machine learning tasks in this process. It uses the concept of imputation plan, a logical sequence of strategies and algorithms used in the production of the final imputed value. In this work, we will expand the use of this technique, completing its use with the ensemble classifier bagging. In this method, the data is divided into random groups and linked to classifiers called base learners. For the subsets Generated in the bagging, the scores (percentage of assertiveness) of each imputation plan will be returned. The plan with the highest assertiveness among all subsets will be indicated as the suggestion of imputation to the complete set. The work is implemented in a system developed for the Spark tool, called Appraisal-Spark, which aims to generate values with higher accuracy and predictive performance for large-scale environments. It will be possible to compose various plans of high-performance imputation, evaluating strategies and comparing results.

Dissertation