Student: Leandro Maia Gonçalves
Title: Hot-deck imputation: A systematic review of literature
Advisor: Jorge de Abreu Soares
Committee: Jorge de Abreu Soares (president), Eduardo Soares Ogasawara (CEFET/RJ) e José Maria da Silva Monteiro Filho (UFC)
Day/time: January 29, 2021 / 10h.
Abstract: Organizations have realized that investing in transforming data into information to assist the decision-making process can bring competitive advantages. Thus, in the current scenario in which data grows in volume, speed, and variety, such expansion is accompanied by an increase in missing data, bringing interpretation problems for analysts and researchers. The exclusion of these cases cannot necessarily be considered a solution, regardless of data volume, due to its risks of generating bias or trends. Therefore, data imputation proves to be a fundamental task in data pre-processing, capable of improving its analysis. Hot-deck imputation is an approach that stands out in this context due to its ability to estimate more accurately and preserve individual differences between subjects in the imputation process. In this study, a systematic review of hot-deck imputation techniques performed on the Scopus database evaluates how the evolution of studies on this topic has occurred over the years. This work also proposes a taxonomy that aims to classify, order, and establish hierarchies for imputation techniques. As a result, 63% of the investigated articles did not adequately identify the missing mechanisms in their experiments; the hot-deck approach used 72% of the clustering algorithms in the Partitioning Based category; and 75% represented by the algorithms random hot-deck, K-Nearest-Neighbor, and K-means. Regarding the experiment’s reproducibility, 30% of the articles presented pseudocodes for the algorithms used, 42% used public data sets, and 45% compared the imputation results with the original data set. It is noteworthy that only 1% of the articles presented source code in an open repository, leaving an essential lack regarding the reproducibility of experiments in this area.