Student: João Antônio de Ferreira
Title: An Algebraic framework for Data Analysis Workflows in Apache Spark
Advisors: Eduardo Soares Ogasawara (advisor),
Rafaelli de Carvalho Coutinho (co-advisor)
Eduardo Soares Ogasawara (CEFET/RJ) (president),
Rafaelli de Carvalho Coutinho (CEFET/RJ), Jorge de Abreu Soares (CEFET/RJ), Fabio Andre Machado Porto (LNCC), Leonardo Gresta Paulino Murta (UFF)
Day/Time: February 25, 2 pm
The typical activity of a data scientist involves the implementation of various processes that characterize data analysis experiments. In these analyzes there is a need to execute several codes in different programming languages (Python, R, C, Java, Kotlin and Scala) in different parallel and distributed processing environments. Depending on the complexity of the process and the numerous possibilities for distributed execution of these solutions, it may be necessary to spend a lot of energy on different implementations that take the data scientist away from his ultimate goal of producing knowledge from large volumes of data. In this context, this paper aims to support this difficulty by proposing the construction of WfF framework conceived from an algebraic approach that isolates the process modeling from the difficulty of optimally executing such workflows . An agnostic language was also created in the form of eDSL – Embedded domain-specific language inspired by the MDA (Model Driven Architecture) concepts, for dataflow (data-centric workflow) execution and a Scala code generator for deploy in the Apache Spark environment. Spark ecosystem functionalities were evaluated in the process of filter optimization (filter operator) and mappings (map operator) that operate on UDF using the SparkSQL Catalyst API, and the experiments demonstrated the feasibility of this approach.