August 27th, 2018
Room: Segóvia 1

09:00-09:15 Opening
09:15-10:30 Keynote Speaker: Claudia Bauzer Medeiros
Data Science, Open Science – when and how shall the twain meet?
10:30-11:00 Coffee Break
11:00-12:30

Technical Session 1

1.     Scientific Data Analysis Using Data-Intensive Scalable Computing: the SciDISC Project (invited paper)
Patrick Valduriez, Marta Mattoso, Reza Akbarinia, Heraldo Borges, José Camata, Alvaro Coutinho, Daniel Gaspar, Noel Lemus, Ji Liu, Hermano Lustosa, Florent Masseglia, Fabricio Nogueira Da Silva, Vítor Silva, Renan Souza, Kary Ocaña, Eduardo Ogasawara, Daniel de Oliveira, Esther Pacitti, Fabio Porto and Dennis Shasha
2.     An I/O Performance Evaluation Tool for Distributed Data-Intensive Scientific Applications
Eduardo Camilo Inacio and Mario Antonio Ribeiro Dantas
3.     A Comparative Study on Streaming Frameworks for Big Data
Wissem Inoubli, Sabeur Aridhi, Haithem Mezni, Mondher Maddouri and Engelbert Mephu Nguifo
4.     Big Data Analytics Technologies and Platforms: a brief review
Ticiana Coelho Da Silva, Regis Pires, Igo Ramalho Brilhante, Jose Macedo, David Araújo, Paulo Rego and Aloisio Vieira Lira Neto
12:30-14:00 Lunch Break
14:00-15:30

Technical Session 2

5.     Urban Data Consistency in RDF: A Case Study of Curitiba Transportation System
Mirian Halfeld Ferrari, Carmem Hara, Nadia Kozievitch and Flavio Uber
6.     Business Activity Clustering: A Use Case in Curitiba
Yuri Bichibichi, Nádia Kozievitch, Ricardo Dutra and Artur Ziviani
7.     Influence of Virtual Road Traffic Sensors of Oporto for Origin-Destination Matrix Estimation
Luciano Urgal Pando, Ricardo Lüders, Keiko Veronica Ono Fonseca and Marcelo de Oliveira Rosa
8.     A Multilayer and Time-varying Structural Analysis of the Brazilian Air Transportation Network
Klaus Wehmuth, Bernardo Costa, João Victor Bechara and Artur Ziviani
15:30-16:00 Coffee Break
16:00-17:10

Technical Session 3

9.     Applying term frequency-based indexing to improve scalability and accuracy of probabilistic data linkage
Robespierre Pita, Luan Menezes and Marcos Barreto
10.  ATAnalysis – Toward a psycholinguistic method to analyze video textual information
Helder Yukio Okuno, Flavio Carvalho, Gustavo Paiva Guedes and Marcelle Torres Alves Okuno
Short papers:
11.  Computation of PDFs on Big Spatial Data: Problem & Architecture
Ji Liu, Noel M. Lemus, Esther Pacitti, Fabio Porto and Patrick Valduriez
12.  Towards a Human-in-the-Loop Library for Tracking Hyperparameter Tuning in Deep Learning Development
Renan Souza, Liliane Neves, Leonardo Azeredo, Ricardo Luiz, Elaine Tady, Paulo Cavalin and Marta Mattoso
13.  A Method to build a Geolocalized Food Price Time Series Knowledge Base analyzable by Everyone
Johyn Papin, Frederic Andres and Laurent D’Orazio
17:15-17:30 Closing remarks

Detailed Program

09:00-09:15 – Opening

09:15-10:30: Keynote Speaker: Claudia Bauzer Medeiros

Data Science, Open Science – when and how shall the twain meet?

Unicamp – Campinas, SP, Brazil

11:00-12:30: Technical Session 1

11-00-11:20 – Scientific Data Analysis Using Data-Intensive Scalable Computing: the SciDISC Project (invited paper)

Patrick Valduriez1, Marta Mattoso2, Reza Akbarinia1, Heraldo Borges3, José Camata2, Alvaro Coutinho2, Daniel Gaspar4, Noel Lemus4, Ji Liu1, Hermano Lustosa4, Florent Masseglia1, Fabricio Nogueira Da Silva5, Vítor Silva2, Renan Souza2, Kary Ocaña4, Eduardo Ogasawara3, Daniel de Oliveira5, Esther Pacitti1, Fabio Porto4 and Dennis Shasha6

1Inria, LIRMM and University Montpellier – France
2COPPE/UFRJ, Rio de Janeiro – RJ – Brazil
3CEFET/RJ, Rio de Janeiro – RJ – Brazil
4LNCC, Rio de Janeiro – RJ – Brazil
5UFF, Rio de Janeiro – RJ – Brazil
6NYU, New York – NY – USA

Abstract: Data-intensive science requires the integration of two fairly different paradigms: high-performance computing (HPC) and data-intensive scalable computing (DISC), as exemplified by frameworks such as Hadoop and Spark. In this context, the SciDISC project addresses the grand challenge of scientific data analysis using DISC, by developing architectures and methods to combine simulation and data analysis. SciDISC is an ongoing project between Inria, several research institutions in Rio de Janeiro and NYU. This paper introduces the motivations and objectives of the project, and reports on the first results achieved so far.

11-20-11:40 – An I/O Performance Evaluation Tool for Distributed Data-Intensive Scientific Applications

Eduardo Camilo Inacio1 and Mario Antonio Ribeiro Dantas1

1UFSC – Universidade Federal de Santa Catarina, Florianópolis, SC – Brazil

Abstract: I/O performance arises as a major bottleneck in nowadays data-intensive scientific applications. In order to identify execution parameters that provide an improved I/O performance, experimental efforts relying on synthetic I/O workload generators are widely employed. Focusing on addressing limitations of currently workload generators, and to provide a more flexible, unified, and user-friendly approach for parallel I/O performance analysis and optimization, we have proposed a differentiated tool, called IORE. In this paper, we demonstrate IORE applicability for I/O performance analysis of a dataset-based workload derived from a real-world scientific application. Beyond giving insight on best performing configurations for the referred application workload, our results indicate the potential of IORE as a parallel I/O experimental tool.

11-40-12:00 – A Comparative Study on Streaming Frameworks for Big Data

Wissem Inoubli1, Sabeur Aridhi2, Haithem Mezni3, Mondher Maddouri4 and Engelbert Mephu Nguifo5

1University of Tunis El Manar, Faculty of Sciences of Tunis, LIPAH
2University of Lorraine, CNRS, Inria, LORIA
3University of Jendouba, SMART Lab
4College Of Buisness, University of Jeddah
5Clermont University, Blaise Pascal University, LIMOS

Abstract: Recently, increasingly large amounts of data are generated from a variety of data sources. Existing data processing technologies are not suitable to cope with the huge amounts of generated data. Yet, many research works focus on streaming in Big Data, a task referring to the processing of massive volumes of structured and/or unstructured streaming data. Recently proposed streaming frameworks for Big Data applications help to store, analyze and process the continuously captured data. In this paper, we discuss the challenges of streaming Big Data and we survey existing streaming frameworks for Big Data. We also present an experimental evaluation and a comparative study of the most popular streaming frameworks

12-00-12:20 – Big Data Analytics Technologies and Platforms: a brief review

Ticiana Coelho Da Silva1, Regis Pires1, Igo Ramalho Brilhante1, Jose Macedo1, David Araújo1, Paulo Rego1 and Aloisio Vieira Lira Neto2

1Federal University of Ceara, Brazil
2Brazilian Federal Highway Police

Abstract: A plethora of Big Data Analytics technologies and platforms have been proposed in the last years. However, in 2017, only 53% of companies are adopting such tools. It seems that the industry is not convinced about Big Data promises or maybe choosing the right technology/platform requires in-depth knowledge about the capabilities of all these tools. Before deciding the right technology or platform to choose from, the organizations have to investigate the application/algorithm needs and the advantages and drawbacks of each technology/platform. In this paper, we aim at helping organizations in the selection of technologies/platforms more appropriate to their analytic processes by offering a short-review according to some categories of Big Data problems as processing (streaming and batch), storage, data integration, analytics, data governance, and monitoring.

14:00-15:30: Technical Session 2

14-00-14:20 – Urban Data Consistency in RDF: A Case Study of Curitiba Transportation System

Mirian Halfeld Ferrari, Carmem Hara, Nadia Kozievitch and Flavio Uber

1Universite d’Orleans, INSA CVL, Orleans, France
2Universidade Federal do Parana, Curitiba, PR, Brazil
3Universidade Tecnologica Federal do Parana, Curitiba, PR, Brazil
4Universidade Estadual de Maringa, Maringa, PR, Brazil

Abstract: Urban Computing has an important role in providing new tools for urban mobility. In this paper, integrity constraints and blank nodes are used in an RDF database to minimize extra updates (called side effects) to guarantee consistency during required updates. A case study using a real scenario on Curitiba/Brazil  transportation database is presented. Experimental results showed that our approach performs better and produces more meaningful results when compared to a similar strategy.

14-20-14:40 – Business Activity Clustering: A Use Case in Curitiba

Yuri Bichibichi1, Nádia Kozievitch1, Ricardo Dutra1 and Artur Ziviani2

1Universidade Tecnológica Federal do Parana (UTFPR), Curitiba, PR, Brazil
2National Laboratory for Scientific Computing (LNCC), Petropolis, RJ, Brazil

Abstract: In the context of smart cities, the information of businesses licenses has the potential to discriminate economics characteristics of the observed urban environment. This work performs an initial analysis on business activity clustering using the k-means algorithm with data from the granting of business licenses (from 1980 to 2016) in the city of Curitiba – Brazil.

14-40-15:00 – Influence of Virtual Road Traffic Sensors of Oporto for Origin-Destination Matrix Estimation

Luciano Urgal Pando1, Ricardo Lüders1, Keiko Veronica Ono Fonseca1 and Marcelo de Oliveira Rosa1

1Federal University of Technology – Parana (UTFPR)

Abstract: The knowledge of urban mobility patterns is important to maintain good public services as well as to improve city planning. These mobility patterns can be characterized by using expensive fieldwork or analyzing the huge amount of data available from services and environmental monitoring in smart cities. The origin-destination matrix estimation (ODME) aims to estimate the traffic of vehicles between two particular origin and destination areas in the city from traffic observed from sensors installed at roads. This estimation is stated as an optimization problem and solved here by linear programming. The results obtained for sensor data of Porto in Portugal have shown that the number and location of sensors are important issues to be considered.

15-00-15:20 – A Multilayer and Time-varying Structural Analysis of the Brazilian Air Transportation Network

Klaus Wehmuth, Bernardo Costa, João Victor Bechara and Artur Ziviani

1 LNCC – National Laboratory for Scientific Computing

Abstract: This paper provides a multilayer and time-varying structural analysis of one air transportation network, having the Brazilian air transportation network as a case study. Using a single mathematical object called MultiAspect Graph (MAG) for this analysis, the multi-layer perspective enables the unveiling of the particular strategies of each airline to both establish and adapt in a moment of crisis its specific flight network.

16:00-17:30: Technical Session 3

16-00-16:20 – Applying term frequency-based indexing to improve scalability and accuracy of probabilistic data linkage

Robespierre Pita1,2, Luan Menezes1,2 and Marcos Barreto1,2

1UFBA – Federal University of Bahia, Salvador, BA, Brazil
2FIOCRUZ – Oswaldo Cruz Foundation, Salvador, BA, Brazil

Abstract: Record or data linkage is a technique frequently used in diverse domains to aggregate data stored in different sources that presumably pertain to the same real-world entity. Deterministic (key-based) or probabilistic (rule-based) linkage methods can be used to implement data linkage, being the second approach suitable when no common link attributes exist amongst the data sources involved. Depending on the volume of data being linked, indexing (or blocking) techniques should be used to reduce the number of pairwise comparisons that need to be executed to decide if a given pair of records match or not. In this paper, we discuss a new indexing scheme, based on term-frequency counts, deployed in our data linkage tool (AtyImo). We present our algorithm design and some metrics related to accuracy and efficiency (reduction ratio achieved during blocking construction), as well a comparative analysis with a predicate-based technique also used in AtyImo. Our results show a very high level of accuracy and reduction in terms of pairwise comparison tasks.

16-20-16:40 – ATAnalysis – Toward a psycholinguistic method to analyze video textual information

Helder Yukio Okuno, Flavio Carvalho, Gustavo Paiva Guedes and Marcelle Torres Alves Okuno

1CEFET/RJ, Rio de Janeiro – RJ – Brazil
2EGN – Escola de Guerra Naval, Rio de Janeiro – RJ – Brazil

Abstract: Political statements of world leaders may affect many lives, so it is important to study what they express through language. We propose a method to do psycholinguistic analysis of statements extracted from videos. To show the relevance and some interesting information, we conducted some experiments in video subtitles of world leaders Donald Trump and Kim Jong-un amid imminent agreement that could lead to peace in the Korean peninsula. Results suggest less security in statements of the North Korean leader while threatening to unleash an “unimaginable strike” at the US territory. Moreover, the US president shows less honesty by saying he hopes never to use the nuclear arsenal. This approach may be useful in future studies to reveal what the language used by candidates can show.

16-40-16:50 – Computation of PDFs on Big Spatial Data: Problem & Architecture

Ji Liu1, Noel M. Lemus2, Esther Pacitti1, Fabio Porto2 and Patrick Valduriez1

1Inria and LIRMM, Univ. of Montpelier, France
2LNCC Petropolis, Brazil

Abstract: Big spatial data can be produced by observation or numerical simulation programs and correspond to points that represent a 3D soil cube area. However, errors in signal processing and modeling create some uncertainty, and thus a lack of accuracy in identifying geological or seismic phenomenons. To analyze uncertainty, the main solution is to compute a Probability Density Function (PDF) of each point in the spatial cube area, which can be very time consuming. In this paper, we analyze the problem and discuss the use of Spark to efficiently compute PDFs.

16-50-17:00 – Towards a Human-in-the-Loop Library for Tracking Hyperparameter Tuning in Deep Learning Development

Renan Souza1,2, Liliane Neves1, Leonardo Azeredo1, Ricardo Luiz, Elaine Tady1, Paulo Cavalin2 and Marta Mattoso1

1COPPE/Federal University of Rio de Janeiro
2IBM Research

Abstract: The development lifecycle of Deep Learning (DL) models requires humans (the model trainers) to analyze and steer the training evolution. They analyze intermediate data, fine-tune hyperparameters, and stop when a resulting model is satisfying. The problem is that existing solutions for DL do not track the trainer actions. There are no explicit data relationship between trainer action with the input data and hyperparameter to the output performance results, throughout the training process. This jeopardizes online training data analyses and post-hoc results reproducibility, reusability, and understanding. This paper presents DL-Steer, our first prototype to aid trainers to fine-tune hyperparameters and for tracking trainer steering actions. Tracked data are stored in a relational database for online and post-hoc data analyses.

17-00-17:10 – A Method to build a Geolocalized Food Price Time Series Knowledge Base analyzable by Everyone

Johyn Papin1, Frederic Andres2 and Laurent D’Orazio1

1Univ Rennes, France
2NII – National Institute of Informatics, Japan

Abstract: Time-series analysis is a very challenging concept in Data Science for companies and industries. Harvesting prices of agricultural production (e.g. vegetable, fruit, milk…) as time series is key to operating reliable dish cost prediction at scale to ensure for example that the market price is valid. In this paper, we describe initial stakeholder needs, the application and engineering contexts in which the challenge of time-serie harvesting and management arose, and theoretical and architectural choices we made to implement a solution of historical food prices to demonstrate the feasibility. For this, we use scrappers through the TOR network. We also propose the knowledge map approach to make the data accessible to any type of users.

17-15-17:30 – Closing remarks