Increasingly, organizations seek to analyze the increasing number of data available to develop actions that bring competitive advantage and highlight their field of activity. This process ranges from the correct collection and storage of the data to the integration of the data with the information obtained on the Web. Data is associated with the planning, management, and performance of the organization, and can be structured, semi-structured and unstructured. Thus, there is a need for treatment, transforming them into information and knowledge.
In order to assist this process in a significant way, this project analyzes different research opportunities. First, it is possible to observe the need to process large volumes of heterogeneous data in parallel and distributed. This is a typical scenario in large projects in several areas of knowledge, such as bioinformatics, astronomy, engineering, and medicine, where workflows have been widely adopted. Many of these workflows are large-scale and require high-performance computing environments (such as clusters, supercomputers, and computer clouds) and parallelism techniques to run it in a viable time. In addition to these environments, in recent years, there has been a frequent usage of large-scale data-centric computing frameworks such as Apache Spark, which provides efficient memory processing. One of the goals of this project is to develop workflows for large-scale data management and analysis using these frameworks and optimize their execution in parallel and distributed environments. Finally, the objective is also to study conceptual modeling techniques with workflows and ontologies applied to Big Data, and preprocessing, indexing and query in Big Data, including approaches based on distributed storage systems (HDFS), management systems of relational-object databases, NoSQL and NewSQL.
- Eduardo Ogasawara (Leader)
- Jorge Soares
- Kele Belloze
- Rafaelli Coutinho