Dissertation defense (September 13, 2023): Jéssica da Silva Costa

Student: Jéssica da Silva Costa

Title: Methods Based on Homology and Machine Learning for Identification of Essential Proteins

Advisor: Kele Teixeira Belloze

Committee: Kele Teixeira Belloze (CEFET/RJ), Eduardo Bezerra (CEFET/RJ), Diogo Antonio Tschoeke (Coppe/UFRJ), Victor Ströele de Andrade Menezes (UFJF)

Day/Time: September 13, 2023 / 8am.

Room: https://teams.microsoft.com/l/meetup-join/19%3a8bd040fc5e004447b6a1fa09484d81d0%40thread.tacv2/1694208843103?context=%7b%22Tid%22%3a%228eeca404-a47d-4555-a2d4-0f3619041c9c%22%2c%22Oid%22%3a%22d0ca0ae9-1955-4759-a7ad-0b2fa49dbe55%22%7d

Abstract:

Drug development is often a complex and time-consuming process. Especially in the initial phase, the selection of a target for drug development can take many years. Essential genes and proteins are biological entities responsible for biological processes of survival and reproduction of organisms. Genes and proteins related to ancestry, in organisms of different species, usually retain their function. Furthermore, studies indicate that essential genes tend to have higher expression and encode proteins that engage in more protein-protein. All these characteristics make proteins potential drug targets. Many works in the literature propose biological and computational approaches for essentiality identification. Therefore, this work presents two workflows for identifying essentiality characteristics in proteins for drug targets of the target organism S. mansoni. For this, a method based on homology and another method based on machine learning with Model organisms S. cerevisiae, C. elegans and D. melanogaster. The homology-based method classifies about 11 essential candidate proteins with the group of model organisms and the organism S. mansoni. Among the peers, the highest number of applications was with S. cerevisiae where 726 candidate essential proteins were identified. On the other hand, machine learning based method, experiments carried out with three tree-based algorithms, with context-based features (PPI) and sequence-based, showed better recall values with the use of the Undersampling technique. In quantitative terms, about 4000 proteins were predicted as essential in the XGBoost and Gradient Boosting algorithms and 3800 proteins for the Random Forest algorithm. About 3300 proteins were predicted as essential by the three algorithms worked, which demonstrated a certain similarity between the results of the algorithms.