BIOINFORMATICS 2021 Abstracts


Full Papers
Paper Nr: 1
Title:

Coordinate Systems for Pangenome Graphs based on the Level Function and Minimum Path Covers

Authors:

Thomas Büchler, Caroline Räther, Pascal Weber and Enno Ohlebusch

Abstract: The Computational Pan-Genomics Consortium (Consortium, 2016) described the role of coordinate systems in genomics as follows: “A pan-genome defines the space in which (pan-)genomic analyses take place. It should provide a ‘coordinate system’ to unambiguously identify genetic loci and (potentially nested) genetic variants.” The most natural representations of pangenomes are graphs. The Computational Pan-Genomics Consortium identified desirable properties of the linear reference genome model that graphical frameworks should attempt to preserve: spatiality, monotonicity, and readability. In this paper, we introduce a coordinate system for DAGs that has these properties. It is based on the level function and a minimum path cover of the graph. Moreover, we describe a new method for finding a minimum path cover in a DAG, which works very well in practice.

Paper Nr: 4
Title:

The Extension of the Standard Genetic Code via Optimal Codon Blocks Division

Authors:

Kuba Nowak, Paweł Błażej, Małgorzata Wnetrzak, Dorota Mackiewicz and Paweł Mackiewicz

Abstract: The standard genetic code (SGC) is a crucial biological system, which allows to transmit genetic information from DNA sequences to the protein world. The idea of the optimal extension of the SGC with new information appears especially interesting in the context of successful experimental achievements in reprogramming of this code. The aim of this code engineering is incorporating non-canonical amino acids (ncAAs) into synthesised artificial proteins with novel functions. Such molecules open new perspectives in medicine, chemistry and biotechnology. Several methods extending the canonical coding system were proposed. Here, we would like to investigate a problem of the optimal genetic code extension using graph theory methodology. We measured the quality of considered coding systems applying the set conductance, which determines the robustness level against point mutations of individual codon blocks encoding the same information. Thanks to that, we were able to find several possible optimal extensions of the SGC based on utilization of the codons redundant in the original code. We found codes that could encode up to 16 ncAAs and simultaneously code for 20 canonical amino acids and one stop translation signal. One of these codes was the most balanced, i.e. it consisted of the canonical set and the extended set that were characterized by the same level of robustness against point mutations. The proposed codes could be helpful in experimental construction of artificial genetic codes, which can encode new amino acid with new useful properties.

Paper Nr: 5
Title:

LigityScore: Convolutional Neural Network for Binding-affinity Predictions

Authors:

Joseph Azzopardi and Jean P. Ebejer

Abstract: Scoring functions are at the heart of structure-based drug design and are used to estimate the binding of ligands to a target. Seeking a scoring function that can accurately predict the binding affinity is key for successful virtual screening methods. Deep learning approaches have recently seen a rise in popularity as a means to improve the scoring function having as a key advantage the automatic extraction of features and the creation of a complex representation without feature engineering and expert knowledge. In this study we present LigityScore1D and LigityScore3D, which are rotationally invariant scoring functions based on convolutional neural networks. LigityScore descriptors are extracted directly from the structural and interacting properties of the protein-ligand complex which are input to a CNN for automatic feature extraction and binding affinity prediction. This representation uses the spatial distribution of Pharmacophoric Interaction Points, derived from interaction features from the protein-ligand complex based on pharmacophoric features conformant to specific family types and distance thresholds. The data representation component and the CNN architecture together, constitute the LigityScore scoring function. The main contribution for this study is to present a novel protein-ligand representation for use as a CNN based SF for binding affinity prediction. LigityScore models are evaluated for scoring power on the latest two CASF benchmarks. The Pearson Correlation Coefficient, and the standard deviation in linear regression were used to compare and rank LigityScore with the benchmark model, and also to other models recently published in literature. LigityScore3D has achieved better overall results and showed similar performance in both CASF benchmarks. LigityScore3D ranked 5th place for the CASF-2013 benchmark , and 8th for CASF-2016, with an average R-score performance of 0.713 and 0.725 respectively. LigityScore1D ranked 8th place for the CASF-2013 and 7th place for CASF-2016 with an R-score performance of 0.635 and 0.741 respectively. Our methods show relatively good performance when compared to the Pafnucy model (one of the best performing CNN based scoring functions), on the CASF-2013 benchmark using a less computationally complex model that can be trained 16 times faster.

Paper Nr: 8
Title:

Backward Pattern Matching on Elastic Degenerate Strings

Authors:

Petr Procházka, Ondřej Cvacho, Luboš Krčál and Jan Holub

Abstract: Recently, the concept of Elastic Degenerate Strings (EDS) was introduced as a way of representing a sequenced population of the same species. Several on-line Elastic Degenerate String Matching (EDSM) algorithms were presented so far. Some of them provide practical implementation. We propose a new on-line EDSM algorithm BNDM-EDS. Our algorithm combines two traditional algorithms BNDM and the Shift-And that were adapted to the specifics needed by Elastic Degenerate Strings. BNDM-EDS is running in O(Nmdm w e) worst-case time. This implies O(Nm) time for small patterns, where m is the length of the searched pattern, N is the size of EDS, and w is the size of the computer word. The algorithm uses O(N + n) space, where n is the length of EDS. BNDM-EDS requires a simple preprocessing step with time and space O(m). Experimental results on real genomic data show superiority of BNDM-EDS over state-of-the-art algorithms.

Paper Nr: 13
Title:

Mining Biomedical Texts for Pediatric Information

Authors:

Tian Yun, Deepti Garg and Natalia Khuri

Abstract: To perform a comprehensive and detailed analysis of the gaps in knowledge about drugs’ safety and effectiveness in neonates, infants, children, and adolescents, large collections of complex and unstructured texts need to be analyzed. In this work, machine learning algorithms have been used to implement classifiers of biomedical texts and to extract information about safety and efficacy of drugs in pediatric populations. Models were trained using approved drug product labels and computational experiments were conducted to evaluate the accuracy of the models. A Support Vector Machine with a radial kernel had the best performance by classifying short texts with an accuracy of 94% and an excellent precision. Results show that classifiers perform better when trained using features comprising multiple words rather than single words. The proposed text classifier may be used to mine other sources of biomedical information, such as research publications and electronic health records.

Paper Nr: 21
Title:

Applying PySCMGroup to Breast Cancer Biomarkers Discovery

Authors:

Mazid A. Osseni, Prudencio Tossou, Jacques Corbeil and François Laviolette

Abstract: Background. The identification of biomarkers associated with triple-negative breast cancer (TNBC) is still an active area of research due to the complexity of finding robust biomarkers associated with the disease. Previous methods have attempted to tackle the problem from a mono-perspective view by analyzing each omics individually in the search of biomarkers. The majority of these methods mainly focus on gene expression analysis since their impact on the phenotype is easier to measure and possibly more direct. However, it is common understanding that genes belong to pathways and tend to work together within various metabolic, regulatory, and signalling pathways. Hence, in this work, we tackled the TNBC biomarker discovery problem as a multi-omic pathway-based problem by efficiently combining the biological knowledge from multiple pathways using a novel machine learning algorithm. The proposed algorithm, called GroupSCM, is an extension of the Set Covering Machine (SCM) that incorporate the pathway features as priors. Results. Although the GroupSCM performed similarly to the SCM, metric-wise, it helps identify new biomarkers not previously found by the SCM. By leveraging the pathway priors, the GroupSCM was able to uncover two miRNAs: hsa-mir-18a and hsa-mir-190b, already known to be associated with various cancers including breast cancer and yet to be linked to the Triple-Negative Breast Cancer phenotype. Conclusion. The addition of priors to the SCM leads to interpretable, complete and sparser models which are easier to analyze in vivo settings. It also provides insight into the omics interaction by highlighting the miRNAs and epigenome contribution to the prediction task. Code Availability: The code is available at: https://github.com/dizam92/BRCA experiments and paper

Short Papers
Paper Nr: 2
Title:

TotemBioNet Enrichment Methodology: Application to the Qualitative Regulatory Network of the Cell Metabolism

Authors:

Laetitia Gibart, Gilles Bernot, Hélène Collavizza and Jean-Paul Comet

Abstract: When designing a biological regulatory network, new information or wet experiments can require adding variables or interactions, inside a previously validated model. They can result in complete reconsiderations of established behaviours. Fortunately, formal methods allow for fully automated verification of properties, and TotemBioNet is an efficient software integrating a collection of formal approaches for regulatory networks. It allowed us to develop a multidisciplinary methodology for designing large dynamical models in an incremental way, including non regression proofs (preservation of important biological properties).

Paper Nr: 3
Title:

Non-coding DNA: A Methodology for Detection and Analysis of Pseudogenes

Authors:

Gabriella Trucco and Vittorio Cerioli

Abstract: It is well known that elements lying outside the coding regions of the human genome are involved in many human diseases. Therefore, the efforts to detect and characterize functional elements in the non-coding regions are rapidly increasing. Among many types of non-coding DNA, pseudogenes are sequences that share some similarities with their parental genes but have lost their ability to code for proteins. In this paper, we propose a methodology for detection and analysis of pseudogenes, based on transition probabilities of the nucleotides and their occurrences. The 1000 base pairs length downstream region of each detected pseudogene is analyzed in order to find a polyA tail and a polyadenylation signal. We implemented a Hidden Markov Model with the Viterbi algorithm to decode the upstream regions of the previously detected pseudogenes in order to search for CpG islands. In order to identify motif signals in the selected pseudogenes, we implemented the Gibbs sampling algorithm and we executed it on the flanking regions of some pseudogenes. Results demonstrate that the proposed methodology is an efficacious solution to detect new potential loci, especially when the query coverage of the alignment is shorter than the coding strand. These loci can be classed as pseudogene fragments.

Paper Nr: 9
Title:

Finding Potential Inhibitors of COVID-19

Authors:

Angela Kralevska, Marija Velichkovska, Viktor Cicimov, Tome Eftimov and Monika Simjanoska

Abstract: COVID-19 is an infectious disease caused by virus SARS-CoV-2 that spread globally due to its high contagious nature and became an ongoing pandemic. The lack of vaccines and drugs to treat infected patients is a great problem in the fight against this pandemic. Molecular docking is one of the best approaches to search for potential drugs in real time with possibilities to apply at COVID-19. In this experiment, molecular docking studies of fourteen ligands were carried out with three important proteins of SARS-CoV-2, i.e. main protease, ACE2, and spike glycoprotein. From the obtained results, we observed that many of the tested molecules showed better dock score in comparison to remdesivir and dexamethasone, drugs that are claimed to be effective against COVID-19. Combining the dock score and other properties, we believe that auranetin can be further explored for potential use against COVID-19.

Paper Nr: 10
Title:

Flower Pollination Algorithm for Detection of Epistasis Associated with a Phenotype

Authors:

Jozef Sitarčík, Mária Lucká and Tibor Krajčovič

Abstract: Detecting associations of SNPs with traits like complex diseases can provide valuable insights. However, due to the epistases - complex interactions between SNPs - SNP combinations need to be evaluated for their association with a trait. As the number of possible SNP combinations grows rapidly with increase of the number of SNPs, great computational challenges have to be tackled. In this paper, we propose FPepi, epistasis detection tool based on flower pollination algorithm with multiple objectives. Two variants of the algorithm are proposed, one using Gini score and K2 score as objectives, while the second variant uses K2 score and mutual information score. The flower pollination algorithm selects a small subset of potential SNP combinations, that are then evaluated by G-test. The proposed tool shown better results in detection power when compared with other similar tools.

Paper Nr: 14
Title:

Possibilities of using Neural Networks to Blood Flow Modelling

Authors:

Katarína Buzáková, Katarína Bachratá, Hynek Bachratý and Michal Chovanec

Abstract: Computer simulation of the flow of blood or other fluid is beneficial to reduce the variety of costs necessary for biological experiments in microfluidics. It turns out, that as biological experiments, even the simulations have limitations. However, data from both types of experiments can be further processed by machine learning methods in order to improve them and thus contribute to the optimization of microfluidic devices. This article describes the possibilities of using neural networks to blood flow modelling. In this paper, we focus mainly on the prediction of red blood cells movement. We propose other possibilities of using neural networks with regard to the needs of further research in simulation modelling.

Paper Nr: 16
Title:

Voxelized Breast Phantoms for Dosimetry in Mammography

Authors:

R. M. Tucciariello, P. Barca, D. D. Sarto, R. Lamastra, G. Mettivier, A. Retico, P. Russo, A. Sarno, A. C. Traino and M. E. Fantacci

Abstract: X-ray breast imaging techniques are an essential part of breast cancer screening programs and their improvements lead to gain in performance and accuracy. Radiation dose estimate and control play an important role in digital mammography and digital breast tomosynthesis investigations, since the risk of radioinduced cancer to the gland must be contained and dose delivered to the gland must be declared in the medical report. The actual dosimetric protocols suggest the assessment of radiation dose by means of Monte Carlo calculation on digital breast phantoms, providing the assumption of the homogeneous mixture of glandular and adipose tissues within the breast organ, leading to a drastic approximation. In line with the trend of other research groups, with the aim of improving the Monte Carlo model, in the current work a new heterogeneous digital breast model is proposed, involving a voxelized approach and disengaging from the concept of homogeneous phantom. The proposed model is based on new findings in the literature and after a validation process, the model is adopted to evaluate mean glandular dose discrepancies with the traditional model which is adopted in clinic for decades.

Paper Nr: 19
Title:

Fitting Personalized Mechanistic Mathematical Models of Acute Myeloid Leukaemia to Clinical Patient Data

Authors:

Dennis Görlich

Abstract: In this position paper, we discussed the potential to fit mechanistic mathematical models of acute myeloid leukaemia to patient data. The overarching aim was to estimate personalized models. We briefly introduced one selected mechanistic ODE model to illustrate the approach. The usually available outcome measures, e.g. in clinical datasets, were aligned with the model’s prediction capabilities. Among the most relevant outcomes (blast load, complete remission, and survival), only blast load turned out to be well suited to be used in the model fitting process. We formulated an optimization problem that, finally, resulted in personalized model parameters. The degree of personalization could be chosen by selecting only a subset of parameters within the optimization problem. To illustrate the fitness landscape for individual patients we performed a grid search and calculated the fitness values for each grid point. The grid search revealed that an optimum exists, but that the fitness landscape can be very noisy. In these cases, gradient-based solvers will perform poorly and other algorithms needs to be chosen. Finally, we belief that personalized model fitting will be a promising approach to integrate mechanistic mathematical models into clinical research.

Paper Nr: 20
Title:

Machine Learning Studies of Non-coding RNAs based on Artificially Constructed Training Data

Authors:

Mirele F. Costa, João A. Oliveira, Waldeyr M. C. da Silva, Rituparno Sen, Jörg Fallmann, Peter F. Stadler and Maria T. Walter

Abstract: Machine learning (ML) methods are often used to identify members of non-coding RNA classes such as microRNAs or snoRNAs. However, ML methods have not been successfully used for homology search tasks. A systematic evaluation of ML in homology search requires large, controlled, and known ground truth test sets, and thus, methods to construct large realistic artificial data sets. Here we describe a method for producing sets of arbitrarily large and diverse snoRNA sequences based on artificial evolution. These are then used to evaluate supervised ML methods (Support Vector Machine, Artificial Neural Network, and Random Forest) for snoRNA detection in a chordate genome. Our results indicate that ML approaches can indeed be competitive also for homology search.

Paper Nr: 6
Title:

A Deep Learning Method to Impute Missing Values and Compress Genome-ide Polymorphism Data in Rice

Authors:

Tanzila Islam, Chyon H. Kim, Hiroyoshi Iwata, Hiroyuki Shimono, Akio Kimura, Hein Zaw, Chitra Raghavan, Hei Leung and Rakesh K. Singh

Abstract: Missing value imputation and compressing genome-wide DNA polymorphism data are considered as a challenging task in genomic data analysis. Missing data consists in the lack of information in a dataset that directly influences data analysis performance. The aim is to develop a deep learning model named Autoencoder Genome Imputation and Compression (AGIC) which can impute missing values and compress genome-wide polymorphism data using a separated neural network model to reduce the computational time. This research will challenge the construction of a model by using Autoencoder for genomic analysis, in other words, a fusion research between agriculture and information sciences. Moreover, there is no knowledge of missing value imputation and genome-wide polymorphism data compression using Separated Stacking Autoencoder Model. The main contributions are: (1) missing value imputation of genome-wide polymorphism data, (2) genome-wide polymorphism data compression of Rice DNA. To demonstrate the usage of AGIC model, real genome-wide polymorphism data from a rice MAGIC population has been used.

Paper Nr: 11
Title:

Modeling Haptic Data Transfer Processes through a Thermal Interface using an Equivalent Electric Circuit Approach

Authors:

Yosef Y. Shani and Simon Lineykin

Abstract: Many activities and scenarios today require human-computer interactions (HCI), and since traditional communication channels such as vision and hearing are often overloaded or irrelevant, there is an increasing interest in haptic interfaces, specifically thermal. Designing and optimizing an effective tactile interface requires an easy-to-use simulation tool to reduce the time for empirical experiments. An original modeling tool was developed in this study to support cutting edge research on human response to thermal stimuli. The human skin tissue model is developed as an equivalent electrical circuit for simultaneous simulation with a thermal display scheme and its control circuitry. The simulator enables monitoring heat flows and temperature variations at any location of the system without intervening in the process itself and inside the skin tissue, for instance, at the depth of the thermoreceptors. The other generic advantage of performing tests with a simulator is the ability to adjust the parameters according to the variety of skin types, test conditions, or thermo-display characteristics, and to simulate the response to different generated stimuli. This report presents the methodology and structure of the model along with an initial empiric validation and suggests directions for further research and future implementation.

Paper Nr: 12
Title:

Comprehensive Statistical Analysis on Estimated Errors of Averagine Model for Intact Proteins

Authors:

Yuanxi Che

Abstract: Averagine Model (AM) is a very popular and practical computing tool in top-down proteomics, which is usually employed to predict the monoisotopic mass for an unknown protein or a peptide to be of interest. However, with the significant advancement on high-resolution and high-accuracy mass spectrometry (MS) instrumentation, AM’s limitation on its accuracy became more and more significant. Here we studied statistically AM’s mass errors using all proteins in the Human databases. Both the mass errors of estimated monoisotopic mass and average mass for all proteins from the Human protein database are analysed comprehensively in this paper. According to the results obtained, we then found the error range difference between these two different types of mass errors and then we further analysed the error contributions on the individual elemental level of C, H, N, O and S which constitute the proteins. Our analysis will provide an experimental basis to further improve the average model in the top-down proteomics.

Paper Nr: 15
Title:

Machine Learning Algorithms for Predicting Chronic Obstructive Pulmonary Disease from Gene Expression Data with Class Imbalance

Authors:

Kunti R. Mahmudah, Bedy Purnama, Fatma Indriani and Kenji Satou

Abstract: Chronic obstructive pulmonary disease (COPD) is a progressive inflammatory lung disease that causes breathlessness and leads to serious illness including lung cancer. It is estimated that COPD caused 5% of all deaths globally in 2015, putting COPD as the three leading causes of death worldwide. This study proposes methods that utilize gene expression data from microarrays to predict the presence or absence of COPD. The proposed method assists in determining better treatments to lower the fatality rates. In this study, microarray data of the small airway epithelium cells obtained from 135 samples of 23 smokers with COPD (9 GOLD stage I, 12 GOLD stage II, and 2 GOLD stage III), 59 healthy smokers, and 53 healthy nonsmokers were selected from GEO dataset. Machine learning and regression algorithms performed in this study included Random Forest, Support Vector Machine, Naïve Bayes, Gradient Boosting Machines, Elastic Net Regression, and Multiclass Logistic Regression. After diminishing imbalance data effect using SMOTE, classification algorithms were performed using 825 of the selected features. High AUC score was achieved by elastic net regression and multiclass logistic regression with AUC of 89% and 90%, respectively. In the metrics including accuracy, specificity, and sensitivity, both classifiers also outperformed the others.

Paper Nr: 18
Title:

Anomalies Detection in Gene Expression Matrices: Towards a New Approach

Authors:

Nicoletta D. Buono, Flavia Esposito, Laura Selicato and Maria C. Vegliante

Abstract: One of the main problems in analyzing real data is often related to the presence of anomalies. Anomalous cases may, in fact, spoil the resulting analysis as well as contain valuable information at the same time. In both cases, the ability to detect these occurrences is very important. Particularly, in biomedical field, a proper identification of outliers allows to develop novel biological hypotheses not taken into consideration when experimental biological data are considered. In this paper, we address the problem of detecting outlier samples in gene expression data. We propose an ensemble approach for anomalies detection in gene expression matrices based on the use of hierarchical clustering and Robust Principal Component Analysis, that allows to derive a novel pseudo mathematical classification of anomalies.

Paper Nr: 22
Title:

Adaptive Learning Control and Monitoring of Oxygen Saturation for COVID-19 Patients

Authors:

Lubna Farhi, Rija Rehman and Muhammad A. Khan

Abstract: This paper proposes an adaptive learning control and monitoring of oxygen for patients with breathing complexities and respiratory diseases. By recording the oxygen saturation levels in real-time, this system uses an adaptive learning controller (ALC) to vary the oxygen delivered to the patient and maintain it in an optimum range. In the presented approach, the PID controller gain is tuned with the learning technique to provide improved response time and a proactive approach to oxygen control for the patient. A case study is performed by monitoring the time varying health vitals across different age groups to gain a better understanding of the relationship between these parameters for COVID-19 patients. This information is then used to improve the standard of care supplied to patients and reducing the time to recovery. Results show that ALC controlled the oxygen saturation within the target range of 90% to 94% SpO2, 77% and 80.1% of the time in patients aged 40 to 50-year-old and 50 to 60-year-old, respectively. It also had faster time to recovery to target SpO2 range when the concentration dropped rapidly or when the patient became hypoxic as compared to manual control of the oxygen saturation by the healthcare staff.

Paper Nr: 24
Title:

A Methodology based on Formal Methods for Thermal Ablation Area Detection

Authors:

Luca Brunese, Francesco Mercaldo, Antonella Santone and Giuseppe P. Vanoli

Abstract: Thermal ablation is the process related to the destruction of tissue by elevated tissue temperatures or depressed tissue temperatures. The machine exploited to perform for this process is named thermal ablator, requiring in input the area of the tissue to be subjected to treatment. In this proposal, with the aim to assist doctors in the process of the detection of the area targeted by the thermal ablator, we propose a methodology based on formal methods considering the representation of medical images in terms of formal and mathematical representations for the detection of the area.

Paper Nr: 25
Title:

MSL-ST: Development of Mass Spectral Library Search Tool to Enhance Compound Identification

Authors:

Teodora Gerasimoska, Milka Ljoncheva and Monika Simjanoska

Abstract: Identification of new organic compounds through suspect screening (SS) and non-targeted analysis (NTA) became the most challenging task in environmental and metabolomics research in the recent two decades. Identification of thousands of organic compounds is performed using the recent technology advancements in chromatography-mass spectrometry as the core analytical platform, assisted by multitude of cheminformatics-assisted approaches. As many of those approaches rely on mass spectral libraries (MSLs) search, the availability of comprehensive MSLs with engines for batch search and export of MS data and batch search engines for simultaneous search and export of MS data from multiple MSLs is of crucial importance. In lack of such, analysts perform this step in a laborious, time-consuming manual manner, importing significant risk of compound misidentification. This paper presents MSL-ST, the first tool for automated batch search and storage of MS spectra that uses two of the largest publicly available MSLs as data source, the MassBank of North America (MoNa) and the MassBank of Europe. MSL-ST assembles large amount of MS data in an automated, time- and cost-effective manner in a format which allows its further processing, especially for the purpose of compound identification. The tool, accompanied with user manual, is publicly available on GitHub.