BIOINFORMATICS 2020 Abstracts


Full Papers
Paper Nr: 11
Title:

Efficient Computation of Base-pairing Probabilities in Multi-strand RNA Folding

Authors:

Ronny Lorenz, Christoph Flamm, Ivo L. Hofacker and Peter F. Stadler

Abstract: RNA folding algorithms, including McCaskill’s partition function algorithm for computing base pairing probabilities, can be extended to N ≥ 2 interacting strands by considering all permutations π of the N strands. For each π, the inside dynamic programming recursion for connected structures needs to be extended by only a single extra case corresponding to a base pair connecting exactly two connected substructures. This leaves the cubic running time unchanged. A straightforward implementation of the corresponding outside recursion, however results in a quartic algorithm. We show here how cubic running time asymptotically equal to McCaskill’s partition function algorithm can be achieved by introducing linear-size auxiliary arrays. The algorithm is implemented within the framework of the ViennaRNA package and conforms to the theoretical performance bounds.

Paper Nr: 18
Title:

Prediction of Dynamical Properties of Biochemical Pathways with Graph Neural Networks

Authors:

Pasquale Bove, Alessio Micheli, Paolo Milazzo and Marco Podda

Abstract: Biochemical pathways are often represented as graphs, in which nodes and edges give a qualitative description of the modeled reactions, while node and edge labels provide quantitative details such as kinetic and stoichiometric parameters. Dynamical properties of biochemical pathways are usually assessed by performing numerical (ODE-based) or stochastic simulations in which quantitative parameters are essential. These simulation methods are often computationally very expensive, in particular when property assessment requires varying parameters such as initial concentrations of molecules. In this paper we propose the use of a Deep Neural Network (DNN) to predict such dynamical properties relying only on the graph structure. In particular, our model is based on Graph Neural Networks. We focus on the dynamical property of concentration robustness, which is the ability of the pathway to maintain the concentration of some molecules within certain intervals despite of perturbation in the initial concentration of other molecules. The use of DNNs can allow robustness to be predicted by avoiding the burden of performing a huge number of numerical or stochastic simulations. Moreover, once trained, the model could be applied to predicting robustness properties for pathways in which quantitative parameters are not available.

Paper Nr: 24
Title:

Properties of the Standard Genetic Code and Its Alternatives Measured by Codon Usage from Corresponding Genomes

Authors:

Małgorzata Wnetrzak, Paweł Błażej and Paweł Mackiewicz

Abstract: The standard genetic code (SGC) and its modifications, i.e. alternative genetic codes (AGCs), are coding systems responsible for decoding genetic information from DNA into proteins. The SGC is thought to be universal for almost all organisms, whereas alternative genetic codes operate mainly in organelles and some specific microorganisms containing usually reduced genomes. Previous analyzes showed that the AGCs minimize the consequences of amino acid replacements due to point mutations better than the SGC. However, these studies did not take into account the potential differences in codon usage between the genomes on which given codes operate. The previous analyzes assumed a uniform distribution of codons, even though we can observe significant codon bias in genomes. Therefore, we developed a new measure involving codon usage as an additional parameter, which allowed us to assess the quality of a given genetic code. We tested our approach on the SGC and its 13 alternatives. For each AGC we applied an appropriate codon usage characteristic of a genome on which this code operates. This approach is more reliable for testing the impact of codon reassignments observed in the AGCs on their robustness to point mutations. The results indicate that the AGCs are generally more robust to point mutation than the SGC, especially when we consider the codon usages characteristic of their corresponding genomes. Moreover, we did not find a genetic code optimal for all considered codon usages, which indicates that the alternative variants of the SGC evolved in specific conditions.

Paper Nr: 38
Title:

BROGUE: A Platform for Constructing and Visualizing “Gene-Mutation-Disease” Relation Knowledge Graphs to Support Biomedical Research and Clinical Decisions

Authors:

Dongsheng Zhao, Fan Tong, Zheheng Luo, Sheng Liu and Wei Song

Abstract: In the era of precision medicine, clinicians need intensive and comprehensive evidence to conduct research and make decisions. However, current knowledge bases are isolated and lack integration with information from other databases or literature, constituting an obstacle for clinicians to locate and understand their interested relations. In this paper, we design a platform development methodology to construct and visualize a biomedical knowledge graph combining text mining tools and knowledge fusion models with web interface libraries. The platform thereby provides the functions of knowledge acquisition, integration, storage, search and visualization, where each concept in the relation is described by its properties, each relation in the database is located to sentences and each paragraph in the article is translated into Chinese. To further validate the feasibility and practicability, we applied the methodology to the “gene-mutation-disease” field and built a Biomedical Relation of Gene-mUtation-diseasE (BROGUE) platform. The platform included 590 high-quality gene-mutation-disease relations covering a wide range of commonly-used gene (286), mutation (525) and disease (347) concepts by October 2019. Two tests demonstrated that BROGUE has potential to be useful for supporting biomedical research and clinical decisions. The platform has been deployed and is publicly available at http://brogue.medmdt.net/.

Paper Nr: 42
Title:

Gene Co-expression Analysis for Lung Cancer Biomarkers Detection

Authors:

Stefani Kostadinovska, Slobodan Kalajdziski and Monika Simjanoska

Abstract: Cancer is one of the most widespread diseases that we come across. The complexity of this disease makes it difficult to analyze and detect biomarkers with the purpose to ease the targeted treatments. This study presents a methodology based on gene expression data that provides promising results in terms of revealing potential biomarkers associated with lung cancer. To accomplish this, gene networks are built presenting the correlation among the genes. These networks are further analyzed and thus specific modules are created. Hereupon special representative genes for each of the modules are detected that lead to the identification of potential biomarkers for lung cancer. The reliability of the revealed biomarkers has been proved in the literature.

Short Papers
Paper Nr: 2
Title:

Minimal Complexity Requirements for Proteins and Other Combinatorial Recognition Systems

Authors:

George D. Montañez, Laina Sanders and Howard Deshong

Abstract: How complex do proteins (and other multi-part recognition systems) need to be? Using an information-theoretic framework, we characterize the information costs of recognition tasks and the information capacity of combinatorial recognition systems, to determine minimum complexity requirements for systems performing such tasks. Reducing the recognition task to a finite set of binary constraints, we determine the sizes of minimal equivalent constraint sets using a form of distinguishability, and show how the representation of constraint sets as binary circuits or decision trees also results in minimum constraint set size requirements. We upper-bound the number of configurations a recognition system can distinguish between as a function of the number of parts it contains, which we use to determine the minimum number of parts needed to accomplish a given recognition task. Lastly, we apply our framework to DNA-binding proteins and derive estimates for the minimum number of amino acids needed to accomplish binding tasks of a given complexity.

Paper Nr: 4
Title:

Expanding Polygenic Risk Scores to Include Automatic Genotype Encodings and Gene-gene Interactions

Authors:

Trang T. Le, Hoyt Gong, Patryk Orzechowski, Elisabetta Manduchi and Jason H. Moore

Abstract: Polygenic Risk Scores (PRS) are aggregation of genetic risk factors of specific diseases and have been successfully used to identify groups of individuals who are more susceptible to those diseases. While several studies have focused on identifying the correct genetic variants to include in PRS, most existing statistical models focus on the marginal effect of the variants on the phenotypic outcome but do not account for the effect of gene-gene interactions. Here, we propose a novel calculation of the risk score that expands beyond marginal effect of individual variants on the outcome. The Multilocus Risk Score (MRS) method effectively selects alternative genotype encodings and captures epistatic gene-gene interactions by utilizing an efficient implementation of the model-based Multifactor Dimensionality Reduction technique. On a diverse collection of simulated datasets, MRS outperforms the standard PRS in the majority of the cases, especially when at least two-way interactions between the variants are present. Our findings suggest that models incorporating epistatic interactions are necessary and will yield more accurate and effective risk profiling.

Paper Nr: 5
Title:

Influence of Data Similarity on the Scoring Power of Machine-learning Scoring Functions for Docking

Authors:

Kam-Heung Sze, Zhiqiang Xiong, Jinlong Ma, Gang Lu, Wai-Yee Chan and Hongjian Li

Abstract: Inconsistent conclusions have been drawn from recent studies exploring the influence of data similarity on the scoring power of machine-learning scoring functions, but they were all based on the PDBbind v2007 refined set whose data size is limited to just 1300 protein-ligand complexes. Whether these conclusions can be generalized to substantially larger and more diverse datasets warrants further examinations. Besides, the previous definition of protein structure similarity, which relied on aligning monomers, might not truly reflect what it was supposed to be. Moreover, the impact of binding pocket similarity has not been investigated either. Here we have employed the updated refined set v2013 providing 2959 complexes and utilized not only protein structure and ligand fingerprint similarity but also a novel measure based on binding pocket topology dissimilarity to systematically control how similar or dissimilar complexes are incorporated for training predictive models. Three empirical scoring functions X-Score, AutoDock Vina, Cyscore and their random forest counterparts were evaluated. Results have confirmed that dissimilar training complexes may be valuable if allied with appropriate machine learning algorithms and informative descriptor sets. Machine-learning scoring functions acquire their remarkable scoring power through mining more data to advance performance persistently, whereas classical scoring functions lack such learning ability. The software code and data used in this study and supplementary results are available at https://GitHub.com/HongjianLi/MLSF.

Paper Nr: 8
Title:

A Novel Method for the Inverse QSAR/QSPR based on Artificial Neural Networks and Mixed Integer Linear Programming with Guaranteed Admissibility

Authors:

Naveed A. Azam, Rachaya Chiewvanichakorn, Fan Zhang, Aleksandar Shurbevski, Hiroshi Nagamochi and Tatsuya Akutsu

Abstract: Inverse QSAR/QSPR is a well-known approach for computer-aided drug design. In this study, we propose a novel method for inverse QSAR/QSPR using artificial neural network (ANNs) and mixed integer linear programming. In this method, we introduce a feature function f that converts each chemical compound G into a vector f (G) of several descriptors of G. Next, given a set of chemical compounds along with their chemical properties, we construct a prediction function Ψ with an ANN so that Ψ( f (G)) takes a value nearly equal to a given chemical property for many chemical compounds G in the set. Then, given a target value y* of the chemical property, we conversely infer a chemical structure G* having the desired property y* in the following way. We formulate the problem of finding a vector x* such that (i) Ψ(x*) = y* and (ii) there exists a chemical compound G* such that f (G*) = x* (if one exists over all vectors x* in (i)) as a mixed integer linear programming problem (MILP). In an existing method for the inverse QSAR/QSPR, the second condition (ii) was not guaranteed. For acyclic chemical compounds and some chemical properties such as heat of formation, boiling point, and retention time, we conducted computational experiments.

Paper Nr: 9
Title:

A Grid-based Simulation Model for the Evolution of Influenza A Viruses

Authors:

Hsin-Ting Chung and Yuh-Jyh Hu

Abstract: We propose a simulation approach for analyzing and predicting the evolution of influenza A viruses (IAVs). The simulation is conducted in a sequence-based space to constrain the evolutionary trends within a grid of clusters of protein sequences. The simulated trajectories enable the investigation into point mutations on a protein strain of IAVs in evolution, which cannot be accomplished easily by analyses of phylogenetic trees. We tested the simulation model on three IAV internal proteins, NP, PB1 and PB2. The produced evolutionary pathways were consistent with previous studies of the reassortment history of the 2009 human pandemic. In addition, the chronological analysis of host-associated signature mutations on NP, PB1 and PB2 also agreed with the previous findings.

Paper Nr: 12
Title:

Evaluation of Phenotyping Errors on Polygenic Risk Score Predictions

Authors:

Ruowang Li, Jiayi Tong, Rui Duan, Yong Chen and Jason H. Moore

Abstract: Accurate disease risk prediction is essential in healthcare to provide personalized disease prevention and treatment strategies not only to the patients, but also to the general population. In addition to demographic and environmental factors, advancements in genomic research have revealed that genetics play an important role in determining the susceptibility of diseases. However, for most complex diseases, individual genetic variants are only weakly to moderately associated with the diseases. Thus, they are not clinically informative in determining disease risks. Nevertheless, recent findings suggest that the combined effects from multiple disease-associated variants, or polygenic risk score (PRS), can stratify disease risk similar to that of rare monogenic mutations. The development of polygenic risk score provides a promising tool to evaluate the genetic contribution of disease risk; however, the quality of the risk prediction depends on many contributing factors including the precision of the target phenotypes. In this study, we evaluated the impact of phenotyping errors on the accuracies of PRS risk prediction. We utilized electronic Medical Records and Genomics Network (eMERGE) data to simulate various types of disease phenotypes. For each phenotype, we quantified the impact of phenotyping errors generated from the differential and non-differential mechanism by comparing the prediction accuracies of PRS on the independent testing data. In addition, our results showed that the rate of accuracy degradation depended on both the phenotype and the mechanism of phenotyping error.

Paper Nr: 33
Title:

Variable Selection based on a Two-stage Projection Pursuit Algorithm

Authors:

Shu Jiang and Yijun Xie

Abstract: Dimension reduction methods have gained popularity in modern era due to exponential growth in data collection. Extracting key information and learning from all available data is a crucial step. Principal component analysis (PCA) is a popular dimension reduction technique due to its simplicity and flexibility. We stress that PCA is solely based on maximizing the proportion of total variance of the explanatory variables and do not directly impact the outcome of interest. Variable selection under such unsupervised setting may thus be inefficient. In this note, we propose a novel two-stage projection pursuit based algorithm which simultaneously consider the loss in the outcome variable when doing variable selection. We believe that when one is keen in variable selection in relation to the outcome of interest, the proposed method may be more efficient compared to existing methods.

Paper Nr: 35
Title:

Multi-state Models for the Analysis of Survival Studies in Biomedical Research: An Alternative to Composite Endpoints

Authors:

Alicia Quirós, Armando Pérez de Prado, Natalia Montoya and José T. Hernández

Abstract: Primary endpoints of survival studies in biomedical research are usually composite endpoints, which indicate whether any of a list of events is observed. They are practical to empower studies and in the presence of competing risks, although constrained. In this work, we propose a more sophisticated modelization of the evolution of the disease for a patient with multi-state models, which allow to define relationships between adverse events by a state structure. Each transition between states may depend on different covariates, which provides a personalized prediction for patients, considering their characteristics, treatment and observed disease evolution. In order to illustrate their performance, we analyze a study in interventional cardiology including 1008 patients with acute coronary syndrome who underwent percutaneous revascularization between 2013 and 2019. The results show the great potential of multi-states models for analyzing survival studies in biomedical research.

Paper Nr: 39
Title:

Advancements in Red Blood Cell Detection using Convolutional Neural Networks

Authors:

František Kajánek and Ivan Cimrák

Abstract: Extraction of data from video sequences of experiments is necessary for the acquisition of high volumes of data. The process requires Red Blood Cell detection to be of sufficient quality, so that the tracking algorithm has enough information for connecting frames and positions together. When holes occur in the detection, the tracking algorithm is only capable of fixing a certain amount of errors before it fails. In this work we iterate on existing frameworks and we attempt to improve upon the existing results of Convolutional Neural Network solutions.

Paper Nr: 40
Title:

Detection of Lattice-points inside Triangular Mesh for Variable-viscosity Model of Red Blood Cells

Authors:

Tibor Poštek, František Kajánek and Mariana Ondrušová

Abstract: In this work we introduce the extension of an existing computational model for red blood cells that enables modelling of different viscosity inside and outside the cells. The extension is based on an algorithm for detection of fluid lattice-points inside the cell given the membrane of the cell as a closed triangular mesh. This algorithm enables the setting of variable viscosity in the underlying lattice-Boltzmann method for computation of fluid dynamics.

Paper Nr: 43
Title:

Metagenomic Clustering in Search of Common Origin

Authors:

Jolanta Kawulok and Michal Kawulok

Abstract: Analysis of metagenomic samples is aimed at extracting relevant information on these samples, including their composition and origin. To determine where a sample comes from, it is commonly compared with a set of reference samples extracted from known locations. However, if such reference samples are unavailable or when the origins of the investigated samples are not covered by the reference set, it may be helpful to identify groups of similar samples that may have a common origin. In this paper, we tackle this problem with hierarchical clustering applied to analyse a matrix of mutual similarities obtained using the Mash and our CoMeta programs. We report initial, yet encouraging results of our experimental study performed for the metagenomic data extracted from two large metropolises, downloaded from the Sequence Read Archive repository. The obtained results indicate that the proposed approach is effective, which justifies further exploration of the topic using more extensive data.

Paper Nr: 48
Title:

PIACAN: Pathway Integration and Analysis of Cancer Networks

Authors:

Adrian Quintana, Vinh Nguyen, Tommy Dang and Chiquito Crasto

Abstract: We developed a web-based software tool, Pathway Integration and Analysis of Cancer Networks (PIACAN), to identify key cancer genes, pathways and sub-pathways that are implicated in more than one type of cancer. PIACAN is the result of merging biological pathways associated with 15 different human cancer types mined from the Kyoto Encyclopaedia of Genes and Genomes (KEGG). The Cytoscape software was used to port the mined information for pathway merging and subsequent analysis. Web-determined visualization of the merged networks was achieved by programming using the JavaScript library Data-Drive-Documents (D3). The results of PIACAN allow us a mechanistic glimpse into the potential development of secondary cancers spreading to distant tissues, following the primary tumour-localization in a specific tissue, via traversal of the blood-brain barrier. Given the similarities in biological networks between different cancers, PIACAN allows us a glimpse into the similarities in cancer development in remote tissues. PIACAN is a free, public, web-accessible resource (https://adrquint.github.io/integrated-cancer-networks/), where users can identify how and where biological pathways and/or sub-pathways, depending on the cancer type. A video-demonstration of the preliminary work can be found at: https://www.youtube.com/watch?v=tOJ-EOY33fU. PIACAN is also developed as a knowledge-dissemination tool. In its current iteration, for each gene in the pathway, the system links to cancer gene information in KEGG, GeneCards, Gene Ontology, NCBI AceView, and Ensembl.

Paper Nr: 49
Title:

Using ASAR for Analysis of Electrogenic and Human Gut Microbial Communities

Authors:

Igor Goryanin, Anatoly Sorokin and Olga Vasieva

Abstract: In this paper we describe applications of our ASAR package to functional, taxonomic and pathways analysis of metagenomes and propose future plans and perspectives. To illustrate an analytical potential of ASAR, we discuss outcomes of several projects. The main focus is made on metabolic plasticity of electrochemically active microbial communities and a potential role of integrated symbiotic bacterial interactions; antipathogenic properties of BES, manifested in its capacity to remove some pathogens from waste streams; and medical applications of this technology. We present ASAR-based metagenome analysis of evolving bacterial community from distillery waste over period of 36 months in BES environment as an example. Application of ASAR to personalised analyses of gut microbiome (GM) and the data interpretation based on publically available association studies are also discussed in this publication.

Paper Nr: 51
Title:

Adding Value to Translational Informatics through the Semantic Management of Drug to Drug Interaction

Authors:

Radmila Juric

Abstract: Translational informatics, aimed at bridging the gap between biomedical scientific knowledge and clinical practice has changed the way we use rapidly growing information from biomedical research and bring it closer to clinical practice. Software technologies play an important role in this process, particularly if they help in understanding and manipulating the meaning of data and information generated in biomedical research and translate it into semantic suitable for clinical practice. In this paper, we propose software architectural and conceptual computational models, which use semantic technologies in order to explore the meaning of the relationships between drugs when they interact in clinical practice. The data about drug to drug interactions, available from biomedical research, is reusable in instances where they are decisive factors in drug administration in clinical practice. We explore the power of semantic web technologies and SWRL enabled OWL ontologies to demonstrate the applicability and feasibility of our proposal in translational informatics.

Paper Nr: 6
Title:

SIRA-HIV: A User-friendly System to Evaluate HIV-1 Drug Resistance from Next-generation Sequencing Data

Authors:

Letícia M. Raposo, Mônica B. Arruda, Rodrigo M. Brindeiro and Flavio F. Nobre

Abstract: Evaluating next-generation sequencing (NGS) data requires an extensive knowledge of bioinformatics and programming commands, which could limit the studies in this area. We propose a user-friendly system to analyse raw NGS data from HIV-1 patient samples to identify amino acid variants and the virus susceptibility to antiretrovirals. SIRA-HIV was developed as an R Shiny web application. The software Segminator II was applied to analyse viral data. Four genotypic interpretation systems were implemented in R language to classify the HIV susceptibility: the French National Agency for AIDS Research (ANRS), the Stanford HIV Drug Resistance Database (HIVdb), the Rega Institute (Rega) and the Brazilian Network for HIV-1 Genotyping (Brazilian Algorithm). SIRA-HIV was structured in two analysis components. The Drug Resistance Positions module shows the resistance positions, their frequencies, and the coverage. In the Genotypic Resistance Interpretation Algorithms module, the rule-based systems are available to interpret HIV-1 drug resistance genotyping results. SIRA-HIV exhibited comparable results to Deep Gen HIV, HyDRA, and PASeq. As advantage, the proposed application shows susceptibility levels from the most widely used rule-based systems and works locally, allowing analysis not to rely on the internet. SIRA-HIV could be a promising system to aid in HIV-1 patient data analysis.

Paper Nr: 10
Title:

Measuring the Similarity of Proteomes using Grammar-based Compression via Domain Combinations

Authors:

Morihiro Hayashida, Hitoshi Koyano and Jose C. Nacher

Abstract: Revealing evolution of organisms is one of important biological research topics, and is also useful for understanding the origin of organisms. Hence, genomic sequences have been compared and aligned for finding conserved and functional regions. A protein can contain several domains, which are known as structural and functional units. In the previous work, a proteome, whole kinds of proteins in an organism, was regarded as a set of sequences of protein domains, and a grammar-based compression algorithm was developed for a proteome, where production rules in the grammar represented evolutionary processes, mutation and duplication. In this paper, we propose a similarity measure based on the grammar-based compression, and apply it to hierarchical clustering of seven organisms, Homo sapiens, Mus musculus, Drosophila melanogaster, Caenorhabditis elegans, Saccharomyces cerevisiae, Arabidopsis thaliana, and Escherichia coli. The results suggest that our similarity measure could classify the organisms very well.

Paper Nr: 15
Title:

A Machine Learning Approach to Select the Type of Intermittent Fasting in Order to Improve Health by Effects on Type 2 Diabetes

Authors:

Shula Shazman

Abstract: Intermittent fasting (IF) is the cycling between periods of eating and fasting. The main types of IF are: complete alternate-day fasting; time-restricted feeding (eating within specific time frames such as the most prevalent 16:8 fast, with 16 hours of fasting and 8 hours for eating); religious fasting such as the Ramadan (occurs one month per year, with eating taking place only after nightfall). IF can be effective in reducing metabolic disorders and age-related diseases by bringing about changes in metabolic parameters associated with type 2 diabetes. Questions do remain, however, about the effects of the different types of IF as a function of the age at which fasting begins, gender and severity of type 2 diabetes. In this paper we describe a machine learning approach to selecting the best type of IF to improve health in type 2 diabetes. For the purposes of this research, the health outcomes of interest are changes in fasting glucose and insulin. The different types of intermittent fast offer promising non-pharmacological approaches to improving health at the population level, with multiple public health benefits.

Paper Nr: 19
Title:

Classification of Respiratory Sounds with Convolutional Neural Network

Authors:

A. A. Saraiva, D. S. Santos, A. A. Francisco, Jose M. Sousa, N. F. Ferreira, Salviano Soares and Antonio Valente

Abstract: Noting recent advances in the field of image classification, where convolutional neural networks (CNNs) are used to classify images with high precision. This paper proposes a method of classifying breathing sounds using CNN, where it is trained and tested. To do this, a visual representation of each audio sample was made that allows identifying resources for classification, using the same techniques used to classify images with high precision.For this we used the technique known as Mel Frequency Cepstral Coefficients (MFCCs). For each audio file in the dataset, we extracted resources with MFCC which means we have an image representation for each audio sample. The method proposed in this article obtained results above 74%, in the classification of respiratory sounds used in the four classes available in the database used (Normal, crackles, wheezes, Both).

Paper Nr: 21
Title:

EasyModel 1.1: User-friendly Stochastic and Deterministic Simulations for Systems Biology Models

Authors:

Jordi Bartolome, Rui Alves and Francesc Solsona

Abstract: EasyModel is a user-friendly web application that uses Wolfram webMathematica for performing simulations and analysis of systems biology models. EasyModel lets users create new models, load models from the BioModels database, and import preexisting models from SBML files. EasyModel mainly targets the student of bioinformatics or systems biology without the need of having Mathematica programming knowledge. In addition, expert programmers may find it useful as a tool for quickly implementing new models in Mathematica, which can then be downloaded as Mathematica notebooks to be tailored locally for more advanced simulation and analysis. The version described in this manuscript introduces the stochastic simulation feature. EasyModel is freely available at https://easymodel.udl.cat

Paper Nr: 22
Title:

A Machine Learning-based Approach for the Categorization of MicroRNAs to Their Species of Origin

Authors:

Luise Odenthal, Jens Allmer and Malik Yousef

Abstract: Many diseases are driven by dysregulated gene expression. MicroRNAs are key players for post-transcriptional gene regulation. miRBase contains microRNAs (miRNAs) from about 200 species organized into about 70 clades. It has been shown that not all miRNAs collected in the database are likely to be real and, therefore, novel routes to delineate between correct and false miRNAs should be explored. Here, a novel approach allowing the assignment of an unknown miRNA to its most likely clade/species of origin is presented. A simple way to filter new data would be to ensure that the novel miRNA categorizes closely to the species it is said to originate from. The approach presented here automatically assigns a miRNA sample to its clade/species of origin. For that, an ensemble classifier of multiple two class random forest was designed, where each random forest was trained on one species/clade pair. The approach was tested with different sampling methods on a dataset that was taken from miRBase and it was evaluated using a hierarchical f-measure. The approach predicted 81% to 94% of the test data correctly, depending on the sampling method. This is the first classifier that can classify miRNAs to their species of origin.

Paper Nr: 23
Title:

RuleDSD: A Rule-based Modelling and Simulation Tool for DNA Strand Displacement Systems

Authors:

Vinay Gautam, Shiting Long and Pekka Orponen

Abstract: RuleDSD is a tool to support the rule-based modelling and simulation of DNA Strand Displacement (DSD) systems. It constitutes a software pipeline programmed in Python and integrated with PySB, a standard framework for rule-based modelling of biochemical systems. The input to RuleDSD is a domain-level model of a DSD system, where each initial DNA complex is described at the level of named pairing domains. The RuleDSD pipeline converts these domain-level descriptions into a canonical graph representation, and based on this performs a full state-space enumeration of DNA species reachable by applying the basic rules of DNA strand displacement reactions to the ensemble of initial species. The resulting chemical reaction network is then converted into a BioNetGen model and imported into the PySB framework for deterministic or stochastic simulation and analysis. Altogether, RuleDSD thus provides a customised front-end for rule-based modelling and simulation of DNA Strand Displacement systems using the BioNetGen simulation engine, and opens up further possibilities for harnessing the well-established rule-based modelling methods and tools that can easily be utilised through the PySB wrapper.

Paper Nr: 29
Title:

Classification of Optical Coherence Tomography using Convolutional Neural Networks

Authors:

A. A. Saraiva, D. S. Santos, Pimentel Pedro, Jose M. Sousa, N. F. Ferreira, J. B. Neto, Salviano Soares and Antonio Valente

Abstract: This article describes a classification model of optical coherence tomography images using convolution neural network. The dataset used was the Labeled Optical Coherence Tomography provided by (Kermany et al., 2018) with a total of 84495 images, with 4 classes: normal, drusen, diabetic macular edema and choroidal neovascularization. To evaluate the generalization capacity of the models k-fold cross-validation was used. The classification models were shown to be efficient, and as a result an average accuracy of 94.35% was obtained.

Paper Nr: 30
Title:

Validity of the Michaelis-Menten Approximation for the Stability Analysis in Regulatory Reaction Networks

Authors:

Takashi Naka

Abstract: Cellular signalling systems are comprised of enzymatic reaction cascades and organized as regulatory reaction networks. The primary building block of the network is an enzymatic activation-inactivation cyclic reaction such as phosphoryl modifications. We have investigated the effects of the network architectures and kinetic parameter values on the stability such as the emergence of bi-stability or oscillations employing the canonical Michaelis-Menten equation as the approximation for Michaelis-Menten-type reaction mechanisms in each of enzymatic cyclic reaction. Although the Michaelis-Menten approximation has known to work well under an assumption of a large excess of substrate over enzyme which is usually satisfied for metabolic pathways, the approximation might not suit to regulatory reaction networks in which the required assumption might be violated. In this study, comparing the predicted stabilities from the model with the Michalis-Menten approximation and with the full set of reaction equations derived only from the law of mass action, the validity of the Michaelis-Menten approximation was examined for the regulatory reaction networks over the possible network architectures and kinetic parameter values elucidating that employing the Michalis-Menten approximation might not be valid even in the analysis for the steady states such as the stability analysis.

Paper Nr: 31
Title:

Development of HIV-1 Coreceptor Tropism Classifiers: An Approach to Improve X4 and R5X4 Viruses Prediction

Authors:

José A. Rodrigues, Letícia M. Raposo and Flavio F. Nobre

Abstract: The pathway of human immunodeficiency virus (HIV) infection depends on the composition of a 35-amino acid variable region in its envelope, known as the V3 loop. Since this discovery, many tools have been developed to diagnose and predict viral tropism, from biochemical tests to various computational algorithms. To date, the biggest developmental difficulty is the correct prediction of X4 or R5X4-tropism virions. In this study, we evaluated some of these recommended criteria and proposed a random forest-based approach for better prediction of X4-capable (i.e., either X4-only, or R5X4-dual/mixed capability). All methods achieved a specificity higher than 87%, with geno2pheno 2.5% showing the best performance (98.2%). Nevertheless, the sensitivity (73.3%) was lower compared to the other approaches. The highest sensitivity was attained by our Complete Model with an undersampling strategy (90.1%). The accuracy of all approaches ranged from 87.4% to 93.0%. Complete Model with oversampling and Reduced Model with no balancing showed the highest MCC value (both with 0.796 score). Considering error rates and the number of explanatory variables, our main objective of increasing the ability to predict viral specimens with X4-tropism was achieved.

Paper Nr: 36
Title:

Chikungunya Virus Inhibitor Study based on Molecular Docking Experiments

Authors:

A. A. Saraiva, Soares Jeferson, Castro Miranda, Jose M. Sousa, N. F. Ferreira, J. B. Neto, Salviano Soares and Antonio Valente

Abstract: Chikungunya virus disease transmitted by the sting of the mosquito 'Aedes aegypti’ presenting an epidemic in some regions. In order to have an early diagnosis and the best treatment technique, it establishes the study of inhibitors for laboratory elaboration of a drug from molecular docking. As a result you have a better chance of using Suramin followed by Silibin.

Paper Nr: 44
Title:

Sharing Bioinformatic Data for Machine Learning: Maximizing Interoperability through License Selection

Authors:

Alexander Bernier and Adrian Thorogood

Abstract: Efficient machine learning in bioinformatics requires a large volume of data from different sources. Bioinformatics is shifting from a paradigm of siloed analysis of individual datasets by researchers to the aggregation and analysis of disparate sets of health and biomedical data across from academic, healthcare and commercial settings. Data generating organizations must give thought to selecting legal terms for dataset release that will promote compatibility with other datasets. In releasing bioinformatic data for open use, care must be taken to ensure that the terms of the licenses selected ensure maximum interoperability. The following technical elements should inform the choice of license: License hybridity; waivers of liability, warranties and guarantees; commercial/non-commercial use; attribution and copyleft; granular permission and bilateral or multilateral licensing. Licenses are compared to inform optimal license selection and enable data integration and analysis; consideration is given to an eventual standard license for open sharing of bioinformatic data.

Paper Nr: 45
Title:

3D Spatial Dependencies Study in the Hawk and Dove Model

Authors:

Andrzej Swierniak, Marek Bonk and Damian Borys

Abstract: The aim of the research was to check spatial dependencies in evolutionary games in 3D grids and compare them with simulation results (2D) and theoretical or analytical considerations obtained from the replicator dynamics equations. In order to compare the results, the classic Hawk and Dove model was used and a series of simulations for both v < c and v > c cases was performed using our own software. The results are almost the same as the theoretical analysis of this model, but some small differences were observed and discussed. It seems, however, that the 3D model better reflects the behaviour of the population than 2D simulations.

Paper Nr: 46
Title:

Theoretical Study of the Fidelity of Transcription

Authors:

Yao-Gen Shu, Ming Li and Zhong-Can Ou-Yang

Abstract: This year we celebrate the 50th anniversary of the discovery of the three eukaryotic RNA polymerases. Ever since this seminal event was uncovered by Robert Roeder in 1969(Roeder and Rutter, 1969), researchers have investigated the intricate mechanisms of gene transcription with great dedication. However, there is not breakthrough in study of the fidelity of transcription still. Here, we propose a simplest model with first-order neighbor effects, a first-passage approach, to theoretically investigate the gene transcription fidelity.

Paper Nr: 47
Title:

A Deep-learning based Method for the Classification of the Cellular Images

Authors:

Caleb Vununu, Suk-Hwan Lee and Ki-Ryong Kwon

Abstract: The present work proposes a classification method for the Human Epithelial of type 2 (HEp-2) cell images using an unsupervised deep feature learning method. Unlike most of the state-of-the-art methods in the literature that utilize deep learning in a strictly supervised way, we propose here the use of the deep convolutional autoencoder (DCAE) as the principal feature extractor for classifying the different types of the HEp-2 cellular images. The network takes the original cellular images as the inputs and learns how to reconstruct them through an encoding-decoding process in order to capture the features related to the global shape of the cells. A final feature vector is constructed by using the latent representations extracted from the DCAE, giving a highly discriminative feature representation. The created features will then be fed to a nonlinear classifier whose output will represent the final type of the cell image. We have tested the discriminability of the proposed features on one of the most popular HEp-2 cell classification datasets, the SNPHEp-2 dataset and the results show that the proposed features manage to capture the distinctive characteristics of the different cell types while performing at least as well as some of the actual deep learning based state-of-the-art methods.

Paper Nr: 50
Title:

Predictive Technologies and Biomedical Semantics: A Study of Endocytic Trafficking

Authors:

Radmila Juric, Elisabetta Ronchieri, Gordana B. Zagorac, Hana Mahmutefendic and Pero Lucin

Abstract: Predictive technologies with increased uptake of machine learning algorithms have changed the landscape of computational models across problem domains and research disciplines. With the abundance of data available for computations, we started looking at the efficiency of predictive inference as the answer to many problems we wish to address using computational power. However, the real picture of the effectiveness and suitability of predictive and learning technologies in particular is far from promising. This study addresses these concerns and illustrates them though biomedical experiments which evaluate Tf/TfR endosomal recycling as a part of cellular processes by which cells internalise substances from their environment. The outcome of the study is interesting. The observed data play an important role in answering biomedical research questions because it was feasible to perform ML classifications and feature selection using the semantic stored in the observed data set. However, the process of preparing the data set for ML classifications proved the opposite. Precise algorithmic predictions, which are ultimate goals when using learning technologies, are not the only criteria which measure the success of predictive inference. It is the semantic of the observed data set, which should become a training data set for ML, which becomes a weak link in the process. The recognised practices from data science do not secure any safety of preserving important semantics of the observed data set and experiments. They could be distorted and misinterpreted and might not contribute towards correct inference. The study can be seen as an illustration of hidden problems in using predictive technologies in biomedicine and is applicable to both: computer and biomedical scientists.