BIOINFORMATICS 2016 Abstracts


Full Papers
Paper Nr: 7
Title:

Highly Robust Classification: A Regularized Approach for Omics Data

Authors:

Jan Kalina and Jaroslav Hlinka

Abstract: Various regularized approaches to linear discriminant analysis suffer from sensitivity to the presence of outlying measurements in the data. This work has the aim to propose new versions of regularized linear discriminant analysis suitable for high-dimensional data contaminated by outliers. We use principles of robust statistics to propose classification methods suitable for data with the number of variables exceeding the number of observations. Particularly, we propose two robust regularized versions of linear discriminant analysis, which have a high breakdown point. For this purpose, we propose a regularized version of the minimum weighted covariance determinant estimator, which is one of highly robust estimators of multivariate location and scatter. It assigns implicit weights to individual observations and represents a unique attempt to combine regularization and high robustness. Algorithms for the efficient computation of the new classification methods are proposed and the performance of these methods is illustrated on real data sets.

Paper Nr: 18
Title:

Combinatorial Identification of Broad Association Regions with ChIP-seq Data

Authors:

Jieun Jeong, Mudit Gupta, Andrey Poleshko and Jonathan A. Epstein

Abstract: Motivation: Differentiation of cells into different cell types involves many types of chromatin modifications, and mapping these modifications is a key computational task as researchers uncover different aspects of that process. Modifications associated with heterochromatin formation pose new challenges in this context because we must define very broad regions that have only a moderately stronger signal than the rest of the chromatin. Lamin-associated domains (LADs) are a prime example of such regions. Results: We present Combinatorial Identification of Broad Association Regions (CIBAR), a new method to identify these types of broad regions. CIBAR is based on an efficient solution to a natural combinatorial problem, which adapts to widely variable yields of reads from ChIP-seq data and the associated controls and performs competitively with previous methods, including DamID, which has been used in many publications on LADs but cannot be applied in most in vivo situations.

Paper Nr: 23
Title:

Integer Linear Programming Approach to Median and Center Strings for a Probability Distribution on a Set of Strings

Authors:

Morihiro Hayashida and Hitoshi Koyano

Abstract: We address problems of finding median and center strings for a probability distribution on a set of strings under Levenshtein distance, which are known to be NP-hard in a special case. There are many applications in various research fields, for instance, to find functional motifs in protein amino acid sequences, and to recognize shapes and characters in image processing. In this paper, we propose novel integer linear programming-based methods for finding median and center strings for a probability distribution on a set of strings under Levenshtein distance. Furthermore, we restrict several variables to a region near the diagonal in the formulation, and propose novel integer linear programming-based methods also for finding approximate median and center strings for a probability distribution on a set of strings. For evaluation of our proposed methods, we perform several computational experiments, and show that the restricted formulation reduced the execution time.

Paper Nr: 28
Title:

Controlling the Cost of Prediction in using a Cascade of Reject Classifiers for Personalized Medicine

Authors:

Blaise Hanczar and Avner Bar-Hen

Abstract: The supervised learning in bioinformatics is a major tool to diagnose a disease, to identify the best therapeutic strategy or to establish a prognostic. The main objective in classifier construction is to maximize the accuracy in order to obtain a reliable prediction system. However, a second objective is to minimize the cost of the use of the classifier on new patients. Despite the control of the classification cost is high important in the medical domain, it has been very little studied. We point out that some patients are easy to predict, only a small subset of medical variables are needed to obtain a reliable prediction. The prediction of these patients can be cheaper than the others patient. Based on this idea, we propose a cascade approach that decreases the classification cost of the basic classifiers without dropping their accuracy. Our cascade system is a sequence of classifiers with rejects option of increasing cost. At each stage, a classifier receives all patients rejected by the last classifier, makes a prediction of the patient and rejects to the next classifier the patients with low confidence prediction. The performances of our methods are evaluated on four real medical problems.

Paper Nr: 31
Title:

BioMed Xplorer - Exploring (Bio)Medical Knowledge using Linked Data

Authors:

Mohammad Shafahi, Hayo Bart and Hamideh Afsarmanesh

Abstract: Developing an effective model for predicting risks of a disease requires exploration of a vast body of (bio)medical knowledge. Furthermore, the continuous growth of this body of knowledge poses extra challenges. Numerous research has attempted to address these issues through developing a variety of approaches and support tools. Most of these tools however, do not sufficiently address the needed dynamism, lack intuitiveness in their use, and present a rather scarce amount of information usually obtained from a single source. This research aims to address the aforementioned gaps through the development of a dynamic model for (bio)medical knowledge, represented as a network of interrelated (bio)medical concepts, and integrating disperse sources. To this end, this paper introduces BioMed Xplorer, presenting a model and a tool that enables researchers to explore biomedical knowledge, organized in an information graph, through a user friendly and intuitive interface. Furthermore, BioMed Xplorer provides concept related information from a multitude of sources, while also preserving and presenting their provenance data. For this purpose a RDF knowledge base has been created based on a core ontology which we have introduced. Results are further experimented with and validated by some domain experts and are contrasted against the state of the art.

Paper Nr: 35
Title:

Statistical Characterization, Modelling and Classification of Morphological Changes in imp Mutant Drosophila Gamma Neurons

Authors:

A. Razetti, X. Descombes, C. Medioni and F. Besse

Abstract: In Drosophila brain, gamma neurons in the mushroom body are involved in higher functions such as olfactory learning and memory. During metamorphosis, they undergo remodelling after which they adopt their adult shape. Some mutations alter remodelling and therefore neuronal final morphology, causing behavioural dysfunctions. The RNA binding protein Imp, for example, was shown to control this remodelling process at least partly by regulating profilin expression. This work aims at precisely characterizing the morphological changes observed upon imp knockdown in order to further understand the role of this protein. We develop a methodological framework that consists in the selection of relevant morphological features, their modelling and parameter estimation. We thus perform a statistical comparison and a likelihood analysis to quantify similarities and differences between wild type and mutated neurons. We show that imp mutant neurons can be classified into two phenotypic groups (called Imp L and Imp Sh) that differ in several morphological aspects. We also demonstrate that, although Imp L and wild-type neurons show similarities, branch length distribution is discriminant between these populations. Finally, we study biological samples in which Profilin was reintroduced in imp mutant neurons, and show that defects in main axon and branch lengths are partially suppressed.

Paper Nr: 37
Title:

A Sampling Approach for Multiple RNA Interaction - Finding Sub-optimal Solutions Fast

Authors:

Saad Mneimneh and Syed Ali Ahmed

Abstract: The interaction of two RNA molecules involves a complex interplay between folding and binding that warranted recent developments in RNA-RNA interaction algorithms. These algorithms cannot be used to predict interaction structures when the number of RNAs is more than two. Our recent formulation of the multiple RNA interaction problem is based on a combinatorial optimization called Pegs and Rubber Bands, and has been successful in predicting structures that involve more than two RNAs. Even then, however, the optimal solution obtained does not necessarily correspond to the actual biological structure. Moreover, a structure produced by interacting RNAs may not be unique to start with. Multiple solutions (thus sub-optimal ones) are needed. We extend our previous approach to generate multiple sub-optimal solutions that was based on exhaustive enumeration. Here, a sampling approach for multiple RNA interaction is developed. Since not too many samples are needed to reveal solutions that are sufficiently different, sampling provides a much faster alternative. By clustering the sampled solutions, we are able to obtain representatives that correspond to the biologically observed structures. Specifically, our results for the U2-U6 complex and its introns in the spliceosome of yeast, and the CopA-CopT complex in E. Coli are consistent with published biological structures.

Paper Nr: 42
Title:

Inference of Predictive Phospho-regulatory Networks from LC-MS/MS Phosphoproteomics Data

Authors:

Sebastian Vlaic, Robert Altwasser, Peter Kupfer, Carol L. Nilsson, Mark Emmett, Anke Meyer-Baese and Reinhard Guthke

Abstract: In the field of transcriptomics data the automated inference of predictive gene regulatory networks from high-throughput data is a common approach for the identification of novel genes with potential therapeutic value. Sophisticated methods have been developed that extensively make use of diverse sources of prior-knowledge to obtain biologically relevant hypotheses. Transferring such concepts to the field of phosphoproteomics data has the potential to reveal new insights into phosphorylation-related signaling mechanisms. In this study we conceptually adapt the TILAR network inference algorithm for the inference of a phospho-regulatory network. Therefore, we use published phosphoproteomics data of WP1193 treated and IL6-stimulated glioblastoma stem cells under normoxic and hypoxic condition. Peptides corresponding to 21 differentially phosphorylated proteins were used for network inference. Topological analysis of the phospho-regulatory network suggests lamin B2 (LMNB2) and spectrin, beta, non-erythrocytic 1 (SPTBN1) as potential hub-proteins associated with the alteration of phosphorylation under the observed conditions. Altogether, our results show that inference of phospho-regulatory networks can aid in the understanding of complex molecular mechanisms and cellular processes of biological systems.

Paper Nr: 44
Title:

On the Impact of Granularity in Extracting Knowledge from Bioinformatics Data

Authors:

Sean West and Hesham Ali

Abstract: With the rapidly increasing amount of various types of biological data currently available to researchers, the focus of the biomedical research community has been shifting from pure data generation towards the development of new methodologies for data analytics. Although many researchers continue to focus on approaches developed for analyzing single types of biological data, recent attempts have been made to utilize the availability of heterogeneous data sets that contain various types of data and try to establish tools for data integration and analysis in many bioinformatics applications. Such attempts are expected to increase significantly in this coming decade. While this can be viewed as a positive step towards advancing big data analytics in bioinformatics, it is critical that these integration methodologies are meticulously studied to ensure high quality of the knowledge extracted from the integrated data. In this work, we employ data integration methods to analyze biological data obtained from protein interaction networks and gene expression data. We conduct a study to show that potential problems can arise from integrating or fusing data obtained at different granularity levels and highlight the importance of developing advanced data fusing techniques to integrate various types of biological data for analytical purposes. Further, we explore the impact of granularity from a more formulized approach and the granularity levels significantly impact the quality of knowledge extracted from the integrated data.

Short Papers
Paper Nr: 9
Title:

Search for Latent Periodicity in Amino Acid Sequences with Insertions and Deletions

Authors:

Valentina Pugacheva, Alexander Korotkov and Eugene Korotkov

Abstract: The aim of this study was to show that amino acid sequences have a latent periodicity with insertions and deletions of amino acids in unknown positions of the analyzed sequence. Genetic algorithm, dynamic programming, and random weight matrices were used to develop the new mathematical algorithm for latent periodicity search. The method makes the direct optimization of the position-weight matrix for multiple sequence alignment without using pairwise alignments. The developed algorithm was applied to analyze the amino acid sequences of a small number of proteins. This study showed the presence of latent periodicity with insertions and deletions in the amino acid sequences of such proteins, for which the presence of latent periodicity was not previously known. The origin of latent periodicity with insertions and deletions is discussed.

Paper Nr: 10
Title:

Comparison of GPU-based and CPU-based Algorithms for Determining the Minimum Distance between a CUSA Scalper and Blood Vessels

Authors:

Hiroshi Noborio, Takahiro Kunii and Kiminori Mizushino

Abstract: In this study, we have designed a GPGPU (General-Purpose Graphics Processing Unit)-based algorithm for determining the minimum distance from the tip of a CUSA (Cavitron Ultrasonic Surgical Aspirator) scalpel to the closest point around three types of blood vessel STLs (STereo-Lithographies). The algorithm consists of the following two functions: First, we use z-buffering (depth buffering) as the classic matured function of the GPU in order to effectively obtain depths corresponding to image pixels. Second, we use multiple cores of the GPU for parallel processing so as to calculate the minimum Euclidean distance from the scalpel tip to the closest z-values of the depths. Therefore, the complexity of the GPU-based algorithm does not depend on the shape complexity (e.g., patch, edge, and vertex numbers) of the blood vessels.

Paper Nr: 13
Title:

Robustness to Sub-optimal Temperatures of the Processes of Tsr Cluster Formation and Positioning in Escherichia Coli

Authors:

Teppo Annila, Ramakanth Neeli-Venkata and Andre S. Ribeiro

Abstract: Clustering and positioning of chemotaxis-associated proteins are believed to be essential steps for their proper functioning. We investigate the robustness of these processes to sub-optimal temperatures by studying the size and location of clusters of Tsr-Venus proteins in live cells. We find that the degree of clustering of Tsr proteins is maximal under optimal temperature. The data further suggests that the weakening of the clustering process in lower-than and higher-than optimal temperatures is not due to the same cause. Meanwhile, the location of the clusters is found to be weakly temperature independent, within the range tested. We conclude that while the clustering of Tsr is heavily temperature dependent, the localization is only weakly dependent, suggesting that the functionality of the proteins responsible for retaining Tsr-clusters at the cell poles, such as the Tol-Pal complex, is robust to suboptimal temperatures.

Paper Nr: 15
Title:

Temporal Logic based Framework to Model and Analyse Gene Networks with Alternative Splicing

Authors:

Sohei Ito

Abstract: Toward system-level understanding of biological systems, we need a formalism to model and analyse them. Due to incompleteness of knowledge about quantitative parameters and molecular mechanisms, qualitative methods have been useful alternatives. We have been working on temporal logic-based approach for qualitative modelling and analysis of gene regulatory networks. Although our framework is well-established to model several aspects of gene regulation, we still lack treatment of alternative splicing, which contributes to proteomic diversity of eukaryotic organisms. In this paper we extend our logic-based qualitative framework to be able to capture alternative splicing, which is crucial to model the gene regulatory networks in eukaryotic organisms. We study mechanisms of alternative splicing and propose how we model each mechanism, then demonstrate the modelling method by analysing the regulatory network of sex determination in Drosophila and verify that the network ensures sex determination.

Paper Nr: 21
Title:

Discovering New Proteins in Plant Mitochondria by RNA Editing Simulation

Authors:

Fabio Fassetti, Claudia Giallombardo, Ofelia Leone, Luigi Palopoli, Simona E. Rombo and Adolfo Saiardi

Abstract: In plant mitochondria an essential mechanism for gene expression is RNA editing, often influencing the synthesis of functional proteins. RNA editing alters the linearity of genetic information transfer. Indeed it causes differences between RNAs and their coding DNA sequences that hinder both experimental and computational research of genes. Therefore common software tools for gene search, successfully applied to find canonical genes, often fail in discovering genes encrypted in the genome of plants. Here we propose a novel strategy useful to identify candidate coding sequences resulting from possible editing substitutions. In particular, we consider c!u substitutions leading to the creation of new start and stop codons in the mitochondrial DNA of a given input organism. We try to mimic the natural RNA editing mechanism, in order to generate candidate Open Reading Frame sequences that could code for novel, uncharacterized proteins. Results obtained analyzing the mtDNA of Oryza sativa are supportive of this approach, since we identified thirteen Open Reading Frame sequences transcribed in Oryza, that do not correspond to already known proteins. Five of the corresponding amino acid sequences present high homologies with proteins already discovered in other organisms, whereas, for the remaining ones, no such homology was detected.

Paper Nr: 29
Title:

Prediction of Cancer using Network Topological Features

Authors:

Fernanda Brito Correia, Joel P. Arrais and José Luis Oliveira

Abstract: Several data mining methods have been applied to explore biological data and understand the mechanisms that regulate genetic and metabolic diseases. The underlying hypothesis is that the identification of signatures can help the clinical identification of diseased tissues. Under this principle many different methodologies have been tested mostly using unsupervised methods. A common trend consists in combining the information obtained from gene expression and protein-protein interaction networks analyses or, more recently, building series of complex networks to model system dynamics. Despite the positive results that these works present, they typically fail to generalize out of sample datasets. In this paper we describe a supervised classification approach, with a new methodology for extracting the network topology dynamics embedded in a disease system, to improve the capacity of cancer prediction, using exclusively the topological properties of biological networks as features. Four microarrays datasets were used, for testing and validation, three from breast cancer experiments and one from a liver cancer experiment. The obtained results corroborate the potential of the proposed methodology to predict a certain type of cancer and the necessity of applying different classification models to different types of cancer.

Paper Nr: 33
Title:

Feature Selection for MicroRNA Target Prediction - Comparison of One-Class Feature Selection Methodologies

Authors:

Malik Yousef, Jens Allmer and Waleed Khalifa

Abstract: Traditionally, machine learning algorithms build classification models from positive and negative examples. Recently, one-class classification (OCC) receives increasing attention in machine learning for problems where the negative class cannot be defined unambiguously. This is specifically problematic in bioinformatics since for some important biological problems the target class (positive class) is easy to obtain while the negative one cannot be measured. Artificially generating the negative class data can be based on unreliable assumptions. Several studies have applied two-class machine learning to predict microRNAs (miRNAs) and their target. Different approaches for the generation of an artificial negative class have been applied, but may lead to a biased performance estimate. Feature selection has been well studied for the two–class classification problem, while fewer methods are available for feature selection in respect to OCC. In this study, we present a feature selection approach for applying one-class classification to the prediction of miRNA targets. A comparison between one-class and two-class approaches is presented to highlight that their performance are similar while one-class classification is not based on questionable artificial data for training and performance evaluation. We further show that the feature selection method we tried works to a degree, but needs improvement in the future. Perhaps it could be combined with other approaches.

Paper Nr: 34
Title:

Strategies for Phylogenetic Reconstruction - For the Maximum Parsimony Problem

Authors:

Karla E. Vazquez-Ortiz, Jean-Michel Richer and David Lesaint

Abstract: The phylogenetic reconstruction is considered a central underpinning of diverse field of biology like: ecology, molecular biology and physiology. The main example is modeling patterns and processes of evolution. Maximum Parsimony (MP) is an important approach to solve the phylogenetic reconstruction by minimizing the total number of genetic transformations, under this approach different metaheuristics have been implemented like tabu search, genetic and memetic algorithms to cope with the combinatorial nature of the problem. In this paper we review different strategies that could be added to existing implementations to improve their efficiency and accuracy. First we present two different techniques to evaluate the objective function by using CPU and GPU technology, then we show a Path-Relinking implementation to compare tree topologies and finally we introduces the application of these techniques in a Simulated Annealing algorithm looking for an optimal solution.

Paper Nr: 38
Title:

Ab initio Splice Site Prediction with Simple Domain Adaptation Classifiers

Authors:

Nic Herndon and Doina Caragea

Abstract: The next generation sequencing technologies (NGS) have made it affordable to sequence any organism, opening the door to assembling new genomes and annotating them, even for non-model organisms. One option for annotating a genome is to assemble RNA-Seq reads into a transcriptome and aligning the transcriptome to the genome assembly to identify the protein-encoding genes. However, there are a couple of problems with this approach. RNA-Seq is error prone and therefore the gene models generated with this technique need to be validated. In addition, this method can only capture the genes expressed at the time of sequencing. Machine learning can help address both of these problems by generating ab initio gene models that can provide supporting evidence to the models generated with RNA-Seq, as well as predict additional genes that were not expressed during sequencing. However, machine learning algorithms need large amounts of labeled data to learn accurate classifiers, and newly sequenced, non-model organisms have insufficient labeled data. This can be addressed by leveraging the abundant labeled data from a related model-organism (the source domain) and use it in conjunction with the little labeled data from the organism of interest (the target domain) to train a classifier in a domain adaptation setting. The method we propose uses this approach and generates accurate classification on the task of splice site prediction – a difficult and essential step in gene prediction. It is simple – it combines source and target labeled data, with different weights, into one dataset, and then trains a supervised classifier on the combined dataset. Despite its simplicity it is surprisingly accurate, with highest areas under the precision-recall curve between 53.33% and 83.57%. Out of the domain adaptation classifiers evaluated (SVM, na¨ıve Bayes, and logistic regression) this method produced the best results in 12 out of the 16 cases studied.

Paper Nr: 41
Title:

An Extension to Local Network Alignment using Hidden Markov Models (HMMs)

Authors:

Hakan Gündüz and İbrahim Süzer

Abstract: Local alignment is done on biological networks to find common conserved substructures belonging to different organisms. Many algorithms such as PathBLAST (Kelley et al., 2003), Network-BLAST (Scott et al., 2006) are used to align networks locally and they are generally good at finding small sized common substructures. However, these algorithms have same failures about finding larger substructures because of complexity issues. To overcome these issues, Hidden Markov Models (HMMs) is used. The study done by (Qian and Yoon, 2009), uses HMMs to find optimal conserved paths in two biological networks where aligned paths have constant path length. In this paper, we aim to make an extension to the local network alignment procedure done in (Qian and Yoon, 2009) to find common substructures in varying length sizes between the biological networks. We again used same algorithm to find k-length exact matches from networks and we used them to find common substructures in two forms as sub-graphs and extended paths. These structures do not need to have the same number of nodes and should satisfy the predefined similarity threshold (s0). The other parameter is the length of exact paths (k) formed from biological networks and choosing a lower k value is faster but bigger values might be needed in order to balance the number of matching paths below s0.

Paper Nr: 45
Title:

Enumerating Naphthalene Isomers of Tree-like Chemical Graphs

Authors:

Fei He, Akiyoshi Hanai, Hiroshi Nagamochi and Tatsuya Akutsu

Abstract: In this paper, we consider the problem of enumerating naphthalene isomers, where enumeration of isomers is important for drug design. A chemical graph G with no other cycles than naphthalene rings is called tree-like, and becomes a tree T possibly with multiple edges if we contract each naphthalene ring into a single virtual atom of valence 8. We call tree T the tree representation of G. There may be more than one tree-like chemical graphs whose tree representations equal to T, which are called naphthalene isomers of T. We present an efficient algorithm that enumerates all naphthalene isomers of a given tree representation. Our algorithm first counts the number of all the naphthalene isomers using dynamic programming, and then for each k, generates the k-th isomer by backtracking the counting computation. In computational experiment, we compare our method with MolGen, a state-of-the-art enumeration tool, and it is observed that our program enumerates the same number of naphthalene isomers within extremely shorter time, which proves that our algorithm is effectively built.

Paper Nr: 47
Title:

Memory Efficient de novo Assembly Algorithm using Disk Streaming of K-mers

Authors:

Yuki Endo, Fubito Toyama, Chikafumi Chiba, Hiroshi Mori and Kenji Shoji

Abstract: Sequencing the whole genome of various species has many applications, not only in understanding biological systems, but also in medicine, pharmacy, and agriculture. In recent years, the emergence of high-throughput next generation sequencing technologies has dramatically reduced the time and costs for whole genome sequencing. These new technologies provide ultrahigh throughput with a lower per-unit data cost. However, the data are generated from very short fragments of DNA. Thus, it is very important to develop algorithms for merging these fragments. One method of merging these fragments without using a reference dataset is called de novo assembly. Many algorithms for de novo assembly have been proposed in recent years. Velvet and SOAPdenovo2 are well-known assembly algorithms, which have good performance in terms of memory and time consumption. However, memory consumption increases dramatically when the size of input fragments is larger. Therefore, it is necessary to develop an alternative algorithm with low memory usage. In this paper, we propose an algorithm for de novo assembly with lower memory. In the proposed method, memory-efficient DSK (disk streaming of k-mers) to count k-mers is adopted. Moreover, the amount of memory usage for constructing de bruijn graph is reduced by not keeping edge information in the graph. In our experiment using human chromosome 14, the average maximum memory consumption of the proposed method was approximately 7.5–8.8% of that of the popular assemblers.

Paper Nr: 49
Title:

On using Longer RNA-seq Reads to Improve Transcript Prediction Accuracy

Authors:

Anna Kuosmanen, Ahmed Sobih, Romeo Rizzi, Veli Mäkinen and Alexandru I. Tomescu

Abstract: Over the past decade, sequencing read length has increased from tens to hundreds and then to thousands of bases. Current cDNA synthesis methods prevent RNA-seq reads from being long enough to entirely capture all the RNA transcripts, but long reads can still provide connectivity information on chains of multiple exons that are included in transcripts. We demonstrate that exploiting full connectivity information leads to significantly higher prediction accuracy, as measured by the F-score. For this purpose we implemented the solution to the Minimum Path Cover with Subpath Constraints problem introduced in (Rizzi et al., 2014), which is an extension of the classical Minimum Path Cover problem and was shown solvable by min-cost flows. We show that, under hypothetical conditions of perfect sequencing, our approach is able to use long reads more effectively than two state-of-the-art tools, StringTie and FlipFlop. Even in this setting the problem is not trivial, and errors in the underlying flow graph introduced by sequencing and alignment errors complicate the problem further. As such our work also demonstrates the need for a development of a good spliced read aligner for long reads. Our proof-of-concept implementation is available at http://www.cs.helsinki.fi/en/gsa/traphlor.

Paper Nr: 51
Title:

Sloppy/Stiff Parameters Rankings in Sensitivity Analysis of Signaling Pathways

Authors:

Malgorzata Kardynska, Jaroslaw Smieja, Anna Naumowicz, Patryk Janus, Piotr Widlak and Marek Kimmel

Abstract: Sensitivity analysis methods have been developed for over half a century. However, their application to systems biology is a relatively new concept and has not been fully investigated. In this paperwe focus on creating parameter rankings based on sloppy/stiff parameter sensitivity analysis, that can be used to find the most important parameters and processes (that have the greatest impact on the system output) and subsequently can be used to reduce the number of experiments needed to precisely estimate parameters values or to indicate molecular targets for new drugs. In order to test the proposed procedure we performed sensitivity analysis of the HSF/NF-кB pathway model - a model combining two signaling pathways essential for cell survival.

Posters
Paper Nr: 2
Title:

Sequence-based MicroRNA Clustering

Authors:

Kübra Narcı, Hasan Oğul and Mahinur Akkaya

Abstract: MicroRNAs (miRNAs) play important roles in post-transcriptional gene regulation. Altogether, understanding integrative and co-operative activities in gene regulation is conjugated with identification of miRNA families. In current applications, the identification of such groups of miRNAs is only investigated by the projections of their expression patterns and so along with their functional relations. Considering the fact that the miRNA regulation is mediated through its mature sequence by the recognition of the target mRNA sequences in the RISC (RNA-induced silencing complex) binding regions, we argue here that relevant miRNA groups can be obtained by de novo clustering them solely based on their sequence information, by a sequence clustering approach. In this way, a new study can be guided by a set of previously annotated miRNA groups without any preliminary experimentation or literature evidence. In this report, we presents the results of a computational study that considers only mature miRNA sequences to obtain relevant miRNA clusters using various machine learning methods employed with different sequence representation schemes. Both statistical and biological evaluations encourages the use this approach in silico assessment of functional miRNA groups.

Paper Nr: 14
Title:

Gene-gene Interaction Analysis by IAC (Interaction Analysis by Chi-Square) - A Novel Biological Constraint-based Interaction Analysis Framework

Authors:

Sidney K. Chu, Samuel Guanglin Xu, Feng Xu and Nelson L. S. Tang

Abstract: In the recent years of the GWAS era, large-scale genotyping of million polymorphisms (SNPs) among thousands of patients have identified new disease predisposition loci. However, these conventional GWAS statistical models only analyse SNPs singularly and cannot detect significant SNP-SNP (gene-gene) interaction. Studies of interacting genetic variants (SNPs) are useful to elucidate a disease’s underlying biological pathway. Therefore, a powerful and efficient statistical model to detect SNP-SNP interaction is urgently needed. We hypothesize that among all the exhaustive model patterns of interaction (>100), only limited patterns are plausible based on the principle of protein-protein interaction (in the context of GWAS data analysis). The production of proteins by the process of translation of DNA predicts that gene-gene interaction resulting in a phenotype should only occur in classical genetic epistasis models, such as dominant-dominant, and recessive-recessive models. We developed a statistical analysis model, IAC (Interaction Analysis by Chi-Square), to examine such interactions. We then exhausted different population and statistical parameters, upon a total of 532 simulated case-control experiments to study the effects of these parameters on statistical power and type I error of using an interaction vs. singular SNP analysis. Our method has also detected potential pairwise interactions associated with Parkinson's disease that were previously undetected in conventional methods. We showed that the detection of SNP-SNP interaction is actually feasible using typical sample sizes found in common GWAS studies. This approach may be applied in complimentarily with other models in two-stage association tests to efficiently detect candidate SNPs for further study.

Paper Nr: 17
Title:

A Coding Theoretical Approach to Predict Sequence Changes in H5N1 Influenza A Virus Hemagglutinin

Authors:

Keiko Sato, Toshihide Hara and Masanori Ohya

Abstract: The changes in the receptor binding domain of influenza A virus hemagglutinin lead to the appearance of new viral strains that evade the immune system. To prepare the future emergence of potentially dangerous outbreaks caused by divergent influenza strains including human-adapted H5N1 strains, it is imperative that we understand the rule stored in the sequence of the receptor binding domain. Information of life is stored as a sequence of nucleotides, and the sequence composed of four nucleotides seems to be a code. It is important to determine the code structure of the sequences. Once we know the code structure, we can make use of mathematical results concerning coding theory for research in life science. In this study, we applied various codes in coding theory to sequence analysis of the 220 loop in the receptor binding domain of H1, H3, H5 and H7 subtype viruses isolated from humans. Sequence diversity in the 220 loop has been observed even within the same hemagglutinin subtype. However, we found that the code structure of the 220 loop from the same subtype remains unchanged. Our results indicate that the sequences at the 220 loop have the structure of subtype-specific codes. In addition, in view of these finding, we predicted possible amino acid changes in the 220 loop of H5N1 strains that will emerge in the future. Our method will facilitate understanding of the evolutionary patterns of influenza A viruses, and further help the development of new antiviral drugs and vaccines.

Paper Nr: 19
Title:

ZK Drugresist - Automatic Extraction of Drug Resistance Mutations and Expression Level Changes from Medline Abstracts

Authors:

Zoya Khalid and Ugur Sezerman

Abstract: Drugs are small molecules that generally work by binding to its target which is often a protein. This ligand molecule binding helps in the treatment of various diseases. Major obstacle to treat complex diseases is the phenomena underlying drug resistance mechanisms which are not fully understood so far. Previously reported literature has mentioned few of the motives behind this complex mechanism which dominantly include protein missense mutations and the changes in the expression levels of certain genes. A better understanding of these mechanisms is getting crucial for the researchers. Retrieving information on these processes can be challenging as scientific literature has huge pool of data and extracting the required information has always been a laborious task. We developed an online pipeline ZK DrugResist that automatically extracts PubMed abstracts of drug resistance paired with either mutation or expression for a given disease. Our classifier showed 97.7% accuracy with 93.5% recall and 96.5% F-measure. This system saves plenty of time in terms of data mining and also reduces efforts in retrieving information from online resources.

Paper Nr: 20
Title:

Techniques to Control Robot Action Consisting of Multiple Segmented Motions using Recurrent Neural Network with Butterfly Structure

Authors:

Wataru Torii, Shinpei Fujimoto, Masahiro Furukawa, Hideyuki Ando and Taro Maeda

Abstract: In the field of robot control, there have been several studies on humanoid robots operating in remote areas. We propose a methodology to control a robot using input from an operator with fewer degrees of freedom than the robot. This method is based on the concept that time-continuous actions can be segmented because human intentions are discrete in the time domain. Additionally, machine learning is used to determine components with a high correlation to input data that are often complex or large in quantity. In this study, we implemented a new structure on a conventional neural network to manipulate a robot using a fast Fourier transform. The neural network was expected to acquire robustness for amplitude and phase variations. Thus, our model can reflect a fluctuating operator input to control a robot. We applied the proposed neural network to manipulate a robot and verified the validity and performance compared with traditional models.

Paper Nr: 22
Title:

Gene Selection using a Hybrid Memetic and Nearest Shrunken Centroid Algorithm

Authors:

Vinh Quoc Dang and Chiou-Peng Lam

Abstract: High-throughput technologies such as microarrays and mass spectrometry produced high dimensional biological datasets both in abundance and with increasing complexity. Prediction Analysis for Microarrays (PAM) is a well-known implementation of the Nearest Shrunken Centroid (NSC) method which has been widely used for classification of biological data. In this paper, a hybrid approach incorporating the Nearest Shrunken Centroid (NSC) and Memetic Algorithm (MA) is proposed to automatically search for an optimal range of shrinkage threshold values for the NSC to improve feature selection and classification accuracy. Evaluation of the approach involved nine biological datasets and results showed improved feature selection stability over existing evolutionary approaches as well as improved classification accuracy.

Paper Nr: 25
Title:

PAPAyA: A Highly Scalable Cloud-based Framework for Genomic Processing

Authors:

Francois Andry, Nevenka Dimitrova, Alexander Mankovich, Vartika Agrawal, Anas Bder and Ariel David

Abstract: The PAPAyA platform has been designed to ingest, store and process in silico large genomics datasets using analysis algorithms based on pre-defined knowledge databases with the goal to offer personalized therapy guidance to physicians in particular for cancers and infectious diseases. This new highly scalable, secure and extensible framework is deployed on a cloud-based digital health platform that provides generic provisioning and hosting services, identity and access management, workflow orchestration, device cloud capabilities, notifications, scheduling, logging, auditing, metering as well as specific patient demographic, clinical and wellness data services that can be combined with the genomics analytics results.

Paper Nr: 36
Title:

A Structure based Approach for Accurate Prediction of Protein Interactions Networks

Authors:

Hafeez Ur Rehman, Usman Zafar, Alfredo Benso and Naveed Islam

Abstract: In the recent days, extraordinary revolution in genome sequencing technologies have produced an overwhelming amount of genes that code for proteins, resulting in deluge of proteomics data. Since proteins are involved in almost every biological activity, therefore due to this rapid uncovering of biological “facts”, the field of System Biology now stands on the doorstep of considerable theoretical and practical advancements. Precise understanding of proteins, specially their functional associations or interactions are inevitable to explicate how complex biological processes occur at molecular level, as well as to understand how these processes are controlled and modified in different disease states. In this paper, we present a novel protein structure based method to precisely predict the interactions of two putative protein pairs. We also utilize the interspecies relationship of proteins i.e., the sequence homology, which is crucial in cases of limited information from other sources of biological data. We further enhance our model to account for protein binding sites by linking individual residues in structural templates which bind to other residues. Finally, we evaluate our model by combining different sources of information using Naive Bayes classification. The proposed model provides substantial improvements in terms of accuracy, precision, recall when compared with previous approaches. We report an accuracy of 90% when tested for a protein interaction network of yeast proteome.

Paper Nr: 53
Title:

Use of GeIS for Early Diagnosis of Alcohol Sensitivity

Authors:

José Fabián Reyes Román and Óscar Pastor López

Abstract: This study focuses on the importance of Genomic Information Systems (GeIS) today; the results of this research provide great benefits to the medical community through technological potential. The development of SILE (Search-Identification-Load-Exploitation) to GeIS improves the databases management with curated data. The studies are focused on improving the quality of data and time optimization. With SILE we perform a selective loading of genes and variations found for a specific disease from different data sources like: NCBI, dbSNP and others. When we worked with a selected group of genes/variations it is possible guaranteeing a more reliable diagnosis, thus sustaining the increase accuracy of the results with respect to data quality and improvements over time. Also, we integrate the association of genes/variations with population studies, for this way providing an early diagnosis for any disease of genetic origin.

Paper Nr: 55
Title:

Directional Cellular Dynamics for Tissue Morphogenesis and Tumour Characterization by Aggressive Cancer Cells Identification

Authors:

Abdoulaye Sarr, Petra Miglierini, Alexandra Fronville and Vincent Rodin

Abstract: Due to the availability of large amount of medical data and the improvements of computers’ capacities, an increase of tools for medical applications has been noted. In the case of cancer, this results in some application and treatment successes in radiotherapy. However, on the one hand, high therapeutic results are yet to be seen, and on the other hand, unpleasant side effects are still widely observed. In the first case, it may arise from the avoidance of any damage to healthy structures implying ineffective treatment, and in the second case it may be, due to lethal doses deposited in the tumour, leading to an unacceptable damage to one or more healthy structures. Thus, it would be useful to simulate the effects of any treatment prior to its application. Thereby, we are focusing on the proposition of computational methods serving to give insights for decisions aid tools in radiotherapy. In this paper, we provide algorithms for tissue growth prediction where cells are elements of a 2D cellular automaton oriented multi-agent system. Then, we propose a novel method to predict and characterize the evolution of a pathological tissue under cells irradiation. We show that the more cells destroyed during the radiotherapy are linked to aggressive cancer cells, the more the treatment lead to an impaired result in terms of growth. By contrast, we highlight that there exists cells less linked to these aggressive cancer cells that are more suitable to target for an effective and efficient radiotherapy. Based on the dominant cells (linked or not linked to aggressive cancer cells), we introduce a novel method to classify tumours.