BIOINFORMATICS 2015 Abstracts


Full Papers
Paper Nr: 11
Title:

Modeling Genetical Data with Forests of Latent Trees for Applications in Association Genetics at a Large Scale - Which Clustering Method should Be Chosen?

Authors:

D.-T. Phan, P. Leray and C. Sinoquet

Abstract: Association genetics, and in particular genome-wide association studies (GWASs), aim at elucidating the etiology of complex genetic diseases. In the domain of association genetics, machine learning provides an appealing alternative framework to standard statistical approaches. Pioneering works (Mourad et al., 2011) have proposed the forest of latent trees (FLTM) to model genetical data at the genome scale. The FLTM is a hierarchical Bayesian network with latent variables. A key to FLTMconstruction is the recursive clustering of variables, in a bottom up subsuming process. In this paper, we study the impact of the choice of the clustering method to be plugged in the FLTM learning algorithm, in a GWAS context. Using a real GWAS data set describing 41400 variables for each of 3004 controls and 2005 individuals affected by Crohn’s disease, we compare the influence of three clustering methods. Data dimension reduction and ability to split or group putative causal SNPs in agreement with the underlying biological reality are analyzed. To assess the risk of missing significant association results through subsumption, we also compare the methods through the corresponding FLTM-driven GWASs. In the GWAS context and in this framework, the choice of the clustering method does not impact the satisfying performance of the downstream application, both in power and detection of false positive associations.

Paper Nr: 20
Title:

Essential Proteins and Functional Modules in the Host-Pathogen Interactions from Innate to Adaptive Immunity - C. albicans-zebrafish Infection Model

Authors:

Chia-Chou Wu and Bor-Sen Chen

Abstract: The host and the pathogen are indispensable in the infectious diseases. Besides studying the host defensive and pathogen invasive mechanisms individually, the cross-species interactions, i.e., the host-pathogen interactions, become a novel and intense research subjects of the infectious diseases. In this study, two host-pathogen interaction networks are constructed for innate and adaptive immunity based on the time course microarray data of C. albicans-zebrafish infection model. The interaction variations in the host, pathogen, and host-pathogen regions are evaluated by comparing the two constructed networks. Those proteins of larger interaction variations stand for more pivotal roles in the transition from innate to adaptive immunity. Moreover, in the host-pathogen region, four significantly enriched functional modules are identified. Meanwhile, the interaction variations of these four functional groups imply the corresponding strategy shifts of the host and pathogen from innate to adaptive immunity. In view of these results, this study gives a systematic explanation about the transition from innate and adaptive immunity from functional modules perspective. Thus, this study provides potential targets for developing efficient therapies of the infectious diseases.

Paper Nr: 25
Title:

Constructing Structural Profiles for Protein Torsion Angle Prediction

Authors:

Zafer Aydin, David Baker and William Stafford Noble

Abstract: Structural frequency profiles provide important constraints on structural aspects of a protein and is receiving a growing interest in the structure prediction community. In this paper, we introduce new techniques for scoring templates that are later combined to form structural profiles of 7-state torsion angles. By employing various parameters of target-template alignments we improve the quality and accuracy of structural profiles considerably. The most effective technique is the scaling of templates by integer powers of sequence identity score in which the power parameter is adjusted with respect to the similarity interval of the target. Incorporating other alignment scores as multiplicative factors further improves the accuracy of profiles. After analyzing the individual strengths of various structural profile methods, we combine them with ab-initio predictions of 7-state torsion angles by a linear committee approach. We show that incorporating template information improves the accuracy of ab-initio predictions significantly at all levels of target-template similarity even when templates are distant from the target. Template scaling methods developed in this work can be applied in many other prediction tasks and in more advanced methods designed for computing structural profiles.

Paper Nr: 29
Title:

Identifying Aging Genes in the Aging Mouse Hypothalamus Using Gateway Node Analysis of Correlation Networks

Authors:

Kathryn M. Cooper, Stephen Bonasera and Hesham Ali

Abstract: High-throughput studies continue to produce volumes of data, providing a wealth of information that can be used to better guide biological research. However, models that can readily identify true biological signals from this data have not been developed at the same rate, due in part to a lack of well-developed algorithms that can handle the magnitude, variability and veracity of the data. One promising and effective solution to this complex issue is network modeling, due to its capabilities for representing biological elements and relationships en masse. In this research, we use correlation networks for analysis where genes are represented as nodes and indirect relationships (derived from expression patterns) are represented as edges. Here, we define “gateway” nodes as elements representing genes that change in co-expression and possibly co-regulation between states. We use the gateway node approach to identify critical genes in the aging mouse brain and perform a cursory investigation of the robustness of these gateway nodes according to network structure. Our results highlight the power of the gateway nodes approach and show how it can be used to limit search space and determine candidate genes for targeted studies. The novelty of this approach lies in application of the gateway node approach on novel mouse datasets, and the investigation into robustness of network structures.

Paper Nr: 31
Title:

De-Novo Assembly of Short Reads in Minimal Overlap Model

Authors:

Shashank Sharma and Ankit Singhal

Abstract: Next Generation Sequencing (NGS) technologies produce millions of short reads that provide high coverage of genome at much lower cost than Sanger Sequencing based technologies. The advent of NGS technologies has led to various developments in assembling techniques. Our focus is on adapting overlap graph based algorithms to work with millions of NGS reads. Due to the high coverage of the genome by NGS reads, we show that it is feasible to perform assembly while working with small overlaps. This strategy gives us a significant computational and space advantage over the existing approaches. Our method finds alternate paths in an overlap graph to construct an assembly. We compare the performance of our tool, MOBS, with some of the widely used assemblers on ideal datasets (error free reads, distributed uniformly over genome), for which finished genomes are available. We show that MOBS results are most of the time better than other assemblers with respect to quality of assemblies, running time and genome coverage.

Paper Nr: 32
Title:

Machine Reading of Biological Texts - Bacteria-Biotope Extraction

Authors:

Wouter Massa, Parisa Kordjamshidi, Thomas Provoost and Marie-Francine Moens

Abstract: The tremendous amount of scientific literature available about bacteria and their biotopes underlines the need for efficient mechanisms to automatically extract this information. This paper presents a system to extract the bacteria and their habitats, as well as the relations between them. We investigate to what extent current techniques are suited for this task and test a variety of models in this regard. To detect entities in a biological text we use a linear chain Conditional Random Field (CRF). For the prediction of relations between the entities, a model based on logistic regression is built. Designing a system upon these techniques, we explore several improvements for both the generation and selection of good candidates. One contribution to this lies in the extended flexibility of our ontology mapper, allowing for a more advanced boundary detection. Furthermore, we discover value in the combination of several distinct candidate generation rules. Using these techniques, we show results that are significantly improving upon the state of art for the BioNLP Bacteria Biotopes task.

Paper Nr: 35
Title:

Wave Equation Model of Soft Tissue for a Virtual Reality Laparoscopy Training System - A Validation Study

Authors:

Sneha Patel and Jackrit Suthakorn

Abstract: Laparoscopic procedures have various benefits for the patients but come with environmental limitations for the surgeons. Therefore to prevent serious complications, surgeons require intensive and repetitive training to acquire essential techniques, skills or tasks. There are various training systems used in surgical programs; a recent technology that shows promise is virtual reality (VR) training. An important aspect of these training systems is the realism of the soft tissue model and the user interface, which allow effective transference of skills from the training system to the operating room. This paper discusses a novel method to model soft tissue in virtual reality training systems and the validation of this model. Wave equation, a mathematical model, is used to model the soft tissue and laparoscopic tools’ interaction. This model is validated using finite element analysis, which is used to compare the mechanical properties of the resulting material and human skin. The model discussed in this paper will be applied to a novel surgical training system, which trains the user in laparoscopic suturing techniques.

Paper Nr: 36
Title:

Outlier Detection in Survival Analysis based on the Concordance C-index

Authors:

João Diogo Pinto, Alexandra M. Carvalho and Susana Vinga

Abstract: Outlier detection is an important task in many data-mining applications. In this paper, we present two parametric outlier detection methods for survival data. Both methods propose to perform outlier detection in a multivariate setting, using the Cox regression as the model and the concordance c-index as a measure of goodness of fit. The first method is a single-step procedure that presents a delete-1 statistic based on bootstrap hypothesis, testing for the increase in the concordance c-index. The second method is based on a sequential procedure that maximizes the c-index of the model using a a greedy one-step-ahead search. Finally, we use both methods to perform robust estimation for the Cox regression, removing from the regression a fraction of the data by their measure of outlyingness. Our preliminary results on three different data sets have shown to improve the estimation of the Cox Regression coefficients and also the model predictive ability.

Short Papers
Paper Nr: 5
Title:

A Nonlinear Mixture Model based Unsupervised Variable Selection in Genomics and Proteomics

Authors:

Ivica Kopriva

Abstract: Typical scenarios occurring in genomics and proteomics involve small number of samples and large number of variables. Thus, variable selection is necessary for creating disease prediction models robust to overfitting. We propose an unsupervised variable selection method based on sparseness constrained decomposition of a sample. Decomposition is based on nonlinear mixture model comprised of test sample and a reference sample representing negative (healthy) class. Geometry of the model enables automatic selection of component comprised of disease related variables. Proposed unsupervised variable selection method is compared with 3 supervised and 1 unsupervised variable selection methods on two-class problems using 3 genomic and 2 proteomic data sets. Obtained results suggest that proposed method could perform better than supervised methods on unseen data of the same cancer type.

Paper Nr: 10
Title:

Approximate Analysis of Homeostasis of Gene Networks by Linear Temporal Logic using Network Motifs

Authors:

Sohei Ito, Shigeki Hagihara and Naoki Yonezaki

Abstract: We proposed a novel framework to analyse homeostasis of gene networks using linear temporal logic. We formulate a kind of homeostasis as strong satisfiability of reactive system specifications. Both behaviours and properties of gene networks are specified in linear temporal logic and homeostasis of the network is checked by strong satisfiability checkers. Though this framework is simple and applicable for many networks, the computational complexity is heavy and large networks cannot be directly analysed. In this paper we present an approximate analysis method to mitigate this computational difficulty. We approximately specify a network specification using fewer propositions such that approximated specifications guarantee homeostasis of the network. However it is difficult to find such safely approximated specifications for any gene network. Thus we present approximate specifications for network motifs, which are common patterns appearing in many gene networks. We demonstrate our approximate method and see that our approximate method is quite efficient in analysing large networks.

Paper Nr: 16
Title:

PERFECTOS-APE - Predicting Regulatory Functional Effect of SNPs by Approximate P-value Estimation

Authors:

Ilya E. Vorontsov, Ivan V. Kulakovskiy, Grigory Khimulya, Daria D. Nikolaeva and Vsevolod J. Makeev

Abstract: Single nucleotide polymorphisms (SNPs) and variants (SNVs) are often found in regulatory regions of human genome. Nucleotide substitutions in promoter and enhancer regions may affect transcription factor (TF) binding and alter gene expression regulation. Nowadays binding patterns are known for hundreds of human TFs. Thus one can assess possible functional effects of allele variations or mutations in TF binding sites using sequence analysis. We present PERFECTOS-APE, the software to PrEdict Regulatory Functional Effect of SNPs by Approximate P-value Estimation. Using a predefined collection of position weight matrices (PWMs) representing TF binding patterns, PERFECTOS-APE identifies transcription factors whose binding sites can be significantly affected by given nucleotide substitutions. PERFECTOS-APE supports both classic PWMs under the position independency assumption, and dinucleotide PWMs accounting for the dinucleotide composition and correlations between nucleotides in adjacent positions within binding sites. PERFECTOS-APE uses dynamic programming to calculate PWM score distribution and convert the scores to P-values with an optional binary search mode using a precomputed P-value list to speed-up the computations. Software is written in Java and is freely available as standalone program and online tool: http://opera.autosome.ru/perfectosape/. We have tested our algorithm on several disease associated SNVs as well as on a set of cancer somatic mutations occurring in intronic regions of the human genome.

Paper Nr: 21
Title:

Multi-Algorithmic Approaches to Gene Expression Binarization

Authors:

Jaime Seguel

Abstract: A basic problem in the construction of network representations of gene interactions is deciding whether a gene is or is not expressed at a time instant. This problem, referred here as the gene expression decision problem, has been approached with statistical and numerical algorithms. Numerical methods are based on different intuitions on what signals a gene expression threshold and as a consequence, they often return different answers. Consequently, the choice of a particular gene expression decision algorithm influences the gene interaction model. This article proposes an aggregation methodology for numerical gene expression decision algorithms that is based on voting. The result is thus, the expression decision made by the majority of the algorithms, provided that that decision is consistent with an underlying logical law referred as the doctrine. The proposed method is compared with some non-voting aggregation algorithms.

Paper Nr: 30
Title:

Biomechanical Effects of the Geometry of Ball-and-Socket Intervertebral Prosthesis on Lumbar Spine Using Finite Element Method

Authors:

Jisu Choi, Dong Ah Shin and Sohee Kim

Abstract: The purpose of this study was to analyze the biomechanical effects of three different types of ball-and-socket geometry of a lumbar artificial disc using finite element method. A three dimensional linear finite element (FE) model was developed, and the lumbar artificial disc was inserted at L3-L4 level. The height of implant was fixed and location of implant was also center-fixed. Three different curvatures of ball-and-socket geometry were modeled (radius of curvature: 50.5mm for C1, 26mm for C2, 18.17mm for C3). The biomechanical effects including range of motion (ROM), stress of intervertebral disc, facet contact force and stress on implant were compared among different geometries. As the radius of curvature decreased, the result shows that ROM increased at the surgical level and the stress on implant decreased. The change in stress within intervertebral disc was not significant. The facet contact force at surgical level was maximum with C2 while C1 and C3 had similar facet contact force. We confirmed that the geometry of artificial disc can cause remarkable biomechanical changes at surgical level.

Paper Nr: 38
Title:

Algorithms for Regularized Linear Discriminant Analysis

Authors:

Jan Kalina and Jurjen Duintjer Tebbens

Abstract: This paper is focused on regularized versions of classification analysis and their computation for high-dimensional data. A variety of regularized classification methods has been proposed and we critically discuss their computational aspects. We formulate several new algorithms for regularized linear discriminant analysis, which exploits a regularized covariance matrix estimator towards a regular target matrix. Numerical linear algebra considerations are used to propose tailor-made algorithms for specific choices of the target matrix. Further, we arrive at proposing a new classification method based on L2-regularization of group means and the pooled covariance matrix and accompany it by an efficient algorithm for its computation.

Paper Nr: 39
Title:

Computer Annotation of Nucleic Acid Sequences in Bacterial Genomes Using Phylogenetic Profiles

Authors:

Mikhail A. Golyshev and Eugene V. Korotkov

Abstract: Over the last years a great number of bacterial genomes were sequenced. Now one of the most important challenges of computational genomics is the functional annotation of nucleic acid sequences. In this study we presented the computational method and the annotation system for predicting biological functions using phylogenetic profiles. The phylogenetic profile of a gene was created by way of searching for similarities between the nucleotide sequence of the gene and 1204 reference genomes, with further estimation of the statistical significance of found similarities. The profiles of the genes with known functions were used for prediction of possible functions and functional groups for the new genes.We conducted the functional annotation for genes from 104 bacterial genomes and compared the functions predicted by our system with the already known functions. For the genes that have already been annotated, the known function matched the function we predicted in 63% of the time, and in 86% of the time the known function was found within the top five predicted functions. Besides, our system increased the share of annotated genes by 19%. The developed system may be used as an alternative or complementary system to the current annotation systems.

Paper Nr: 40
Title:

Patterns of Codon Usage in Plastidial Genomes of Ancient Plants Provide Insights into Evolution

Authors:

Manju Yadav, Suresh Babu and Gitanjali Yadav

Abstract: Basal angiosperms are the first flowering plants that diverged from ancestral angiosperms, while magnoliids represent the oldest known angiosperms and are considered to retain the characteristics of more primitive angiosperms. Availability of the plastidial genomes from several members of both these classes of plants provides an opportunity to identify and understand large-scale genomic patterns in organelles of early angiosperms. In this work, chloroplast genomes from nine AT-rich basal angiosperm and magnoliid species were analyzed to unearth patterns, if any, in terms of codon bias and to identify factors responsible for the detected patterns. We were able to distinguish nine optimal codons in basal angiosperm chloroplasts and 18 in case of magnoliids. Our findings suggest mutational bias as the most predominant factor shaping codon usage patterns among the genomes examined, while gene expression, hydrophobicity and aromaticity, were found to have a limited but important effect on pattern determination.

Paper Nr: 41
Title:

An Enhanced DNA-based Steganography Technique with a Higher Hiding Capacity

Authors:

Samiha Marwan, Ahmed Shawish and Khaled Nagaty

Abstract: : DNA-based Steganography is one of the promising techniques to secure data exchange, where data is hidden into a real DNA sequence. For the sake of security, some steganography techniques encrypt data before hiding it which strengthen the technique’s steganalysis. One of the widely used encryption techniques is the DNAbased playfair cipher. This technique intensively requires a long list of preprocessing steps in addition to extra bits which must be added to guarantee successful decryption. Nevertheless, the succeeding hiding step suffers from a limited capacity, which turns this current DNA-based Steganography technique into a complex, inefficient, and time consuming process. In this paper, we propose a new DNA-based Steganography algorithm to simplify the current technique as well as achieve higher hiding capacity. In the proposed algorithm, we enhance the commonly used playfair cipher by defining a novel short sequence of preprocessing steps and getting rid of the extra overhead bits. We also utilize a more efficient technique to enhance the hiding phase. The proposed approach is not only simple and fast but also provides a significantly higher hiding capacity with a high security. The conducted extensive experimental studies confirm the outstanding performance of the proposed algorithm.

Paper Nr: 42
Title:

A Graph-based Pattern Recognition for Chemical Molecule Matching

Authors:

Yunus Gökçer, M. Fatih Demirci and Mehmet Tan

Abstract: In this paper we present a new method that uses graph-based pattern recognition to compute the similarity between chemical molecules. Our method is used for prediction of the activity of chemical molecules, that is, the prediction of carcinogenicity of molecules. In our method, molecules are depicted as edge-weighted graphs, where each atom corresponds to a vertex and the bonds between the atoms are depicted as edges. The framework performs graph embedding by representing vertices as points in a geometric space. The similarity measure (distance) between the embedded points is computed using the Earth Mover’s Distance (EMD) method, which is based on a distribution-based transportation algorithm. Our method shows promising results on the PTC dataset compared to the existing kernels.

Paper Nr: 43
Title:

Bioinformatics Strategies for Identifying Regions of Epigenetic Deregulation Associated with Aberrant Transcript Splicing and RNA-editing

Authors:

Mia D. Champion, Ryan A. Hlady, Huihuang Yan, Jared Evans, Jeff Nie, Jeong-Heon Lee, James M. Bogenberger, Kannabiran Nandakumar, Jaime Davila, Raymond Moore, Asha Nair, Daniel O'Brien, Yuan-Xiao Zhu, K. Martin Kortüm, Tamas Ordog, Zhiguo Zhang, Richard W. Joseph, A. Keith Stewart, Jean-Pierre Kocher, Eric Jonasch, Keith D. Robertson, Raoul Tibes and Thai H. Ho

Abstract: Epigenetic modifications are associated with the regulation of co/post-transcriptional processing and differential transcript isoforms are known to be important during cancer progression. It remains unclear how disruptions of chromatin-based modifications contribute to tumorigenesis and how this knowledge can be leveraged to develop more potent treatment strategies that target specific isoforms or other products of the co/post-transcriptional regulation pathway. Rapid developments in all areas of next-generation sequencing (DNA, RNA-seq, ChIP-seq, Methyl-CpG, etc.) have provided new opportunities to develop novel integration and data-mining approaches, and also allows for exciting hypothesis driven bioinformatics and computational studies. Here, we present a program that we developed and summarize the results of applying our methods to analyze datasets from patient matched tumor or normal (T/N) paired samples, as well as cell lines that were either sensitive or resistant (S/R) to treatment with an anti-cancer drug, 5-Azacytidine (http://sourceforge.net/projects/chiprnaseqpro/). We discuss additional options for user-defined approaches and general guidelines for simultaneously analyzing and annotating epigenetic and RNA-seq datasets in order to identify and rank significant regions of epigenetic deregulation associated with aberrant splicing and RNA-editing.

Paper Nr: 46
Title:

Fast Alignment-free Comparison for Regulatory Sequences using Multiple Resolution Entropic Profiles

Authors:

Matteo Comin and Morris Antonelli

Abstract: Enhancers are stretches of DNA (100-1000 bp) that play a major role in development gene expression, evolution and disease. It has been recently shown that in high-level eukaryotes enhancers rarely work alone, instead they collaborate by forming clusters of cis-regulatory modules (CRMs). Even if the binding of transcription factors is sequence-specific, the identification of functionally similar enhancers is very difficult and it cannot be carried out with traditional alignment-based techniques. In this paper we study the use of alignment-free measures for the classification of CRMs. However alignment-free measures are generally tied to a fixed resolution k. Here we propose an alignment-free statistic that is based on multiple resolution patterns derived from Entropic Profiles. Entropic Profile is a function of the genomic location that captures the importance of that region with respect to the whole genome. We evaluate several alignment-free statistics on simulated data and real mouse ChIP-seq sequences. The new statistic is highly successful in discriminating functionally related enhancers and, in almost all experiments, it outperforms fixed-resolution methods.

Paper Nr: 53
Title:

In Silico Analysis of Interactions Between NFkB and HSF Pathways

Authors:

Jaroslaw Smieja, Malgorzata Kardynska, Anna Naumowicz, Patryk Janus, Piotr Widlak and Marek Kimmel

Abstract: Motivation: Inhibition of NFkB pathway is known to promote apoptosis and therefore may constitute one of the goals in anticancer therapies. Experimental results show that heat shock induces such inhibition in cancer cells. However, the mechanisms of interactions between heat shock and NFkB pathways are not fully understood yet. Development of a combined mathematical model of these pathways and its subsequent computational analysis should help to uncover these mechanisms and determine the time window in which heat shock treatment preceding chemotherapy would be the most efficient. Results: An original mathematical model has been developed, allowing for computational testing of various hypotheses concerning main sources of interplay between HSF and NFkB pathways. Computational analysis strongly suggests that the competition for IKK, known from literature, cannot be the only mechanism. Two plausible hypotheses are that either a kinase activating IKK can misfold due to heat shock or that heat shock affects TNF receptors, blocking activation of NFkB pathway at the cell membrane.

Paper Nr: 55
Title:

Identifying Strong Statistical Bias in the Local Structure of Metabolic Networks - The Metabolic Network of Saccharomyces Cerevisiae as a Test Case

Authors:

Paulo A. N. Dias, Marco Seabra dos Reis, Pedro Martins and Armindo Salvador

Abstract: The detection of strong statistical bias in metabolic networks is of much interest for highlighting potential selective preferences. However, previous approaches to this problem have relied on ambiguous representations of the coupling among chemical reactions or in physically unrealizable null models, which raise interpretation problems. Here we present an approach that avoids these problems. It relies in a bipartite-graph representation of chemical reactions, and it prompts a near-comprehensive examination of statistical bias in the relative frequencies of topologically related metabolic structures within a predefined scope. It also lends naturally to a comprehensive visualization of such statistical relationships. The approach was applied to the metabolic network of Saccharomyces cerevisiae, where it highlighted a preference for sparse local structures and flagged strong context-dependences of the reversibility of reactions and of the presence/absence of some types of reactions.

Paper Nr: 61
Title:

A Novel Technique of Feature Extraction Based on Local and Global Similarity Measure for Protein Classification

Authors:

Neha Bharill and Aruna Tiwari

Abstract: The paper aims to propose a novel approach for extracting features from protein sequences. This approach extracts only 6 features for each protein sequence which are computed by globally considering the probabilities of occurrences of the amino acids in different position of the sequences within the superfamily which locally belongs to the six exchange groups. Then, these features are used as an input for Neural Network learning algorithm named as Boolean-Like Training Algorithm (BLTA). The BLTA classifier is used to classify the protein sequences obtained from the Protein Information Resource (PIR). To investigate the efficacy of proposed feature extraction approach, the experimentation is performed on two superfamilies, namely Ras and Globin. Across tenfold cross validation, the highest Classification Accuracy achieved by proposed approach is 94.32±3.52 with Computational Time 6.54±0.10 (s) is remarkably better in comparison to the Classification Accuracies achieved by other approaches. The experimental results demonstrate that the proposed approach extracts the minimum number of features for each protein sequence. Therefore, it results in considerably potential improvement in Classification Accuracy and takes less Computational Time for protein sequence classification in comparison with other well-known feature extraction approaches.

Paper Nr: 63
Title:

Multifactorial Dimensionality Reduction for Disordered Trait

Authors:

Alexander Rakitko

Abstract: We develop our recent works concerning the identification of the factors associated with a certain complex disease. The case of disordered discrete trait is studied. We build two models (3D and 2D) for the range of response variable indicating the state of the health of a patient. In this work we consider the problem of optimal forecast for response variable depending on a finite collection of factors with values in arbitrary finite set. The quality of prediction is described by the error function involving a penalty function. The estimation of the error requires some cross-validation procedure. The developed approach provides the basis to identify the set of significant factors. Such problem arises naturally, e.g., in the genome-wide association study. Using simulated data we illustrate the efficiency of our method.

Paper Nr: 65
Title:

CyanoFactory Knowledge Base & Synthetic Biology - A Plea for Human Curated Bio-databases

Authors:

Gabriel Kind, Eric Zuchantke and Röbbe Wünschiers

Abstract: Nowadays, life science research is dominated by two conditions: interdisciplinarity and high-throughput. The former leads to highly diverse datasets from a data type point of view while high-throughput yields massive amounts of data. Both aspects are reflected by the byte-growth of public bio-databases and the sheer number of specialised databases or databases of databases (i.e. data warehouses). We provide an insight to the development of a biodata knowledge base (dubbed CyanoFactory KB) targeted to bio-engineers in the field of synthetic biology and exemplify the need for data type specific data curation and cross-linking. CyanoFactory KB is unique in incorporating experimental data from a broad range of scientific methods that are based on one strain of Synechocystis sp. PCC 6803. The knowledge base can be accessed upon request via cyanofactory.hs-mittweida.de.

Posters
Paper Nr: 34
Title:

Coupling of Self-activating Genes Induces Spontaneous Synchronized Oscillations in Cells

Authors:

Jesus Miro-Bueno

Abstract: Genetic oscillators are present in a wide range of organisms from bacteria to neurons and coordinate important biological functions. Current models of genetic oscillators are based on auto-repressed genes. In these models a gene produces a repressor protein that binds to the promoter of its own gene repressing the transcription. Different versions of these models have been studied in living organisms and for engineering synthetic clocks. Synchronization of genetic clocks based on this model has also been studied. However, genes with positive feedbacks are also present in natural and synthetic genetic clocks. These self-activating genes provide robustness and frequency tuning to genetic clocks. In this paper we show a novel role of self-activating genes. We demonstrate that the coupling of self-activating genes by small molecules in a cell population produces synchronized oscillations. Our model could be useful for engineering new robust multicellular clocks and better understanding of natural genetic oscillators.

Paper Nr: 47
Title:

Alpha Complexes in Protein Structure Prediction

Authors:

Pawel Winter and Rasmus Fonseca

Abstract: Reducing the computational effort and increasing the accuracy of potential energy functions is of utmost importance in modeling biological systems, for instance in protein structure prediction, docking or design. Evaluating interactions between nonbonded atoms is the bottleneck of such computations. It is shown that local properties of a-complexes (subcomplexes of Delaunay tessellations) make it possible to identify nonbonded pairs of atoms whose contributions to the potential energy are not marginal and cannot be disregarded. Computational experiments indicate that using the local properties of a-complexes, the relative error (when compared to the potential energy contributions of all nonbonded pairs of atom) is well within 2%. Furthermore, the computational effort (assuming that a-complexes are given) is comparable to even the simplest and therefore also fastest cutoff approaches. The determination of a-complexes from scratch for every configuration encountered during the search for the native structure would make this approach hopelessly slow. However, it is argued that kinetic a-complexes can be used to reduce the computational effort of determining the potential energy when ``moving" from one configuration to a neighboring one. As a consequence, relatively expensive (initial) construction of an a-complex is expected to be compensated by subsequent fast kinetic updates during the search process. Computational results presented in this paper are limited. However, they suggest that the applicability of a-complexes and kinetic a-complexes in protein related problems (e.g., protein structure prediction and protein-ligand docking) deserves furhter investigation.

Paper Nr: 48
Title:

Can Evolutionary Rate Matrices be Estimated from Allele Frequencies?

Authors:

Conrad J. Burden

Abstract: This paper is a work in progress in which aims to combine the principles of population genetics and continuous-time Markovian evolutionary models to estimate evolutionary rate matrices from the current observed state of a single genome. A model is proposed in which sections of the genome which are not susceptible to natural selection are considered to be a statistical ensemble of individual genomic sites. Each site is a representative from a stationary distribution of allele frequencies 0 = ? = 1 within the population. Simulations of this distribution via a finite-state Markov model based on a finite effective population size are compared with the stationary solution to the continuum Fokker-Planck equation. Parameters of the evolutionary rate matrix introduced via mutation rates within the Fokker-Planck equation are estimated for simulated data in a number of exploratory examples.

Paper Nr: 50
Title:

Balanced Sampling Method for Imbalanced Big Data Using AdaBoost

Authors:

Hong Gu and Tao Song

Abstract: With the arrival of the era of big data, processing large volumes of data at much faster rates has become more urgent and attracted more and more attentions. Furthermore, many real-world data applications present severe class distribution skews and the underrepresented classes are usually of concern to researchers. Variants of boosting algorithm have been developed to cope with the class imbalance problem. However, due to the inherent sequential nature of boosting, these methods can not be directly applied to efficiently handle largescale data. In this paper, we propose a new parallelized version of boosting, AdaBoost.Balance, to deal with the imbalanced big data. It adopts a new balanced sampling method which combines undersampling methods with oversampling methods and can be simultaneously calculated by multiple computing nodes to construct a final ensemble classifier. Consequently, it is easily implemented by the parallel processing platform of big data such as the MapReduce framework.

Paper Nr: 52
Title:

Application of Ant Colony Optimization for Mapping the Combinatorial Phylogenetic Search Space

Authors:

Alexander Safatli and Christian Blouin

Abstract: In bioinformatics, landscapes of phylogenetic trees for an alignment of sequence data are defined by a discrete state combinatorial space. The optimal solution in such a space is the best-fitting tree which provides insight on the evolutionary relationship between taxonomic groups. The underlying structure of this space is poorly understood. The Ant Colony Optimization (ACO) algorithm is applied in a novel manner to sample phylogenetic tree landscapes in order to understand more about this structure. The proposed implementation provides a probabilistic model for exploring this combinatorial space. This probabilistic model allows us to circumvent the complexity that arises due to increasing the number of sequences. In order to evaluate its performance, quantities of resultant solutions were judged in order to determine how much of the space can be sampled. The results show that the algorithm is robust to the starting location and consistently samples a majority of the search space.

Paper Nr: 58
Title:

A Web-based Computer Aided Detection System for Automated Search of Lung Nodules in Thoracic Computed Tomography Scans

Authors:

M. E. Fantacci, S. Bagnasco, N. Camarlinghi, E. Fiorina, E. Lopez Torres, F. Pennanzio, c. Peroni, A. Retico, M. Saletta, C. Sottocornola, A. Traverso and P. Cerello

Abstract: M5L, a Web-based fully automated Computer-Aided Detection (CAD) system for the automated detection of lung nodules in thoracic Computed Tomography (CT), is based on a multi-thread analysis with two independent CAD subsystems, the lung Channeler Ant Model (lungCAM) and the Voxel-Based Neural Analysis (VBNA), and on the combination of their results. The lungCAM subsystem is based on a model of the capabilities that ants show in nature in finding structures, defining shapes and acting according with local information. The VBNA subsystem is based on a multi-scale filter for spherical structures in searching internal nodules and on the analysis of the intersections of surface normals in searching pleural nodules. The M5L performance, extensively validated on 1043 CT scans from 3 independent datasets, including the full LIDC/IDRI database, is homogeneous across the databases: the sensitivity is about 0.8 at 6-8 False Positive findings per scan, despite the different annotation criteria and acquisition and reconstruction conditions. A prototype service based on M5L is hosted on a server operated by INFN in Torino. Preliminary validation tests of the system have recently started in several Italian radiological institutes.

Paper Nr: 62
Title:

Diagonal Consistency Problem Resolution in DIALIGN Algorithm

Authors:

Ibrahim Chegrane, Athman Sighier, Chahrazed Ighilaza and Aicha Boutorh

Abstract: DIALIGN is a well known Algorithm for pairwise as well as multiple alignment of nucleic acid and protein sequences. It combines local and global alignment features. In this paper we present a new method to better solve the problem of diagonals consistency in DIALIGN algorithm using graph theory modeling. and we describe a new implementation of the method from the extraction of diagonals to the final alignment process. We show the power of our proposed aproach by comparing it with DIALIGN 2.2 using benchmarks from ”BAliBASE” and ”SMART” databases.

Paper Nr: 66
Title:

Is the Identification of SNP-miRNA Interactions Supporting the Prediction of Human Lymphocyte Transcriptional Radiation Responses?

Authors:

Marzena Dolbniak, Joanna Zyla, Sylwia Kabacik, Grainne Manning, Christophe Badie, Ghazi Alsbeih and Joanna Polanska

Abstract: Genome-Wide Association Studies (GWAS) are of great importance in identifying the genetic variants associated with traits/diseases. Due to the high number of candidate SNPs some filtering techniques are necessary to be applied. The aim of the study was to develop the comprehensive approach allowing for detailed analysis of both SNP-gene and SNP-miRNA-gene relations. We elaborated and optimized the novel signal analysis pipeline improving significantly the results of the analysis on genotype-phenotype interplay. Direct links between genotype results and gene expression levels were enriched by detailed analysis of SNP-miRNA-gene interactions at both mature miRNA structure/seed region and target binding site level. The proposed technique was applied to the data on lymphocyte radiation response and increased by almost 100% number of potential functional SNPs.

Paper Nr: 67
Title:

Towards a Unified Named Entity Recognition System - Disease Mention Identification

Authors:

Tsendsuren Munkhdalai, Meijing Li, Khuyagbaatar Batsuren and Keun Ho Ryu

Abstract: Named Entity Recognition (NER) is an essential prerequisite task before effective text mining can begin for biomedical text data. Exploiting unlabeled text data to leverage system performance has been an active and challenging research topic in text mining due to the recent growth in the amount of biomedical literature. In this study, we take a step towards a unified NER system in biomedical, chemical and medical domain. We evaluate word representation features automatically learnt by a large unlabeled corpus for disease NER. The word representation features include brown cluster labels and Word Vector Classes (WVC) built by applying k-means clustering to continuous valued word vectors of Neural Language Model (NLM). The experimental evaluation using Arizona Disease Corpus (AZDC) showed that these word representation features boost system performance significantly as a manually tuned domain dictionary does. BANNER-CHEMDNER, a chemical and biomedical NER system has been extended with a disease mention recognition model that achieves a 77.84% F-measure on AZDC when evaluating with 10-fold cross validation method. BANNER-CHEMDNER is freely available at: https://bitbucket.org/tsendeemts/banner-chemdner.