BIOINFORMATICS 2014 Abstracts


Full Papers
Paper Nr: 16
Title:

A Qualitative Framework for Analysing Homeostasis in Gene Networks

Authors:

Sohei Ito, Shigeki Hagihara and Naoki Yonezaki

Abstract: Toward the system level understanding of the mechanisms contributing homeostasis in organisms, a computational framework to model a system and analyse its properties is indispensable. The purpose of this work is to provide a framework which enables testing and validating homeostatic properties on gene regulatory networks in silico. Based on a qualitative analysis framework for gene networks using temporal logic, we proposed a novel formulation of homeostasis by the notion of realisability. This formulation of homeostasis yields a qualitative method to analyse homeostasis of gene networks. In this formulation, homeostasis is captured by a response not for just an instantaneous stimulation such as dose-response relationships but for any input scenario e.g. oscillating or continuous inputs, which is difficult to be captured by quantitative models. Moreover, we can consider any number of inputs from an environment without difficulty. Such flexibility is a notable advantage of our framework. We demonstrate the usefulness of our framework in analysing a number of small but tricky networks.

Paper Nr: 17
Title:

Probabilistic Neural Network for Predicting Resistance to HIV-Protease Inhibitor Nelfinavir

Authors:

Letícia Martins Raposo, Mônica Barcellos Arruda, Rodrigo de Moraes Brindeiro and Flavio Fonseca Nobre

Abstract: Resistance to antiretroviral drugs has been a major obstacle for a long-lasting treatment of HIV infected patients. The development of models to predict drug resistance is already recognized as useful for helping the decision making process regarding the best therapy for each individual HIV+. The aim of this study was to develop classifiers for predicting resistance to HIV protease inhibitor Nelfinavir using probabilistic neural network (PNN). The data were provided by the Molecular Virology Laboratory of the Health Sciences Center, Federal University of Rio de Janeiro (CCS-UFRJ/Brazil). Using a combination of bootstrap and cross-validation to develop the classifiers, four features were selected to be used as input for the network. Additionally, this approach was also used to define the spread parameter of the PNN networks. Final modelling strategy involved the development of four PNN networks with balanced data and evaluation of the models was done using a separate test set. The accuracies on the test set of the classifiers ranged from 71.2 to 76.0% and the area under the receiver operating characteristic (ROC) curve (AUC) ranged from 0.70 to 0.73. For the two best classifiers the sensitivity and specificity were 66.7% and 78.9% respectively, and the accuracy and AUC were 76.0% and 0.73 for both classifiers. The classifiers showed performances very close to two existing expert-based interpretation systems (IS), the Stanford HIV db and the Rega algorithms. The analysis also illustrates the use of a computational approach for feature selection and model parameters estimation that can be used in other settings.

Paper Nr: 24
Title:

Exploring a Sub-optimal Hidden Markov Model Sampling Approach for De Novo Peptide Structure Modeling

Authors:

Pierre Thevenet and Pierre Tufféry

Abstract: Peptides have, in the recent years, become plausible candidate therapeutics. However, their structural characterization at a large scale, necessary for their identification and optimization, still remains an open in silico challenge. We introduce a new procedure to the rapid generation of 3D models of peptides. It is based on the concept of Hidden Markov Model derived structural alphabet, a generalization of the secondary structure. Based on this concept we have previously setup an approach to the de novo modeling of peptide structure based on a greedy algorithm. Here, we explore a new strategy that relies on the sampling of the sub-optimal sequences of states in the terms of a Hidden Markov Model derived structural alphabet. Our results suggest such procedure is able to identify the native conformation of peptides at a very low algorithmic complexity, while having a performance similar to the former greedy approach. On average peptide models approximate the experimental structure at less than 3°A RMSD, for a processing cost of only few minutes on a workstation. As a result, peptide de novo modeling becomes tractable at a large scale.

Paper Nr: 25
Title:

Knowledge-based Subtractive Integration of mRNA and miRNA Expression Profiles to Differentiate Myelodysplastic Syndrome

Authors:

Jiří Kléma, Jan Zahálka, Michael Anděl and Zdeněk Krejčík

Abstract: The goal of our work is to integrate conventional mRNA expression profiles with miRNA expressions using the knowledge of their validated or predicted interactions in order to improve class prediction in genetically determined diseases. The raw mRNA and miRNA expression features become enriched or replaced by new aggregated features that model the mRNA-miRNA interaction. The proposed subtractive integration method is directly motivated by the inhibition/degradation models of gene expression regulation. The method aggregates mRNA and miRNA expressions by subtracting a proportion of miRNA expression values from their respective target mRNAs. The method is used to model the outcome or development of myelodysplastic syndrome, a blood cell production disease often progressing to leukemia. The reached results demonstrate that the integration improves classification performance when dealing with mRNA and miRNA profiles of comparable predictive power.

Paper Nr: 33
Title:

SuperPhy - A Pilot Resource for Integrated Phylogenetic and Epidemiological Analysis of Pathogens

Authors:

Matthew Whiteside, Chad R. Laing, Akiff Manji and Victor P. J. Gannon

Abstract: Advances in DNA sequencing technology have created new opportunities in fields such as clinical medicine and epidemiology, where performing real-time, genome-based surveillance and identification of phenotypic characteristics of bacterial pathogens is now possible. New analytical tools and infrastructure are needed to analyze these genomic datasets, store the data, and provide the essential biological information to end-users. We have implemented an online whole-genome analyses platform called SuperPhy that uses Panseq as an engine to compare bacterial genomes, the Fisher’s exact test to identify sub-group specific loci, and FastTree to create maximum-likelihood trees. SuperPhy facilitates the upload of genomes for both private and public use. Analyses include: 1) genomic comparisons of clinical isolates, and identification of virulence and antimicrobial resistance genes in silico, 2) associations between specific genotypes and phenotypic meta-data (e.g., geospatial distribution, host, source); 3) identification of group-specific genome markers (presence/ absence of specific genomic regions, and single-nucleotide polymorphisms) in bacterial populations; 4) the ability to manipulate the display of phylogenetic trees; 5) identify statistically significant clade-specific markers. The SuperPhy pilot database currently contains genome sequences for 1063 Escherichia coli strains. Future work will extend SuperPhy to include multiple pathogens.

Paper Nr: 35
Title:

A Computational Study to Identify TP53 and SREBF2 as Regulation Mediators of miR-214 in Melanoma Progression

Authors:

Gianfranco Politano, Alfredo Benso, Stefano Di Carlo, Francesca Orso, Alessandro Savino and Daniela Taverna

Abstract: In the complex world of post-transcriptional regulation, miR-214 is known to control in vitro tumor cell movement and survival to anoikis, as well as in vivo malignant cell extravasation from blood vessels and lung metastasis formation. miR-214 has also been found to be highly expressed in human melanomas, and to directly and indirectly regulate several genes involved in tumor progression and in the establishment of distant metastases (Penna et al., 2011). In this work, we exploit a computational pipeline integrating data from multiple online data repositories to identify the presence of transcriptional or post-transcriptional regulatory modules involving miR-214 and a set of 73 previously identified miR-214 regulated genes. We identified 27 putative regulatory modules involving miR-214, NFKB1, SREBPF2, miR-33a and 9 out of the 73 miR-214 modulated genes (ALCAM, POSTN, TFAP2A, ADAM9, NCAM1, SEMA3A, PVRL2, JAG1, EGFR1). As a preliminary experimental validation we focused on 9 out of the 27 identified regulatory modules that involve two main players, miR-33a and SREBF2. The results confirm the importance of the predictions obtained with the presented computational approach.

Paper Nr: 47
Title:

Empirical Study of Domain Adaptation with Naïve Bayes on the Task of Splice Site Prediction

Authors:

Nic Herndon and Doina Caragea

Abstract: For many machine learning problems, training an accurate classifier in a supervised setting requires a substantial volume of labeled data. While large volumes of labeled data are currently available for some of these problems, little or no labeled data exists for others. Manually labeling data can be costly and time consuming. An alternative is to learn classifiers in a domain adaptation setting in which existing labeled data can be leveraged from a related problem, referred to as source domain, in conjunction with a small amount of labeled data and large amount of unlabeled data for the problem of interest, or target domain. In this paper, we propose two similar domain adaptation classifiers based on a na¨ıve Bayes algorithm. We evaluate these classifiers on the difficult task of splice site prediction, essential for gene prediction. Results show that the algorithms correctly classified instances, with highest average area under precision-recall curve (auPRC) values between 18.46% and 78.01%.

Paper Nr: 50
Title:

A Novel Feature Generation Method for Sequence Classification - Mutated Subsequence Generation

Authors:

Hao Wan, Carolina Ruiz and Joseph Beck

Abstract: In this paper, we present a new feature generation algorithm for sequence data sets called Mutated Subsequence Generation (MSG). Given a data set of sequences, the MSG algorithm generates features from these sequences by incorporating mutative positions in subsequences. We compare this algorithm with other sequence-based feature generation algorithms, including position-based, k-grams, and k-gapped pairs. Our experiments show that the MSG algorithm outperforms these other algorithms in domains in which presence, not specific location, of sequential patterns discriminate among classes in a data set.

Paper Nr: 51
Title:

Uneven Distribution of Potential Triplex Sequences in the Human Genome - In Silico Study using the R/Bioconductor Package Triplex

Authors:

Matej Lexa, Tomáš Martínek and Marie Brázdová

Abstract: Eukaryotic genomes are rich in sequences capable of forming non-B DNA structures. These structures are expected to play important roles in natural regulatory processes at levels above those of individual genes, such as whole genome dynamics or chromatin organization, as well as in processes leading to the loss of these functions, such as cancer development. Recently, a number of authors have mapped the occurrence of potential quadruplex sequences in the human genome and found them to be associated with promoters. In this paper, we set out to map the distribution and characteristics of potential triplex-forming sequences (PTS) in the human genome sequence. Using the R/Bioconductor package triplex, we found these sequences to be excluded from exons, while present mostly in a small number of repetitive sequence classes, especially short sequence tandem repeats (microsatellites), Alu and combined elements, such as SVA. We also introduce a novel way of classifying potential triplex sequences, using a lexicographically minimal rotation of the most frequent k-mer to assign class membership automatically. Members of such classes typically have different propensities to form parallel and antiparallel intramolecular triplexes (H-DNA). We observed an interesting pattern, where the predicted third strands of antiparallel H-DNA were much less likely to contain a deletion than their duplex structural counterpart than were their parallel versions.

Short Papers
Paper Nr: 12
Title:

Search of Possible Insertions in Bacterial Genes

Authors:

Eugene Korotkov, Yulia Suvorova and Maria Korotkova

Abstract: It is known that nucleotide sequences are not homogeneous and from this heterogeneity arises the task of segmentation of a sequence into a set of homogeneous parts by the points called change points. In the work we investigated a special case of change points in genes – paired change points (PCP). We used a well-known property of coding sequences – triplet periodicity. The sequence that we are especially interested in consists of three successive parts: the first and the last parts have similar triplet periodicity (TP) and the middle part is of another TP type. We aimed to find genes with PCP and provide explanation for the phenomenon. We developed a mathematical method for PCP detection based on new measure of similarity between TP matrixes. Among 66936 studied genes we found 2700 genes with PCP and 6459 genes with single change point (SCP). We suppose that PCP could be associated with double fusion or insertion events.

Paper Nr: 37
Title:

Identifying Sub-Network Functional Modules in Protein Undirected Networks

Authors:

Massimo Natale, Alfredo Benso, Stefano Di Carlo and Elisa Ficarra

Abstract: Protein networks are usually used to describe the interacting behaviours of complex biosystems. Bioinformatics must be able to provide methods to mine protein undirected networks and to infer subnetworks of interacting proteins for identifying relevant biological pathways. Here we present FunMod an innovative Cytoscape version 2.8 plugin able to identify biologically significant sub-networks within informative protein networks, enabling new opportunities for elucidating pathways involved in diseases. Moreover FunMod calculates three topological coefficients for each subnetwork, for a better understanding of the cooperative interactions between proteins and discriminating the role played by each protein within a functional module. FunMod is the first Cytoscape plugin with the ability of combining pathways and topological analysis allowing the identification of the key proteins within sub-network functional modules.

Paper Nr: 40
Title:

Methods for Quality Control of Low-resolution MALDI-ToF Spectra

Authors:

Michal Marczyk and Joanna Polanska

Abstract: Protein profiling of human blood serum or plasma using MALDI-ToF mass spectrometry may be used for identification of candidates for disease biomarkers. Due to many biological and technical difficulties emerging during preparation of the sample and spectra measurement quality control step is becoming important. In this study we compared different methods for finding low quality spectra based on the Pearson correlation coefficient and proposed two novel solutions. First one utilizes information about area under the measured spectrum and other incorporates modeling of signal-to-noise ratio of spectra intensity by mixture of Gaussians. Obtained results show that removing of outlying samples increases the similarity of spectra obtained within the same experimental conditions. What is more important it increases reproducibility of peak detection by decreasing the coefficient of variation of peaks intensities within a group and increasing its prevalence. This work shows that appropriate identification and removing of low quality spectra is a necessary step in analysis of mass spectrometry data and proposed tools are appropriate for quality control of MALDI-ToF data.

Paper Nr: 41
Title:

Application of RotaSVM for HLA Class II Protein-Peptide Interaction Prediction

Authors:

Shib Sankar Bhowmick, Indrajit Saha, Giovanni Mazzocco, Ujjwal Maulik, Luis Rato, Debotosh Bhattacharjee and Dariusz Plewczynski

Abstract: In this article, the recently developed RotaSVM is used for accurate prediction of binding peptides to Human Leukocyte Antigens class II (HLA class II) proteins. The HLA II - peptide complexes are generated in the antigen presenting cells (APC) and transported to the cell membrane to elicit an immune response via T-cell activation. The understanding of HLA class II protein-peptide binding interaction facilitates the design of peptide-based vaccine, where the high rate of polymorphisms in HLA class II molecules poses a big challenge. To determine the binding activity of 636 non-redundant peptides, a set of 27 HLA class II proteins are considered in the present study. The prediction of HLA class II - peptide binding is carried out by an ensemble classifier called RotaSVM. In RotaSVM, the feature selection scheme generates bootstrap samples that are further used to create a diverse set of features using Principal Component Analysis. Thereafter, Support Vector Machines are trained with these bootstrap samples with the integration of their original feature values. The effectiveness of the RotaSVM for HLA class II protein-peptide binding prediction is demonstrated in comparison with other traditional classifiers by evaluating several validity measures with the visual plot of ROC curves. Finally, Friedman test is conducted to judge the statistical significance of RotaSVM in prediction of peptides binding to HLA class II proteins.

Paper Nr: 46
Title:

Generating Features using Burrows Wheeler Transformation for Biological Sequence Classification

Authors:

Karthik Tangirala and Doina Caragea

Abstract: Recent advancements in biological sciences have resulted in the availability of large amounts of sequence data (both DNA and protein sequences). The annotation of biological sequence data can be approached using machine learning techniques. Such techniques require that the input data is represented as a vector of features. In the absence of biologically known features, a common approach is to generate k-mers using a sliding window. A larger k value usually results in better features; however, the number of k-mer features is exponential in k, and many of the k-mers are not informative. Feature selection techniques can be used to identify the most informative features, but are computationally expensive when used over the set of all k-mers, especially over the space of variable length k-mers (which presumably capture better the information in the data). Instead of working with all k-mers, we propose to generate features using an approach based on Burrows Wheeler Transformation (BWT). Our approach generates variable length k-mers that represent a small subset of kmers. Experimental results on both DNA (alternative splicing prediction) and protein (protein localization) sequences show that the BWT features combined with feature selection, result in models which are better than models learned directly from k-mers. This shows that the BWT-based approach to feature generation can be used to obtain informative variable length features for DNA and protein prediction problems.

Paper Nr: 49
Title:

Modelling of Genetic Interactions in GWAS Reveals More Complex Relations between Genotype and Phenotype

Authors:

Joanna Zyla, Christophe Badie, Ghazi Alsbeih and Joanna Polanska

Abstract: The aim of this work is to present the complete methodology useful in GWAS analysis with small sample size, where comprehension of interaction between the genotype and phenotype is a main issue. By including all possible models of interaction into the process of model building, we were able to significantly increase the number of candidate polymorphisms and decrease the false discovery ratio.

Paper Nr: 55
Title:

Control of the p53 Protein - mdm2 Inhibitor System using Nonlinear Kalman Filtering

Authors:

Gerasimos G. Rigatos and Efthymia G. Rigatou

Abstract: A nonlinear feedback control scheme for the p53 protein - mdm2 inhibitor system is developed with the use of differential flatness theory and of nonlinear Kalman Filtering. It is shown that by applying differential flatness theory the protein synthesis model can be transformed into the canonical form. This enables the design of a feedback control law that maintains the concentration of the p53 protein at the desirable levels. To estimate the non-measurable elements of the state vector describing the p53-mdm2 system dynamics and to compensate for modeling uncertainties and external disturbances that affect the p53-mdm2 system, the nonlinear Kalman Filter is re-designed as a disturbance observer. The proposed nonlinear feedback control and perturbations compensation method for the p53-mdm2 system can result in more efficient chemotherapy schemes where the infusion of medication will be better administered.

Paper Nr: 60
Title:

Mathematics of the Design of a Parallel Mapping Assembly Algorithm - Combining Smith-Waterman and Hirschberg’s LCS Methods

Authors:

Jaime Seguel

Abstract: This paper focuses on mathematical definitions and results that prove the correctness of a parallel algorithm for mapping assembly. The mathematical concepts and facts discussed here establish the reach and limitations of a combination of Smith-Waterman local alignment method and Hirschberg’s divide-and-conquer longest common subsequence determination method. The parallel algorithm, whose correctness is proved, is a general method that works best for solving the problem of the local alignment of a short and a very large sequence, such as an entire genome. The method is thus, suitable for mapping assembly, where millions of short sequence segments, the so-called reads, are aligned with a whole genome.

Paper Nr: 62
Title:

Genome Mapping by a 60-core Processor

Authors:

Tomohiro Yasuda and Asako Koike

Abstract: Next-generation sequencing (NGS) has drastically changed researches based on DNA sequencing with its high throughput and low costs. Mapping sequences generated by NGS sequences onto reference genomes is an indispensable step to find useful knowledge for biological researches or clinical applications. To accelerate genome mapping by using a new many-core processor Xeon Phi, two major mapping programs, BWA and Bowtie2, were ported to Xeon Phi in this study. Although vector operations of Xeon Phi are not compatible with those of x86 processors, these incompatibilities were successfully circumvented. In a computational experiment where the ported programs were evaluated, the performances of the ported BWA and Bowtie2 peaked when 120 and 60 threads were used, respectively. These results imply that performances of BWA and Bowtie2 can be improved by using tens of processing cores.

Paper Nr: 64
Title:

Fast and Accurate cDNA Mapping and Splice Site Identification

Authors:

Michaël Vyverman, Dieter De Smedt, Yao-Cheng Lin, Lieven Sterck, Bernard De Baets, Veerle Fack and Peter Dawyndt

Abstract: Mapping and alignment of cDNA sequences containing splice sites is an algorithmically and computationally challenging task. Most recently developed spliced aligners are designed for mapping short reads and sacrifice sensitivity for increased performance. We present mesalina, a highly accurate spliced aligner, that can also be used to detect novel non-canonical splice sites and whose performance is more robust with respect to increasing read length. Mesalina utilizes the seed-extend strategy, combining fast retrieval of maximal exact matches with a sensitive sandwich dynamic programming algorithm. Preliminary results indicate that mesalina is accurate and very fast, especially for mapping longer reads. In particular, it is more than ten times faster than mappers with a comparable accuracy. Mesalina is available from https://github.ugent.be/ComputationalBiology/mesalina.

Paper Nr: 65
Title:

ProRank+ - A Method for Detecting Protein Complexes in Protein Interaction Networks

Authors:

Eileen Marie Hanna and Nazar Zaki

Abstract: The course of developing effective medical treatments is typically based on the identification of disease-triggering protein complexes. In this paper, we present ProRank+, an effective method for detecting protein complexes in protein interaction networks. By assuming that complexes may overlap, the method uses a ranking algorithm to order proteins based on their importance in the network. In addition, a novel merging procedure is introduced to refine the predicted complexes in terms of their members. The experimental studies and results showed that ProRank+ outperforms several state-of-the-art methods in terms of the number of correctly-detected protein complexes using numerous quality measures.

Paper Nr: 69
Title:

The Possibilities of Filtering Pairs of SNPs in GWAS Studies - Exploratory Study on Public Protein-interaction and Pathway Data

Authors:

Matej Lexa and Stanislav Stefanic

Abstract: Genome-wide association studies have become a standard way of discovering novel causative alleles by looking for statisticaly significant associations in patient genotyping data. The present challenge for these methods is to discover associations involving multiple interacting loci, a common phenomenon in diseases often related to epistasis. The main problem is the exponential increase in necessary computational power for every additional interacting locus considered in association tests. Several approaches have been proposed to manage this problem, including limiting analysis to interacting pairs and filtering SNPs according to external biological knowledge. Here we explore the possibilities of using public protein interaction data and pathway maps to filter out only pairs of SNPs that are likely to interact, perhaps because of epistatic mechanisms working at the protein level. After filtering all possible pairs of SNPs by their presence in common protein-protein interactions or proteins sharing a metabolic or signalling pathway, we calculate the possible reduction in computational requirements under different scenarios. We discuss these exploratory results in the context of the so-called ”lost heredity” and the usefulness of this approach for similar scenarios.

Paper Nr: 70
Title:

Rational Identification of Prognostic Markers of Breast Cancer

Authors:

Maysson Al-Haj Ibrahim, Joanne L Selway, Kian Chin, Sabah Jassim, Michael A. Cawthorne and Kenneth Langlands

Abstract: Accurate prognostication is central to the management of breast cancer, and traditional clinical and histochemical-based assessments are increasingly augmented by genetic tests. In particular, the use of microarray data has allowed the creation of molecular disease signatures for the early identification of individuals at elevated risk of relapse. However, tailoring therapy on the basis of a molecular assay is only recommended in certain cases, and the identification of a minimal set of genes whose expression allows informed decision-making in a broader spectrum of disease remains challenging. Finding an optimal solution is, however, an intractable computational task (i.e. retrieving the smallest group of genes with the greatest prognostic power). Our solution was to reduce the genetic search-space by using two filtering steps that enriched by biological function those genes whose expression discriminated disease states. In this way, we were able to identify a new molecular signature, the expression characteristics of which facilitated the classification of intermediate risk disease. We went on to create a statistical test that confirmed the relevance of our approach by comparing the performance of our signature to that of 1000 random signatures.

Posters
Paper Nr: 4
Title:

Bayesian Prognostic Model for Genomic Discovery in Bipolar Disorder

Authors:

Swetha S. Bobba, Amin Zollanvari and Gil Alterovitz

Abstract: Integrative approaches that incorporate multiple experiments have shown a potential application in the discovery of disease-related attributes. This study presents a unique, data-driven, integrative, Bayesian approach to merge gene expression data from various experiments into prognostic models and evaluate them for the discovery of bipolar-related attributes. Two prognostic models were constructed: a singlystructured Bayesian and a Bayesian multi-net model, which differentiated Bipolar disease state at a higher level of abstraction. These prognostic models were evaluated to find the most common attributes responsible for the disease and their AUROC, using external crossvalidation. The multi-net model achieved an AUROC of 0.907 significantly outperforming the single-structured model with an AUROC of 0.631. The study found six new genes and five chromosomal regions associated with the bipolar state. Enrichment analysis performed in this study revealed biological concepts and proteins responsible for the disease. We anticipate this method and results will be used in the future to integrate information from multiple experiments for the same or related phenotypes of various diseases and also to predict the disease state earlier.

Paper Nr: 18
Title:

Statistical Identification of Co-regulatory Gene Modules using Multiple ChIP-Seq Experiments

Authors:

Xi Chen, Xu Shi, Ayesha N. Shajahan-Haq, Leena Hilakivi-Clarke, Robert Clarke and Jianhua Xuan

Abstract: ChIP-Seq experiments provide accurate measurements of the regulatory roles of transcription factors (TFs) under specific condition. Downstream target genes can be detected by analyzing the enriched TF binding sites (TFBSs) in genes’ promoter regions. The location and statistical information of TFBSs make it possible to evaluate the relative importance of each binding. Based on the assumption that the TFBSs of one ChIP-Seq experiment follow the same specific location distribution, a statistical model is first proposed using both location and significance information of peaks to weigh target genes. With genes’ binding scores from different TFs, we merge them into a weighted binding matrix. A Markov Chain Monte Carlo (MCMC) based approach is then applied to the binding matrix for co-regulatory module identification. We demonstrate the efficiency of our statistical model on an ER-α ChIP-Seq dataset and further identify co-regulatory modules by using eleven breast cancer related TFs from ENCODE ChIP-Seq datasets. The results show that the TFs in individual module regulate common high score target genes; the association of TFs is biologically meaningful, and the functional roles of TFs and target genes are consistent.

Paper Nr: 21
Title:

Towards a Large Integrated Model of Signal Transduction and Gene Regulation Events in Mammalian Cells

Authors:

Liam G. Fearnley, Mark A. Ragan and Lars K. Nielsen

Abstract: Recent work has generated whole-cell and whole-process models capable of predicting phenotype in simple organisms. The approaches used are hindered in higher organisms and more-complex cells by a lack of kinetic parameters for reactions and events, and the difficulty of measuring and estimating these. Here, we outline a large, two-process model capable of predicting the effects of gene expression on a signal transduction network. Our method models signal transduction and the processes involved in gene expression as two separate systems, solved iteratively. We show that this approach is sufficient to capture functionally significant behaviour resulting from common network motifs. We further demonstrate that our method is scalable and efficient to the size of the largest signal transduction databases currently available. This approach enables analysis and prediction in the absence of kinetic data, but is itself held back by the lack of detailed large-scale gene expression models. However, research consortia such as ENCODE and FANTOM are rapidly adding to the knowledge of transcriptional regulation, and we anticipate that incorporating this data into our regulatory model could allow the modelling of complex cellular phenomena such as the structured progression seen in cellular differentiation.

Paper Nr: 27
Title:

Impact of Single Amino Acid Substitution Upon Protein Structure

Authors:

Mark Livingstone, Lukas Folkman and Bela Stantic

Abstract: In the biological sciences, one of the most fundamental operations is that of comparison. As we strive to further understand the constituent parts of living tissue, we need to examine proteins and their many mutations. Indeed, characterising mutations is an important part of proteomics, because a seemingly trivial mutation can sometimes stand between creating a life-saving drug on one hand, or blocking a vital receptor inactivating that same drug on the other. In this work we examined single point mutations to characterise their effects on outwardly expanding neighbourhood ranges. As the shape of a protein is very important, we examined how mutations can make subtle changes to the protein shape as well as investigated the implications both for backbone and side-chain residues. Our findings suggest that structural changes upon a mutation are significantly influenced by the protein shape, which allows for the prediction of the impact brought about the mutation by looking only into the protein shape. Surprisingly, we found that there was very little variation between wild type and mutant protein structures close to the mutation site. Also, in contrast with what was expected, the largest structural variations were found when deleted and introduced residues had similar hydrophobicity.

Paper Nr: 29
Title:

Summarizing Genome-wide Phased Genotypes using Phased PC Plots

Authors:

Sergio Torres-Sánchez, Nuria Medina-Medina and María M. Abad-Grau

Abstract: Ordination in reduced space such as principal component (PC) analysis and their visual representation in PC plots may help to uncover important patterns among samples in highly dimensional data sets. When used with data sets obtained from genome-wide genotyping, they may show biologically relevant relationships among populations, such as population structure and admixture. Extending the PC analysis to genome-wide phased genotypes may help to reveal different levels of inbreeding between or within populations as well as to evaluate the quality of the haplotyping technique used. We have developed a method to perform PC analysis to a data set of genome-wide phased genotypes and to plot results keeping information about individuals. The method has been implemented in the computer program PCPhaser. To increase the method applicability and reduce development time, PCPhaser implements the method through the transformation of the input data set by segregating haplotypes and using software EIGENSOFT to perform PC analysis. Given this transformation, the proposed method can be applied through any other software able to perform PCA, although PCPhaser will be still required to draw the phased PC plots. PCPhaser is a linux-based software that can be downloaded from http://bios.ugr.es/PCPhaser.

Paper Nr: 31
Title:

Validation of Numerical Simulation for Subdural Cortical Stimulation - Using Spherical Phantoms and Anatomically Realistic Head Phantom

Authors:

Jinmo Jeong, Donghyeon Kim, Sangdo Jeong, Euiheon Chung, Sung Chan Jun, Jonghyun Lee and Sohee Kim

Abstract: The purpose of this study is to investigate the accuracy of numerical simulation for electric brain stimulation. For this, we modelled brains using simple computational models with 2 and 3 shells, with and without realistic head geometry, and performed numerical simulations using finite element method (FEM). The corresponding head phantoms were constructed for the validation of simulation results. We implanted stimulation electrodes in the head phantom, and measured the electric potential induced by the electrodes. When comparing the electric potential obtained from numerical simulations and phantom experiments, both results showed similar trend and amplitude, with a relative difference of 13.64% on average in the realistic head model study. This result demonstrates that predicting the electric potential and its gradient (current density) using computational simulation is reliable with reasonably small deviation from the actual measurement.

Paper Nr: 34
Title:

A Novel Pipeline for Identification and Prioritization of Gene Fusions in Patient-derived Xenografts of Metastatic Colorectal Cancer

Authors:

Paciello Giulia, Andrea Acquaviva, Consalvo Petti, Claudio Isella, Enzo Medico and Elisa Ficarra

Abstract: Metastatic spread to the liver is a frequent complication of colorectal cancer (CRC), occurring in almost half of the cases, for which personalized treatment strategies are highly desirable. To this aim, it has been proven that patient-derived mouse xenografts (PDX) of liver-metastatic CRC can be used to discover new therapeutic targets and determinants of drug resistance. To identify gene fusions in RNA-Seq data obtained from such PDX samples, we propose a novel pipeline that tackles the following issues: (i) discriminating human from murine RNA, to filter out transcripts contributed by the mouse stroma that supports the PDX; (ii) increasing sensitivity in case of suboptimal RNA-Seq coverage; (iii) prioritizing the detected chimeric transcripts by molecular features of the fusion and by functional relevance of the involved genes; (iv) providing appropriate sequence information for subsequent validation of the identified fusions. The pipeline, built on top of Chimerascan(R.Iyer, 2011) and deFuse(McPherson, 2011) aligner tools, was successfully applied to RNASeq data from 11 PDX samples. Among the 299 fusion genes identified by the aforementioned softwares, five were selected since passed all the filtering stages implemented into the proposed pipeline resulting as biologically relevant fusions. Three of them were experimentally confirmed.

Paper Nr: 38
Title:

BioMetaDB: Ontology-based Classification and Extension of Biodatabases

Authors:

Ching-Fen Chang, Chang-Hsien Lin and Chuan-Hsiung Chang

Abstract: The recent rapid increase in high-throughput biological data and computational tools has facilitated the establishments of numerous biodatabases as the repositories of biodata and bioinformatics analysis tools. Due to the inefficiency of database categorization, the search of all available information of research interests costs researchers a lot of time and efforts. We have established BioMetaDB for users to systematically identify all the available databases of their interests and to extend databases on relevance biomedical contents. For the purpose of establishing BioMetaDB to provide semantically annotated corpus to markup the instances of biomedical ontology, our BioMetaDB comprises three main tasks: (1) biological information retrieval from public databases; (2) creating an integrated ontology repository for biological and medical studies based on expert-tagged corpus; (3) establish web services to enable users to access all their desired databases by systemically ontology query. Based on biomedical ontologies, we indexed all the databases by their relevant biological features, and further evaluated the relevance among the databases. Our BioMetaDB, a comprehensive compendium of biological databases, is currently integrated from over 1,500 digital sources.

Paper Nr: 39
Title:

Modeling the Serial Position Effect - Using the Emergent Neural Network Simulation System

Authors:

Katherine Goodman and John K. Bennett

Abstract: The Serial Position Effect (SPE) is a well-studied phenomenon in experimental psychology. SPE captures the idea that, when subjects are asked to recall list items, they are more likely to remember the first items and the last items, whether those items are numbers, non-words or elements of a story. Until recently, SPE has been generally considered to rely upon a two-store memory model, i.e., primacy (remembering initial items) and recency (remembering latter items) were thought to be the work of long term memory and short term memory, respectively. This paper reports the results of a basic hippocampus simulation study using the Leabra algorithm within the Emergent Neural Network Simulation System to model the SPE. Simulation results demonstrate that both primacy and recency of the SPE in a serial recall task can be replicated using only the hippocampus, suggesting that a one-store model of memory for this recall task is sufficient. It remains to be seen if this simulation mirrors the actual biological mechanism utilized.

Paper Nr: 44
Title:

On the Robustness of the Biological Correlation Network Model

Authors:

Kathryn M. Dempsey and Hesham H. Ali

Abstract: Recent progress in high-throughput technology has resulted in a significant data overload. Determining how to obtain valuable knowledge from such massive raw data has become one of the most challenging issues in biomedical research. As a result, bioinformatics researchers continue to look for advanced data analysis tools to analysis and mine the available data. Correlation network models obtained from various biological assays, such as those measuring gene expression levels, are a powerful method for representing correlated expression. Although correlation does not always imply causation, the correlation network has been shown to be effective in identifying elements of interest in various bioinformatics applications. While these models have found success, little to no investigation has been made into the robustness of relationships in the correlation network with regard to vulnerability of the model according to manipulation of sample values. Particularly, reservations about the correlation network model stem from a lack of testing on the reliability of the model. In this work, we probe the robustness of the model by manipulating samples to create six different expression networks and find a slight inverse relationship between sample count and network size/density. When samples are iteratively removed during model creation, the results suggest that network edges may or may not remain within the statistical parameters of the model, suggesting that there is room for improvement in the filtering of these networks. A cursory investigation into a secondary robustness threshold using these measures confirms the existence of a positive relationship between sample size and edge robustness. This work represents an important step toward better understanding of the critical noise versus signal issue in the correlation network model.

Paper Nr: 58
Title:

De Novo Short Read Assembly Algorithm with Low Memory Usage

Authors:

Yuki Endo, Fubito Toyama, Chikafumi Chiba, Hiroshi Mori and Kenji Shoji

Abstract: Determining whole genome sequences of various species has many applications not only in biological system, but also in medicine, pharmacy and agriculture. In recent years, the emergence of high-throughput next generation sequencing technologies has dramatically reduced time and costs for whole genome sequencing. These new technologies provide ultrahigh throughput with lower unit data cost. However, the data are very short length fragments of DNA. Thus, developing algorithms for merging these fragments is very important. Merging these fragments without reference data is called de novo assembly. Many algorithms for de novo assembly have been proposed in recent years. Velvet, one of the algorithms, is famous because it has good performance in terms of memory and time consumption. But memory consumption increases dramatically when the size of input fragments is huge. Therefore, it is necessary to develop algorithm with low memory usage. In this paper, we propose an algorithm for de novo assembly with lower memory. In our experiments using E.coli K-12 strain MG 1655, memory consumption of the proposed algorithm was one-third of that of Velvet.

Paper Nr: 66
Title:

Studies of Mutation Accumulation in Three Codon Positions using Monte Carlo Simulations and Metropolis-Hastings Algorithm

Authors:

Małgorzata Grabińska, Pawel Blazej and Paweł Mackiewicz

Abstract: Protein coding sequences are characterized by specific nucleotide composition in three codon positions as a result of mutational and selection pressures. To analyse the impact of mutations and different transition/transversion ratio on three codon position in protein coding sequences, we elaborated a model of genome evolution based Monte Carlo simulation. Selection was applied against stop translation codons and modified Metropolis-Hastings algorithm to maintain typical nucleotide composition of particular codon positions. The simulations were performed on genomes consisting of bacterial gene sequences. We used a series of nucleotide substitution matrices assuming different transition/transversion ratio and nucleotide stationary distribution characteristic of the real mutational pressure. The simulations showed exponential decrease in the number of eliminated genomes with the growth of the transition/transversion ratio. The same trend was also observed both for accepted and to lesser extent for rejected mutations. The third codon positions much more mutations accepted than rejected because of very similar composition to the mutational stationary distribution, whereas the first positions accumulated the smallest number of mutations and rejected the most as a result of strong selection on its nucleotide composition. The obtained results showed different response of three codon positions on mutational pressure related with their characteristic nucleotide composition.

Paper Nr: 68
Title:

Grammatical Evolution Association Rule Mining to Detect Gene-Gene Interaction

Authors:

Aicha Boutorh and Ahmed Guessoum

Abstract: An important goal of human genetics is to identify DNA sequence variations that increase or decrease specific disease susceptibility. Complex interactions among genes and environmental factors are known to play a role in common human disease etiology. Methods for association rule mining (ARM) are highly successful; especially that they produce rules which are easily interpretable. This has made them widely used in various domains. During the different stages of the knowledge discovery process, several problems are faced. It turns out that, the search characteristics of Evolutionary Algorithms make them suited to solve this kind of problems. In this study, we introduce GEARM, a novel approach for discovering association rules using Grammatical Evolution. We present the approach and evaluate it on simulated data that represents epistasis models. We show that this method improves the performance of gene-gene interaction detection.

Paper Nr: 71
Title:

Flow Index based Characterization of next Generation Sequencing Errors - Visualizing Pyrosequencing and Semiconductor Sequencing to Cope with Homopolymer Errors

Authors:

Peter Sarkozy, Márton Enyedi and Peter Antal

Abstract: We characterized the error sources of multiple resequencing measurements performed on the Ion Torrent Personal Genome Machines and the Roche 454 sequencing platforms. Homopolymer insertions and deletions are the most common error types for these platforms, and there are many underlying factors which define their occurrence patterns. In the paper we investigate the effect of flow order, specifically the difference in the average value of the flow values for each homopolymer run length, based on the position in the flow cycle.

Paper Nr: 72
Title:

MicroRNA Prioritization based on Target Profile Similarities

Authors:

Péter Marx, Bence Bolgár, András Gézsi, Attila Gulyás-Kovács and Péter Antal

Abstract: microRNAs form a complex regulatory network with thousands of target genes. This network is known to suffer specific, but largely elusive, genetic perturbations in various types of disease. Accurate prioritization of microRNAs for each disease type would elucidate those perturbations and so facilitate therapeutic and diagnostic design. The multiple target profiles of microRNAs stemming from various experimental and in silico methods allow the definition of wide range of similarities over microRNAs, but the combined use of these of heterogeneous similarities was not utilized in the gene prioritization approach. Using microRNAs as bases, prioritization with a disease-specific query set of microRNAs is straightforward once a microRNAmicroRNA similarity matrices have been derived. Here we demonstrate the application of a one-class version of the multiple kernel learning framework in order to fuse heterogeneous characteristics of microRNAs. We evaluate the method with breast cancer-specific queries, illustrate its technological aspects, and validate our results not only by standard leave-one-out cross validation, but also with a prospective evaluation.