BIOINFORMATICS 2013 Abstracts


Full Papers
Paper Nr: 9
Title:

2D-PAGE Texture Classification using Support Vector Machines and Genetic Algorithms - An Hybrid Approach for Texture Image Analysis

Authors:

Carlos Fernandez-Lozano, Jose A. Seoane, Pablo Mesejo, Youssef S. G. Nashed, Stefano Cagnoni and Julian Dorado

Abstract: In this paper, a novel texture classification method from two-dimensional electrophoresis gel images is presented. Such a method makes use of textural features that are reduced to a more compact and efficient subset of characteristics by means of a Genetic Algorithm-based feature selection technique. Then, the selected features are used as inputs for a classifier, in this case a Support Vector Machine. The accuracy of the proposed method is around 94%, and has shown to yield statistically better performances than the classification based on the entire feature set. We found that the most decisive and representative features for the textural classification of proteins are those related to the second order co-occurrence matrix. This classification step can be very useful in order to discard over-segmented areas after a protein segmentation or identification process.

Paper Nr: 11
Title:

Qualitative Analysis of Gene Regulatory Networks using Network Motifs

Authors:

Sohei Ito, Takuma Ichinose, Masaya Shimakawa, Naoko Izumi, Shigeki Hagihara and Naoki Yonezaki

Abstract: We developed a method for analysing gene regulatory networks in a purely qualitative fashion. Behaviours of networks are captured as transition systems using propositions for gene states (ON or OFF), and those related to threshold values for gene activation/inhibition. Possible behaviours of networks are specified by logical formulae in Linear Temporal Logic (LTL). With this specification, it is possible to check whether some/all behaviours satisfy a biological property, which is difficult for quantitative analyses like an ordinary differential equation approach. Our method uses satisfiability checking of LTL. Due to the complexity of LTL satisfiability checking, analyses of large networks are generally intractable in this method. To tackle this issue, in this paper, we propose approximate analysis method in which we specify behaviours in simpler formulae which compress/expand the possible behaviours of networks. We present approximate specifications for some network patterns called network motifs.

Paper Nr: 28
Title:

The Distribution of Short Word Match Counts between Markovian Sequences

Authors:

Conrad J. Burden, Paul Leopardi and Sylvain Forêt

Abstract: The D2 statistic, which counts the number of word matches between two given sequences, has long been proposed as a measure of similarity for biological sequences. Much of the mathematically rigorous work carried out to date on the properties of the D2 statistic has been restricted to the case of ‘Bernoulli’ sequences composed of identically and independently distributed letters. Here the properties of the distribution of this statistic for the biologically more realistic case of Markovian sequences is studied. The approach is novel in that Markovian dependency is defined for sequences with periodic boundary conditions, and this enables exact analytic formulae for the mean and variance to be derived. The formulae are confirmed using numerical simulations, and asymptotic approximations to the full distribution are tested.

Paper Nr: 57
Title:

Detecting Interacting Mutation Clusters in HIV-1 Drug Resistance

Authors:

Yu Zhang

Abstract: Understanding the genetic basis of HIV-1 drug resistance is essential for antiretroviral drug development. We analyzed drug resistant mutations in HIV-1 protease and reverse transcriptase under 18 drug treatments. The analysis is challenging because there is a large number of possible mutation combinations that may jointly affect drug resistance. The mutations are also strongly correlated, imposing inference difficulties such as multi-colinearity issues. We applied a novel Bayesian algorithm to the drug resistance data. Our method efficiently identified clusters of mutations in HIV-1 protease and reverse transcriptase that are strongly and directly associated with drug resistance. In addition to marginal associations, we detected strong interactions among mutations at distant protein locations. Most identified protein positions are cross-resistant to several drugs of the same types. The effects of interactions are mostly negative, suggesting a threshold mechanism for the genetics underlying HIV drug resistance. Our method is among the first to produce detailed structures of marginal and interactive associations in HIV-1 drug resistance studies, and is generally suitable for detecting high-order interactions in large-scale datasets with complex dependencies.

Paper Nr: 66
Title:

Systematic Analysis of Structure of Multiple Tandem Repeat Arrays in the Human Genome

Authors:

Woo-Chan Kim and Dong-Ho Cho

Abstract: Repetitive elements constitute the vast majority of the human genome and form many complex but highly-ordered patterns. Tandem repeats whose repeat units are placed next to each other particularly form very highly structured patterns in the human genome when homologous multiple tandem repeats are close together. In this paper, the structure of the multiple tandem repeat array (MTRA) is analyzed based on systematic analysis. The proposed system for analyzing MTRA derives the original tandem repeat units by using the characteristics of homology of MTRA and represents diagram model to show the structure of MTRA easily. The analysis results of the four MTRAs in the human genome are shown and the proposed algorithm is proved to be very efficient for analyzing MTRA by the comparison of three conventional algorithms.

Paper Nr: 68
Title:

Classification of HEp-2 Staining Patterns in ImmunoFluorescence Images - Comparison of Support Vector Machines and Subclass Discriminant Analysis Strategies

Authors:

Ihtesham Ul Islam, Santa Di Cataldo, Andrea Bottino, Elisa Ficarra and Enrico Macii

Abstract: Anti-nuclear antibodies test is based on the visual evaluation of the intensity and staining pattern in HEp-2 cell slides by means of indirect immunofluorescence (IIF) imaging, revealing the presence of autoantibodies responsible for important immune pathologies. In particular, the categorization of the staining pattern is crucial for differential diagnosis, because it provides information about autoantibodies type. Their manual classification is very time-consuming and not very reliable, since it depends on the subjectivity and on the experience of the specialist. This motivates the growing demand for computer-aided solutions able to perform staining pattern classification in a fully automated way. In this work we compare two classification techniques, based respectively on Support Vector Machines and Subclass Discriminant Analysis. A set of textural features characterizing the available samples are first extracted. Then, a feature selection scheme is applied in order to produce different datasets, containing a limited number of image attributes that are best suited to the classification purpose. Experiments on IIF images showed that our computer-aided method is able to identify staining patterns with an average accuracy of about 91% and demonstrate, in this specific problem, a better performance of Subclass Discriminant Analysis with respect to Support Vector Machines.

Paper Nr: 71
Title:

Naïve Bayes Domain Adaptation for Biological Sequences

Authors:

Nic Herndon and Doina Caragea

Abstract: The increased volume of biological data requires automatic computation tools to analyze it. Although machine learning methods have been successfully used with biological sequences in a supervised framework, their accuracy usually suffers when a classifier is learned on a source domain and applied to a different, less studied domain, in a domain adaptation framework. To address this issue, we propose to use an algorithm that combines labeled sequences from a well studied organism, the source domain, with labeled and unlabeled sequences from a related, less studied organism, the target domain. Our experimental results show that this algorithm has high classifying accuracy on the target domain.

Paper Nr: 81
Title:

A Methodology for Optimizing the Cost Matrix in Cost Sensitive Learning Models applied to Prediction of Molecular Functions in Embryophyta Plants

Authors:

S. García-López, J. A. Jaramillo-Garzón, L. Duque-Muñoz and C. G. Castellanos-Domínguez

Abstract: Due to the large amount of data generated by genomics and proteomics research, the use of computational methods has been a great support tool for this purpose. However, tools based on machine learning, face several problems associated to the nature of the data, one of them is the class-imabalance problem. Several balancing techniques exist to obtain an improvement in prediction performance, such as boosting and resampling, but they have multiple weaknesses in difficult data spaces. On the other hand, cost sensitive learning is an alternative solution, yet, the obtention of appropriate cost matrix to induce a good prediction model is complex, and still remains an open problem. In this paper, a methodology to obtain an optimal cost matrix to train models based on cost sensitive learning is proposed. The results show that cost sensitive learning with a proper cost can be very competitive, and even outperform many class-balance strategies in the state of the art. Tests were applied to prediction of molecular functions in Embryophyta plants.

Paper Nr: 82
Title:

Mining Association Rules that Incorporate Transcription Factor Binding Sites and Gene Expression Patterns in C. elegans

Authors:

Hao Wan, Gregory Barrett, Carolina Ruiz and Elizabeth F. Ryder

Abstract: Gene expression in different cells is regulated by different sets of transcription factors. How the combinations of transcription factors required to achieve specificity of expression are encoded by regulatory regions of DNA is a long-standing problem in biology. In the model system C. elegans, gene regulatory regions are relatively compact, and much work has been done to describe gene expression patterns in a number of cell types. In this work, we collected the promoter regions of genes with known expression patterns in a limited number of neuronal cell types, and annotated any DNA motifs in the promoters that corresponded to putative binding sites of known C. elegans transcription factors, using position weight matrices. We used association rule mining to identify rules relating the presence of particular motifs with expression of particular genes. We used metrics including confidence, support, lift, and p-value to mine and assess rules. We examined the effect on the rules of multiple vs. single transcription factors, and the effect of distance from transcription factor binding sites to the start of transcription. The mined association rules were filtered by Benjamini and Hochberg’s approach, and the most interesting rules were selected. We also validated our approach by generating association rules corresponding to gene expression patterns which have been already revealed in biological research. We conclude that our system allows the identification of interesting putative gene expression rules involving known transcription factors. These rules can be further validated using biological techniques.

Short Papers
Paper Nr: 14
Title:

Performance of Beta-Binomial SGoF Multitesting Method for Dependent Gene Expression Levels - A Simulation Study

Authors:

Irene Castro-Conde and Jacobo de Uña-Álvarez

Abstract: In a recent paper (de Uña-Álvarez, 2012, Statistical Applications in Genetics and Molecular Biology Vol. 11, Iss. 3, Article 14) a correction of SGoF multitesting method for possibly dependent tests was introduced. This correction enhanced the field of applications of SGoF methodology, initially restricted to the independent setting, to make decisions on which genes are differently expressed in group comparison when the gene expression levels are correlated. In this work we investigate through an intensive Monte Carlo simulation study the performance of that correction, called BB-SGoF (from Beta-Binomial), in practical settings. In the simulations, gene expression levels are correlated inside a number of blocks, while the blocks are independent. Different number of blocks, within-block correlation values, proportion of true effects, and effect levels are considered. The allocation of the true effects is taken to be random. False discovery rate, power, and conservativeness of the method with respect to the number of existing effects with p-values below the given significance threshold are computed along the Monte Carlo trials. Comparison to the classical Benjamini-Hochberg adjustment is provided. Conclusions from the simulation study and practical recommendations are reported.

Paper Nr: 17
Title:

Using a Random Forest Classifier to Find Nuclear Export Signals in Proteins of Arabidopsis thaliana

Authors:

Claudia Rubiano, Thomas Merkle and Tim W. Nattkemper

Abstract: This paper presents a new computational strategy for predicting Nuclear Export Signals (NESs) in proteins of the model plant Arabidopsis thaliana based on a random forest classifier. NESs are amino acid sequences that enable a protein to interact with a nuclear receptor and in this way to be exported from the nucleus to the cytoplasm. The proposed classifier uses two kinds of features, the sequence of the NESs expressed as the score obtained from a HMM profile and physicochemical properties of the amino acid residues expressed as amino acid index values. Around 5000 proteins from the total of protein sequences from Arabidopsis were predicted as containing NESs. A small group of these proteins was experimentally tested for the actual presence of an NES. 11 out of 13 tested proteins showed positive interaction with the receptor Exportin 1 (XPO1a) from Arabidopsis in yeast two-hybrid assays, which indicates they contain NESs. The experimental validation of the nuclear export activity in a selected group of proteins is an indicator of the potential usefulness of the tool. From the biological perspective, the nuclear export activity observed in those proteins strongly suggests that nucleo-cytoplasmic partitioning could be involved in regulation of their functions.

Paper Nr: 19
Title:

Assignment of Orthologous Genes by Utilization of Multiple Databases - The Orthology Package in R

Authors:

Steffen Priebe and Uwe Menzel

Abstract: The assignment of orthologous genes between species is a key issue when multiple-species approaches are conducted. This has become even more relevant over the past years, triggered by the development of highthroughput genome sequencing technologies, which enable access to complete genomes in a rapid and cost effective way. In this paper, we present a new software that allows the user to access orthology relationships across multiple species in an easy, fast, and flexible manner. The tool collects data from three prominent freely available databases, and presents it to the user in a convenient, easily accessible way. Once the package is installed, the software works on the local computer, therewith circumventing runtime delay caused by network traffic often being a critical performance bottleneck when large datasets are studied or many organisms are investigated simultaneously. By the consequent internal usage of unique identifiers, the software disburdens the user from problems connected with the existence of synonyms or ambiguous gene denotations, a problem that often hampers a clear-cut assignment of orthologs. The software is able to display frequently occurring, complicated many-to-many orthology relationships in a visual manner. It is written in the R programming language and freely available.

Paper Nr: 21
Title:

Computational Study of the Electrostatic Coupling of Membrane-spanning α-Helices Controlled by Dielectric Media

Authors:

Tarunendu Mapder and Lipika Adhya

Abstract: Voltage-gated potassium ion-channels (Kv) play a key role in neurons. The ion-channel is a tetramer each having two domains: voltage senor-domain (VSD) and pore-domain (PD), with four (S1-S4) and two (S5-S6) α-helices respectively, which behave like macrodipoles. The VSD appears capable of adopting different orientations relative to the pore-domain of Kv channels in response to the variable (-70V to +30V) transmembrane-voltage controlling the passage-way of the K+ ions across the membrane. There is an immense progress in the study of voltage-gated channel; however the molecular mechanism underlying voltage sensing is still a matter of debate. Here, we have used a novel theoretical approach using electrostatic theory to identify the possible stable conformation of the voltage gated potassium ion-channel of Aeropyrum pernix (KvAP) at zero transmembrane-voltage by computing the minimum potential energy of the system embedded in hybrid dielectric environment. We have set up an algorithm to generate data, which is presented graphically and then analyzed to study the configuration of the biological system of KvAP. It is observed that in ion-channel protein two adjacent α-helices behaving like a macrodipole conform to antiparallel arrangement and the involvement of the charged residues with the multidielectric environment gives the ion-channel protein different conformations.

Paper Nr: 24
Title:

Visualization of Bioinformatics Workflows for Ease of Understanding and Design Activities

Authors:

H. V. Byelas and M. A. Swertz

Abstract: Bioinformatics analyses are growing in size and complexity. They are often described as workflows, with the workflow specifications also becoming more complex due to the diversity of data, tools, and computational resources involved. A number of workflow management systems (WMS) have been developed recently to help bioinformaticians in their workflow design activities. Many of these WMS visualize workflows as graphs, where the nodes are analysis steps and the edges are interactions and constraints between analysis steps. These graphs usually represent a data flow of the analysis. We know that in software visualization, similar graphs are used to show a data flow in software systems. However, the WMS do not use any widely accepted standards for workflow visualization, particularly not in the bioinformatics domain. As a result, workflows are visualized in different ways in different WMS and workflows describing the same analysis look different in different WMS. Furthermore, the visualization techniques used in WMS for bioinformatics are quite limited. Here, we argue that applying some of the visual analytics methods and techniques used in software field, such as UML (unified modelling language) diagrams combined with quality metrics, can help to enhance understanding and sharing of the workflow, and ease workflow analysis and design activities.

Paper Nr: 35
Title:

Alternative PPM Model for Quality Score Compression

Authors:

Mete Akgün and Mahmut Şamil Sağıroğlu

Abstract: Next Generation Sequencing (NGS) platforms generate header data and quality information for each nucleotide sequence. These platforms may produce gigabyte-scale datasets. The storage of these datasets is one of the major bottlenecks of NGS technology. Information produced by NGS are stored in FASTQ format. In this paper, we propose an algorithm to compress quality score information stored in a FASTQ file. We try to find a model that gives the lowest entropy on quality score data. We combine our powerful statistical model with arithmetic coding to compress the quality score data the smallest. We compare its performance to text compression utilities such as bzip2, gzip and ppmd and existing compression algorithms for quality scores. We show that the performance of our compression algorithm is superior to that of both systems.

Paper Nr: 37
Title:

Data Mining based Methodologies for Cardiac Risk Patterns Identification

Authors:

V. G. Almeida, J. Borba, T. Pereira, H. C. Pereira, J. Cardoso and C. Correia

Abstract: Cardiovascular diseases (CVDs) are the leading cause of death in the world. The pulse wave analysis provides a new insight in the analysis of these pathologies, while data mining techniques can contribute for an efficient diagnostic method. Amongst the various available techniques, artificial neural networks (ANNs) are well established in biomedical applications and have numerous successful classification applications. Also, clustering procedures have proven to be very useful in assessing different risk groups in terms of cardiovascular function in healthy populations. In this paper, a robust data mining approach was performed for cardiac risk patterns identification. Eight classifiers were tested: C4.5, Random Forest, RIPPER, Naïve Bayes, Bayesian Network, Multy-layer perceptron (MLP) (1 and 2-hidden layers) and radial basis function (RBF). As for clustering procedures, k-means clustering (using Euclidean distance) and expectation-maximization (EM) were the chosen algorithms. Two datasets were used as case studies to perform classification and clustering analysis. The accuracy values are good with intervals between 88.05% and 97.15%. The clustering techniques were essential in the analysis of a dataset where little information was available, allowing the identification of different clusters that represent different risk group in terms cardiovascular function. The three cluster analysis has allowed the characterization of distinctive features for each of the clusters. Reflected wave time (T_RP) and systolic wave time (T_SP) were the selected features for clusters visualization. Data mining methodologies have proven their usefulness in screening studies due to its descriptive and predictive power.

Paper Nr: 41
Title:

Radial Basis Function Neural-fuzzy Model for Microarray Signature Identification

Authors:

Julio De Alejandro Montalvo, George Panoutsos, Mahdi Mahfouf and James W. Catto

Abstract: This paper introduces a Fuzzy entropy-based method for the problem of feature selection. For the first time Fuzzy-Entropy is used to directly link the relative input relevance of a Radial-Basis-Function Neural-Fuzzy modelling structure. This embedded feature selection method uses the model performance as a criterion for the feature selection. The resulting model maintains its simplicity and transparency in the form of a linguistic Fuzzy-Logic rule-base. The proposed methodology is validated using a real biomedical case-study, which concerns the signature selection for the identification of the stage of bladder cancer. The signature selection and predictive modelling results are compared to previous research work on the same dataset, and it is shown that the RBF-NF model outperforms the previous modelling attempts by achieving high predictive accuracy (>90%). The model is shown to maintain its good performance even when using just 10 genes in the gene based signature.

Paper Nr: 48
Title:

Predicting Molecular Functions in Plants using Wavelet-based Motifs

Authors:

G. Arango-Argoty, A. F. Giraldo-Forero, J. A. Jaramillo-Garzón, L. Duque-Muñoz and G. Castellanos-Dominguez

Abstract: Predicting molecular functions of proteins is a fundamental challenge in bioinformatics. Commonly used algorithms are based on sequence alignments and fail when the training sequences have low percentages of identity with query proteins, as it is the case for non-model organisms such as land plants. On the other hand, machine learning-based algorithms offer a good alternative for prediction, but most of them ignore that molecular functions are conditioned by functional domains instead of global features of the whole sequence. This work presents a novel application of theWavelet Transform in order to detect discriminant sub-sequences (motifs) and use them as input for a pattern recognition classifier. The results show that the continuous wavelet transform is a suitable tool for the identification and characterization of motifs. Also, the proposed classification methodology shows good prediction capabilities for datasets with low percentage of identity among sequences, outperforming BLAST2GO on about 11,5% and PEPSTATS-SVMon 16,4%. Plus, it offers major interpretability of the obtained results.

Paper Nr: 54
Title:

Learning Advanced TFBS Models from Chip-Seq Data - diChIPMunk: Effective Construction of Dinucleotide Positional Weight Matrices

Authors:

Ivan V. Kulakovskiy, Victor G. Levitsky, Dmitry G. Oschepkov, Ilya E. Vorontsov and Vsevolod J. Makeev

Abstract: Identification and consequent analysis of DNA sequence motifs recognized by transcription factors is an important component in studying transcriptional regulation in higher eukaryotes. In particular, motif discovery methods are applied to construct transcription factor binding sites (TFBSs) models. The TFBS models are then used for prediction of putative binding sites in genomic regions of interest. The most popular TFBS model is a positional weight matrix (PWM). The PWM is usually constructed from nucleotide positional frequencies estimated from a gapless multiple local alignments of experimentally identified TFBS sequences. Modern high-throughput experiments, like ChIP-Seq, provide enough data for careful training of more advanced models having more parameters. Until now, the majority of existing tools for TFBS prediction in ChIP-Seq data still rely on PWMs with independent positions. This is partly explained with only marginal improvement of specificity and sensitivity of TFBS recognition for advanced models over those based on traditional PWMs if trained on ChIP-Seq data. Here we present a novel computational tool, diChIPMunk (http://autosome.ru/dichipmunk/), which can construct dinucleotide PWMs accounting for neighboring nucleotide correlations in input sequences. diChIPMunk retains advantages of the published ChIPMunk algorithm, including usage of ChIP Seq peak shape and overall computational efficiency. Using public ChIP-Seq data for several TFs we show that carefully trained dinucleotide PWMs perform significantly better as compared to PWMs based on mononucleotide frequencies.

Paper Nr: 55
Title:

Translation Efficiency of Synaptic Proteins and Its Coding Sequence Determinants

Authors:

Shelly Mahlab, Itai Linial and Michal Linial

Abstract: The synapse is an organized structure that contains synaptic vesicles, mitochondria, receptors, transporters and stored proteins. About 10% of the mRNAs that are express in mammalian neurons are delivered to synaptic sites, where they are subjected to local translation. While neuronal plasticity, learning and memory occur at the synapse, the mechanisms that regulate post-transcriptional events and local translation are mostly unknown. We hypothesized that evolutional signals that govern translational efficiency are encoded in the mRNA of synaptic proteins. Specifically, we applied a measure of tRNA adaptation index (tAI) as an indirect proxy for translation rate and showed that ionic channels and ligand-binding receptors are specified by a global low tAI values. In contrast, the genuine proteins of the synaptic vesicles exhibit significantly higher tAI values. The expression of many of these proteins actually accompanied synaptic plasticity. Furthermore, in human, the local tAI values for the initial segment of mRNA coding differs for synaptic proteins in view of the rest of the human proteome. We propose that the translation of synaptic proteins is a robust solution for compiling with the high metabolic demands of the synapse.

Paper Nr: 58
Title:

A Hybrid Local Search for Simplified Protein Structure Prediction

Authors:

Swakkhar Shatabda, M. A. Hakim Newton, Duc Nghia Pham and Abdul Sattar

Abstract: Protein structure prediction based on Hydrophobic-Polar energy model essentially becomes searching for a conformation having a compact hydrophobic core at the center. The hydrophobic core minimizes the interaction energy between the amino acids of the given protein. Local search algorithms can quickly find very good conformations by moving repeatedly from the current solution to its “best” neighbor. However, once such a compact hydrophobic core is found, the search stagnates and spends enormous effort in quest of an alternative core. In this paper, we attempt to restructure segments of a conformation with such compact core. We select one large segment or a number of small segments and apply exhaustive local search. We also apply a mix of heuristics so that one heuristic can help escape local minima of another. We evaluated our algorithm by using Face Centered Cubic (FCC) Lattice on a set of standard benchmark proteins and obtain significantly better results than that of the state-of-the-art methods.

Paper Nr: 65
Title:

Regularized Least Squares Applied to Heartbeat Classification using Transform-based and RR Intervals Features

Authors:

Hamza Baali, Rini Akmeliawati and Momoh J. E. Salami

Abstract: An algorithm for arrhythmia classification is presented with emphasis on the discrimination between normal and premature ventricular contraction (PVC) conditions. We derived new features from the transformed ECG signal resulting from the linear predictive analysis of the ECG heartbeats and from the LPC filter impulse response matrix. These features in conjunction with the residual error energy and RR-intervals are fed into the Regularized Least Squares Classifier (RLSC) with radial basis kernel. The proposed features show an acceptable separation capability between the two classes. Two scenarios are investigated using selected records taken from the MIT-Arrhythmia database namely, intra-patient and inter-patient classification. The achieved results are 98.18 sensitivity and 99.02 specificity in average for the first scenario (intra-patient) and 95.18 sensitivity and 96.92 specificity in average for the second scenario (inter-patient).

Paper Nr: 67
Title:

A Predictive Alignment-free Method based on Logistic Regression for Feature Selection and Classification of Protein Sequences

Authors:

Braulio Roberto Goncalves Marinho Couto, Marcelo Matos Santoro, Ana Paula Ladeira and Marcos A. dos Santos

Abstract: The majority of actual methods for predicting the protein type a new sequence encode is based on alignments. We present a method that codifies sequences as dipeptide frequency vectors in 400 and uses information from known protein databases to build logistic regression models for protein prediction. In addition to calculate the probability of an unknown sequence being a specific class of protein, the method performs a feature selection, identifying dipeptides important to each protein group. We tested the method on 16 randomly groups of proteins chosen from Swiss-Prot. Assessments of the fit of logistic regression models were made on an independent dataset and by comparing discriminant results with BLAST, the basic local alignment search tool. Overall rate of correct protein classification ranged from 87% to 99%, and the sensitivity ranged from 61% to 99%, similar or better than BLAST. We observed that BLAST had difficult to identify short sequences, as Venom peptides, showing only 18% of correct classification of this group of proteins. Logistic model hit 96% in this case. Areas under the ROC curves were higher than 0.90 for all models. After achieving the logistic models, the problem to predict the protein type of a new sequence encode became very simple. Same analysis can be achieved for any other protein group. In addition to good results, better than BLAST program, there are two important issues in the proposed method: firstly, the modeling phase is made by a case-control study that do not use all database, but only samples for each target protein. This way the modeling becomes fast and adaptable to huge problems. The second and most important characteristic of this method is that, after the modeling phase, the entire system reduces to a few source code with an interface to receive queries, a subroutine to encoded amino acids sequences as frequency vectors and the logistic equations to predict probabilities. After the model is built there is no more database searching or any comparison among the new sequence and known proteins.

Paper Nr: 70
Title:

Automatic Feature Selection for Sleep/Wake Classification with Small Data Sets

Authors:

J. Foussier, P. Fonseca, X. Long and S. Leonhardt

Abstract: This paper describes an automatic feature selection algorithm integrated into a classification framework developed to discriminate between sleep and wake states during the night. The feature selection algorithm proposed in this paper uses the Mahalanobis distance and the Spearman’s ranked-order correlation as selection criteria to restrict search in a large feature space. The algorithm was tested using a leave-one-subject-out cross-validation procedure on 15 single-night PSG recordings of healthy sleepers and then compared to the results of a standard Sequential Forward Search (SFS) algorithm. It achieved comparable performance in terms of Cohen’s kappa (k = 0.62) and the Area under the Precision-Recall curve (AUCPR = 0.59), but gave a significant computational time improvement by a factor of nearly 10. The feature selection procedure, applied on each iteration of the cross-validation, was found to be stable, consistently selecting a similar list of features. It selected an average of 10.33 features per iteration, nearly half of the 21 features selected by SFS. In addition, learning curves show that the training and testing performances converge faster than for SFS and that the final training-testing performance difference is smaller, suggesting that the new algorithm is more adequate for data sets with a small number of subjects.

Paper Nr: 73
Title:

A Novel Pipeline for V(D)J Junction Identification using RNA-Seq Paired-end Reads

Authors:

Giulia Paciello, Elisa Ficarra, Alberto Zamò, Chiara Pighi, Carmelo Foti, Francesco Abate, Enrico Macii and Andrea Acquaviva

Abstract: Immunoglobulin heavy and light chains are assembled respectively from germline V, D, J and V, J segments within a process called V(D)J recombination involving the development of T and B lymphocytes. The discovery that abnormal antibodies are often related to a wide range of pathologies conducted during the last years to many studies inherent the immunoglobulin features. In particular the identification of the functional V(D)J sequence of an antibody is considered fundamental since it could allow to understand the link between a particular disease and a specific recombination in a certain tissue and to promote the engineering of therapeutic antibodies. Objective of the implemented pipeline consists in the identification of the so called ’main clone’ that characterizes a neoplastic tissue using paired-end RNA-Sequencing (RNA-Seq) reads.

Paper Nr: 75
Title:

Comparison of Four Ab Initio MicroRNA Prediction Tools

Authors:

Müşerref Duygu Saçar and Jens Allmer

Abstract: MicroRNAs are small RNA sequences of 18-24 nucleotides in length, which serve as templates to drive post transcriptional gene silencing. The canonical microRNA pathway starts with transcription from DNA and is followed by processing by the Microprocessor complex, yielding a hairpin structure. This is then exported into the cytosol where it is processed by Dicer and next incorporated into the RNA induced silencing complex. All of these biogenesis steps add to the overall specificity of miRNA production and effect. Unfortunately, experimental detection of miRNAs is cumbersome and therefore computational tools are necessary. Homology-based miRNA prediction tools are limited by fast miRNA evolution and by the fact that they are template driven. Ab initio miRNA prediction methods have been proposed but they have not been analyzed competitively so that their relative performance is largely unknown. Here we implement the features proposed in four miRNA ab initio studies and evaluate them on two data sets. Using the features described in Bentwich 2008 leads to the highest accuracy but still does not provide enough confidence into the results to warrant experimental validation of all predictions in a larger genome like the human genome.

Paper Nr: 78
Title:

Comparing Viral (HIV) and Bacterial (Staphylococcus aureus) Infection of the Bone Tissue

Authors:

Mohammad Ali Moni, Pietro Liò and Luciano Milanesi

Abstract: This paper focuses on the differences between S. aureus bacterial and HIV viral infection of the bone tissue. Both of these infections alters the RANK/RANKL/OPG signalling dynamics that regulates osteoblasts and osteoclasts behavior in bone remodelling. These infections rapidly lead to severe bone loss and it may even spread to other parts of the body. Since both HIV and osteomyelitis cause loss of bone mass, we focused on comparing the dynamics of these diseases by means of computational models. Firstly, we performed meta-analysis on the gene expression data of normal, HIV and osteomyelitis bone conditions and compare the effects of HIV and S. aureus infection. We mainly focused on RANKL/OPG signalling, the TNF and TNF receptor superfamilies and the NF-kB pathway. Using information from the gene expression data, we estimated parameters for a novel model of osteomyelitis. Then we develop another multi strain HIV ODE model incorporating the HAART therapy. Our ODE modelling aims at investigating the dynamics of the effects of osteomyelitis and HIV infection in bone remodelling.

Paper Nr: 84
Title:

RqPCRAnalysis: Analysis of Quantitative Real-time PCR Data

Authors:

Frédérique Hilliou and Trang tran

Abstract: We propose the statistical RqPCRAnalysis tool for quantitative real-time PCR data analysis which includes the use of several normalization genes, biological as well as technical replicates and provides statistically validated results. This RqPCRAnalysis tool improved methods developed by Genorm and qBASE programs. The algorithm was developed in R language and is freely available. The main contributions of RqPCRAnalysis tool are: (1) determining the most stable reference genes (REF)--housekeeping genes--across biological replicates and technical replicates; (2) computing the normalization factor based on REF; (3) computing the normalized expression of the genes of interest (GOI), as well as rescaling the normalized expression across biological replicates; (4) comparing the level expression between samples across biological replicates via the test of statistical significance. In this paper we describe and demonstrate the available statistical functions for practical analysis of quantitative real-time PCR data. Our statistical RqPCRAnalysis tool is user-friendly and should help biologist with no prior formation in R programming to analyze their quantitative PCR data.

Paper Nr: 86
Title:

Structural Analysis of Nuclear Magnetic Resonance Spectroscopy Data

Authors:

Alejandro Chinea and José L. González Mora

Abstract: From the clinical diagnosis point of view in vivo nuclear magnetic resonance (NMR) spectroscopy has proven to be a valuable tool for performing non-invasive quantitative assessments of brain tumour glucose metabolism. Brain tumours are considered fast-growth tumours because of their high rate of proliferation. Therefore, there is strong interest from the clinical investigator’s point of view in the development of early tumour detection techniques. Unfortunately, current diagnosis techniques ignore the dynamic aspects of these signals. It is largely believed that temporal variations of NMR spectra are simply due to noise or do not carry enough information to be exploited by any reliable diagnosis procedure. Thus, current diagnosis procedures are mainly based on empirical observations extracted from single averaged spectra. In this paper, a machine learning framework for the analysis of NMR spectroscopy signals is introduced. The proposed framework is characterized by a set of structural parameters that are shown to be very sensitive to metabolic changes as those exhibited by tumour cells. Furthermore, they are able to cope not only with high-dimensional characteristics of NMR data but also with the dynamic aspects of these signals.

Paper Nr: 90
Title:

Can Software Transactional Memory Make Concurrent Programs Simple and Safe?

Authors:

Ketil Malde

Abstract: Parallel programs are key to exploiting the performance of modern computers, but traditional facilities for synchronizing threads of execution are notoriously difficult to use correctly, especially for problems with a non-trivial structure. Software transactional memory is a different approach to managing the complexity of interacting threads. By eliminating locking, many of the complexities of concurrency is eliminated, and the resulting programs are composable, and thus simplifies refactoring and other modifications. Here, we investigate STM in the context of genome assembly, and demonstrate that a program using STM is able to successfully parallelize the genome scaffolding process with a near linear speedup.

Paper Nr: 91
Title:

Generalized Association Rules for Connecting Biological Ontologies

Authors:

Fernando Benites and Elena Sapozhnikova

Abstract: The constantly increasing volume and complexity of available biological data requires new methods for managing and analyzing them. An important challenge is the integration of information from different sources in order to discover possible hidden relations between already known data. In this paper we introduce a data mining approach which relates biological ontologies by mining generalized association rules connecting their categories. To select only the most important rules, we propose a new interestingness measure especially well-suited for hierarchically organized rules. To demonstrate this approach, we applied it to the bioinformatics domain and, more specifically, to the analysis of data from Gene Ontology, Cell type Ontology and GPCR databases. In this way found association rules connecting two biological ontologies can provide the user with new knowledge about underlying biological processes. The preliminary results show that produced rules represent meaningful and quite reliable associations among the ontologies and help infer new knowledge.

Paper Nr: 92
Title:

Directional Variation of Trabecular Bone in the Femoral Head, a μ-CT based Approach

Authors:

Varitis Emmanouil, Sagris Dimitrios, David Constantine and Lontos Antonios

Abstract: The structural characteristics of bone are described by features of high complexity, defining the directional anisotropy of its mechanical properties. This phenomenon originates in the orientation of collagen fibers and osteons within the cortical tissue and the trabecular morphology of cancellous bone. the purpose of this study was the examination of the geometrical anisotropy of cancellous bone in the femoral head. 28 femoral heads, harvested during hip replacement of 17 women and 11 men, were studied in total. Cylindrical specimens of 11mm in diameter were extracted perpendicular to the fovea capitis femoris and subjected to micro Computed Tomography (μ-CT). a 11mm sphere was isolated from all samples and the cross-sectional area of the sphere was studied for 8 predefined regions, corresponding to planes perpendicular to principal loading directions of the hip joint. Significant topographical variations of trabecular bone structure in different subchondral regions were determined. in the superior region, the trabecular bone strength was the highest, while the inferior region exhibited the lowest bone strength and medial and lateral regions had intermittent magnitudes. No significant difference in anisotropy was found between male and female samples, although the absolute values were greater in males. The obtained results cohere with recent literature data of osteopenetration experiments in these directions.

Paper Nr: 95
Title:

Multiple RNA Interaction - Formulations, Approximations, and Heuristics

Authors:

Saad Mneimneh, Syed Ali Ahmed and Nancy L. Greenbaum

Abstract: The interaction of two RNA molecules involves a complex interplay between folding and binding that warranted recent developments in RNA-RNA interaction algorithms. However, biological mechanisms in which more than two RNAs take part in an interaction exist. Therefore, we formulate multiple RNA interaction as a computational problem, which not surprisingly turns out to be NP-complete. Our experiments with approximation algorithms and heuristics for the problem suggest that this formulation is indeed useful to determine interaction patterns of multiple RNAs when information about which RNAs interact is not necessarily available (as opposed to the case of two RNAs where one must interact with the other), and because the resulting RNA structure often cannot be predicated by existing algorithms when RNAs are simply handled in pairs.

Posters
Paper Nr: 5
Title:

New Algorithm for Analysis of Off-target Effects in siRNA Screens

Authors:

Karol Kozak, Sandra Kaestner, Thomas Wild, Andreas Vonderheit, Benjamin Misselwitz, Ulrike Kutay and Gabor Csucs

Abstract: The occurrence of RNAi side effects called “off-target effects” is still a challenging aspect in the interpretation of data from large-scale RNA interference screens. To reduce off-target effects, improved algorithms have been developed for small interfering RNA (siRNA) design, but also chemical modifications of double stranded RNA molecules were introduced by the various commercial providers. To aid the analysis of large-scale screens, we present a new algorithm and tool for the prediction of potential off-target effects that can be applied to RNAi experimental data. Our approach provides different possibilities to search for homologies between individual siRNAs of cellular mRNAs. We demonstrate our approach on a ribosomal RNAi screening dataset.

Paper Nr: 12
Title:

Structure Prediction with FAMS for Proteins Screened Critically to Autoimmune Diseases based upon Bioinformatics

Authors:

Shigeharu Ishida, Hideaki Umeyama, Mitsuo Iwadate and Y-h. Taguchi

Abstract: Drug discovery for autoimmune diseases is recently recognized to be an important task. In this study, we try to perform structure prediction of proteins whose gene promoter regions were previously reported to be specifically methelysed or demethylased commonly for three autoimmune diseases, systemic lupus erythematosus, rheumatoid arthritis, and dermatomyositis. FAMS were employed for this purpose and we can predict three dimensional structure with significantly small enough P-values. Most of them are suggested to be self immunology related proteins and will be important drug target candidates. We also found some proteins which form complex with each other. The possibility of a new drug target, i.e., suppression of protein complex formation is suggested.

Paper Nr: 15
Title:

A Literature Evaluation of CUDA Compatible Sequence Aligners

Authors:

Yang Liu, Jiang-Yu Li, Yi-Qing Mao, Xiao-Lei Wang and Dong-Sheng Zhao

Abstract: The rapidly accumulating biological data generated by next-generation sequencer motivate the development of improved tools for sequence alignment. Many technologies have been proposed for this purpose, and one of them is GPU computing. Existing acceleration of sequence aligners using GPU computing overemphasize speed. However, other factors such as accuracy, performance per watt, price-performance and programming complexity are also important and need to be considered. Based on the existing literatures of GPU-based sequence aligners, this paper gives a literature evaluation of these sequence aligners from the above perspectives, in order to determine the usability of the tremendous GPU-based sequence aligners.

Paper Nr: 18
Title:

Evolution of Bacterial Genome under Changing Mutational Pressure - Computer Simulation Studies

Authors:

Paweł Błażej, Paweł Mackiewicz, Małgorzata Wańczyk and Stanisław Cebrat

Abstract: The main force shaping the structure of bacterial chromosomes is the replication-associated mutational pressure which is characterized by distinct nucleotide substitution patterns acting on differently replicated DNA strands (leading and lagging). Therefore, the composition of DNA strands is asymmetric and it is important at which strand a gene is located and into which strand it could be translocated. Thus, the mutational pressure restricts also intragenomic translocations. To analyze this effect, we have elaborated a simulation model of bacterial genome evolution assuming translocation of protein coding genes and different types of selection acting on their sequences. The ’negative’ selection eliminated individuals if the coding signal of any gene in its genome dropped below the acceptable range, whereas the ’stabilizing’ selection did not allow for the decrease in the coding signal of any gene below its original value. Under the ’negative’ selection more genes stayed or were translocated to the lagging strand, whereas under the ’stabilizing’ selection more genes preferred the leading strand. The ’stabilizing’ selection eliminated more individuals because of the coding signal loss and slightly fewer because of the stop codon generation. The ’stabilizing’ selection allowed also for much less gene translocations between strands than the ’negative’ selection.

Paper Nr: 22
Title:

Optimisation and Validation of a Minimum Data Set for the Identification and Quality Control of EST Expression Libraries

Authors:

A. T. Milnthorpe and Mikhail Soloviev

Abstract: There are currently a few bioinformatics tools, such as dbEST, DDD, GEPIS, cDNA xProfiler and cDNA DGED to name a few, which have been widely used to retrieve and analyse EST expression data and for comparing gene expression levels e.g. between cancer and normal tissues. The outcome of any such comparison depends on EST libraries' annotations and assumes that the actual expression data (EST counts) are correct. None of the existing tools provide a quality control method for the selection and evaluation of the original EST expression libraries. Here we report the selection, optimisation and evaluation of a minimal gene expression data set using CGAP cDNA DGED. Our approach relies solely on the expression data itself and is independent on the libraries annotations. The reported approach allows tissue typing of expression libraries of different sizes containing between as little as 249 total EST counts and up to 13,929 total EST counts (the highest tested so far).

Paper Nr: 26
Title:

Different Stimuli for Inference of Gene Regulatory Network in Rheumatoid Arthritis

Authors:

Peter Kupfer, Sebastian Vlaic, René Huber, Raimund W. Kinne and Reinhard Guthke

Abstract: Since genetic and epigenetic factors are known to be involved in the pathogenesis of rheumatoid arthritis the search for key players in this disease is one of the most important challenges. For this purpose gene regulatory networks are one possibility to reveal underlying interactions for different stimuli. In this study we analyzed the cellular response of synovial fibroblasts to 4 different stimuli. We infered a gene regulatory network that is able to explain the observed data for stimulation by TNF-a, TGF-b1, IL-1 and PDGF-D simultaneously.

Paper Nr: 33
Title:

QUASI: A Pipeline for the Quality Assessment and Statistical Inference on Next Generation Sequencing Data from Pooled shRNA Library Screens

Authors:

Mark Onyango, Carsten Ade, Franz Cemič and Jürgen Hemberger

Abstract: With the development of next generation high-throughput sequencing solutions to expression profiling, the efficient and effortless handling of such profiling data became a key challenge for bioinformaticians and biologists alike. We therefore present a "fire and forget" style pipeline implemented in C and R, named QUASI. It is capable of quality assessments, sequence alignments, shRNA quantification and statistically inferring significant differential sequence abundance from datasets presented to it. Through blackboxing the often complex and laborious steps, QUASI presents itself as a user-friendly and time-efficient solution to handle pooled shRNA library screening data.

Paper Nr: 51
Title:

A Non-linear Finite Element Model for Assessment of Lumbar Spinal Injury Due to Dynamic Loading

Authors:

Alexander Tsouknidas, Savvas Savvakis , Nikolaos Tsirelis , Antonios Lontos and Nikolaos Michailidis

Abstract: In this paper a highly detailed model of an adult lumbar spine (L1-L5) was recreated based on Computed Tomography. Next to the viscoelastic deformation of the intervertebral discs, cortical and cancellous bone anisotropy was considered, while seven types of ligaments were simulated either by solid or cable elements. The dynamic behaviour of the spine segment was assessed through stress-strain curves, provoking a non-linear response of all implicated tissues’ material properties. The model was subjected to dynamic loading to determine abnormalities in the anatomy’s stress equilibrium that could provoke gait disturbances. Results indicated the introduced methodology as an effective alternative to in vitro investigations, capable of providing valuable insight on critical movements and loads of potential patients, as the model can be employed to optimize therapeutic training or threshold kinematics of any given lumbar spine pathology.

Paper Nr: 59
Title:

Hadoop-RINS - A Hadoop Accelerated Pipeline for Rapid Nonhuman Sequence Identification

Authors:

Li Jiangyu, Liu Yang, Wang Xiaolei, Mao Yiqing, Wang Yumin and Zhao Dongsheng

Abstract: Sequencing data increase rapidly in recent years with the development of high-throughput sequencing technology. Using parallel computing to accelerate the computation is an important way to process the large volume of sequence data. RINS is a pipeline used to identify nonhuman sequences in deep sequencing datasets. It uses user-provided microbial reference genomes to reduce the number of reads to be processed and improve the processing speed. But all of its steps run serially. As a result, the processing speed of RINS slows down sharply as the sequencing data and reference genomes increase. In this article, we report a pipeline that processes sequencing data parallel through Hadoop. By comparing the runtime using same dataset, Hadoop-RINS is proved to be significantly faster than RINS with the same computation result.

Paper Nr: 60
Title:

A Semantic-based Similarity of Human Drug Target Proteins

Authors:

Eduardo C. dos Santos, Marcos A. dos Santos, Bráulio R. G. M. Couto and Julio C. D. Lopes

Abstract: The study of drug target similarity can provide valuable guidelines on target identification, drug repurposing, and rational drug development strategies. Here, we develop a measure of similarity by using singular value decomposition and rank reduction over a vector space originally described by the functional annotation of known drug targets. We show that although the measure was constructed strictly with general protein properties, it discriminates druggable and non-druggable proteins and can be used to predict druggable targets with sensitivity and specicity of 88%. Furthermore, it allows to find hidden similarities with no prima facie relationships and hardly recognized by sequence alignment.

Paper Nr: 61
Title:

A Dynamic Whole-genome Database for Comparative Analyses, Molecular Epidemiology and Phenotypic Summary of Bacterial Pathogens

Authors:

Chad R. Laing, Eduardo Taboada, Peter Kruczkiewicz, James E. Thomas and Victor P. J. Gannon

Abstract: Background. Recent outbreaks caused by bacterial contaminants in food, including sprouts by E. coli O104:H4 in Germany and processed meats by Listeria in Canada highlight the need for rapid and accurate characterization of bacterial pathogens. Current sequencing platforms have revolutionized the amount and quality of data available to epidemiologists, public health officials and microbiologists, who now require powerful yet intuitive tools to make sense of the underlying biology in these large datasets. In this study, we developed bioinformatics tools to: automate whole-genome analyses, make the data broadly accessible via novel reporting functions, and provide a dynamic computational platform for genomic analyses online at http://76.70.11.198/bacpath. Methods. A PHP-based web front end and PostgreSQL database display the pre-computed data. Genomic comparisons are performed using updates to our previously created pan-genomic software suite, Panseq (http:://lfz.corefacility.ca/panseq/). New genomic sequences are analyzed and added to the database without the need for recomputing previous analyses. Phylogenetic trees are created with MrBayes. Statistical calculations are performed using R. Results. A pathogen-specific genomic database encompassing all publicly available E. coli strains was created as a proof of concept. Pre-computed comparisons for the hundreds of bacterial genomes including phylogeny, presence/absence of virulence markers, group-specific biomarkers and geospatial information were generated. Data reporting tools were created to summarize the complexity of the data and to provide biologically pertinent results including genotype, phenotype (eg. anti-microbial resistance), and geospatial information. Discussion. The database provides rapid and accurate identification and characterization of E. coli. Output is formatted specifically for end users describing virulence, phylogeny and group-specific markers. Uptake of a global surveillance system with near real time analysis will provide an effective early warning system and allow for a faster response to pathogen-related outbreaks.

Paper Nr: 64
Title:

Infrequent, Unexpected, and Contrast Pattern Discovery from Bacterial Genomes by Genome-wide Comparative Analysis

Authors:

Daisuke Ikeda, Osamu Maruyama and Satoru Kuhara

Abstract: With plenty of sequences, comparative genomics is becoming important. Its basic approach is to find similar subsequences from the sequences of different species and then examine differences in detail among found similar parts. Instead of focusing on similar parts, this paper is devoted to find different parts directly from the whole DNA sequences. It is challenging because the large size prohibits computationally expensive methods and there exits so many differences in case of genome-wide comparison. To cope with this, we exploit the algorithm in (Ikeda and Suzuki, 2009), which finds unexpected, infrequent patterns. But, found patterns was not evaluated from the viewpoint of biology. In this paper, we show that patterns discovered by the algorithm from bacterial genome sequences match well biological features, such as RNA and transposon. Therefore, assuming these features as relevant regions, we compute F-measure values and show that some species achieves about 90%, which is one order of magnitude better than patterns found by an existing method. Thus, we conclude that the algorithm can find these infrequent, but biologically meaningful patterns from genome-wide sequences.

Paper Nr: 69
Title:

Development of Prediction Models under Multiple Imputation for Coronary Heart Disease in Type 2 Diabetes Mellitus

Authors:

Guozhi Jiang, Eric S. Lau, Ying Wang, Andrea O. Luk, Claudia H. Tam, Janice S. Ho, Vincent K. Lam, Heung M. Lee, Xiaodan Fan, Wing-Yee So, Juliana C. Chan and Ronald C. Ma

Abstract: The objectives of this study were to develop and compare the prediction models based on imputed data sets with that based on complete-case (C-C) data set for coronary heart disease (CHD) in type 2 diabetes mellitus (T2DM) and to identify novel genes associated with CHD from T2DM related genes. A prospective cohort of 5526 patients with T2DM and without known CHD and heart failure at baseline was used in this analysis. During a median follow-up time of 8.8 years, 406 (7.3%) patients developed incident CHD. Multiple imputation (MI) was performed to tackle missing values for 26 clinical variables and 40 genetic variables, while Cox proportional hazards regression with backward variable selection was applied to bootstrap samples. Five different MI or C-C models were compared and the performance based on C-index, 5 years AUC and the slope of prognostic index were similar, three SNPs located at NEGR1, CDKAL1 and ADAMTS9 were found to be significant after adjusting for clinical variables. In conclusion, multiple imputation and bootstrap can be benefit to the development of prediction model, and a stable risk factor set for CHD was successfully identified from our dataset containing clinical and genetic variables.

Paper Nr: 77
Title:

MIST: A Tool for Rapid in silico Generation of Molecular Data from Bacterial Genome Sequences

Authors:

Peter Krukczkiewicz, Steven Mutschall, Dillon Barker, James Thomas, Gary Van Domselaar, Victor P. J. Gannon, Catherine D. Carrillo and Eduardo N. Taboada

Abstract: Whole-genome sequence (WGS) data can, in principle, resolve bacterial isolates that differ by a single base pair, thus providing the highest level of discriminatory power for epidemiologic subtyping. Nonetheless, because the capability to perform whole-genome sequencing in the context of epidemiological investigations involving priority pathogens has only recently become practical, fewer isolates have WGS data available relative to traditional subtyping methods. It will be important to link these WGS data to data in traditional typing databases such as PulseNet and PubMLST in order to place them into proper historical and epidemiological context, thus enhancing investigative capabilities in response to public health events. We present MIST (Microbial In Silico Typer), a bioinformatics tool for rapidly generating in silico typing data (e.g. MLST, MLVA) from draft bacterial genome assemblies. MIST is highly customizable, allowing the analysis of existing typing methods along with novel typing schemes. Rapid in silico typing provides a link between historical typing data and WGS data, while also providing a framework for the assessment of molecular typing methods based on WGS analysis.

Paper Nr: 79
Title:

AgED: Extraction and Evaluation of Elliptic Fourier Descriptors from Image Data in Phenotype Assessment Applications

Authors:

Jörgen Brandt and Alexander Heyl

Abstract: In biological experiments, phenotype evaluation is a common challenge. In a wide variety of applications, the phenotypic features of organisms have to be measured and statistically assessed. This is especially important as differences between wild-type and mutant or treated and untreated organisms are often very subtle. Here, we propose a set of digital image transformations that implement preprocessing, feature extraction and statistical analysis of image data that is typically generated in a biological experiment. Moreover we present AgED - Analysis given Experimental Data, a software toolkit that facilitates the process of phenotypic feature evaluation from digital image data in an automatized fashion. Suitable statistical analysis and visualization is performed and controlled via a Graphical User Interface. Furthermore, the use of open data structures allows for the convenient reuse of the acquired feature data with miscellaneous data-mining software and scientific workflow systems. The functionality of this software tool is demonstrated and validated by repeating a phytohormone response experiment carried out on the fresh water alga Coleochaete scutata. The results showed that the timely and automatic processing of digital image data aides the researcher and rationalizes the formerly lengthy and, at times, error prone data evaluation in spreadsheet documents. Furthermore, the software toolkit AgED establishes a comparable evaluation standard and provides ready-to-publish graphic export facilities.

Paper Nr: 93
Title:

Community Detection within Clusters Helps Large Scale Protein Annotation - Preliminary Results of Modularity Maximization for the BAR+ Database

Authors:

Giuseppe Profiti, Damiano Piovesan, Pier Luigi Martelli, Piero Fariselli and Rita Casadio

Abstract: Given the exponentially increasing amount of available data, electronic annotation procedures for protein sequences are a core topic in bioinformatics. In this paper we present the refinement of an already published procedure that allows a fine grained level of detail in the annotation results. This enhancement is based on a graph representation of the similarity relationship between sequences within a cluster, followed by the application of community detection algorithms. These algorithms identify groups of highly connected nodes inside a bigger graph. The core idea is that sequences belonging to the same community share more features in respect to all the other sequences in the same graph.