BIOINFORMATICS 2019 Abstracts


Full Papers
Paper Nr: 10
Title:

Indexing k-mers in Linear-space for Quality Value Compression

Authors:

Yoshihiro Shibuya and Matteo Comin

Abstract: Many bioinformatics tools heavily rely on k-mer dictionaries to describe the composition of sequences and allow for faster reference-free algorithms or look-ups. Unfortunately, naive k-mer dictionaries are very memory inefficient, requiring very large amount of storage space to save each k-mer. This problem is generally worsened by the necessity of an index for fast queries. In this work we discuss how to build an indexed linear reference containing a set of input k-mers, and its application to the compression of quality score in FASTQ files. Most of the entropy of sequencing data lies in the quality scores, and thus they are difficult to compress. Here, we present an application to improve the compressibility of quality values while preserving the information for SNPs calling. We show how a dictionary of significant k-mers, obtained from SNPs databases and multiple genomes, can be indexed in linear space and used to improve the compression of quality value. Availability: the software is freely available at https://github.com/yhhshb/yalff.

Paper Nr: 12
Title:

Detection of Gene-gene Interactions: Methodological Comparison on Real-World Data and Insights on Synergy between Methods

Authors:

Hugo Boisaubert and Christine Sinoquet

Abstract: In this paper, we report three contributions in the field of gene-gene interaction (epistasis) detection. Our first contribution is the comparative analysis of five approaches designed to tackle epistasis detection, on real-world datasets. The aim is to help fill the lack of feedback on the behaviors of published methods in real-life epistasis detection. We focus on four state-of-the-art approaches encompassing random forests, Bayesian inference, optimization techniques and Markov blanket learning. Besides, a recently developed approach, SMMB-ACO (Stochastic Multiple Markov Blankets with Ant Colony Optimization) is included in the comparison. Thus, our second contribution addresses assessing the behavior of SMMB-ACO on real-world data, while SMMB-ACO was mainly evaluated so far through small-scale simulations. We used a published case control dataset related to Crohn’s disease. Focusing on pairwise interactions, we report a great heterogeneity across the methods in running times, memory occupancies, numbers of interactions output, distributions of p-values and odds ratios characterizing the interactions. Then, our third contribution is a proof-of-concept study in the context of genetic association interaction studies, to foster alternatives to analyses driven by prior biological knowledge. The principle is to cross the results of several machine learning methods whose intrinsic mechanisms greatly differ, to provide a priorized list of interactions to be validated experimentally. Focusing on the interactions identified in common by two methods at least, we obtained a priorized list of 56 interactions, from which we could infer one interaction network of size 7, four networks of size 4 and six of size 3.

Paper Nr: 13
Title:

Method Choice in Gene Set Analysis Has Important Consequences for Analysis Outcome

Authors:

Farhad Maleki, Katie L. Ovens, Elham Rezaei, Alan M. Rosenberg and Anthony J. Kusalik

Abstract: Gene set enrichment analysis is a well-established approach for gaining biological insight from expression data. With many gene set analysis methods available, a question is raised about the consistency of the results of these methods. In this paper, we answer this question with a systematic analysis of ten commonly used gene set analysis methods when applied to microarray data. The statistical analysis suggests that there is a significant difference between the results of these methods. Comparison of the 20 most statistically significant gene sets reported by these methods showed little to no agreement regardless of the dataset being used. This observation suggests that the outcome of a study can be highly dependent on the choice of the gene set analysis method. Comparing the 100 most statistically significant gene sets also led to the same conclusion. Furthermore, biological evaluation using a juvenile idiopathic arthritis dataset agreed with the results of the statistical analysis. The 20 most statistically significant gene sets for some methods showed relevance to the biology of juvenile arthritis, supporting their utility, while most methods led to results that were irrelevant or marginally relevant to the known biology of the disease.

Paper Nr: 17
Title:

The Impact of the Transversion/Transition Ratio on the Optimal Genetic Code Graph Partition

Authors:

Daniyah A. Aloqalaa, Dariusz R. Kowalski, Paweł Błażej, Małgorzata Wnetrzak, Dorota Mackiewicz and Paweł Mackiewicz

Abstract: The standard genetic code (SGC) is a system of rules ascribing 20 amino acids and stop translation signal to 64 codons, i.e triplets of nucleotides. It was proposed that the structure of the SGC evolved to minimize harmful consequences of mutations and translational errors. To study this problem, we described the SGC structure by a graph, in which codons are vertices and edges correspond to single nucleotide mutations occurring between the codons. We also introduced weights (W) for mutation types to distinguish transversions from transitions. Using this representation, the SGC is a partition of the set of vertices into 21 disjoint subsets. In this case, the question about the potential robustness of the genetic code to the mutations can be reformulated into the optimal graph clustering task. To investigate this problem, we applied an appropriate clustering algorithm, which searched for the codes characterized by the minimum average calculated from the set W-conductance of codon groups. Our algorithm found three best codes for various ranges of the applied weights. The average W-conductance of the SGC was the most similar to that of the best codes in the range of weights corresponding to the observed transversion/transition ratio in natural mutational pressures. However, it should be noted that the optimization of the SGC was not as perfect as the best codes. It implies that the evolution of the SGC was driven not only by the selection for the robustness against mutations or mistranslations but also other factors, e.g. subsequent addition of amino acids to the code according to the expansion of amino acid metabolic pathways.

Paper Nr: 18
Title:

On-line Searching in IUPAC Nucleotide Sequences

Authors:

Petr Procházka and Jan Holub

Abstract: We propose a novel pattern matching algorithm for consensus nucleotide sequences over IUPAC alphabet, called BADPM (Byte-Aligned Degenerate Pattern Matching). The consensus nucleotide sequences represent a consensus obtained by sequencing a population of the same species and they are considered as so-called degenerate strings. BADPM works at the level of single bytes and it achieves sublinear search time on average. The algorithm is based on tabulating all possible factors of the searched pattern. It needs O(m + mα2 logm)space data structure and O(mα2) time for preprocessing where m is a length of the pattern and α represents a maximum number of variants implied from a 4-gram over IUPAC alphabet. The worst-case locate time is bounded by O(nm2α4) for BADPM where n is the length of the input text. However, the experiments performed on real genomic data proved the sublinear search time. BADPM can easily cooperate with the block q-gram inverted index and so achieve still better locate time. We implemented two other pattern matching algorithms for IUPAC nucleotide sequences as a baseline: Boyer-Moore-Horspool (BMH) and Parallel Naive Search (PNS). Especially PNS proves its efficiency insensitive to the length of the searched pattern m. BADPM proved its strong superiority for searching middle and long patterns.

Paper Nr: 20
Title:

Graph-based Characterisations of Cell Types and Functionally Related Modules in Promoter Capture Hi-C Data

Authors:

Lelde Lace, Gatis Melkus, Peteris Rucevskis, Edgars Celms, Karlis Cerans, Paulis Kikusts, Mārtiņš Opmanis, Darta Rituma and Juris Viksna

Abstract: Current technologies, noteworthy Hi-C, for chromosome conformation capture allow to understand a broad spectrum of functional interactions between genome elements. Although significant progress has been made, there are still a lot of open questions regarding best approaches for analysis of Hi-C data to identify biologically significant features. In this paper we approach this problem by focusing strictly on the topological properties of Hi-C interaction graphs. Graph topological properties were analysed from the perspective of two research questions: 1) are topological properties alone able to distinguish between different cell types and assign biologically meaningful distances between them; 2) what is a typical structure of Hi-C interaction graphs and can we assign a biological significance to structural elements or features? The analysis was applied to a set of Hi-C interactions in 17 human haematopoietic cell types. Promising results have been obtained at answering both questions. Firstly, we propose a concrete set Base11 of 11 topology-based metrics that provide good discriminatory power between cell types. Secondly, we have explored the topological features of connected components of Hi-C interaction graphs and demonstrate that such components tend to be well conserved within particular cell type subgroups and can be well associated with known biological processes.

Paper Nr: 21
Title:

Identifying and Resolving Genome Misassembly Issues Important for Biomarker Discovery in the Protozoan Parasite, Cryptosporidium

Authors:

Arthur Morris, Justin Pachebat, Guy Robinson, Rachel Chalmers and Martin Swain

Abstract: Cryptosporidium is a protozoan parasite that causes a diarrhoeal disease in humans, and which may be spread by swimming pools or infected municipal water supplies. It can be a serious health risk for individuals with weakened immune systems. Genomics has the potential to help control this pathogen, but until recently, it has not been possible to perform whole genome sequencing directly from human stool samples. This is no longer the case, and there are now at least a dozen high quality genomes available via resources like CryptoDB and NCBI, with other isolates being sequenced. The analysis of these genomes will improve current approaches for tracking sources of contamination and routes of transmission by allowing the identification of biomarkers, such as multiple-locus variable tandem repeat regions (VNTRs). However, problems remain due to highly uneven sequence coverage, which causes serious errors and artefacts in the genome assemblies produced by a number of popular assemblers. Here we discuss these assembly issues, and describe our strategy to generate genome assemblies of sufficient quality to enable the discovery of new VNTR biomarkers.

Paper Nr: 31
Title:

Pattern Matching in Discrete Models for Ecosystem Ecology

Authors:

Cinzia Di Giusto, Cédric Gaucherel, Hanna Klaudel and Franck Pommereau

Abstract: In this paper we consider discrete qualitative models of ecosystems viewed as collections of interacting living (animals, plants. . . ) and nonliving entities (air, water, soil. . . ), whose conditions of appearance/disappearance are controlled by a set of formal rules (i.e., processes). We present here a rule-based method allowing to compare ecosystems. The method relies on a measure of similarity and on an optimization algorithm. In addition, the proposed method allows to detect patterns (i.e., ecological processes or sets of processes) in ecosystems. We have validated the method by applying it against a set of models and patterns provided by ecologists.

Paper Nr: 37
Title:

Application of Artificial Intelligence in Microwave Radiometry (MWR)

Authors:

Christoforos Galazis, Sergey Vesnin and Igor Goryanin

Abstract: Microwave radiometry is being developed more actively in recent years for medical applications. One such application is for diagnosis or monitoring of cancer. Medical radiometry presents a strong alternative to other methods of diagnosis, especially with recent gains in its accuracy. In addition, it is safe to use, noninvasive and has a relative low cost of use. Temperature readings were taking from the mammary glands for the purpose of detecting cancer and evaluating the effectiveness of radiometry. Building a diagnostic system to automate classification of new samples requires an adequate machine learning model. Such models that were explored were random forest, XGBoost, k-nearest neighbors, support vector machines, variants of cascade correlation neural network, deep neural network and convolution neural network. From all these models evaluated, the best performing on the test set was the deep neural network with a significant difference from the rest.

Paper Nr: 45
Title:

Construct Semantic Type of “Gene-mutation-disease” Relation by Computer-aided Curation from Biomedical Literature

Authors:

Dongsheng Zhao, Fan Tong and Zheheng Luo

Abstract: Background: Current semantic type of “gene-mutation-disease” relation lacks fine-grained classification and corresponding relation signal words, which limits its usage in relation extraction from biomedical literature using text mining approach. Methods: We propose a computer-aided curation pipeline in which open relation extraction, signal word clustering, relation type mapping are used to analyze biomedical abstracts for semantic type of “gene-mutation-disease” construction. Coverage metrics are used to evaluate the defined relation type while ClinVar is chosen as a target to test our semantic type’s usability and performance on guiding relation extraction from biomedical literature. Results: We have constructed a 5-layer and 16-category semantic type of “gene-mutation-disease” relation with a vocabulary list containing 58 commonly used relation signal words. The vocabulary list has coverage of 95.08% and the semantic type has coverage of 94.12%. From 25 abstracts linked to 30 ClinVar records, 15 relations are correctly mapped and 8 novel relations are discovered additionally. Conclusion: The results show that our semantic type can cover the main relations between “gene”, “mutation” and “disease” and can achieve good performance on guiding relation extraction from biomedical text even using relatively out-of-date dictionary-based text mining methods.

Paper Nr: 47
Title:

Constraint Maximal Inter-molecular Helix Lengths within RNA-RNA Interaction Prediction Improves Bacterial sRNA Target Prediction

Authors:

Rick Gelhausen, Sebastian Will, Ivo L. Hofacker, Rolf Backofen and Martin Raden

Abstract: Efficient computational tools for the identification of putative target RNAs regulated by prokaryotic sRNAs rely on thermodynamic models of RNA secondary structures. While they typically predict RNA–RNA interaction complexes accurately, they yield many highly-ranked false positives in target screens. One obvious source of this low specificity appears to be the disability of current secondary-structure-based models to reflect steric constraints, which nevertheless govern the kinetic formation of RNA–RNA interactions. For example, often—even thermodynamically favorable—extensions of short initial kissing hairpin interactions are kinetically prohibited, since this would require unwinding of intra-molecular helices as well as sterically impossible bending of the interaction helix. In consequence, the efficient prediction methods, which do not consider such effects, predict over-long helices. To increase the prediction accuracy, we devise a dynamic programming algorithm that length-restricts the runs of consecutive inter-molecular base pairs (perfect canonical stackings), which we hypothesize to implicitely model the steric and kinetic effects. The novel method is implemented by extending the state-of-the-art tool INTARNA. Our comprehensive bacterial sRNA target prediction benchmark demonstrates significant improvements of the prediction accuracy and enables 3-4 times faster computations. These results indicate—supporting our hypothesis—that length-limitations on inter-molecular subhelices increase the accuracy of interaction prediction models compared to the current state-of-the-art approach.

Short Papers
Paper Nr: 1
Title:

Big Data Scalability of BayesPhylogenies on Harvard’s Ozone 12k Cores

Authors:

M. Manjunathaiah, A. Meade, R. Thavarajan, P. Protopapas and R. Dave

Abstract: Computational Phylogenetics is classed as a grand challenge data driven problem in the fourth paradigm of scientific discovery due to the exponential growth in genomic data, the computational challenge and the potential for vast impact on data driven biosciences. Petascale and Exascale computing offer the prospect of scaling Phylogenetics to big data levels. However the computational complexity of even approximate Bayesian methods for phylogenetic inference requires scalable analysis for big data applications. There is limited study on the scalability characteristics of existing computational models for petascale class massively parallel computers. In this paper we present strong and weak scaling performance analysis of BayesPhylogenies on Harvard’s Ozone 12k cores. We perform evaluations on multiple data sizes to infer the scaling complexity and find that strong scaling techniques along with novel methods for communication reduction are necessary if computational models are to overcome limitations on emerging complex parallel architectures with multiple levels of concurrency. The results of this study can guide the design and implementation of scalable MCMC based computational models for Bayesian inference on emerging petascale and exascale systems.

Paper Nr: 3
Title:

Vectorized Character Counting for Faster Pattern Matching

Authors:

Roman Snytsar

Abstract: Many modern sequence alignment tools implement fast string matching using the space efficient data structure called a FM-index. The succinct nature of this data structure presents unique challenges for the algorithm designers. In this paper, we explore the opportunities for parallelization of the exact and inexact matches, and present an efficient solution for the Occ portion of the algorithm that utilizes the instruction-level parallelism of the modern CPUs. Our implementation computes all eight Occ values required for the inexact match algorithm step in a single pass. We showcase the algorithm performance in a multi-core genome aligner and discuss effects of the memory prefetch.

Paper Nr: 9
Title:

In Silico Validation of ncRNA-ncRNA Interaction Sites with ncRNAs Represented by k-mers Features

Authors:

Malik Yousef, Walid Khaleifa and Tugba Onal-Suzek

Abstract: A recent catalogue of human transcriptome, namely CHESS database, assembled from RNA sequencing experiments as a part of the Genotype-Tissue Expression (GTEx) Project reported more non-coding RNA genes (21,856) than protein-coding (21,306), revealing an unexpectedly vast amount of transcriptional noise (Pertea et al, 2018). In this study, we introduce a workflow coded in KNIME that computationally distinguishes the ncRNA-ncRNA interaction sites with less reliable interaction sites containing less experimentally validated binding sites than the interaction sites with more experimental validation. Duplex structure and k-mer features of the ncRNA-ncRNA binding sites with experimental verification were used as input to the classification workflow. In our analysis, we observed that although duplex structure features had no positive effect on the success rate of the classification, using just the k-mer features, ~80% success could be achieved in categorization of the confidence of the ncRNA-ncRNA binding sites. Our result verified the classification performance of miRNA-mRNA targets using only k-mer features from our previous study (Yousef et al, 2018).

Paper Nr: 15
Title:

Gene Set Overlap: An Impediment to Achieving High Specificity in Over-representation Analysis

Authors:

Farhad Maleki and Anthony J. Kusalik

Abstract: Gene set analysis methods are widely used to analyze data from high-throughput “omics” technologies. One drawback of these methods is their low specificity or high false positive rate. Over-representation analysis is one of the most commonly used gene set analysis methods. In this paper, we propose a systematic approach to investigate the hypothesis that gene set overlap is an underlying cause of low specificity in over-representation analysis. We quantify gene set overlap and show that it is a ubiquitous phenomenon across gene set databases. Statistical analysis indicates a strong negative correlation between gene set overlap and the specificity of over-representation analysis. We conclude that gene set overlap is an underlying cause of the low specificity. This result highlights the importance of considering gene set overlap in gene set analysis and explains the lack of specificity of methods that ignore gene set overlap. This research also establishes the direction for developing new gene set analysis methods.

Paper Nr: 16
Title:

A General Framework for Exact Partially Local Alignments

Authors:

Falco Kirchner, Nancy Retzlaff and Peter F. Stadler

Abstract: Multiple sequence alignments are a crucial intermediate step in a plethora of data analysis workflows in computational biology. While multiple sequence alignments are usually constructed with the help of heuristic approximations, exact pairwise alignments are readily computed by dynamic programming algorithms. In the pairwise case, local, global, and semi-global alignments are distinguished, with key applications in pattern discovery, gene comparison, and homology search, respectively. With increasing computing power, exact alignments of triples and even quadruples of sequences have become feasible and recent applications e.g. in the context of breakpoint discovery have shown that mixed local/global multiple alignments can be of practical interest. vaPLA is the first implementation of partially local multiple alignments of a few sequences and provides convenient access to this family of specialized alignment algorithms.

Paper Nr: 25
Title:

Ontology-powered Semantic Similarity of Biological and Biomedical Entities - The Story So Far, and The Road Ahead

Authors:

Prashanti Manda

Abstract: The widespread use of ontologies in biology and bio-medicine have led to the creation of ontology-powered data stores and have paved the way for large-scale computational analyses. Semantic similarity – the assessment of “relatedness” between biological objects such as genes, diseases, phenotypes, etc. has become a crucial tool for many applications. Over the years, a number of similarity metrics have been developed and used for applications such as identifying functionally similar proteins, comparing human disease phenotypes to model organism models for disease diagnosis, connecting evolutionary phenotypes to model organism gene phenotypes, etc. The vast variety of semantic similarity metrics and the diverse applications make it daunting for new adopters to select and apply an appropriate metric. While semantic similarity metrics abound, critical issues such as standardized performance evaluations, robustness, sensitivity, effect of parametric choices, and computational complexity of these metrics remain largely unexplored. This study presents a position on several critical issues that impact the accuracy and confidence of semantic similarity results. A comprehensive review of similarity metrics in four categories along with applications in biological and biomedical domains are also included.

Paper Nr: 28
Title:

The Composition of Dense Neural Networks and Formal Grammars for Secondary Structure Analysis

Authors:

Semyon Grigorev and Polina Lunina

Abstract: We propose a way to combine formal grammars and artificial neural networks for biological sequences processing. Formal grammars encode the secondary structure of the sequence and neural networks deal with mutations and noise. In contrast to the classical way, when probabilistic grammars are used for secondary structure modeling, we propose to use arbitrary (not probabilistic) grammars which simplifies grammar creation. Instead of modeling the structure of the whole sequence, we create a grammar which only describes features of the secondary structure. Then we use undirected matrix-based parsing to extract features: the fact that some substring can be derived from some nonterminal is a feature. After that, we use a dense neural network to process features. In this paper, we describe in details all the parts of our receipt: a grammar, parsing algorithm, and network architecture. We discuss possible improvements and future work. Finally, we provide the results of tRNA and 16s rRNA processing which shows the applicability of our idea to real problems.

Paper Nr: 33
Title:

Towards an Efficient Verification Method for Monotonicity Properties of Chemical Reaction Networks

Authors:

Roberta Gori, Paolo Milazzo and Lucia Nasti

Abstract: One of the main goals of systems biology is to understand the behaviour of (bio)chemical reaction networks, which can be very complex and difficult to analyze. Often, dynamical properties of reaction networks are studied by performing simulations based on the Ordinary Differential Equations (ODEs) models of the reactions’ kinetics. For some kinds of dynamical properties (e.g. robustness) simulations have to be repeated many times by varying the initial concentration of some components of interest. In this work, we propose sufficient conditions that guarantee the existence of monotonicity relationships between the variation of the initial concentration of an “input” biochemical species and the concentration (at all times) of an “output” species involved in the same reaction network. Our sufficient conditions allow monotonicity properties to be verified efficiently by exploring a dependency graph constructed on the set of species of the reaction network. Once established, monotonicity allows us to drastically restrict the number of simulations required to prove dynamical properties of the chemical reaction network.

Paper Nr: 34
Title:

Residual Convolutional Neural Networks for Breast Density Classification

Authors:

Francesca Lizzi, Stefano Atzori, Giacomo Aringhieri, Paolo Bosco, Carolina Marini, Alessandra Retico, Antonio C. Traino, Davide Caramella and M. E. Fantacci

Abstract: In this paper, we propose a data-driven method to classify mammograms according to breast density in BI-RADS standard. About 2000 mammographic exams have been collected from the “Azienda Pisana” (AOUP, Pisa, IT). The dataset has been classified according to breast density in the BI-RADS standard. Once the dataset has been labeled by a radiologist, we proceeded by building a Residual Neural Network in order to classify breast density in two ways. First, we classified mammograms using two “super-classes” that are dense and non-dense breast. Second, we trained the residual neural network to classify mammograms according to the four classes of the BI-RADS standard. We evaluated the performance in terms of the accuracy and we obtained very good results compared to other works on similar classification tasks. In the near future, we are going to improve the results by increasing the computing power, by improving the quality of the ground truth and by increasing the number of samples in the dataset.

Paper Nr: 36
Title:

Applying Deep Learning Models to Action Recognition of Swimming Mice with the Scarcity of Training Data

Authors:

Ngoc G. Nguyen, Mera K. Delimayanti, Bedy Purnama, Kunti R. Mahmudah, Mamoru Kubo, Makiko Kakikawa, Yoichi Yamada and Kenji Satou

Abstract: Deep learning models have shown their ability to model complicated problems in more efficient ways than other machine learning techniques in many application fields. For human action recognition tasks, the current state-of-the-art models are deep learning models. But they are not well-studied in applying for animal behaviour recognition due to the lack of data required for training these models. Therefore, in this research, we proposed a method to apply deep learning models to recognize the behaviours of a swimming mouse in two mouse forced swim tests with a limited amount of training data. We used deep learning models which are used in human action recognition tasks and fine-tuned them on the largest publicly available mouse behaviour dataset to give the models the knowledge about mouse behaviour recognition tasks. Then we fine-tuned the models one more time using the small amount of data that we have annotated for our swimming mouse behaviour recognition tasks. The good performance of these models in the new tasks proved the efficiency of our approach.

Paper Nr: 38
Title:

Prediction of Subnuclear Location for Nuclear Protein

Authors:

Kenji Satou, Yoshiki Shimaguchi, Kunti R. Mahmudah, Ngoc G. Nguyen, Mera K. Delimayanti, Bedy Purnama, Mamoru Kubo, Makiko Kakikawa and Yoichi Yamada

Abstract: To play a biomolecular function, a protein must be transported to a specific location of cell. Also in a nucleus, a nuclear protein has its own location to fulfil its role. In this study, subnuclear location of nuclear protein was predicted from protein sequence by using deep learning algorithm. As a dataset for experiments, 319 non-homologous protein sequences with class labels corresponding to 13 classes of subcellular localization (e.g. "Nuclear envelope") were selected from public databases. In order to achieve better performance, various combinations of feature generation methods, classification algorithms, parameter tuning, and feature selection were tested. Among 17 methods for generating features of protein sequences, Composition/Transition/Distribution (CTD) generated the most effective features. They were further selected by randomForest package for R. Using the selected features, quite high accuracy (99.91%) was achieved by a deep neural network with seven hidden layers, maxout activation function, and RMSprop optimization algorithm.

Paper Nr: 42
Title:

Loop Grammars to Identify RNA Structural Patterns

Authors:

Michela Quadrini, Emanuela Merelli and Riccardo Piergallini

Abstract: The biological functions of an RNA molecule are largely determined by molecular configuration. Understanding the link between the structure and the biological functions has been considered one of the challenges in biology. In this study, we face the problem of identifying a given structural pattern into an RNA pseudoknot-free secondary structure. We introduce a context-free grammar, Loop Grammar, that formalizes the primary and secondary structure of an RNA molecule as a composition of loops. Such composition is expressed as to concatenation or nesting of the simplest structural elements, hairpins, generated during the folding process when a bond between two nonconsecutive nucleotides is established. Then, we formalize the concatenation and nesting on Fatgraphs, oriented surfaces with boundary, and we define a Surface Loop Grammar, whose algebraic expressions uniquely identify such surfaces associated with given RNA structures. The terms of the Loop Grammar allow us to face the problems of identifying substructures considering both the primary and secondary structures, while the strings generated by Surface Loop Grammar permit to identify a given structural pattern in a secondary structure in terms of relations among hairpins. Both use the string pattern matching.

Paper Nr: 50
Title:

Study of Dipeptidil Peptidase 4 Inhibitors based on Molecular Docking Experiments

Authors:

A. A. Saraiva, J. N. Soares, Nator C. Costa, José M. Sousa, N. F. Ferreira, Antonio Valente and Salviano Soares

Abstract: The lack of physical activity and poor nutrition triggers various diseases, among them is diabetes. In this context, several researches seek ways that can mitigate these diseases to provide a better quality of life for people. Therefore, the present work aims to analyze the possible inhibitors of the enzyme Dipeptidil Peptidase 4 that hypotheses will be stipulated for the creation of new drugs through molecular docking techniques, that is, a computational simulation of combinations of drugs of the family of gliptins with other antidiabetics (metformin, glyburide and cucurbitacin). Among the results, it was observed that the antidiabetic cucurbitacin combined with the gliptines obtained greater energy during the process.

Paper Nr: 4
Title:

Prediction of Malaria Vaccination Outcomes from Gene Expression Data

Authors:

Ahmad Shayaan, Indu Ilanchezian and Shrisha Rao

Abstract: Vaccine development is a laborious and time-consuming process and can benefit from statistical machine learning techniques, which can produce general outcomes based on the patterns observed in the limited available empirical data. In this paper, we show how limited gene expression data from a small sample of subjects can be used to predict the outcomes of malaria vaccine. In addition, we also draw inferences from the gene expression data, with over 22000 columns (or features), by visualizing the data, and reduce the data dimensions based on this inference for efficient model training. Our methods are general and reliable and can be extended to vaccines developed against any pathogen. Given the gene expression data from a sample of subjects administered with a novel vaccine, our methods can be used to test the outcome of that vaccine, without the need for empirical observations on a larger population. By carefully tuning the available data and the machine learning models, we are able to achieve greater than 98% accuracy, with sensitivity and specificity of 0.93 and 1 respectively, in predicting the outcomes of the malaria vaccine in developing immunogenicity against the malaria pathogen.

Paper Nr: 6
Title:

Artificial Neural Network Approach to Prediction of Protein-RNA Residue-base Contacts

Authors:

Morihiro Hayashida, Jose Nacher and Hitoshi Koyano

Abstract: Protein-RNA complexes play essential roles in a cell, and are involved in the post-transcriptional regulation of gene expression. Therefore, it is important to analyze and elucidate structures of protein-RNA complexes and also contacts between residues and bases in their interactions. A method based on conditional random fields (CRFs) was developed for predicting residue-base contacts using evolutionary relationships between individual positions of a residue and a base. Further, the probabilistic model was modified to improve the prediction accuracy. Recently, many researchers focus on deep neural networks due to its classification performance. In this paper, we develop a neural network with five layers for predicting residue-base contacts. From computational experiments, in terms of the area under the receiver operating characteristic curve (AUC), the predictive performance of our proposed method was comparable or better than those of the CRF-based methods.

Paper Nr: 11
Title:

Latest Advances in Solving the All-Pairs Suffix Prefix Problem

Authors:

Maan H. Rachid

Abstract: Finding the overlaps between sequences that are generated by Next Generation Sequencing (NGS) technology is a time- and space-consuming step in building a string graph in genome assembly. The problem is known in computer science as all-pairs suffix prefix (APSP). The problem has been tackled since 1992 and several solutions were presented to solve it. While some of them achieve optimal theoretical time consumption, they have a very high space-consumption in addition to being practically slow due to a raised constant factor. Some other recent solutions practically consume much less space and time to solve APSP despite their adaptations to techniques and data structures which don’t have optimal worst-case asymptotic complexity. Other few researches tackled the approximate version of the overlap problem hoping to avoid error-detecting stages in genome assembly. These solutions used the same data structures which were employed to solve APSP in addition to some advanced techniques in order to address the complexity of approximate matching. In this work, we evaluate these recent algorithms, in terms of time and space, in both exact and approximate formats. Our results show that FastAPSP has the best time-consumption unless the size of the data set is large. The high space demand of such large data sets would favor the usage of SOF and Readjoiner. Our experiments also show that AOF is, in general, faster than FM unless the data set is small and repetitive. In addition, it can handle large data sets that cannot be processed by FM.

Paper Nr: 22
Title:

Data-driven Autism Biomarkers Selection by using Signal Processing and Machine Learning Techniques

Authors:

Antonio Antovski, Stefani Kostadinovska, Monika Simjanoska, Tome Eftimov, Nevena Ackovska and Ana M. Bogdanova

Abstract: To analyze microarray gene expression data from homogeneous group of children diagnosed with classic autism, a synergy of signal processing and machine learning techniques is proposed. The main focus of the paper is the gene expression preprocessing, which relies on Fractional Fourier Transformation, and the obtained data is further used for biomarker selection using an entropy-based method. This is a crucial step needed to obtain knowledge of the most informative genes (biomarkers) in terms of their discriminative power between the autistic and the control (healthy) group. The relevance of the selected biomarkers is tested using discriminative and generative machine learning classification algorithms. Furthermore, a data-driven approach is used to evaluate the performance of the classifiers by using a set of two performance measures (sensitivity and specificity). The evaluation showed that the model learned by Naive Bayes provides best results. Finally, a reliable biomarkers set is obtained and each gene is analyzed in terms of its chromosomal location and accordingly compared to the critical chromosomes published in the literature.

Paper Nr: 23
Title:

Effect of Database Size in the Genetic Variants Calling

Authors:

Sunhee Kim, Young-Suk Lee and Chang-Yong Lee

Abstract: The base quality score recalibration (BQSR) is an important step in the variant calling from high-throughput sequence data. Motivated by the fact that BQSR necessarily requires a database of known variants such as the dbSNP, we present an extensive analysis on BQSR results for human and rice genome. We showed that the recalibration results depended on the size of the database: the more variants are there in the database, the larger averaged value of the recalibrated base quality scores is obtained. This implies that the recalibrated quality score is lower than it should be when the number of variants in the database is not large enough. Based on the finding that the size of the database should play a crucial role in BQSR, we proposed a method to create a database when the size of a database is not large enough for BQSR results to be reliable. We demonstrated that, in the case of human, the database constructed by the proposed method generated almost the same results as the human dbSNP. In the case of rice, however, we showed that the proposed database is more reasonable than the rice dbSNP.

Paper Nr: 27
Title:

Mean and Variability in RNA Polymerase Numbers Are Correlated to the Mean but Not the Variability in Size and Composition of Escherichia Coli Cells

Authors:

Bilena Almeida, Vatsala Chauhan, Vinodh Kandavalli and Andre Ribeiro

Abstract: Cell morphology differs with cell physiology in general and with gene expression in particular. We investigate the degree to which these relationships differ with medium richness. Using Escherichia coli cells with fluorescently tagged β’ subunits, flow cytometry, and statistical analysis, we study at the single-cell level the correlation between parameters associated to cell morphology and composition (FSC, SSC, and Width channels) and GFP tagged RNA polymerase (RNAp) levels (FITC channel). From measurements in three media differing in richness (M63, LB, and TB) and, thus, cell growth rates, we find that the mean and cell-to-cell variability in RNAp levels are correlated to the mean values of FSC, SSC, and/or Width. Further, in all growth conditions considered, RNAp levels are positively correlated to FSC, SSC, and Width at the single-cell level, with the correlation decreasing for increasing medium richness. Overall, the results suggest that the mean and cell-to-cell variability in levels of RNAp, a master regulator of gene expression, are correlated to the mean values of the parameters assessing the cellular morphology and composition, as measured by flow cytometry, but they do not correlate to the degree of variability of these parameter values.

Paper Nr: 29
Title:

Monte Carlo Methods for Assessment of the Mean Glandular Dose in Mammography: Simulations in Homogeneous Phantoms

Authors:

R. M. Tucciariello, P. Barca, D. Caramella, R. Lamastra, C. Traino and M. E. Fantacci

Abstract: The rationale of this study is to perform a personalized dosimetry in digital mammography, using Monte Carlo simulations. We developed a GEANT4-based application that reproduces mammographic investigations editable in different setups and conditions. Mean Glandular Dose (MGD) is estimated for different compressed breast sizes and compositions. Breast compositions are obtained with homogeneous mixture of glandular and adipose tissues. The simulated setup reproduces the Hologic Selenia® Dimensions® Mammography System and the TASMIPM tool for deriving the photon fluence from the X-ray source has been employed. The influence of different skin models is also investigated, deriving the mean glandular dose in the breast using adipose tissue for different skin thicknesses, from 2 mm to 5 mm, and a dedicated composition found in literature with the specific thickness of 1.45 mm. We denoted different photon shielding properties on the MGD values.

Paper Nr: 35
Title:

Clustering and Classification of Breathing Activities by Depth Image from Kinect

Authors:

Mera K. Delimayanti, Bedy Purnama, Ngoc G. Nguyen, Kunti R. Mahmudah, Mamoru Kubo, Makiko Kakikawa, Yoichi Yamada and Kenji Satou

Abstract: This paper describes a new approach of the non-contact capturing method of breathing activities using the Kinect depth sensor. To process the data, we utilized feature extraction on time series of mean depth value and optional feature reduction step. The next process implemented a machine learning algorithm to execute clustering on the resulted data. The classification had been realized on four different subjects and then, continued to use 10-fold cross-validation and Support Vector Machine (SVM) classifier. The most efficient classifier is SVM radial with the grid reached the best accuracy for all of the subjects.

Paper Nr: 39
Title:

Human Ovulation Hidden Hints and It’s Effects on Fluctuant Assymetry Studies

Authors:

Mahsa Kiazadeh, Gabriela Goncalves and Hamid R. Shahbazkia

Abstract: This document tries to study the truth about human concealed ovulation only by analysing possible facial modifications. In normal view, the human ovulation remains concealed. In other words, there is no visible external sign of the mensal period in humans. These external signs are very much visible in many animals such as baboons, dogs or elephants. Some are visual (baboons) and others are biochemical (dogs). Insects use pheromones and other animals can use sounds to inform the partners of their fertility period. The objective is not just to study the visual female ovulation signs but also to understand and explain automatic image processing methods which could be used to extract precise landmarks from the facial pictures. This could later be applied to the studies of fluctuant asymmetry. The field of fluctuant asymmetry is a growing field in evolutionary biology but cannot be easily developed because of the time necessary to extract landmarks manually. In this work we have tried to see if such signs are present in human face during the ovulation could be detected, either by computer vision or by human observers. We have taken photography from 50 girls for 32 days. Each day we took many photos. At the end we chose a set of 600 photos, 15 photos per girl representing the whole mensal cycle of 40 women. The photos were organized in a rating software to allow human raters to watch and choose the 2 best looking pictures for each girl. These results were then checked to highlight the relation between chosen photos and ovulation period in the cycle. The results, were indicating that in fact there are some clues in the face of human which could eventually give a hint about their ovulation. Later, different automatic landmark detection methods were applied to the pictures to detect landmarks which could show the changes in the face during the period. Although the precision of methods tested are far from being perfect, but the comparison of these measurements to the state of art indexes of beauty shows a slight modification of the face towards a prettier face during the ovulation. The automatic methods tested were Active Appearance Model (AAM), the neural deep learning and the regression trees. It was observed that for this kind of applications the best method was the regression trees. Future work has to be conducted to firmly confirm these data, number of human raters should be augmented and a proper learning data base should be developed to allow a learning process specific to this problematic. We also think that low level image processing will be necessary to achieve the final precision which could reveal more details of possible changes in human faces.

Paper Nr: 40
Title:

On the Problem about Optical Flow Prediction for Silhouette Image

Authors:

Bedy Purnama, Mera K. Delimayanti, Ngoc G. Nguyen, Kunti R. Mahmudah, Mamoru Kubo, Makiko Kakikawa, Yoichi Yamada and Kenji Satou

Abstract: We address a problem of finding an algorithm to handle unclear flows in the inner area of silhouette image. In this study, after getting the region of interest from an implemented algorithm of dual TV-L1 optical flow, we conducted experiments to predict the new flow in the unclear inner areas. The experiments include perspective transform methods. Five experiments were performed using these methods. As a result, an algorithm that uses double refining of perspective transform method obtained the optimal result. It might be useful for analyzing the motion of a unicolored animal (e.g. a black cat).

Paper Nr: 41
Title:

Towards Multi-UAV and Human Interaction Driving System Exploiting Human Mental State Estimation

Authors:

Gaganpreet Singh, Raphaëlle N. Roy and Caroline C. Chanel

Abstract: This paper addresses the growing human-multi-UAV interaction issue. Current active approaches towards a reliable multi-UAV system are reviewed. This brings us to the conclusion that the multiple Unmanned Aerial Vehicles (UAVs) control paradigm is segmented into two main scopes: i) autonomous control and coordination within the group of UAVs, and ii) a human centered approach with helping agents and overt behavior monitoring. Therefore, to move further with the future of human-multi-UAV interaction problem, a new perspective is put forth. In the following sections, a brief understanding of the system is provided, followed by the current state of multi-UAV research and how taking the human pilot’s physiology into account could improve the interaction. This idea is developed first by detailing what physiological computing is, including mental states of interest and their associated physiological markers. Second, the article concludes with the proposed approach for Human-multi-UAV interaction control and future plans.

Paper Nr: 46
Title:

A Proposal for a Language Combining Biochemical Rules and Topological Structure for Systems Biology

Authors:

Anasthasie J. Compaore and Pascale L. Gall

Abstract: For about twenty years, rule-based modelling has been widely used for Systems Biology issues. Most existing languages focus on biochemical reactions primarily, and to a lest extent, on the cell structure in compartments. BIOCHAM and Pathway Logic Assistant (PLA) are representative examples of such rule-based languages. They are equipped with tools providing great analysis capabilities. We propose to provide such biochemical languages with annotations relating to the compartments in which the biochemical reactions take place. We will make sure that biochemical rules always indicate the nature of the compartments involved and the neighbourhood relations between them. At the end, it suffices to specialize the generic rules according to a particular topological structure in order to obtain sets of localized rules. Thus, resulting models can be analysed by using either BIOCHAM or PLA.

Paper Nr: 49
Title:

The Importance of Considering Natural Isotopes in Improving Protein Identification Accuracy

Authors:

Sara El Jadid, Raja Touahni and Ahmed Moussa

Abstract: Many tools in proteomics are based on accurate identification of peptide contained in a sample. In fact, the issue of identification is the foundation of the entire proteomics workflow, where all subsequent steps depend on the quality of data generated at the beginning. The accuracy of data generated allow, not only to have good results, but also to ensure consistency at the end of the analysis. There is a consensus about the factors that affect this accuracy. It is popularly assumed that exploiting physics and chemistry of peptides deduced from sequences can improve the identification accuracy. In fact, considering natural isotopes when quantifying peptides will considerably improve results. This paper presents findings that defend such a view. We explored the mass difference between the nominal mass (which considers the most abundant isotope of each element) and the mean mass (which considers the abundance of each element). We noticed that within a biomolecule, the larger the number of elements, the less this difference is negligible. In accordance with that, peptide misidentification is due to the previously explained variance. These findings reveal that including natural isotopes during quantification will play a key role in improving identification accuracy. This study could lead us to design alternative identification tools combining better sensitivity and specificity.