David Hoksza - Homepage

Density-Based Classification of Protein Structures Using Iterative TM-score

Finding similarity between a pair of protein structures is one of the fundamental tasks in many areas of bioinformatical research such as protein structure prediction, function mapping, etc. We propose a method for finding pairing of amino acids based on densities of the structures and we also propose a modification to the original TM-score rotation algorithm that assess similarity score to this alignment. Proposed modification is faster than TM and comparably robust according to non-optimal parts in the alignment. We measure the qualities of the algorithm in terms of SCOP classification accuracy. Regarding the accuracy, our solution outperforms the contemporary solutions at two out of three tested levels of the SCOP hierarchy.

On the Effectiveness of Distances Measuring Protein Structure Similarity

Abstract—Determining similarity between two protein structures is one of the most fundamental problems in contemporary structural bioinformatics. With the increasing complexity of the measures, their effectiveness increases as well. However, other important observables, such as the degree of metric properties fulfilment, could rather deteriorate than improve. In this paper we introduce an effective measure and study its degree of metric properties fulfilment.

DDPIn - Distance and Density Based Protein Indexing

Methods for protein structure classification have many applications in protein function prediction and associated fields i.e. drug discovery and others. In this paper we propose a new method for representing protein structures in order to quickly search and classify big collections of structures. In our approach, each protein structure is represented by number of vectors (based on histogram of distances) equivalent to the number of its C-alpha residues. Each C-alpha residue represents a viewpoint from which distances to each of the other residues are computed. Consequently, we use serveral methods to convert these distances into a n-dimensional feature vector which is indexed using a metric indexing structure (M-tree is the structure of our choice). While searching, we use single or multi-step approach which provides us classification accuracy and speed comparable with the best contemporary classification methods.

An Application of the Metric Access Methods to the Mass Spectrometry Data

Mass spectrometry is a very popular method for protein and peptide identification nowadays. Abundance of data generated in this way grows exponentially every year and although there exist algorithms for interpreting mass spectra, demand for faster and more accurate approaches remains.

We propose an approach for preprocessing the protein sequence database based on metric access methods. This approach allows to select only a small set of suitable peptide sequence candidates, which can be then compared with experimental spectra using more sophisticated algorithms. We define logarithmic distance for selecting peptide sequence candidates and also outline possibilities of using the interval query for searching posttranslational modifications.

The experimental results show that our approach is comparable in precision with nowadays most widely used public tools and outline possible directions for further resarch.

Methods of Protein Structure Alignment

One of the most crucial discoveries in proteomics is that similar proteins (in both sequence and structure) handle similar biological functions. This knowledge together with growing size of protein structures with known function is the reason why field of protein structural alignment has been very active in the last twenty years. To handle increasing amount of structures with unknown function, many algorithms for pairwise structural alignment have emerged. Together with tools for optimal superposition of protein structures, also measures for quantifying similarity and ways to represent features of individual structures has been proposed. In this paper we focus on categorizing measures, algorithms and presenting their main ideas.

Native Multidimensional Indexing in Relational Databases

In current database systems there is a strong need for searching data according to many attributes. In commercial database platforms the standard search over multiple attributes is provided by B$^{+}$-tree (or it's variants) with compound keys. On the other hand, the systems provide also multidimensional indexing, however, just for spatial purposes (such as GIS or CAD systems) and use special data types and syntax. In this paper we propose a native multidimensional method for indexing tables with simple attributes, such that multi-attribute queries can be processed (with standard SQL queries) more efficiently than by simple B+-tree with compound keys. For implementation, we have used the PostgreSQL and R-tree-based index. With this combination we outperformed commercial platforms by an order of magnitude in the number of accesses to index. As a by-product, a framework for easy implementation of external indexing methods into PostgreSQL was designed.

Improved Alignment of Protein Sequences Based on Common Parts

Searching databases of protein sequences has been an important task in bioinformatics in the last twenty years and it's importance has grown with increasing size of protein sequence databases. Because of this growth, there have been efforts to speed-up the original quadratic complexity searching algorithm that became too slow with regard to the volume of the databases. To do so, heuristic approaches have been proposed that trade accuracy of the search for speed due to the linear complexity of such methods. During the time, accuracy of heuristic methods has grown but exact search is still needed in some cases. Because of this need there have been attempts to speed-up the search while maintaining the rigorous nature of the search (i.e. index protein sequences, modification the algorithm without loss of accuracy, ...). These attempts were not successful (at least not enough successful to beat heuristic approaches) and the only exact search algorithm stays SSEARCH (or it's clones) which sequentially scans database of protein sequences and performs full alignment against each of the sequence.
Since exact search is still needed we focus on improving the sequential search algorithm. Unlike methods that benefit from decreasing number of computations of the distance function, we try to decrease costs needed to compute the distance function itself, when used with large datasets. This is achieved by reusing alignment calculations of common parts of the protein sequences without loss of accuracy.
With this method we reduced the computational costs by up to 20% depending of the database size and subset used. We also implemented approximate search which further reduced computational for the price of some accuracy loss.

Metric indexing of protein databases and promising approaches

Most widely used biological databases nowadays are nucleotide and protein ones. These databases are crucial for determination of biological functions of living organisms with respect to their DNA structure. The biological function of a protein can be derived from the similarity with another protein with known function which is stored in a database and therefore the chance of finding the biological function of given protein or DNA sequence grows with size of the database. Because of this fact, the growth is exponential which in turn calls for sublinear methods of searching these databases. Optimal solution is aligning the query sequence with all sequences in the queried database. Since aligning of two sequences is computationally expensive, fast heuristic methods (e.g. BLAST [Altschul et al., 1997]) are used although they can only approximate the optimal solution without restricting the resulting error. In this paper we try to use metric access methods (MAMs) for exact and approximate searching through protein databases. As experiments show, such a straightforward use of MAMs is not very suitable, therefore we also show possible further directions in the area of indexing protein sequences based on the so far learned facts.

Improving the Performance of M-tree Family by Nearest-Neighbor Graphs

The M-tree and its variants have been proved to perform efficient similarity search in database environments. In order to further improve their performance, in this paper we propose an extension of M-tree family, which makes use of nearest-neighbor (NN) graphs. Each tree node maintains its own NN-graph, a structure that stores for each entry in the node the reference (and distance) to its nearest neighbor, considering just the entries in thucleotide or protein sequences, finding a local alignment of two sequences is one of the main tasks. Since the sizes of available databases grow constantly, the efficiency of retrieval methods becomes the critical issue. The sequence retrieval relies on finding sequences in the database which align best with the query sequence. However, an optimal alignment can be found in quadratic time (by use of dynamic programming) while this is infeasible when dealing with large databases. The existing solutions use fast heuristic methods (like BLAST, FASTA) which produce only an uncontrolled approximation of the best alignment and even do not provide any information about the alignment approximation error. In this paper we propose an approach of exact and approximate indexing using several metric access methods (MAMs) in combination with the TriGen algorithm, in order to reduce the number of alignments (distance computations) needed. The experimental results have shown that a straightforward adoption of MAMs to sequence retrieval cannot outperform the specialized heuristic algorithms (at least at the moment). On the other side, the results show MAMs could provide a basis for specialized access methods capable of precision/efficiency trade-off control.

Index-based approach to similarity search in protein and nucleotide databases

Construction ofd in quadratic time (by use of dynamic programming) while this is infeasible when dealing with large databases. The existing solutions use fast heuristic methods (like BLAST, FASTA) which produce only an uncontrolled approximation of the best alignment and even do not provide any information about the alignment approximation error. In this paper we propose an approach of exact and approximate indexing using several metric access methods (MAMs) in combination with the TriGen algorithm, in order to reduce the number of alignments (distance computations) needed. The experimental results have shown that a straightforward adoption of MAMs to sequence retrieval cannot outperform the specialized heuristic algorithms (at least at the moment). On the other side, the results show MAMs could provide a basis for specialized access methods capable of precision/efficiency trade-off control.

Construction of Tree-based Indexes for Level-Contiguous Buffering Support

In multimedia databases, the spatial index structures based on trees (like R-tree, M-tree) have been proved to be efficient and scalable for low-dimensional data retrieval. However, if the data dimensionality is too high, the hierarchy of nested regions (represented by the tree nodes) becomes spatially indistinct. Hence, the query processing deteriorates to inefficient index traversal (in terms of random-access I/O costs) and in such case the tree-based indexes are less efficient than the sequential search. This is mainly due to repeated access to many nodes at the top levels of the tree. In this paper we propose a modified storage layout of tree-based indexes, such that nodes belonging to the same tree level are stored together. Such level-ordered storage allows to prefetch several top levels of the tree to the buffer pool by only a few or even a single contiguous I/O operation (i.e. one-seek read). The experimental results show that our approach can speedup the tree-based search significantly.