Density-Based Classification of Protein Structures Using Iterative TM-score |
Finding similarity between a pair of protein
structures is one of the fundamental tasks in many areas of
bioinformatical research such as protein structure prediction,
function mapping, etc. We propose a method for finding
pairing of amino acids based on densities of the structures
and we also propose a modification to the original TM-score
rotation algorithm that assess similarity score to this alignment.
Proposed modification is faster than TM and comparably
robust according to non-optimal parts in the alignment. We
measure the qualities of the algorithm in terms of SCOP
classification accuracy. Regarding the accuracy, our solution
outperforms the contemporary solutions at two out of three
tested levels of the SCOP hierarchy.
On the Effectiveness of Distances Measuring Protein Structure Similarity |
Abstract—Determining similarity between two protein structures
is one of the most fundamental problems in contemporary
structural bioinformatics. With the increasing complexity of
the measures, their effectiveness increases as well. However,
other important observables, such as the degree of metric
properties fulfilment, could rather deteriorate than improve.
In this paper we introduce an effective measure and study its
degree of metric properties fulfilment.
DDPIn - Distance and Density Based Protein Indexing |
Methods for protein structure classification have many applications
in protein function prediction and associated fields i.e. drug
discovery and others. In this paper we propose a new method for
representing protein structures in order to quickly search and
classify big collections of structures. In our approach, each
protein structure is represented by number of vectors (based
on histogram of distances) equivalent to the number of its
C-alpha residues. Each C-alpha residue represents a viewpoint
from which distances to each of the other residues are computed.
Consequently, we use serveral methods to convert these distances
into a n-dimensional feature vector which is indexed using a metric
indexing structure (M-tree is the structure of our choice).
While searching, we use single or multi-step approach which
provides us classification accuracy and speed comparable with the
best contemporary classification methods.
An Application of the Metric Access Methods to the Mass Spectrometry Data |
Mass spectrometry is a very popular method for protein and peptide
identification nowadays. Abundance of data generated in this way grows
exponentially every year and although there exist algorithms for interpreting mass spectra,
demand for faster and more accurate approaches remains.
We propose an approach for preprocessing the protein sequence database
based on metric access methods. This approach allows to select only a
small set of suitable peptide sequence candidates, which can be then
compared with experimental spectra using more sophisticated algorithms.
We define logarithmic distance for selecting peptide sequence candidates
and also outline possibilities of using the interval query for searching
posttranslational modifications.
The experimental results show that our approach is comparable in
precision with nowadays most widely used public
tools and outline possible directions for further resarch.
Methods of Protein Structure Alignment |
One of the most crucial discoveries in proteomics is that similar proteins
(in both sequence and structure) handle similar biological functions.
This knowledge together with growing size of protein structures with
known function is the reason why field of protein structural alignment
has been very active in the last twenty years. To handle increasing
amount of structures with unknown function, many algorithms for pairwise
structural alignment have emerged. Together with tools for optimal
superposition of protein structures, also measures for quantifying
similarity and ways to represent features of individual structures
has been proposed. In this paper we focus on categorizing measures,
algorithms and presenting their main ideas.
Native Multidimensional Indexing in Relational Databases |
In current database systems there is a strong need for searching data according to many attributes.
In commercial database platforms the standard search over multiple attributes is
provided by B$^{+}$-tree (or it's variants) with compound keys. On the other hand,
the systems provide also multidimensional indexing, however, just for spatial purposes
(such as GIS or CAD systems) and use special data types and syntax.
In this paper we propose a native multidimensional method for indexing
tables with simple attributes, such that multi-attribute queries can be
processed (with standard SQL queries) more efficiently than by simple
B+-tree with compound keys. For implementation, we have used the PostgreSQL
and R-tree-based index. With this combination we outperformed commercial platforms
by an order of magnitude in the number of accesses to index.
As a by-product, a framework for easy implementation of external
indexing methods into PostgreSQL was designed.
Improved Alignment of Protein Sequences Based on Common Parts |
Searching databases of protein sequences has been an important task in
bioinformatics in the last twenty years and it's importance has grown
with increasing size of protein sequence databases. Because of this
growth, there have been efforts to speed-up the original quadratic
complexity searching algorithm that became too slow with regard to
the volume of the databases. To do so, heuristic approaches have been
proposed that trade accuracy of the search for speed due to the linear
complexity of such methods. During the time, accuracy of heuristic methods
has grown but exact search is still needed in some cases. Because of this
need there have been attempts to speed-up the search while maintaining
the rigorous nature of the search (i.e. index protein sequences,
modification the algorithm without loss of accuracy, ...). These attempts
were not successful (at least not enough successful to beat heuristic approaches)
and the only exact search algorithm stays SSEARCH (or it's clones) which
sequentially scans database of protein sequences and performs full alignment
against each of the sequence.
Since exact search is still needed we focus on improving the sequential search
algorithm. Unlike methods that benefit from decreasing number of computations
of the distance function, we try to decrease costs needed to compute the
distance function itself, when used with large datasets. This is
achieved by reusing alignment calculations of common parts of the
protein sequences without loss of accuracy.
With this method we reduced the computational costs by up to 20%
depending of the database size and subset used. We also implemented
approximate search which further reduced computational for the price
of some accuracy loss.
Metric indexing of protein databases and promising approaches |
Most widely used biological databases nowadays are nucleotide and
protein ones. These databases are crucial for determination of biological functions
of living organisms with respect to their DNA structure. The biological function
of a protein can be derived from the similarity with another protein with known
function which is stored in a database and therefore the chance of finding the
biological function of given protein or DNA sequence grows with size of the
database. Because of this fact, the growth is exponential which in turn calls
for sublinear methods of searching these databases. Optimal solution is aligning
the query sequence with all sequences in the queried database. Since aligning of
two sequences is computationally expensive, fast heuristic methods (e.g. BLAST
[Altschul et al., 1997]) are used although they can only approximate the optimal
solution without restricting the resulting error. In this paper we try to use metric
access methods (MAMs) for exact and approximate searching through protein
databases. As experiments show, such a straightforward use of MAMs is not very
suitable, therefore we also show possible further directions in the area of indexing
protein sequences based on the so far learned facts.
Improving the Performance of M-tree Family by
Nearest-Neighbor Graphs |
The M-tree and its variants
have been proved to perform efficient similarity search in database
environments. In order to further improve their performance, in this paper
we propose an extension of M-tree family, which makes use of
nearest-neighbor (NN) graphs. Each tree node maintains its own NN-graph, a
structure that stores for each entry in the node the reference (and
distance) to its nearest neighbor, considering just the entries in thucleotide or protein sequences, finding a local alignment of two
sequences is one of the main tasks. Since the sizes of available databases
grow constantly, the efficiency of retrieval methods becomes the critical
issue. The sequence retrieval relies on finding sequences in the database
which align best with the query sequence. However, an optimal alignment
can be found in quadratic time (by use of dynamic programming) while this
is infeasible when dealing with large databases. The existing solutions
use fast heuristic methods (like BLAST, FASTA) which produce only an
uncontrolled approximation of the best alignment and even do not provide
any information about the alignment approximation error. In this paper we
propose an approach of exact and approximate indexing using several metric
access methods (MAMs) in combination with the TriGen algorithm, in order
to reduce the number of alignments (distance computations) needed. The
experimental results have shown that a straightforward adoption of MAMs to
sequence retrieval cannot outperform the specialized heuristic algorithms
(at least at the moment). On the other side, the results show MAMs could
provide a basis for specialized access methods capable of
precision/efficiency trade-off control.
Index-based approach to similarity search in protein and nucleotide databases |
Construction ofd in quadratic time (by use of dynamic programming) while this
is infeasible when dealing with large databases. The existing solutions
use fast heuristic methods (like BLAST, FASTA) which produce only an
uncontrolled approximation of the best alignment and even do not provide
any information about the alignment approximation error. In this paper we
propose an approach of exact and approximate indexing using several metric
access methods (MAMs) in combination with the TriGen algorithm, in order
to reduce the number of alignments (distance computations) needed. The
experimental results have shown that a straightforward adoption of MAMs to
sequence retrieval cannot outperform the specialized heuristic algorithms
(at least at the moment). On the other side, the results show MAMs could
provide a basis for specialized access methods capable of
precision/efficiency trade-off control.
Construction of Tree-based Indexes for
Level-Contiguous Buffering Support |
In multimedia databases, the
spatial index structures based on trees (like R-tree, M-tree) have been
proved to be efficient and scalable for low-dimensional data retrieval.
However, if the data dimensionality is too high, the hierarchy of nested
regions (represented by the tree nodes) becomes spatially indistinct.
Hence, the query processing deteriorates to inefficient index traversal
(in terms of random-access I/O costs) and in such case the tree-based
indexes are less efficient than the sequential search. This is mainly due
to repeated access to many nodes at the top levels of the tree. In this
paper we propose a modified storage layout of tree-based indexes, such
that nodes belonging to the same tree level are stored together. Such
level-ordered storage allows to prefetch several top levels of the tree to
the buffer pool by only a few or even a single contiguous I/O operation
(i.e. one-seek read). The experimental results show that our approach can
speedup the tree-based search significantly.