The similarity search is popular in various areas of computing, including multimedia databases, data mining, bioinformatics, etc. For a long time, the database approaches to similarity search assumed the similarity as a metric distance. Due to its properties, metric similarity allows to index a database such that it can be queried efficiently (quickly). However, together with the increasing complexity of data across various domains, there appeared many similarities in recent years that were not metrics (i.e., nonmetrics). The database research, however, is still not aware of the huge potential market for nonmetric similarity search, recognizing just the metric space model.
This project aims to propose formal models followed by a design of access methods for efficient nonmetric similarity search, that is, search in databases where the similarity is not restricted by the metric postulates. Such a goal would bring an efficient database solution to the domain experts that need to pursue large-scale content-based retrieval tasks in complex databases, like multimedia retrieval, similarity-based data mining, complex pattern matching, classification and prediction in bioinformatics, etc.
Team member : David Hoksza, Jakub Lokoc, Jiri Novak, Juraj Mosko, Tomas Bartos
In recent years volume of gene and protein banks (databases) grows rapidly. The reason for storing huge volumes of gene and protein sequences in one place is not only for browsing these sequences itself, but in the first place searching for similarities among stored sequences. Similar sequences indicate similar functionality which helps in finding functions of unknown genes.
Current techniques for finding similarity among data sequences go through whole databases of genes and proteins, and examine similarity between query and every sequence in the database. As the volume of databases grows, the time for finding similar seqences increases linearly.
Hence, the goal of the project is an application of multimedia indexing methods to speed up searching in biological databases (primarily genom and protein databases). In the project, we will examine (primarily in the first year - plan of future works is in further sections) the ability of existing indexing methods to index different types of biological data (or their modification) in a way that will be optimal for biological data.