Skip to Content

Indexing Similarity

The general paradigm for content-based retrieval is the similarity search model, which consists of three key components. First, given a database of complex data objects (e.g., multimedia documents), a set of feature descriptors must be extracted from the actual data objects. Second, a distance function must be defined on the descriptors that mimics the similarity between the respective objects. Third, a query is given using the query-by-example concept, that is, distances are evaluated between a query (example) descriptor and all the descriptors in the database while those sufficiently close (similar) to the example are returned to the user as a result.

Had we accept the above outlined query process as a naive implementation (sequential search of the entire database), there would be no problem and no constraints on the distance function. However, the distance function is often computationally expensive and the databases are too large to be searched both sequentially and efficiently. Hence, there were developed various models for indexing similarity, while the most of them follow the metric space model that assumes a metric distance function. The metric postulates allow to partition the descriptor space such that query processing visits only the prospective partitions, making the search efficient. However, the restriction on just metric distances is quite serious because real-world applications often require non-metric distances or even dynamic distances that change because of evolving user preferences. The SRG aims at investigating general techniques for indexing metric, non-metric and dynamic distance functions at large scale.


The multimedia data (images, audio, video) already confirmed their dominant role within the flood of data available over the Internet. With the exponential growth of multimedia data volumes, the means of multimedia retrieval cannot keep relying just on the conventional keyword-search technology that requires an annotation given by a set of keywords. Not only the annotation is mostly unavailable for all the multimedia data at such a large scale, but even the available annotatations usually suffer from subjectiveness and incompleteness. Thus, content-based multimedia retrieval systems need to be designed that employ similarity search models and techniques considering the actual multimedia content rather than the keywords. The SRG is involved in two research directions concerning indexing in content-based multimedia retrieval, in particular, the multimedia exploration access methods and indexing adaptive similarity. The outcomes of this research will be incorporated into our web-based Smart image retrieval system (SIR).


While traditional content-based retrieval approaches provide query-driven access under the assumption that the users' needs are clearly specified, modern content-based exploration approaches support users in browsing and navigating through multimedia databases when imprecise or even no retrieval intent is given. By means of interactive graphical user interfaces, exploration approaches offer a convenient and intuitive access to unknown multimedia databases which becomes even more important with the arrival of powerful mobile devices, such as the iPad and iPhone.

When determining content-based similarity between two multimedia objects, the distance is evaluated on feature representations which aggregate the inherent properties of the multimedia objects. The conventional feature representations aggregate and store these properties in feature histograms, which can be compared by vectorial distances. Recent feature representations adaptively aggregate and store individual object properties in more flexible feature signatures, which can be compared by adaptive similarity measures, such as the quadratic form distance or the Earth mover's distance.

Bioinformatics & Cheminformatics

Bioinformatics applies computer science and mathematic techniques to the field of molecular biology, in order to solve complex biological problems. Most bioinformatics areas heavily rely on similarity search. SRG puts considerable effort on implementation of efficient similarity search methods in the bioinformatics domain. We are especially interested in the following applications and we developed tools helping to solve given problems.

It was believed for many years that the only function of RNA is to transfer genetic information to ribosomes where it is translated to proteins. However, it has been revealed recently that ribosomes themselves are formed by RNA molecules having catalytic function. Therefore, RNA is not only a mere transportation medium but it can be involved in many enzymatic processes in living organism. It has also been recently shown that RNA can act as a gene regulator, thus influencing whether given gene will be translated and in what amount. Clearly, deep understanding of the mechanisms behind RNA function can have a great impact on the drug discovery related areas.

Proteins perform many different biological functions and so ensure most of the vital processes in the living organisms. Proteins can fold into various 3D structures and this is the reason for their huge functional diversity. Better understanding of protein function can result in more effective drugs or various industrial products (e.g., laundry detergent containing enzymes).

Proteins, organic molecules made of amino acids, are essential for construction of cells and for their proper function. The mass spectrometry is a widely used method for determining protein sequences from a biological (wet) sample. The sequences are not determined directly, but they must be interpreted from the mass spectra, which is the output of the mass spectrometer