Skip to Content
GAČR 17-22224S
User Preference Analytics in Multimedia Exploration Models
2017 - 2019
Principal investigator : Tomas Skopal
GAČR 15-08916S
Efficient subgraph discovery for petabyte-scale web analysis
2015 - 2017

The study of network behaviors without packet content inspection is becoming of increasing concern in context of network administration and security. Recent years observe an increasing demand for machine learning algorithms on graphs, since modeling interactions between entities by graphs is natural in context of large computer networks. A promising approach to modeling graphs that leverages the advantages of machine learning techniques is based on the so-called ``graphlets'', that provide embedding of graph fragments into vector spaces. However, wider adoption of graphlets is hindered by the cost of embedding and limitation to unweighted undirected graphs. In this project, we would like to focus on the design of generalized graphlet-based models and the respective vocabularies, and thus to increase the variety of applications potentially benefiting from graphlet-based descriptors. The proposed methodology will be verified within the domain of network security. In particular, malicious web communities will be searched on petabyte-scale network traffic database available to Cisco.

Principal investigator : Jakub Lokoc
Team member : Tomas Skopal, Premek Cech
GAČR 15-00885S
Novel methods for computational prediction and visualization of secondary structures of ribosomal ribonucleic acids - an integrated solution
2015 - 2017
Ribosomal ribonucleic acid (rRNA) is essential for the proteins synthesis, one of the most basic biological processes. To understand its mechanisms, the knowledge of the rRNA structure is required as it forms the structural core of the protein synthesizing unit, the ribosome. Experimental identification of rRNA structure is extremely technically difficult. Secondary structure, a simplified structural model, can be predicted, but prediction for rRNAs is hindered by extreme length of rRNA sequences. Thus, only few eukaryotic rRNA structures are available so far. We will employ information about evolutionary conserved segments of eukaryotic rRNA sequences for secondary structure prediction. An algorithmic workflow integrating the secondary structure prediction pipeline, visualization algorithm and database of the predicted structures will be developed. The predicted rRNA structures will be used for a novel bioinformatic identification of evolutionary conserved structural motifs in eukaryotic rRNAs that may bring new insights into the role of rRNA in the protein synthesis.
Co-Investigator : David Hoksza
GAUK 174615
Adaptive virtual screening
2015 - 2017

Biological screening is used to detect the ability of small molecules to trigger a response in a macromolecular target by binding to it. The main disadvantage of physical screening is its price and the need to own the tested molecules. An alternative to the physicial screening is its in-silico variant - virtual screening (VS). VS commonly takes place in the early stages of drug discovery as a molecular filter before physical screening. One of the similarity principle-based types of VS is the ligand-based VS (LBVS). The similarity principle correlates function of a molecule with its structural and physico-chemical properties. If there exist known active molecules (ligands) to given target we can utilize molecular similarity to identify novel ligands. LBVS requires a suitable molecular representation and similarity function. Then the candidate molecules can be sorted based on similarity to known ligand(s) and thus, by association, by activity. The parameters of LBVS (similarity, representation and its parametrization) greatly influence its effectivity. The parameters are often static despite the fact that they are target dependent. A wrong parametrizations results in sub-optimal efficiency. Our goal is to develop a modular LBVS framework with generic representation and automated parameterization based on existing information about target.

Principal investigator : Petr Škoda
Co-Investigator : Jan Jelinek, Radoslav Krivak
GAUK 201515
Using metric indexes for efficient content-based multimedia exploration
2015 - 2017
The multimedia exploration techniques become the most promising trends of the multimedia retrieval, because they allow an intuitive way of accessing the particular content of the multimedia databases. The exploration techniques are useful especially in cases, where traditional methods of data filtering fail or are insufficient. Furthermore, the multimedia exploration is fun! This is beneficial for modern web portals and e-shops containing visually interesting products and advertising, so the portals can keep users attracted for a longer period. The intuitive and funny exploration model is usually simple. However, a simple model cannot resolve the diversity of user's needs. Therefore, in this project we plan to create a complex multimedia exploration framework that is able to intuitively combine multiple similarity models in order to satisfy the particular user's intent. Additionally, we plan to use existing metric indexing structures as the native exploration structures in order to achieve computationally inexpensive and thus fluent exploration. Furthermore, we plan to implement effective multi-query evaluation using specific query objects. This approach is especially viable for cumulative exploration using sophisticated methods of narrowing results and for utilizing exploration movements over the extensive collections (zoom in/out, side movements etc.).
GAČR 14-29032P
Efficient chemical space exploration using multi-objective optimization
2014 - 2016
Recently, we have developed a method for a systematic generation of the chemical space lying between a given pair of small organic molecules. The intended use follows the similar property principle stating that "similar compounds have similar properties". Thus, if a pair of molecules shows a similar function molecules close in the chemical space should behave alike.
 
Our approach has two major drawbacks which we would like to tackle in the proposed project - 1) a path (subspace) between the input molecules is not guaranteed to be found and 2) the exploration process is driven purely by structure and does not take into account physicochemical and biological properties of generated compounds. In order to solve the first problem we propose to use an approach inspired by scaffold hopping. We will utilize multiple scaffold types retaining different levels of structural information to reduce the complexity of the chemical space. We propose that a path in a less complex chemical space is more likely to be identified. In the second part of the project we will modify the exploration process so it will not be performed in the structural space but in the biologically more meaningful space of features such as, e.g., ADME/Tox properties. This will be done by projecting starting molecules into multidimensional space of the physicochemical and/or biological properties, and using multi-objective optimization to drive the exploration towards the desired optima. Finally, we propose a way how to mix scaffold hopping and bioactivity based exploration into a novel chemical space exploration approach.
Principal investigator : David Hoksza
GAUK 550214
rRNA Secondary Structure Prediction
2014 - 2015

Gene translation is the process of implementation of genetic information, which forms a living organism. The unit central to translation is the ribosome.

The "skeleton" (and major part) of the ribosome consists of ribosomal ribonucleic acids (rRNA), which are critical for its function. Because the function of biological molecules is mostly determined by their spatial structure, understanding the role of rRNA in translation depends on understanding rRNA structures. While rRNA nucleotide sequences can be obtained relatively easily, determining their three-dimensional structure is very demanding: sequences are known for hundreds of eucaryotic organisms while spatial structures only for 4.

An intermediate step between sequences and three-dimensional structures are secondary structures, the understanding of which enables at least partial study of rRNA behavior. Secondary structures, however, can be predicted from sequences (to an extent). The project will develop a method for predicting secondary structures of rRNA and a software infrastructure "rPredictor" (database, web interface, integrated bioinformatic tools) for making this method and an associated database of predicted structures available to the scientific community.

The rPredictor infrastructure is integral to the project because the existing database of rRNAs, SILVA, is, from a programming standpoint, insufficient and not extensible with a structure prediction module.

GAUK 910913
Real-time Exploration Queries in Multimedia Databases
2013 - 2015

Nowadays, the similarity search in multimedia databases is performed through similarity queries explicitly specified by users. The queries return a certain part of the database that is relevant to the user specified query parameters. However, this approach suffers in case the user does not know how to specify the query, or actually she/he only wants to know what the database contains in the whole picture. In such case non-standard access to data is more appropriate, e.g., the exploration of a multimedia database.
During the exploration process the user gains a complex idea of all the stored data rather than a particular part of database returned as the result of some similarity query. In the complex view the user is osupported in browsing the space of multimedia data (typically by multi-touch device, provided by modern technologies, e.g., iPad) and that results in stream of similarity queries. For a convenient user-friendly browsing, the exploration system has to evaluate these queries promptly, which is not guaranteed in case of standard query processing (even approximate). Hence, the goal of this project is to propose and implement access methods that provide functionality of real-time similarity retrieval, thus founding fundamentals for user-friendly exploration of multimedia databases.

Principal investigator : Juraj Mosko
Team member : Tomas Skopal, Tomas Bartos, Jakub Lokoc, Tomas Grosup
GAUK 154613
Efficient Molecular Representation
2013 - 2014

Many areas of chemical biology, e.g. drug discovery, largely rely on libraries of molecules for production processes. Since the space of all molecules, called the chemical space, is huge, those libraries contain only small subspace of the whole space. An alternative to statical libraries is dynamic exploration of the space. The exploration generates molecules dynamically based on paths between given molecules in the chemical space. One of a few tools to explore the chemical space is Molpher which has been developed at Charles University in Prague. Molpher’s drawback is the limited possibility of influencing the exploration process. Possible approach to affect the exploration is to minimize the exploration of the subspaces that are not interesting. This calls for effective molecule representation allowing the distinction of interested molecules based on their structure. Here, by effectiveness we understand both accuracy and low computation complexity. Because a lot of molecules are being generated during the exploration processes it is crucial to be able to compare two molecules very fast which helps to restrict the chemical space that needs to be explored. Our goal is to develop a new molecular representation and related algorithms enabling its use in Molpher and similar projects.

Principal investigator : Petr Škoda
CISCO 2013
Finding similar events within IDS
2013 - 2014
Principal investigator : Tomas Skopal
Team member : Jakub Lokoc, Juraj Mosko
GAČR P202/12/P297
Synergistic Modeling of Adaptive Similarities for Multimedia Retrieval
2012 - 2014

The task of smart similarity search in huge multimedia databases remains still a big challenge for the image retrieval research. The domain experts try to design more effective retrieval models which often utilize complex and expensive similarity measures. At the same time, the database experts developing similarity-based indexes have to fight with less "indexable" similarity spaces induced by the more complex similarity measures. Moreover, since there is a common practice that the provided similarity measures are considered as black-box algorithms, only general ap-proaches enabling efficiency tuning can be employed (e.g., the TriGen algorithm). However, the general approaches are not sufficient and so the database experts can no longer stand aside from the similarity modeling. In this project, we would like to "open the Pandora's box" and enter the world of domain experts, in order to design "indexable" similarity spaces. In general, we would like to describe a new similarity space modeling schema considering not only the retrieval quality but also the efficiency issues. In other words, we plan to introduce indexability measures to the similarity modeling process and investigate more variants of popular multimedia distance spaces. As we have recently shown [Beecks et al. 2011] in cooperation with RWTH Aachen, this ap-proach can result in interesting tradeoffs, where at least two orders of magnitude speedup can be achieved for the price of only slightly decreased retrieval quality.

Principal investigator : Jakub Lokoc
GAUK 567312
Algorithmic exploration of axiom spaces for efficient similarity search at large scale
2012 - 2014

Similarity search is becoming popular in even more disciplines, such as multimedia databases, bioinformatics, data mining, or social networks. The large-scale search engines for such data are mostly based on models involving low-level features and simple similarity functions. There also exist complex models employing local features and higher-level similarities which provide higher retrieval effectiveness. An application of complex models, however, is not feasible at large scale due to insufficient portfolio of indexing techniques enabling fast search.

 
The existing techniques assume the metric space model that is too restrictive. In this project we revisit assumptions which persist in the mainstream research of content-based retrieval. Leaving the traditional indexing paradigms such as the metric space model, our goal is to propose alternative methods for indexing that shall lead to high-performance similarity search. We intend to develop an algorithmic framework for exploration of axioms (analytical properties) useful for indexing that hold in a given complex similarity space but were not discovered so far. Consequently, the known axioms will be localized as a small subset within the universe of all axioms suitable for indexing. The discovery of new axioms valid in some similarity space might have a huge impact in the database community.
 
Principal investigator : Tomas Bartos
Team member : Tomas Skopal, Juraj Mosko
FEBS Short-Term Fellowships
Similarity Retrieval in Protein Structure Databases
2012 - 2012

Similarity retrieval has a wide usage in many bioinformatics tasks. A typical case is similarity retrieval in databases of protein structures. However, this problem is not satisfactorily solved yet, especially in the comparison with the similarity retrieval in databases of protein sequences which is successfully solved (e.g., by BLAST).

Thus, the goal of our proposed research is to develop a tool for similarity retrieval in protein structure databases. The tool will be accessible online as a web application that will support not only protein structures but also protein sequences as a query input. In the case that a query is the sequence of a protein, the system should select such structures that are similar to the (unknown) structure of the query protein. In general, such behavior can be achieved by use of so-called sequence-structure similarity measures.

The guest laboratory (AG Porto) has extensive experiences with the design of the protein structure similarities and also with sequence-structure similarities. Our research group (SIRET) develops efficient and effective methods for similarity retrieval in huge databases. We have also experience with application of these methods on biological data. Hence, we hope that the joint research can result in a tool solving the problem with higher efficiency and effectiveness.

Principal investigator : Jakub Galgonek
FEBS Short-Term Fellowships
Fast Similarity Retrieval in Mass Spectra Databases
2012 - 2012
Principal investigator : Jiri Novak
GAČR P202/11/0968
Large-scale Nonmetric Similarity Search in Complex Domains
2011 - 2014

The similarity search is popular in various areas of computing, including multimedia databases, data mining, bioinformatics, etc. For a long time, the database approaches to similarity search assumed the similarity as a metric distance. Due to its properties, metric similarity allows to index a database such that it can be queried efficiently (quickly). However, together with the increasing complexity of data across various domains, there appeared many similarities in recent years that were not metrics (i.e., nonmetrics). The database research, however, is still not aware of the huge potential market for nonmetric similarity search, recognizing just the metric space model.

            This project aims to propose formal models followed by a design of access methods for efficient nonmetric similarity search, that is, search in databases where the similarity is not restricted by the metric postulates. Such a goal would bring an efficient database solution to the domain experts that need to pursue large-scale content-based retrieval tasks in complex databases, like multimedia retrieval, similarity-based data mining, complex pattern matching, classification and prediction in bioinformatics, etc.

Principal investigator : Tomas Skopal
Team member : David Hoksza, Jakub Lokoc, Jiri Novak, Juraj Mosko, Tomas Bartos
GAUK 430711
Application of Metric and Non-metric Indexing Methods in Computational Proteomics
2011 - 2012

The volume of unstructured databases grows extremely whereas its annotation is problematic. The similarity search concept based on a similarity function defined for each pair of database objects is more suitable for this kind of data. The similarity is usually modelled by a distance function satisfying metric axioms, which allows efficient indexing. However, metric axioms can be very restrictive for domain experts who may prefer non-metric functions. Hence database experts have to solve this problem by converting non-metric functions to metric ones or by developing new types of non-metric indexing methods.

One of the areas where metric/non-metric similarity searching is used is computational proteomics. During the determination of the biological function of an "unknown" protein, retrieval of "known" proteins with similar structures (and thus probably with similar function) is very useful. Moreover, fast and cheap determination of protein structures is also an open problem. From database point of view, it is possible to use databases of known protein structures to address this problem. In this approach, sequence-structure similarity functions are used to obtain structures that can be similar to the searched structure of the protein.

Our goal is developing high-quality structure and sequence-structure similarity functions and methods for their indexing.

Principal investigator : Jakub Galgonek
Team member : Tomas Skopal, Jiri Novak, Jakub Lokoc
GAČR 201/09/0683
Similarity Searching in Very Large Multimedia Databases
2009 - 2011

Finished, rated as excellent

Co-Investigator : Tomas Skopal
GAUK 18208
Distributed and parallel metric indexing in multimedia databases
2008 - 2009

Current data processing applications use data with considerably less structure and much less precise queries than traditional database systems. The multimedia data, like images or videos, that offer query-by-example search, are a typical example. Such data can neither be ordered in a canonical manner nor meaningfully searched by precise database queries that would return exact matches. This novel situation is what has given rise to a similarity searching. The most general approach to the similarity search, still allowing construction of index structures, is modeled in metric space. Here an important issue is the efficiency - we need to achieve fast query response over huge volumes of data. During last two decades there have been developed many metric access methods and indexing structures, however, they mostly cannot scale up with the exponential growth of multimedia data volumes we encounter during last years. A way to compete this enormous growth is to design parallel and distributed solutions, either as an extension of the traditional centralized indexing techniques, or completely new ones, where the parallelism/distribution are inherent indexing properties. Hence, the goal of the proposed project is the design and implementation of parallel and distributed indexing techniques and comparison with existing centralized solutions.

Principal investigator : Jakub Lokoc
Team member : Tomas Skopal
GAUK 57907
Similarity search in biological databases
2007 - 2008

In recent years volume of gene and protein banks (databases) grows rapidly. The reason for storing huge volumes of gene and protein sequences in one place is not only for browsing these sequences itself, but in the first place searching for similarities among stored sequences. Similar sequences indicate similar functionality which helps in finding functions of unknown genes.

Current techniques for finding similarity among data sequences go through whole databases of genes and proteins, and examine similarity between query and every sequence in the database. As the volume of databases grows, the time for finding similar seqences increases linearly.

Hence, the goal of the project is an application of multimedia indexing methods to speed up searching in biological databases (primarily genom and protein databases). In the project, we will examine (primarily in the first year - plan of future works is in further sections) the ability of existing indexing methods to index different types of biological data (or their modification) in a way that will be optimal for biological data.

Principal investigator : David Hoksza
Team member : Tomas Skopal
GAČR 201/05/P036
Efficient metric search in large multimedia databases
2005 - 2007

Finished, rated as excellent

Principal investigator : Tomas Skopal