Skip to Content

TWIC Dataset

Thematic Web Images Collection (TWIC) is a new dataset comprising 11,555 images divided into 200 classes. To create the dataset, keywords from various topics were selected and for each keyword the results found by Google image search engine were manually filtered (i.e., we used only keywords search to create the candidate classes). Thus for each keyword one class has emerged, moreover, each class was checked by more persons and only classes containing more than fifty objects were finally selected. The dataset consists of mostly visually similar objects placed on a heterogeneous background.

We have manually selected 12 themes and for each theme, we have picked around 50 keywords. We have set the threshold for a "visual-search" interesting class to at least 50 objects, resulting  hypothetically in 12 * 50 * 50 = 30.000 objects. However, the number of "visual-search" interesting classes obtained from text-based image search engine varied for each theme significantly (see the table below). Moreover, to create at least 10.000 objects collection, we had to add some "outlying" objects to several classes to reach the 50-objects-per-class limit. By "outlying" object we mean, that in a class, there can appear an object representing the keyword, but which is visually dissimilar to the rest of the objects in the class. Nevertheless, the amount of this heterogeneity is not significant (maximally + 5 such objects into a small class).

  • Buildings - 30
  • Clothes - 20
  • Flags - 52
  • Flora - 15
  • Food - 2
  • Fruits - 10
  • House - 7
  • Insects - 5
  • Mammals - 21
  • Sea - 14
  • Space - 9
  • Transport - 15

Query objects (first link in each list) were selected manually - we have focused on such objects, that visually represents the majority in the class (the query object is not an outlier).

The collection should be used only for scientific/comparison purposes and should not be redistributed. Of course, we would like to provide the collection in a more comfortable way, however, because of the copyright laws we propose only links to the selected images. Moreover, in the case of any negative reaction from the author of any referenced image, we will immediatelly remove the image from the list. 

Extracted feature signatures based on 7dim {(L,a,b), (x,y), (c,e)} space - http://siret.ms.mff.cuni.cz/lokoc/data/TWICSignatures.zip

Five extracted MPEG-7 descriptors - http://siret.ms.mff.cuni.cz/lokoc/data/TWICMPEG7.zip

To reference this collection please refer to paper

J. Lokoč, D. Novák, M. Batko, and T. Skopal. Visual image search: feature signatures or/and global descriptors. In Proceedings of the 5th international conference on Similarity Search and Applications, SISAP 2012, pages 177-191, Berlin, Heidelberg, 2012. Springer-Verlag.