P. Ciaccia and M. Patella. Searching in Metric Spaces with User-Defined and Approximate Distances. In ACM Transactions on Database Systems, 27(4), December 2002.
Novel database applications, such as multimedia, data mining, e-commerce, and many others, make intensive use of similarity queries in order to retrieve the objects that better fit a user request. Since the effectiveness of such queries improves when the user is allowed to personalize the similarity criterion according to which database objects are evaluated and ranked, the development of access methods able to efficiently support user-defined similarity queries becomes a basic requirement. In this article we introduce the first index structure, called the QIC-M-tree, that can process user-defined queries in generic metric spaces, that is, where the only information about indexed objects is their relative distances. The QIC-M-tree is a metric access method that can deal with several distinct distances at a time: (1) a query (user-defined) distance, (2) an index distance (used to build the tree), and (3) a comparison (approximate) distance (used to quickly discard from the search uninteresting parts of the tree). We develop an analytical cost model that accurately characterizes the performance of the QIC-M-tree and validate such model through extensive experimentation on real metric data sets. In particular, our analysis is able to predict the best evaluation strategy (i.e., which distances to use) under a variety of configurations, by properly taking into account relevant factors such as the distribution of distances, the cost of computing distances, and the actual index structure. We also prove that the overall saving in CPU search costs when using an approximate distance can be estimated by using information on the data set only (thus such measure is independent of the underlying access method) and show that performance results are closely related to a novel "indexing" error measure.
I. Bartolini, P. Ciaccia, and M. Patella. String Matching with Metric Trees Using an Approximate Distance. In Proceedings of the 9th International Symposium on String Processing and Information Retrieval (SPIRE 2002), Lisbon, Portugal, September 2002.
Searching in a large data set those strings that are more similar, according to the edit distance, to a given one is a time-consuming process. In this paper we investigate the performance of metric trees, namely the M-tree, when they are extended using a cheap approximate distance function as a filter to quickly discard irrelevant strings. Using the bag distance as an approximation of the edit distance, we show an improvement in performance up to 90% with respect to the basic case. This, along with the fact that our solution is independent on both the distance used in the pre-test and on the underlying metric index, demonstrates that metric indices are a powerful solution, not only for many modern application areas, as multimedia, data mining and pattern recognition, but also for the string matching problem.
P. Ciaccia and M. Patella. PAC Nearest Neighbor Queries: Approximate and Controlled Search in High-Dimensional and Metric Spaces. In Proceedings of the 16th International Conference on Data Engineering (ICDE 2000), San Diego, California, USA, March 2000.
In high-dimensional and complex metric spaces, determining the nearest neighbor (NN) of a query object q can be a very expensive task, because of the poor partitioning operated by index structures - the so-called "curse of dimensionality". This also affects approximately correct (AC) algorithms, which return as result a point whose distance from q is less than (1+e) times the distance between q and its true NN. In this paper we introduce a new approach to approximate similarity search, called PAC-NN queries, where the error bound e can be exceeded with probability d, where both e and d parameters can be tuned at query time to trade off cost of the search and quality of the result. We describe sequential and index-based PAC-NN algorithms that exploit the distance distribution of the query object in order to determine a stopping condition which respects the error bound. Analysis and experimental evaluation of the sequential algorithm confirms that, for moderately large datasets and suitable e and d values, PAC-NN queries can be efficiently solved and error controlled. Then, we provide experimental evidence that indexing can further speed-up the retrieval process by up to 1-2 orders of magnitude without giving up accuracy.
P. Ciaccia and M. Patella. Using the Distance Distribution for Approximate Similarity Queries in High-Dimensional Metric Spaces. In Proceedings of the 1st International Workshop on Similarity Search (IWOSS'99), Florence, Italy, September 1999.
We investigate the problem of approximate similarity (nearest neighbor) search in high-dimensional metric spaces, and describe how the distance distribution of the query object can be exploited so as to provide probabilistic guarantees on the quality of the result. This leads to a new paradigm for similarity search, called PAC-NN (probably approximately correct nearest neighbor) queries, aiming to break the "dimensionality curse". PAC-NN queries return, with probability at least 1-delta, a (1+eps)-approximate NN - an object whose distance from the query q is less than (1+eps) times the distance between q and its NN. Analytical and experimental results obtained for sequential and index-based algorithms show that PAC-NN queries can be efficiently processed even on very high-dimensional spaces and that control can be exerted in order to tradeoff the accuracy of the result and the cost.
P. Zezula, P. Savino, P. Ciaccia, and F. Rabitti. On the Region Proximity in Metric Spaces. In Proceedings of the 1st International Workshop on Similarity Search (IWOSS'99), Florence, Italy, September 1999.
The problem of defining and measuring proximity of generic metric space regions is investigated. Though the proposed probabilistic approach is valid for arbitrary regions, specific ready-to-use formulas are developed for the important case of ball regions, respecting arbitrary distance distributions. Given a measure of proximity, the concepts of similarity and selectivity of regions are also discussed.
P. Ciaccia and M. Patella. PAC Nearest Neighbor Queries: Using the Distance Distribution for Searching in High-Dimensional Metric Spaces. In Atti del Settimo Convegno Nazionale SEBD, Como, Italy, June 1999.
In this paper we introduce a new paradigm for similarity search, called PAC-NN (probably approximately correct nearest neighbor) queries, aiming to break the "dimensionality curse" which inhibits current approaches to be applied in high-dimensional spaces. PAC-NN queries return, with probability at least 1-delta, a (1+eps)-approximate NN - an object whose distance from the query q is less than (1+eps) times the distance between q and its NN. We describe how the distance distribution of the query object can be used to determine a suitable stopping condition with probabilistic guarantees on the quality of the result, and then analyze performance of both sequential and index-based PAC-NN algorithms. This shows that PAC-NN queries can be efficiently processed even on very high-dimensional spaces and that control can be exerted in order to tradeoff between the accuracy of the result and the cost.
P. Ciaccia, A. Nanni, and M. Patella. A Query-sensitive Cost Model for Similarity Queries with M-tree. In Proceedings of th 10th Australasian Database Conference (ADC'99), Auckland, New Zealand, January 1999.
We introduce a cost model for the M-tree access method which provides estimates of CPU (distance computations) and I/O costs for the execution of similarity queries as a function of each single query. This model is said to be query-sensitive, since it takes into account, by relying on the novel notion of "witness", the "position" of the query point inside the metric space indexed by the M-tree. We describe the basic concepts underlying the model along with different methods which can be used for its implementation; finally, we experimentally validate the model over both real and synthetic datasets.
P. Ciaccia, M. Patella, and P. Zezula. A Cost Model for Similarity Queries in Metric Spaces. In Proceedings of the 17th ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems (PODS '98), Seattle, Washington, USA, June, 1998.
We consider the problem of estimating CPU (distance computations) and I/O costs for processing range and k-nearest neighbors queries over metric spaces. Unlike the specific case of vector spaces, where information on data distribution has been exploited to derive cost models for predicting the performance of multi-dimensional access methods, in a generic metric space there is no such a possibility, which makes the problem quite different and requires a novel approach. We insist that the distance distribution of objects can be profitably used to solve the problem, and consequently develop a concrete cost model for the M-tree access method. Our results rely on the assumption that the indexed dataset comes from a metric space which is "homogeneous" enough (in a probabilistic sense) to allow reliable cost estimations even if the distance distribution with respect to a specific query object is unknown. We experimentally validate the model over both real and synthetic datasets, and show how the model can be used to tune the M-tree in order to minimize a combination of CPU and I/O costs. Finally, we sketch how the same approach can be applied to derive a cost model for the vp-tree index structure.
P. Zezula, P. Savino, G. Amato, and F. Rabitti. Approximate Similarity Retrieval with M-Trees. VLDB Journal 7(4), 1998.
Motivated by the urgent need to improve the efficiency of similarity queries, approximate similarity retrieval is investigated in the environment of a metric tree index called the M-tree. Three different approximation techniques are proposed, which show how to forsake query precision for improved performance. Measures are defined that can quantify the improvements in performance efficiency and the quality of approximations. The proposed approximation techniques are then tested on various synthetic and real-life files. The evidence obtained from the experiments confirms our hypothesis that a high-quality approximated similarity search can be performed at a much lower cost than that needed to obtain the exact results. The proposed approximation techniques are scalable and appear to be independent of the metric used. Extensions of these techniques to the environments of other similarity search indexes are also discussed.
P. Ciaccia, M. Patella, and P. Zezula. Processing complex similarity queries with distance-based access methods. In Proceedings of the 6th EDBT International Conference, Valencia, Spain, March, 1998.
Efficient evaluation of similarity queries is one of the basic requirements for advanced multimedia applications. In this paper, we consider the relevant case where complex similarity queries are defined through a generic language L and whose predicates refer to a single feature F. Contrary to the language level which deals only with similarity scores, the proposed evaluation process is based on distances between feature values - known spatial or metric indexes use distances to evaluate predicates. The proposed solution suggests that the index should process complex queries as a whole, thus evaluating multiple similarity predicates at a time. The flexibility of our approach is demonstrated by considering three different similarity languages, and showing how the M-tree access method has been extended to this purpose. Experimental results clearly show that performance of the extended M-tree version is consistently better than that of state-of-the-art search algorithms.
P. Zezula, P. Savino, F. Rabitti, G. Amato, and P. Ciaccia. Processing M-trees with Parallel Resources. In Eighth International Workshop on Research Issues in Data Engineering (RIDE'98), Orlando, Florida, February 1998.
The problem of the design and implementation of parallel metric tree indexes, called M-trees, is elaborated. Four different object declustering techniques are proposed and tested in order to get a sufficient evidence needed for specifying the pros and cons of their application. In general, the obtained I/O speedup and scaleup levels are high. A way how to deal with the CPU parallelism is also proposed and its speedup and scaleup experimentally tested.
P. Ciaccia, and M. Patella. Bulk loading the M-tree. In Proceedings of th 9th Australasian Database Conference (ADC'98), Perth, Australia, February 1998.
The M-tree is a dynamic paged structure that can be effectively used to index multimedia databases, where objects are represented by means of complex features, and similarity queries require the computation of time-consuming distance functions. The initial loading of the M-tree, however, can be very expensive. In this paper we propose a fast (bulk) loading algorithm to speed-up the creation of the tree on a given dataset. Experimental results show that our
BulkLoadingalgorithm can significantly improve the index' performance with respect to M-tree insertion methods, and its performance is comparable to that of other static metric trees.
P. Ciaccia, M. Patella, and P. Zezula. M-tree: An efficient access method for similarity search in metric spaces. In Proceedings of the 23rd VLDB International Conference, Athens, Greece, September 1997.
A new access method, called M-tree, is proposed to organize and search large data sets from a generic "metric space", i.e. where object proximity is only defined by a distance function satisfying the positivity, symmetry, and triangle inequality postulates. We detail algorithms for insertion of objects and split management, which keep the M-tree always balanced - several heuristic split alternatives are considered and experimentally evaluated. Algorithms for similarity (range and k-nearest neighbors) queries are also described. Results from extensive experimentation with a prototype system are reported, considering as the performance criteria the number of page I/O's and the number of distance computations. The results demonstrate that the M-tree indeed extends the domain of applicability beyond the traditional vector spaces, performs reasonably well in high-dimensional data spaces, and scales well in case of growing files.
P. Ciaccia, M. Patella, F. Rabitti, and P. Zezula. Indexing metric spaces with M-tree. In Atti del Quinto Convegno Nazionale SEBD, Verona, Italy, June 1997.
M-tree is a dynamic access method suitable to index generic "metric spaces", where the function used to compute the distance between any two objects satisfies the positivity, symmetry, and triangle inequality postulates. The M-tree design fulfills typical requirements of multimedia applications, where objects are indexed using complex features, and similarity queries can require application of time-consuming distance functions. In this paper we describe the basic search and management algorithms of M-tree, introduce several heuristic split policies, and experimentally evaluate them, considering both I/O and CPU costs. Results also show that M-tree performs better than R*-tree on high-dimensional vector spaces.
M. Patella. Similarity Search in Multimedia Databases. PhD thesis, Dipartimento di Elettronica Informatica e Sistemistica, Università degli Studi di Bologna, Bologna, Italy, February 1999.
P. Ciaccia, M. Patella, F. Rabitti, and P. Zezula. Performance of M-tree, an access method for similarity search in metric spaces. Technical Report 13, HERMES ESPRIT LTR Projects, 1996.
A new access method, called M-tree, is proposed to organize and search large data sets from a generic "metric space", i.e. where object proximity is only defined by a distance function satisfying the positivity, symmetry, and triangle inequality postulates. The M-tree design has been motivated by retrieval requirements from typical multimedia database applications, where objects, such as text, image, and video, are indexed using complex feature representations, and search for objects similar to a query object can involve application of time-consuming distance functions. We detail algorithms for insertion of objects and split management, which keep the M-tree always balanced - several heuristic split alternatives are considered and experimentally evaluated. Algorithms for similarity (range and k-nearest neighbors) queries are also described. Results from extensive experimentation with a prototype system are reported, considering as the performance criteria the number of page I/O's and the number of distance computations. The results demonstrate that the M-tree indeed extends the domain of applicability beyond the traditional vector spaces, performs reasonably well in high-dimensional data spaces, and scales well in case of growing files.
P. Zezula, P. Ciaccia, and F. Rabitti. M-tree: A dynamic index for similarity queries in multimedia databases. Technical Report 7, HERMES ESPRIT LTR Projects, 1996.
The problem of indexing dynamic collections of persistent objects in order to support efficient execution of similarity queries, is elaborated. Motivated by urgent needs of multimedia applications, non vector-based representations of object features are considered - existing index structures either do not apply to this case at all or do not guarantee stable performance in dynamic environments. Our solution leads to the definition of a new dynamic secondary-memory paged tree structure, called M-tree, which always remains balanced and can be applied whenever the distance function used to quantify object similarity is a metric. Basic principles of M-tree are presented, and algorithms for search and insertion are outlined. Implementation performance issues are also discussed.