Approximate Searching

It is a fact that, depending on the characteristics of the dataset at hand, indexing might not be the best solution. Indeed, the performance of index trees has been repeatedly observed to deteriorate in high-D spaces, so that, even for D as low as 10-15, a linear scan of the dataset would perform (much) better. Furthermore, recent mathematical studies demonstrate that this unpleasant phenomenon, known as "the curse of dimensionality", is not peculiar to vector spaces, but can also affect more complex metric spaces, and it is tightly related to the distribution of distances between the indexed objects and the query object. Intuitively, the more such distances are all similar each other, i.e. their variance is low, the more searching is difficult.

On the other hand, when objects are naturally organized into clusters or the intrinsic (or fractal) dimensionality of the dataset is low, NN search can be efficiently solved. In this case, a multi-step filter-and-refine approach has been proposed, the idea being to initially use an easy-to-compute distance function which lower bounds the original one, and then to compute the actual result by evaluating the original distance function only on the set of candidates returned by the filter step. This is also the basic idea underlying the use of dimensionality-reduction techniques.

In our ICDE 2000 paper, we pursue a different, yet complementary, direction which extends previous work on approximate NN search, i.e. where one does not require that the result has to be the "correct" NN of the query object. Approximate queries are suitable to a variety of scenarios, especially when the query specification is itself a "guess". This is the case in exploratory data analysis, in content-based image retrieval, and in many other real-life situations. Furthermore, in many cases the difference between the NN and a "good" approximation could be indistinguishable from a practical point of view.

For approximate queries the two conflicting requirements to be satisfied are low processing cost and high accuracy of the result, i.e. low error. The approach undertaken by what here we call approximately correct nearest neighbor (AC-NN) queries is to specify the maximum relative error to be tolerated, ε > 0, thus one is guaranteed to obtain a result whose distance from the query object does not exceed (1+ε) times the distance between the query object and its NN. However, experimental results show that AC-NN algorithms are still plagued by the dimensionality curse and become unpractical when D is intrinsically high, regardless of ε.

We propose a probabilistic approach to approximate NN search, which allows two parameters to be specified at query time: the accuracy ε allows for a certain relative error in the result, and the confidence δ guarantees, with probability at least (1-δ), that ε will not be exceeded. This generalizes both AC-NN queries, obtained when δ=0, as well as correct (C-NN) queries ε=δ=0).

The basic information used by our PAC (probably approximately correct) NN algorithms is the distance distribution of the query object, which is exploited to derive a stopping condition with provable quality guarantees, the basic idea being to avoid searching "too close" to the query object.

Since the complexity of the PAC-NN sequential algorithm is still linear in the dataset size, we introduce a PAC-NN index-based algorithm which we have implemented in the M-tree, and experimentally demonstrate that performance can improve by 1-2 orders of magnitude. Although we use the M-tree for practical reasons, our algorithm and results apply to any multi-dimensional or metric index tree. We also demonstrated that, for any value of the ε accuracy parameter, the δ confidence parameter can be chosen in such a way that the actual average relative error stays indeed very close to ε. This implies that an user can indeed exert an effective control on the quality of the result, trading off between accuracy and cost. Details can be found in our ICDE 2000 paper

Software

The code for PAC queries is not included into the M-tree library. If you are interested in the code, please send a message to marco.patella@unibo.it