Cost Models for Metric Trees

Predicting the performance of access methods is essential for query optimization, database design, and index tuning purposes. Since, unlike vector spaces, in a generic metric space no information on data distribution can be used, cost models developed for multi-dimensional (spatial) access methods cannot be applied to the M-tree and, more in general, to metric trees. To obviate this we have pursued a more general approach, which can be characterized by the following position:

The only information derivable from the analysis of a metric dataset concerns the distance distribution of objects. Further, we claim that the distance distribution is the correct counterpart of data distribution used for vector spaces, it being the "natural" way to characterize a metric dataset.

The Distance Distribution

We define the (overall) distribution of distances over a feature domain D as:

F(x) = Pr{ d(O₁, O₂) <= x }

where O₁ and O₂ are two (independent) random points of D. By setting O₁=O_i in above equation, the relative distance distribution of object O_i is obtained, i.e:

F_{O_i}(x) = Pr{ d(O_i, O₂) <= x }

which represents the expected fraction of objects in D whose distance from O_i does not exceed x.

Based on the observation that, given a query object q, its relative distance distribution, F_q(x), is the key information needed to predict both I/O and CPU query processing costs, the two cost models we have derived differ in the way how they estimate F_q(x).

The Average-Case Cost Model

The average-case cost model, introduced in PODS 98, relies on the assumption that the relative distance distribution of query object q can be reliably estimated by directly using (a sample of) the overall distance distribution. Experimental results (see our PODS 98 paper) show that this assumption leads to quite accurate cost estimates, averaged over a set of queries (for range- and k-NN queries).

The Query-Sensitive Cost Model

When dealing with specific queries, the accuracy of the average-case model decreases due to the "loose" approximation of F_q(x) as provided by the overall distance distribution. The query-sensitive model solves this problem by making the estimate of F_q(x) dependent on the "position" of the query object q. This is obtained by maintaining the relative distance distributions of several "witnesses" (chosen in some way among the objects in the indexed dataset, in order to reflect their "distribution"); such distributions are then combined in order to "reconstruct" F_q(x), by taking into account the distances between q and the witnesses.

Starting from the notion of witness, the basic problems to address are: (1) How witnesses have to be chosen among all the objects of the dataset, and (2) how their relative distributions have to be combined. In our ADC 99 paper we have investigated different solutions to such problems, showing how the query-sensitive model is indeed very effective in predicting query costs for the M-tree, both on synthetic and on real datasets.