Ranking

Ranking

Relaxations on both semantics and structure are expected to affect the score of results. Ranking approaches based on the number of query transformations, such as the Tree Edit Distance, appear to be too coarse-grained.

Consider Figure 3:

Figure 3: Query q1 and q2 require the same transformations to match the document.

Assuming a Tree Edit Distance approach, the satisfiability of the two queries would

have been considered equivalent. Indeed, although we are comparing different queries,

and not different results, it clearly appears that the score they have been assigned

does not provide any information on the rate of query satisfied by the document.

Query q1 is indeed satisfied after modification of 3 out of 7 nodes (high percentage),

whereas query q2 is satisfied after modification of 3 out of 11 nodes (low percentage).

This emphasizes that the current document satisfies better query q2 than query q1,

and also provides information on the quality of results.

Thus, an effective ranking function should take into account the query rate satisfied by an answer, i.e. the completeness of the anwer, so as to tune better its relevance.

Moreover, the use of wildcards flattens the score of results, which are considered relevant no matter the grade of sparseness of the data retrieved. Thus, a further key element to be considered is the cohesion of the aswers.

We propose a ranking function, called SATES (Scored Approximate Tree Embedding Set function), that captures all these features:

Given a query tree t_q , a document tree t_d, and an approximate tree embedding e[t_q,t_d] of t_q in t_d , SATES is a function that returns a score s in S=[0,1] such that, with respect to e, s is obtained as a combination of the following components:

s₁, that indicates how much e is semantically complete with respect to the given query
s₂, which denotes semantic correctness of e, in that it states how well the embedding satisfies semantic requirements
s₃, that represents the structural completeness of e with respect to the given query; it denotes the structural coverage of the query
s₄, that expresses the structural correctness of e, in that it is a measure of how well constraints on structure are respected in the embedding
s₅, which specifies cohesion of e, by providing the grade of fragmentation of the retrieved embedding

Then, s = F(s₁,s_2,s_3,s_4,s₅) is given by the combination of the above components, with F (possibly weighable) combine function. s states the overall similarity score of the embedding e.

More details can be found in [CP02] [CP02a] [Pen02].