Recovery Mapping:

from RTED to ZS

March 20, 2020

 

Xingyu Xie

Tsinghua University

Question in Tree Matching

After building maps between isomorphic subtrees, "recover" mappings between similar but not isomophic subtrees.

Here comes the question, how to find recovery mappings?

A simpler edit script

A simple edit script between two ordered labeled trees, consisting of three kinds of edit action:

  • delete
  • insert
  • rename => recovery mapping!

A new question!

Best recovery mapping <=> Best edit script

 

How to characterize best edit script?

Assume that every kind of action has a cost:

  • cost(delete) = 1
  • cost(insert) = 1
  •  

The edit script with minimum cost is recognized as the best edit script.

 

Finding the edit script with minimum cost between two ordered labeled trees (edit distance) is a well-studied question.

cost (rename (u, v)) = \left\{ \begin{array}{ll} similarity(value(u), value(v)) &, type(u) = type(v) \\ \infty &, type(u) = type(v) \end{array} \right.

Algorithms for Tree Edit Distance

algorithm time complexity characteristics
ZS[3] only                      when the tree is almost balanced
Demaine[4] worst case optimal;
usually run worst
RTED[2] Not worse than ZS and Demaine in any case
O(n^3)
O(n^4)
O(n^3)
O(n^2 \log^2 n)

ZS Algorithm:

Dynamic Programming

Question: the minimum cost of editing one tree into another

Let's think from the perspective of dynamic programming!

Consider the roots of the trees...

Three cases

The question becomes the minimum edit cost between forests, it looks so difficult...

The magic of dfs ordering

Repermutate the nodes in (post) dfs ordering,

Forest => Interval

subtree => interval

root => the last element of interval

subtree without root => interval

Just redesign the question

Question: the minimum edit script from one interval into another.

Let's consider the root of the rightest subtree:

Bingo!

Three cases all could be handled appropriately.

 

Time complexity and space complexity are both 

O(n^4)

Optimization: quartic=>square

Space: in post-order, only the edit distance between subtrees and the current calculating subtree pairs needs to be memoized.

 

 

 

 

 

 

 

Time: considering a balanced tree, the sum of sizes of red substrees are                    , so the total time is reduced to

O(n^2 \log^2 n)
O(n \log n)

RTED Algorithm: Basic Idea

Question: which root of subtree to delete for recursion

ZS: the rightmost subtree

Demaine: the largest subtree

RTED: the optimal subtree

RTED in GumTree

Actual implementation:

  • Not used
  • Not finished

 

Possible reasons:

  • Tooooooooo complicated to implement and debug: 1765 LoC
  • No prominent advantage comparing to ZS algorithm in small cases

Summary

  • Problem Transformation: recovery mapping => minimum edit distance
  • ZS Algorithm: simple and efficient in practice
  • RTED: not practical at all

Rethink Again

  • The recovery mappings do obey monotonicity
  • More accurate similarity index
  • Handle the unordered children: Assignment problem, which could be solved by Hungarian algorithm in O(n^3)
  • Does best edit script really characterize the best recovery mapping?

Reference

[1] J.-R. Falleri, F. Morandat, X. Blanc, M. Martinez, and M. Monperrus. Fine-grained and accurate source code differencing. Proceedings of the 29th ACM/IEEE International Conference on Automated Software Engineering, ser. ASE ’14. ACM, 2014.

[2] M. Pawlik and N. Augsten. RTED: a robust algorithm
for the tree edit distance. PVLDB, 5(4):334–345, 2011.

[3] K. Zhang and D. Shasha. Simple fast algorithms for
the editing distance between trees and related
problems. SIAM J. Comput., 18(6):1245–1262, 1989.

Tree Edit Distance: from ZS to RTED

By namasikanam TA

Tree Edit Distance: from ZS to RTED

  • 12