Recovery Mapping:
from RTED to ZS
March 20, 2020
Xingyu Xie
Tsinghua University
Question in Tree Matching
After building maps between isomorphic subtrees, "recover" mappings between similar but not isomophic subtrees.
Here comes the question, how to find recovery mappings?
A simpler edit script
A simple edit script between two ordered labeled trees, consisting of three kinds of edit action:
- delete
- insert
- rename => recovery mapping!
A new question!
Best recovery mapping <=> Best edit script
How to characterize best edit script?
Assume that every kind of action has a cost:
- cost(delete) = 1
- cost(insert) = 1
The edit script with minimum cost is recognized as the best edit script.
Finding the edit script with minimum cost between two ordered labeled trees (edit distance) is a well-studied question.
Algorithms for Tree Edit Distance
algorithm | time complexity | characteristics |
---|---|---|
ZS[3] | only when the tree is almost balanced | |
Demaine[4] | worst case optimal; usually run worst |
|
RTED[2] | Not worse than ZS and Demaine in any case |
ZS Algorithm:
Dynamic Programming
Question: the minimum cost of editing one tree into another
Let's think from the perspective of dynamic programming!
Consider the roots of the trees...
Three cases
The question becomes the minimum edit cost between forests, it looks so difficult...
The magic of dfs ordering
Repermutate the nodes in (post) dfs ordering,
Forest => Interval
subtree => interval
root => the last element of interval
subtree without root => interval
Just redesign the question
Question: the minimum edit script from one interval into another.
Let's consider the root of the rightest subtree:
Bingo!
Three cases all could be handled appropriately.
Time complexity and space complexity are both
Optimization: quartic=>square
Space: in post-order, only the edit distance between subtrees and the current calculating subtree pairs needs to be memoized.
Time: considering a balanced tree, the sum of sizes of red substrees are , so the total time is reduced to
RTED Algorithm: Basic Idea
Question: which root of subtree to delete for recursion
ZS: the rightmost subtree
Demaine: the largest subtree
RTED: the optimal subtree
RTED in GumTree
Actual implementation:
- Not used
- Not finished
Possible reasons:
- Tooooooooo complicated to implement and debug: 1765 LoC
- No prominent advantage comparing to ZS algorithm in small cases
Summary
- Problem Transformation: recovery mapping => minimum edit distance
- ZS Algorithm: simple and efficient in practice
- RTED: not practical at all
Rethink Again
- The recovery mappings do obey monotonicity
- More accurate similarity index
- Handle the unordered children: Assignment problem, which could be solved by Hungarian algorithm in O(n^3)
- Does best edit script really characterize the best recovery mapping?
Reference
[1] J.-R. Falleri, F. Morandat, X. Blanc, M. Martinez, and M. Monperrus. Fine-grained and accurate source code differencing. Proceedings of the 29th ACM/IEEE International Conference on Automated Software Engineering, ser. ASE ’14. ACM, 2014.
[2] M. Pawlik and N. Augsten. RTED: a robust algorithm
for the tree edit distance. PVLDB, 5(4):334–345, 2011.
[3] K. Zhang and D. Shasha. Simple fast algorithms for
the editing distance between trees and related
problems. SIAM J. Comput., 18(6):1245–1262, 1989.
Tree Edit Distance: from ZS to RTED
By namasikanam TA
Tree Edit Distance: from ZS to RTED
- 12