An Optimal and Progressive Algorithm for Skyline Queries

Gliederung

Problemstellung
Beispiel
Nearest Neighbor
Branch and Bound Algorihtmus
Evaluation

Problemstellung

Gegeben: Punktemenge , wobei

Gesucht: Teilmenge von Punkten, die von keinem anderen Punkt

dominiert werden

p_1,...,p_N

p_1,...,p_N

p_i \in \mathbb{R}^d

p_i \in \mathbb{R}^d

Problemstellung

dominiert , g.d.w.

p_j

p_j

p_i

p_i

(p_j)_k \leqslant (p_i)_k \forall k \in \{1,...,d\}

(p_j)_k \leqslant (p_i)_k \forall k \in \{1,...,d\}

das bedeutet: ist bezüglich aller Eigenschaften mindestens "so gut" wie

p_j

p_j

p_i

p_i

Problemstellung

Wir suchen also Punkte, die optimal sind bezüglich der Kombination ihrer Eigenschaften

Beispiel

Liste von Hotels (Punkte) a,...,n mit 2 Eigenschaften

(Dimensionen) von Interesse

Nearest Neighbor (NN)

Suche Punkt mit minimaler Distanz (z.B. -Norm) zum Ursprung
Alle Punkte, die von diesem Punkt dominiert werden, können ausgeschlossen werden
Teile den verbleibenden Raum gemäß den Koordinaten des gefundenen Punktes auf

L_1

L_1

Nearest Neighbor (NN)

Füge die gefundenen Partitionen in eine todo-Liste ein
Wiederhole diesen Prozess rekursiv auf allen Partitionen, bis die todo-Liste leer ist

Nearest Neighbor (NN)

Problem für höhere Dimensionen (d>2): überlappende Partitionen redundante Zugriffe

Anzahl an Knoten-Zugriffen:

NA_{NN} \geqslant s \cdot h \cdot d

NA_{NN} \geqslant s \cdot h \cdot d

\rightarrow

\rightarrow

Verbesserung durch...

Branch and Bound Skyline Algorithmus (BBS)

Branch and Bound Skyline Algorithmus

Idee: Teile Raum in MBRs (Minimum Bounding Rectangles) auf und prune Bereiche, die von einem Skyline-Punkt dominiert werden disjunkte Partitionierung

\rightarrow

\rightarrow

Branch and Bound Skyline Algorithmus

action

heap contents

< e_7, 4> < e_6,6 >

< e_7, 4> < e_6,6 >

\varnothing

\varnothing

\text{expand } e_7

\text{expand } e_7

\text{access root}

\text{access root}

< e_3,5> < e_6, 6> < e_5, 8> < e_4, 10>

< e_3,5> < e_6, 6> < e_5, 8> < e_4, 10>

\varnothing

\varnothing

\text{expand } e_3

\text{expand } e_3

< \textbf{ i,5}> < e_6, 6> < h, 7> < e_5, 8>

< \textbf{ i,5}> < e_6, 6> < h, 7> < e_5, 8>

< e_4, 10> < g,11>

< e_4, 10> < g,11>

\{i\}

\{i\}

\text{expand } e_6

\text{expand } e_6

< h,7 > < e_5, 8 > < e_1 , 9> < e_4, 10> < g ,11>

< h,7 > < e_5, 8 > < e_1 , 9> < e_4, 10> < g ,11>

\{i\}

\{i\}

\text{expand } e_1

\text{expand } e_1

< \textbf{a,10} > < e_4, 10> < g ,11> < b, 12>

< \textbf{a,10} > < e_4, 10> < g ,11> < b, 12>

\{i, a\}

\{i, a\}

\text{expand } e_4

\text{expand } e_4

< \textbf{k,10} > < g ,11> < b, 12> < c,12 >

< \textbf{k,10} > < g ,11> < b, 12> < c,12 >

\{i, a,k\}

\{i, a,k\}

Branch and Bound Skyline Algorithmus

Alle Punkte in S sind Teil der Skyline und es gibt keine false hits
Progressivität: erste Resultate werden dem User sofort ausgegeben und sukzessive erweitert
Fairness: keine Punkte, die besonders "gut" in einer Dimension sind, werden bevorzugt
Einbindung von Präferenzen
Universalität bezüglich Datendistribution und Dimensionalität

Branch and Bound Skyline Algorithmus

Die Anzahl an Knoten-Zugriffen von BBS ist optimal (d.h. kein Knoten im R-tree wird doppelt besucht)
Anzahl an tatsächlichen Zugriffen:

NA_i =

NA_i =

P_{intr-i} =

P_{intr-i} =

Anzahl an Knoten-Zugriffen auf Level i

Wahrscheinlichkeit, dass

für einen Knoten auf Level i

e_i \cap SSR \neq \emptyset

e_i \cap SSR \neq \emptyset

Branch and Bound Skyline Algorithmus

NA_i =

NA_i =

P_{intr-i} =

P_{intr-i} =

Anzahl an Knoten-Zugriffen auf Level i

Wahrscheinlichkeit, dass

für einen Knoten auf Level i

e_i \cap SSR \neq \emptyset

e_i \cap SSR \neq \emptyset

NA_i = \frac{N}{f^{i+1}} P_{intr-i}

NA_i = \frac{N}{f^{i+1}} P_{intr-i}

\frac{N}{f^{i+1}} =

\frac{N}{f^{i+1}} =

Anzahl an Knoten auf Level i (f ist durchschnittliche Knotenkapazität)

Dann ist

und

NA_{BBS} = \sum_{i=1}^h{NA_i} \leqslant s \cdot h

NA_{BBS} = \sum_{i=1}^h{NA_i} \leqslant s \cdot h

e_i

e_i

Branch and Bound Skyline Algorithmus und Nearest Neighbor im Vergleich

\frac{NA_{NN}}{NA_{BBS}} \geqslant d

\frac{NA_{NN}}{NA_{BBS}} \geqslant d

Bewertung

in der Tat optimale Anzahl an Knotenzugriffen UNTER dieser Datenaufbereitung, ABER:
unter Umständen teure Berechnung der MBR's, die nicht in das Paper eingegangen sind
Skalierbarkeit auf andere Datenstrukturen (keine R-trees)?
Problem bei höheren Dimensionen des Datenraums

Branch and Bound Skyline Algorithm

By cirquit

Branch and Bound Skyline Algorithm

cirquit

PhD student with a focus on machine learning, distributed systems and functional programming.

An Optimal and Progressive Algorithm for Skyline Queries

Branch and Bound Skyline Algorithm

More from cirquit