Ahcène Boubekki
Ulf Brefeld
Cláudio L. Lucchesi
Wolfgang Stille
DIPF - Leuphana University
Leuphana University
UFMS, Brazil
ULB - TU Darmstadt
Hello everyone,
I will talk about how to propagate capacites in a graph in order to make recommendations.
This is a common work between ...
<SPACE>
Let's build a Problem
<SPACE>
| A | |||||
| B | |||||
| C | |||||
| D | |||||
| E |
| A | B | C | D | E |
|---|
5 items:
| - | 1 | 3 | 4 | 0 | |
| 1 | - | 3 | 0 | 5 | |
| 3 | 3 | - | 0 | 0 | |
| 4 | 0 | 0 | - | 5 | |
| 0 | 5 | 0 | 5 | - |
Introduction
A
B
C
D
E
1
3
3
5
5
4
Adjacency Matrix
Item Graph
Item-based Collaborative Filtering
How to recommend E?
What weight?
A
Let's suppose we want to build a recommender system for 5 items
The first thing to look at is the adjacency matrix
<SPACE>
The same information can be seen on the item-graph
<SPACE>
If you run an item-based CF on this graph to recommend an item from A
<SPACE>
you would be able to recommend only B,C and D
What about E?
<SPACE>
How do we do to recommend E?
And what would be its similarity with A?
The missing edges are characteristic in case of cold-start, either of the whole data set of of specific items
<SPACE>
Basically we want to tackle the cold-start by building a transitive closure of the graph, but how do we weight the new edges?
<SPACE>
This is Weight Propagation problem
<SPACE>
A
D
E
5
4
9
20
F
G
2
2
40
80
Addition
Multiplication
Weight Propagation
Consider this situation with 3 items.
The first idea would be to add the weights.
NO you don't do that. These are similarities, not distances. You can't add them.
<SPACE>
Why not multiply them?
Because if you consider new nodes always further from A,
<SPACE>
the products will be bigger
<SPACE>
and bigger.
At the end, very far away items will have a very high similarity to A and will be ranked very high.
We don't want this.
<SPACE>
A
D
E
.5
.4
.2
F
G
.2
.2
.04
.008
Cosine Multiplication
Weight Propagation
Normalizing the weights, for example using cosine, does not help.
It just leads to the opposite situation,
<SPACE>
where far items are assigned a very low weight,
<SPACE>
and hence will never appear in a TOP N recommendation.
We don't want this neither.
Our solution is to be found in Network Theory. It is called the Capacity
<SPACE>
A
B
C
D
E
5
4
4
Capacity
The path may not be unique.
Weight Propagation
The capacity of a path is defined as the lowest weight of the path's edges.
In our case it 4
However
<SPACE>
their might not be a unique path connecting two nodes.
<SPACE>
Here there are three
<SPACE>
We review two points of view to handle this issue
<SPACE>
A
B
C
D
E
1
3
3
5
5
4
BCSP
MaxCap
3
4
Weight Propagation
The first idea is to search for a balance between capacity and path length
This is what Bi-Criterion-Shortest-Path
<SPACE>
(or BCSP) is doing. It was studied by Malucelli. It requires to compute all the paths connecting the two nodes. This is a very tedious tasks.
The approach we chose is simpler. It is called the MaxCapacity
<SPACE>
The max capacity is defined as the biggest one among all the paths connecting the two nodes.
<SPACE>
In theory this would also require to compute all the paths.
In practice this can be avoided.
<SPACE>
F. Malucelli, P. Cremonesi, and B. Rostami. An application of bicriterion shortest paths to collaborative filtering.
In Proceedings of the Federated Conference on Computer Science and Information Systems (FedCSIS), 2012.
How to compute max Capacities?
<SPACE>
We propose 2 completely different algorithms
<SPACE>
The first one makes use of buckets
<SPACE>
The second one transforms the item graph into a tree
more precisely: a forest
<SPACE>
Let's start with the bucket based one.
<SPACE>
Algorithms : Buckets
A
B
C
D
E
1
3
3
5
5
4
G 0
G 1
G 2
G 3
G 4
G 5
∅
3
E
A
G 4
The idea is to build a sequence of sub-graphs G_alpha
<SPACE>
containing the edges with a weight strictly bigger than alpha.
In our case this will be 6 sub graphs
<SPACE>
G_0 to G_5
<SPACE>
G_0 is a copy of the item graph,
<SPACE>
G_1 doesn't have the edge between A and B but still has all the vertices.
<SPACE> The same for G_2 <SPACE>
G_3 looses the node C
<SPACE> G_4 looses A<SPACE>
And G_5 is empty as 5 is the biggest weight.
When we update the weight of AB to set it to 3,
<SPACE>
we just add the edge to G_1 <SPACE>and G_2.
Computing the MaxCapacity between A and E
<SPACE>
consists of <SPACE> looking for the first sub graph missing at least on the the nodes.
<SPACE>
Here it is G_4, so the MaxCapacity between A and E is 4
<SPACE>
Algorithms : Tree
1 def Dijkstra(Graph, source):
2
3 create vertex set Q
4 # Initialization
5 for each vertex v in Graph:
6 dist[v] ← INFINITY
7 prev[v] ← UNDEFINED
8 add v to Q
9 # Distance from source to source
10 dist[source] ← 0
11
12 while Q is not empty:
13 # Node with the least distance will be selected first
14 u ← vertex in Q with min dist[u]
15 remove u from Q
16
17 # where v is still in Q.
18 for each neighbor v of u:
19 alt ← dist[u] + length(u, v)
20 # If a shorter path to v has been found
21 if alt < dist[v]:
22 dist[v] ← alt
23 prev[v] ← u
24
25 return dist[], prev[]https://en.wikipedia.org/wiki/Dijkstra%27s_algorithm
cap[v] ← -INFINITY
cap[source] ← INFINITY
biggest capacity
max cap[u]
alt ← min( cap[u] , weight(u,v) )
alt > cap[v]:
cap[v] ← alt
cap[]
This Approach was developped to be able to use Dijkstra. So first we show how it can be modified to compute MaxCapacity between two nodes.
The first things, In the initialization, the capacities are set
<SPACE>
to -Inf instead of +Inf
<SPACE>
The capacity from the source to the source is set to +inf
The next node to be extended is the one that has the biggest capacity from the source instead of the least distance
<SPACE>
the capacity of the path from the source to a neighbor
<SPACE>
is the minimum of the capacity of the current node and of the weight of the link to the neighbor.
<SPACE>
If the value is bigger than the stored one, it is updated.
<SPACE>
We finish by returning capacities instead of distances
The algorithm we propose, aims at optimizing Dijkstra by building trees on-the-fly.
<SPACE>
Algorithms : Tree
| AB | ED | AD | AD | EB |
|---|---|---|---|---|
| 1 | 5 | 2 |
A
B
C
D
E
1
5
2
Roots
.2
3
2
4
4
.2
1
.2
5
5
At the beginning all the nodes are the root of their own tree.
While reading the transactions there is three cases to handle, that we will review
<SPACE>
First we read AB. The trees are disconnect,
<SPACE>so we connect them
<SPACE>B is not a root anymore
<SPACE> The same for ED
The order of the root is not very important at this point
<SPACE> But later <SPACE>
it can influence the depth of the final tree <SPACE>
Now we want to update the weight of an existing edge. If the newt weight is smaller, nothing happens <SPACE>
if it is bigger, we simply modify the value <SPACE>
<SPACE> Adding EB would create a cycle <SPACE>
To handle this we compute the path between E and B, <SPACE>
and look for the edge with the smallest weight.
If the weight of the edge to be added is smaller, we stay at it is <SPACE>
if the it is bigger, <SPACE>
the new edge is added <SPACE>
and the smallest edge is removed <SPACE>
IF there is several edges with the smallest weight, one of them is removed. Again this can be optimized to keep the tree tidy.
<SPACE>
The maxCapacities can now be computed using a tree search
It is to be remembered that even if the tree can be different, the maxCapacities remain the same.
What is the difference between the two algorithms?
<SPACE>
Let's compare them in a controlled environment
<SPACE>
Synthetic Experiments
Settings:
50k transactions
21k items
700k edges
for each transactions (u,v,w):
Update(u,v,w)
Request MaxCap(u,v)
CC = Co-occurrence
MCb = Bucket-based
Max Capacity
MCt = Tree-based
Max Capacity
The dataset is made a 50k transactions between 21k items. At the end the item graph contains up to 70k edges. The protocol consists of :
* reading a transaction
* updating the buckets or tree
* finally requesting the MaxCap between the two nodes
<SPACE>
The evolution of the number of edges is almost linear.
We compare the two algorithms to CC that just builds the adjacency matrix.
MCb refers to the bucket approach, and MCt to the tree based one
<SPACE>
The first results are memory usage. the buckets consume a lot a memory compared to the others.
Note that in a production system, the adjacency matrix is not needed to compute tha MaxCap. So we could subtract the yellow line to the two others. MCt will then been almost 0.
<SPACE>
Learning and evaluations time test.
The bucket approach is slower to learn but appears faster to evaluate the capacities. Note the scale difference. In both cases the biggest difference is about 5sec.
if the evaluation using the tree approach is slower, but it allows the usage of real numbers as weights like cosine. Buckets can't handle that as it would theoretically require an infinite number of buckets
<SPACE>
Before continuing with the experiments,
I would like to talk about an issue that we didn't evoke yet.
<SPACE>
Ties
<SPACE>
A
B
C
D
E
1
3
3
5
5
4
Max Capacity and Ties
| A | B | C | D | E | |
|---|---|---|---|---|---|
| A | - | 4 | 3 | 4 | 4 |
| B | 4 | - | 3 | 5 | 5 |
| C | 3 | 3 | - | 3 | 3 |
| D | 4 | 5 | 3 | - | 5 |
| E | 4 | 5 | 3 | 5 | - |
Max Capacity
MaxCap: Rec (A) = B D E C or B E D C or D B E C …
MaxCap + dist: Rec (A) = D E B C
!! TIES !!
A
4
4
4
| A | B | C | D | E | |
|---|---|---|---|---|---|
| A | - | 4 | 3 | 4 | 4 |
Tree Distance
| A | B | C | D | E | |
|---|---|---|---|---|---|
| A | - | 3 | 1 | 1 | 2 |
For our running example, this is the MaxCapacity Matrix.
We want to recommend items from A.
<SPACE>
The problem is that we have ties
<SPACE>
In what order should we sort items with the same MaxCapacity?
Should the recommendation be BDEC? BEDC? DBEC?
<SPACE>
We propose to handle this problem using the tree based approach
<SPACE>
We add as a second criterion the distance in the tree.
<SPACE>
Like this we reduce the number of ties, and in our example, we end up with a unique ranking.
<SPACE>
Max Capacity and Cold-start
MovieLens 1M
for each (user, movie, rating, date) chronologically:
if movie rated between 1 and 20 times:
if first rating of user:
SKIP
else:
Rec(user,movie,rating)
if 1000 recommendations have been computed:
STOP
Let's go back to the experimental results and look at how the different algorithm behave in a cold-start situation simulated on the ML dataset.
<SPACE>
The cold-start situation is obtained by reading few transactions or ratings and computing recommendations of movies with few ratings (here between 1 and 20).
Ties are handled by considering that the expected item is in the middle of the tie-block
We stop after 1000 recommendations, as BCSP was too slow.
A CV is performed by running the protocol on 6 different parts of the dataset.
<SPACE>
Firstly, BCSP and MC behave similarly with an advantage for MC.
Note that the poor results of Cosine are systematic in cold-start situation
MC+dist outperforms the other baselines for both criterion. But also in term of incertitude.
<SPACE>
Max Capacity and Cold-start
Incertitude = size of the tie-blocks
We define the incertitude of a recommendation as the number of items sharing the same similarity that the expected item.
In the plot we see that using a second criterion dramatically decreases the incertitude of the recommendation.
Even though cosine uses real numbers, the average incertitude of its recommendations is higher.
<SPACE>
Summary
Double Rank : Cooc + Popularity