using votes to combine rankings
about me
-
background in mathematics
-
post-doc researcher at biomedical genomics lab (IRB)
-
focus: mutational processes and tumor evolution
-
interests: math models, statistics, programming
-
long term goal in life: Advent of Code in Haskell
from neighbor politics to cancer genomics
the story:
2 yrs ago
Today
neighbors
A. Paint corridor
B. Fix Front of Building
C. Pipe Maintenance
Either pipe maintenance or fix the front
A. Paint corridor
B. Fix Front of Building
C. Pipe Maintenance
neighbors
Fix the front
A. Paint corridor
B. Fix Front of Building
C. Pipe Maintenance
neighbors
I don't really care, I hate these meetings
A. Paint corridor
B. Fix Front of Building
C. Pipe Maintenance
neighbors
B>A=C
B=C>A
A>B=C
A=B>C
A=B=C
B=C>A
A>B=C
B>A=C
A=B>C
A=B=C
A. Paint corridor
B. Fix Front of Building
C. Pipe Maintenance
neighbors
reward the good!
A=B=C
A=B>C
B>A=C
B=C>A
A>B=C
...
voting
?
update
voting rights
+3%
+2%
-2%
-1%
-2%
impact
- definition of choices
- feasibility of impact evaluation
- ethics of mutable voting rights
many caveats in real-life social choice
... but, does it make sense in other contexts?
cancer genomics problem:
discovery of genes that drive tumor evolution
genomics
Living cells run on operating systems know as genomes
Genomes are written in a suitable extension of the ACGT-language
cancer genomics
- In healthy multicellular organism genomes have evolved to cooperate
- Cancer arises when genome modifications lead to unhealthy growth and expansion of a cell population
Specific genes are hacked
cancer drivers genes
cancer genes
www.intogen.org
mutations
print('hello, world')
Translocation
Copies
print('hella, world')
print('worlo, helld')
print('hello, world, world')
print('hell, world')
Substitution
Deletion
statistical methods guess which genes drive
statistical model
ranking genes:
1. TP53
2. PIK3CA
3. PTEN
4. GATA3
5. RUNX1
6. ...
Cohort
p-values
what the model expects
(background model)
what we observe
- p-values
- how embarrassed is the model after observing the data
- the lower the p-value, the higher the embarrassment!
TP53
TP53
statistical methods guess which genes drive
1. TP53
2. PIK3CA
3. PTEN
4. GATA3
5. RUNX1
6. MAP2K4
...
1. TP53
2. MLL3
3. CDH1
4. FOXA1
5. MAP2K4
...
1. TP53
2. MLL3
3. CDH1
4. FOXA1
5. MAP2K4
...
1. TP53
2. PIK3CA
3. CDH1
...
1. PIK3CA
2. MAP2K4
3. TP53
4. SETD2
5. MLL3
6. CDH1
...
1. TP53
2. PIK3CA
3. CDH1
4. MAP3K1
5. ARID1A
...
combining p-values
Fisher
Stouffer-Liptak
Brown
...
combining p-values:
a few caveats
-
Inconsistent rankings
-
Use of different scales of embarrassment
-
Many false positives as number of methods increase
-
Real data does not follow assumptions
- consistent ranking
- systematic allocation of credibility
- interpretable and statistically sound
we want a consensus of driver discovery...
ranking consistency:
let them vote!
Markus Schulze
Social Choice and Welfare, 2011, 36 (2), 267–303
how it works
- Ranking consistency essentially means "Condorcet"
- ... yet it remains fast to compute
TP53 = PIK3CA > PTEN > GATA3 > ...
PIK3CA> MAP2K4 > TP53 > SETD2 > ...
TP53 > MLL3 > CDH1 = FOXA1 > MAP2K4 > ...
...
TP53 > PIK3CA > MAP2K4 > PTEN > ...
how it works
step 1
voters = {v1, v2, v3, v4}
candidates = {c1, c2, c3, c4, c5}
- candidates are given ranks by voters
- not any rank assignment is valid
Valid Ballots
- at least one candidate ranks 1st
- rank(c) = # {s | rank(s) < rank(c)} + 1
how it works
weight matrix
= how many voters prefer over ?
step 2
how it works
step 2
How many voters prefer over ?
how it works
step 2
How many voters prefer over ?
how it works
step 3
M defines a directed weighted graph G
Max
Min
core idea by Schulze
A path in the weights graph is a sequence of nodes
has strength if is the maximum satisfying:
The strength between candidates x, y is the max strength among all paths joining them:
core idea by Schulze
Theorem:
The set of candidates equipped with the relation gives a partially ordered set.
allocation of credibility:
We want to give higher voting rights to methods that contribute more to a good outcome (!)
+3%
+2%
-2%
-1%
-2%
https://cancer.sanger.ac.uk/census
manually curated dataset of bona fide known cancer genes
enrichment score
Given a single ranking , define an enrichment score:
: proportion of CGC genes up to rank
: weighting for rank
Enrichment of bona fide known drivers in the top positions of the consensus ranking
voting rights
scale rankings with weights
(voting rights or credibility)
...
...
...
step 1: Schulze
step 2: enrichment score
step 1 + step 2 together define a function:
voting rights
...
...
...
step 1: Schulze
step 2: enrichment score
Optimize (with constraints) to find most credible voting rights
what is left?
gene selection
Composite rule based on:
- Each gene ranked by ranking combination
- Credibility leads to more accurate p-value combination
the implementation:
Schulze voting:
numpy: http://www.numpy.org/ cython: http://cython.org/
Graph representation:
networkx: https://networkx.github.io/
key chunk of code
code that computes all the max flow paths of the weight directed graph: Floyd's algorithm
def strongest_path(long size, double [:] pref, double [:] spath):
for i in range(size):
for j in range(size):
if i != j:
if pref[i*size + j] > pref[j*size + i]:
spath[i*size + j] = pref[i*size + j]
for i in range(size):
for j in range(size):
if i != j:
for k in range(size):
if (i != k) and (j != k):
spath[j*size + k] = max(spath[j*size + k],
min(spath[j*size + i], spath[i*size + k]))
package
Python package to experiment with Schulze's voting algorithm
https://bitbucket.org/ferran_muinos/consensus
features:
-
random ballot generator
-
computes consensus ranking with Schulze
-
with customizable voting rights
-
computation of weights and strength
-
graph plots
special dependencies:
-
cython, networkx
summary:
Schulze
*
update voting rights
optimization strategy
CGC enrichment
from neighbor politics to driver discovery
+3%
+2%
-2%
-1%
-2%
credit and thanks
Iker
Inés
Jordi
Núria
Carlota
Loris
Fran
Abel
Oriol
IntOGen
www.intogen.org
references
Robert W. Floyd Algorithm 97 (Shortest Path) Commun ACM, 6(5), 1962, 345
Markus Schulze A new monotonic, clone-independent, reversal symmetric, and condorcet-consistent single-winner election method
Social Choice and Welfare, 2011, 36 (2), 267–303
Ranking combination (lite)
By Ferran Muiños
Ranking combination (lite)
Presenting a ranking combination method that makes use of a voting system alongside optimization. Schulze voting, p-value combination statistics and cancer genomics featuring in the same talk. Presented at the PyCon Nove meeting (April 2018).
- 1,121