using votes to combine rankings

Ferran Muiños @fmuinos

Institute for Research in Biomedicine (IRB Barcelona)

about me

  • background in mathematics

  • post-doc researcher at biomedical genomics lab (IRB)

  • focus: mutational processes and tumor evolution

  • interests: math models, statistics, programming

  • long term goal in life: Advent of Code in Haskell

from neighbor politics to cancer genomics

the story:

2 yrs ago

Today

 neighbors

A. Paint corridor

B. Fix Front of Building

C. Pipe Maintenance

Either                    pipe maintenance or fix the front

A. Paint corridor

B. Fix Front of Building

C. Pipe Maintenance

 neighbors

Fix the front

A. Paint corridor

B. Fix Front of Building

C. Pipe Maintenance

 neighbors

I don't really care, I hate these meetings

A. Paint corridor

B. Fix Front of Building

C. Pipe Maintenance

 neighbors

B>A=C

B=C>A

A>B=C

A=B>C

A=B=C

B=C>A

A>B=C

B>A=C

A=B>C

A=B=C

A. Paint corridor

B. Fix Front of Building

C. Pipe Maintenance

 neighbors

reward the good!

A=B=C

A=B>C

B>A=C

B=C>A

A>B=C

...

voting

?

update

voting rights

+3%

+2%

-2%

-1%

-2%

impact

  • definition of choices
  • feasibility of impact evaluation
  • ethics of mutable voting rights

many caveats in real-life social choice

... but, does it make sense in other contexts?

cancer genomics problem:

discovery of genes that drive tumor evolution

genomics

Living cells run on operating systems know as genomes

Genomes are written in a suitable extension of the ACGT-language

cancer genomics

  • In healthy multicellular organism genomes have evolved to cooperate

 

  • Cancer arises when genome modifications lead to unhealthy growth and expansion of a cell population

Specific genes are hacked

cancer drivers genes

cancer genes

www.intogen.org

mutations

print('hello, world')

Translocation

Copies

print('hella, world')
print('worlo, helld')
print('hello, world, world')
print('hell, world')

Substitution

Deletion

statistical methods guess which genes drive

statistical model

ranking genes:

 

1. TP53

2. PIK3CA

3. PTEN

4. GATA3

5. RUNX1

6. ...

p=10^{-10}
p=1010p=10^{-10}
p=10^{-5}
p=105p=10^{-5}
p=10^{-3}
p=103p=10^{-3}
p=10^{-2}
p=102p=10^{-2}
p=1.2\cdot 10^{-2}
p=1.2102p=1.2\cdot 10^{-2}

Cohort

p-values

what the model expects

(background model)

what we observe

  • p-values             
  • how embarrassed is the model after observing the data
  • the lower the p-value, the higher the embarrassment!
\in [0, 1]
[0,1]\in [0, 1]

TP53

TP53

statistical methods guess which genes drive

1. TP53

2. PIK3CA

3. PTEN

4. GATA3

5. RUNX1

6. MAP2K4

...

1. TP53

2. MLL3

3. CDH1

4. FOXA1

5. MAP2K4

...

1. TP53

2. MLL3

3. CDH1

4. FOXA1

5. MAP2K4

...

1. TP53

2. PIK3CA

3. CDH1

...

1. PIK3CA

2. MAP2K4

3. TP53

4. SETD2

5. MLL3

6. CDH1

...

1. TP53

2. PIK3CA

3. CDH1

4. MAP3K1

5. ARID1A

...

combining p-values

p = 0.1
p=0.1p = 0.1
p = 0.02
p=0.02p = 0.02
p = 0.3
p=0.3p = 0.3

Fisher

Stouffer-Liptak

Brown

...

P
PP

combining p-values:

a few caveats

  • Inconsistent rankings

  • Use of different scales of embarrassment

  • Many false positives as number of methods increase

  • Real data does not follow assumptions

  • consistent ranking
  • systematic allocation of credibility
  • interpretable and statistically sound

we want a consensus of driver discovery...

ranking consistency:

let them vote!

how it works

  • Ranking consistency essentially means "Condorcet"
  • ... yet it remains fast to compute

TP53 = PIK3CA > PTEN > GATA3 > ...

PIK3CA> MAP2K4 > TP53 > SETD2 > ...

TP53 > MLL3 > CDH1 = FOXA1 > MAP2K4 > ...

...

TP53 > PIK3CA > MAP2K4 > PTEN > ...

how it works

step 1

  voters = {v1, v2, v3, v4}

candidates = {c1, c2, c3, c4, c5}

  • candidates are given ranks by voters
  • not any rank assignment is valid

 

Valid Ballots

  • at least one candidate ranks 1st
  • rank(c) = # {s | rank(s) < rank(c)} + 1

how it works

weight matrix                                  

 = how many voters prefer      over     ?

step 2

M = (m_{ij})
M=(mij)M = (m_{ij})
m_{ij}
mijm_{ij}
c_i
cic_i
c_j
cjc_j

how it works

step 2

How many voters prefer        over        ?

c_2
c2c_2
c_1
c1c_1

how it works

step 2

How many voters prefer        over        ?

c_2
c2c_2
c_1
c1c_1

how it works

step 3

M defines a directed weighted graph G

Max

Min

core idea by Schulze

A path                   in the weights graph is a sequence of nodes

 

     has strength     if     is the maximum satisfying:

  •                                                              

  •  

 

The strength between candidates x, y is the max strength among all paths joining them:

P = \{C_1 = x, \ldots, C_n = y\}
P={C1=x,,Cn=y}P = \{C_1 = x, \ldots, C_n = y\}
w(C_i,C_{i+1}) \geq w(C_{i+1},C_i)
w(Ci,Ci+1)w(Ci+1,Ci)w(C_i,C_{i+1}) \geq w(C_{i+1},C_i)
\forall i\;\; w(C_i,C_{i+1}) \geq s
i&ThickSpace;&ThickSpace;w(Ci,Ci+1)s\forall i\;\; w(C_i,C_{i+1}) \geq s
P:x\to y
P:xyP:x\to y
s
ss
P
PP
s
ss
S(x,y) = \max\{s_P \;|\; P:x\to y\}
S(x,y)=max{sP&ThickSpace;&ThickSpace;P:xy}S(x,y) = \max\{s_P \;|\; P:x\to y\}

core idea by Schulze

Theorem:

The set of candidates equipped with the relation        gives a partially ordered set.

x \prec_S y \Leftrightarrow S(x, y) \leq S(y, x)
xSyS(x,y)S(y,x)x \prec_S y \Leftrightarrow S(x, y) \leq S(y, x)
\prec_S
S\prec_S

allocation of credibility:

We want to give higher voting rights to methods that contribute more to a good outcome (!)

+3%

+2%

-2%

-1%

-2%

Cancer Gene Census (CGC)

https://cancer.sanger.ac.uk/census

manually curated dataset of bona fide known cancer genes

enrichment score

Given a single ranking      , define an enrichment score:

 

 

       : proportion of CGC genes up to rank

       : weighting for rank

\mathcal{R}
R\mathcal{R}
S(\mathcal{R}) = \sum_{i=1}^n \lambda_i\cdot P_i
S(R)=i=1nλiPiS(\mathcal{R}) = \sum_{i=1}^n \lambda_i\cdot P_i
P_i
PiP_i
\lambda_i
λi\lambda_i
i
ii
i
ii

Enrichment of bona fide known drivers in the top positions of the consensus ranking

voting rights

\mathcal{R}_1
R1\mathcal{R}_1
\mathcal{R}_2
R2\mathcal{R}_2
\mathcal{R}_3
R3\mathcal{R}_3
\mathcal{R}
R\mathcal{R}
\omega_1
ω1\omega_1
\omega_2
ω2\omega_2
\omega_3
ω3\omega_3
S(\mathcal{R})
S(R)S(\mathcal{R})

scale rankings with weights

(voting rights or credibility)

...

...

...

\omega_k
ωk\omega_k

step 1: Schulze

step 2: enrichment score

step 1 + step 2 together define a function:

f: \Delta(\omega_1, \ldots, \omega_n) \subset \mathbb{R}^n \to \mathbb{R}
f:Δ(ω1,,ωn)RnRf: \Delta(\omega_1, \ldots, \omega_n) \subset \mathbb{R}^n \to \mathbb{R}

voting rights

\mathcal{R}_1
R1\mathcal{R}_1
\mathcal{R}_2
R2\mathcal{R}_2
\mathcal{R}_3
R3\mathcal{R}_3
\mathcal{R}
R\mathcal{R}
\omega_1
ω1\omega_1
\omega_2
ω2\omega_2
\omega_3
ω3\omega_3
S(\mathcal{R})
S(R)S(\mathcal{R})

...

...

...

step 1: Schulze

step 2: enrichment score

f: \Delta(\omega_1, \ldots, \omega_n) \subset \mathbb{R}^n \to \mathbb{R}
f:Δ(ω1,,ωn)RnRf: \Delta(\omega_1, \ldots, \omega_n) \subset \mathbb{R}^n \to \mathbb{R}

Optimize (with constraints) to find most credible voting rights

what is left?

gene selection

 

Composite rule based on:

  • Each gene ranked by ranking combination
  • Credibility leads to more accurate p-value combination

the implementation:

Schulze voting:

 

numpy: http://www.numpy.org/

cython: http://cython.org/

 

Graph representation:

 

networkx: https://networkx.github.io/

key chunk of code

code that computes all the max flow paths of the weight directed graph: Floyd's algorithm

\mathcal{O}(n^3)
O(n3)\mathcal{O}(n^3)
def strongest_path(long size, double [:] pref, double [:] spath):

    for i in range(size):
        for j in range(size):
            if i != j:
                if pref[i*size + j] > pref[j*size + i]:
                    spath[i*size + j] = pref[i*size + j]

    for i in range(size):
        for j in range(size):
            if i != j:
                for k in range(size):
                    if (i != k) and (j != k):
                        spath[j*size + k] = max(spath[j*size + k],  
                                                min(spath[j*size + i], spath[i*size + k]))

package

Python package to experiment with Schulze's voting algorithm

https://bitbucket.org/ferran_muinos/consensus

features:

  • random ballot generator

  • computes consensus ranking with Schulze

  • with customizable voting rights

  • computation of weights and strength

  • graph plots

special dependencies:

  • cython, networkx                                            

summary:

Schulze

*

update voting rights

optimization strategy

CGC enrichment

from neighbor politics to driver discovery

+3%

+2%

-2%

-1%

-2%

credit and thanks

Iker

Inés

Jordi

Núria

Carlota

Loris

Fran

Abel

Oriol

IntOGen

www.intogen.org

references

Robert W. Floyd Algorithm 97 (Shortest Path) Commun ACM, 6(5), 1962, 345

Ranking combination (lite)

By Ferran Muiños

Ranking combination (lite)

Presenting a ranking combination method that makes use of a voting system alongside optimization. Schulze voting, p-value combination statistics and cancer genomics featuring in the same talk. Presented at the PyCon Nove meeting (April 2018).

  • 1,121