using votes to combine rankings

Ferran Muiños @fmuinos

Institute for Research in Biomedicine (IRB Barcelona)

about me

background in mathematics
post-doc researcher at biomedical genomics lab
focus: mutational processes and tumor evolution
interests: modelling, statistics, programming
long term goal: accomplish AoC in Haskell

the story:

2 yrs ago

Today

neighbors

A. Paint corridor

B. Fix Front of Building

C. Pipe Maintenance

Either pipe maintenance or fix the front

A. Paint corridor

B. Fix Front of Building

C. Pipe Maintenance

neighbors

Fix the front

A. Paint corridor

B. Fix Front of Building

C. Pipe Maintenance

neighbors

Paint corridor or fix the front

A. Paint corridor

B. Fix Front of Building

C. Pipe Maintenance

neighbors

paint the corridor

A. Paint corridor

B. Fix Front of Building

C. Pipe Maintenance

neighbors

I don't really care, I hate these meetings

A. Paint corridor

B. Fix Front of Building

C. Pipe Maintenance

neighbors

B>A=C

B=C>A

A>B=C

A=B>C

A=B=C

B=C>A

A>B=C

B>A=C

A=B>C

A=B=C

A. Paint corridor

B. Fix Front of Building

C. Pipe Maintenance

neighbors

reward the good!

A=B=C

A=B>C

B>A=C

B=C>A

A>B=C

...

voting

update

voting rights

+3%

+2%

-2%

-1%

-2%

impact

definition of choices
feasibility of impact evaluation
ethics of mutable voting rights

many caveats in real-life social choice

does it make sense in other contexts?

cancer genomics problem:

discovery of genes that drive tumor evolution

genomics

Living cells run on operating systems know as genomes

Genomes are written in a suitable extension of the ACGT-language

cancer genomics

In healthy multicellular organism genomes have evolved to cooperate

Cancer arises when genome modifications lead to unhealthy growth and expansion of a cell population

Genomes change

tumor evolution: massive trial and error

time

cell population

Specific changes in specific genes

known as cancer drivers genes

cancer genes

www.intogen.org

genome modifications

print('hello, world')

Translocation

Copies

print('hella, world')

print('worlo, helld')

print('hello, world, world')

print('hell, world')

Substitution

Deletion

statistical methods guess which genes drive

statistical model

ranking genes:

1. TP53

2. PIK3CA

3. PTEN

4. GATA3

5. RUNX1

6. ...

p=10^{-10}

p=10^{-5}

p=10^{-3}

p=10^{-2}

p=1.2\cdot 10^{-2}

Cohort

p-values

what the model expects

what we observe

p-values
how embarrassed is the model after observing the data
the lower the p-value, the higher the embarrassment!

\in [0, 1]

TP53

statistical methods guess which genes drive

1. TP53

2. PIK3CA

3. PTEN

4. GATA3

5. RUNX1

6. MAP2K4

...

1. TP53

2. MLL3

3. CDH1

4. FOXA1

5. MAP2K4

...

1. TP53

2. MLL3

3. CDH1

4. FOXA1

5. MAP2K4

...

1. TP53

2. PIK3CA

3. CDH1

...

1. PIK3CA

2. MAP2K4

3. TP53

4. SETD2

5. MLL3

6. CDH1

...

1. TP53

2. PIK3CA

3. CDH1

4. MAP3K1

5. ARID1A

...

combining p-values

p = 0.1

p = 0.02

p = 0.3

Fisher

Stouffer-Liptak

Brown

...

combining p-values:

a few caveats

Inconsistent rankings
Use of different scales of embarrassment
Many false positives as number of methods increase
Real data does not follow assumptions

consistent ranking
systematic allocation of credibility
interpretable and statistically sound

we want a consensus of driver discovery...

ranking consistency: Schulze voting

Markus Schulze

A new monotonic, clone-independent, reversal symmetric, and condorcet-consistent single-winner election method

Social Choice and Welfare, 2011, 36 (2), 267–303

how it works

Ranking consistency essentially means "Condorcet"
...yet it remains fast to compute

TP53 = PIK3CA > PTEN > GATA3 > ...

PIK3CA> MAP2K4 > TP53 > SETD2 > ...

TP53 > MLL3 > CDH1 = FOXA1 > MAP2K4 > ...

...

TP53 > PIK3CA > MAP2K4 > PTEN > ...

how it works

step 1

voters = {v1, v2, v3, v4}

candidates = {c1, c2, c3, c4, c5}

candidates are given ranks by voters
not any rank assignment is valid

Valid Ballots

some candidate gets 1st
rank(c) = # {s | rank(s) < rank(c)} + 1

how it works

weight matrix

= how many voters prefer over ?

step 2

M = (m_{ij})

m_{ij}

c_i

c_j

how it works

step 2

How many voters prefer over ?

c_2

c_1

how it works

step 2

How many voters prefer over ?

c_2

c_1

how it works

step 3

M defines a directed weighted graph G

Max

Min

allocation of credibility:

We want to give higher voting rights to methods that contribute more to a better outcome (!)

+3%

+2%

-2%

-1%

-2%

Cancer Gene Census (CGC)

https://cancer.sanger.ac.uk/census

manually curated dataset of bona fide known cancer genes

enrichment score

Given a single ranking , define an enrichment score:

: proportion of CGC genes up to rank

: weighting for rank

\mathcal{R}

S(\mathcal{R}) = \sum_{i=1}^n \lambda_i\cdot P_i

P_i

\lambda_i

Enrichment of bona fide known drivers in the top positions of the consensus ranking

voting rights

\mathcal{R}_1

\mathcal{R}_2

\mathcal{R}_3

\mathcal{R}

\omega_1

\omega_2

\omega_3

S(\mathcal{R})

preferences of voter can be scaled with a factor

...

\omega_k

step 1: Schulze

step 2: enrichment score

step 1 + step 2 together define a function:

f: \Delta(\omega_1, \ldots, \omega_n) \subset \mathbb{R}^n \to \mathbb{R}

allocation of credibility

formulated as an

optimization problem

\hat{\omega} = \textrm{argmax}_{(\omega_1, \ldots, \omega_n)} f

in practice:

what is left?

gene selection

Composite rule based on:

Each gene ranked by ranking combination
Credibility leads to more accurate p-value combination

the implementation:

Schulze voting:

numpy: http://www.numpy.org/

cython: http://cython.org/

Graph representation:

networkx: https://networkx.github.io/

key chunk of code

code that computes all the max flow paths of the weight directed graph: Floyd's algorithm

\mathcal{O}(n^3)

def strongest_path(long size, double [:] pref, double [:] spath):

    for i in range(size):
        for j in range(size):
            if i != j:
                if pref[i*size + j] > pref[j*size + i]:
                    spath[i*size + j] = pref[i*size + j]

    for i in range(size):
        for j in range(size):
            if i != j:
                for k in range(size):
                    if (i != k) and (j != k):
                        spath[j*size + k] = max(spath[j*size + k],  
                                                min(spath[j*size + i], spath[i*size + k]))

the implementation:

Optimization with constraints:

scipy: https://www.scipy.org/

scipy.optimize

...array of different optimization methods

Overkill attempts:

pyopt: http://www.pyopt.org/

ALPSO (Augmented Lagrangian Particle Swarm Optimizer)

scikit-optimize: https://scikit-optimize.github.io/

Bayesian optimization

package

Python package to experiment with these ideas

https://bitbucket.org/ferran_muinos/

features:

random ballot generator
computes consensus ranking with Schulze
with customizable voting rights
computation of weights and strength
graph plots
enrichment-based voting rights optimization

requires:

cython, networkx, scipy

TO BE RELEASED

SOON!

IntOGen

www.intogen.org

summary:

Schulze

update voting rights

optimization strategy

CGC enrichment

from neighbor politics to driver discovery

+3%

+2%

-2%

-1%

-2%

credit and thanks

Joint work in close collaboration with: Francisco Martínez-Jiménez

IntOGen working group: Loris Mularoni, Carlota Rubio-Perez, Jordi Deu-Pons, Inés Sentís, Iker Reyes-Salazar, David Tamborero, Abel Gonzalez-Perez, Núria López-Bigas

Iker

Inés

Jordi

Núria

Carlota

Loris

Fran

Abel

credit and thanks

Loris Mularoni

references

Robert W. Floyd Algorithm 97 (Shortest Path) Commun ACM, 6(5), 1962, 345

Markus Schulze A new monotonic, clone-independent, reversal symmetric, and condorcet-consistent single-winner election method

Social Choice and Welfare, 2011, 36 (2), 267–303

how it works: backup

A path with strength is any sequence of candidates satisfying:

The strength between two candidates is the max strengths for all paths joining them:

P = \{C_1, \ldots, C_n\}

C_1 = x

C_n = y

w(C_i,C_{i+1}) \geq w(C_{i+1},C_i)

\forall i\;\; w(C_i,C_{i+1}) \geq S(P)

S(x,y) = \max\{S(P)\;|\; P:x\to y\}

P:x\to y

ranking combination

By Ferran Muiños

ranking combination

Presenting a ranking combination method that makes use of a voting system alongside optimization. Schulze voting, p-value combination statistics and cancer genomics featuring in the same talk. Presented at the PyCon Nove meeting (April 2018).

7 years ago
2,276

using votes to combine rankings

about me

the story:

neighbors

neighbors

neighbors

neighbors

neighbors

neighbors

neighbors

reward the good!

many caveats in real-life social choice

does it make sense in other contexts?

cancer genomics problem:

discovery of genes that drive tumor evolution

genomics

cancer genomics

Genomes change

tumor evolution: massive trial and error

cancer genes

genome modifications

statistical methods guess which genes drive

p-values

statistical methods guess which genes drive

combining p-values

combining p-values:

a few caveats

we want a consensus of driver discovery...

ranking consistency: Schulze voting

how it works

how it works

how it works

how it works

how it works

how it works

allocation of credibility:

enrichment score

voting rights

allocation of credibility

optimization problem

in practice:

what is left?

gene selection

the implementation:

key chunk of code

the implementation:

package

IntOGen

summary:

credit and thanks

credit and thanks

references

how it works: backup

ranking combination

More from Ferran Muiños