VNN-COMP 2022

Rules Discussion Meeting

April 12, 2022

Stanley
Bak

Changliu
Liu

Taylor Johnson

Christopher
Brix

Mark
Müller

Open-Loop Neural Network Verification

Input
Set

Output
Set

\(i_1 \in [0, 1]\)

\(i_2 \in [0, 1]\)

\(\ldots\)

\(i_n \in [0, 1]\)

\(o_1 \geq o_2\)

\(o_1 \geq o_3\)

\(\ldots\)

\(o_1 \geq o_m\)

VNN-COMP is once again co-hosted with Workshop on Formal Methods for ML-Enabled Autonomous Systems (FoMLAS) at CAV, July 31-Aug 1, 2020.

 

We have a website: sites.google.com/view/vnn2022

The main communication channel is the github site: github.com/stanleybak/vnncomp2022/

 

Three phases:
1. Determine Rules

2. Submit Your Benchmarks

3. Run Your Tools on the Benchmarks

Overview

Rules will be similar to last year. If you didn't participate last year, you can look through last year's rule document.

 

Goal of this meeting: finalize any discussion points on changes for this year's rules. A rules document will be produced and posted on the github issue for any final comments.

Meeting Goals

Terminology: People submit benchmarks, which are collections of instances (specific network / specification / timeout in seconds)

 

Neural Networks are given in .onnx format

Specifications are given in .vnnlib format

Output result written to a text file
 

You must provide several scripts for your tool to automate the evaluation process:
1. Install your tool

2. Convert single instance into the correct format

3. Analyze the instance and produce a result

Benchmarks

You submit benchmarks. Some are used for scoring. Last year, each participant could submit one benchmark.

 

Proposed change: non-participants can submit benchmarks too. Each participant can choose two benchmarks that get used for scoring (so they can nominate someone else's benchmark if they don't have their own.

 

Non-scored benchmarks: this was used to provide a year-to-year comparison.

Where do benchmarks come from?

Alternatives: organizers choose which benchmarks should be scored ("need larger / more control benchmarks"). Discuss.

We use AWS cloud services (EC2) to run the competition. Last year we had two hardware types, a CPU and a GPU instance, each roughly costing $3/hour.

 

Changes: some feedback was the GPU instance's CPU was too slow. We can have more instance types now, although not all advertised instance types are possible

Huan Zhang: "On my side, I tried to evaluate the new g5.4xlarge and g5.8xlarge GPU instances. However, it seems AWS currently has availability issues on these (g5 is a quite new type released last November)"

Evaluation Hardware

Proposal: Continue to try to get better hardware availability. Finalize 3-5 instance types within a month. Discuss.

Old result: property "holds" / "violated"

 

Suggestion: get closer to a SAT solver terminology, so change this to "sat" and "unsat"

 

In the case of "sat", the output file should also include the concrete assignment to inputs and outputs.

 

This helps when tools disagree on results. This will be optional, but tools will be punished if they don't include it and there's a mismatch. Numerical precision issues will be not be punished. Ground truth result is what we obtain using onnxruntime.

Counterexamples

; Property with label: 2.

(declare-const X_0 Real)
(declare-const X_1 Real)

(declare-const Y_0 Real)
(declare-const Y_1 Real)
(declare-const Y_2 Real)

; Input constraints:
(assert (<= X_0  0.05000000074505806))
(assert (>= X_0  0.0))

(assert (<= X_1  1.00))
(assert (>= X_1  0.95))

; Output constraints:
(assert (or
    (and (>= Y_0 Y_2))
    (and (>= Y_1 Y_2))
))

Example Input .vnnlib file

unsat / timeout / error

Example Suggested Output File

sat
((X_0 0.02500000074505806)
 (X_1 0.97500000000000000)
 (Y_0 -0.03500000023705806)
 (Y_1  0.32500000072225301)
 (Y_2 0.02500000094505020))

or

Discuss.

Last year, we subtracted tool overhead and only measured runtime.

 

Other options: "provide multiple verification instances at once." Cons: this creates dependencies among instances or rewards selection process.

 

Count overhead as runtime... also not ideal as acquiring GPU can take several seconds (one tool's Julia compilation was ~40 seconds last year)

Overhead Measurement

I don't have any great ideas here. Discuss.

People seemed happy about randomizing images for evaluation. Seed was chosen after tools were finalized.

 

Can we require this? For control benchmarks ("ACAS Xu"), this seems less feasible, although we could still randomize input set sizes.

Benchmark Randomization

Should we require this for all benchmarks? Discuss.

Should we try to get more complicated specifications?
For example, nonlinear constraints over outputs.

 

Do we still want to allow per-benchmark tuning? Per-benchmark hardware option CPU/GPU?

 

Raise the 6 hour timeout per benchmark? Is there any value in this?

 

Changes to scoring? Each benchmark counts as 100 percent and overall winner is highest sum of percents.

 

Other Issues Not Discussed Online

Is there anything else about the rules we need to change? Discuss.

Prepare your benchmarks. A github issue will be created and emailed out shortly to organizer discussion.

 

Once things are in the right format, you can automatically check your benchmark. Then on the github issue post the url to the validated benchmark: https://vnncomp.christopher-brix.de/

 

Deadline: ~1 month. May have another organization meeting after benchmarks are finalized if needed.

Next Steps

Thanks!

Stanley
Bak

Changliu
Liu

Taylor Johnson

Christopher
Brix

Mark
Müller

VNNCOMP 2022 Rules Discussion

By Stanley Bak

Private

VNNCOMP 2022 Rules Discussion