April 12, 2022
Stanley
Bak
Changliu
Liu
Taylor Johnson
Christopher
Brix
Mark
Müller
Input
Set
Output
Set
\(i_1 \in [0, 1]\)
\(i_2 \in [0, 1]\)
\(\ldots\)
\(i_n \in [0, 1]\)
\(o_1 \geq o_2\)
\(o_1 \geq o_3\)
\(\ldots\)
\(o_1 \geq o_m\)
VNN-COMP is once again co-hosted with Workshop on Formal Methods for ML-Enabled Autonomous Systems (FoMLAS) at CAV, July 31-Aug 1, 2020.
We have a website: sites.google.com/view/vnn2022
The main communication channel is the github site: github.com/stanleybak/vnncomp2022/
Three phases:
1. Determine Rules
2. Submit Your Benchmarks
3. Run Your Tools on the Benchmarks
Rules will be similar to last year. If you didn't participate last year, you can look through last year's rule document.
Goal of this meeting: finalize any discussion points on changes for this year's rules. A rules document will be produced and posted on the github issue for any final comments.
Terminology: People submit benchmarks, which are collections of instances (specific network / specification / timeout in seconds)
Neural Networks are given in .onnx format
Specifications are given in .vnnlib format
Output result written to a text file
You must provide several scripts for your tool to automate the evaluation process:
1. Install your tool
2. Convert single instance into the correct format
3. Analyze the instance and produce a result
You submit benchmarks. Some are used for scoring. Last year, each participant could submit one benchmark.
Proposed change: non-participants can submit benchmarks too. Each participant can choose two benchmarks that get used for scoring (so they can nominate someone else's benchmark if they don't have their own.
Non-scored benchmarks: this was used to provide a year-to-year comparison.
Alternatives: organizers choose which benchmarks should be scored ("need larger / more control benchmarks"). Discuss.
We use AWS cloud services (EC2) to run the competition. Last year we had two hardware types, a CPU and a GPU instance, each roughly costing $3/hour.
Changes: some feedback was the GPU instance's CPU was too slow. We can have more instance types now, although not all advertised instance types are possible
Huan Zhang: "On my side, I tried to evaluate the new g5.4xlarge and g5.8xlarge GPU instances. However, it seems AWS currently has availability issues on these (g5 is a quite new type released last November)"
Proposal: Continue to try to get better hardware availability. Finalize 3-5 instance types within a month. Discuss.
Old result: property "holds" / "violated"
Suggestion: get closer to a SAT solver terminology, so change this to "sat" and "unsat"
In the case of "sat", the output file should also include the concrete assignment to inputs and outputs.
This helps when tools disagree on results. This will be optional, but tools will be punished if they don't include it and there's a mismatch. Numerical precision issues will be not be punished. Ground truth result is what we obtain using onnxruntime.
; Property with label: 2.
(declare-const X_0 Real)
(declare-const X_1 Real)
(declare-const Y_0 Real)
(declare-const Y_1 Real)
(declare-const Y_2 Real)
; Input constraints:
(assert (<= X_0 0.05000000074505806))
(assert (>= X_0 0.0))
(assert (<= X_1 1.00))
(assert (>= X_1 0.95))
; Output constraints:
(assert (or
(and (>= Y_0 Y_2))
(and (>= Y_1 Y_2))
))
unsat / timeout / error
sat
((X_0 0.02500000074505806)
(X_1 0.97500000000000000)
(Y_0 -0.03500000023705806)
(Y_1 0.32500000072225301)
(Y_2 0.02500000094505020))
or
Discuss.
Last year, we subtracted tool overhead and only measured runtime.
Other options: "provide multiple verification instances at once." Cons: this creates dependencies among instances or rewards selection process.
Count overhead as runtime... also not ideal as acquiring GPU can take several seconds (one tool's Julia compilation was ~40 seconds last year)
I don't have any great ideas here. Discuss.
People seemed happy about randomizing images for evaluation. Seed was chosen after tools were finalized.
Can we require this? For control benchmarks ("ACAS Xu"), this seems less feasible, although we could still randomize input set sizes.
Should we require this for all benchmarks? Discuss.
Should we try to get more complicated specifications?
For example, nonlinear constraints over outputs.
Do we still want to allow per-benchmark tuning? Per-benchmark hardware option CPU/GPU?
Raise the 6 hour timeout per benchmark? Is there any value in this?
Changes to scoring? Each benchmark counts as 100 percent and overall winner is highest sum of percents.
Is there anything else about the rules we need to change? Discuss.
Prepare your benchmarks. A github issue will be created and emailed out shortly to organizer discussion.
Once things are in the right format, you can automatically check your benchmark. Then on the github issue post the url to the validated benchmark: https://vnncomp.christopher-brix.de/
Deadline: ~1 month. May have another organization meeting after benchmarks are finalized if needed.
Stanley
Bak
Changliu
Liu
Taylor Johnson
Christopher
Brix
Mark
Müller