When micro-optimisations matters

Me @arnaudroger

  • Wrote my first line of java in 1997
  • Work in 3 countries Paris-London-Belfast
  • Media, Finance, Telco, Charity, Security
  • 3 Children ... 7-4-2
  • Identical Twin brother, a java dev too
  • SimpleFlatMapper
  • Blog https://arnaudroger.github.io/blog/
  • Working at rapid 7 on the Attacker Behavior Analytics
    • up to 110 000 events per seconds
    • and replays at 6M

You

  • 10 tps or less?
  • 10 tps to 1000 tps?
  • 1000 tps to 100 000 tps?
  • 1 million tps?

Usual argument against

  • hardware is cheap
  • few milliseconds does not matter
  • readability cost
  • pre optimisation is EVIL!

Many small streams make one big river

Many small streams make one big river

1 second has only 1 000 000 microseconds

It all adds up

Cost model

today cpus are not your grandparent's cpus

Cost model

the 1900s cost model

  • instruction based                    
  • O complexity is good approximation

Cost model

today's cost model

  • cache misses
  • core occupation
  • instruction dependency
  • predictable patterns
  • O complexity is missleading
    • think linkedlist vs arraylist

Java Cost model

What does that means for java?

  • pointer chasing is expensive, bye bye LinkedList
  • allocation is cheap but next load is not, cache miss
  • the cpu does not execute byte code, be aware of the jit

 

 

Java Cost model

What kind of optimization does the JIT provides

  • inlining - the mother of all optimization -
  • escape analysis
  • loop unrolling
  • boundary check elimination
  • virtual call optimisation
  • dead code elimination....

 

 

Cost model is not enough

  • measure
  • measure
  • measure!

 

 

Not all profiler are the same

  • safe point biais
  • interference with the jit
  • jit code to java code interpolation
  • skiding

 

 

My performance toolkit

  • JMC/Flight Recorder
  • perf-map-agent, linux only
  • jmh
  • jit-watch

 

Also worth considering Solaris Studio.

 

 

Workflow

  1. identify hot path - measure in prod if possible
    • perf-map-agent flamegraph
    • honest profiler
  2. produce jmh benchmark
  3. improve perf
  4. validate with benchmark
  5. go back to 1

Realife example : Re2j

Why

  • java regex are faster most of the time
  • but terribly slower at the .999th
  • process 100 000s regex per seconds.

"If such choices are deeply nested, this strategy requires an exponential number of passes over the input data before it can detect whether the input matches. If the input is large, it is easy to construct a pattern whose running time would exceed the lifetime of the universe."

 

Regex Dos

https://www.owasp.org/index.php/Regular_expression_Denial_of_Service_-_ReDoS

 

 

        
        // run the following on java8, fixed on 9
        StringBuilder sb = new StringBuilder();
        for(int i = 0; i < 80; i++) sb.append("a");
        sb.append("!");

        long l = System.currentTimeMillis();
        boolean b = Pattern.compile("^(a+)+$").matcher(sb.toString()).matches();
        long t = l - System.currentTimeMillis();
        System.out.println("t = " + t + " " + b);

All the following modification are available at

https://github.com/arnaudroger/re2j/tree/jug

 

2 old pr in re2j - not totally in line the changes in the presentation

https://github.com/google/re2j/pull/35/ - Merged!

https://github.com/google/re2j/pull/36/

 

All the benchmark run 20 iterations 20 forks on a box with set cpu speed, but still can be noisy

Flame graph

 cd /tmp
 sudo apt-get --yes install git linux-tools-generic linux-tools-x.y.z-w-generic cmake && \
 wget https://github.com/jvm-profiling-tools/perf-map-agent/archive/master.zip && \
 unzip master.zip && \
 cd perf-map-agent-master && \
 cmake . && \
 make  && \
 git clone https://github.com/brendangregg/FlameGraph
 export FLAMEGRAPH_DIR=/tmp/perf-map-agent-master/FlameGraph
 export PERF_RECORD_SECONDS=60

 bin/perf-java-flames <pid>
    -XX:+PreserveFramePointer

you'll need to had that to your start script

then on the instance

Flame graph

Re2j

String EXP1 = "\\\\.*(documents|\\$documents\\.user)\\\\";
String EXP2 = "abcdef.exe|foooooo.exe|bargoo.exe|ratatouille.exe|orleans.exe";
String[] DATA = {
    "bargoo.exe", 
    "\\SystemRoot\\System32\\bargoo.exe",
    "somefile.exe",
    "C:\\WINDOWS\\system32\\somefile.exe",
    "cmd.exe",
    "\"C:\\WINDOWS\\system32\\cmd.exe\" ",
    "powershell.exe",
    "powershell.exe -Command function Main {\n    $lorem = \\\"Sed ut perspiciatis unde omnis iste natus error sit voluptatem accusantium doloremque laudantium, totam rem aperiam, eaque ipsa quae ab illo inventore veritatis et quasi architecto beatae vitae dicta sunt explicabo. Nemo enim ipsam voluptatem quia voluptas sit aspernatur aut odit aut fugit, sed quia consequuntur magni dolores eos qui ratione voluptatem sequi nesciunt. Neque porro quisquam est, qui dolorem ipsum quia dolor sit amet, consectetur, adipisci velit, sed quia non numquam eius modi tempora incidunt ut labore et dolore magnam aliquam quaerat voluptatem. Ut enim ad minima veniam, quis nostrum exercitationem ullam corporis suscipit laboriosam, nisi ut aliquid ex ea commodi consequatur? Quis autem vel eum iure reprehenderit qui in ea voluptate velit esse quam nihil molestiae consequatur, vel illum qui dolorem eum fugiat quo voluptas nulla pariatur\\\"\n  }\n \nMain\n"
    };

Re2j

    @Benchmark
    public void testExp1(Blackhole blackhole) {
        for(String str : data) {
            blackhole.consume(exp1.matcher(str).find());
        }
    }
    
    @Benchmark
    public void testExp2(Blackhole blackhole) {
        for (String str : data) {
            blackhole.consume(exp2.matcher(str).find());
        }
    }

    @Benchmark
    public void testCombine(Blackhole blackhole) {
        for (String str : data) {
            blackhole.consume(exp1.matcher(str).find());
            blackhole.consume(exp2.matcher(str).find());
        }
    }

Re2j

Benchmark - jmh

 

 

 


Java 8 vs Re2j 1.1

Benchmark            Mode  Cnt      Score      Error  Units
JavaRegex.testCombine       thrpt  200  16553.594 ± 397.406  ops/s
Re2jFindRegex.testCombine   thrpt  200   1504.195 ±   9.869  ops/s  11   x 😞

JavaRegex.testExp1          thrpt  200  64284.475 ± 308.279  ops/s
Re2jFindRegex.testExp1      thrpt  200   4842.201 ±  56.966  ops/s  13   x 😞

JavaRegex.testExp2          thrpt  200  28048.060 ± 110.140  ops/s
Re2jFindRegex.testExp2      thrpt  200   2195.518 ±  34.807  ops/s  13   x 😞


Flight Recorder

java -jar target/benchmarks.jar -f 1 -wi 1 -i 1000 Re2jRegex.testExp2 \
  -jvmArgs "-XX:+UnlockCommercialFeatures -XX:+FlightRecorder \
            -XX:+UnlockDiagnosticVMOptions -XX:+DebugNonSafepoints"
  • part of Java Mission Control, port from jrockit
  • free for use in dev - need a license for prod -
  • will be is open sourced in the near future - java 11 -
  • no safepoint bias sampling
  • accurate memory profiling does not interfere with EA

Hot Methods

Hot Methods

  • simple fold - potential big impact
    • but what is that?
  • Enum.ordinal() ? - surely should not matter?
    • use int
  • Inst.op()
    • ??
  • Lots of ArrayList access - could we gain here
    • use array directly

 

Enum.ordinal()

Text

Text

Enum.ordinal()

Benchmark                    Mode  Cnt      Score     Error  Units
Re2jFindRegex.testCombine   thrpt  200   1645.880 ±  17.718  ops/s  9%
Re2jFindRegex.testExp1      thrpt  200   5713.571 ±  61.726  ops/s 18%
Re2jFindRegex.testExp2      thrpt  200   2495.788 ±  24.896  ops/s 14%

9% for a 60 seconds change not bad!

the 18% looks inflated compare to  Re2jMatchRegex

Inst.op()

Inst.op()

Benchmark                    Mode  Cnt      Score     Error  Units
Re2jFindRegex.testCombine   thrpt  200   1691.707 ±  16.574  ops/s  3%
Re2jFindRegex.testExp1      thrpt  200   5899.987 ±  58.776  ops/s  3%
Re2jFindRegex.testExp2      thrpt  200   2544.503 ±  28.540  ops/s  2%

and another 3%.

 

3% could easily just be noise, but here consistent across benchmark

ArrayList

Text

Text

ArrayList

ArrayList

Benchmark                    Mode  Cnt      Score     Error  Units
Re2jFindRegex.testCombine   thrpt  200   1885.322 ±  26.392  ops/s  11%
Re2jFindRegex.testExp1      thrpt  200   6465.808 ± 142.104  ops/s  10%
Re2jFindRegex.testExp2      thrpt  200   2791.424 ±  28.951  ops/s  10%

Really?

 

ArrayList.get

  • check against size
  • array boundary check
  • checkcast - erasure .... -

ArrayList

Benchmark                    Mode  Cnt      Score     Error  Units
Re2jFindRegex.testCombine   thrpt  200   1885.322 ±  26.392  ops/s  11%
Re2jFindRegex.testExp1      thrpt  200   6465.808 ± 142.104  ops/s  10%
Re2jFindRegex.testExp2      thrpt  200   2791.424 ±  28.951  ops/s  10%

 

 

Array

  • check against size
  • array boundary check
  • checkcast Arrays are typed!

Interlude

25%
34%
27%

Unicode.simpleFold()

// https://github.com/google/re2j/blob/master/java/com/google/re2j/Inst.java#L64
 if ((arg & RE2.FOLD_CASE) != 0) {
    for (int r1 = Unicode.simpleFold(r0); 
         r1 != r0; // loop until folded on the original code point A -> a -> A over!
         r1 = Unicode.simpleFold(r1)) {
      if (r == r1) {
        return true;
      }
    }
 }

"simpleFold iterates over Unicode code points equivalent under the Unicode-defined simple case folding"

  • A -> a
  • K 75,k 107,K 8490
  • Θ 920,θ 952,ϑ 977,ϴ 1012

Unicode.simpleFold()

// https://github.com/google/re2j/blob/master/java/com/google/re2j/Unicode.java#L203
  static int simpleFold(int r) {
    // Consult caseOrbit table for special cases.
    int lo = 0;
    int hi = UnicodeTables.CASE_ORBIT.length;
    while (lo < hi) {
      int m = lo + (hi - lo) / 2;
      if (UnicodeTables.CASE_ORBIT[m][0] < r) {
        lo = m + 1;
      } else {
        hi = m;
      }
    }
    if (lo < UnicodeTables.CASE_ORBIT.length &&
        UnicodeTables.CASE_ORBIT[lo][0] == r) {
      return UnicodeTables.CASE_ORBIT[lo][1];
    }

    // No folding specified.  This is a one- or two-element
    // equivalence class containing rune and toLower(rune)
    // and toUpper(rune) if they are different from rune.
    int l = toLower(r);
    if (l != r) {
      return l;
    }
    return toUpper(r);
  }

Unicode.simpleFold()

  • very inefficient
    • called a minim twice on every code point
    • binary search every time!
  • but we can precompute the list of codepoint
  • there is a max of 4

Unicode.simpleFold()

Unicode.simpleFold()

Benchmark                    Mode  Cnt      Score     Error  Units
Re2jFindRegex.testCombine   thrpt  200   2973.766 ±  32.765  ops/s  58% total  98% 👍
Re2jFindRegex.testExp1      thrpt  200   7970.696 ±  99.064  ops/s  23% total  65% 👍
Re2jFindRegex.testExp2      thrpt  200   5315.117 ±  71.776  ops/s  90% total 142% 👍

Flight recorder

Text

  • Prog.getInst()
  • Queue.* / Machine.free()

PS : Use to be less than 2%

Prog.getInst

  • use Inst directly
  • pre-link Inst to each other
  • store pc in Inst

Prog.getInst

Benchmark                    Mode  Cnt      Score     Error  Units
Re2jFindRegex.testCombine   thrpt  200   3301.226 ±  22.490  ops/s  11%/119%
Re2jFindRegex.testExp1      thrpt  200   8701.832 ±  52.752  ops/s   9%/ 80%
Re2jFindRegex.testExp2      thrpt  200   5497.766 ±  75.145  ops/s   3%/150%

Queue/Machine.free

Use to store current active matching thread

  • allow iteration in order of active threads
  • store if inst has already been processed for iteration
  • use a dense/sparse array with Entry struct
    • http://research.swtch.com/2008/03/using-uninitialized-memory-for-fun-and.html
    • Uninitialized????

 

Queue/Machine.free

We can

  • split the contains from dense storage
  • use bit mask for contains < 64 and boolean[]
  • optimise Queue free by using arraycopy
    • no more null

Queue/Machine.free

Queue/Machine.free

Benchmark                    Mode  Cnt      Score     Error  Units
Re2jFindRegex.testCombine   thrpt  200   3755.242 ±  39.526  ops/s  14%/150%
Re2jFindRegex.testExp1      thrpt  200   9335.040 ± 177.798  ops/s   7%/ 93%
Re2jFindRegex.testExp2      thrpt  200   6313.462 ±  88.504  ops/s  15%/188%

Interlude

150%
93%
188%

So where are we

FlighRecorder got us so far

from 12x to 4-5x slower

a perf improvement of 93 and 188%

 

now time to....

Look at the x86 assembly

Jmh has multiple profiler integrated.

including perfasm.

  • use perf
  • map code address to java source

As good as it can get. but need to run on linux...

or windows

Look at the x86 assembly

- need hsdis library in your vm

see https://wiki.openjdk.java.net/display/HotSpot/PrintAssembly

echo -1 | sudo tee /proc/sys/kernel/perf_event_paranoid # needed only on per boot
java -jar target/benchmarks.jar -f 1 Re2jRegex.testExp2  -prof perfasm

perfasm

 27.59%   29.33%         C2, level 4  com.google.re2j.Machine::add, version 487 (373 bytes) 
 22.30%   20.97%         C2, level 4  com.google.re2j.Machine::add, version 487 (231 bytes) 
 19.77%   18.44%         C2, level 4  com.google.re2j.Machine::step, version 490 (301 bytes) 
 11.40%   13.05%         C2, level 4  com.google.re2j.Machine::match, version 538 (764 bytes) 
  8.37%    7.57%         C2, level 4  com.google.re2j.Machine::step, version 490 (348 bytes) 
  5.79%    6.73%        runtime stub  StubRoutines::jint_disjoint_arraycopy (128 bytes) 

perfasm

  0.11%    0.09%  │  0x00007fd8dd2104ba: mov    0x38(%rsp),%r10
  0.55%    0.53%  │  0x00007fd8dd2104bf: mov    0xc(%r10),%r10d    ;*getfield op
                  │                                                ; - com.google.re2j.Machine::add@23 (line 343)
  0.80%    0.54%  │  0x00007fd8dd2104c3: or     %r11,%r8
  0.61%    0.53%  │  0x00007fd8dd2104c6: mov    %r8,0x10(%rdx)     ;*putfield pcsl
                  │                                                ; - com.google.re2j.Machine$Queue::add@15 (line 57)
                  │                                                ; - com.google.re2j.Machine::add@19 (line 342)
  0.13%    0.17%  │  0x00007fd8dd2104ca: mov    %r10d,%r11d
  0.47%    0.55%  │  0x00007fd8dd2104cd: dec    %r11d
  0.76%    0.56%  │  0x00007fd8dd2104d0: cmp    $0xc,%r11d
                  │  0x00007fd8dd2104d4: jae    0x00007fd8dd21070e  ;*tableswitch
                  │                                                ; - com.google.re2j.Machine::add@26 (line 343)
  0.55%    0.53%  │  0x00007fd8dd2104da: mov    0x38(%rsp),%r11
  0.15%    0.24%  │  0x00007fd8dd2104df: mov    0x14(%r11),%r8d    ;*getfield arg
                  │                                                ; - com.google.re2j.Machine::add@141 (line 357)
  0.56%    0.54%  │  0x00007fd8dd2104e3: mov    0x30(%r11),%r11d
  0.77%    0.76%  │  0x00007fd8dd2104e7: movslq %r10d,%r9
  0.64%    0.71%  │  0x00007fd8dd2104ea: mov    %r11,%rcx
  0.10%    0.18%  │  0x00007fd8dd2104ed: shl    $0x3,%rcx          ;*getfield outInst
                  │                                                ; - com.google.re2j.Machine::add@176 (line 363)
  0.55%    0.60%  │  0x00007fd8dd2104f1: movabs $0x7fd8dd2103e0,%r10  ;   {section_word}
  0.73%    0.67%  │  0x00007fd8dd2104fb: jmpq   *-0x8(%r10,%r9,8)  ;*tableswitch
                  │                                                ; - com.google.re2j.Machine::add@26 (line 343)
           0.00%  ↘  0x00007fd8dd210500: mov    0x70(%rsp),%rax
  0.01%              0x00007fd8dd210505: jmpq   0x00007fd8dd2106e5
                     0x00007fd8dd21050a: andn   %r8d,%edi,%r10d
                     0x00007fd8dd21050f: test   %r10d,%r10d

perfasm

Machine.add and Machine.step are very costly.

Method are big, and inlining is very limited.

 

switch(inst.op) is very much like virtual call dispatch

-> use polymorphism instead

-> better profiling information, JIT might do better than the switch

 

Inst polymorphism

Inst polymorphism

Benchmark                    Mode  Cnt      Score     Error  Units
Re2jFindRegex.testCombine   thrpt  200   4272.693 ±  71.341  ops/s  14%/184% 👍
Re2jFindRegex.testExp1      thrpt  200  11552.934 ± 199.890  ops/s  24%/139% 👍
Re2jFindRegex.testExp2      thrpt  200   7973.784 ± 113.688  ops/s  26%/263% 👍

perfasm 2

 29.17%   26.09%         C2, level 4  com.google.re2j.Machine::step, version 504 (1200 bytes) 
 25.99%   27.85%         C2, level 4  com.google.re2j.Machine::step, version 504 (605 bytes) 
 22.36%   24.75%         C2, level 4  com.google.re2j.Machine::match, version 557 (1126 bytes) 
  7.44%    8.03%        runtime stub  StubRoutines::jint_disjoint_arraycopy (128 bytes) 
  7.39%    7.60%         C2, level 4  com.google.re2j.Machine::step, version 504 (365 bytes) 
  • array copy - only used in 2 places

 

perfasm 2

  • optimized path for no capture?
    • expose a find method that does not capture

 

Pattern.find

Pattern.find

Benchmark                    Mode  Cnt      Score     Error  Units
Re2jFindRegex.testCombine   thrpt  200   5139.382 ±  37.840  ops/s  20%/242%
Re2jFindRegex.testExp1      thrpt  200  14420.410 ± 193.187  ops/s  25%/198%
Re2jFindRegex.testExp2      thrpt  200   9594.136 ± 173.792  ops/s  20%/337%

perfasm 3

 62.67%   60.66%         C2, level 4  com.google.re2j.Machine::step, version 500 (1434 bytes) 
 22.73%   25.48%         C2, level 4  com.google.re2j.Machine::match, version 550 (979 bytes) 
  4.77%    4.96%         C2, level 4  com.google.re2j.Machine::step, version 500 (381 bytes) 
  3.68%    4.65%         C2, level 4  com.google.re2j.Machine::step, version 500 (111 bytes) 
  1.31%    0.24%         C2, level 4  com.google.re2j.Machine::init, version 541 (312 bytes) 
  0.56%    0.60%         C2, level 4  com.google.re2j.Machine::match, version 550 (267 bytes) 
  0.55%    0.54%   [kernel.kallsyms]  [unknown] (5 bytes) 
  0.19%    0.05%         C2, level 4  com.google.re2j.Machine::init, version 541 (61 bytes) 

No more arraycopy!

  0.88%    0.58%   │ │  0x00007f242c3896fc: lea    (%r12,%r8,8),%r11
  0.55%    0.44%   │ │  0x00007f242c389700: mov    0x10(%r11,%r10,4),%r14d  ;*aaload
                   │ │                                                ; - com.google.re2j.Machine::step@27 (line 278)
  0.31%    0.41%   │ │  0x00007f242c389705: mov    0x10(%r12,%r14,8),%ebp  ;*getfield inst
                   │ │                                                ; - com.google.re2j.Machine::step@78 (line 283)
                   │ │                                                ; implicit exception: dispatches to 0x00007f242c38aded
  3.40%    3.11%   │ │  0x00007f242c38970a: mov    0x8(%r12,%rbp,8),%r8d  ; implicit exception: dispatches to 0x00007f242c38adfd
  7.00%    6.89%   │ │  0x00007f242c38970f: cmp    $0xf8019992,%r8d   ;   {metadata(&apos;com/google/re2j/Inst$RuneInst&apos;)}
                   │ │  0x00007f242c389716: jne    0x00007f242c389f11
  1.79%    1.36%   │ │  0x00007f242c38971c: lea    (%r12,%rbp,8),%r11  ;*invokevirtual isMatch
                   │ │                                                ; - com.google.re2j.Machine::step@85 (line 285)
  • virtual code to isMatch

isMatch

Benchmark                    Mode  Cnt      Score     Error  Units
Re2jFindRegex.testCombine   thrpt  200   5024.247 ±  76.465  ops/s  -2%/234%
Re2jFindRegex.testExp1      thrpt  200  14606.561 ± 111.745  ops/s   1%/202%
Re2jFindRegex.testExp2      thrpt  200  10152.329 ±  85.381  ops/s   6%/362%

isMatch

  0.14%    0.12%   ││  0x00007fcab81811ee: mov    0x10(%r11,%r10,4),%r14d  ;*aaload
                   ││                                                ; - com.google.re2j.Machine::step@27 (line 278)
  0.87%    0.66%   ││  0x00007fcab81811f3: mov    0x10(%r12,%r14,8),%r11d  ;*getfield inst
                   ││                                                ; - com.google.re2j.Machine::step@78 (line 283)
                   ││                                                ; implicit exception: dispatches to 0x00007fcab8182849
  3.71%    3.24%   ││  0x00007fcab81811f8: mov    0xc(%r12,%r11,8),%ebp  ;*getfield op
                   ││                                                ; - com.google.re2j.Machine::step@85 (line 285)
                   ││                                                ; implicit exception: dispatches to 0x00007fcab8182859
  6.46%    6.71%   ││  0x00007fcab81811fd: cmp    $0x6,%ebp
  1.53%    1.75%   ││  0x00007fcab8181200: je     0x00007fcab8181a6d  ;*if_icmpne
                   ││                                                ; - com.google.re2j.Machine::step@90 (line 285)
  1.86%    1.80%   ││  0x00007fcab8181206: mov    0x8(%r12,%r11,8),%r9d
                   ││  0x00007fcab818120b: cmp    $0xf8019992,%r9d   ;   {metadata(&apos;com/google/re2j/Inst$RuneInst&apos;)}
                   ││  0x00007fcab8181212: jne    0x00007fcab81819c5  ;*invokevirtual matchRune
                   ││                                                ; - com.google.re2j.Machine::step@189 (line 299)
  0.00%    0.01%   ││  0x00007fcab8181218: mov    0x20(%rsp),%r8

still high cost around inst access, cache miss?

 

What else?

  • flatten AltInst
    • Instead of a tree use an array/specialised type

 

  0.65%    0.55%  │     0x00007f36b9225d3f: mov    %rbx,%rsi
  0.05%    0.06%  │     0x00007f36b9225d42: and    %r9,%rsi           ;*land
                  │                                                   ; - com.google.re2j.Machine$Queue::contains@13 (line 47)
                  │                                                   ; - com.google.re2j.Inst$AltInst::add@5 (line 187)
                  │                                                   ; - com.google.re2j.Inst$AltInst::add@-1 (line 187)
  0.07%    0.11%  │     0x00007f36b9225d45: test   %rsi,%rsi
                  │     0x00007f36b9225d48: jne    0x00007f36b9226481  ;*ifeq
                  │                                                   ; - com.google.re2j.Machine$Queue::contains@16 (line 47)
                  │                                                   ; - com.google.re2j.Inst$AltInst::add@5 (line 187)
                  │                                                   ; - com.google.re2j.Inst$AltInst::add@-1 (line 187)
  0.30%    0.49%  │     0x00007f36b9225d4e: cmp    $0x40,%ecx
                  │     0x00007f36b9225d51: jge    0x00007f36b92264cd  ;*if_icmpge
                  │                                                   ; - com.google.re2j.Machine$Queue::add@3 (line 56)
                  │                                                   ; - com.google.re2j.Inst$AltInst::add@19 (line 190)
                  │                                                   ; - com.google.re2j.Inst$AltInst::add@-1 (line 187)
  0.25%    0.21%  │     0x00007f36b9225d57: mov    0x1c(%r10),%ebp    ;*getfield outInst
                  │                                                   ; - com.google.re2j.Inst$AltInst::add@23 (line 192)
                  │                                                   ; - com.google.re2j.Inst$AltInst::add@-1 (line 187)
  0.05%    0.05%  │     0x00007f36b9225d5b: or     %r9,%rbx           ;*lor  ; - com.google.re2j.Machine$Queue::add@14 (line 57)

AltInst

Benchmark                    Mode  Cnt      Score     Error  Units
Re2jFindRegex.testCombine   thrpt  200   5467.610 ±  60.482  ops/s  9%/263%
Re2jFindRegex.testExp1      thrpt  200  14808.181 ± 134.538  ops/s  1%/206%
Re2jFindRegex.testExp2      thrpt  200  11636.934 ±  91.307  ops/s 15%/430%

AltInst

  0.38%    0.33%    │  0x00007f220122449e: mov    %ebx,0xac(%rsp)
  0.00%             │  0x00007f22012244a5: vmovd  %eax,%xmm3
                    │  0x00007f22012244a9: mov    %rcx,%r14
  0.13%    0.12%    │  0x00007f22012244ac: mov    0xc(%rcx),%r10d    ;*getfield size
                    │                                                ; - com.google.re2j.Machine$Queue::addThread@6 (line 65)
                    │                                                ; - com.google.re2j.Inst$MatchInst::add@74 (line 106)
                    │                                                ; - com.google.re2j.Inst$Alt2Inst::add@35 (line 193)
                    │                                                ; - com.google.re2j.Machine::step@-1 (line 276)
  0.29%    0.27%    │  0x00007f22012244b0: mov    %r10d,0x28(%rsp)
                    │  0x00007f22012244b5: mov    0x20(%rcx),%r10d   ;*getfield denseThreads
                    │                                                ; - com.google.re2j.Machine$Queue::addThread@1 (line 65)
                    │                                                ; - com.google.re2j.Inst$MatchInst::add@74 (line 106)
                    │                                                ; - com.google.re2j.Inst$Alt2Inst::add@35 (line 193)
                    │                                                ; - com.google.re2j.Machine::step@-1 (line 276)
                    │  0x00007f22012244b9: vmovd  %r10d,%xmm2
  0.16%    0.11%    │  0x00007f22012244be: mov    0x28(%rsp),%r10d
  0.29%    0.27%    │  0x00007f22012244c3: inc    %r10d              ;*iadd
                    │                                                ; - com.google.re2j.Machine$Queue::addThread@11 (line 65)
                    │                                                ; - com.google.re2j.Inst$MatchInst::add@74 (line 106)
                    │                                                ; - com.google.re2j.Inst$Alt2Inst::add@35 (line 193)
                    │                                                ; - com.google.re2j.Machine::step@-1 (line 276)
                    │  0x00007f22012244c6: vmovd  %r10d,%xmm4
                    │  0x00007f22012244cb: mov    %r10d,0xc(%rcx)    ;*putfield size
                    │                                                ; - com.google.re2j.Machine$Queue::addThread@12 (line 65)
                    │                                                ; - com.google.re2j.Inst$MatchInst::add@74 (line 106)
                    │                                                ; - com.google.re2j.Inst$Alt2Inst::add@35 (line 193)
                    │                                                ; - com.google.re2j.Machine::step@-1 (line 276)

thread removal

Thread pooling takes a lot of time in different places

  • pool only necessary for capture - avoid int[] alloc
  • non-capture faster, but impact capturing path...

 

thread removal

Benchmark                    Mode  Cnt      Score     Error  Units
Re2jFindRegex.testCombine   thrpt  200   7818.190 ± 108.756  ops/s  43%/420%
Re2jFindRegex.testExp1      thrpt  200  20212.245 ± 387.433  ops/s  36%/317%
Re2jFindRegex.testExp2      thrpt  200  16208.431 ± 189.066  ops/s  39%/430%
Re2jMatchRegex.testCombine  thrpt  200   4078.958 ±  36.783  ops/s  -5%
Re2jMatchRegex.testExp1     thrpt  200  10905.446 ± 143.010  ops/s  -8%
Re2jMatchRegex.testExp2     thrpt  200   7200.117 ±  62.800  ops/s -14%

thread removal

more asm

  1.47%    1.51%  │     │    0x00007f2da121b3e3: mov    0x20(%r9),%ebp     ;*getfield denseThreadsInstructions
                  │     │                                                  ; - com.google.re2j.Machine::step@78 (line 294)
  0.39%    0.32%  │     │    0x00007f2da121b3e7: mov    0xc(%r12,%rbp,8),%r10d  ; implicit exception: dispatches to 0x00007f2da121bd85
  0.90%    0.70%  │     │    0x00007f2da121b3ec: cmp    %r10d,%r8d
                  │     │    0x00007f2da121b3ef: jae    0x00007f2da121b6d3
  0.91%    0.92%  │     │    0x00007f2da121b3f5: lea    (%r12,%rbp,8),%r10
  1.29%    1.40%  │     │    0x00007f2da121b3f9: mov    0x10(%r10,%r8,4),%ebp  ;*aaload
                  │     │                                                  ; - com.google.re2j.Machine::step@83 (line 294)
  0.34%    0.30%  │     │    0x00007f2da121b3fe: mov    0xc(%r12,%rbp,8),%r11d  ; implicit exception: dispatches to 0x00007f2da121bd99
  • array boundary check on each threat
for (int j = 0; j < runq.size; ++j) {

Cannot prove runq.size will not change, it actually does.

          if (!longest) {
            // First-match mode: cut off all lower-priority threads.
            freeQueue(runq, j + 1); // calls queue.clear(); witch set the size to 0
            // which will trigger an exit from the loop
          }
          matched = true;

runq.size

runq.size

Benchmark                    Mode  Cnt      Score     Error  Units
Re2jFindRegex.testCombine   thrpt  200   8061.101 ± 111.423  ops/s  3%/436%
Re2jFindRegex.testExp1      thrpt  200  20755.356 ± 356.537  ops/s  3%/329%
Re2jFindRegex.testExp2      thrpt  200  17957.057 ± 107.874  ops/s 11%/718%

runq.size

  0.79%    1.09%  │  0x00007f0c9921b4fa: mov    0x20(%rax),%ebp    ;*getfield denseThreadsInstructions
                  │                                                ; - com.google.re2j.Machine::step@82 (line 295)
  0.22%    0.25%  │  0x00007f0c9921b4fd: mov    0xc(%r12,%rbp,8),%r8d  ; implicit exception: dispatches to 0x00007f0c9921c661
  1.10%    1.20%  │  0x00007f0c9921b502: cmp    %r8d,%r10d
                  │  0x00007f0c9921b505: jae    0x00007f0c9921baa9
  1.80%    1.41%  │  0x00007f0c9921b50b: lea    (%r12,%rbp,8),%r8
  0.61%    0.60%  │  0x00007f0c9921b50f: mov    0x10(%r8,%r10,4),%ecx  ;*aaload
                  │                                                ; - com.google.re2j.Machine::step@87 (line 295)

Did not eliminate the boundary check in exp1 though

but did in exp2, will come back to that one later

 

  0.26%    0.16%  │          0x00007f5f41204509: mov    0x20(%rdx),%r11d   ;*getfield denseThreadsInstructions
                  │                                                        ; - com.google.re2j.Machine::step@82 (line 295)
  0.15%    0.10%  │          0x00007f5f4120450d: mov    0xc(%r12,%r11,8),%r10d  ;*aaload
                  │                                                        ; - com.google.re2j.Machine::step@87 (line 295)
                  │                                                        ; implicit exception: dispatches to 0x00007f5f4120494d

invokevirtual matchRune

  1.35%    1.70%  │ │   │    0x00007f5f41204557: mov    0x8(%r12,%r10,8),%ecx
  1.03%    1.25%  │ │   │    0x00007f5f4120455c: cmp    $0xf8019993,%ecx   ;   {metadata(&apos;com/google/re2j/Inst$RuneInst&apos;)}
                  │ │   │    0x00007f5f41204562: jne    0x00007f5f412047cd
  0.86%    0.89%  │ │   │    0x00007f5f41204568: shl    $0x3,%r10          ;*invokevirtual matchRune
                  │ │   │                                                  ; - com.google.re2j.Machine::step@181 (line 312)

the instance type check is costly, even though only call it for RuneInst.

-> move matchRune method to Inst as final

-> no need for virtual dispatch

 

matchRune

matchRune

Benchmark                    Mode  Cnt      Score     Error  Units
Re2jFindRegex.testCombine   thrpt  200   9082.066 ±  73.694  ops/s  13%/504%
Re2jFindRegex.testExp1      thrpt  200  21453.043 ± 332.811  ops/s   3%/343%
Re2jFindRegex.testExp2      thrpt  200  20037.627 ± 253.445  ops/s  12%/718%

captures

  0.53%    0.48%      0x00007f6ad9217394: mov    0x8(%rsp),%r8
  1.20%    0.96%      0x00007f6ad9217399: movzbl 0x11(%r8),%r8d     ;*getfield captures
                                                                    ; - com.google.re2j.Machine::step@26 (line 285)
  2.20%    2.40%      0x00007f6ad921739e: test   %r8d,%r8d

captures

Benchmark                    Mode  Cnt      Score     Error  Units
Re2jFindRegex.testCombine   thrpt  200   9307.324 ±  63.348  ops/s  2%/519%
Re2jFindRegex.testExp1      thrpt  200  22421.399 ± 230.262  ops/s  5%/363%
Re2jFindRegex.testExp2      thrpt  200  20411.229 ± 228.717  ops/s  2%/830%

matched/anchored

 

  0.04%    0.02%      │││    ││││↘│ ││││ ││││   0x00007f2e7923aab0: mov    %r10d,0x5c(%rsp)   ;*aload_0
                      │││    ││││ │ ││││ ││││                                                 ; - com.google.re2j.Machine::match@267 (line 237)
  0.09%    0.18%      │││    ││││ ↘ ││││ ││││   0x00007f2e7923aab5: test   %eax,%eax
                      │││    ││││   ││││ ││││   0x00007f2e7923aab7: jne    0x00007f2e7923b27d  ;*ifne
                      │││    ││││   ││││ ││││                                                 ; - com.google.re2j.Machine::match@271 (line 237)
  0.50%    0.40%      │││    ││││   ││││ ││││   0x00007f2e7923aabd: mov    0x64(%rsp),%r11d
  0.12%    0.12%      │││    ││││   ││││ ││││   0x00007f2e7923aac2: test   %r11d,%r11d
                      │││    ││││  ╭││││ ││││   0x00007f2e7923aac5: je     0x00007f2e7923ac65  ;*ifeq
                      │││    ││││  │││││ ││││                                                 ; - com.google.re2j.Machine::match@275 (line 237)
Benchmark                    Mode  Cnt      Score     Error  Units
Re2jFindRegex.testCombine   thrpt  200   9307.737 ± 144.002  ops/s 0%
Re2jFindRegex.testExp1      thrpt  200  22643.527 ± 276.346  ops/s 1%
Re2jFindRegex.testExp2      thrpt  200  20796.029 ± 175.146  ops/s 2%


waste of time, as can see from the perf asm

originally included after perf indicated higher cost.

boundary elim

boundary elim

Benchmark                    Mode  Cnt      Score     Error  Units
Re2jFindRegex.testCombine   thrpt  200   9195.517 ± 118.936  ops/s  -1%
Re2jFindRegex.testExp1      thrpt  200  22150.490 ±  84.177  ops/s  -2%
Re2jFindRegex.testExp2      thrpt  200  20539.044 ± 121.331  ops/s  -1%


no impact ;(

boundary elim

  0.34%    0.31%  ││  0x00007f4a39235d73: mov    %r11,%rax          ;*iload
                  ││                                                ; - com.google.re2j.Machine::step@37 (line 287)
  0.61%    0.49%  ││  0x00007f4a39235d76: mov    0x10(%rbx,%r10,4),%r8d  ;*aaload
                  ││                                                ; - com.google.re2j.Machine::step@95 (line 297)
  1.79%    1.69%  ││  0x00007f4a39235d7b: mov    0xc(%r12,%r8,8),%r11d  ;*getfield op
                  ││                                                ; - com.google.re2j.Machine::step@100 (line 299)
                  ││                                                ; implicit exception: dispatches to 0x00007f4a3923701d
  • No more boundary check in exp2!
  • no perf change in exp2 ...

Interlude

511%
357%

835%

The great stagnation!

Another 3 tries

  • local startInst 1%/0%/3%
  • passthrough Inst 0%/4%/-1% only increase in exp1 cause Capture is now simpler
  • replace contains/add with a single containsOrAdd 6%/1%/2%

Summary

  • between 4.8 to 9.8 time faster at the end!
    • 5 times less instances needed!
  • small change can have big impact!
  • always measure - theorise - implement - validate - analyse
  • the JIT is very good, just need a bit of help sometimes
  • beware things change, all the time.
    • Math.min/max use to be slow now uses CMOV
    • but CMOV slower than a branch if branch highly predictable

What else?

  • replace regex with contains/startsWith/equals
    • run through the Inst
    • Identify if regex can be transformed to a contains call

Other related perf resource

 

  • https://shipilev.net/jvm-anatomy-park/
  • https://arnaudroger.github.io/blog/2017/02/28/java-performance-puzzle-part2.html
  • http://psy-lob-saw.blogspot.co.uk/
  • https://mechanical-sympathy.blogspot.co.uk/

When micro optimisation matters

By Arnaud Roger

When micro optimisation matters

how to get re2j to be x time faster

  • 1,307