When micro-optimisations matters
Me @arnaudroger
- Wrote my first line of java in 1997
- Work in 3 countries Paris-London-Belfast
- Media, Finance, Telco, Charity, Security
- 3 Children ... 7-4-2
- Identical Twin brother, a java dev too
- SimpleFlatMapper
- Blog https://arnaudroger.github.io/blog/
- Working at rapid 7 on the Attacker Behavior Analytics
- up to 110 000 events per seconds
- and replays at 6M
You
- 10 tps or less?
- 10 tps to 1000 tps?
- 1000 tps to 100 000 tps?
- 1 million tps?
Usual argument against
- hardware is cheap
- few milliseconds does not matter
- readability cost
- pre optimisation is EVIL!
Many small streams make one big river
Many small streams make one big river
1 second has only 1 000 000 microseconds
It all adds up
Cost model
today cpus are not your grandparent's cpus
Cost model
the 1900s cost model
- instruction based
- O complexity is good approximation
Cost model
today's cost model
- cache misses
- core occupation
- instruction dependency
- predictable patterns
- O complexity is missleading
- think linkedlist vs arraylist
Java Cost model
What does that means for java?
- pointer chasing is expensive, bye bye LinkedList
- allocation is cheap but next load is not, cache miss
- the cpu does not execute byte code, be aware of the jit
Java Cost model
What kind of optimization does the JIT provides
- inlining - the mother of all optimization -
- escape analysis
- loop unrolling
- boundary check elimination
- virtual call optimisation
- dead code elimination....
Cost model is not enough
- measure
- measure
- measure!
Not all profiler are the same
- safe point biais
- interference with the jit
- jit code to java code interpolation
- skiding
My performance toolkit
- JMC/Flight Recorder
- perf-map-agent, linux only
- jmh
- jit-watch
Also worth considering Solaris Studio.
Workflow
- identify hot path - measure in prod if possible
- perf-map-agent flamegraph
- honest profiler
- produce jmh benchmark
- improve perf
- validate with benchmark
- go back to 1
Realife example : Re2j
Why
- java regex are faster most of the time
- but terribly slower at the .999th
- process 100 000s regex per seconds.
"If such choices are deeply nested, this strategy requires an exponential number of passes over the input data before it can detect whether the input matches. If the input is large, it is easy to construct a pattern whose running time would exceed the lifetime of the universe."
Regex Dos
https://www.owasp.org/index.php/Regular_expression_Denial_of_Service_-_ReDoS
// run the following on java8, fixed on 9
StringBuilder sb = new StringBuilder();
for(int i = 0; i < 80; i++) sb.append("a");
sb.append("!");
long l = System.currentTimeMillis();
boolean b = Pattern.compile("^(a+)+$").matcher(sb.toString()).matches();
long t = l - System.currentTimeMillis();
System.out.println("t = " + t + " " + b);
All the following modification are available at
https://github.com/arnaudroger/re2j/tree/jug
2 old pr in re2j - not totally in line the changes in the presentation
https://github.com/google/re2j/pull/35/ - Merged!
https://github.com/google/re2j/pull/36/
All the benchmark run 20 iterations 20 forks on a box with set cpu speed, but still can be noisy
Flame graph
cd /tmp
sudo apt-get --yes install git linux-tools-generic linux-tools-x.y.z-w-generic cmake && \
wget https://github.com/jvm-profiling-tools/perf-map-agent/archive/master.zip && \
unzip master.zip && \
cd perf-map-agent-master && \
cmake . && \
make && \
git clone https://github.com/brendangregg/FlameGraph
export FLAMEGRAPH_DIR=/tmp/perf-map-agent-master/FlameGraph
export PERF_RECORD_SECONDS=60
bin/perf-java-flames <pid>
-XX:+PreserveFramePointer
you'll need to had that to your start script
then on the instance
Flame graph
Re2j
String EXP1 = "\\\\.*(documents|\\$documents\\.user)\\\\";
String EXP2 = "abcdef.exe|foooooo.exe|bargoo.exe|ratatouille.exe|orleans.exe";
String[] DATA = {
"bargoo.exe",
"\\SystemRoot\\System32\\bargoo.exe",
"somefile.exe",
"C:\\WINDOWS\\system32\\somefile.exe",
"cmd.exe",
"\"C:\\WINDOWS\\system32\\cmd.exe\" ",
"powershell.exe",
"powershell.exe -Command function Main {\n $lorem = \\\"Sed ut perspiciatis unde omnis iste natus error sit voluptatem accusantium doloremque laudantium, totam rem aperiam, eaque ipsa quae ab illo inventore veritatis et quasi architecto beatae vitae dicta sunt explicabo. Nemo enim ipsam voluptatem quia voluptas sit aspernatur aut odit aut fugit, sed quia consequuntur magni dolores eos qui ratione voluptatem sequi nesciunt. Neque porro quisquam est, qui dolorem ipsum quia dolor sit amet, consectetur, adipisci velit, sed quia non numquam eius modi tempora incidunt ut labore et dolore magnam aliquam quaerat voluptatem. Ut enim ad minima veniam, quis nostrum exercitationem ullam corporis suscipit laboriosam, nisi ut aliquid ex ea commodi consequatur? Quis autem vel eum iure reprehenderit qui in ea voluptate velit esse quam nihil molestiae consequatur, vel illum qui dolorem eum fugiat quo voluptas nulla pariatur\\\"\n }\n \nMain\n"
};
Re2j
@Benchmark
public void testExp1(Blackhole blackhole) {
for(String str : data) {
blackhole.consume(exp1.matcher(str).find());
}
}
@Benchmark
public void testExp2(Blackhole blackhole) {
for (String str : data) {
blackhole.consume(exp2.matcher(str).find());
}
}
@Benchmark
public void testCombine(Blackhole blackhole) {
for (String str : data) {
blackhole.consume(exp1.matcher(str).find());
blackhole.consume(exp2.matcher(str).find());
}
}
Re2j
Benchmark - jmh
Java 8 vs Re2j 1.1
Benchmark Mode Cnt Score Error Units
JavaRegex.testCombine thrpt 200 16553.594 ± 397.406 ops/s
Re2jFindRegex.testCombine thrpt 200 1504.195 ± 9.869 ops/s 11 x 😞
JavaRegex.testExp1 thrpt 200 64284.475 ± 308.279 ops/s
Re2jFindRegex.testExp1 thrpt 200 4842.201 ± 56.966 ops/s 13 x 😞
JavaRegex.testExp2 thrpt 200 28048.060 ± 110.140 ops/s
Re2jFindRegex.testExp2 thrpt 200 2195.518 ± 34.807 ops/s 13 x 😞
Flight Recorder
java -jar target/benchmarks.jar -f 1 -wi 1 -i 1000 Re2jRegex.testExp2 \
-jvmArgs "-XX:+UnlockCommercialFeatures -XX:+FlightRecorder \
-XX:+UnlockDiagnosticVMOptions -XX:+DebugNonSafepoints"
- part of Java Mission Control, port from jrockit
- free for use in dev - need a license for prod -
-
will beis open sourcedin the near future- java 11 - - no safepoint bias sampling
- accurate memory profiling does not interfere with EA
Hot Methods
Hot Methods
- simple fold - potential big impact
- but what is that?
- Enum.ordinal() ? - surely should not matter?
- use int
- Inst.op()
- ??
- Lots of ArrayList access - could we gain here
- use array directly
Enum.ordinal()
Text
Text
Enum.ordinal()
Benchmark Mode Cnt Score Error Units
Re2jFindRegex.testCombine thrpt 200 1645.880 ± 17.718 ops/s 9%
Re2jFindRegex.testExp1 thrpt 200 5713.571 ± 61.726 ops/s 18%
Re2jFindRegex.testExp2 thrpt 200 2495.788 ± 24.896 ops/s 14%
9% for a 60 seconds change not bad!
the 18% looks inflated compare to Re2jMatchRegex
Inst.op()
Inst.op()
Benchmark Mode Cnt Score Error Units
Re2jFindRegex.testCombine thrpt 200 1691.707 ± 16.574 ops/s 3%
Re2jFindRegex.testExp1 thrpt 200 5899.987 ± 58.776 ops/s 3%
Re2jFindRegex.testExp2 thrpt 200 2544.503 ± 28.540 ops/s 2%
and another 3%.
3% could easily just be noise, but here consistent across benchmark
ArrayList
Text
Text
ArrayList
ArrayList
Benchmark Mode Cnt Score Error Units
Re2jFindRegex.testCombine thrpt 200 1885.322 ± 26.392 ops/s 11%
Re2jFindRegex.testExp1 thrpt 200 6465.808 ± 142.104 ops/s 10%
Re2jFindRegex.testExp2 thrpt 200 2791.424 ± 28.951 ops/s 10%
ArrayList
Benchmark Mode Cnt Score Error Units
Re2jFindRegex.testCombine thrpt 200 1885.322 ± 26.392 ops/s 11%
Re2jFindRegex.testExp1 thrpt 200 6465.808 ± 142.104 ops/s 10%
Re2jFindRegex.testExp2 thrpt 200 2791.424 ± 28.951 ops/s 10%
Interlude
25% |
34% |
27% |
Unicode.simpleFold()
// https://github.com/google/re2j/blob/master/java/com/google/re2j/Inst.java#L64
if ((arg & RE2.FOLD_CASE) != 0) {
for (int r1 = Unicode.simpleFold(r0);
r1 != r0; // loop until folded on the original code point A -> a -> A over!
r1 = Unicode.simpleFold(r1)) {
if (r == r1) {
return true;
}
}
}
"simpleFold iterates over Unicode code points equivalent under the Unicode-defined simple case folding"
- A -> a
- K 75,k 107,K 8490
- Θ 920,θ 952,ϑ 977,ϴ 1012
Unicode.simpleFold()
// https://github.com/google/re2j/blob/master/java/com/google/re2j/Unicode.java#L203
static int simpleFold(int r) {
// Consult caseOrbit table for special cases.
int lo = 0;
int hi = UnicodeTables.CASE_ORBIT.length;
while (lo < hi) {
int m = lo + (hi - lo) / 2;
if (UnicodeTables.CASE_ORBIT[m][0] < r) {
lo = m + 1;
} else {
hi = m;
}
}
if (lo < UnicodeTables.CASE_ORBIT.length &&
UnicodeTables.CASE_ORBIT[lo][0] == r) {
return UnicodeTables.CASE_ORBIT[lo][1];
}
// No folding specified. This is a one- or two-element
// equivalence class containing rune and toLower(rune)
// and toUpper(rune) if they are different from rune.
int l = toLower(r);
if (l != r) {
return l;
}
return toUpper(r);
}
Unicode.simpleFold()
- very inefficient
- called a minim twice on every code point
- binary search every time!
- but we can precompute the list of codepoint
- there is a max of 4
Unicode.simpleFold()
Unicode.simpleFold()
Benchmark Mode Cnt Score Error Units
Re2jFindRegex.testCombine thrpt 200 2973.766 ± 32.765 ops/s 58% total 98% 👍
Re2jFindRegex.testExp1 thrpt 200 7970.696 ± 99.064 ops/s 23% total 65% 👍
Re2jFindRegex.testExp2 thrpt 200 5315.117 ± 71.776 ops/s 90% total 142% 👍
Flight recorder
Text
- Prog.getInst()
- Queue.* / Machine.free()
PS : Use to be less than 2%
Prog.getInst
- use Inst directly
- pre-link Inst to each other
- store pc in Inst
Prog.getInst
Benchmark Mode Cnt Score Error Units
Re2jFindRegex.testCombine thrpt 200 3301.226 ± 22.490 ops/s 11%/119%
Re2jFindRegex.testExp1 thrpt 200 8701.832 ± 52.752 ops/s 9%/ 80%
Re2jFindRegex.testExp2 thrpt 200 5497.766 ± 75.145 ops/s 3%/150%
Queue/Machine.free
Use to store current active matching thread
- allow iteration in order of active threads
- store if inst has already been processed for iteration
- use a dense/sparse array with Entry struct
- http://research.swtch.com/2008/03/using-uninitialized-memory-for-fun-and.html
- Uninitialized????
Queue/Machine.free
We can
- split the contains from dense storage
- use bit mask for contains < 64 and boolean[]
- optimise Queue free by using arraycopy
- no more null
Queue/Machine.free
Queue/Machine.free
Benchmark Mode Cnt Score Error Units
Re2jFindRegex.testCombine thrpt 200 3755.242 ± 39.526 ops/s 14%/150%
Re2jFindRegex.testExp1 thrpt 200 9335.040 ± 177.798 ops/s 7%/ 93%
Re2jFindRegex.testExp2 thrpt 200 6313.462 ± 88.504 ops/s 15%/188%
Interlude
150% |
93% |
188% |
So where are we
FlighRecorder got us so far
from 12x to 4-5x slower
a perf improvement of 93 and 188%
now time to....
Look at the x86 assembly
Jmh has multiple profiler integrated.
including perfasm.
- use perf
- map code address to java source
As good as it can get. but need to run on linux...
or windows
Look at the x86 assembly
- need hsdis library in your vm
see https://wiki.openjdk.java.net/display/HotSpot/PrintAssembly
echo -1 | sudo tee /proc/sys/kernel/perf_event_paranoid # needed only on per boot
java -jar target/benchmarks.jar -f 1 Re2jRegex.testExp2 -prof perfasm
perfasm
27.59% 29.33% C2, level 4 com.google.re2j.Machine::add, version 487 (373 bytes)
22.30% 20.97% C2, level 4 com.google.re2j.Machine::add, version 487 (231 bytes)
19.77% 18.44% C2, level 4 com.google.re2j.Machine::step, version 490 (301 bytes)
11.40% 13.05% C2, level 4 com.google.re2j.Machine::match, version 538 (764 bytes)
8.37% 7.57% C2, level 4 com.google.re2j.Machine::step, version 490 (348 bytes)
5.79% 6.73% runtime stub StubRoutines::jint_disjoint_arraycopy (128 bytes)
perfasm
0.11% 0.09% │ 0x00007fd8dd2104ba: mov 0x38(%rsp),%r10
0.55% 0.53% │ 0x00007fd8dd2104bf: mov 0xc(%r10),%r10d ;*getfield op
│ ; - com.google.re2j.Machine::add@23 (line 343)
0.80% 0.54% │ 0x00007fd8dd2104c3: or %r11,%r8
0.61% 0.53% │ 0x00007fd8dd2104c6: mov %r8,0x10(%rdx) ;*putfield pcsl
│ ; - com.google.re2j.Machine$Queue::add@15 (line 57)
│ ; - com.google.re2j.Machine::add@19 (line 342)
0.13% 0.17% │ 0x00007fd8dd2104ca: mov %r10d,%r11d
0.47% 0.55% │ 0x00007fd8dd2104cd: dec %r11d
0.76% 0.56% │ 0x00007fd8dd2104d0: cmp $0xc,%r11d
│ 0x00007fd8dd2104d4: jae 0x00007fd8dd21070e ;*tableswitch
│ ; - com.google.re2j.Machine::add@26 (line 343)
0.55% 0.53% │ 0x00007fd8dd2104da: mov 0x38(%rsp),%r11
0.15% 0.24% │ 0x00007fd8dd2104df: mov 0x14(%r11),%r8d ;*getfield arg
│ ; - com.google.re2j.Machine::add@141 (line 357)
0.56% 0.54% │ 0x00007fd8dd2104e3: mov 0x30(%r11),%r11d
0.77% 0.76% │ 0x00007fd8dd2104e7: movslq %r10d,%r9
0.64% 0.71% │ 0x00007fd8dd2104ea: mov %r11,%rcx
0.10% 0.18% │ 0x00007fd8dd2104ed: shl $0x3,%rcx ;*getfield outInst
│ ; - com.google.re2j.Machine::add@176 (line 363)
0.55% 0.60% │ 0x00007fd8dd2104f1: movabs $0x7fd8dd2103e0,%r10 ; {section_word}
0.73% 0.67% │ 0x00007fd8dd2104fb: jmpq *-0x8(%r10,%r9,8) ;*tableswitch
│ ; - com.google.re2j.Machine::add@26 (line 343)
0.00% ↘ 0x00007fd8dd210500: mov 0x70(%rsp),%rax
0.01% 0x00007fd8dd210505: jmpq 0x00007fd8dd2106e5
0x00007fd8dd21050a: andn %r8d,%edi,%r10d
0x00007fd8dd21050f: test %r10d,%r10d
perfasm
Machine.add and Machine.step are very costly.
Method are big, and inlining is very limited.
switch(inst.op) is very much like virtual call dispatch
-> use polymorphism instead
-> better profiling information, JIT might do better than the switch
Inst polymorphism
Inst polymorphism
Benchmark Mode Cnt Score Error Units
Re2jFindRegex.testCombine thrpt 200 4272.693 ± 71.341 ops/s 14%/184% 👍
Re2jFindRegex.testExp1 thrpt 200 11552.934 ± 199.890 ops/s 24%/139% 👍
Re2jFindRegex.testExp2 thrpt 200 7973.784 ± 113.688 ops/s 26%/263% 👍
perfasm 2
29.17% 26.09% C2, level 4 com.google.re2j.Machine::step, version 504 (1200 bytes)
25.99% 27.85% C2, level 4 com.google.re2j.Machine::step, version 504 (605 bytes)
22.36% 24.75% C2, level 4 com.google.re2j.Machine::match, version 557 (1126 bytes)
7.44% 8.03% runtime stub StubRoutines::jint_disjoint_arraycopy (128 bytes)
7.39% 7.60% C2, level 4 com.google.re2j.Machine::step, version 504 (365 bytes)
- array copy - only used in 2 places
perfasm 2
- optimized path for no capture?
- expose a find method that does not capture
Pattern.find
Pattern.find
Benchmark Mode Cnt Score Error Units
Re2jFindRegex.testCombine thrpt 200 5139.382 ± 37.840 ops/s 20%/242%
Re2jFindRegex.testExp1 thrpt 200 14420.410 ± 193.187 ops/s 25%/198%
Re2jFindRegex.testExp2 thrpt 200 9594.136 ± 173.792 ops/s 20%/337%
perfasm 3
62.67% 60.66% C2, level 4 com.google.re2j.Machine::step, version 500 (1434 bytes)
22.73% 25.48% C2, level 4 com.google.re2j.Machine::match, version 550 (979 bytes)
4.77% 4.96% C2, level 4 com.google.re2j.Machine::step, version 500 (381 bytes)
3.68% 4.65% C2, level 4 com.google.re2j.Machine::step, version 500 (111 bytes)
1.31% 0.24% C2, level 4 com.google.re2j.Machine::init, version 541 (312 bytes)
0.56% 0.60% C2, level 4 com.google.re2j.Machine::match, version 550 (267 bytes)
0.55% 0.54% [kernel.kallsyms] [unknown] (5 bytes)
0.19% 0.05% C2, level 4 com.google.re2j.Machine::init, version 541 (61 bytes)
No more
0.88% 0.58% │ │ 0x00007f242c3896fc: lea (%r12,%r8,8),%r11
0.55% 0.44% │ │ 0x00007f242c389700: mov 0x10(%r11,%r10,4),%r14d ;*aaload
│ │ ; - com.google.re2j.Machine::step@27 (line 278)
0.31% 0.41% │ │ 0x00007f242c389705: mov 0x10(%r12,%r14,8),%ebp ;*getfield inst
│ │ ; - com.google.re2j.Machine::step@78 (line 283)
│ │ ; implicit exception: dispatches to 0x00007f242c38aded
3.40% 3.11% │ │ 0x00007f242c38970a: mov 0x8(%r12,%rbp,8),%r8d ; implicit exception: dispatches to 0x00007f242c38adfd
7.00% 6.89% │ │ 0x00007f242c38970f: cmp $0xf8019992,%r8d ; {metadata('com/google/re2j/Inst$RuneInst')}
│ │ 0x00007f242c389716: jne 0x00007f242c389f11
1.79% 1.36% │ │ 0x00007f242c38971c: lea (%r12,%rbp,8),%r11 ;*invokevirtual isMatch
│ │ ; - com.google.re2j.Machine::step@85 (line 285)
- virtual code to isMatch
isMatch
Benchmark Mode Cnt Score Error Units
Re2jFindRegex.testCombine thrpt 200 5024.247 ± 76.465 ops/s -2%/234%
Re2jFindRegex.testExp1 thrpt 200 14606.561 ± 111.745 ops/s 1%/202%
Re2jFindRegex.testExp2 thrpt 200 10152.329 ± 85.381 ops/s 6%/362%
isMatch
0.14% 0.12% ││ 0x00007fcab81811ee: mov 0x10(%r11,%r10,4),%r14d ;*aaload
││ ; - com.google.re2j.Machine::step@27 (line 278)
0.87% 0.66% ││ 0x00007fcab81811f3: mov 0x10(%r12,%r14,8),%r11d ;*getfield inst
││ ; - com.google.re2j.Machine::step@78 (line 283)
││ ; implicit exception: dispatches to 0x00007fcab8182849
3.71% 3.24% ││ 0x00007fcab81811f8: mov 0xc(%r12,%r11,8),%ebp ;*getfield op
││ ; - com.google.re2j.Machine::step@85 (line 285)
││ ; implicit exception: dispatches to 0x00007fcab8182859
6.46% 6.71% ││ 0x00007fcab81811fd: cmp $0x6,%ebp
1.53% 1.75% ││ 0x00007fcab8181200: je 0x00007fcab8181a6d ;*if_icmpne
││ ; - com.google.re2j.Machine::step@90 (line 285)
1.86% 1.80% ││ 0x00007fcab8181206: mov 0x8(%r12,%r11,8),%r9d
││ 0x00007fcab818120b: cmp $0xf8019992,%r9d ; {metadata('com/google/re2j/Inst$RuneInst')}
││ 0x00007fcab8181212: jne 0x00007fcab81819c5 ;*invokevirtual matchRune
││ ; - com.google.re2j.Machine::step@189 (line 299)
0.00% 0.01% ││ 0x00007fcab8181218: mov 0x20(%rsp),%r8
still high cost around inst access, cache miss?
What else?
- flatten AltInst
- Instead of a tree use an array/specialised type
0.65% 0.55% │ 0x00007f36b9225d3f: mov %rbx,%rsi
0.05% 0.06% │ 0x00007f36b9225d42: and %r9,%rsi ;*land
│ ; - com.google.re2j.Machine$Queue::contains@13 (line 47)
│ ; - com.google.re2j.Inst$AltInst::add@5 (line 187)
│ ; - com.google.re2j.Inst$AltInst::add@-1 (line 187)
0.07% 0.11% │ 0x00007f36b9225d45: test %rsi,%rsi
│ 0x00007f36b9225d48: jne 0x00007f36b9226481 ;*ifeq
│ ; - com.google.re2j.Machine$Queue::contains@16 (line 47)
│ ; - com.google.re2j.Inst$AltInst::add@5 (line 187)
│ ; - com.google.re2j.Inst$AltInst::add@-1 (line 187)
0.30% 0.49% │ 0x00007f36b9225d4e: cmp $0x40,%ecx
│ 0x00007f36b9225d51: jge 0x00007f36b92264cd ;*if_icmpge
│ ; - com.google.re2j.Machine$Queue::add@3 (line 56)
│ ; - com.google.re2j.Inst$AltInst::add@19 (line 190)
│ ; - com.google.re2j.Inst$AltInst::add@-1 (line 187)
0.25% 0.21% │ 0x00007f36b9225d57: mov 0x1c(%r10),%ebp ;*getfield outInst
│ ; - com.google.re2j.Inst$AltInst::add@23 (line 192)
│ ; - com.google.re2j.Inst$AltInst::add@-1 (line 187)
0.05% 0.05% │ 0x00007f36b9225d5b: or %r9,%rbx ;*lor ; - com.google.re2j.Machine$Queue::add@14 (line 57)
AltInst
Benchmark Mode Cnt Score Error Units
Re2jFindRegex.testCombine thrpt 200 5467.610 ± 60.482 ops/s 9%/263%
Re2jFindRegex.testExp1 thrpt 200 14808.181 ± 134.538 ops/s 1%/206%
Re2jFindRegex.testExp2 thrpt 200 11636.934 ± 91.307 ops/s 15%/430%
AltInst
0.38% 0.33% │ 0x00007f220122449e: mov %ebx,0xac(%rsp)
0.00% │ 0x00007f22012244a5: vmovd %eax,%xmm3
│ 0x00007f22012244a9: mov %rcx,%r14
0.13% 0.12% │ 0x00007f22012244ac: mov 0xc(%rcx),%r10d ;*getfield size
│ ; - com.google.re2j.Machine$Queue::addThread@6 (line 65)
│ ; - com.google.re2j.Inst$MatchInst::add@74 (line 106)
│ ; - com.google.re2j.Inst$Alt2Inst::add@35 (line 193)
│ ; - com.google.re2j.Machine::step@-1 (line 276)
0.29% 0.27% │ 0x00007f22012244b0: mov %r10d,0x28(%rsp)
│ 0x00007f22012244b5: mov 0x20(%rcx),%r10d ;*getfield denseThreads
│ ; - com.google.re2j.Machine$Queue::addThread@1 (line 65)
│ ; - com.google.re2j.Inst$MatchInst::add@74 (line 106)
│ ; - com.google.re2j.Inst$Alt2Inst::add@35 (line 193)
│ ; - com.google.re2j.Machine::step@-1 (line 276)
│ 0x00007f22012244b9: vmovd %r10d,%xmm2
0.16% 0.11% │ 0x00007f22012244be: mov 0x28(%rsp),%r10d
0.29% 0.27% │ 0x00007f22012244c3: inc %r10d ;*iadd
│ ; - com.google.re2j.Machine$Queue::addThread@11 (line 65)
│ ; - com.google.re2j.Inst$MatchInst::add@74 (line 106)
│ ; - com.google.re2j.Inst$Alt2Inst::add@35 (line 193)
│ ; - com.google.re2j.Machine::step@-1 (line 276)
│ 0x00007f22012244c6: vmovd %r10d,%xmm4
│ 0x00007f22012244cb: mov %r10d,0xc(%rcx) ;*putfield size
│ ; - com.google.re2j.Machine$Queue::addThread@12 (line 65)
│ ; - com.google.re2j.Inst$MatchInst::add@74 (line 106)
│ ; - com.google.re2j.Inst$Alt2Inst::add@35 (line 193)
│ ; - com.google.re2j.Machine::step@-1 (line 276)
thread removal
Thread pooling takes a lot of time in different places
- pool only necessary for capture - avoid int[] alloc
- non-capture faster, but impact capturing path...
thread removal
Benchmark Mode Cnt Score Error Units
Re2jFindRegex.testCombine thrpt 200 7818.190 ± 108.756 ops/s 43%/420%
Re2jFindRegex.testExp1 thrpt 200 20212.245 ± 387.433 ops/s 36%/317%
Re2jFindRegex.testExp2 thrpt 200 16208.431 ± 189.066 ops/s 39%/430%
Re2jMatchRegex.testCombine thrpt 200 4078.958 ± 36.783 ops/s -5%
Re2jMatchRegex.testExp1 thrpt 200 10905.446 ± 143.010 ops/s -8%
Re2jMatchRegex.testExp2 thrpt 200 7200.117 ± 62.800 ops/s -14%
thread removal
more asm
1.47% 1.51% │ │ 0x00007f2da121b3e3: mov 0x20(%r9),%ebp ;*getfield denseThreadsInstructions
│ │ ; - com.google.re2j.Machine::step@78 (line 294)
0.39% 0.32% │ │ 0x00007f2da121b3e7: mov 0xc(%r12,%rbp,8),%r10d ; implicit exception: dispatches to 0x00007f2da121bd85
0.90% 0.70% │ │ 0x00007f2da121b3ec: cmp %r10d,%r8d
│ │ 0x00007f2da121b3ef: jae 0x00007f2da121b6d3
0.91% 0.92% │ │ 0x00007f2da121b3f5: lea (%r12,%rbp,8),%r10
1.29% 1.40% │ │ 0x00007f2da121b3f9: mov 0x10(%r10,%r8,4),%ebp ;*aaload
│ │ ; - com.google.re2j.Machine::step@83 (line 294)
0.34% 0.30% │ │ 0x00007f2da121b3fe: mov 0xc(%r12,%rbp,8),%r11d ; implicit exception: dispatches to 0x00007f2da121bd99
- array boundary check on each threat
for (int j = 0; j < runq.size; ++j) {
Cannot prove runq.size will not change, it actually does.
if (!longest) {
// First-match mode: cut off all lower-priority threads.
freeQueue(runq, j + 1); // calls queue.clear(); witch set the size to 0
// which will trigger an exit from the loop
}
matched = true;
runq.size
runq.size
Benchmark Mode Cnt Score Error Units
Re2jFindRegex.testCombine thrpt 200 8061.101 ± 111.423 ops/s 3%/436%
Re2jFindRegex.testExp1 thrpt 200 20755.356 ± 356.537 ops/s 3%/329%
Re2jFindRegex.testExp2 thrpt 200 17957.057 ± 107.874 ops/s 11%/718%
runq.size
0.79% 1.09% │ 0x00007f0c9921b4fa: mov 0x20(%rax),%ebp ;*getfield denseThreadsInstructions
│ ; - com.google.re2j.Machine::step@82 (line 295)
0.22% 0.25% │ 0x00007f0c9921b4fd: mov 0xc(%r12,%rbp,8),%r8d ; implicit exception: dispatches to 0x00007f0c9921c661
1.10% 1.20% │ 0x00007f0c9921b502: cmp %r8d,%r10d
│ 0x00007f0c9921b505: jae 0x00007f0c9921baa9
1.80% 1.41% │ 0x00007f0c9921b50b: lea (%r12,%rbp,8),%r8
0.61% 0.60% │ 0x00007f0c9921b50f: mov 0x10(%r8,%r10,4),%ecx ;*aaload
│ ; - com.google.re2j.Machine::step@87 (line 295)
Did not eliminate the boundary check in exp1 though
but did in exp2, will come back to that one later
0.26% 0.16% │ 0x00007f5f41204509: mov 0x20(%rdx),%r11d ;*getfield denseThreadsInstructions
│ ; - com.google.re2j.Machine::step@82 (line 295)
0.15% 0.10% │ 0x00007f5f4120450d: mov 0xc(%r12,%r11,8),%r10d ;*aaload
│ ; - com.google.re2j.Machine::step@87 (line 295)
│ ; implicit exception: dispatches to 0x00007f5f4120494d
invokevirtual matchRune
1.35% 1.70% │ │ │ 0x00007f5f41204557: mov 0x8(%r12,%r10,8),%ecx
1.03% 1.25% │ │ │ 0x00007f5f4120455c: cmp $0xf8019993,%ecx ; {metadata('com/google/re2j/Inst$RuneInst')}
│ │ │ 0x00007f5f41204562: jne 0x00007f5f412047cd
0.86% 0.89% │ │ │ 0x00007f5f41204568: shl $0x3,%r10 ;*invokevirtual matchRune
│ │ │ ; - com.google.re2j.Machine::step@181 (line 312)
the instance type check is costly, even though only call it for RuneInst.
-> move matchRune method to Inst as final
-> no need for virtual dispatch
matchRune
matchRune
Benchmark Mode Cnt Score Error Units
Re2jFindRegex.testCombine thrpt 200 9082.066 ± 73.694 ops/s 13%/504%
Re2jFindRegex.testExp1 thrpt 200 21453.043 ± 332.811 ops/s 3%/343%
Re2jFindRegex.testExp2 thrpt 200 20037.627 ± 253.445 ops/s 12%/718%
captures
0.53% 0.48% 0x00007f6ad9217394: mov 0x8(%rsp),%r8
1.20% 0.96% 0x00007f6ad9217399: movzbl 0x11(%r8),%r8d ;*getfield captures
; - com.google.re2j.Machine::step@26 (line 285)
2.20% 2.40% 0x00007f6ad921739e: test %r8d,%r8d
captures
Benchmark Mode Cnt Score Error Units
Re2jFindRegex.testCombine thrpt 200 9307.324 ± 63.348 ops/s 2%/519%
Re2jFindRegex.testExp1 thrpt 200 22421.399 ± 230.262 ops/s 5%/363%
Re2jFindRegex.testExp2 thrpt 200 20411.229 ± 228.717 ops/s 2%/830%
matched/anchored
0.04% 0.02% │││ ││││↘│ ││││ ││││ 0x00007f2e7923aab0: mov %r10d,0x5c(%rsp) ;*aload_0
│││ ││││ │ ││││ ││││ ; - com.google.re2j.Machine::match@267 (line 237)
0.09% 0.18% │││ ││││ ↘ ││││ ││││ 0x00007f2e7923aab5: test %eax,%eax
│││ ││││ ││││ ││││ 0x00007f2e7923aab7: jne 0x00007f2e7923b27d ;*ifne
│││ ││││ ││││ ││││ ; - com.google.re2j.Machine::match@271 (line 237)
0.50% 0.40% │││ ││││ ││││ ││││ 0x00007f2e7923aabd: mov 0x64(%rsp),%r11d
0.12% 0.12% │││ ││││ ││││ ││││ 0x00007f2e7923aac2: test %r11d,%r11d
│││ ││││ ╭││││ ││││ 0x00007f2e7923aac5: je 0x00007f2e7923ac65 ;*ifeq
│││ ││││ │││││ ││││ ; - com.google.re2j.Machine::match@275 (line 237)
Benchmark Mode Cnt Score Error Units
Re2jFindRegex.testCombine thrpt 200 9307.737 ± 144.002 ops/s 0%
Re2jFindRegex.testExp1 thrpt 200 22643.527 ± 276.346 ops/s 1%
Re2jFindRegex.testExp2 thrpt 200 20796.029 ± 175.146 ops/s 2%
waste of time, as can see from the perf asm
originally included after perf indicated higher cost.
boundary elim
boundary elim
Benchmark Mode Cnt Score Error Units
Re2jFindRegex.testCombine thrpt 200 9195.517 ± 118.936 ops/s -1%
Re2jFindRegex.testExp1 thrpt 200 22150.490 ± 84.177 ops/s -2%
Re2jFindRegex.testExp2 thrpt 200 20539.044 ± 121.331 ops/s -1%
no impact ;(
boundary elim
0.34% 0.31% ││ 0x00007f4a39235d73: mov %r11,%rax ;*iload
││ ; - com.google.re2j.Machine::step@37 (line 287)
0.61% 0.49% ││ 0x00007f4a39235d76: mov 0x10(%rbx,%r10,4),%r8d ;*aaload
││ ; - com.google.re2j.Machine::step@95 (line 297)
1.79% 1.69% ││ 0x00007f4a39235d7b: mov 0xc(%r12,%r8,8),%r11d ;*getfield op
││ ; - com.google.re2j.Machine::step@100 (line 299)
││ ; implicit exception: dispatches to 0x00007f4a3923701d
- No more boundary check in exp2!
- no perf change in exp2 ...
Interlude
511% |
357% |
835% |
The great stagnation!
Another 3 tries
- local startInst 1%/0%/3%
- passthrough Inst 0%/4%/-1% only increase in exp1 cause Capture is now simpler
- replace contains/add with a single containsOrAdd 6%/1%/2%
Summary
- between 4.8 to 9.8 time faster at the end!
- 5 times less instances needed!
- small change can have big impact!
- always measure - theorise - implement - validate - analyse
- the JIT is very good, just need a bit of help sometimes
- beware things change, all the time.
- Math.min/max use to be slow now uses CMOV
- but CMOV slower than a branch if branch highly predictable
What else?
- replace regex with contains/startsWith/equals
- run through the Inst
- Identify if regex can be transformed to a contains call
Other related perf resource
- https://shipilev.net/jvm-anatomy-park/
- https://arnaudroger.github.io/blog/2017/02/28/java-performance-puzzle-part2.html
- http://psy-lob-saw.blogspot.co.uk/
- https://mechanical-sympathy.blogspot.co.uk/
When micro optimisation matters
By Arnaud Roger
When micro optimisation matters
how to get re2j to be x time faster
- 1,483