Chris Birchall
Scala Days Berlin 2016
WARNING
Bytecode
ahead!
Remove a function call
by copying the function body into the caller
def target(a: Int, b: Int) = {
(a + b) * 2
}
def caller = {
val x = 1
val y = 2
target(x, y)
}
def caller = {
val x = 1
val y = 2
(x + y) * 2
}
inlining
def target(a: Int, b: Int) = {
(a + b) * 2
}
def caller = {
val x = 1
val y = 2
target(x, y)
}
def caller = {
val x = 1
val y = 2
(x + y) * 2
}
inlining
// def target
0: iload_1
1: iload_2
2: iadd
3: iconst_2
4: imul
5: ireturn
// def caller
0: aload_0
1: iconst_1
2: iconst_2
3: invokevirtual #24
6: ireturn
0: iconst_1
1: iconst_2
2: iadd
3: iconst_2
4: imul
5: ireturn
If the resolved method is not signature polymorphic (§2.9), then the invokevirtual instruction proceeds as follows.
Let C be the class of objectref. The actual method to be invoked is selected by the following lookup procedure:
If C contains a declaration for an instance method m that overrides (§5.4.5) the resolved method, then m is the method to be invoked, and the lookup procedure terminates.
Otherwise, if C has a superclass, this same lookup procedure is performed recursively using the direct superclass of C; the method to be invoked is the result of the recursive invocation of this lookup procedure.
Otherwise, an AbstractMethodError is raised.
The objectref must be followed on the operand stack by nargs argument values, where the number, type, and order of the values must be consistent with the descriptor of the selected instance method.
If the method is synchronized, the monitor associated with objectref is entered or reentered as if by execution of a monitorenter instruction (§monitorenter) in the current thread.
If the method is not native, the nargs argument values and objectref are popped from the operand stack. A new frame is created on the Java Virtual Machine stack for the method being invoked. The objectref and the argument values are consecutively made the values of local variables of the new frame, with objectref in local variable 0, arg1 in local variable 1 (or, if arg1 is of type long or double, in local variables 1 and 2), and so on. Any argument value that is of a floating-point type undergoes value set conversion (§2.8.3) prior to being stored in a local variable. The new frame is then made current, and the Java Virtual Machine pc is set to the opcode of the first instruction of the method to be invoked. Execution continues with the first instruction of the method.
class A(x: Int) {
def plusOne() = x + 1
}
def two: Int = {
val a = new A(1)
a.plusOne()
}
def two: Int = {
val a = new A(1)
a.x + 1
}
def two: Int = {
1 + 1
}
escape
analysis
inlining
So...
Small
-XX:InlineSmallCode (default 1000 bytes of assembly)
-XX:MaxInlineSize (default 35 bytes of bytecode)
-XX:MaxTrivialSize (default 6 bytes of bytecode)
Hot
-XX:MinInliningThreshold (default 250 invocations?)
Caller not already too big (default 325 bytes of bytecode)
Not a native method
...
import scala.annotation._
object Test {
@inline
def inlineMe(a: Int, b: Int) = (a + b) * 2
@noinline
def dontInlineMe(a: Int, b: Int) = (a + b) * 2
def foo = inlineMe(1, 2)
def bar = dontInlineMe(1, 2)
}
$ scalac -optimise -Yinline-warnings Test.scala
only inline @inline-marked methods,
and always inline them,
including under separate-compilation
def foo() = {
for (i <- 1 until 10) {
println(i)
}
}
def foo() = {
val range = new Range(1, 10, 1)
val f = new $anonfun$foo$1() // println(i)
range.foreach(f)
}
desugar
def foo() = {
val range = new Range(1, 10, 1)
val f = new $anonfun$foo$1()
if (!range.isEmpty) {
var i = range.start
while (true) {
f.apply(i)
if (i == range.lastElement) return
i += range.step
}
}
}
inline
Range.foreach
def foo() = {
val range = new Range(1, 10, 1)
val f = new $anonfun$foo$1()
if (!range.isEmpty) {
var i = range.start
while (true) {
println(i)
if (i == range.lastElement) return
i += range.step
}
}
}
inline
closure
def foo() = {
val range = new Range(1, 10, 1)
val f = new $anonfun$foo$1()
if (!range.isEmpty) {
var i = range.start
while (true) {
println(i)
if (i == range.lastElement) return
i += range.step
}
}
}
eliminate
dead code
Like most benchmarks,
this one is probably wrong
Cooley-Turkey algorithm
final case class Complex(r: Double, i: Double) {
@inline def +(x: Complex) = Complex(r + x.r, i + x.i)
@inline def -(x: Complex) = Complex(r - x.r, i - x.i)
@inline def *(x: Complex) = Complex(r * x.r - i * x.i, ...)
}
HotSpot inlining disabled | HotSpot inlining enabled | |
---|---|---|
@inline | 1208 ± 11 | 360 ± 10 |
@noinline | 1226 ± 14 | 355 ± 4 |
Scala 2.11.8, GenASM
HotSpot inlining disabled | HotSpot inlining enabled | |
---|---|---|
@inline | 1237 ± 12 | 330 ± 4 |
@noinline | 1243 ± 13 | 329 ± 4 |
Scala 2.12.0-M4
JMH settings: 20 warmup, 20 iterations, 10 forks
Oracle slides about HotSpot optimisations, including inlining
Blogposts about Java performance tuning
Explanation of initial prototype of the new optimizer in Scala 2.12
public class Generic {
<A> void foo(A a) {
return;
}
void test() {
foo("hello");
foo(123);
}
}
<A> void foo(A);
descriptor: (Ljava/lang/Object;)V
Code:
0: return
void test();
descriptor: ()V
Code:
0: aload_0
1: ldc #2 // String hello
3: invokevirtual #3 // Method foo:(Ljava/lang/Object;)V
6: aload_0
7: bipush 123
// Method java/lang/Integer.valueOf:(I)Ljava/lang/Integer;
9: invokestatic #4
12: invokevirtual #3 // Method foo:(Ljava/lang/Object;)V
15: return
Any
AnyVal
Int
Double
...
AnyRef
j.l.Object
JVM primitives
import scala.collection.mutable
object StdlibMapExample {
def foo(): Unit = {
val map = mutable.Map.empty[String, Int]
map.put("key", 123)
}
}
public void foo();
Code:
0: getstatic #18 // Field scala/collection/mutable/Map$.MODULE$:Lscala/collection/mutable/Map$;
3: invokevirtual #22 // Method scala/collection/mutable/Map$.empty:()Lscala/collection/mutable/Map;
6: astore_1
7: aload_1
8: ldc #24 // String key
10: bipush 123
// Method scala/runtime/BoxesRunTime.boxToInteger:(I)Ljava/lang/Integer;
12: invokestatic #30
15: invokeinterface #36, 3 // InterfaceMethod scala/collection/mutable/Map.put:(Ljava/lang/Object;Ljava/lang/Object;)Lscala/Option;
20: pop
21: return
to remove boxing overhead
MySpecialMap$mcB$sp.class // byte
MySpecialMap$mcC$sp.class // char
MySpecialMap$mcD$sp.class // double
MySpecialMap$mcF$sp.class // float
MySpecialMap$mcI$sp.class // int
MySpecialMap$mcJ$sp.class // long
MySpecialMap$mcS$sp.class // short
MySpecialMap$mcV$sp.class // null
MySpecialMap$mcZ$sp.class // boolean
MySpecialMap.class // AnyRef
class MySpecialMap[@specialized A] {
def put(key: String, value: A): Unit = ...
def get(key: String): Option[A] = ...
}
class MySpecialMap[@specialized A] {
def put(key: String, value: A): Unit = {}
def get(key: String): Option[A] = None
}
object Test {
def foo(): Unit = {
val map1 = new MySpecialMap[Int]
map1.put("key", 123)
}
}
0: new #15 // class MySpecialMap$mcI$sp
3: dup
4: invokespecial #16 // Method MySpecialMap$mcI$sp."<init>":()V
7: astore_1
8: aload_1
9: ldc #18 // String key
11: bipush 123
// Method MySpecialMap.put$mcI$sp:(Ljava/lang/String;I)V
13: invokevirtual #24
16: return
Constant pool:
...
#7 = Utf8 Lscala/reflect/ScalaSignature;
#8 = Utf8 bytes
#9 = Utf8 ??e2A!??? \taQ*_*qK?L?\r\'ba*\t1!A?=K6?H/
??U?a?F\n?? ?\"?C???%Q?AC??g? G.Y???%?a!?8z%?4?\"???\t?y?A?
j]&$h?F??!\r\t?AE???A?1????\t%)??)A?? ?aCA?B#\t9\"???\t1%??$
??? >$?.?8h!\tA1$?? ?\t??I\=)?Qq?C?? ?\t??BA?ta? ?.?7ju?$?\"
???\t???a?9viR?Ae\n???!)?B??\n??)f.?;\t !\n??A???-,????+[9??
bK??Y%\ta??:fI?4?B??0???FO]5oO*?A&???c??\rAE??m?dW/???g?!\t?
N??O?$HCA?9!\rAaGE??o%?aa?9uS>t?\"??3??I?
^^^ somewhere in there! ^^^
class MySpecialMap[@specialized (Int, Long, Double) A] {
...
}
Tuple1: @specialized(Int, Long, Double)
Tuple2: @specialized(Int, Long, Double, Char, Boolean)
Tuple3+: BOXING!
Option: BOXING!
Function{0,1,2}: Various combinations of @specialized
Immutable collections: BOXING!
Mutable collections: BOXING!
class BloomFilter[@specialized(Int) A](m: Int, k: Int)
(implicit hashFunctions: HashFunctions[A]) {
def add(value: A): Unit = ...
def query(value: A): Boolean = ...
}
trait HashFunctions[@specialized(Int) A] {
def alpha(value: A): Int
def beta(value: A): Int
}
Average time taken | |
---|---|
With specialisation | 0.163 ± 0.001 |
Without specialisation | 0.165 ± 0.001 |
JMH settings: 20 warmup, 20 iterations, 10 forks
Units = μs/op, smaller is better
class Buffer[@specialized(Int) A](initialCapacity: Int) {
def append(value: A): Unit = ...
def foreach(f: A => Unit): Unit = ...
}
Call foreach on a buffer containing 1 million Ints
Average time taken | |
---|---|
With specialisation | 4539 ± 67 |
Without specialisation | 9168 ± 86 |
JMH settings: 20 warmup, 20 iterations, 10 forks
Units = μs/op, smaller is better
An annotation for methods whose bodies may be excluded from compiler-generated bytecode