maandag 23 januari 2017

Devirtualization

When a non static method is called on an interface or class, an `invokeinterface` or `invokevirtual` bytecode is emitted. This makes it possible to do polymorphic calls; so at runtime determining which method implementation should be executed.

The following useless method is used in this post.

    public static int sizePlusOne(Collection c) {
        return c.size() + 1;
    }

The sizePlusOne function calls the size method on the passed collection and increments the result by one.

If we view the bytecode using 'javap -c SomeClass' then we get the following output:

  public static int sizePlusOne(java.util.Collection);
    Code:
       0: aload_0
       1: invokeinterface #18,  1           // InterfaceMethod java/util/Collection.size:()I
       6: iconst_1
       7: iadd
       8: ireturn
The problem is that a `invokeinterface` or `invokevirtual` is a dynamic dispatch instead of a static dispatch. A dynamic dispatch involves more complexity because a vtable lookup is required and it prevents inlining since the concrete method implementation isn't known.

What the JIT does is speculate that there will only be 1 or 2 types at most and instead of doing a dynamic dispatch, there are some cheap guards and static dispatching. To test this we'll be using Hotspot 1.8.0_91. Keep in mind that in the future the optimizations could be implemented differently.

For this post we assume that:

  • sizePlusOne method is called equally with the different types. There is not a dominant type.
  • there are many implementations loaded for interface. Having only a single loaded implementation deserves a post on its own.
  • only the C2 code is relevant

Monomorphic callsite

The simplest version of the optimization is a monomorphic call site: so only a single type being used.

To demonstrate the monomorphic callsite, we use the following example:
public class Devirtualization_MonoMorphic {

    public static void main(String[] args) {
        ArrayList arrayList = new ArrayList();

        int result = 0;

        for (int k = 0; k < 100_000; k++) {
            result += sizePlusOne(arrayList);
        }

        System.out.println("result:" + result);
    }

    public static int sizePlusOne(Collection c) {
        return c.size() + 1;
    }
}
So we loop 100k times and execute the 'sizePlusOne' method with an ArrayList.

And we run it with the following settings to get the Assembly:

-XX:+UnlockDiagnosticVMOptions
-XX:PrintAssemblyOptions=intel
-XX:-TieredCompilation
-XX:-Inline
-XX:-BackgroundCompilation
-XX:CompileCommand=print,*Devirtualization_MonoMorphic.sizePlusOne
Inlining is disabled so we can focus on the content of the sizePlusOne method. This will not prevent inlining of simple accessors like ArrayList/LinkedList.size. For more information see '-XX:-InlineAccessors' and credits go to Andrew Haley for pointing this out. Background compilation is disabled so that compilation isn't done asynchronous and tiered compilation disabled since for this blogpost only the C2 emitted Assembly will be discussed.

When the program is run, the following Assembly is printed.

CompilerOracle: print *Devirtualization_MonoMorphic.sizePlusOne
Compiled method (c2)     247    9             com.devirtualization.Devirtualization_MonoMorphic::sizePlusOne (9 bytes)
 total in heap  [0x000000010be7e3d0,0x000000010be7e678] = 680
 relocation     [0x000000010be7e4f0,0x000000010be7e508] = 24
 main code      [0x000000010be7e520,0x000000010be7e580] = 96
 stub code      [0x000000010be7e580,0x000000010be7e598] = 24
 oops           [0x000000010be7e598,0x000000010be7e5a0] = 8
 metadata       [0x000000010be7e5a0,0x000000010be7e5b0] = 16
 scopes data    [0x000000010be7e5b0,0x000000010be7e5e0] = 48
 scopes pcs     [0x000000010be7e5e0,0x000000010be7e660] = 128
 dependencies   [0x000000010be7e660,0x000000010be7e668] = 8
 nul chk table  [0x000000010be7e668,0x000000010be7e678] = 16
Loaded disassembler from /Library/Java/JavaVirtualMachines/jdk1.8.0.jdk/Contents/Home/jre/lib/hsdis-amd64.dylib
Decoding compiled method 0x000000010be7e3d0:
Code:
[Disassembling for mach='i386:x86-64']
[Entry Point]
[Verified Entry Point]
[Constants]
  # {method} {0x000000011417a560} 'sizePlusOne' '(Ljava/util/Collection;)I' in 'com/devirtualization/Devirtualization_MonoMorphic'
  # parm0:    rsi:rsi   = 'java/util/Collection'
  #           [sp+0x20]  (sp of caller)
  0x000000010be7e520: mov    DWORD PTR [rsp-0x14000],eax
  0x000000010be7e527: push   rbp
  0x000000010be7e528: sub    rsp,0x10           ;*synchronization entry
                                                ; - com.devirtualization.Devirtualization_MonoMorphic::sizePlusOne@-1 (line 21)

  0x000000010be7e52c: mov    r11d,DWORD PTR [rsi+0x8]  ; implicit exception: dispatches to 0x000000010be7e55d
  0x000000010be7e530: cmp    r11d,0xf8003231    ;   {metadata('java/util/ArrayList')}
  0x000000010be7e537: jne    0x000000010be7e54a  ;*invokeinterface size
                                                ; - com.devirtualization.Devirtualization_MonoMorphic::sizePlusOne@1 (line 21)

  0x000000010be7e539: mov    eax,DWORD PTR [rsi+0x10]
  0x000000010be7e53c: inc    eax                ;*iadd
                                                ; - com.devirtualization.Devirtualization_MonoMorphic::sizePlusOne@7 (line 21)

  0x000000010be7e53e: add    rsp,0x10
  0x000000010be7e542: pop    rbp
  0x000000010be7e543: test   DWORD PTR [rip+0xffffffffff6fdab7],eax        # 0x000000010b57c000
                                                ;   {poll_return}
  0x000000010be7e549: ret    
  0x000000010be7e54a: mov    rbp,rsi
  0x000000010be7e54d: mov    esi,0xffffffde
  0x000000010be7e552: nop
  0x000000010be7e553: call   0x000000010be47120  ; OopMap{rbp=Oop off=56}
                                                ;*invokeinterface size
                                                ; - com.devirtualization.Devirtualization_MonoMorphic::sizePlusOne@1 (line 21)
                                                ;   {runtime_call}
  0x000000010be7e558: call   0x000000010a9a2e44  ;   {runtime_call}
  0x000000010be7e55d: mov    esi,0xfffffff6
  0x000000010be7e562: nop
  0x000000010be7e563: call   0x000000010be47120  ; OopMap{off=72}
                                                ;*invokeinterface size
                                                ; - com.devirtualization.Devirtualization_MonoMorphic::sizePlusOne@1 (line 21)
                                                ;   {runtime_call}
  0x000000010be7e568: call   0x000000010a9a2e44  ;*invokeinterface size
                                                ; - com.devirtualization.Devirtualization_MonoMorphic::sizePlusOne@1 (line 21)
                                                ;   {runtime_call}
  0x000000010be7e56d: hlt    
  0x000000010be7e56e: hlt    
  0x000000010be7e56f: hlt    
  0x000000010be7e570: hlt    
  0x000000010be7e571: hlt    
  0x000000010be7e572: hlt    
  0x000000010be7e573: hlt    
  0x000000010be7e574: hlt    
  0x000000010be7e575: hlt    
  0x000000010be7e576: hlt    
  0x000000010be7e577: hlt    
  0x000000010be7e578: hlt    
  0x000000010be7e579: hlt    
  0x000000010be7e57a: hlt    
  0x000000010be7e57b: hlt    
  0x000000010be7e57c: hlt    
  0x000000010be7e57d: hlt    
  0x000000010be7e57e: hlt    
  0x000000010be7e57f: hlt    
[Exception Handler]
[Stub Code]
  0x000000010be7e580: jmp    0x000000010be6bf60  ;   {no_reloc}
[Deopt Handler Code]
  0x000000010be7e585: call   0x000000010be7e58a
  0x000000010be7e58a: sub    QWORD PTR [rsp],0x5
  0x000000010be7e58f: jmp    0x000000010be46d00  ;   {runtime_call}
  0x000000010be7e594: hlt    
  0x000000010be7e595: hlt    
  0x000000010be7e596: hlt    
  0x000000010be7e597: hlt    
OopMapSet contains 2 OopMaps

#0 
OopMap{rbp=Oop off=56}
#1 
OopMap{off=72}
That is a lot output. Lets remove everything not relevant and add some comments:
  0x000000010be7e52c: mov    r11d,DWORD PTR [rsi+0x8] 
    ;; Loads the address of the class of 'c' into register r11
  0x000000010be7e530: cmp    r11d,0xf8003231 
    ;; Compare that address with the address of the ArrayList class
  0x000000010be7e537: jne    0x000000010be7e54a 
    ;; if the classes are different, jump to an uncommon trap 
  0x000000010be7e539: mov    eax,DWORD PTR [rsi+0x10]
    ;; copies the size field of the ArrayList object into eax
  0x000000010be7e53c: inc    eax                ;*iadd   
    ;; Add one to eax. 
If this would be converted to pseudo Java code, we would get something like this:
  if(collection.class != ArrayList.class){
    uncommonTrap()
  }

  int result = c.size;
  result++;

Conclusion

With only a single type at the call site, the Assembly only has a simple type-guard and then a static dispatch. So the dynamic dispatch is replaced by a static dispatch and inlining of the method is possible.

Bimorphic callsite

The previous example has only 1 concrete type at the callsite. But what happens when a second type is added.
public class Devirtualization_BiMorphic {

    public static void main(String[] args) {
        ArrayList arrayList = new ArrayList();
        LinkedList linkedList = new LinkedList();

        int result = 0;

        for (int k = 0; k < 100_000; k++) {
            result += sizePlusOne(arrayList);
        }

        for (int k = 0; k < 100_000; k++) {
            result += sizePlusOne(linkedList);
        }

        System.out.println("result:" + result);
    }

    public static int sizePlusOne(Collection c) {
        return c.size() + 1;
    }
}
And we run with the following settings:
-XX:+UnlockDiagnosticVMOptions
-XX:PrintAssemblyOptions=intel
-XX:-TieredCompilation
-XX:-Inline
-XX:-BackgroundCompilation
-XX:CompileCommand=print,*Devirtualization_BiMorphic.sizePlusOne
First the Assembly of the the monomorphic version is emited, and as soon as the LinkedList is encountered, the uncommon trap is executed and eventually new Assembly emitted:
Compiled method (c2)     291   11             com.devirtualization.Devirtualization_BiMorphic::sizePlusOne (9 bytes)
 total in heap  [0x000000010c339510,0x000000010c3397f8] = 744
 relocation     [0x000000010c339630,0x000000010c339648] = 24
 main code      [0x000000010c339660,0x000000010c3396c0] = 96
 stub code      [0x000000010c3396c0,0x000000010c3396d8] = 24
 oops           [0x000000010c3396d8,0x000000010c3396e0] = 8
 metadata       [0x000000010c3396e0,0x000000010c339700] = 32
 scopes data    [0x000000010c339700,0x000000010c339730] = 48
 scopes pcs     [0x000000010c339730,0x000000010c3397e0] = 176
 dependencies   [0x000000010c3397e0,0x000000010c3397e8] = 8
 nul chk table  [0x000000010c3397e8,0x000000010c3397f8] = 16
Decoding compiled method 0x000000010c339510:
Code:
[Entry Point]
[Verified Entry Point]
[Constants]
  # {method} {0x00000001146395f0} 'sizePlusOne' '(Ljava/util/Collection;)I' in 'com/devirtualization/Devirtualization_BiMorphic'
  # parm0:    rsi:rsi   = 'java/util/Collection'
  #           [sp+0x20]  (sp of caller)
  0x000000010c339660: mov    DWORD PTR [rsp-0x14000],eax
  0x000000010c339667: push   rbp
  0x000000010c339668: sub    rsp,0x10           ;*synchronization entry
                                                ; - com.devirtualization.Devirtualization_BiMorphic::sizePlusOne@-1 (line 28)

  0x000000010c33966c: mov    r10d,DWORD PTR [rsi+0x8]  ; implicit exception: dispatches to 0x000000010c3396ad
  0x000000010c339670: cmp    r10d,0xf8003231    ;   {metadata('java/util/ArrayList')}
  0x000000010c339677: je     0x000000010c339687
  0x000000010c339679: cmp    r10d,0xf8008c83    ;   {metadata('java/util/LinkedList')}
  0x000000010c339680: jne    0x000000010c339698  ;*invokeinterface size
                                                ; - com.devirtualization.Devirtualization_BiMorphic::sizePlusOne@1 (line 28)

  0x000000010c339682: mov    eax,DWORD PTR [rsi+0x10]  ;*getfield size
                                                ; - java.util.LinkedList::size@1 (line 326)
                                                ; - com.devirtualization.Devirtualization_BiMorphic::sizePlusOne@1 (line 28)

  0x000000010c339685: jmp    0x000000010c33968a  ;*invokeinterface size
                                                ; - com.devirtualization.Devirtualization_BiMorphic::sizePlusOne@1 (line 28)

  0x000000010c339687: mov    eax,DWORD PTR [rsi+0x10]  ;*synchronization entry
                                                ; - com.devirtualization.Devirtualization_BiMorphic::sizePlusOne@-1 (line 28)

  0x000000010c33968a: inc    eax                ;*iadd
                                                ; - com.devirtualization.Devirtualization_BiMorphic::sizePlusOne@7 (line 28)

  0x000000010c33968c: add    rsp,0x10
  0x000000010c339690: pop    rbp
  0x000000010c339691: test   DWORD PTR [rip+0xfffffffffe841969],eax        # 0x000000010ab7b000
                                                ;   {poll_return}
  0x000000010c339697: ret    
  0x000000010c339698: mov    rbp,rsi
  0x000000010c33969b: mov    esi,0xffffffc6
  0x000000010c3396a0: data32 xchg ax,ax
  0x000000010c3396a3: call   0x000000010c306120  ; OopMap{rbp=Oop off=72}
                                                ;*invokeinterface size
                                                ; - com.devirtualization.Devirtualization_BiMorphic::sizePlusOne@1 (line 28)
                                                ;   {runtime_call}
  0x000000010c3396a8: call   0x000000010b840e44  ;   {runtime_call}
  0x000000010c3396ad: mov    esi,0xfffffff6
  0x000000010c3396b2: nop
  0x000000010c3396b3: call   0x000000010c306120  ; OopMap{off=88}
                                                ;*invokeinterface size
                                                ; - com.devirtualization.Devirtualization_BiMorphic::sizePlusOne@1 (line 28)
                                                ;   {runtime_call}
  0x000000010c3396b8: call   0x000000010b840e44  ;*invokeinterface size
                                                ; - com.devirtualization.Devirtualization_BiMorphic::sizePlusOne@1 (line 28)
                                                ;   {runtime_call}
  0x000000010c3396bd: hlt    
  0x000000010c3396be: hlt    
  0x000000010c3396bf: hlt    
[Exception Handler]
[Stub Code]
  0x000000010c3396c0: jmp    0x000000010c32af60  ;   {no_reloc}
[Deopt Handler Code]
  0x000000010c3396c5: call   0x000000010c3396ca
  0x000000010c3396ca: sub    QWORD PTR [rsp],0x5
  0x000000010c3396cf: jmp    0x000000010c305d00  ;   {runtime_call}
  0x000000010c3396d4: hlt    
  0x000000010c3396d5: hlt    
  0x000000010c3396d6: hlt    
  0x000000010c3396d7: hlt    
OopMapSet contains 2 OopMaps

#0 
OopMap{rbp=Oop off=72}
#1 
OopMap{off=88}
Again a lot of output. Lets trim it down a bit:
  0x000000010c33966c: mov    r10d,DWORD PTR [rsi+0x8]
     ;; Loads the address of the class of 'c' into register r10d
  0x000000010c339670: cmp    r10d,0xf8003231   
     ;; Compare that address with the address of the ArrayList class
  0x000000010c339677: je     0x000000010c339687
      ;; if it is an ArrayList, jump to 0x000000010c339687
  0x000000010c339679: cmp    r10d,0xf8008c83    
      ;; it was not an ArrayList, lets compare with address of LinkedList class
  0x000000010c339680: jne    0x000000010c339698  
      ;; if jump to uncommon trap 
  0x000000010c339682: mov    eax,DWORD PTR [rsi+0x10] 
      ;; copies the size field of the LinkedList instance into eax
  0x000000010c339685: jmp    0x000000010c33968a  
  0x000000010c339687: mov    eax,DWORD PTR [rsi+0x10]  ;*synchronization entry
      ;; copes the size field of the ArrayList instance into eax
  0x000000010c33968a: inc    eax                
      ;; Add one to eax.
If we would convert this back pseudo Java we would get something like this:
int result
if(c.class == ArrayList.class){
    result = c.size;
} else if(c.class == LinkedList.class){
    result = c.size;
} else {
    uncommonTrap();
}
result++;

Conclusion

In case of a bimorphic callsite, we don't need to pay the price for a dynamic dispatch and inlining isn't prevented. However we do get 1 type-guard for one of the types and 2 type-guards for the other.

Megamorpic callsite

Lets add a third type of Collection and see what happens:
public class Devirtualization_MegaMorphic {

    public static void main(String[] args) {
        ArrayList arrayList = new ArrayList();
        LinkedList linkedList = new LinkedList();
        HashSet hashSet = new HashSet();

        int result = 0;

        for (int k = 0; k < 100_000; k++) {
            result += sizePlusOne(arrayList);
        }

        for (int k = 0; k < 100_000; k++) {
            result += sizePlusOne(linkedList);
        }

        for (int k = 0; k < 100_000; k++) {
            result += sizePlusOne(hashSet);
        }

        System.out.println("result:" + result);
    }

    public static int sizePlusOne(Collection c) {
        return c.size() + 1;
    }
}
If we run with the following settings
-XX:+UnlockDiagnosticVMOptions
-XX:PrintAssemblyOptions=intel
-XX:-TieredCompilation
-XX:-Inline
-XX:-BackgroundCompilation
-XX:CompileCommand=print,*Devirtualization_MegaMorphic.sizePlusOne
Then the following Assembly is eventually emitted:
Compiled method (c2)     268   14             com.devirtualization.Devirtualization_MegaMorphic::sizePlusOne (9 bytes)
 total in heap  [0x000000010ea7d550,0x000000010ea7d788] = 568
 relocation     [0x000000010ea7d670,0x000000010ea7d680] = 16
 main code      [0x000000010ea7d680,0x000000010ea7d6c0] = 64
 stub code      [0x000000010ea7d6c0,0x000000010ea7d6d8] = 24
 oops           [0x000000010ea7d6d8,0x000000010ea7d6e0] = 8
 metadata       [0x000000010ea7d6e0,0x000000010ea7d6e8] = 8
 scopes data    [0x000000010ea7d6e8,0x000000010ea7d708] = 32
 scopes pcs     [0x000000010ea7d708,0x000000010ea7d768] = 96
 dependencies   [0x000000010ea7d768,0x000000010ea7d770] = 8
 handler table  [0x000000010ea7d770,0x000000010ea7d788] = 24
Decoding compiled method 0x000000010ea7d550:
Code:
[Entry Point]
[Verified Entry Point]
[Constants]
  # {method} {0x0000000116d7a678} 'sizePlusOne' '(Ljava/util/Collection;)I' in 'com/devirtualization/Devirtualization_MegaMorphic'
  # parm0:    rsi:rsi   = 'java/util/Collection'
  #           [sp+0x20]  (sp of caller)
  0x000000010ea7d680: mov    DWORD PTR [rsp-0x14000],eax
  0x000000010ea7d687: push   rbp
  0x000000010ea7d688: sub    rsp,0x10           ;*synchronization entry
                                                ; - com.devirtualization.Devirtualization_MegaMorphic::sizePlusOne@-1 (line 33)

  0x000000010ea7d68c: nop
  0x000000010ea7d68d: movabs rax,0xffffffffffffffff
  0x000000010ea7d697: call   0x000000010ea45f60  ; OopMap{off=28}
                                                ;*invokeinterface size
                                                ; - com.devirtualization.Devirtualization_MegaMorphic::sizePlusOne@1 (line 33)
                                                ;   {virtual_call}
  0x000000010ea7d69c: inc    eax                ;*iadd
                                                ; - com.devirtualization.Devirtualization_MegaMorphic::sizePlusOne@7 (line 33)

  0x000000010ea7d69e: add    rsp,0x10
  0x000000010ea7d6a2: pop    rbp
  0x000000010ea7d6a3: test   DWORD PTR [rip+0xffffffffff742957],eax        # 0x000000010e1c0000
                                                ;   {poll_return}
  0x000000010ea7d6a9: ret                       ;*invokeinterface size
                                                ; - com.devirtualization.Devirtualization_MegaMorphic::sizePlusOne@1 (line 33)

  0x000000010ea7d6aa: mov    rsi,rax
  0x000000010ea7d6ad: add    rsp,0x10
  0x000000010ea7d6b1: pop    rbp
  0x000000010ea7d6b2: jmp    0x000000010ea6ece0  ;   {runtime_call}
  0x000000010ea7d6b7: hlt    
  0x000000010ea7d6b8: hlt    
  0x000000010ea7d6b9: hlt    
  0x000000010ea7d6ba: hlt    
  0x000000010ea7d6bb: hlt    
  0x000000010ea7d6bc: hlt    
  0x000000010ea7d6bd: hlt    
  0x000000010ea7d6be: hlt    
  0x000000010ea7d6bf: hlt    
[Exception Handler]
[Stub Code]
  0x000000010ea7d6c0: jmp    0x000000010ea6bf60  ;   {no_reloc}
[Deopt Handler Code]
  0x000000010ea7d6c5: call   0x000000010ea7d6ca
  0x000000010ea7d6ca: sub    QWORD PTR [rsp],0x5
  0x000000010ea7d6cf: jmp    0x000000010ea46d00  ;   {runtime_call}
  0x000000010ea7d6d4: hlt    
  0x000000010ea7d6d5: hlt    
  0x000000010ea7d6d6: hlt    
  0x000000010ea7d6d7: hlt    
Again a lot of output. If we extract only the relevant part, we get:
  0x000000010ea7d68d: movabs rax,0xffffffffffffffff
  0x000000010ea7d697: call   0x000000010ea45f60  ; OopMap{off=28}
                                                ;*invokeinterface size
                                                ; - com.devirtualization.Devirtualization_MegaMorphic::sizePlusOne@1 (line 33)
                                                ;   {virtual_call}
  0x000000010ea7d69c: inc    eax                ;*iadd
                                                ; - com.devirtualization.Devirtualization_MegaMorphic::sizePlusOne@7 (line 33)
The code above contains the expensive virtual call to Collection.size.

Conclusion

The JIT has given up on devirtualizing the callsite after it has 3 encountered different implementations.

Recommended reads

I would start out with this excellent post of Aleksey Shipilëv. He goes into much more detail.

Very good post from Richard Warburton.

And very informative post from Martin Thompson.

Geen opmerkingen:

Een reactie posten

How does hyperthreading work.

Introduction In the last few months, there were 2 occurrences where people were talking about the implementation of hyperthreading; the In...