Become a sponsor to see this post

X64 JIT written in #kotlin: Making KPspEmu 10x faster on K/N 0.9 with Dynarek2

Sep 12 2018

The bits

Note: This is a proof of concept for MacOS and Windows X64, so do not expect it to run too much stuff or to have further improvements. The demo is known to run minifire and HelloWorldPsp demos, that are included in the 7z file. But it won’t run other programs that for example, use floating point at this point. But you can try the bundled demos and switch interpreter and dynarek back and forth and see the performance difference between the interpreted version and the dynarek one.

Download proof of concept from GitHub for Win64 and MacOS (~5MB)

TL;DR;

The start point

Last week I was able to compile kpspemu with K/N 0.9. It was a huge achievement since previous versions still had bugs that prevented me to compile and run it properly.

The main “problem” was the overhead per function call and field and array access related to reference counting, bound checking and freezing checking (though at this stage shouldn’t be a problem since K/N has not reached 1.0 and has zero optimizations).

And don’t be fooled: in normal programs this is not that big issue. But in the kpspemu interpreter I have one function for decoding instructions and functions per each instruction. And instead of executing several big function per frame, potentially with the required stuff in arguments, I am (trying) to execute billions of small functions per second so the smallest overhead is a big one in this particular case.

This is common code, so inlining each instruction would cause JVM to generate too big methods and would reach the method limit much before including half of the instructions I have to emulate. Also in JS bigger methods are likely to be much less optimized, and to be slower, so it is better to split in smaller functions. Also, right now Kotlin Common doesn’t support expect class + actual inline class which would help here to implement an expected class with an actual inline type using pointers.

Not everything is lost here

Instead of waiting to a future K/N version with optimizations, I wanted to see if I could get it faster by implementing dynarek in K/N and doing some adjustments, which at any rate would be required in later versions to get the best performance possible even with a super optimized K/N. And yeah! It helped: A lot. And would help even more with a better method generation on my side like the ones I implemented for jspspemu. Doing a bit of CFG + trampolines would make it even faster, instead of doing small chunks and returning to a loop.

What I made

I made a X64 codegen from scratch in Kotlin and used it to generate code on the fly (dynamic recompilation/dynarec) by effectively converting MIPS code from the PSP into X64 code. It is not a JIT strictly, since it doesn’t recompile things as required just yet (for example with optimistic optimizations or to patch new functions after cache invalidation). I could have used libjit, llvm or similar, but I wanted to keep dependencies at the bare minimum, and I wanted to learn practice in the process and explore K/N possibilities as much as possible, which is the main purpose of all this in the end.

This is my first time generating native code directly: with my other approaches to dynarec I just generated JS, JMV bytecode or .NET IL code. But this time it was X64 initially and in a later stage ARM64 too.

I has been pretty interesting since I learned to program using QBASIC and x86 assembly and I was able to recall some stuff from the x86 architecture (in addition to MIPS).

Dynarek2

Porting Dynarek to Native was not an easy task. And indeed I decided to start from scratch. Dynarek allowed to generate arbitrary functions with several arities. But I decided to simplify it and just support a single specific function arity, with well known arguments, that would allow fast performance and predictability in native.

I initially decided to set 4 arguments (though 3 would be enough without the temp one):

generatedFunction(regs: D2Memory, mem: D2Memory, temp: D2Memory, external: Any): Int

Description:

regs // contains a chunk of faster memory with the registers
mem // contains a chunk of memory with the physical memory to be emulated
temp // contains temporal memory to be used (I will remove this since I can use stack frames for this)
external // contains an arbitrary object that can be used

And returns an int. If I want generate a complex result, I can set some regs or call a function with the external object to set something more complex. If I want to return a float, I can return the bits of the float instead.

Function references (and staticCFunction limitations + inline reified for detecting signature)

When building the IR (AST-based) for dynarek2, there are times when you need to call other functions.