X64 JIT written in #kotlin: Making KPspEmu 10x faster on K/N 0.9 with Dynarek2

PSPEMU , KOTLIN

September 12, 2018

The bits

Note: This is a proof of concept for MacOS and Windows X64, so do not expect it to run too much stuff or to have further improvements. The demo is known to run minifire and HelloWorldPsp demos, that are included in the 7z file. But it won't run other programs that for example, use floating point at this point. But you can try the bundled demos and switch interpreter and dynarek back and forth and see the performance difference between the interpreted version and the dynarek one.

`Download proof of concept from GitHub for Win64 and MacOS (~5MB)`

TL;DR;

The start point

Last week I was able to compile kpspemu with K/N 0.9. It was a huge achievement since previous versions still had bugs that prevented me to compile and run it properly.

The main "problem" was the overhead per function call and field and array access related to reference counting, bound checking and freezing checking (though at this stage shouldn't be a problem since K/N has not reached 1.0 and has zero optimizations).

And don't be fooled: in normal programs this is not that big issue. But in the kpspemu interpreter I have one function for decoding instructions and functions per each instruction. And instead of executing several big function per frame, potentially with the required stuff in arguments, I am (trying) to execute billions of small functions per second so the smallest overhead is a big one in this particular case.

This is common code, so inlining each instruction would cause JVM to generate too big methods and would reach the method limit much before including half of the instructions I have to emulate. Also in JS bigger methods are likely to be much less optimized, and to be slower, so it is better to split in smaller functions. Also, right now Kotlin Common doesn't support expect class + actual inline class which would help here to implement an expected class with an actual inline type using pointers.

Not everything is lost here

Instead of waiting to a future K/N version with optimizations, I wanted to see if I could get it faster by implementing dynarek in K/N and doing some adjustments, which at any rate would be required in later versions to get the best performance possible even with a super optimized K/N. And yeah! It helped: A lot. And would help even more with a better method generation on my side like the ones I implemented for jspspemu. Doing a bit of CFG + trampolines would make it even faster, instead of doing small chunks and returning to a loop.

What I made

I made a X64 codegen from scratch in Kotlin and used it to generate code on the fly (dynamic recompilation/dynarec) by effectively converting MIPS code from the PSP into X64 code. It is not a JIT strictly, since it doesn't recompile things as required just yet (for example with optimistic optimizations or to patch new functions after cache invalidation). I could have used libjit, llvm or similar, but I wanted to keep dependencies at the bare minimum, and I wanted to learn practice in the process and explore K/N possibilities as much as possible, which is the main purpose of all this in the end.

This is my first time generating native code directly: with my other approaches to dynarec I just generated JS, JMV bytecode or .NET IL code. But this time it was X64 initially and in a later stage ARM64 too.

I has been pretty interesting since I learned to program using QBASIC and x86 assembly and I was able to recall some stuff from the x86 architecture (in addition to MIPS).

Dynarek2

Porting Dynarek to Native was not an easy task. And indeed I decided to start from scratch. Dynarek allowed to generate arbitrary functions with several arities. But I decided to simplify it and just support a single specific function arity, with well known arguments, that would allow fast performance and predictability in native.

I initially decided to set 4 arguments (though 3 would be enough without the temp one):

generatedFunction(regs: D2Memory, mem: D2Memory, temp: D2Memory, external: Any): Int

Description:

regs // contains a chunk of faster memory with the registers
mem // contains a chunk of memory with the physical memory to be emulated
temp // contains temporal memory to be used (I will remove this since I can use stack frames for this)
external // contains an arbitrary object that can be used

And returns an int. If I want generate a complex result, I can set some regs or call a function with the external object to set something more complex. If I want to return a float, I can return the bits of the float instead.

Function references (and `staticCFunction` limitations + inline reified for detecting signature)

When building the IR (AST-based) for dynarek2, there are times when you need to call other functions.

The old dynarek allowed to call plain member methods, and it worked for JS and JVM for any object. But the problem is that K/N doesn't allow you to get the pointer of functions directly. Instead, there is an intrinsic function called staticCFunction to get a function that has to be declared in the global scope.

There are several "problems" with staticCFunction:

One problem is that you cannot use plain objects: Though the function can have pointers, so you have to get a StablePointer first, and register a function that takes a pointer, and then converts that pointer into a proper object.
Other problem is that it has to be top-level. So you are not able to call instance methods directly, and you have to define a global function instead.
And the other problem (as for 0.9) is that you have to call the staticCFunction function directly. You cannot use it in an inline fun function. So you cannot create convenient wrappers around it. And what is most important, you cannot create a expect/actual set to create a similar function in common to register all the methods that you are going to use.

The workaround I had to used is to create a expect fun D2Context.registerPspemuMethods with an actual en JVM and JS that do nothing, and an actual in native that actually registers the available methods when constructing the context. That's rather inconvenient. But is the best thing I came with at this point. I would like to be able to use in inline fun. This way I could totally remove the requirement of registering the methods, and doing it only on native also. The other limitations do not prevent the DRY, so are less inconvenient.

In native I used the reified + inline trick when registering functions to obtain the function information (arity, type of arguments and return type) so I can set the right registers to call the function right (either plain integral registers or XMM floating point registers):

In JS you don't need type information, just the function reference. And you can call it directly since you have the arity.

In JVM you need type information, and the owner of the function. KFunction doesn't expose this information directly, so I had to do some reflective tricks to get that information:

Tools: ./gradlew_wine, Parallels, CLion, NASM, GCC, IDA FREE, Instruments/Counters, X64 instruction tables and Google

In this quest, I have used several tools.

The first one, to test on Windows without having to switch to a machine or use a slower emulator and having to switch mentaly contexts, I have used gradlew_wine (a gradle wrapper I have done that uses wine instead)

In some cases I totally needed to check the actual Windows API on Kotlin/Native. In those cases I used Parallels and CLion autocompletion and went back to IntelliJ IDEA on Mac. In some cases I also needed autocompletion for Native, and used CLion for that purpose.

For knowing how to encode some instructions I used NASM and IDA for disassembling it:

nano demo.asm && nasm demo.asm

When I needed more context to see how C is generating some things or which instructions are using, I used GCC + IDA again:

nano democ.c && gcc democ.c -o democ

I could try to use X64 encoding tables. But I find them confusing. What I usually did is to do things like:

use64

global    _start

section   .text
_start:

push rbp
mov rbp, rsp

mov rax, [rax+0x77777777]
mov rdi, [rax+0x77777777]
mov r8, [rax+0x77777777]
mov r15, [rax+0x77777777]

mov [rax+0x77777777], rax
mov [rax+0x77777777], rdi
mov [rax+0x77777777], r8
mov [rax+0x77777777], r15

pop rbp
retn

I usually use RAX and RDI as sample registers because I can compare those instructions and see which bits are used for which registers. RAX=0b00 and RDI=0b111 (first and last registers in the normal register bank).

In some cases I also used Google, StackOverflow and X64 instruction tables to decide which one to use or see the actual behaviour of that instruction.

But in general the other tools were good enough for my purpose.

D2Memory

The first tool I needed for dynarek2 was a class that I could use for the fastest memory access possible in each platform.

In the case of JS, D2Memory wraps an ArrayBuffer and preallocates several views: Int8Array, Int16Array, Int32Array and Float32Array.

In the case of JVM, D2Memory wraps a direct ByteBuffer and preallocates several additional views: ShortBuffer, IntBuffer, FloatBuffer

In the case of Native, D2Memory wraps a raw CPointer. Would be better if I could actually use actual inline class so the D2Memory would be an actual Pointer, greatly reducing the overhead since because of the current reference counting and property access, there is a pretty big overhead for any non-primitive field access. But right now, it is not possible. So this is the best I can do with common code at this point. At least it totally removes bound and freezing checking for this kind of memory. And allows me to grab the actual pointer inside dynarek2.

Inline classes wrapping D2Memory to the rescue of the registes

To simplify a lot of things and making it faster, instead of storing the registers as plain fields + arrays in the case of indexed registers (like GPR, FPR and VFPR), now

I'm using a CpuRegisters inline class wrapping D2Memory. Then defined extension properties that actual access specific areas and effectively acts as a manually-managed object for primitives.

Then my CpuState exposes the CpuRegisters as a field, and exposes some views using it to temproarly keep the same interface as before, but the CpuRegisters.pointer is passed to the dynarek2 code generator making super fast access to the MIPS registers inside X64.

mmap, munmap, mprotect and executable memory

It happens that a plain malloc is not enough for a dynarec to work because that memory is allocated without execution privileges.

To create memory that allows executing code I had to use:

mmap and munmap on posix, with the right flags
VirtualAlloc and VirtualFree on windows, with the right flags

Initially this was not obvious. So the first thing to generate is a retn instruction and check that you can actually execute the generated code (if it doesn't crash).

X64 ABIs

Microsoft likes to do thing different in a non-standard way. Maybe because they did it first and by then things were not standarized yet, and obviously that's not something you can change later.

But in the end, when working with X64 code, you have to take this into account. When calling a function, there are several things taken for granted. Some registers are used as parameters, while others have to be preserved before returning the function.

Microsoft ABI uses some registers as arguments when calling functions, while the standard one uses others.

Dynarek2 detects the operating system by having an expect/actual val isNativeWindows: Boolean property, that sets to true when compiling on windows and false otherwise. Depending on this, it will keep and use certain X64 registers.

I think it would be nice that Kotlin Common had some global APIs to detect things like platform running the code, the operating system and the architecture. But you can do it manually in your libraries with different source sets and expect/actual.

fastinvoke: Cinterop with the multiplatform plugin

Calling a staticCFunction with the built-in mechanism is slow. Maybe not slow for a normal application, but super slow for an emulator that has to call these things potentially millions of times per second.

Fortunately you can use cinterop and create a plain C function with a fixed signature that acts as a dispatcher for your function itself.

I decided to use longs (int64) for the parameters since the size is big enough for holding pointers and being integers the overhead should be small even with little optimizations done:

Expect D2Runner to reduce calling overhead and caching functions in the dynarek executor loop

Since at this point the generated functions are small (simple Basic Blocks) and do not contain loops or other CFG, the dispatcher must be as fast as possible.

Initially I started with the original loop, that was calling the dynarek function with the expected parameters. But for each call, I had to create a stable pointer, and extract the pointer from D2Memory instances, and that was slow.

But in the end, the only thing that changed in the loop for a step is the function. So I created an expect class D2Runner that allows to set the parameters and the function separately and execute it.

The other thing that was costly, specially for debug builds was to lookup for the function in a map. So I kept a local cache of functions for a fast lookup up to the last 4 different generated "functions"/basic blocks.

Deferring exceptions and Caching them

You cannot throw exceptions inside functions that have been called from the dynarek2 generated code. Probably those functions require a prefix/suffix in all the calling stack for it to be supported.

Because of that, when calling syscall right now, that might throw an exception to ask the interpreter or the dynarec to stop executing code returning to the threadmanager/interruptmanager I'm catching exceptions inside the syscall method, and using a return convention to later rethrow the exception outside the generated code.

Later I hope to get the prefix/suffix working to not require this.

Also since I explained in the previous post, instantiating exceptions in K/N is too slow for an emulator to use in normal flow. So what I did was to instantiate the exception instances once and rethrowing them (and doing that, I loose the stacktrace). But I was not using it at any rate in the cases I was using it.

Interpreter

Though not written yet, this new dynarek2 should allow to write a faster interpreter of the AST for full AOT targets (not allowing dynamic code execution for security). Also some design decisions should allow to faster interpreting of the normal MIPS interpreter.

Repository

Instead of hosting the dynarek2 as part of korlibs, I have moved to the kpspemu repo until it is matured to prevent having to build in one project/repo and use in another. Once matured I will move back to korlibs or potentially its own repo since it doesn't have dependencies and can be useful for other projects.

Conclusion

And that's it.

The good thing here is that the emulator can be developed in Kotlin, my favourite language, targeting the JVM with its safety, error handling, tooling and fast incremental compilation times. Then run in any browser compiling it to JS. And then compile to native executables even if it takes ~5-10 minutes to do so, since it will usually work (modulo strange bugs).

And it doesn't do strange things like using emscripten or webassembly for JS. It is fully native to each platform. Which is super great.

Code generation is tested in the dynarek2 module (that do not have dependencies), so when used in kpspemu, should work most flawlessly. And you can run native tests in around ~20 seconds. So developing the dynarek2 is not too time consuming in relation to testing.