Evolution of Kotlin and KorGE (value classes)

KOTLIN , KORGE

February 12, 2021

Kotlin is one of the most beloved languages, thanks to its inline classes, inline functions, tail lambdas, suspending functions, DSLs and receivers, scope-aware extension members, elliding types, non mandatory semicolons, non requiring the new operator, data classes, primary constructors, companion objects implementing interfaces, three-leter declarations, smart casts...

All this while keeping a syntax that is clean yet statically typed and puts and gives names higher priority when reading and writing than types. But since it targets JVM-8/Android (valhalla took way too long) and JS that doesn't support structures/value types, there were some issues when writing high performant code in some use-cases like videogames.

Value classes in Kotlin, that will be available at some point, that work similar to inline classes, will allow vectorial high performant yet clean code on all the Kotlin targets.

Right now:

// Potentially 3 allocations
val sum = Point(1, 2) + Point(3, 4)

And temporarily preallocating for critical paths:

// Either here or from a pool
val temp1 = Point()
val temp2 = Point()
val temp3 = Point()
fun demo() {
    // 0 allocations, since we have all three points preallocated
    temp1.setTo(1, 2)
    temp2.setTo(3, 4)
    temp3.add(temp1, temp2)
}

But in the future, we will be able to do this:

// 0 allocations
val sum = Point(1, 2) + Point(3, 4)

Only one pre-allocation per thread, per different value class for each running application.

Introduction:

Last week, reading the Kotlin blog, I found this post:

https://blog.jetbrains.com/kotlin/2021/02/new-language-features-preview-in-kotlin-1-4-30/#inline-value-classes-stabilization

And in the comments section Svetlana pointed me here:

https://github.com/Kotlin/KEEP/blob/master/notes/value-classes.md#multifield-value-classes-before-valhalla

This is something super exciting that will allow to write high performance vectorial code in critical paths without having to sacrify readability, and designing error-prone mutable APIs so it works reasonably well on all the targets: JVM, JS and Kotlin/Native.

Now that the IR backend is almost ready, the Kotlin team will be able to bring new features in a much easier and consistent way. They will be able to make transformations at the IR level that applies all the targets at once highly reducing bugs in different targets, and I believe they will be able to implement this feature mostly at the IR level.

How could this work for parameters, return values and fields?

So, how could this work to avoid allocating objects?

Let's think on this Point implementation:

value class Point(val x: Float, val y: Float) {
    operator fun plus(other: Point) = Point(x + other.x, y + other.y)
}

val sum = Point(1f, 2f) + Point(3f, 4f)

Right now, it could be converted into ES6 like this:

class Point {
    constructor(x, y) {
        this.x = x
        this.y = y
    }
    plus(other) {
        return new Point(this.x + other.x, this.y + other.y)
    }
}

const sum = new Point(1f, 2f).plus(new Point(3f, 4f))

In some cases, an escape analysis could decide to not allocate the Point instances, or allocate them into the stack, since they are not stored on the heap. But that wouldn't happen when stored into a field. And that optimization might not happen on all the JS runtimes, initially, or before starting doing optimistic optimizations based on input shapes.

How could that be implemented this without allocating, if you have full control on the generated code?

class Point {
    static $threadLocalTemp = new Point(0f, 0f) // This should be a thread local on targets sharing threads with the same heap

    constructor(x, y) {
        this.x = x
        this.y = y
    }

    static plus(this$x, this$y, other$x, other$y, out = Point.$threadLocalTemp) {
        out.x = this$x + other$x
        out.y = this$y + other$y
        return out
    }
}

const temp = Point.plus(1f, 2f, 3f, 4f))
const sum$x = temp.x
const sum$y = temp.y

In the end, for each local and stack element, you can have as many locals/elements as fields your value class have, then use an intermediate and per thread preallocated object to store the results. That should work on all the targets, and won't rely on escape analysis being availalble or working.

There, we are still writting to a temporary object, that the runtime might not be able to optimize, so that wouldn't be optimal, yet should work. But we could use inline functions for those operations. For example, marking the operator fun plus as inline, then the generated code could be something like:

const temp$1 = 1f
const temp$2 = 2f
const temp$3 = 3f
const temp$4 = 4f
const sum$x = temp$1 + temp$3
const sum$y = temp$2 + temp$4

When the runtime converts this into the SSA form (and by the way, in this case, it is already in SSA), this is trivial for most of the runtimes on any language to optimize. And they could even perform autovectorization. Though this won't work on JS since the variables are doubles, and most SIMD coprocessors have instructions for Float32, unless they finally manage to optimize Math.fround. But still will be as performant as possible limited by the runtime.

When storing to fields in objects, it could work similar to locals: as many locals as fields have the structure.

Custom arrays:

And to have for example lists or arrays of Points, we can do something like:

inline class PointArray internal constructor(val data: FloatArray {
    constructor(count: Int) : this(FloatArray(count * 2))

    // inline could help here to avoid using a globally allocated instance
    operator fun get(index: Int) = Point(data[index * 2 + 0], data[index * 2 + 1])
    operator fun set(index: Int, value: Point) {
        data[index * 2 + 0] = value.x
        data[index * 2 + 1] = value.y
    }
}

val points = PointArray(16)
points[2] = points[0] + points[1]

This would store all the fields sequentially, making possible to fully take advantages of CPU cache levels.

A possible implementation in JS/ES6 could be something similar to:

class PointArray {
    static create(count) {
        return new Float32Array(count * 2)
    }

    static get($this, index, out = Point.$threadLocalTemp) {
        out.x = $this[index * 2 + 0]
        out.y = $this[index * 2 + 1]
        return out
    }
    static set($this, index, value$x, value$y) {
        $this[index * 2 + 0] = value$x
        $this[index * 2 + 1] = value$y
    }
}

const points = PointArray.create(16) // new Float32Array(16 * 2)
let temp$Point
temp$Point = PointArray.get(points, 0)
const temp$1$x = temp$Point.x
const temp$1$y = temp$Point.y
temp$Point = PointArray.get(points, 1)
const temp$2$x = temp$Point.x
const temp$2$y = temp$Point.y
temp$Point = Point.plus(temp$1$x, temp$1$y, temp$2$x, temp$2$y)
const result$x = temp$Point.x
const result$y = temp$Point.y
Point.set(points, 2, result$x, result$y)

And (inline fun'd):

const points = new Float32Array(16 * 2)
const temp$00 = 0
const temp$1$x = points[temp$00 * 2 + 0]
const temp$1$y = points[temp$00 * 2 + 1]
const temp$01 = 1
const temp$2$x = points[temp$01 * 2 + 0]
const temp$2$y = points[temp$01 * 2 + 1]
const result$x = temp$1$x + temp$2$x
const result$y = temp$1$y + temp$2$y
const temp$02 = 2
points[temp$02 * 2 + 0] = result$x
points[temp$02 * 2 + 1] = result$y

Which is pretty straight-forward and fast for any engine and runtime to optimize it quickly.

And the generated code could be optimized, after constant propagation, side-effect analysis (with a heap per thread/worker, the array could not be modified externally) and local inlining, as:

const points = new Float32Array(32)
points[4] = points[0] + points[2]
points[5] = points[1] + points[3]

So yeah, I'm pretty excited for this to happen. Once this is ready, I will start preparing a new major version of my libraries supporting this, tentatively and hopefully the 3.x version of the libraries, and a lot of the engine code will be greatly simplified, and user-level code will be able to use this too.