KorGE 2.1: From 100K to 800K fast sprites

KORGE

April 30, 2021

KorGE 2.1 that will be released soon, will include potential new performance improvements for high performance use cases.

In fact originally the sprites10k sample was able to handle 10k at over 100fps, after that FastSprite was introduced and we included a bunnymark that can reach 100k at near 100fps, and now with KorGE 2.1 we will be able to handle 800k sprites at over 100fps (limited by GPU), and for the first time reach around 30fps on JS (where original Pixi.JS bunnymark can reach only 2fps) and around 50fps with 800k sprites in Kotlin/Native too with new Kotlin/Native 1.5.0 optimizations!

But how did I manage to reach that level of performance? Let's dig into it:

Bottle neck in bunnymark

Bunnymark was memory-bound, and that means that the performance was limited by memory accesses. Each sprite was an object, and that object was allocated in arbitrary memory addresses, so when iterating all those objects CPU cache hitting was not optimal.

But not only that: we had to iterate all the different sprite objects, then compute all the coordinates, then write them 4 vertices for every sprite into another memory address, then upload all those vertices to the GPU.

New FSprites container

What if we would be able to store all the main sprite properties packed together in memory, so when iterating they were packed and near in memory, so the CPU caches would hit properly?

You can normally do that in C/C++/C# by creating an array of structures, so all the elements are packed together without memory indirections. But you cannot do that on the JVM and on Kotlin targets.

But what if we could make a trick to do that, while using object-oriented programming? That's exactly what FSprites does. To do that we use inline classes.

The idea is to define an inline class that wraps an integer with an index:

inline class FSprite(val id: Int) {
    inline val offset get() = id * STRIDE
}

Then we have a container that defines properties for FSprite and uses a buffer to store the values there:

open class FSprites(val maxSize: Int) {
    var size = 0
    val data = FBuffer(maxSize * STRIDE)
    private val i32 = data.i32
    private val f32 = data.f32
    fun alloc() = FSprite(size++)

    // Properties for Fsprites
    var FSprite.x: Float get() = f32[offset + 0] ; set(value) { f32[offset + 0] = value }
    var FSprite.y: Float get() = f32[offset + 1] ; set(value) { f32[offset + 1] = value }
}

And we can extend the FSprites container to define new properties backed with a separate buffer:

class BunnyContainer(maxSize: Int) : FSprites(maxSize) {
    val speeds = FBuffer(maxSize * Float.SIZE_BYTES * 2).f32
    var FSprite.speedXf: Float get() = speeds[index * 2 + 0] ; set(value) { speeds[index * 2 + 0] = value }
    var FSprite.speedYf: Float get() = speeds[index * 2 + 1] ; set(value) { speeds[index * 2 + 1] = value }
}

Then we have a way to iterate over each sprite from the FSprites container:

inline fun <T : FSprites> T.fastForEach(callback: T.(sprite: FSprite) -> Unit)

And we can use that fastForEach to tick all the sprites like this:

bunnys.fastForEach { bunny ->
    bunny.x += bunny.speedXf
    bunny.y += bunny.speedYf
    bunny.speedYf += gravity

    if (bunny.x > maxX) {
        bunny.speedXf *= -1
        bunny.x = maxX
    } else if (bunny.x < minX) {
        bunny.speedXf *= -1
        bunny.x = minX
    }

    if (bunny.y > maxY) {
        bunny.speedYf *= -0.85f
        bunny.y = maxY
        bunny.radiansf = (random.nextFloat() - 0.5f) * 0.2f
        if (random.nextFloat() > 0.5) {
            bunny.speedYf -= random.nextFloat() * 6
        }
    } else if (bunny.y < minY) {
        bunny.speedYf = 0f
        bunny.y = minY
    }
}

The drawback here is that we have to have the FSprites instance injected as this for this to work, so when iterating it is directly with the fastForEach, but if we want to access a specific sprite, we have to wrap the code around a fsprites.apply { ... } so you can access the properties of FSprite defined in your container, for critical code, this is a pretty reasonable tradeoff, since you can continue using properties as plain OOP.

AG/KorGW improvements

With the previous example, we can have all the sprites data together in memory, improving the performance a lot. But we still needed to compute all the vertices, duplicate and adjusting it four times (one per vertex per sprite), then copy that info 4 times duplicated it into the GPU memory.

Multiple Buffers and Layouts

One of the problems was that since in KorGW's AG drawing only supported one Buffer and Layout for vertices, you had to pack all your vertex information in that buffer. But we could maybe precompute some vertex attributes that won't change, upload it once, and then reuse them for several calls.

So the next thing I did was to allow to specify several buffer+vertex layouts when drawing. So:

data class AG.Batch(
    var vertices: Buffer = Buffer(Buffer.Kind.VERTEX),
    var vertexLayout: VertexLayout = VertexLayout(),
    var program: Program,
    // ...
)

was converted into:

data class VertexData(
    var buffer: Buffer = Buffer(Buffer.Kind.VERTEX),
    var layout: VertexLayout = VertexLayout()
)

data class AG.Batch(
    var vertexData: List<VertexData> = listOf(VertexData()),
    var program: Program,
    // ...
)

So now we have full control on where to store stuff. Maybe we can store all the X values into a buffer, then all the Y values into other, and color, scale and etc. packed together in another buffer. So in the case some buffers doesn't change, we won't have to upload it.

This improvement also helped to support instanced rendering, that is the final optimization.

Instanced Rendering

Instanced Rendering allows to render a set of primitives multiples times, and with attribute divisors, you can define some vertex attributes to change on each instance.

It is something that is not supported out of the box by OpenGL ES 2.0, or WebGL 1.0. But it is available as OpenGL extensions, and it is also available out of the box starting with OpenGL ES 3.0 and WebGL 2.0.

So I updated AG and related APIs to support this:

data class Batch constructor(
    // ...
    var instances: Int = 1
)

And updated shader Attribute to include a divisor property. When divisor is 0, the attribute is not instanced and will use the same values for every instance. While when the divisor is 1 or greater, the attribute will be changed for each instance.

And when rendering, we updated this:

if (indices != null) {
    gl.drawElements(type.glDrawMode, vertexCount, indexType.glIndexType, offset)
} else {
    gl.drawArraysInstanced(type.glDrawMode, offset, vertexCount, instances)
}

to this:

if (indices != null) {
    if (instances != 1) {
        gl.drawElementsInstanced(type.glDrawMode, vertexCount, indexType.glIndexType, offset, instances)
    } else {
        gl.drawElements(type.glDrawMode, vertexCount, indexType.glIndexType, offset)
    }
} else {
    if (instances != 1) {
        gl.drawArraysInstanced(type.glDrawMode, offset, vertexCount, instances)
    } else {
        gl.drawArrays(type.glDrawMode, offset, vertexCount)
    }
}

Detect GPU features

Since instanced rendering is not always available (though it will be available on most cases), we need a way to detect if it is available, to use it, or to fallback to a version not using instanced rendering. For that I created an AGFeatures interface, that looks like this:

interface AGFeatures {
    val graphicExtensions: Set<String> get() = emptySet()
    val isInstancedSupported: Boolean get() = false
    val isFloatTextureSupported: Boolean get() = false
}

And AG implements it:

abstract class AG : AGFeatures

So from KorGE you can check if instanced rendering is supported.

Make FSpritesView to use instanced rendering

FSprites is storing already the x, y, width, height, rotation and alpha together in memory. With non-instanced rendering we have to put 4 times each property into memory, and that forces to have a separate memory region where you have to manually copy all the data.

But with instance rendering, vertex divisors, and the new combined layouts we can do this uploading our data directly to the GPU.

How do we do this?

Create a 1 by 1 sprite

First of all, we create a buffer with four vertices defining x and y.

val a_xy = Attribute("a_xy", VarType.Float2, false)
val RenderContext.xyBuffer by Extra.PropertyThis<RenderContext, AG.VertexData> { ag.createVertexData(a_xy) }

private val xyData = floatArrayOf(
    0f, 0f,
    1f, 0f,
    1f, 1f,
    0f, 1f
)

With that, we can render that using a TRIANGLE_FAN:

ctx.ag.drawV2(
    // ...
    type = AG.DrawType.TRIANGLE_FAN,
    vertexCount = 4,
    // ...

and with instanced rendering, we can render that sprite many times:

ctx.ag.drawV2(
    // ...
    type = AG.DrawType.TRIANGLE_FAN,
    vertexCount = 4,
    instances = sprites.size,
    // ...

But that sprite is not at the right position, it is not rotated, doesn't have texture information etc. instanced attributes (attributes with divisor >= 1) to the rescue!

val a_pos = Attribute("a_rxy", VarType.Float2, false).withDivisor(1)
val a_scale = Attribute("a_scale", VarType.Float2, true).withDivisor(1)
val a_angle = Attribute("a_rangle", VarType.Float1, false).withDivisor(1)
val a_anchor = Attribute("a_axy", VarType.SShort2, true).withDivisor(1)
val a_uv0 = Attribute("a_uv0", VarType.UShort2, false).withDivisor(1)
val a_uv1 = Attribute("a_uv1", VarType.UShort2, false).withDivisor(1)
val RenderContext.fastSpriteBuffer by Extra.PropertyThis<RenderContext, AG.VertexData> {
    ag.createVertexData(a_pos, a_scale, a_angle, a_anchor, a_uv0, a_uv1)
}

With that code, all those attributes will be the same for each 4 vertices, and different for each instance. So we can upload that information directly to the GPU, and those attributes will be available to the instance.

After that, we can perform Vertex Shader computations to place the sprite with the right position, scale and rotation on the screen, like this:

val vprogram = Program(VertexShader {
    DefaultShaders.apply {
        val baseSize = t_Temp1["xy"]
        SET(baseSize, a_uv1 - a_uv0)

        SET(v_Tex, vec2(
            mix(a_uv0.x, a_uv1.x, a_xy.x),
            mix(a_uv0.y, a_uv1.y, a_xy.y),
        ) * u_i_texSize)
        val cos = t_Temp0["x"]
        val sin = t_Temp0["y"]
        SET(cos, cos(a_angle))
        SET(sin, sin(a_angle))
        SET(t_TempMat2, mat2(
            cos, -sin,
            sin, cos,
        ))
        val size = t_Temp0["zw"]
        val localPos = t_Temp0["xy"]

        SET(size, baseSize)
        SET(localPos, t_TempMat2 * (a_xy - a_anchor) * size)
        SET(out, (u_ProjMat * u_ViewMat) * vec4(localPos + vec2(a_pos.x, a_pos.y), 0f.lit, 1f.lit))
    }
}, FragmentShader {
    DefaultShaders.apply {
        SET(out, texture2D(u_Tex, v_Tex["xy"]))
        //SET(out, vec4(1f.lit, 0f.lit, 1f.lit, .5f.lit))
        IF(out["a"] le 0f.lit) { DISCARD() }
        SET(out["rgb"], out["rgb"] / out["a"])
    }
})

Final words

And voilá: we are storing all the properties for the rendering together into an array thanks to FSprites and inline classes, thanks to new multiple buffers+layouts and instanced rendering and attribute divisors we are able to upload 1/4 of the information to the GPU, while not requiring extra copy or computations on the CPU. Amazing!

Ah! And since Kotlin 1.5.0, performance of some functions is much better, specially on Windows! And after circumventing a few issues, I was able to get a really good performance on the bunnymark-fast sample. Check the details here: https://github.com/korlibs/kotlin-native-performance-experiment

That's it. Thank you for reading!