The quest for the perfect 2D sprite pipeline

When it comes to sprite rendering, I have mostly used XNA SpriteBatch directly or used a ported SpriteBatch to C++ to draw sprites in the past. However I always found XNA SpriteBatch limiting at times, expectially if you wanted to have different parameters passed to your pixel shader or if you wanted to use a different vertex input layout.

Considering that XNA was designed around Direct3D 9 and since then GPU hardware have progressed a lot since then and we could use modern GPU features to render lots of sprites more efficiently. But which method will give us the best result for our needs? In order to answer this question, I decided to sacrifice some time in the development of our first game BioMech Catalyst to explore various options to find the best way for our engine c0ld to render lots and lots of sprites and share my findings.

Requirements

Before listing the approches I’m going to try, let’s explore the requirements I need for rendering sprites.

Sub-indexing rectangles into a texture atlas

Combining images into a single texture is great way to save memory and texture state changes on GPU. Usually you combine images used together into a single texture that we called an atlas. Very useful to store all animations of a single character for instance.

/blog/sprite_pipeline/atlas_animation.gif

Horizontal and Vertical Flip

Being able to flip the UV coordinates of the sprite horizontally and/or vertically, allowing to change the orientation of the sprite by using the same source texture.

/blog/sprite_pipeline/flip_animation.gif

Origin/Anchor point

A origin/anchor point is used so that one animation can vary its size and stays at the same place in game. This make placing sprites in a world far more easier and simplify creating the collision box and attack boxes for physics and collision.

When the origin point is at the top left (0,0), the sprite position is not stable between animation frames.

/blog/sprite_pipeline/origin_point_top_left.gif

For a platformer character, placing the origin point vertically at their feet make positioning the sprite far more predictable.

/blog/sprite_pipeline/origin_point_feet.gif

Affine transformations

/blog/sprite_pipeline/sprite_rotation.gif /blog/sprite_pipeline/sprite_scaling.gif

Being able to translate, scale and rotate the sprite.

Color tint

/blog/sprite_pipeline/color_tint_no_tint.gif /blog/sprite_pipeline/color_tint.gif

Being able to tint the sprite, basically multiply a color with the texture pixels. While it may look useful to replicate palette swapped sprites, you don’t have enough control to swap to the colors you actually want.

Color overlay

/blog/sprite_pipeline/color_overlay.gif

Being able to overlay a color on top of a sprite for doing hit flash and more. The alpha of the overlay controls how much the the color overlay is superimposed on the resulting sprite.

Palette

/blog/sprite_pipeline/palette_blue.gif /blog/sprite_pipeline/palette_red.gif

Some sprites may require at runtime different colors palette instead of pre-generating color variations of the same sprite that would take more disk space and more texture memory.

The original image will be stored using 1 color channel (red) and each color value will be an index into the palette that the shader will sample and the index 0 will be the transparency.

/blog/sprite_pipeline/palette_sprite_sheet.gif /blog/sprite_pipeline/palette_example.gif

For better visualisation here, the index is stored in the first nibble of the color (0x00, 0x10, 0x20, 0x30, …) only allowing to index 15 colors, but if we store using the full range we will able to index 255 colors + transparency.

Benchmark methods and hardware

  • CPU Time is measured using the Tracy profiler
  • GPU Time is measued using NVIDIA NSight on Windows. The time measured is the time of execution of the graphics command list.
  • CPU memory is measured by Windows Process Memory.
  • GPU memory is measured using in-game report info.

Windows results uses the D3D12 rendering backend of our engine while the Steam Deck uses the Vulkan rendering backend.

Both Debug and ReleaseFast configurations are measured because during development the Debug configuration should be used as much as possible.

Hardware

2019 Laptop

  • Intel Core i7 9750H @ 2.60 GHz, 6 cores 12 threads
  • 32 GB DD4 Memory @ 1329 MHz
  • NVIDIA GeForce RTX 2060 Mobile, 6 GB GDDR6
  • Windows 11 23H2 (22631.4460)

Steam Deck

  • 7 nm AMD APU Zen 2 @ 2.4-3.5GHz, 4 cores 8 threads
  • 16 GB LPDDR5 on-board RAM (5500 MT/s quad 32-bit channels)
  • 8 RDNA 2 CUs, 1.6GHz (1.6 TFlops FP32)

2013 Desktop PC

  • Intel(R) Core(TM) i5-3570K CPU @ 3.40GHz, 4 cores 4 threads
  • 16 GB DDR3 @ 800.7 MHz
  • NVIDIA GeForcce GTX 660 2 GB GDDR5
  • Windows 10 22H2 (19045.5131)

Resources

Here’s the number of elements of our test scene for the measurements we want to make.

ResourceValue
Sprites rendered100 000
Palette sprite rendered4
Draw calls219
Max sprites131 072
Max palette sprites256
Unique textures loaded and drawn211
Max GPU memory for GPU-only buffers64 Mb
Max GPU memory for render targets4 Mb
Max GPU memory for textures572 Mb
Max GPU memory for dynamic data (*1)256 Mb
Target GPU memory budget1 Gb
Target CPU render time in ReleaseFast8.33 ms (120 FPS)

The max sprites count is just a huge number to set an upper bound for our measurement tests. It could be higher but I fail to see our games every drawing that many 2D elements on screen.

*1 Dynamic data include vertex buffer, index buffer and constant buffer always visible to the CPU for writing. This is the most commmon size of CPU memory that can be DMA to the GPU on the PCI bus before the advent of resizable BAR.

Here’s how the test scene looks in action:

Approaches

These are the approaches I wanna to try and measure the CPU time and memory usage.

  • Traditional/CPU: Generate the vertex and index data on the CPU.
  • Vertex Pulling: Generate the vertices in the vertex shader using the sprite draw data stored in a structured buffer.
  • Compute Shader: Using sprite draw data stored in a structured buffer to populate vertex and index buffer data in a compute shader.

Before doing any testing, I want to share my guess estimate that the Vertex Pulling is going to be the best compromise between CPU, GPU time and memory usage.

For easier A/B testing, the sprite/pixel render module has been duplicated into 4 files that I can switch in code to redo the measurements of the test scene.

The pixel_render module has two draw methods:

drawTexture()

 1const DrawTextureArgs = struct {
 2    texture: gpu.TextureResource,
 3    position: math.float2,
 4    layer: u32 = 0,
 5    sub_rect: math.Rect = .{},
 6    origin: math.int2 = @splat(0),
 7    scale: math.float2 = @splat(1.0),
 8    rotation: f32 = 0.0,
 9    flip: Flip = .{},
10    tint_color: zigimg.color.Rgba32 = zigimg.Colors(zigimg.color.Rgba32).White,
11    overlay_color: zigimg.color.Rgba32 = .{ .r = 0, .g = 0, .b = 0, .a = 0 },
12};
13pub fn drawTexture(args: DrawTextureArgs) !void {
14    if (context.sprite_data.len >= context.max_sprite_data) {
15        std.log.debug("[pixel_render] Max sprites data ({}) per frame reached!", .{context.max_sprite_data});
16        return;
17    }
18
19    context.sprite_data.len += 1;
20    const new_sprite = &context.sprite_data[context.sprite_data.len - 1];
21
22    new_sprite.texture = args.texture;
23    new_sprite.position = math.float3{ args.position[0], args.position[1], @as(f32, @floatFromInt(args.layer)) / @as(f32, TEMP_MAX_LAYERS) };
24    new_sprite.origin = math.float2{ @floatFromInt(args.origin[0]), @floatFromInt(args.origin[1]) };
25    new_sprite.flip = args.flip;
26    new_sprite.tint_color = args.tint_color;
27    new_sprite.overlay_color = args.overlay_color;
28    new_sprite.scale = args.scale;
29    new_sprite.rotation = args.rotation;
30
31    if (args.sub_rect.isEmpty()) {
32        // Draw the whole texture
33        const texture_resource = try gpu.getTextureResource(args.texture);
34
35        new_sprite.sub_rect = .{
36            .right = @intCast(texture_resource.width),
37            .bottom = @intCast(texture_resource.height),
38        };
39    } else {
40        new_sprite.sub_rect = args.sub_rect;
41    }
42}

and drawPalettedTexture()

 1const DrawPalettedTextureArgs = struct {
 2    texture: gpu.TextureResource,
 3    palette: PaletteResource,
 4    position: math.float2,
 5    layer: u32 = 0,
 6    sub_rect: math.Rect = .{},
 7    origin: math.int2 = @splat(0),
 8    scale: math.float2 = @splat(1.0),
 9    rotation: f32 = 0.0,
10    flip: Flip = .{},
11    tint_color: zigimg.color.Rgba32 = zigimg.Colors(zigimg.color.Rgba32).White,
12    overlay_color: zigimg.color.Rgba32 = .{ .r = 0, .g = 0, .b = 0, .a = 0 },
13};
14pub fn drawPalettedTexture(args: DrawPalettedTextureArgs) !void {
15    if (context.palette_sprite_data.len >= context.max_palette_sprite_data) {
16        std.log.debug("[pixel_render] Max palette sprite data ({}) per frame reached!", .{context.max_palette_sprite_data});
17        return;
18    }
19
20    context.palette_sprite_data.len += 1;
21
22    const new_palette_sprite = &context.palette_sprite_data[context.palette_sprite_data.len - 1];
23
24    new_palette_sprite.sprite_draw.texture = args.texture;
25    new_palette_sprite.sprite_draw.position = math.float3{ args.position[0], args.position[1], @as(f32, @floatFromInt(args.layer)) / @as(f32, TEMP_MAX_LAYERS) };
26    new_palette_sprite.sprite_draw.origin = math.float2{ @floatFromInt(args.origin[0]), @floatFromInt(args.origin[1]) };
27    new_palette_sprite.sprite_draw.flip = args.flip;
28    new_palette_sprite.sprite_draw.tint_color = args.tint_color;
29    new_palette_sprite.sprite_draw.overlay_color = args.overlay_color;
30    new_palette_sprite.sprite_draw.scale = args.scale;
31    new_palette_sprite.sprite_draw.rotation = args.rotation;
32    new_palette_sprite.palette = args.palette;
33
34    if (args.sub_rect.isEmpty()) {
35        // Draw the whole texture
36        const texture_resource = try gpu.getTextureResource(args.texture);
37
38        new_palette_sprite.sprite_draw.sub_rect = .{
39            .right = @intCast(texture_resource.width),
40            .bottom = @intCast(texture_resource.height),
41        };
42    } else {
43        new_palette_sprite.sprite_draw.sub_rect = args.sub_rect;
44    }
45}

Each draw texture and draw paletted texture command are queued to be processed later in the frame.

The pixel_render module draws the whole game content in a off-screen texture of 480x270 and then upscale it to the render resolution. A off-center orthographic matrix transforms the positions from 0,0 (top-left) to 480,270(bottom) to normalized coordinates(-1.0, 1.0) in the vertex shader.

Traditional/CPU

This is closer to the XNA SpriteBatch design. Every vertex is generated on CPU and we are batching per kind of texture to reduce texture state changes on GPU.

This the vertex structure

1const SpriteVertex = struct {
2    position: [4]f32,
3    texture_uv: [2]f32,
4    tint_color: zigimg.color.Rgba32,
5    overlay_color: zigimg.color.Rgba32,
6};

zigimg.color.Rgba32 is 4 u8.

Before rendering, we sort the sprite and palette sprite data per texture ID so that we can batch as much vertices for each texture.

On render, we use a simple quad batcher to write 4 vertices and 6 indices per sprite. Each vertex is scaled, transformed and rotated manually and UV coordinates are flipped depending of the flip state of the sprite. Each texture change result in a draw call. We process the sprite and the palette sprite separately to group the draw call per Pipeline State Object (PSO).

 1fn batchSpriteDrawData(batcher: *QuadBatcher, sprite_draw: SpriteDrawData) !void {
 2    try batcher.setTexture(SpriteVertex, sprite_draw.texture);
 3
 4    const texture_size: math.float2 = .{
 5        @floatFromInt(batcher.last_texture_width),
 6        @floatFromInt(batcher.last_texture_height),
 7    };
 8
 9    const sub_rect_top_left: math.float2 = .{
10        @floatFromInt(sprite_draw.sub_rect.left),
11        @floatFromInt(sprite_draw.sub_rect.top),
12    };
13
14    const sub_rect_bottom_right: math.float2 = .{
15        @floatFromInt(sprite_draw.sub_rect.right),
16        @floatFromInt(sprite_draw.sub_rect.bottom),
17    };
18
19    const sub_rect_size: math.float2 = sub_rect_bottom_right - sub_rect_top_left;
20
21    const sub_rect_top_left_uv: math.float2 = sub_rect_top_left / texture_size;
22    const sub_rect_size_uv: math.float2 = sub_rect_size / texture_size;
23
24    const normalized_origin: math.float2 = sprite_draw.origin / sub_rect_size;
25
26    const destination_size: math.float2 = sub_rect_size * sprite_draw.scale;
27
28    const cos_value = @cos(sprite_draw.rotation);
29    const sin_value = @sin(sprite_draw.rotation);
30
31    const rotation_matrix1: math.float2 = .{ cos_value, -sin_value };
32    const rotation_matrix2: math.float2 = .{ sin_value, cos_value };
33
34    const mirror_bits: u8 = (@as(u8, @intFromBool(sprite_draw.flip.horizontal)) << @as(u8, 1)) | (@as(u8, @intFromBool(sprite_draw.flip.vertical)) << @as(u8, 0));
35
36    const tint_color = sprite_draw.tint_color;
37    const overlay_color = sprite_draw.overlay_color;
38
39    var quad: [4]SpriteVertex = undefined;
40
41    for (&quad, 0..) |*vertex, index| {
42        const corner_offset: math.float2 = (QUAD_CORNERS[index] - normalized_origin) * destination_size;
43
44        const position_float3 = sprite_draw.position + math.float3{
45            @reduce(.Add, corner_offset * rotation_matrix1),
46            @reduce(.Add, corner_offset * rotation_matrix2),
47            0,
48        };
49
50        vertex.position = .{ position_float3[0], position_float3[1], position_float3[2], 1.0 };
51
52        // Flip horizontal case
53        // 00 ^ 10 = 10 (2) // Bottom Left uses Bottom-Right UV
54        // 01 ^ 10 = 11 (3) // Top Left uses Top Right UV
55        // 10 ^ 10 = 00 (0) // Bottom Right use Bottom Left UV
56        // 11 ^ 10 = 01 (1) // Top Right uses Top Left UV
57
58        // Flip vertical case
59        // 00 ^ 01 = 01 (1) // Bottom Left uses Top Left UV
60        // 01 ^ 01 = 00 (0) // Top Left uses Bottom Left UV
61        // 10 ^ 01 = 11 (3) // Bottom Right use Top Right UV
62        // 11 ^ 01 = 10 (2) // Top Right uses Bottom Right UV
63
64        vertex.texture_uv = sub_rect_top_left_uv + QUAD_CORNERS[index ^ mirror_bits] * sub_rect_size_uv;
65        vertex.tint_color = tint_color;
66        vertex.overlay_color = overlay_color;
67    }
68
69    try batcher.writeQuad(SpriteVertex, quad[0..]);
70}

The quad is laid out in a clockwise winding order. A quad is 2 sets of triangles, the first triangle uses the vertices 0, 1 and 2. The second triangle uses the vertices 1, 3 and 2.

 1(1)
 2XXXXXXXXXXXXXXXXXXXXXXXXXXXXXX (3)
 3XX                           X
 4X XX                         X
 5X   XX                       X
 6X     XX                     X
 7X       XX                   X
 8X         XX                 X
 9X           XX               X
10X             XX             X
11X               XX           X
12X                 XX         X
13X                   XX       X
14X                     XX     X
15X                       XX   X
16X                         XX X
17X                           XX
18XXXXXXXXXXXXXXXXXXXXXXXXXXXXXX (2)
19(0)

The vertex and index buffer are stored in GPU buffers that are CPU writable since the contents changes every frame. The vertex and index buffer are dynamically allocated in 15 or 16 Mb chunks. Once a chunk is filled, the current batch is flushed to a draw call and a new buffer chunk is allocated. The code to manage those buffers have been in the code shared below.

The actual draw call looks like this.

 1pub fn flush(self: *QuadBatcher) !void {
 2    const vertex_batch_info = self.vertex_writer.end();
 3    const index_batch_info = self.index_writer.end();
 4
 5    if (vertex_batch_info.written > 0 and index_batch_info.written > 0) {
 6        if (!self.last_vertex_buffer.data.equals(vertex_batch_info.view.buffer.data) or self.last_vertex_stride != vertex_batch_info.view.stride_in_bytes) {
 7            self.draw_stream.setVertexBufferView(vertex_batch_info.view);
 8
 9            self.start_vertex_location = 0;
10            self.last_vertex_buffer = vertex_batch_info.view.buffer;
11            self.last_vertex_stride = vertex_batch_info.view.stride_in_bytes;
12        }
13
14        if (!self.last_index_buffer.data.equals(index_batch_info.view.buffer.data)) {
15            self.draw_stream.setIndexBufferView(index_batch_info.view);
16
17            self.start_index_location = 0;
18            self.last_index_buffer = index_batch_info.view.buffer;
19        }
20
21        self.draw_stream.drawIndexedInstanced(.{
22            .index_count_per_instance = index_batch_info.written,
23            .instance_count = 1,
24            .start_index_location = self.start_index_location,
25            .base_vertex_location = @intCast(self.start_vertex_location),
26        });
27
28        try self.draw_stream.flush();
29
30        self.start_vertex_location += vertex_batch_info.written;
31        self.start_index_location += index_batch_info.written;
32    }
33}

We first check if the vertex buffer view needs to be updated if the dynamic vertex writer has changed the backing GPU buffer or the type of vertex changed. Same for the index buffer if the backing GPU buffer has changed. We draw the batch of quads by treating them as a list of triangles and no instancing.

While the other approaches uses pre-allocated GPU buffers, I didn’t find the need to backport the pre-allocated buffers method for the benchmark measurements because to be honest, using the power of the GPU peforms better. More optimizations could be done for this approach but for the sake of time, I’ll stop here.

Traditional/CPU shader and pixel_render code

Benchmark results

ResourceValue
GPU Buffer memory0 Mb
GPU Render Target Memory1 Mb
Max dynamic data memory used62 Mb
Texture memory13 Mb
Resource2019 LaptopSteam Deck (Desktop Mode)2013 Desktop PC
CPU Time (Debug)39.17 ms (26 FPS)59.59 ms (17 FPS)62.07 ms (16 FPS)
CPU Time (ReleaseFast)7.35 ms (136 FPS)17.22 ms (58 FPS)9.33 ms (107 FPS)
gpu.waitForPreviousFrame() (Debug)2.7 ms5.07 ms15.44 ms
gpu.waitForPreviousFrame() (ReleaseFast)1.98 ms4.95 ms3.41 ms
pixel_render.render() (Debug)18.52 ms29.42 ms24.77 ms
pixel_render.render() (ReleaseFast)3.06 ms8.09 ms3.55 ms
GPU Time Graphics (Debug)1.90 msn/a3.20 ms
GPU Time Graphics (ReleaseFast)1.79 msn/a2.53 ms
CPU Memory400 Mb129 Mb381 Mb
GPU Memory76 Mb65 Mb76 Mb
Resolution2560 x 14402560x14401920x1080

Vertex Pulling

Vertex pulling means that we are pulling the vertex data out of external data. We bypass the vertex buffer completely. We store a batch of sprite draw data in a structured buffer and generate the vertices in the vertex shader from that draw data.

Here the main attraction, the vertex shader that generate the vertex on the fly from the sprite draw data.

 1#include "sprite_types.hlsli"
 2#include "common.hlsli"
 3
 4struct FrameConstants
 5{
 6    float4x4 view_projection_matrix;
 7};
 8
 9struct SpriteDrawData
10{
11    float3 position;
12    float rotation;
13
14    float2 scale;
15    float2 origin;
16
17    int4 sub_rect; // left, top, right, bottom
18
19    uint tint_color;
20    uint overlay_color;
21    uint flip_and_texture_size;
22    uint padding;
23};
24
25#define FLIP_MASK 0x3
26
27#define TEXTURE_WIDTH_SHIFT 2
28#define TEXTURE_HEIGHT_SHIFT 16
29#define TEXTURE_SIZE_MASK 16383
30
31ConstantBuffer<FrameConstants> Frame : register(b0, space0);
32
33StructuredBuffer<SpriteDrawData> DrawData : register(t0, space1);
34
35SpritePixelInput main(uint sprite_id: SV_InstanceID, uint vertex_id: SV_VertexID)
36{
37    SpriteDrawData draw_data = DrawData[sprite_id];
38
39    // Get only the flip data
40    uint flip = draw_data.flip_and_texture_size & FLIP_MASK;
41
42    // Get the texture size
43    uint2 texture_size = uint2(
44        (draw_data.flip_and_texture_size >> TEXTURE_WIDTH_SHIFT) & TEXTURE_SIZE_MASK,
45        (draw_data.flip_and_texture_size >> TEXTURE_HEIGHT_SHIFT) & TEXTURE_SIZE_MASK
46    );
47
48    float2 sub_rect_size = draw_data.sub_rect.zw - draw_data.sub_rect.xy;
49    float2 sub_rect_top_left_uv = (float2)draw_data.sub_rect.xy / texture_size;
50    float2 sub_rect_size_uv = (float2)sub_rect_size / texture_size;
51
52    float4 normalized_origin = float4(draw_data.origin / sub_rect_size, 0.0f, 0.0f);
53
54    float cos_value = cos(draw_data.rotation);
55    float sin_value = sin(draw_data.rotation);
56
57    float4 destination_size = float4(draw_data.scale * sub_rect_size, 1.0, 1.0);
58
59    float4 position_corner = float4(
60        (vertex_id >> 1) & 1,
61        (vertex_id & 1) ^ 1,
62        0.0f,
63        1.0f
64    );
65
66    float2 uv_corner = float2(
67        ((vertex_id ^ flip) >> 1) & 1,
68        ((vertex_id ^ flip) & 1) ^ 1
69    );
70
71    float4 corner_offset = (position_corner - normalized_origin) * destination_size;
72    float4 position = float4(
73        dot(corner_offset.xy, float2(cos_value, -sin_value)),
74        dot(corner_offset.xy, float2(sin_value, cos_value)),
75        0.0,
76        0.0
77    ) + float4(draw_data.position.x, draw_data.position.y, draw_data.position.z, 1.0f);
78
79    SpritePixelInput output;
80    output.position = mul(position, Frame.view_projection_matrix);
81    output.texture_uv = sub_rect_top_left_uv + uv_corner * sub_rect_size_uv;
82    output.tint_color = unpackRgbaColor(draw_data.tint_color);
83    output.overlay_color = unpackRgbaColor(draw_data.overlay_color);
84    return output;
85}

Unlike the other approaches, we are using a triangle strip as the primitive instead of just a triangle list. This approach is pretty much instanced rendering with the instance data provided inside a structured buffer. By using a triangle strip with instancing, we only need to generate 4 vertices in total and no need for an index buffer to create 2 triangles for the quad.

We store the data required to generate the vertices into a StructuredBuffer of SpriteDrawData that is indexed like an array of struct. To know which sprite to draw, we use the instance ID semantic SV_InstanceID and we handle 4 vertices per instance. To know which vertex we use the SV_VertexID semantic parameter. Only vertex ID 0 through 3 are passed to the shader.

The quad is laid out in a clockwise winding order. The first vertex is at the bottom left. By using some bit manipulation, we can generate the X and Y coordinates in 0.0 to 1.0 range. The logic is that the vertices greater than 2 as always on the right of the quad and the bottom vertices are always even numbers. We can get the X coordinate by shifting right by 1. By flipping the least significant bit we can get the Y coordinate.

 1(1)
 2XXXXXXXXXXXXXXXXXXXXXXXXXXXXXX (3)
 3XX                           X
 4X XX                         X
 5X   XX                       X
 6X     XX                     X
 7X       XX                   X
 8X         XX                 X
 9X           XX               X
10X             XX             X
11X               XX           X
12X                 XX         X
13X                   XX       X
14X                     XX     X
15X                       XX   X
16X                         XX X
17X                           XX
18XXXXXXXXXXXXXXXXXXXXXXXXXXXXXX (2)
19(0)
20
210, 0b00, x=0, y=1, Bottom left
221, 0b01, x=0, y=0, Top Left
232, 0b10, x=1, y=1, Bottom Right
243, 0b11, x=1, y=0, Top Right

Once we got the position and UV coordinates from that bit manipulation, we offset the absolute position by the sprite normalized origin and then we scale that value to the destination size, which is the sub rectangle size times the scaling factor. The sub rectangle is a region inside the texture atlas. We then need to apply the rotation for the X and Y coordinate manually using the dot product (basically applying the 2D rotation matrix by hand), and finally we do the translation.

Note that the flip bits and the texture size are packed into the same variable to save memory, each texture dimension is stored within 14 bits enough to store texture size up to 16384, enough for a 16K texture.

The colors that we need to pass to the pixel shader needs to be unpacked manually since we are not using a vertex buffer that interprets these color value as normalized float value.

On the CPU side before rendering, we sort the sprite and palette sprite data per texture ID so that we can batch as much draw data as possible for each texture.

On render, each sprite and paletted sprite draw command is transformed into specific GPU draw data (ShaderSpriteDrawData) to be consumed by the Sprite vertex shader. The GPU draw data is 64 bytes to be a power of two 2 as required by the SPIR-V spec for structured buffer and it is also faster to index on the GPU. Transformining the CPU draw data to the GPU draw data is quite trivial. Because we need to sort the CPU draw data first, we can’t store the draw data directly in the GPU draw data buffer.

 1const ShaderSpriteDrawData = extern struct {
 2    position: [3]f32 = @splat(0.0),
 3    rotation: f32 = 0.0,
 4    scale: [2]f32 = @splat(1.0),
 5    origin: [2]f32 = @splat(0.0),
 6    sub_rect: [4]i32 = @splat(0),
 7    tint_color: u32 = 0,
 8    overlay_color: u32 = 0,
 9    flip_and_texture_size: u32 = 0,
10    padding: u32 = 0,
11};
 1for (context.sprite_data) |sprite_draw| {
 2    try batcher.setTexture(sprite_draw.texture);
 3
 4    const flip_and_texture_size: u32 = @as(u32, @as(u2, @bitCast(sprite_draw.flip))) | (batcher.last_texture_width << 2) | (batcher.last_texture_height << 16);
 5
 6    draw_data_slice[batcher.draw_data_start_offset + batcher.written] = .{
 7        .position = sprite_draw.position,
 8        .rotation = sprite_draw.rotation,
 9        .origin = sprite_draw.origin,
10        .scale = sprite_draw.scale,
11        .sub_rect = .{
12            sprite_draw.sub_rect.left,
13            sprite_draw.sub_rect.top,
14            sprite_draw.sub_rect.right,
15            sprite_draw.sub_rect.bottom,
16        },
17        .tint_color = @bitCast(sprite_draw.tint_color),
18        .overlay_color = @bitCast(sprite_draw.overlay_color),
19        .flip_and_texture_size = flip_and_texture_size,
20    };
21    batcher.written += 1;
22}
23
24try batcher.flush();

For the actual drawing on the current batch, which can occur either when changing texture or when all sprites has been processed, first we need to allocate a binding group to pass the Structured Buffer to the vertex shader. I use a pool of maximum 1024 binding group that are created on the fly and recycled every frame when they are no longer used by the GPU. A binding group is a set of binding parameters used by a shader.

Then we issue the drawInstanced command using 4 vertices by instance and the instance count is the number of sprite draw data being written in the current batch.

 1pub fn flush(self: *SpriteDrawDataBatcher) !void {
 2    if (self.written > 0) {
 3        // TODO: use a better temp allocator
 4        const temp_allocator = memory.gpa.allocator();
 5
 6        const batch_byte_size = @sizeOf(ShaderSpriteDrawData) * self.written;
 7
 8        const draw_binding_group = try self.dynamic_binding_group_allocator.allocate(temp_allocator, .{
 9            .layout = context.draw_data_binding_layout,
10            .entries = &.{
11                .{
12                    .structured_buffer = .{
13                        .buffer = self.draw_buffer,
14                        .offset = self.buffer_start_offset + self.draw_data_start_offset * @sizeOf(ShaderSpriteDrawData),
15                        .size = batch_byte_size,
16                    },
17                },
18            },
19        });
20
21        self.draw_stream.setBindingGroup(PIXEL_BINDING_GROUP_DRAW_DATA, draw_binding_group);
22
23        self.draw_stream.drawInstanced(.{
24            .vertex_count_per_instance = 4,
25            .instance_count = self.written,
26        });
27
28        self.draw_data_start_offset += self.written;
29        self.written = 0;
30
31        try self.draw_stream.flush();
32    }
33}

At first I was using a similar dynamic chunk allocation for GPU buffer memory like the traditional approach for the shader draw data, but when I reduced the draw data size for the compute shader approach, the compute shader got way faster and I started doing some optimizations on the vertex pulling approach. So I pre-allocate the GPU buffer for the draw data using the max sprites budget passed to the pixel_render module and we allocate it for each back buffer (3 in our engine). At first it was done to simplify the compute shader implementation but I back ported it to the vertex pulling to try to match the performance. It does uses more GPU memory but still pretty small for the overall budget (1 Gb) I’ve allocated for GPU memory.

I tested 2 memory access pattern for the StructuredBuffer, first having only the CPU writable buffer and then having another buffer GPU optimized that is copied from the CPU writable buffer. As you can see in the results below, using an GPU optimized buffer on PC architecture with non-unified memory have a big impact on the vertex shader performance. The buffers are copied using a separate copy queue.

On the Steam Deck during the first rounds of tests, I was getting 24.08 ms (42 FPS) of CPU time. But after looking at the problem on Windows with the Vulkan backend under Intel VTune, I found out that the issue was due to frequent memory allocation in the code to update descriptor sets (binding group) in the Vulkan backend. By using the stack more instead of the heap allocated memory, I got faster than compute also on the Steam Deck. The difference between having the structured buffer being GPU-optimized or not is very minimal due to the Steam Deck having a APU with unified memory.

Vertex Pulling shader and pixel_render code

Benchmark results

With the Structured Buffer read directly from the CPU writable buffer:

ResourceValue
GPU Buffer memory0 Mb
GPU Render Target Memory1 Mb
Max dynamic data memory used71 Mb
Texture memory13 Mb
Resource2019 LaptopSteam Deck (Desktop Mode)2013 Desktop PC
CPU Time (Debug)21.67 ms (46 FPS)39.18 ms (26 FPS)28.63 ms (35 FPS)
CPU Time (ReleaseFast)3.95 ms (253 FPS)9.9 ms (101 FPS)6.86 ms (146 FPS)
gpu.waitForPreviousFrame() (Debug)3.68 ms4.74 ms5.25 ms
gpu.waitForPreviousFrame() (ReleaseFast)1.41 ms4.72 ms3.65 ms
pixel_render.render() (Debug)2.72 ms5.9 ms3.26 ms
pixel_render.render() (ReleaseFast)0.82 ms1.58 ms0.99 ms
GPU Time Graphics (Debug)1.58 msn/a2.65 ms
GPU Time Graphics (ReleaseFast)1.56 msn/a2.62 ms
CPU Memory401 Mb221 Mb381 Mb
GPU Memory85 Mb85 Mb85 Mb
Resolution2560 x 14402560x14401920x1080

With the Structured Buffer in GPU optimized memory copied from the CPU writable buffer:

ResourceValue
GPU Buffer memory24 Mb
GPU Render Target Memory1 Mb
Max dynamic data memory used71 Mb
Texture memory13 Mb
Resource2019 LaptopSteam Deck (Desktop Mode)2013 Desktop PC
CPU Time (Debug)21.4 ms (47 FPS)42.99 ms (23 FPS)28.79 ms (35 FPS)
CPU Time (ReleaseFast)3.12 ms (312 FPS)10.1 ms (100 FPS)6.47 ms (155 FPS)
gpu.waitForPreviousFrame() (Debug)3.22 ms4.74 ms5.44 ms
gpu.waitForPreviousFrame() (ReleaseFast)0.68 ms4.68 ms2.9 ms
pixel_render.render() (Debug)2.7 ms6.28 ms3.32 ms
pixel_render.render() (ReleaseFast)0.79 ms1.61 ms0.98 ms
GPU Time Copy (Debug)1.58 msn/an/a
GPU Time Copy (ReleaseFast)0.89 msn/an/a
GPU Time Graphics (Debug)1.58 msn/a2.65 ms
GPU Time Graphics (ReleaseFast)0.93 msn/a2.65 ms
CPU Memory401 Mb221 Mb381 Mb
GPU Memory109 Mb109 Mb109 Mb
Resolution2560 x 14402560x14401920x1080

Compute Shader

The compute shader approach use GPU compute to generate vertices and to write directly into the vertex and index buffer. According to the SDL3 GPU implementors and their ComputeSpriteBatch example, it provide quite a performance boost over generating the vertices on CPU.

The compute shader resources required are the sprite draw data and batch size as input and the vertex and index buffer as output. The batch size uses push constants, which are constant data set on the command list directly and not stored using GPU buffers.

The sprite draw data is allocated in a GPU buffer that is CPU writable and accessed directly by the computer shader since the data changes every frame. However the vertex and index buffer are GPU optimized buffer since they are not required to be accessed by the CPU at all and also makes rasterization faster as well.

The resources are defined like this in the shader:

 1struct SpriteVertex
 2{
 3    float4 position;
 4    float2 texture_uv;
 5    uint tint_color;
 6    uint overlay_color;
 7};
 8
 9struct ComputeConstants
10{
11    uint batch_size;
12};
13
14StructuredBuffer<SpriteDrawData> DrawData : register(t0, space0);
15RWStructuredBuffer<SpriteVertex> VertexBuffer : register(u1, space0);
16RWStructuredBuffer<uint> IndexBuffer : register(u2, space0);
17
18[[vk::push_constant]]
19ConstantBuffer<ComputeConstants> Constants : register(b0, space1);

Note that SpriteVertex is exactly the same layout used for the vertex buffer as the traditional approach.

At first I was using this SpriteDrawData struct which was 80 bytes per draw data:

 1struct SpriteDrawData
 2{
 3    float3 position;
 4    float rotation;
 5
 6    float2 scale;
 7    float2 origin;
 8
 9    int4 sub_rect;
10
11    uint tint_color;
12    uint overlay_color;
13    uint flip_and_extra_data;
14    uint padding;
15
16    uint2 texture_size;
17    uint2 padding1;
18};

but then I decided to try the compact the info as much as possible and I reduced it to 64 bytes:

 1struct SpriteDrawData
 2{
 3    float3 position;
 4    float rotation;
 5
 6    float2 scale;
 7    float2 origin;
 8
 9    int4 sub_rect;
10
11    uint tint_color;
12    uint overlay_color;
13    uint flip_and_texture_size;
14    uint padding;
15};

By doing so, the computer shader approach got competitive again and demonstrate that padding and alignment matters a lot with GPU programming.

The main of the compute shader looks like this:

 1[numthreads(64,1,1)]
 2void main(uint3 GlobalInvocationID : SV_DispatchThreadID)
 3{
 4    uint draw_index = GlobalInvocationID.x;
 5
 6    if (draw_index >= Constants.batch_size) {
 7        return;
 8    }
 9
10    SpriteDrawData draw_data = DrawData[draw_index];
11
12    // Get only the flip data
13    uint flip = draw_data.flip_and_texture_size & FLIP_MASK;
14
15    // Get the texture size
16    uint2 texture_size = uint2(
17        (draw_data.flip_and_texture_size >> TEXTURE_WIDTH_SHIFT) & TEXTURE_SIZE_MASK,
18        (draw_data.flip_and_texture_size >> TEXTURE_HEIGHT_SHIFT) & TEXTURE_SIZE_MASK
19    );
20
21    float2 sub_rect_size = draw_data.sub_rect.zw - draw_data.sub_rect.xy;
22    float2 sub_rect_top_left_uv = (float2)draw_data.sub_rect.xy / texture_size;
23    float2 sub_rect_size_uv = (float2)sub_rect_size / texture_size;
24
25    float4 normalized_origin = float4(draw_data.origin / sub_rect_size, 0.0f, 0.0f);
26
27    float cos_value = cos(draw_data.rotation);
28    float sin_value = sin(draw_data.rotation);
29
30    float2 destination_size = draw_data.scale * sub_rect_size;
31
32    float4x4 scale_matrix = float4x4(
33        float4(destination_size.x, 0.0f, 0.0f, 0.0f),
34        float4(0.f, destination_size.y, 0.0f, 0.0f),
35        float4(0.0f, 0.0f, 1.0f, 0.0f),
36        float4(0.0f, 0.0f, 0.0f, 1.0f)
37    );
38
39    float4x4 rotation_matrix = float4x4(
40        float4(cos_value, sin_value, 0.0f, 0.0f),
41        float4(-sin_value, cos_value, 0.0f, 0.0f),
42        float4(0.0f, 0.0f, 1.0f, 0.0f),
43        float4(0.0f, 0.0f, 0.0f, 1.0f)
44    );
45
46    float4x4 translation_matrix = float4x4(
47        float4(1.0f, 0.0f, 0.0f, 0.0f),
48        float4(0.0f, 1.0f, 0.0f, 0.0f),
49        float4(0.0f, 0.0f, 1.0f, 0.0f),
50        float4(draw_data.position.x, draw_data.position.y, draw_data.position.z, 1.0f)
51    );
52
53    float4x4 affine_matrix = mul(scale_matrix, mul(rotation_matrix, translation_matrix));
54
55    float4 QUAD_CORNERS[4] = {
56        float4(0.0f, 1.0f, 0.0f, 1.0f), // Bottom left
57        float4(0.0f, 0.0f, 0.0f, 1.0f), // Top Left
58        float4(1.0f, 1.0f, 0.0f, 1.0f), // Bottom Right
59        float4(1.0f, 0.0f, 0.0f, 1.0f), // Top Right
60    };
61
62    // Output vertex data
63    [unroll]
64    for(int vertex_index = 0; vertex_index < 4; ++vertex_index)
65    {
66        float4 corner_offset = (QUAD_CORNERS[vertex_index] - normalized_origin);
67
68        VertexBuffer[draw_index * 4u + vertex_index].position = mul(corner_offset, affine_matrix);
69        VertexBuffer[draw_index * 4u + vertex_index].texture_uv = sub_rect_top_left_uv + (QUAD_CORNERS[vertex_index ^ flip].xy * sub_rect_size_uv);
70        VertexBuffer[draw_index * 4u + vertex_index].tint_color = draw_data.tint_color;
71        VertexBuffer[draw_index * 4u + vertex_index].overlay_color = draw_data.overlay_color;
72    }
73
74    // Output index data
75    uint start_vertex = draw_index * 4u;
76    IndexBuffer[draw_index * 6u] = start_vertex;
77    IndexBuffer[draw_index * 6u + 1] = start_vertex + 1;
78    IndexBuffer[draw_index * 6u + 2] = start_vertex + 2;
79    IndexBuffer[draw_index * 6u + 3] = start_vertex + 1;
80    IndexBuffer[draw_index * 6u + 4] = start_vertex + 3;
81    IndexBuffer[draw_index * 6u + 5] = start_vertex + 2;
82}

One thing with compute shaders that took me a little while to process is the relationship between the thread size and the amount of work to process. It is up to the programmer to decide how the work is divided between the compute units. It uses a 3D vector to indicate the number of threads. In our case I used:

  • 64 threads on the X axis
  • 1 thread on the Y axis
  • 1 thread on the Z axis

But then, how can you tell the compute shader to process a number of entries that it is not dividible by 64? With some research, I found out that you can interrupt the compute shader by just passing the number of entries in a constant to the shader and ignore work that are greater than the batch size.

The rest of the code is quite similiar to the traditional approach on CPU expect we are using the affine transformation matrices directly here.

On the CPU side during the render process, the CPU sort the the sprite and palette sprite draw commands by texture ID. While for compute the data order does not matters at all but the graphics draw calls still need to be group by texture to reduce GPU state change.

To simplify the implementation and reduce the number of bind groups to create and track, I preallocated the buffer memory with the maximum sprites count budget. This is why the memory usage is larger than other approaches in the benchmark results. Only a single bind group is required to manage the compute shader resources per frame.

The draw loop for each sprite draw command both output draw commands as well prepare the data for the compute shader. setTexture calls flush() internally if the texture has changed and flush() generate the draw calls.

 1for (context.sprite_data) |sprite_draw| {
 2    try batcher.setTexture(sprite_draw.texture);
 3
 4    const flip_and_texture_size: u32 = @as(u32, @as(u2, @bitCast(sprite_draw.flip))) | (batcher.last_texture_width << 2) | (batcher.last_texture_height << 16);
 5
 6    compute_draw_data_slice[context.compute_draw_data_allocation.written] = .{
 7        .position = sprite_draw.position,
 8        .rotation = sprite_draw.rotation,
 9        .scale = sprite_draw.scale,
10        .origin = sprite_draw.origin,
11        .sub_rect = .{
12            sprite_draw.sub_rect.left,
13            sprite_draw.sub_rect.top,
14            sprite_draw.sub_rect.right,
15            sprite_draw.sub_rect.bottom,
16        },
17        .tint_color = @bitCast(sprite_draw.tint_color),
18        .overlay_color = @bitCast(sprite_draw.overlay_color),
19        .flip_and_texture_size = flip_and_texture_size,
20    };
21    context.compute_draw_data_allocation.written += 1;
22
23    batcher.addQuad();
24}
25
26try batcher.flush();

After all the sprite and palette sprite draw commands has been processed, the compute shader is dispatched using the compute command stream.

 1// Finish compute stream
 2const batch_size: [1]u32 = .{context.compute_draw_data_allocation.written};
 3compute_stream.pushConstants(.{
 4    .layout = context.compute_batch_size_binding_layout,
 5    .offset = 0,
 6    .data = batch_size[0..],
 7});
 8
 9compute_stream.dispatch(.{
10    .thread_group_count_x = try std.math.divCeil(u32, context.compute_draw_data_allocation.written, 64),
11    .thread_group_count_y = 1,
12    .thread_group_count_z = 1,
13});
14
15try compute_stream.flush();

Compute uses a different stream of commands that will be executed on a different queue than graphics. The graphics queue waits for the compute queue to be finished before starting its execution.

The draw call is very similar to the traditional approach minus the vertex and index buffer management.

 1pub fn addQuad(self: *QuadBatcher) void {
 2        self.vertex_count += 4;
 3        self.index_count += 6;
 4}
 5
 6pub fn flush(self: *QuadBatcher) !void {
 7    if (self.vertex_count > 0 and self.index_count > 0) {
 8        self.draw_stream.drawIndexedInstanced(.{
 9            .index_count_per_instance = self.index_count,
10            .instance_count = 1,
11            .start_index_location = self.start_index,
12            .base_vertex_location = 0,
13        });
14
15        self.start_vertex += self.vertex_count;
16        self.start_index += self.index_count;
17
18        self.vertex_count = 0;
19        self.index_count = 0;
20
21        try self.draw_stream.flush();
22    }
23}

If you sum the Compute and Graphics queue timings from the benchmarks results below, you’ll found out that they take more time than other approches. However, since the engine is built around waiting on the graphics queue for the previous frame to be completed before beginning the next frame, only the execution time of the graphics queue matters in the end for comparison purposes between the approches.

Compute shader and pixel_render code

Benchmark results

ResourceValue
GPU Buffer memory57 Mb
GPU Render Target Memory1 Mb
Max dynamic data memory used77 Mb
Texture memory13 Mb
Resource2019 LaptopSteam Deck (Desktop Mode)2013 Desktop PC
CPU Time (Debug)21.38 ms (47 FPS)45.39 ms (22 FPS)28.82 ms (35 FPS)
CPU Time (ReleaseFast)3.39 ms (295 FPS)11.65 ms (86 FPS)6.17 ms (162 FPS)
gpu.waitForPreviousFrame() (Debug)2.85 ms4.95 ms5.5 ms
gpu.waitForPreviousFrame() (ReleaseFast)0.64 ms4.89 ms2.75 ms
pixel_render.render() (Debug)3.09 ms5.8 ms3.63 ms
pixel_render.render() (ReleaseFast)0.95 ms1.56 ms1.01 ms
GPU Time Graphics (Debug)0.90 msn/a2.37 ms
GPU Time Graphics (ReleaseFast)0.90 msn/a2.44 ms
GPU Time Compute (Debug)1.35 msn/a1.75 ms
GPU Time Compute (ReleaseFast)1.60 msn/a1.59 ms
CPU Memory401 Mb221 Mb381 Mb
GPU Memory148 Mb148 Mb148 Mb
Resolution2560 x 14402560x14401920x1080

Conclusion

Using the 2019 Laptop numbers side by side.

ResourceTraditionalVertex Pulling (CPU buffer)Vertex Pulling (GPU buffer)Compute
CPU Time (Debug)39.17 ms (26 FPS)21.67 ms (46 FPS)21.4 ms (47 FPS)21.38 ms (47 FPS)
CPU Time (ReleaseFast)7.35 ms (136 FPS)3.95 ms (253 FPS)3.12 ms (312 FPS)3.39 ms (295 FPS)
gpu.waitForPreviousFrame() (Debug)2.7 ms3.68 ms3.22 ms2.85 ms
gpu.waitForPreviousFrame() (ReleaseFast)1.98 ms1.41 ms0.68 ms0.64 ms
pixel_render.render() (Debug)18.52 ms2.72 ms2.70 ms3.09 ms
pixel_render.render() (ReleaseFast)3.06 ms0.82 ms0.79 ms0.95 ms
GPU Time Graphics (Debug)1.90 ms1.58 ms1.58 ms0.90 ms
GPU Time Graphics (ReleaseFast)1.79 ms1.56 ms0.93 ms0.90 ms
CPU Memory400 Mb401 Mb401 Mb401 Mb
GPU Memory76 Mb85 Mb109 Mb148 Mb

On the CPU side, I had to include the wait for the previous frame GPU fence to illustrate why the Vertex Pulling approach with the CPU writable buffer was slower on CPU time while the pixel render render() function was faster.

The Vertex Pulling and Compute approach are quite close if you read the vertex pulling data from a GPU optimized buffer in a non-uniform memory architecture (NUMA) like the majority of PC. However, storing the maximum possible of vertices and indices for the Compute approach take its toll on the GPU memory budget, because in additional to the vertex and index data, it needs to store the sprite draw data to generate the vertices.

You can also see the execution time between the two Vertex Pulling vary greatly because of the memory access. When using a GPU optimized buffer, the GPU time is a bit slower than the Compute approach but since the CPU side with Vertex Pulling is faster, it wins on my 2019 laptop with the overall CPU time.

If you find that the traditional approach execution time on the GPU is slow, it’s because the vertex and index buffers are marked CPU writable.

Like I predicted, the Vertex Pulling is the best compromise for our engine for handling lots of sprites. The Compute is quite close and is also a good choice if you prefer that approach. However I prefer the Vertex Pulling approach because it requires less shader code and use less GPU memory overall even when you pay for having buffer memory twice to have a GPU optimized buffer for rasterization.

BioMech Catalyst would never need to render 100000 sprites per frame, far from it. With the test scene, the Debug configuration is quite unusable for development but if I can maintain 60 FPS for the Debug configuration during development I would be quite happy. The target frame rate for release is at least 120 FPS.

It was fun getting on this journey to find the perfect sprite rendering pipeline for our 2D pixel art engine and challeging assumptions about the modern CPU and GPU hardware. Hope you enjoy the journey as well!

If you want to discuss this article or follow the development on our game, please join our Discord!

Bonus - Bindless

With the winning approach Vertex Pulling, I decided to test implementing a bindless design for textures parameters so that we need to issue even less draw commmands to the GPU and save on binding group changes.

The idea is that instead of binding 1 texture at the time in the pixel shader:

 1SamplerState TextureSampler : register(s0, space1);
 2
 3Texture2D Texture : register(t0, space2);
 4
 5float4 main(SpritePixelInput input) : SV_TARGET
 6{
 7    float4 pixel = Texture.Sample(TextureSampler, input.texture_uv) * input.tint_color;
 8
 9    // Discard transparent pixel from the depth buffer
10    if (pixel.a <= 0.01) {
11        discard;
12    }
13
14    return pixel;

We use an array of texture which is unbounded and allow the shader to access any texture currently bound inside that unbounded array:

 1SamplerState TextureSampler : register(s0, space2);
 2
 3Texture2D Textures[] : register(t0, space3);
 4
 5float4 main(SpritePixelInputBindless input) : SV_TARGET
 6{
 7    int texture_id = input.texture_and_palette_id & 0xFFFF;
 8
 9    float4 pixel = Textures[NonUniformResourceIndex(texture_id)].Sample(TextureSampler, input.texture_uv) * input.tint_color;
10
11    // Discard transparent pixel from the depth buffer
12    if (pixel.a <= 0.01) {
13        discard;
14    }
15
16    return pixel;

On NVIDIA GPU, NonUniformResourceIndex() was not required to access the texture but on AMD GPU it was required to get bindless working. See Resource types and arrays section in HLSL documentation for more information.

In the render() loop that process sprite draw command, we don’t need to flush the batch state per texture change and we can only do 1 draw call per pipeline state object. We only to bind the group once per PSO.

 1 pub fn setTexture(self: *SpriteDrawDataBatcher, texture: gpu.TextureResource) !void {
 2    if (!self.last_texture.data.equals(texture.data)) {
 3        try self.flush();
 4
 5        const texture_resource = try gpu.getTextureResource(texture);
 6
 7        self.draw_stream.setBindingGroup(PIXEL_BINDING_GROUP_TEXTURE, texture_resource.binding_group);
 8        self.last_texture = texture;
 9        self.last_texture_width = texture_resource.width;
10        self.last_texture_height = texture_resource.height;
11    }
12}
1pub fn setTextureBindless(self: *SpriteDrawDataBatcher, texture: gpu.TextureResource) !void {
2        if (!self.last_texture.data.equals(texture.data)) {
3            const texture_resource = try gpu.getTextureResource(texture);
4
5            self.last_texture = texture;
6            self.last_texture_width = texture_resource.width;
7            self.last_texture_height = texture_resource.height;
8        }
9    }

We still need to get the texture resource to get the texture ID, width and height.

To be honest, I was expecting a little bit of improvement on the frame time but turns out the impact is negligable and give similiar frame time to the Vertex Pulling with GPU buffer approach. The improvements are on the code side which simplify the resource binding code. It just shows that draw calls count does not matter as much anymore as it used to be with older graphics API, at least for a 2D pixel art game.

Bindless shader and pixel_render code

Benchmark results

ResourceValue
GPU Buffer memory24 Mb
GPU Render Target Memory1 Mb
Max dynamic data memory used71 Mb
Texture memory13 Mb
Draw calls8
Resource2019 LaptopSteam Deck (Desktop Mode)2013 Desktop PC
CPU Time (Debug)18.06 ms (55 FPS)40.35 ms (25 FPS)25.92 ms (39 FPS)
CPU Time (ReleaseFast)3.22 ms (311 FPS)10.46 ms (96 FPS)6.49 ms (154 FPS)
gpu.waitForPreviousFrame() (Debug)3.03 ms5.26 ms5.3 ms
gpu.waitForPreviousFrame() (ReleaseFast)0.70 ms4.81 ms2.95 ms
pixel_render.render() (Debug)2.54 ms4.81 ms3.15 ms
pixel_render.render() (ReleaseFast)0.76 ms1.43 ms0.97 ms
GPU Time Copy (Debug)0.71 msn/an/a
GPU Time Copy (ReleaseFast)0.83 msn/an/a
GPU Time Graphics (Debug)1.07 msn/a3.33 ms
GPU Time Graphics (ReleaseFast)1.11 msn/a2.74 ms
CPU Memory400 Mb129 Mb381 Mb
GPU Memory109 Mb109 Mb109 Mb
Resolution2560 x 14402560x14401920x1080

Art sources