Texture Streamer

Overview

The TextureStreamer is a core system component that provides asynchronous texture loading for the Waxed compositor. It enables butter-smooth GUI experiences by offloading texture loading operations to a dedicated worker thread, ensuring the main render thread never blocks on disk I/O or GPU uploads.

Key Design Principles:

Non-blocking: All texture operations happen off the main thread
Zero-GPU-wait: Textures are marked ready immediately after command submission, not after GPU completion
Implicit synchronization: Vulkan pipeline barriers handle GPU-side synchronization
Ring buffer architecture: Pre-allocated slots minimize runtime allocations
Ownership transfer: Consumers take ownership of ready textures, enabling efficient resource management

TextureSlot Structure

The TextureSlot represents a single texture’s lifecycle in the streaming system. Each slot contains all Vulkan resources needed for a texture and maintains its own state machine.

Members

Member	Type	Description
`image`	`vk::raii::Image`	The Vulkan image handle
`view`	`vk::raii::ImageView`	Image view for shader access
`memory`	`vk::raii::DeviceMemory`	Device-local memory backing
`descriptor_set`	`vk::raii::DescriptorSet`	Optional descriptor set for binding
`width`	`uint32_t`	Texture width (post-resize)
`height`	`uint32_t`	Texture height (post-resize)
`stride`	`uint32_t`	Row stride in bytes (width * 4)
`format`	`VkFormat`	Pixel format (always `VK_FORMAT_R8G8B8A8_UNORM`)
`sequence`	`uint64_t`	Unique request identifier
`state`	`atomic<State>`	Current slot state (see below)
`on_ready`	`function`	Optional per-slot callback
`dma_buf_fd`	`UniqueFd`	Exported DMA-BUF file descriptor
`dma_buf_modifier`	`uint64_t`	DRM format modifier (0 = LINEAR)

State Machine

Each TextureSlot progresses through four states:

Thread Safety

state is std::atomic<State> and can be read safely from any thread
All other members are only accessed by the worker thread while in Loading state
After Ready, the slot becomes read-only for consumer threads

Ring Buffer Architecture

The TextureStreamer uses a fixed-size ring buffer of TextureSlot objects. This design provides:

Predictable memory usage: Pre-allocated at initialization
No fragmentation: Slots are reused rather than reallocated
Cache efficiency: Contiguous memory layout
Simple lifecycle: Slots cycle through states independently

Ring Buffer Layout

Slot Allocation

find_empty_slot(): Linear search for Empty state slot
Returns nullptr if no slots available (caller must retry or handle error)
Slots are not pre-reserved; first Empty slot wins
After take_texture(), slot is reset to default state (becomes Empty)

Worker Thread Operation

The worker thread runs continuously from construction until destruction, processing load requests from a thread-safe queue.

Worker Loop Flow

Thread Synchronization

Component	Type	Purpose
`request_mutex_`	`std::mutex`	Protects `requests_` queue
`request_cv_`	`std::condition_variable`	Wakes worker for new requests
`shutdown_requested_`	`std::atomic<bool>`	Signals worker to exit
`slot.state`	`std::atomic<State>`	Cross-thread state visibility

Public API

Initialization

TextureStreamer(
    vk::raii::Instance& instance,
    vk::raii::PhysicalDevice& physical_device,
    vk::raii::Device& device,
    vk::raii::Queue& transfer_queue,
    uint32_t transfer_queue_family,
    const Config& config
);

Note: The constructor receives RAII references, not owned pointers. The caller must ensure these objects outlive the TextureStreamer.

Configuration

struct Config {
    size_t ring_buffer_size = 3;              // Number of pre-allocated slots
    size_t min_staging_buffer_size = 32MiB;   // Minimum staging buffer
    bool enable_callback = true;              // Enable callback notification
    TextureReadyCallback on_texture_ready;    // Global callback
};

Requesting a Texture Load

auto request_load(std::string_view path,
                  uint32_t target_width,
                  uint32_t target_height,
                  int mode) -> uint64_t;

Parameters:

path: Filesystem path to image file (PNG, JPEG, etc. via stb_image)
target_width: Desired output width (0 = use original)
target_height: Desired output height (0 = use original)
mode: Resize mode (0=contain, 1=cover, 2=stretch, 3=tile)

Returns: Sequence number for tracking this request

Behavior:

Generates sequence number via atomic fetch-add
Pushes LoadRequest to queue
Notifies worker thread
Returns immediately (non-blocking)

Checking Status

auto get_ready_texture(uint64_t sequence) -> const TextureSlot*;

Returns pointer to slot if Ready, nullptr otherwise.

auto get_freshest_ready() -> const TextureSlot*;

Returns the ready texture with the highest sequence number, or nullptr.

auto is_ready(uint64_t sequence) const -> bool;

Simple boolean check for readiness.

Taking Ownership

auto take_texture(uint64_t sequence, TextureSlot& out_slot) -> bool;

Ownership Transfer Semantics:

Moves all Vulkan resources from the internal slot to out_slot
Resets internal slot to default state (Empty)
Consumer now owns the texture and is responsible for its lifetime

Usage Pattern:

TextureSlot my_texture;
if (streamer.take_texture(sequence, my_texture)) {
    // my_texture now owns the Vulkan resources
    // Use my_texture.image, my_texture.view, etc.
} // my_texture destructors clean up automatically

Queue Status

auto pending_count() const -> size_t;

Returns number of requests currently waiting in the queue (not yet started).

Shutdown

void shutdown();

Called automatically by destructor. Signals worker thread to exit, waits for completion, and ensures GPU operations finish.

Async Load Pipeline

request_load() Flow

The “No GPU Wait” Optimization

The critical design decision: the slot is marked Ready immediately after queue.submit(), not after GPU completion.

Why this works:

Vulkan’s implicit synchronization ensures commands execute in order
Pipeline barriers prevent the shader from sampling before data is ready
The consumer can bind the texture immediately; GPU waits as needed
No vkQueueWaitIdle() or fence synchronization required

Latency comparison:

Traditional (wait):
  submit -> waitIdle -> callback: ~5-15ms

TextureStreamer (fire-and-forget):
  submit -> callback: ~0.1ms

Staging Buffer Management

The staging buffer is a CPU-visible, GPU-accessible buffer used as an intermediate step for texture uploads.

Lifecycle

Created on-demand: First call to load_and_submit()
Dynamically resized: Grows if current size is insufficient
Host-coherent: No manual flushing required
Persisted: Reused across all loads (not recreated per texture)

Sizing Logic

required_size = image_width * image_height * 4 (RGBA)

if (required_size < min_staging_size):
    required_size = min_staging_size

actual_size = required_size * 1.25  (25% headroom)

The 25% headroom reduces reallocations when loading images of varying sizes.

Memory Properties

vk::MemoryPropertyFlagBits::eHostVisible |   // CPU can access
vk::MemoryPropertyFlagBits::eHostCoherent   // Writes auto-visible to GPU

DMA-BUF Export Capability

The TextureStreamer supports exporting textures as DMA-BUF file descriptors for zero-copy sharing with other system components (e.g., video encoders, display servers).

Export Process

auto export_slot_to_dma_buf(uint64_t sequence) -> bool;

Requirements:

Slot must be in Ready state
Image must be created with VK_IMAGE_TILING_LINEAR (already done)
Image memory must be allocated with export info (already done)

After Export:

slot.dma_buf_fd contains the exported file descriptor
slot.dma_buf_modifier is set to DRM_FORMAT_MOD_LINEAR (0)
FD ownership is managed by UniqueFd (RAII)

DMA-BUF Creation Details

The final image is created with external memory capabilities:

vk::ExternalMemoryImageCreateInfo ext_mem_info;
ext_mem_info.setHandleTypes(vk::ExternalMemoryHandleTypeFlagBits::eDmaBufEXT);

vk::ImageCreateInfo image_info;
image_info.setPNext(&ext_mem_info);
image_info.setTiling(vk::ImageTiling::eLinear);  // Required for DMA-BUF

Memory allocation includes export info:

vk::ExportMemoryAllocateInfo export_alloc_info;
export_alloc_info.setHandleTypes(vk::ExternalMemoryHandleTypeFlagBits::eDmaBufEXT);

vk::MemoryAllocateInfo alloc_info;
alloc_info.setPNext(&export_alloc_info);

Integration with BackgroundService

The BackgroundService uses TextureStreamer to load and cycle through background images smoothly.

Integration Pattern

// At initialization
TextureStreamer streamer(instance, physical_device, device,
                         transfer_queue, transfer_queue_family,
                         {.ring_buffer_size = 3,
                          .on_texture_ready = [](auto seq, auto& slot) {
                              background.on_texture_ready(seq, slot);
                          }});

// Request next background
streamer.request_load(next_image_path, width, height, mode);

// In callback
void BackgroundService::on_texture_ready(uint64_t seq, const TextureSlot& slot) {
    // Take ownership when ready
    if (streamer.take_texture(seq, pending_texture_)) {
        // Schedule swap for next frame
        texture_pending_ = true;
    }
}

// During render
if (texture_pending_) {
    current_texture_ = std::move(pending_texture_);
    texture_pending_ = false;
}

This pattern ensures:

Multiple images can be in flight simultaneously
No blocking on the render thread
Smooth transitions between backgrounds

Thread Safety Guarantees

Safe Operations from Any Thread

request_load() - lock-protected queue push
is_ready() - reads atomic state
pending_count() - lock-protected queue size

Worker-Thread-Only Operations

All TextureSlot mutation (except state)
load_and_submit() - entire pipeline
ensure_staging_buffer() - buffer management

Consumer-Thread Operations

get_ready_texture() - reads slot state
get_freshest_ready() - reads slot states
take_texture() - moves from slot (state must be Ready)

Atomicity Notes

slot.state is always std::atomic<State> for cross-thread visibility
next_sequence_ uses atomic fetch-add for unique IDs
shutdown_requested_ is atomic for clean termination

Summary Diagram

Complete Request Lifecycle

Key Takeaways

Zero blocking: Main thread never waits for disk I/O or GPU operations
Fire-and-forget: Submit commands and mark ready immediately
Ring buffer: Fixed slots for predictable memory usage
Ownership transfer: take_texture() moves resources to consumer
DMA-BUF ready: Can export for zero-copy sharing
Thread-safe: Careful atomic and mutex usage for correctness
Vulkan RAII: All resources managed automatically