333x Faster: Moving Selection Bounding Rect from CPU to GPU
One of the most dramatic performance wins in Mojave Paint came from a seemingly simple operation: finding the bounding rectangle of a selection.
The selection mask is a single-channel r16Float texture where white means selected and black means unselected. To crop to a selection, you need the bounding rect — the smallest rectangle containing all selected pixels. The naive approach reads the entire texture to the CPU and scans every pixel:
for y in 0..<height {
for x in 0..<width {
if mask[y * width + x] > 0 {
minX = min(minX, x)
// ...
}
}
}
On a 5640x3752 image, this took 2.7 seconds in a debug build. Swift debug builds are notoriously slow for tight loops — no optimizations, full safety checks on every array access. Even in release it's not great because you're reading back millions of pixels from GPU memory.
The fix was a Metal compute shader using atomic operations:
kernel void selectionBoundingRect(
texture2d<float, access::read> mask [[texture(0)]],
device atomic_uint* bounds [[buffer(0)]],
uint2 gid [[thread_position_in_grid]]
) {
if (mask.read(gid).r > 0) {
atomic_fetch_min_explicit(&bounds[0], gid.x, memory_order_relaxed);
atomic_fetch_min_explicit(&bounds[1], gid.y, memory_order_relaxed);
atomic_fetch_max_explicit(&bounds[2], gid.x, memory_order_relaxed);
atomic_fetch_max_explicit(&bounds[3], gid.y, memory_order_relaxed);
}
}
Every GPU thread checks one pixel. If it's selected, it atomically updates the min/max bounds. After the dispatch, you read back just 4 integers. The result: ~8ms — a 333x speedup. Crop to Selection went from painfully slow to instant.
The lesson is broader than this one kernel: never scan large textures on the CPU when the data is already on the GPU. Even "simple" operations benefit from parallel reduction. This pattern — atomic min/max with a single readback — works for any bounding-rect-style query.