Hard Bugs

A few years ago I stumbled upon my first hard bug. I was working on my custom C++ 2D game engine and had just added the ability to render to an off-screen image. But it was not working well and most of the time I would get a black image. After stepping through the code for a long time, I inspected the camera data again and noticed that the projection matrix looked quite odd. Investigating a little further the root cause was determined: the projection matrix was uninitialized, rendering all further draw commands useless.

But how did this not affect the core render code responsible for creating the actual on-screen image, which was using the same camera code? It turns out the core camera was a global variable, which ensured the projection matrix would be initialized to zeros, while the off-screen camera was created on the stack, meaning the projection matrix would contain whatever values were left at these particular memory locations by the previously executed code. Then the camera code would only update particular entries in the matrix, wrongly expecting the others to already contain zeros.

After figuring all this out, the fix was rather simple: always initialize the projection matrix by adding = {}; after its declaration in the camera struct. I had also learned my lesson, spent some time internalizing C++’s crazy initialization rules and made sure to always initialize my variables from then on.

It took me 3 hours to solve that bug and it was unlike anything I had seen before in terms of time spent and cluelessness while debugging. That is why I called it the hardest bug I had ever debugged myself, at least until now. As of today, we have a new champion, which took a total of 8 hours of staring at Visual Studio in disbelief to beat and made me question whether the compiler was working at all. For your enjoyment, I will start by describing the symptoms I encountered while debugging and you can see if you can guess what the bug was, before I present my analysis.

To set some context, I was experimenting with Vulkan and the NVIDIA Ray Tracing API. For this purpose I had built a little graphical application that uses Vulkan to render a bunch of objects (following the great Vulkan Tutorial). The next step was to start using ray tracing and I was following the NVIDIA Vulkan Ray Tracing Tutorial. The tutorial provides a sample application, that also renders a model using the rasterization pipeline provided by Vulkan, and shows how to extend the application to use the ray tracing pipeline as well. The sample application uses almost the same libraries as I do (GLFW, GLM, tiny_obj_loader, stb_image, among others), so I figured the applications were similar enough that I could start working on my application right away and just adapt the code from the tutorial appropriately.

The first steps went quite well. The tutorial also provides some utility classes, that attempt to abstract away some of the verbosity Vulkan has earned some fame for. I copied the relevant files over and added them to my project. Then I had to enable the ray tracing extension and query some device properties, which also worked ok. At this point my application compiled and ran fine.

Now it was time to build the acceleration structures (AS) necessary for ray tracing. I won’t go into too much detail here about ray tracing in general and NVIDIA’s Vulkan extension in particular, but for fast ray tracing you need to preprocess your geometry and store it inside such a structure. Specifically, the NVIDIA extension requires you to build one or more bottom-level acceleration structures from your vertex buffers and then combine them into one top-level acceleration structure. This is the first real step towards ray tracing and also the largest section of the tutorial. Anyway, I created a new file for all the ray tracing code, worked my way through the tutorial and adapted the code to my application. This involved replacing the GeometryInstance struct defined by the tutorial with the Object struct I had already been using in my application to store the relevant information about the objects rendered to the screen (i.e. different representations of the geometry for processing on the CPU as well as some handles to data buffers on the GPU). The struct definitions are shown in the following code listing, the important part being that the Object struct contains all the required information (index and vertex buffers and a transformation matrix), so it does not make much sense to duplicate that data.

struct GeometryInstance {
    VkBuffer vertexBuffer;
    uint32_t vertexCount;
    VkDeviceSize vertexOffset;
    VkBuffer indexBuffer;
    uint32_t indexCount;
    VkDeviceSize indexOffset;
    glm::mat4x4 transform;
};

struct Object {
    Shading   shading   = Shading::VERTEX;
    glm::mat4 transform = glm::mat4(1);

    // The following data represents the unprocessed geometry using convex
    // polygons as faces.
    std::vector<VertexOnlyData> vertexOnlyData;
    std::vector<VertexFaceData> vertexFaceData;
    std::vector<FaceIndices>    faceIndices;
    std::vector<uint32_t>       faceOffsets;

    // The following arrays represent the processed geometry data consisting of
    // triangles only and should always be in sync with the buffers on the GPU.
    std::vector<Vertex>   vertices;
    std::vector<uint32_t> indices; // Indexes into vertices.

    VkBuffer vertexBuffer             = VK_NULL_HANDLE;
    VkDeviceMemory vertexBufferMemory = VK_NULL_HANDLE;

    VkBuffer indexBuffer              = VK_NULL_HANDLE;
    VkDeviceMemory indexBufferMemory  = VK_NULL_HANDLE;
};

After finishing the acceleration structure section of the tutorial, I wanted to test the code I had so far to see that it would at least run correctly albeit not producing any visual output yet. As one might expect after writing a bunch of new code, it crashed. I did not necessarily expect that it would work on the first try, but I also did not anticipate how long it would take me to figure this out.

I received a VK_ERROR_DEVICE_LOST error, which was unsurprisingly generated by the new code. The communication between CPU and GPU is asynchronous and in this case the code was structured in such a way that it records some commands into a buffer, then sends the buffer to GPU and waits for the GPU to execute all of the commands. It was at this point that the error got reported, but, because of the asynchronous nature of the communication, it was not clear which of the commands actually caused the error. I disabled the top-level acceleration structure code for the moment, which removed somed commands from the buffer. Then, by digging around in the utility classes from the tutorial I found out that there were only two commands left and I had a strong suspicion which one was causing the error. Yet, without any further information about the error I was stuck for a bit.

While researching debugging methods for this particular error, I found this article about using checkpoints in command buffers to locate errors. However, while certainly useful in more complicated scenarios, since I had already isolated the offending command, I figured it would not help my current situation. For debugging and developing Vulkan applications in general, the use of so called “validation layers” is recommended. These validation layers perform additional checks, but are not enabled by default for performance reasons. I had already set up validation layers in my application and at this point I was wondering why they did not report anything. After checking that the validation layers were indeed enabled and working, I decided to update my VulkanSDK installation, which turned out to be a good idea, because after the update I was getting more detailed error reports with actual information about the concrete error!

ERROR[validation]: VkAccelerationStructureInfoNV: The total number of triangles in all geometries must be less than or equal to VkPhysicalDeviceRayTracingPropertiesNV::maxTriangleCount. The Vulkan spec states: The total number of triangles in all geometries must be less than or equal to VkPhysicalDeviceRayTracingPropertiesNV::maxTriangleCount (https://www.khronos.org/registry/vulkan/specs/1.1-extensions/html/vkspec.html#VUID-VkAccelerationStructureInfoNV-maxTriangleCount-02424)

The relevant piece of code looked like this. BottomLevelASGenerator is one of the utility classes from the tutorial, which stores the arguments to AddVertexBuffer and passes them to the Vulkan API inside Generate (which adds the command to the buffer):

Result createBottomLevelAS(VulkanApplication& app, VkCommandBuffer commandBuffer, Accelerator& accelerator, Object* object) {
    nv_helpers_vk::BottomLevelASGenerator generator;

    auto vertexCount = static_cast<uint32_t>(object->vertices.size());
    auto indexCount = static_cast<uint32_t>(object->indices.size());

    generator.AddVertexBuffer(object->vertexBuffer, 0, vertexCount, sizeof(Vertex), object->indexBuffer, 0, indexCount, VK_NULL_HANDLE, 0);

    // ...

    generator.Generate(app.device, commandBuffer, /* ... */);

    return SUCCESS;
}

I was using a comparatively small model, but for testing purposes I switched to a cube consisting of the unbelievable high number of 12 triangles. Same error. Quickly querying the mentioned maximum triangle count returned 536 870 911, i.e. 536 million triangles. Something did not add up.

Let’s quickly go over some of the dead ends I investigated:

  • The main application code was passing a const std::vector<Object*>& but with only one object in it (for testing). I was afraid I was messing something up there and getting garbage pointers (although this would have likely resulted in a memory access violation), so I changed it to a plain Object*. No effect.
  • Maybe it was uninitialized memory again? I explicitly initialized all the vectors in the Object struct just to be safe (which does not make a difference) and doubted for a moment whether new Object() would initialize the object (which it does, of course).
  • I also wondered whether I was managing the object pointers correctly, which I verified by providing a custom constructor and destructor that logged their invocation.

At some point during this process I added a bunch of checks between the code that created the objects in the main.cpp file and the code above, which is located in the ray_tracing.cpp file:

if (object->vertices.size() >= 100 || object->indices.size() >= 100) {
    std::cerr << "FAILURE!!!!" << std::endl;
}

I made sure that the object pointers in main and in the ray tracing file contained the same value, so these checks verified again that the objects were created correctly but somehow got corrupted before being passed to the generator.

Have you figured out what the bug is yet? Here’s a strong hint: All the checks in the main.cpp file passed, while all the checks in the ray_tracing.cpp file failed. There was a passing check in main.cpp, right before the function was called, and the first statement inside the function was the first check that failed.

While stepping through the code in the debugger, even crazier things happened. In this image you can see that the debugger is going to execute the print statement next (yellow arrow), which means the condition of the if statement must have evaluated to true. Yet, according to the watch window, the condition must have been false.

At this point I started to doubt the compiler was working correctly and started looking at the disassembly in a desperate attempt to find something odd-looking. I started to doubt my understanding of operator precedence and put parentheses everywhere. I started to doubt my ability to do boolean logic in my head and verified false || false == false in the watch window.

Well, now is probably your last chance to take a guess. Here’s how I figured this one out: After being stumped for several hours I came to the following conclusion: Assuming the compiler is not completely broken, the code in ray_tracing.cpp must “see” different data than the code in main.cpp and the debugger itself. I checked that with the following code:

// In main.cpp.
std::cerr << "MAIN: " << sizeof(Object) << std::endl;

// In ray_tracing.cpp.
std::cerr << "NOT MAIN: " << sizeof(Object) << std::endl;

And sure enough, the numbers were different! The actual debug output:

INFO: ray tracing max recursion depth: 31
INFO: ray tracing max triangle count: 536870911
INFO: ray tracing shader group handle size: 16
MAIN: 304
FAILURE3.3!!!!
FAILURE3.2!!!!
FAILURE3.1!!!!
FAILURE3!!!!
NOT MAIN: 296
FAILURE4!!!!
ERROR[validation]: VkAccelerationStructureInfoNV: The total number of triangles in all geometries must be less than or equal to VkPhysicalDeviceRayTracingPropertiesNV::maxTriangleCount. The Vulkan spec states: The total number of triangles in all geometries must be less than or equal to VkPhysicalDeviceRayTracingPropertiesNV::maxTriangleCount (https://www.khronos.org/registry/vulkan/specs/1.1-extensions/html/vkspec.html#VUID-VkAccelerationStructureInfoNV-maxTriangleCount-02424)

Now it is clear what “see” means above: the different files expect a different data layout. Specifically, the alignment of the transform matrix was different, which caused the code in the ray_tracing.cpp file to expect 8 padding bytes less somewhere at the beginning of the struct, changing the offsets for all other members.

How exactly did this happen? In my code I defined GLM_FORCE_DEFAULT_ALIGNED_GENTYPES before including GLM (the math library that provides the matrix type glm::mat4), which changes the alignment. In ray_tracing.cpp I included the NVIDIA utility classes (one of which also includes GLM, but without the define) before object.hpp (which in turn includes GLM correctly). Therefore, for the ray_tracing.cpp file only, GLM was included differently. The fix for this bug was a one-liner again. Just include GLM before the utility classes.

Interestingly, it seems to be random which layout is added to the debug information from build to build, since I sometimes saw corrupt data in the watch window as well, which confused me even more.

As with any complex failure, multiple factors came together to cause the different data layouts, beginning with both code bases choosing to use the same math library (albeit configured slightly differently) and culminating in the arrangement of fields inside a struct and the more or less random order of includes.

At the language level, beloved C++ “features” like separate compilation units, textual includes and the C preprocessor contributed to this bug, yet they will probably never be removed from C++ for compatibility reasons. After having solved this mystery, I for one welcome all the new systems programming languages and hope that they will be able to fully replace C++ eventually.

Posted in Programming