Both gcc and clang generate strange/inefficient code

I ran into some surprisingly weird output of both Clang and gcc on a simple code snippet, and I thought I'd share it.

Consider the following C++ function which, in a roundabout way, checks whether an std::array passed as an argument only contains zeros:

#include <array>

static constexpr int arraySize = 1;

bool isAllZeros (const std::array<int, arraySize> &array) {
    std::array<int, arraySize> allZeros {};

    return array == allZeros;
}

In case you're wondering, the reason why this is correct is that initializing an array with { } results in each element of the array being value initialized, or in other words set to zero.

What happens if we compile this with gcc? Using godbolt and the latest gcc version (15.2), with optimizations on ("-O3") we get the following x86-64 Assembly code:

isAllZeros(std::array<int, 1ul> const&):
        mov     eax, DWORD PTR [rdi]
        test    eax, eax
        sete    al
        ret

Already here we get fairly non-intuitive output! We have set arraySize to 1, so we're effectively checking whether a single integer value is 0. The generated code does this by fetching the integer value, and'ing it with itself (which results in the same value), and then setting the return value of the function to be equal to the CPU's zero flag. This is not how your average Assembly programmer would do it, but it's still easy enough to understand (if perhaps wasteful-looking).

Let's see what happens if we set arraySize to 2:

isAllZeros(std::array<int, 2ul> const&):
        cmp     QWORD PTR [rdi], 0
        sete    al
        ret

That's more like it! Now we're simply fetching a QWORD-sized block of memory (8 bytes, which corresponds to two integers), comparing it to 0 and setting the return value to be the result of the comparison operation. This is a lot more intuitive than the arraySize = 1 case, and it's not clear why.

How about arraySize = 3, meaning a 12-byte block ?

isAllZeros(std::array<int, 3ul> const&):
        cmp     QWORD PTR [rdi], 0
        je      .L5
.L2:
        mov     eax, 1
        test    eax, eax
        sete    al
        ret
.L5:
        mov     eax, DWORD PTR [rdi+8]
        test    eax, eax
        jne     .L2
        xor     eax, eax
        test    eax, eax
        sete    al
        ret

Now things are getting really hectic. This time gcc decided to use a mixture of both strategies, using a cmp instruction to check whether the first 8 bytes are zero, and the test instruction to check the remaining 4 bytes.

That's not the weirdest part though. The strangest bit is the block in between the ".L2" and ".L5" labels, which as far as I can tell is using a very odd sequence of instructions to simply set eax to the value 0. A nearly-identical sequence is used at the end of the code to set the return value to 1.

How about clang? Surely we won't see two different compilers behaving oddly here? Again we use "-O3" and the latest version on godbolt, which is Clang 21.1.0.

With arraySize = 1: 

isAllZeros(std::array<int, 1ul> const&):
        cmp     dword ptr [rdi], 0
        sete    al
        ret

Phew! Looks good and normal.

How about with size 2?

isAllZeros(std::array<int, 2ul> const&):
        mov     qword ptr [rsp - 8], 0
        cmp     qword ptr [rdi], 0
        sete    al
        ret

Mmmm. The actual comparison code looks just as good, but there's a new inefficiency here which gcc avoided: specifically, the first instruction which is writing a zero to the stack. This corresponds to initializing the allZeros variable on the stack, even though the rest of the Assembly code never reads this value.

Lastly and for completeness, here is the output from Clang with size 3:

isAllZeros(std::array<int, 3ul> const&):
        mov     dword ptr [rsp - 8], 0
        mov     qword ptr [rsp - 16], 0
        mov     eax, dword ptr [rdi + 8]
        or      rax, qword ptr [rdi]
        sete    al
        ret

Here we see Clang using bit manipulation with the or instruction to generate a value that is nonzero when any of the input elements are nonzero, and then setting the result of the comparison based on the result of the bitwise or operation. Clever really! However, the unnecessary initialisation of allZeros on the stack is still present. One last thing which is unclear is: why did clang not perform these unnecessary writes when arraySize was just 1?

Moral of the story? As advanced as compilers are, we certainly can't trust them to generate optimal code, or even to be predictable when doing seemingly trivial changes to code (such as changing the size of an array).

Note: If I missed anything I apologise in advance - please send any feedback to: the name of this blog (see the URL) @gmail.com

Comments