Unaligned accesses in C/C++: what, why and solutions to do it properly

JonChesterfield · on Nov 30, 2023

I expected this to be an aliasing violation and was not disappointed. Dereferencing a uint8_t through a uint64_t is undefined behaviour whatever the target architecture is. That it works for the author on x64 just means the compiler didn't break it this time around.

In particular, an 8 byte aligned uint8_t also can't be dereferenced via a uint64_t. It isn't UB because the alignment is wrong, it's UB by definition and also the alignment might be wrong.

This is a woeful state of affairs propagated by an ill founded and widespread confidence that the C and C++ standards probably aren't as hostile to type aliasing as they actually are.

gpderetta · on Nov 30, 2023

If uint8_t is a char type (and invariably it is), then it is fully defined.

Similarly, it is perfectly safe to cast a correctly aligned uint8_t* to uint64_t* and dereference it as long as you originally stored an uint64_t into it (or the compiler can't prove you didn't). Remember aliasing UB is always on derefencing with the wrong dynamic type, not on casting pointers.

JonChesterfield · on Nov 30, 2023

There's an old gcc bug report about whether uint8_t should have the magic aliasing properties of char or not. I don't remember what they concluded.

You can cast any pointer you like to any other pointer. When you dereference it, if the declared type is not the type of the underlying object, bad times for you.

An interface which takes a uint8_t* and immediately invokes UB if you pass it anything other than a pointer to a uint64_t can indeed be written, but one should expect people to pass it things other than a cast uint64_t*.

Further, the magic aliasing char doesn't work in both directions. If this code was changed to take a char instead of a uint8_t it would be exactly as undefined as it is currently. You can aim a char* into a uint64_t and deref the char, but you can't aim a uint64_t* into a char and deref the uint64_t.*

quietbritishjim · on Nov 30, 2023

They ended up making uint8_t a typedef for unsigned char rather than a distinct type (which is permitted but not required by the standard) so the semantics are now clear.

Edit: That's what I remember but I can't find evidence of it. The docs say this is ABI dependent so perhaps the above statement is true on some specific ABI (if I haven't misremembered it entirely). But I can't track down wheere a relevant ABI would be documented.

https://gcc.gnu.org/onlinedocs/gcc-13.2.0/gcc/Architecture-i...

Edit 2: this header has a typedef of uint8_t to unsigned char, but it's part of glibc rather than gcc

https://android.googlesource.com/platform/prebuilts/gcc/linu...

kzrdude · on Nov 30, 2023

I thought if uint8_t exists, it logically must be the same as unsigned char? You put some of the statements of the standards together and that's their combined consequence.

quietbritishjim · on Nov 30, 2023

No it doesn't have to be, and it wasn't. That was the whole problem. They were both unsigned integer types with 8 bit widths but they were distinct and, in particular, uint8_t didn't have the aliasing exception that unsigned char does.

leni536 · on Nov 30, 2023

No, the relationship is directional. You can reinterpret a pointer to an integral type to a pointer to unsigned char and read the underlying bytes. The other way around is UB.

gpderetta · on Dec 1, 2023

What I mean is that this is perfectly valid code:

   uint64_t x=0;
   uint8_t * p = (uint8_t*)&x;
   uint64_t y = *(uint64_t*)p;

It is not the casting that is the problem, it is the dereferencing. And as long as that matches the dynamic type, it is ok.

tom_ · on Nov 30, 2023

I'm surprised the gcc and clang people have never noticed that uint8_t is an integer type, and therefore does not need to be a character type.

gpderetta · on Nov 30, 2023

Well, if we were starting from scratch {u,}int8_t should definitely not be character type; instead we would have byte_t (plus an explicitly sized variant, octect_t?) as the general universally aliasable byte type, and all of these would be guaranteed to be distinct types.

But today, uint8_t is in practice always a char type. You can static_assert if you want to be completely sure. Or just use std::byte.

Chabsff · on Nov 30, 2023

You are correct in principle, but the code in the original post definitely exhibits strict aliasing violations. Not only that, it didn't need to.

gpderetta · on Nov 30, 2023

The original code computing the hash by casting uint8_t* to uint64_t* for wide leads is generally wrong, but the memcpy one is fine.

pkkm · on Nov 30, 2023

> This is a woeful state of affairs propagated by an ill founded and widespread confidence that the C and C++ standards probably aren't as hostile to type aliasing as they actually are.

Yep. Honestly, while I get the historical reasons for strict aliasing, I wish it hadn't been created. Just make -fno-strict-aliasing the default and normalize the abundant use of restrict, maybe shorten it to res or rest just like we use const instead of typing out constant. These implicit rules that usually, but not always, do what you want can be a bigger footgun than having to do things explicitly.

wahern · on Nov 30, 2023

How historical is the importance of strict aliasing? AFAIU, strict aliasing survives not because of alignment issues when type punning, but because optimizing compilers heavily rely on non-aliasing to guarantee semantic correctness when reordering code, especially in and around loops. And the performance benefit of reordering has grown exponentially in lock-step with the emergence and increasing sophistication of superscalar architectures, pipelining, and cache hierarchies.

Strict aliasing, IOW, is as a practical matter mostly about when and to what extent compilers (and sometimes processors) can optimize code through automatic reordering, much like wrt parallelism and memory barriers. In this sense, the role of strict aliasing is more important than it ever has been, and while technically "merely" a matter of performance (correctness could be obtained by completely abstaining from reordering), the performance of modern processors plummets precipitously when implicit hardware parallelism can't be fully leveraged.

pkkm · on Nov 30, 2023

You're right that disabling strict aliasing removes some optimization opportunities for the compiler, but you can always get them back by adding restrict here and there. So it's really about what the default behavior should be. I'm thinking that it probably would've been better for the safe and intuitive behavior to be the default and for the extra optimization opportunities to be opt-in. A set of rules that usually does the right thing, but not always, can lull programmers into a false sense of security.

Also, I'm not sure how true your argument is in practice. Modern compilers have other, more sophisticated ways to prove that things don't alias. A number of projects compile with -fno-strict-aliasing without using restrict very often, for example the Linux kernel, and they don't seem to suffer much from it. Linus has this to say about aliasing in kernel code:

> In x86, I doubt _any_ amount of alias analysis makes a huge difference (as long as the compiler at least doesn't think that local variable spills can alias with anything else). Not enough registers, and generally pretty aggressively OoO (with alias analysis in hardware) makes for a much less sensitive platform.

JonChesterfield · on Nov 30, 2023

It's a hack from the days of separate compilation. Like lots of the "no diagnostic required" sucks to be you UB which can only be diagnosed at whole program link time. That is, the semantics and usability of modern C++ are compromised by the hardware capabilities of the machines on which C was first developed.

tialaramex · on Nov 30, 2023

> just like we use const instead of typing out constant.

Those aren't constants†, they're immutable variables which are quite different. Because of the "as if" rule and provenance in some cases you get the same benefit as with a real constant, but that's not what this is.

† In C or C++. In some other languages const does, logically, get you an actual constant.

pkkm · on Nov 30, 2023

Unless you think that the abbreviation const stands for some other word than constant, I'm not sure how that's relevant here?

tialaramex · on Dec 1, 2023

Because these aren't actually constants it would actually be even worse if the keyword was "constant" rather than "const". Jargon that's identical to ordinary words but is just plain wrong if taken that way makes it harder to be understood.

In a language where the const keyword gets you an actual constant this condition wouldn't arise.

AnimalMuppet · on Nov 30, 2023

Well, yes, if you're playing this kind of game, the compiler cannot type-check what you're doing (or sanity-check it in any other way). You're outside the area covered by the standard. You're coding without a net.

But C and C++ are explicitly intended for this kind of situation. You know enough about how types are represented in memory to fiddle with their internals, and you have sufficient reason to do so? Go for it. Just don't expect the type checking to save you, because it won't.

JonChesterfield · on Nov 30, 2023

I'd be totally happy with C or C++ here if the attitude to type checking was "you've done weird things so you're on your own, be careful". Things like mutating a vtable could be fair game, just do it carefully.

However that's not the setup. C++ in particular, and increasingly C copying from it, optimises assuming you never do anything weird with pointers. This makes application code that doesn't do such things faster in a separate compilation world. It makes system code that does do such things undefined, forcing either compiler flags (fno-strict-aliasing and friends, aka writing in a different language that looks like the nominal one) or taking care around separate compilation to make sure the compiler never sees both sides of the boundary at the same time.

trealira · on Nov 30, 2023

If I had the skill, I would want to create my own C compiler that tried not to optimize things like this, e.g. don't optimize away infinite loops, don't optimize away signed overflow checks or do incorrect algebraic simplifications because signed overflow is undefined behavior, and by default act as though -fno-strict-aliasing were on. Add options for trapping on overflow or optimizing away overflow checks, signed or unsigned. Just do register allocation and minor constant folding (but again, don't act like signed overflow doesn't exist, and don't do constant folding across pointer writes if its address has been taken).

There exist open source C compilers that aren't GCC or Clang, such as TCC and Zig, but to my knowledge they all produce really, really bad assembly. There existed proprietary C compilers in the 1980s and 1990s that were written by one or two people that still produced solid assembly, so it's possible, just no one has done it, so Clang and GCC are the only free and open source options for decent assembly for C. Those two also have the downside of being C++ compilers, making them more complex than they need to be.

Maybe another option would be to improve an already existing open source C compiler's codegen than to write a new one from scratch again, assuming they'd agree with the above ideals.

JonChesterfield · on Nov 30, 2023

It's on my to-do list for C. There's some mess around the preprocessor and parsing typedefs but it's not too horrendous. The guix bootstrap had a compiler for a largish subset of C written in scheme, that's probably a good start point if you don't want to go from scratch.

I like C as a notation for programming computers. I don't like the slight misstep -> UB -> everything is ruined philosophy of WG14.

(edit: it should be possible to emit LLVM IR with the right metadata to avoid the optimiser mangling it. In practice there would be a long tail of accidentally assuming C++ semantics that aren't justified by the IR, where bug fixes for those would be valid upstream. If one wanted to take on the probably thankless task of trying to remove ISO C/C++ assumptions from llvm.)

adrianN · on Nov 30, 2023

Rust already encountered lots of bugs when they tried to add all the aliasing info they had to the IR output.

nlewycky · on Nov 30, 2023

> Things like mutating a vtable could be fair game, just do it carefully. However that's not the setup.

You can mutate a vtable without UB. A method may call the destructor on its this pointer and use placement-new to create a new object in place. The fact that any method of a class may do this combined with the fact that the placement-new object might be a more derived version of the destroyed object (so an existing Foo* continues to be valid) means that compiler can't cache vtable lookups for consecutive method calls, making nearly any optimization of virtual function calls impossible because you don't know the type or called function, unless you see the object being constructed (when the vptr is assigned) and inline each called function in turn.

(At some point C++ added a rule that basically reads "you're allowed to cache the vptr, if the accesses were written using the same pointer variable name". This doesn't work well for optimizing compilers because they'll quickly fold two equal values into a single variable in their internal languages and lose track of whether the user wrote two distinct variable names or not.)

xscott · on Nov 30, 2023

I think we're getting to the point where C++ users should define two classes:

    class Integer { 
        // uses unsigned integers for all operations
        // but converts to signed when you need output
    };
    class Pointer {
        // uses void* and memcpy for all loads or stores
    };

Put all of the correct operator overloading on those to get the ergonomics we had in the 90s.

Then we can safely ignore all the undefined behavior which has been so eagerly embraced by the compiler writers, and we can go back to treating the hardware as though it is sane. (Which the hardware is desperately trying to fake being anyways)

After that, we can create some sane container types which don't offer up dangling references or pointers, and Bob's your uncle.

celrod · on Nov 30, 2023

If my signed integers overflow, that is wrong. I want to be able to detect if it ever happens, not have that be defined as perfectly valid wrapping behavior. If it's undefined, it's perfectly legal for the compiler to make overflow trap, so I can catch it, or confirm it never happens when running tests our debugging. `-ftrapv` or ``-fsanitize=undefined` do this.

xscott · on Nov 30, 2023

To each their own. I want my 64 bit integers to do what the CPU does, and I could bore you with useful examples where two's complement wrap around is exactly the right behavior.

But more importantly, I don't ever want to worry about stumbling into undefined behavior, and compiler extensions like -ftrapv or -fwrapv are not part of the standard. I can't count on those being there when I switch compilers, and if I write a library for others, I can't count on the application developer using the flags I need.

derf_ · on Nov 30, 2023

You run into these problems on x86, also, with SIMD intrinsics like _mm_cvtepi8_epi32() (which converts four 8-bit integers to 32-bit integers with sign extension). The underlying instruction, PMOVSXBD, when used with a (4-byte) memory operand, imposes no alignment restrictions. But the intrinsic takes an __m128i value (not a pointer), and __m128i (the C type) requires 16-byte alignment.

If you try casting your pointer to (__m128i *) and dereferencing, sometimes the compiler will optimize it correctly (to PMOVSXBD with a memory operand, which really only loads 4 bytes), and sometimes (for example, when optimizations are disabled) it will emit MOVDQA, which does a 16-byte aligned load. Tools like UBSan will also (correctly) complain about it.

Eventually __mm_loadu_epi32() was added, but it was released in a broken state in the initial gcc implementation: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99754 so you can't rely on it. There is __mm_load_ss(), but gcc's implementation derefs a (float *), so it still requires 4-byte alignment: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84508

The workarounds described in the article (combined with _mm_cvtsi32_si128()) are the only reliable solution I have found. Various compilers still generate an extra MOVD instruction instead of using a memory operand, and reportedly ICC will even do an extra load into a general-purpose register before moving the value to a SIMD register: https://stackoverflow.com/questions/72837929/mm-loadu-si32-n...

Those extra moves are cheap on x86, so it is not the end of the world, but the whole situation is less than ideal.

ack_complete · on Nov 30, 2023

Intel's intrinsics are just a total mess. Another example is _mm_loadl_epi64(), which loads 64-bits from memory and maps to MOVQ x,m -- but it takes an __m128i pointer. The result is that you frequently have to reinterpret cast pointers to use the intrinsics, in ways that would be blatantly broken in any other situation. ARM does a much better job of this in properly using void pointers or providing typed loads and store intrinsics.

It's also fun how intrinsics are mixed between ISA extension levels with confusingly similar names that make it really easy to use them in the wrong code path. Arithmetic shift right immediate for int16 (_mm_srai_epi16) and int32 (_mm_srai_epi32) are SSE2. But _mm_srai_epi64 for int64 is AVX512.

One thing that Intel does do better is that they have a standardized, OS-independent way of testing for ISA extensions through CPUID. ARM doesn't and you are at the mercy of the OS to provide you APIs to test whether, for instance, Crypto, CRC32, CAS, and UDOT instructions are available.

rwmj · on Nov 30, 2023

Particularly insidious if you don't stick to the 16 byte stack alignment. We had one case where we calling from assembly code back to C, but the asm didn't honour the C stack alignment. The C code was trying to access some vector variable on the stack and crashed because the compiler generates code assuming %rsp is 16 byte aligned when you enter the function. The stack was only unaligned some of the time and the crash happened very distant from the wrapper, making it a pain to track down. (Asm wrapper has since been fixed).

derf_ · on Nov 30, 2023

It gets worse than that. On x86-32, you are only guaranteed 8-byte stack alignment, and if you try to use __attribute__((aligned(16)) on a stack variable, gcc just silently fails to enforce the alignment. Or it did 15 years ago when x86-32 was relevant. A lot of times the code works anyway by luck. Until it doesn't. No idea if this was ever fixed.

Am4TIfIsER0ppos · on Nov 30, 2023

Wasn't that a Windows limitation? I recall that trouble from ffmpeg having to align the stack for functions that might be called from microsoft compiled code. I recall gcc assuming alignment but no guarantee that you were actually called like that.

CoastalCoder · on Nov 30, 2023

Really interesting, thanks!

I'm a little surprised about the icc thing, since it has/had a reputation for really good Intel x86 codegen.

mlochbaum · on Nov 30, 2023

A solution used in the Linux kernel[0] is to use a struct with the packed attribute, which implies the fields aren't necessarily aligned. Simplifying, I think this macro should work the same in gcc and similar: apply to any pointer to get something that behaves as *p but will read or write as unaligned.

    #define unaligned(P) (((struct { typeof(*(P)) x; } __attribute__((packed))*)(P))->x)

[0] https://github.com/torvalds/linux/blob/7b9e664beb237d90bc600...

JonChesterfield · on Nov 30, 2023

Uses of pointers to members of packed structs are prone to being miscompiled by GCC so I suspect that only works in combination with the various other not-actually-C flags the kernel has to pass around.

el_pollo_diablo · on Nov 30, 2023

Here is another solution for loading a potentially unaligned 64-bit integer in the CPU's byte order, that does not violate strict aliasing:

  uint64_t load64(uint8_t const *b) {
      union {
          uint64_t u64;
          uint8_t u8[8];
      } u = {.u8 = {
          b[0], b[1], b[2], b[3], b[4], b[5], b[6], b[7]
      }};
      return u.u64;
  }

Type punning through a union is explicitly supported by the C standard, insofar as we do not evaluate uninitialized memory or trap representations. We are fine here, as uint64_t has no trap representations, and must have exactly the size of 8 uint8_ts, as those types have no padding bits.

GCC appears to compile this function to a plain unaligned load at optimization level 2, when supported, e.g. on arm64 (see https://godbolt.org/z/frEox67sG).

el_pollo_diablo · on Dec 1, 2023

Furthermore, one advantage of this solution over using memcpy is that the compiler can still understand what it does (and aggressively optimize it) in a freestanding environment.

amluto · on Nov 30, 2023

The “multiple loads” solution is IMO correct. It does precisely what it looks like it does, it doesn’t rely on any assumptions about how integers are represented, and, IMO best of all, it completely gets rid of the “native endian” crap that has plagued C programs for decades.

(And you computer can convert endianness in a cycle or two. There is very little excuse for optimizing based on “native endian”.)

edit: about the only good thing I can say about “native endian” is that it maybe justified in cases where the value never leaves the machine. Hash tables in memory come to mind, and the article seems to be about that.

wyldfire · on Nov 30, 2023

"-fsanitize=undefined" helps uncover this kind of UB and more. While you're there, pile on with ASan: "-fsanitize=address,undefined".

kaycebasques · on Nov 30, 2023

I'm in the process of updating the docs for Pigweed's pw_alignment lib. Would that lib help here or are we talking about a different problem? https://pigweed.dev/pw_alignment/

(I'm not that familiar with this problem so I figured I would take this opportunity on a seemingly related problem to get more perspective.)

tralarpa · on Nov 30, 2023

With gcc, the option -Wcast-align=strict is useful.

Edit: And if the buffer to convert is very large, one could first check the alignment of the pointer (with alignof(T)) in C++

12345ieee · on Nov 30, 2023

As other comments say, this is missing packed unions, which is the best solution I've found while writing a memory editor.

See: https://github.com/scanmem/scanmem/blob/c6045a8677f37a51b976...

dataflow · on Nov 30, 2023

Q: How do you do atomic unaligned accesses from C++? (and I'm sure someone will say "align your data". But that's not always possible.)

Edit: I mean on x86, where it's possible. I couldn't even find any intrinsics for the corresponding lock instructions. And yes I'm aware of the performance penalty.

cesarb · on Nov 30, 2023

AFAIK, in general, it's not possible; it's a hardware limitation of the CPU. Some architectures like the x86 do allow unaligned atomic accesses, but in a very heavy handed way (doing an unaligned atomic access on the x86 locks the whole bus, stalling all processor cores at the same time). The Linux kernel recently gained the ability to detect this situation and, depending on configuration by the administrator, it could either log a kernel warning or kill the offending process. Some discussion about this on LWN found on a quick web search: https://lwn.net/Articles/790464/ https://lwn.net/Articles/806466/ https://lwn.net/Articles/911219/

JonChesterfield · on Nov 30, 2023

C++ cannot represent an atomic type with less than natural alignment, and you cannot access that which does not exist, so that's the end of your road. Consider a language with lower intrinsic overhead.

Usually the answer is to ignore the atomic abstractions of C++ and use the GCC style atomic intrinsics (they take a void* instead of some atomic qualified thing) but I'm not totally confident they'll do the right thing if the target memory has less than natural alignment.

Beware crossing cache lines with your operation. I'm not sure what the x64 instructions do in such a case.

dataflow · on Nov 30, 2023

> Usually the answer is to ignore the atomic abstractions of C++ and use the GCC style atomic intrinsics

I'm saying I don't see any such intrinsics for this.

I do believe x86 and x64 LOCK works correctly across cache lines.

cjensen · on Nov 30, 2023

Portably? You don't. That's a CPU limitation for some architectures.

knorker · on Nov 30, 2023

The right way to cast stuff in memory since C++20 is std::bit_cast, right?

https://en.cppreference.com/w/cpp/numeric/bit_cast

Ono-Sendai · on Nov 30, 2023

Have been dealing with this issue recently. memcpy was quite slow in my tests however, it was ~twice as fast to check the alignment and load as floats (was dealing with floats in my case).

pengaru · on Nov 30, 2023

There have been several memcpy performance regressions affecting modern CPUs in recent years, might want to investigate if that's part of what you were observing.

epx · on Nov 30, 2023

Had a problem like that while porting an IEEE 11073 parser library to Emscriptem. Funnily enough, the code worked happily in x86 and ARM environments, but failed in Emscriptem.

Night_Thastus · on Nov 30, 2023

>and optimizes it correclty, as alignment rules are still valid

I can't tell if this is a joke or not, haha.

AnimalMuppet · on Nov 30, 2023

Something something Duff's Device...