As I understand it, this mode gains performance by sticking with 32bit pointers in user-space instead of 64bit pointers, while still allowing for taking advantage of all the new registers etc available in x86-64 ("amd64") CPUs. (Many programs doesn't really need more than 4gb virtual address space, and so using full 64bit pointers can cause quite the overhead - imagine doubly-linked lists with small data structures, for example)
x32 is clearly a good technical solution. If you're running on a 64 bit kernel, x32 is superior to i686 in every measurable way.
But it's also yet another architecture. It's not binary compatible with either i686 or x86_64. You need your whole userspace compiled to use it. Middleware with embedded assembly (there's a surprising amount of this in glibc, and of course things like ffmpeg, libjpeg, etc...) needs to be ported. You can't run 32 bit binaries from proprietary sources.
And frankly the benefit over straight x86_64 is quite modest. I don't see x32 taking off. It's just not worth the hassle.
I thought that the work that goes into multiarch support would allow you to run a single kernel and mix&match x32 and x86_64 binaries on the same system, but I might be wrong (Of course that would require a separate set of any lib/dependency).
Some numbers mentioned on the x32abi page hint on anywhere between 4% and 40% performance gains, if that is true then I'd think the benefits would outweigh the hassle of another architecture.
(Edit: Most middleware also ship straight-C versions of the routines; whether or not an x32 C compiler can measure up to handcrafted x86 or x86_64 assembly I don't know - but I'm guessing the much higher register count would help a lot. Regarding proprietary software: There are a great many server configurations that do not need anything beyond the standard open source packages available in Debian)
A standalone x32 binary will run fine on an x64 machine. But if you want to link to any libraries, the library will also have to be x32. So an x64 system, which probably has 32-bit legacy libs as well as normal 64-bit ones, will also need a complete set of x32 libs for x32 to be practical.
Sure. But after the porting work has been done by the distribution vendor, it's done. The package manager software should be able to do whatever is necessary to almost transparently ensure any necessary dependencies are installed for the required sub-architecture. So I would imagine that in most cases, end-users won't notice any hassle except having the option to choose between x32 and x86_64 per package during installation. I think that sounds kind of neat :)
OK, modulo taking up extra space on already-cramped CD distros, taking more time to download updates, taking up more space on production hard disks, and having to download a new version of the software if your dataset grows over 4GB, it sounds good. :)
And extra memory taken up at runtime by having to load the other versions of the libraries. And the I/O costs of reading them in. I'd think that'd outweigh the performance benefits many times over in nearly all cases.
Note that there is no need to port every single program. You can still run a 64 bit userspace (or a 32 bit one for that matter), but for the apps where there is a major benefit use x32. This also means you don't have to port every library etc.
I wonder if there could be an alternative in hardware.
Could the cpu have a mode/flag where all pointers in cpu registers were treated as 32 bits (high 32 bits ignored)?
Alignment issues might make this impossible (loading a 32 bit address in a compatible way from memory/cache...) but it would be neat if it could work. For instance, if memory layout in 32 bit chunks were A... B..., there would have to be a way to load both A and B into the low 32 bits of a register.
I am not sure this is true generally: the JVM has been able to use "Compressed OOPS" aka 32bit pointers to objects managed by the VM in 64 bit platforms for a few years, even without any support from the OS.
Off course it's better when language/VM implementer take this issue into account, but for those who doesn't, just building a VM/interpreter for x32 architecture can be the only workaround.
Flat real mode was a hack mode on old 32bit intel systems to enable flat memory mapping with 16bit code. You'd jump into protected mode, map the full 4gig address space flat, then jump back into 16bit real mode without remapping that memory space.
Afterwards you could access all your memory in a flat address space rather than using 64K segment:offsets. You had to prefix your memory access instructions with a 0x66 byte though. Also, you had to write your own runtime library.
This is the exact opposite, Giving you all the registers, but not the memory address space.
I'm curious about what you're thinking about when you say "write your own runtime library" - wasn't the whole purpose of this "unreal mode" and staying in 16bit such that you would still have access to DOS and its API? (compared to switching to full 32bit mode, where DOS went out the window unless you implemented or embedded a "DOS-extender"?).
Yes you're correct about purpose, but many of the runtime functions would then be limited to the old 64K limit.
I rewrote all the string handling, memory management, etc functions in the Borland pascal and c++ runtimes to let me use all the memory without worrying about segments/offsets.
So, you got the DOS functions, but you also got printf, port io, memory functions, a bunch of stuff I can't remember, etc.
Basically you got the primary benefits (for the needs of the time) of pmode - flat memory, without most of the pain. Seems like the same situation here - you get all the new registers, but without the pointer overhead.
EDIT: One point to note is that your Code, Data and Heap segments all stayed 64K. This bit me in the ass once when I was presenting. Two bits of code had been tested separately when brought together went over the 64K code and heap limits. That was embarrassing.
Yup, the x32 ABI is pretty neat. For most architectures the transition from 32 bit to 64 bit pointers resulted in reduced performance from the extra cache pressure the bigger pointers caused. In x86-64, though, the extra registers and guaranteed SSE2 meant that you actually saw a speed increase.
Of course, there are performance advantages to bigger pointers too, sometimes. For instance, it can be easier for a garbage collector to identify pointers is the ratio of memory addresses in use to total addresses is small.
As I understand it, this mode gains performance by sticking with 32bit pointers in user-space instead of 64bit pointers, while still allowing for taking advantage of all the new registers etc available in x86-64 ("amd64") CPUs. (Many programs doesn't really need more than 4gb virtual address space, and so using full 64bit pointers can cause quite the overhead - imagine doubly-linked lists with small data structures, for example)