More

idividebyzero · on Nov 22, 2020

Well, this is not completely true. As said, JIT employs absolute addresses a lot, and for instance, you can use `lea` to compute the absolute address of a variable at runtime (with PC-relative addressing mode). Said that, I still see `movabs` of an hardcoded address (not necessarily a hardware register) into a register and then a call to it (and this continues to be position-independent code). Also, if I remember correctly, the __TEXT segment is not randomized by default on Linux.

idividebyzero · on Nov 20, 2020

I think it moderately depends on the definition you give it to. If you require RISC to be a load/store architecture, x86 is not even close to be one. Also, aarch64 is a variable-length instructions set and include complex instructions (such as those to perform AES operations). Compiler optimizations are meant to be taken advantage by all architectures, regardless of RISC/CISC.

jcranmer · on Nov 20, 2020

Personally, I think the RISC/CISC "question" isn't really meaningful anymore, and it's not the right lens with which to compare modern architectures. Partially, this is because the modern prototypes of RISC and CISC--ARM/AArch64 and x86-64, respectively--show a lot more convergent evolution and blurriness than the architectures at the time the terms were first coined.

Instead, the real question is microarchitectural. First, what are the actual capabilities of your ALUs, how are they pipelined, and how many of them are there? Next, how good are you at moving stuff into and out of them--the memory subsystem, branch prediction, reorder buffers, register renaming, etc. The ISA only matters insofar as it controls how well you can dispatch into your microarchitecture.

It's important to note how many of the RISC ideas haven't caught on. The actual "small" part of the instruction set, for example, is discarded by modern architectures (bring on the MUL and DIV instructions!). Designing your ISA to let you avoid pipeline complexity (e.g., branch slots) also fell out of favor. The general notion of "let's push hardware complexity to the compiler" tends to fail because it turns out that hardware complexity lets you take advantage of dynamic opportunities that the compiler fundamentally cannot do statically.

The RISC/CISC framing of the debate is unhelpful in that it draws people's attention to rather more superficial aspects of processor design instead of the aspects that matter more for performance.

brandmeyer · on Nov 20, 2020

> It's important to note how many of the RISC ideas haven't caught on.

2-in, 1-out didn't, either. Nowadays all floating-point units support 3-in, 1-out via fused multiply-add. SVE provides a mask argument to almost everything.

klelatti · on Nov 20, 2020

Unless you're using a definition I'm not familiar with aarch64 isn't a variable length instruction set - here's Richard Grisenthwaite Arm's lead architect introducing ARMv8 - the slide here confirms "New Fixed Length Instruction Set":

https://youtu.be/GBeEEfmJ3NI?t=570

idividebyzero · on Nov 20, 2020

I understand that they refer to it as a fixed-length instruction set, it's correct, note though that not all ARMv8 instructions are 4 bytes long. Indeed, some instructions that are met together are fused to a single one, or SVE, for instance, introduces prefix; so practically, this means that sometimes instructions can be 8 bytes long.

brandmeyer · on Nov 20, 2020

Macro-op fusion of the MOVW/MOVT family doesn't count. At the time of that presentation, SVE didn't exist. Even now, the masked move instruction in SVE can also stand on its own as a single instruction and sometimes it does get emitted as its own uop.

klelatti · on Nov 20, 2020

Thanks, yes of course. I guess probably fair to say that philosophically it's fixed-length, in way that the original Arm was RISC, i.e. with some very non RISC-y instruction. Very different to x86 though.

Tuna-Fish · on Nov 20, 2020

64-bit Arm is fixed width. Modern 32-bit Arm was not fixed width, as Thumb-2 was widely used.

marcan_42 · on Nov 21, 2020

The main difference is x86 decode is hell to parallelize, as you have no idea where instructions start or end. It's a linear dependency chain of instruction lengths, an antipattern in the modern parallel processing world. Modern x86 CPUs have to use a large number of tricks and silicon to deal with this decently.

While even with Thumb-2, you can at worst just try decoding an instruction at every halfword. At worst you throw away half of the results if they are the second half of an instruction that was already taken care of. If you tried to do the same thing with x86 you'd throw away many more results, trying to decode (much more complex encodings) at every byte.

jcranmer · on Nov 21, 2020

Is it really so hard to find instruction length in x86? State machines are associative, and therefore you can build a reduction tree for parallel processing of them. And the state machine itself isn't too bad: it's mostly prefixes, and figuring out if the opcode uses a ModR/M byte (which most do) or has an immediate operand. And while x86 does have a nasty habit of packing multiple instructions into a single opcode (via specific register values in the ModR/M byte), I believe all of them would share the same behavior in the immediate operand effects.

I suspect that in one pipeline stage, you could at least resolve the entire cacheline into the individual instruction boundaries that can be simultaneously issued into uops, if not having the entire instruction decoded into the hardware fields. You wouldn't know if register 7 referred to a general purpose register, or a debug register, or an xmm reg, or whatnot, but you'd probably know that it was a register 7.

Tuna-Fish · on Nov 21, 2020

And after you know each instruction boundary, now you have to do a massive mux from positions in the cache line to separate decoders. As I understand, that's a big part of the problem, and essentially costs more than a single pipeline stage.

gpderetta · on Nov 21, 2020

x86 is certainly not RISC by any sane definition. It is still one of the least complex historical CISCs.

idividebyzero · on Aug 30, 2019

Even if the same key is shared by all processes, keys are stored in registers that are meant to be not accessible to userspace, so a disclosure vulnerability would not help here anyways.

kccqzy · on Aug 31, 2019

You don't need to get the key per se, depending on your exploit. You just need a single valid signed pointer that's involved in your ROP gadget.

idividebyzero · on Aug 30, 2019

True, but drivers methods are exposed through IOKit framework, which clearly does not even attempt in reducing the attack surface, and instead, makes the exploitation easier. Thus, its design appears to be fundamentally broken.

idividebyzero · on June 29, 2019

Just out of curiosity: did they let you use the MacBook Pro without problems? Can you guys @ Microsoft decide to use whatever OS is more suitable for what you do?

toyg · on June 30, 2019

Microsoft has had Macs in use for a very long time, even pre-dating the famous Gates investment. There are some infamous pics of delivery vans unloading tons of Apple boxes in Redmond. I would expect them to be even more liberal now under Nadella, but to be honest, they probably get great prices on Surfaces, which are very nice machines now.

filmgirlcw · on July 4, 2019

Just saw this. The answer is it depends on the team, but unless there is a hardware/software reason tied to your job for a specific platform, people can choose what they want.

Many of my colleagues use macOS, some use Windows, some Linux. I have a work-issued Mac and a work-issued Surface Book, because it’s important to test compatibility, especially when it comes to CLI stuff, across different platforms. I have Linux VMs and docker containers and WSL configured too.

idividebyzero · on Nov 19, 2017

because the compiler knows that the structure is located exactly 0x1AB bytes before the current function

This should be 0xF bytes before the current function, shouldn't it? I liked it.

idividebyzero · on Oct 24, 2017

I expected a similar article when it was time to say goodbye to Internet Explorer as well.

idividebyzero · on April 5, 2017

What kind of work would it need sorry? Anyway I can understand this may not seem to be real benchmarking, while writing it eventually became almost an excuse, the very original purpose was to kinda verify my Prof's thesis, whether was really right or not. My intention was not even about to utterly speak about benchmarking nor shared memory (which I both tried to cover though).

gumby · on April 5, 2017

First of all, your article was fine: you had a question, you thought about it a little and got some data, and wrote it up. More people should do this.

As an article it's not particularly clear in method or presentation. It's not especially clear in what it's measuring, process, isolation of variables. But it's not attempting to be a journal article. So that's not what I meant about "needs work" (though it is hard to pull the message out of the wording).

But I don't think it supports your thesis. Again, thats part of what the web is about (not every posting should be a well polished pearl) so I"m glad you posted this. If you cared though, I'd improve your process. This post might not be worth rewriting though -- that's up to you.

idividebyzero · on April 5, 2017

Yes, I got what you meant, and I know that some parts, especially those which cover measuring, should perhaps be investigated deeper. I'll outline the objectives better next time.

idividebyzero · on April 4, 2017

Wanted to just note down the numbers I got by timing, as then I said, I wanted to remark that was interesting how they are being used complementarily and had fun to implement them.

idividebyzero · on April 4, 2017

I haven't even thought about graphs, I just called 'benchmark' the output of time(1).