Reverse-engineering the conditional jump circuitry in the 8086 processor

kens · on Jan 23, 2023

Yes, another look inside the 8086. Any suggestions on what part of the 8086 I should write about next?

ajross · on Jan 23, 2023

Unless you did it already and I missed it, my vote would be for the external bus sequencing. The 8086's multiplexed A/D bus and its 4-clock (!) cycle have always seemed needlessly complicated to me, especially when compared to the 68k's async bus or the 6502's "?!$!@ it, just expose the internal latches" designs. Clearly Intel was thinking it was getting value here somehow (well, something more than pure pin count optimization), but it really just seems like they were adding behavior they didn't need.

kens · on Jan 23, 2023

Yes, the bus sequencing is very complex. Memory accesses turn out to have a 6-clock cycle, but two cycles are overlapped with the end of the previous access, so it's even more of a mess than you'd think.

I think a large part of it was pin count optimization; Intel had weird beliefs about keeping the pin count down. (E.g. Intel was really resistant about using even 18 pins for the 8008, which was way too few.) Another issues was the 8086 supported two different bus protocols: "minimum" and "maximum" mode, where "minimum" was straightforward and easy to use, while "maximum" provided much more information but needed another chip to decode the signals. (I don't know if anyone used "minimum" mode.) Finally, the 8087 needed a lot of information about what was going on inside the 8086's prefetch queue, so that needed more bus signals.

pwg · on Jan 23, 2023

> Intel had weird beliefs about keeping the pin count down.

They might seem weird by today's standards, but by late 1970's standards, a '40 pin' package was significantly cheaper for others to incorporate into their designs than another size. There were a lot fewer "standard socket sizes" available back then, and anything that wasn't already available entailed a lot of expense to manufacture, resulting in a very high initial cost (comparatively) to recover the capital expenditure to bring the new size to production.

kens · on Jan 23, 2023

Intel's beliefs about pin count were weird even for the time. See Federico Faggin's oral history (p55-56) for a discussion about how making everything 16 pins was like a religion at Intel, damaging functionality, at a time when other companies had 40 or 48 pin packages. The Motorola 68000 had 64 pins.

http://archive.computerhistory.org/resources/text/Oral_Histo...

rasz · on Jan 23, 2023

speaking of interfacing 8087 fresh post from today: http://www.os2museum.com/wp/learn-something-old-every-day-pa...

dboreham · on Jan 23, 2023

I remember seeing my first 68k -- it looked like an aircraft carrier.

allenrb · on Jan 24, 2023

One of the magazine ads for the original Macintosh highlighted this. Iirc, a 68k in all its glory was shown next to a plastic DIP 8088. It made enough of an impression on my ~12 y/o self to still picture it today.

Well, look at that. It’s this ad: https://www.flickr.com/photos/mwichary/3236356042 and I was 11 at the time.

djmips · on Jan 24, 2023

That X-ray view of the ball mouse seemed so high tech and mysterious but now, for me all grown up it's very obvious and even old fashioned.

thombat · on Jan 24, 2023

Seeing that view, for the first time in twenty years I thought about the curious stickiness of a well-used mouse ball, and cleaning the gubbins off the counter wheels, prompted by the cursor starting to stall and jump on the screen.

shadowofneptune · on Jan 23, 2023

Have address calculations been covered yet? Those are pretty complex on the '86.

Alternatively, the 'rep' prefix would be interesting.

peterfirefly · on Jan 23, 2023

I vaguely remember that it had trouble remembering more than one prefix past an interrupt -- so if you wanted to combine REP with a segment override prefix you should put the REP first, then the segment override prefix, then the string instruction... and then some conditional jump stuff that jumped back to the REP if CX != 0.

So basically:

    repeat: rep
            cs:
            scasb
            jcxz   done
            jmp    short repeat
    done:

shadowofneptune · on Jan 23, 2023

Micheal Abrash's book covers this. May just be better to swap out the segment registers at that point though.

rot13xor · on Jan 23, 2023

I always wondered why x86 has LEA when its functionality can be replicated by ADD. It has to do with LEA and ADD being able to run in parallel because LEA uses a separate ALU in the address calculation part of the chip, not the main ALU.

shadowofneptune · on Jan 24, 2023

LEA only got more powerful in later models as the restraints on registers were removed and more addressing modes got added. Now it can do several additions and a multiplication in a single operation. Reusing that memory hardware/instruction format is a clever ISA decision.

Gordonjcp · on Jan 24, 2023

Wait, so if you were imaginative enough with how you use registers to calculate addresses you could abuse LEA as a DSP MAC (multiply/accumulate) instruction?

moefh · on Jan 24, 2023

Compilers do it all the time, for example GCC compiles "x*4 + y" to a single LEA instruction: https://godbolt.org/z/TvdW5sK4b

Gordonjcp · on Jan 24, 2023

Aha, playing with it, it looks like the multiply is just a shift because if you try to do anything other than powers of two it has to break it down into more instructions.

Still clever, though. I guess it's to make it easier and quicker to index over words or multiples of words?

rep_lodsb · on Jan 24, 2023

The addressing logic on the 80386+ can add an index shifted left by 0..3 bits (unscaled, x2, x4, x8) with a base register plus an immediate offset.

By using the same register for base and index you can also multiply one register by 3, 5 or 9.

Earlier (16 bit) x86 chips did not have the scaling feature and were limited to certain combinations of base and index (BX/BP as base, SI/DI as index), so LEA was less useful. If the registers are carefully assigned, it could still be used to do an addition and put the result into another register. Normal ALU operations always use one of the operands as their destination.

peterfirefly · on Jan 24, 2023

The system might choose to use relocations for LEA and not for ADD -- this is of course not relevant for stack relative addressing and struct member addressing on the heap. I think I ran into this when coding assembler for DOS in the late 80's.

LEA also gets the "register + offset" thing done in a single instruction instead of two (MOV + ADD). It's also really easy for both assembler programmers and (dumb) compilers.

The "run in parallel" stuff is you looking at modern(ish) CPUs and thinking the original 8086 looked anything like that inside.

kens · on Jan 23, 2023

Those are good suggestions. I'll see if I can figure out how to explain them :-)

pwg · on Jan 23, 2023

Have you posted anything explaining how the chip senses, and responds to, the external interrupt pins? I.e., the circuitry that handles detecting a level change plus how the microcode reacts when an interrupt signal occurs.

dfawcus · on Jan 24, 2023

Not exactly the 8086, but how about the 80286 bits apart from the Execution Unit?

If one ignores the extra instructions, one gets the impression that most of the differences will be in the BIU, and some in the address adder block added to the system. But possibly that additional adder ALU can be ignored?

rep_lodsb · on Jan 25, 2023

The 80186 already added dedicated logic for address computation, as well as multiply/divide and repeated shift/rotate (no barrel shifter, but only one cycle for each shifted bit instead of several as on the 8086). It also had the extended instructions, except for those related to protected mode.

However the microcode format remained essentially the same[1], so I don't think there was a fundamental redesign in either the EU or BIU.

The '286 is different, and not just in the BIU (which now has to enforce segment limits etc). From what I've pieced together looking at die shots, and US patent 4442484:

There are three 6-bit fields to select registers for each micro-instruction. ALU operations can apparently take any register as operand, only immediate values have to be first loaded into a temporary register[2]. The microcode is also organized more like a conventional ROM instead of being addressed directly by opcodes.

Bytes from the prefetch queue first go through a separate decoding stage. An "entry point PLA" translates the opcode (with additional inputs for 0Fh and REP prefixes, real/protected mode, and "modr/m extended opcodes") into a microcode address. That address, any operands including a 16 bit immediate and 17(?) bit displacement field, and other flags are placed into a "decoded instruction queue" holding up to three instructions.

From what I've read the 386 was actually very similar despite adding 32 bit registers and paging. The next major changes to the microarchitecture came in the 486 and Pentium.

[1] https://news.ycombinator.com/item?id=34334799

[2] https://rep-lodsb.mataroa.blog/blog/the-286s-internal-regist...

monocasa · on Jan 23, 2023

hlt handling would be neat to know. IIRC from your earlier posts (and Andrew Jenner's) that was random logic when I kind of expected that to be handled by ucode (but I guess makes sense that it's probably one off logic anyway and doesn't really make sense to abstract out into ucode for that one instruction). Do parts of the core get clock gated? Particularly the dynamic pieces that one would expect to drain out. What does recovery from halted state look like internal to the core and is there any other cleanup that needs to occur when starting back up?

kens · on Jan 23, 2023

What do you mean "hit handling"? Edit: I guess I need a bigger font :-)

There's no clock gating; everything gets the clock. There's not a lot to say about HALT. It stops memory operations and the prefetch queue, so the processor stops processing until it gets an interrupt.

monocasa · on Jan 23, 2023

> What do you mean "hit handling"?

Not hit, hlt instruction.

> There's no clock gating; everything gets the clock. There's not a lot to say about HALT. It stops memory operations and the prefetch queue, so the processor stops processing until it gets an interrupt.

Oh, ok. So ucode is just sitting in that state waiting for Q to fill in for an extra long time is all? Is the prefetch queue flushed or are the next bytes sitting in Q prefetched, simply not presented to ucode?

omniscient_oce · on Jan 24, 2023

Do as many as you like. These are fascinating.

anticensor · on Jan 24, 2023

The CX-related conditionals, including the LOOP (which is a forward-porting of DJNZ).

fortran77 · on Jan 23, 2023

The 8087!

kens · on Jan 23, 2023

I've written a few articles about the 8087: https://www.righto.com/search?q=8087

mrlonglong · on Jan 23, 2023

I'm absolutely interested in this mysterious 108 micro op! Hopefully you'll find it.

rep_lodsb · on Jan 24, 2023

Probably SALC (set AL to 0xff if carry flag set, zero if clear). NEC will ignore the low bit of the opcode and run the documented XLAT instruction.

mmastrac · on Jan 24, 2023

108 appears to be a paragraph number.

snthd · on Jan 24, 2023

>The main advantage of microcode is that it turns design of control circuitry into a programming task instead of a difficult logic design task.

But then you need circuitry to process the microcode. Why is that desirable?

Is microcode basically an abstraction layer to allow a larger machine code instruction set be supported on a smaller native instruction set - and if it is why is that better than just using the smaller instruction set directly?

throwawaylinux · on Jan 24, 2023

Yes, there is less you have to implement directly in logic. A complicated instruction can be implemented as multiple microcode instructions even including branches and loops of micro-ops. And RISC basically came about by observing exactly what you have, that cutting out the middle man actually worked better.

But CISC complexity and microcode was significantly driven by the lack of good optimizing compilers. A lot of performance critical code had to be written in assembly, and writing assembly was difficult and time consuming so it was nice to have more expressive instructions.

Early CPUs also had little or no cache and instruction fetch bandwidth was a very significant bottleneck. This was another significant driver for CISC.

So it wasn't just the case that CPU designers were idiots from the start, they did have reasonable reasons for the choices they made at the time. RISC required a certain confluence of hardware and software advancement to happen before it became the obvious or better alternative.

Interestingly things swung back the other way a decade or so later, as CPUs got vastly more complicated and capable, the ISA became relatively less important and CISCs were able to mostly catch back up to RISCs.

kens · on Jan 24, 2023

Good answer. I'll also point out that RAM was expensive in the olden days, so you wanted your instructions to be as dense as possible. It made sense to have one instruction do as much as possible. It also made sense to have instruction lengths ranging from 1 byte to 4 bytes or more like the 8086, even though that made decoding complicated.

Also, for the original question, you really don't want to program in microcode directly, as it's kind of a mess. Micro-instructions expose a lot of ugly hardware details. Moreover, it locks you into a fixed architecture and you can't upgrade, because the microcode will change.

ghostpepper · on Jan 24, 2023

Very cool. I had no idea that microcode existed as far back as 1978. Gives you a sense of how complex it must be inside a modern Intel CPU.