Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Which Machines Do Computer Architects Admire? (2013) (clemson.edu)
174 points by dhotson on Jan 8, 2020 | hide | past | favorite | 156 comments


I'm not a computer architect (so my opinion shouldn't count in this thread), but as someone who did a lot of numerical programming over the years, I really thought Itanium looked super promising. The idea that you can indicate a whole ton of instructions can be run in parallel seemed really scalable for FFTs and linear algebra. Instead of more cores, give me more ALUs. I know "most" software doesn't have enough work between branches to fill up that kind of pipeline, but machine learning and signal processing can certainly use long branchless basic blocks if you can fit them in icache.

At the time, it seemed (to me at least) that it really only died because the backwards compatibility mode was slow. (I think some of the current perception of Itanium is revisionist history.) It's tough to say what it could've become if AMD64 hadn't eaten it's lunch by running precompiled software better. It would've been interesting if Intel and compiler writers could've kept focus on it.

Nowdays, it's obvious GPUs are the winners for horsepower, and it's telling that we're willing to use new languages and strategies to get that win. However, GPU programming really feels like you're locked outside of the box - you shuffle the data back and forth to it. I like to imagine a C-like language (analogous to CUDA) that would pump a lot of instructions to the "Explicitly Parallel" architecture.

Now we're all stuck with the AMD64 ISA for our compatibility processor, and it seems like another example where the computing world isn't as good as it should be.


There's no free parallelism™️ though.

> Author: AgnerDate: 2015-12-28 01:46

> Ethan wrote:

> > Agner, what's your opinion on the Itanium instruction set in isolation, assuming a compiler is written and backwards compatibility do not matter?

> The advantage of the Itanium instruction set was of course that decoding was easy. The biggest problem with the Itanium instruction set was indeed that it was almost impossible to write a good compiler for it. It is quite inflexible because the compiler always has to schedule instructions 3 at a time, whether this fits the actual amount of parallelism in the code or not. Branching is messy when all instructions are organized into triplets. The instruction size is fixed at 41 bits and 5 bits are wasted on a template. If you need more bits and make an 82 bit instruction then it has to be paired with a 41 bit instruction.

(https://www.agner.org/optimize/blog/read.php?i=425)

Besides, the memory consistency model of Itanium is also a brain teaser used in interviews as counterexamples to poorly-synchonized solutions.


Agner is obviously brilliant, but I think maybe he's looking at general purpose applications.

If I'm doing a million point FFT, I can easily give you 2 million operations in a row without a loop/branch. Maybe 1000 of those at a time could be run in parallel before the results needed to commit for the next 1000. I'd be willing to pay for 1 or 2 nops in the last bundle of every 1000 operations. I admit, the idea might not be awesome for a word processor or spreadsheet, but I did specify signal processing and machine learning.


A common complaint about Itanium was that it was an overpriced DSP masquerading as general purpose CPU.


I actually kind of like that characterization :-)


For signal processing and machine learning, maybe you'd be better off with a systolic array processor or at least a bunch of deep Cray-style vector pipelines? And, like you said, GPUs seem to be doing better at those, in a vaguely Tera-like way, than the Itanic ever could have.


I never got to play on a Cray, but I remember working on a Convex for a couple semesters in school. I had no idea what I was doing back then.

Nowdays, it's pretty clear GPUs are the winner, but like I said, they just don't feel like you actually live and breath inside of them the way you do with CPU code (you shuffle your data over and shuffle it back), and I'm kind of just imagining an alternative timeline where Intel and compiler writers got a chance to run with the EPIC idea.


The dark silicon era will presumably also be the heterogeneous hardware era.


> I admit, the idea might not be awesome for a word processor or spreadsheet,

Or a web server or browser. In fact, it pretty much only helps for your use case. Which is why people are converging on specialised hardware for it, and why Itanium was a commercial failure.


The computationally heavy part of a web browser is the rendering/video engine, which is standard super parallel graphics stuff. Web servers are dominantly I/O (or running wastefully slow scripting languages).

I know it's all a fantasy now, but I wonder what the world might have been like if there wasn't such a split between what you can only do on a CPU and what you can only do on a GPU. Maybe you like heterogeneous computing and gluing C++ to CUDA, but I think it's ugly. Stretch outside of the current box a little bit and imagine a hybrid somewhere in the middle ground of CPUs and GPUs. I think a variable sized VLIW could've gotten there if the market had any more imagination than it does. It's Ford's "faster horses" problem.


Check Xeon Phi (not VLIW though), which executes unmodified amd64 instructions on many cores. However it never becomes a mainstream product and Intel recently killed it. There are many reasons behind its sunset - tooling, programming difficulty to reach max throughput, perf/cost ratio, ...

Related discussion: https://news.ycombinator.com/item?id=17606037


Itanium is essentially a VLIW architecture and... well, as the bottom of the page mentions, VLIW architectures tend to turn out to be bad ideas in practice.

GPUs showed two things: one, you can relegate kernels to accelerators instead of having to maximize performance in the CPU core; and two, you can convince people to rewrite their code, if the gains are sufficiently compelling.


Specifically, the promise of VLIWs (scalar parallelism) was overhyped--the Multiflow compiler couldn't find enough scalar parallelism in practice to keep all 28 ALUs busy (in the widest model we built), or even all 7 ALUs (in the narrowest).

(I ended up running the OS group at Multiflow before bailing right before they hit the wall.)


We need new programming models that make it easier to expose static parallelism to the compiler. Doing it all in plain old C/C++, or even in "managed" VM-based languages, cannot possibly work - and even conventional multi-threading is way too coarse-grained by comparison to what's most likely needed. Something based on a dataflow-oriented description of the code would probably work well, and be possible to integrate well enough with modern functional-like paradigms.


I reached the same conclusion. We need a way to be able to explicitly say "this piece of code here is a standalone unit that is independent of everything else until we have to coalesce results of computation", or "this unit manages those other units and coalesces their results".

Erlang's OTP captures this pretty okay-ish with their concept of processes (preemptively scheduled green threads that get aggressively multiplexed on all CPU cores) but I feel we can go a little bit further than that and have some sort of shorter markers in the code, say `actor { ... }` or `supervises(a1, a2, a3) { ... }` or something.


VISC seems at least potentially promising.

https://www.anandtech.com/show/10025/examining-soft-machines...


Yes, but VLIW's premise was that all that coordination that the VISC architecture is doing at runtime in hardware could be computed at compile-time in software.


Not in general, VLIWs work great in DSP architectures like the Hexagon. They tend to fall down in the presence of unpredictable memory accesses, though, and while Itanium had facilities to try to mitigate that they didn't work well enough.


I think that another issue with VLIW as a general purpose ISA is that for it to be worthwhile the compiler has to have deep understanding of the underlying implementation (deep enough to always generate hazard-free code) such that the CPU does not have to contain any scheduling logic. This is the case for most embedded/DSP VLIW architectures. The issue with that is that there cannot reasonably be any kind of backward compatibility on the machine code level.


15 years ago I thought Itanium was the coolest thing ever. As a compilers student, a software scheduled superscalar processor was kind of like a wet dream. The only problem is that that dream never materialized, due a number of reasons.

First, compilers just could never seem to find enough (static) ILP in programs to fill up all the instructions in a VLIW bundle. Integer and pointer-chasing programs are just too full of branches and loops can't be unrolled enough before register pressure kills you (which, btw, is why Itanium had a stupidly huge register file).

Second, it exposes microarchitectural details that can (and maybe should) change quite rapidly. The width of the VLIW is baked into the ISA. Processors these days have 8 or even 10 execution ports; no way one could even have space for that many instructions in a bundle.

Third, all those wasted slots in VLIW words and huge 6-bit register indices take up a lot of space in instruction encodings. That means I-cache problems, fetch bandwidth problems, etc. Fetch bandwidth is one of the big bottlenecks these days, which is why processors now have big u-op caches and loop stream detectors.

Fourth, there are just too many dynamic data dependencies through memory and too many cache misses to statically schedule code. Code in VLIW is scheduled for the best case, which means a cache miss completely stalls out the carefully constructed static schedule. So the processor fundamentally needs to go out of order to find some work (from the future) to do right now, otherwise all those execution units are idle. If you are going out of order with a huge number of execution ports, there is almost no point in bothering with static instruction scheduling at all. (E.g. our advice from Intel in deploying instruction scheduling for TurboFan was to not bother for big cores--it only makes sense on Core and Atom that don't have (as) fancy OOO engines).

There is one exception though, and that is floating point code. There, kernels are so much different from integer/pointer programs that one can do lots of tricks from SIMD to vectors to lots of loop transforms. The code is dense with operations and far easier to parallelize. The Itanium was a real superstar for floating point performance. But even there I think a lot of the scheduling was done by hand with hand-written assembly.


> The Itanium was a real superstar for floating point performance.

No. Itanium was never a superstar - it was merely competitive when it was at best, and even that was only if you ignored the price/performance, and some of its versions were pretty bad in absolute performance and nowhere near competitive, and it was plain abysmal if you consider price/performance. Also, majority of practical, important numerical computations were memory bandwidth bound, and thus it didn't matter as much whether you can pack the loop perfectly. And Itanium was almost never the highest memory bandwidth machine during its lifetime, partially due to many of its delays.

Itanium would be my choice for the worst architecture, as it successfully killed other cpus, and produced a lot of not very useful research.

MIPS (and Hennessy and Patterson) would be my first choice, for upending the architecture design. Honorable mentions from me would be IBM 801 (lead to many research), Intel iAXP 432 (for capability architecture, which I think will come back at some point), z80 and 8501 for ushering the computing power everywhere, x86-64 for "the best enduring hack".

Anyway, the article itself is great, and I wish I had asked the same question to some of those folks mentioned in the article, and many other architects and researchers when I met them...


> MIPS (and Hennessy and Patterson) would be my first choice, for upending the architecture design.

I agree that it was a revolutionary design for the time. From today's perspective, some of the choices made did not age all that well (in particular branch and load delay slots).

For assembly level programming/debugging, my favorite architecture by far was POWER/PowerPC. Looking at x86 code vs PowerPC after Apple's switch made me almost cry, although the performance benefits were undeniable.


David Patterson didn't work on MIPS.


> There is one exception though, and that is floating point code.

I wish I had read to the end before replying (and then deleting) responses to each of your items. :-)

I think we're mostly in agreement, except for the following minor tidbits:

> The width of the VLIW is baked into the ISA [...] no way one could even have space for that many instructions in a bundle.

My understanding was that your software/compiler could indicate as many instructions as possible in parallel, and then stuff a "stop" into the 5 bit "template" to indicate the previous batch needed to commit before proceeding. So, you wouldn't be limited to 3 instructions per bundle, and if done well (hopefull), your software would automatically run faster as the next generation comes out and has more parallel execution units.

> all those wasted slots in VLIW words and huge 6-bit register indices take up a lot of space in instruction encodings

128 registers, so I'd think it'd be 7 bit indices. Each instruction was 41 bits, which is roughly 5 bytes. Most SSE/AVX instructions end up being 4-5 bytes (assuming no immediates or displacements), and that's for just 4 bit register indices. So it doesn't seem much worse than we have now.


> Instead of more cores, give me more ALUs.

It kinda didn't work that way though.

In practice, all of your ALUs, including your extra ones, were waiting on cache fetches or latencies from previous ALU instructions.

Modern x86 CPUs have 2-4 ALUs which are dispatched to in parallel 4-5 instructions wide, and these dispatches are aware of cache fetches and previous latencies in real time. VLIW can't compete here.

VLIW made sense when main memory was as fast as CPU and all instructions shared the same latency. History hasn't been kind to these assumptions. I doubt we'll see another VLIW arch anytime soon.

I accept the idea that x86 is a local minimum, but it's a deep, wide one. Itanium or other VLIW architectures like it were never deep enough to disrupt it.


> In practice, all of your ALUs, including your extra ones, were waiting on cache fetches

I think if anyone could get around memory bandwidth problems they would, but for some very interesting and useful algorithms, I can tell you way in advance exactly when I'll need each piece of memory. For these problems, VLIW/EPIC with prefetch instructions would be a win over all the speculation and cleverness.

> Itanium or other VLIW architectures like it were never deep enough to disrupt it.

History is what it is, but I'm just imagining an alternative timeline where all the effort spent making Pentium/AMD64 fast was pumped into Itanium instead, and compiler writers and language creators got to target an architecture that didn't act like a 64 bit PDP-11.


Are these sorts of prefetch instructions widely supported in VLIW, though? AIUI, one of the "tricks" the Mill folks had to come up with as part of designing a generally-usable VLIW-like architecture, is making RAM accesses inherently "async", making it easy to statically-schedule other instructions in case the RAM access stalls.


Rather than individual-memory-address prefetch instructions, how would you feel about sending a DMA program to a controller on-board a memory DIMM, that would then enable you to send short external commands to the memory that would be translated by the DMA program into custom “vector requests”, to which the memory could respond with long streams of fetch responses—shaped somewhat like the output of a CCD’s shift-register—where this stream of fetch responses would then entirely overwrite the calling CPU’s cache lines with the retrieved values?


Random thought: what’s stopping main memory from being as fast as CPUs (in throughput, not necessarily latency)? TDP? The unwillingness to pay $100s per DIMM?


As far as throughput is concerned, mostly the idea that there is an DIMM at all. That means that there has to be some wiring between the CPU/MC and actual DRAM array that has manageable number of wires and reasonable RF/EMC characteristics.


The bandwidth between the CPU and the DIMM (and any overhead from ser/des narrowing in the physical signalling layer used to connect the two) is only a constraint if the DIMMs are, to put it in a funny way, RISC—if you have to send them a stream of low-level retrieval requests to describe a high-level piece of state you’d like to know. Which does describe most modern DIMMs, but not all of them.

https://en.wikipedia.org/wiki/Content-addressable_memory (CAM), as used in network switches, isn’t under the same constraints as regular RAM. The requests you make to CAM are CISC—effectively search queries—putting the whole memory-cell array to work at 100% utilization on each bus cycle.

But even CAM is still slower than the CPU. Even when it’s on the same SoC package as the CPU, it’s still clocked in such a way that it takes multiple CPU cycles to answer a query. So, at least in this case, bus bandwidth is not “the” constraint.


The whole idea of DRAM is about making the whole thing cheaper by limiting the outside bandwidth (it is not that DRAM chips have multiplexed address bus to save pins but because supplying address in two phases is inherent to how DRAM array works).

There is nothing that prevents you from making SRAM/CAM array running at same or even higher clock speed than CPU made with same semiconductor technology except cost of the thing. And in fact, n-way associative L1 cache (for n>1) is exactly such an CAM array.


The problem is latency, not throughput. Memories do have the bandwidth to keep the processor caches full, but they can not handle random access.

Anyway, you increase throughput by just adding more hardware. That's easy and widely done.


DRAM is pretty fast in throughput. The problem is the "random access" nature; every piece of indirection or pointer chaining is an unpredictable access. Every time you have a "." in your favourite object orientated language, every step in a linked list: if that's a cache miss, you have to wait for it to come back ... and then potentially cache miss the next lookup as well.


I'm not sure that "." does it, at least in C/C++. What's on the right should be very close in address to what's on the left, unless it's a reference.

In Java... not so much. I believe that "." in Java is the same as "->" in C/C++, unless what's on the right is a primitive data type rather than an object.


Several times in the last months we've seen that various AMD CPUs performed better on certain benchmarks with overclocked RAM. I am not sure the RAM that can help the CPU saturate the transfer channels even exists today.


I worked on a project with machines that used the pa-risc cpus. The importance of optimized compilers can’t be understated (and math libraries) which made those machines really shine. My understanding was the Itanium (which basically replaced pa-risc in hp Unix machine lineup) never got the compiler support to realize the architectures strengths, so everyone looked to the safer bet in 64bit computing.

It’s hard to compete with the scale of x86. Like software I feel the industry tends toward one architecture (the more people use the architecture the better the compilers the more users ...) Even Apple abandoned PowerPC chips.


> so everyone looked to the safer bet in 64bit computing.

Itanic was the safer bet in 64-bit computing. It just sucked. Intel didn't switch to AMD64 until underdog AMD was already eating their lunch.

Today there are probably more aarch64 CPUs being sold every month than amd64 CPUs (including Intel's).


If anyone stood a chance to compete against x86, I'd think Intel would be it :-)


But Intel failed here with the same mistake they've made many times before: the people who actually buy Intel kit basically want a faster 8088/80386/Pentium, not a novel, cleaner, sexier new architecture. See: iAXP432, i860, i960Mx and recently Itanium.

Linux Torvald has an interesting take on this (whether or not you agree): https://yarchive.net/comp/linux/x86.html


I think Linus is very focussed on the stuff he cares about, and that shows by his complaint about PAE, which is pretty invisible to use userland folk.

> the people who actually buy Intel kit basically want a faster 8088/80386/Pentium

My company is small fries compared to most folks, but what we really wanted at the time was cheaper DEC Alphas.


I think Linus is very focussed on the stuff he cares about

I think he'd be the first to agree with that.

PAE, which is pretty invisible to use userland folk

Perhaps, but the original conversation is about architecture, and PAE was a pretty grungy architectural wart, and the sort of thing that's very visible to the OS folks (as you say, what Linus cares about).

we really wanted at the time was cheaper DEC Alphas

There were 'cheap' Alphas: the 21066/21068 like in the Multia. But they were dogs; to be cheap, they had to give up big cache and wide paths to main memory. Expensive system support level stuff was required for fast Alphas (complex support chipsets, 128-bit wide (later 256-bit) memory buses). More commodity inertia would have fixed that over time, but they never got there. Intel on the other hand was way down the road reaping commodity benefits, and it ran the software commodity folks wanted.


> Perhaps, but the original conversation is about architecture, and PAE was a pretty grungy architectural wart, and the sort of thing that's very visible to the OS folks (as you say, what Linus cares about).

I suspect if he was into writing compilers (or graphics, or numerics, or ...), some of the other grungy architectural warts of x86 might annoy him too.


I suspect if you read the linked thread, you'd see exactly what he thought. He was at Transmeta at the time, and isn't exactly unfamiliar with what compiler writers get annoyed with. You may or may not agree with what he says, but he has a cogent, interesting perspective.


Ahh, I only read the one post. My bad.


Well, I read the rest of his comments. There are little tidbits which are good observations, but mostly I continue to think he just disregards things which aren't in his field of interest. If I was more cynical, I might think he wanted the continuance of x86 specifically because he was working at Transmeta.


I don't think x86 compatibility mattered. When it launched in 2001 it was supposed to replace HP's PA-RISC architecture which is totally different anyway. Sun and its SPARC processors were very much alive, Google was only three years old, and AWS was five years away - the idea that a massive array of cheap x86 processors will outperform enterprise-class servers simply hadn't occurred to most people yet.

Of course, the joke is that cheap x86 processors did outperform Itanium (and every other architectures, eventually).


> When it launched in 2001...the idea that a massive array of cheap x86 processors will outperform enterprise-class servers simply hadn't occurred to most people yet

When do you mean? In 2001?

It occurred to Yahoo, whose site had been run that way since almost the beginning, on FreeBSD. It occurred to Google, whose site was run that way since the beginning. It occurred to anyone who was watching the Top500 list, which was already crawling with Beowulfs — admittedly, not at the top of the list yet. It should have occurred to Intel, who were presumably the ones selling those servers to Yahoo and Penguin Computing and VA Research (who had IPOed in 1999 under the symbol LNUX). It had occurred to Intergraph, who had switched from their own high-performance graphics chips to Intel's by 1998. It had occurred to Jim Clark, who had jumped off the sinking MIPS ship at SGI. In 1994. It occurred to the rest of SGI by 1998, when they launched the SGI Visual Workstation, then announced they were going to give up on MIPS and board the Itanic.

I mean, yes, it hadn't occurred to most people yet. Because most people are stupid, and most of the ones who weren't stupid weren't paying attention. But it hadn't occurred to most people at Intel? You'd think they'd have a pretty good handle on how much ass they were already kicking.

> the joke is that cheap x86 processors did outperform Itanium (and every other architectures, eventually).

Even on my Intel laptop, more of the computrons come from the Intel integrated GPU. In machines with ATI (cough) and NVIDIA cards, it's no contest; the GPU is an order of magnitude beefier.


> When it launched in 2001...the idea that a massive array of cheap x86 processors will outperform enterprise-class servers simply hadn't occurred to most people yet

The idea was certainly around in the early-to-mid 1980s, when some former Intel engineers founded Sequent.

The Balance 8000, released in 1984, supported up to 12 processors on dual-CPU boards, while the Balance 21000, released in 1986, supported up to 30.

I interviewed the founder, Casey Powell, and he was explicit about multiple Intel microprocessors replacing large systems. He was targeting minicomputers at the time, of course, but we all anticipated that bigger sets of more powerful CPUs would eventually surpass even the biggest "big iron".

Powell was a great guy. However, his company got taken over by IBM. In the end, he didn't get to change the world.

"It's hard to be the little guy on the block and have really great technology and get beaten, just because the other guy is big." https://www.cnet.com/news/sequent-was-overmatched-ceo-says/


Right! Some friends of mine spent a lot of time programming a Symmetry in the early 1990s. Also around the same time, 1988, Sun introduced the Sun386i, which could even run multiple MS-DOS programs at once — but it wasn't a huge success, and they stuck with SPARC. I think Sequent and the Sun386i were just too early, say by about six or seven years.

An interesting question is: what are the structural advantages of bigness? When Control Data produced the world's fastest computer, some people at IBM wondered how it could happen that a much smaller company could beat them to the punch that way; others believed that that smallness was precisely the reason.


The advantage of smallness is that you can be faster than the big guys. You also can go into smaller niches.

The advantages of bigness are that you can use scale to make the same thing less expensive, and that you can make at least one mistake without it killing you, and that you can chase more than one "next big things" at once.


> I don't think x86 compatibility mattered.

It did. At that time Linux and open source were not the clear winners in the server space they are now and people were not used (or able) to recompile their code.

Windows took a long time to support Itanium and companies wouldn't buy it because they had nothing to run on those machines. They got x86 machines instead and amd64 when it became available.


> I don't think x86 compatibility mattered.

I'll admit I had a limited worldview, but not running Excel as quickly seemed like the kind of criticisms I saw in the trade rags at the time.

> the idea that a massive array of cheap x86 processors will outperform enterprise-class servers simply hadn't occurred to most people yet

Oh, I don't know. There was a really common Slashdot cliche running around at that time: "Can you imagine a Beowulf cluster of these?"


Yeah, it had clearly occurred to everyone on Slashdot. On the other hand, we liked to "imagine Beowulf clusters" of all kinds of recondite hardware. Elbrus 2000 is real power!

HOT GRITS!


/. was another world, one that I miss sometimes. I'm probably viewing it through rose-tinted nostalgia, but I don't remember the vitriol and hatred that so many social sites are soaked in these days.

The Mozilla open source announcement, the Microsoft anti-trust case, Linux exploding in popularity, it all felt like we were changing the world for the better.


/. was great... and then it wasn't. The troll population went way up, the "first post" thing was just noise, and eventually I just quit going there.

HN is also less than it used to be, but I think that the mods have kept it better than /. became - so far, at least.


That seems right on such important sorts of computation, disregarding other factors. On the history, actual HPC numbers for Itanium have appeared in the Infamous Annual Martyn Guest Presentation over the years. An example that came to hand from 15 years ago to compare with bald statements is is https://www.researchgate.net/profile/Martyn_Guest/publicatio...

Regarding GPUs, Fujitsu may not agree (for Fugaku and spin-offs) depending on the value of "horsepower" relevant for HPC systems, even if an A64FX doesn't have the peak performance of a V100. They have form from the K Computer, and if they basically did it themselves again, there was presumably "co-design" for the hardware and software which may be relevant here; I haven't seen anything written about that, though.


That C like language is called "Verilog". (Yes I know it's a HDL but the point still stands. FPGAs are commodity these days.)


Verilog isn't really very c-like. It is slightly more c-like than jts main competitor, VHDL. It is more c-like than lisp. But really, neither of those is saying much.

Ways it is not c-like:

- begin...end instead of curly braces

- parallelism with assign, always, initial

- tasks and functions and their subtle differences

- non-blocking assignment

- bit-oriented variables and operations

- 4-state logic (0, 1, X, Z)

- other constructs for modeling hardware like time delays, tri-state wires, drive strengths, etc.

Not to mention that SystemVerilog has taken over Verilog and adds OOP with classes, a complicated (sorry, powerful) assertion mini-language, constrained randomization, a streaming operator, and so on.


Heh, I suspect hardware folks would like a CPU which programmed well with Verilog or VHDL, and I know that trying to make a hardware description language accessible to software folks has been a pipe dream for at least 25 years. However, I don't think Verilog was the solution for Itanium, and the useful niche for FPGAs seems increasingly limited to low Size Weight And Power realms.

The FPGA projects I've seen (using very high end FPAGs, not commodity/cheap ones) seem like they're always bumping up against clock rates and making timing as soon as they try to do anything approaching what you can do on a CPU or GPU. Of course there are exceptions where the FPGA does really simple and parallel things, but FPAGs aren't a panacea.


There was a company called reconfigure.io that came the closest to an accessible HDL (they compiled Go to VHDL/Verilog) but seems to have died and their founder is now with ARM.


I recently worked with another company trying to solve the same problem, but I suspect I should keep my mouth shut due to NDA crap. Regardless, I suspect we're easily another decade out before anyone makes FPGAs accessible to the masses, and I can't think of many problems where I would take an FPGA over a GPU, but there are still some.


Or you could prototype the algorithm in matlab then convert to HDL. https://www.mathworks.com/products/hdl-coder.html


I think one of the most influential designs of recent times has been the DEC Alpha lineage of 64-bit RISC processors[1]. Originally introduced in 1992, with a superscalar design, branch prediction, instruction and data caches, register renaming, speculative execution, etc. My understanding is that when these came out, they were way ahead of any other CPU out there, both in terms of innovative design and performance.

Looking at this chip, it seems to me that almost all the innovations Intel brought to the Pentium lines of CPU over many years were basically reimplementing features pioneered by the DEC Alpha, just over a decade later, and bringing these innovations to consumer-grade CPUs.

[1]: https://en.wikipedia.org/wiki/DEC_Alpha


I loved working on DEC Alphas. They seemed to me like the best of breed conventional 64 bit machines, and it was sad when we quit buying them because x86 boxes were cheaper.

> it seems to me that almost all the innovations Intel brought to the Pentium lines of CPU over many years were basically reimplementing features pioneered by the DEC Alpha

I can't find a strong source to link, but I thought most of the Alpha team ended up at Intel. If so, that would explain the trickling in of re-implementations.


> I can't find a strong source to link, but I thought most of the Alpha team ended up at Intel.

DEC was bought by Compaq, who sold the team to Intel at the same time it was bought by HP. Intel's Massachusetts site is a former DEC facility.


DEC sold its StrongARM (ARM-based) business to Intel in 1997, before DEC was taken over by Compaq in 1998. It resulted in Intel's XScale business, which it later sold to Marvell.

Compaq abandoned DEC's Alpha for the HP/Intel Itanic, so Intel didn't get an Alpha business. However, Compaq sold the Alpha IP to Intel in 2001, before the HP takeover in 2002.

I'd be interested to know what happened to the Alpha architects, Richard L. Sites and Richard T. Witek. A quick search doesn't find anything interesting.

I remember there was a breakaway of DEC engineers founding a small chip design company, but can't remember what it was called.


Dirk Meyer ended up at AMD, it was why the likes of Hypertransport borrowed/adapted the Alpha EV6 protocol and kicked serious arse.

Always loved Alpha's, their influence cast a long shadow.


K7 Athlon's actually just used EV6 bus, to the point where motherboards were compatible with each others if you made a connector adaptor and replaced firmware.

I have had very unconfirmed rumour that K8 involved so much of DEC stuff (starting with it's obviously EV7-derived design) that at one point it still had support for VAX floating point...


Thats class! thanks for the info.


One of my CS professors used to work at DEC. His lectures were starkly clear and pretty intense.

Jim Keller and other folks from PA Semi worked at DEC earlier in their careers.


I worked at DEC- one thing that I really miss is the high quality of documentation they produced. A lot of it has been archived here:

https://archive.org/details/bitsavers_dec


Through Keller they defined the last decades.


We got an Alpha, a PPC and a MIPS for testing 64 bit compatibility for our app. I believe they were donated, to make sure our code (Mosaic) ran on Windows NT.

I used the DEC as my personal machine. Hardly anyone ever touched the other two, except to verify bug reports.


Superscalar dates to the CDC 6600, a machine I don't understand very well; branch prediction I think dates to Stretch, but the 70s at latest; instruction and data caches were commonplace in high-performance computers by the 1970s, and as the article points out, the 6600 had I$, and the 360/85 had a cache too (not sure if split). I'm not sure about register renaming and speculative execution, but I'd be surprised if they date from as late as the αXP.


Register renaming was IBM 360/91 1967

https://en.wikipedia.org/wiki/Tomasulo_algorithm


Thank you!


Alpha is also unique in that it's the only architecture that doesn't preserve data dependency ordering.


It's the only CPU architecture. I think many accelerators (e.g., GPUs) don't preserve it either.


I think by the definition I was using, data dependency ordering is preserving order on a single thread. All GPUs I'm aware of enforce this.


Cray built a massively parallel machine out of a bunch of Alphas. (Cray Research Inc, iirc, not Seymour Cray). The T3 I think.

DEC had bragged about the features of the cpu useful for parallelization. Cray engineering complained about the features missing for parallelization. It was all described in a glossy Cray monthly magazine description of the new machine but I've been unable to locate a copy.

I disliked the Alpha floating point, it was always signalling exceptions for underflow. Otherwise a fine set of machines.


> I disliked the Alpha floating point, it was always signalling exceptions for underflow.

Alpha floating point wasn't the problem.

The problem was that a whole bunch of "clever" folks used "underflow" for all manner of weird reasons on x86. So, whenever Alpha either ran ported or emulated x86 code, it ran into underflows with far greater frequency than any actual numerical applications ever would.


Intel ended up buying Digital Equipments chip making business..

There where patent disputes and such.

According to this article which describes the sale, windows NT ran on alpha.

https://www.latimes.com/archives/la-xpm-1997-oct-28-fi-47463...


According to my memory they have been sold with NT (4.0?) in Germany by a long gone PC retailer called Vobis. I think for about 5000 DM.



I remember getting an MSDN CD wallet that included a build of Windows NT for Alpha. Of course, I never had an Alpha on which to run it.


One of the nice things was that you could run Windows NT on a DUAL PROCESSOR PC with two Alpha chips. This was such an attractive idea I thought of buying one. However, when I did a very quick trial, it didn't make any noticeable difference to my very simple workloads (mostly Microsoft Office).


Yeah, I used NT 4.0 on Alpha when NT was still new.


AMD actually licensed a good bit of Alpha technology.


AMD hired a bunch of Alpha engineers to develop revolutionary Athlon line of CPUs.


I remember, way back in the 90s, working on NT 3.5/4 on some DEC Alphas in Sony Broadcast at Oxford, UK. The sysadmin there was a cool dude who I remember was amazed at how insane network speeds were getting. I think those guys at Oxford were responsible for a very nice recording studio mixer that Sony made.

I remember DEC Alphas absolutely stomped all over the x86 stuff that everyone else was using, but the flexibility and price of the commodity PCs was just too attractive. Pity, really.


I don't know if it's relevant for computer architects, but one great thing about Alphas (at least the ones I operated) was their relatively huge memory and cache. They were much admired generally by users processing data. An individual crystallography image might fit in cache, and a typical complete dataset in memory -- not that that stopped the maintainer of the main analysis program retaining the disk-based sort bottleneck originating on PDP-11s...


There are some really great designers on the list, like Sophie Wilson and Gordon Bell, but the list of admirable machines comes up really short — and missing a lot of really significant and admirable machines.

Maybe these are the machines bad computer architects, like Alpert, admire. Alpert is notable mostly for leading the computer industry's most expensive and embarrassing failure, the Itanic (formally known as the Itanium), despite the presence on his team of many of the world's best CPU designers, who had just come from designing the HP-PA --- a niche CPU architecture nevertheless so successful that HP's workstation competitors, such as NeXT, started using it. Earlier in his career he sunk the already-struggling 32000, the machine that by rights should have been the 68000. (And maybe if they'd funded GCC it could have been.)

What about the Tera MTA, with its massive hardware multithreading and its packet-switched RAM, which was gorgeous and prefigured significant features of the GPU explosion?

What about the DG Nova, with its bitslice ALU chips and horizontal-microcode instructions? What about the MuP21, with its radical on-chip dual circular stacks?

What about the HP 9100, with its dual stacks and PCB-inductance microcode, where the instruction set was the user interface?

What about the LGP-30, which managed to deliver a usable von Neumann computer with only 113 vacuum tubes (for amplification, inversion, and sequencing)?

What about the 26-bit ARM, with its conditional execution on every instruction, and packing the program status register into the program counter so it automatically gets restored by subroutine return, and, more importantly, interrupt return?

What about Thumb-2 with its unequaled code density?

What about the CM-1? Anyone can see that AVX-512 (or for that matter modern timing-attack-resistant AES implementations!) owe everything to the CM-1.

And the conspicuous omission of the Burroughs 5000 has already been noted by others.

I mean, there are some good designs on the list! But it hardly seems like a very comprehensive list of admirable designs.


It sounds like they just went around the room and asked some folks to list off some systems. I don't think a terrible amount of thought was put into this.

I'd add the Tandem NonStop to my personal list. I don't know why I overlooked the LGP-30 [1], I'll have to find a schematic. 113 vacuum tubes is really impressive, I wonder if there is any overlap with this design and System Hyper Pipelining [2]. Do you know of other architectures that use time multiplexing to reduce part count?

What bit serial computers do you like?

Ahh, it is the Story of Mel computer, awesome.

[1] https://en.wikipedia.org/wiki/LGP-30

[2] https://arxiv.org/abs/1508.07139


NonStop is super interesting! HP has most of the old Tandem papers and manuals online still, I think, and you can see how the software and hardware co-evolved. It's mind-boggling the extent to which they designed the operating system around transactions; with things like TIP, the Transaction Internet Protocol, they did try to get wider adoption for that approach, but it's largely been forgotten. A shame, since we spend so much of our time debugging highly-concurrent distributed systems these days.

Delay-line and drum computers (like the LGP-30, the HP 9100, and the grandmama of them all, the Pilot ACE) all sort of had to do a sort of time multiplexing; the Tera MTA I mentioned, as well as the CDC 6600's PPs (FEPs), worked that way too, time-sharing a single ALU and control unit among many register sets. That's also one of the things going in modern GPUs, but it's hard to say it's to reduce part count. Still, they'd need a lot more parts to do the same thing if they didn't do it.

This CSR/SHP thing sounds really interesting! Thank you!


I should provided a link to the Tandem tech reports [1], they make for great reading, great for the Little Gray Cells. I do think hardware supported distributed transactions would make many problems go away. Fusing what was once modular has unlocked lots of gains (ZFS), with the rise of the hypervisors and abstract VMs, we are getting there in baby steps. Modularity incurs a cost that is much higher than most people realize, if we see something that could be modular, we usually take it, but those decisions force future decisions we aren't aware of. I think someday we will view it in a similar light to OO.

I got to tour the Tera offices in Seattle in the late 90s, about all I remember is that it was a torus and it used some finicky silicon process that was leading to manufacturing delays. I was all into Beowulf and Mosix [2] at the time using Alpha or x86, so I wasn't drawn to it that much.

[1] https://www.hpl.hp.com/hplabs/index/Tandem

[2] https://en.wikipedia.org/wiki/MOSIX


The other machines that I think of as being the siblings of the Tandem are the Nova (a lovely instruction set crippled by its shitty OS), the HP 3000, and of course the byte-addressed PDP-11. Despite their differences they all have a very similar flavor, reflecting a CISCy Zeitgeist when minicomputers were just beginning to cut their umbilical cords to PDP-10s and the like.

Tera’s first machine was bipolar (ECL I assume) and they finally squeezed out a CMOS successor with a lot of assistance from their EDA vendor. Never knew the story of why moving to CMOS was so urgent.

Amusingly, it was the Beowulf list where someone converted me to the Tera religion (rgb I think). I was convinced that was the way all computers would work soon. And, well, it's how GPUs work, kind of. But mostly I was wrong.

Not sure I agree about modularity. Galaxies, mammalian bodies, trees, bacterial films, cars, books, and river systems are modular. It would be surprising if we could make software non-modular. But we could make it only as modular as a tree.


I am happy you don't agree on modularity. I don't want to be correct, I want to arrive at correct conclusions. :)

Composition is great, scale free self similarity is probably the basis for the universe.

Modularity is a great design technique, it can also make things weaker and force other (unknowable) design choices because the module boundary prevents the flow of information/force. Overly constrained modular systems encourage globals, under constrained modular systems are asymptotic to mud/clay.

I don't want to use K8S as a strawman to attack modularity, but I think it is an example of using this powerful design tool to solve the wrong problem using mis-applied methods all the while being more complex and using more resources. In the case of designing systems, modules/objects/processes (Erlang sense) are critical, but not so much in building/engineering them. Demodularizing or fusing a design can make it more robust and more efficient.

I don't dislike modularity, I just think it is a bigger more complex topic than most give it credit for. Unix is highly-non modular and very poor composition. It sits on a molehill of a local maximum, itself sitting in the bottom of a Caldera, a sort of Wizard Mt on Wizard Island.

Other things you might like is the research around "Collapsing Towers of Interpreters" [1]

Or Dave Ackley's T2 Tile Project and Robust First Computing [2]

Would love to chat more, but internet access is spotty for the next week, non-replies are not ignores.

[1] https://lobste.rs/s/yj31ty/collapsing_towers_interpreters

[2] https://www.youtube.com/watch?v=7hwO8Q_TyCA https://www.youtube.com/watch?v=Z5RUVyPKkUg


My first full time paid developer job was to be a Developer on the Spooler/Spooler Plus components which were part of the HP Tandem NSK.

I had a really good mentor when working on this product and I learn the importance of modularity in design. I was just off my internship doing a D2K/Oracle implementation for an airline in-cabin inventory application, so working on Spooler was a breath of fresh air.

[1] https://support.hpe.com/hpsc/doc/public/display?docId=emr_na...


I can think of many ways you could've phrased this without being downright aggressive.


[flagged]


Can you read the date at the top of the page? 2001. Thumb-2? And an ARM creator was interviewed as well, it is like you looked at first name on the list and just had a tantrum.


The page dates from 2010 and includes input from Gordon Bell in 2008. I see Sophie's interviewed in the article, but somehow her masterpiece — likely the most widespread architecture in the world by now — isn't included in the list at the top.


I'd be careful about giving too much credit to the original ARM design. In retrospect, like many early RISC designs, it was over-optimised for its original application. Most of the unique/novel features of the original ARM architecture turned out to be bad ideas in the long run. Most of them were later removed from the architecture, or persist only in backwards compatibility modes.

ARM is ubiquitous today more due to business models and historical accident than to inherent superiority of the design. (See also x86.)


That's interesting! Clearly that's what happened to packing the PSW into the PC†, but what other features are you thinking of?

† though there are an awful lot of ARM processors out there today that have less than 64 MiB of program memory and need their interrupts to be fast, so I'd argue it might be a reasonable idea for many applications today if it didn't involve breaking toolchain compatibility in a subtle way


> What about Thumb-2 with its unequaled code density?

It came out in 2003, and most of the people were queried for their opinions in 2001.


It was never intended as a comprehensive list. Best if you actually read the article. It's been floating around for more than a decade.

Alpert is a bad architect...funny.


It's disappointing that most machines today suck so badly. How did that become the state of the industry, with so many smart people working so hard and nobody likes their latest designs?

The last high-performance design I actually liked was the DEC Alpha. You could write a useful JIT compiler in a couple hundred lines.

I suspect that nVidia's recent GPUs are wonderfully clever inside, but they don't publish their ISA and the drivers are super-clunky. So I can't admire them.

I appreciate the performance of intel Core chips, but there's so much to dislike. The ISA is literally too big to fully document. The kernel needs 1000s of workarounds for CPU weirdnesses. You have to apply security patches to microcode, FFS.

RISC-V would be great if we had fast servers and laptops.


What's wrong with Power 8 & 9? What's wrong with ARM64? What was wrong with Sparc64 until Oracle screwed it up (well...register windows...ok). How is RISC-V intrinsically better than those architectures, considering it doesn't exist in a form that performs anywhere near as fast?


> How did that become the state of the industry, with so many smart people working so hard and nobody likes their latest designs?

The average consumer doesn't buy a fantastically well designed CPU if it doesn't run the software they care about. x86, externally, is horrifically ugly primarily because of backwards compatibility (I've legitimately had a nightmare once from writing an x86 JIT compiler). Internally, I'm almost certain it's an incredible feat of engineering. People who admire architecture aren't a powerful market force, I'm sad to say.


Mass market doesn't value admiration


Surprised nobody picked the Atari 400/800 and Amiga 500 computers (which are the 8-bit and 16-bit spiritual parent/child machines by the same people).

On the other end, pure CPU only machines are kind of interesting as a study in economy, like the ZX Spectrum, a horrible, limited architecture that managed to hit the market at an unreasonably cheap price, make money, and end up with tens of thousands of games.


OMG did I love my Atari 800 back in 1985 when it was clearly one of the best price/performance machines available at that time.

Overlooked by many (but not all) was its built-in MIDI ports and its abilty to control all those early model beatboxs and synths...unfortunately I forget the name of the rather crappy software that I used to get things talking and synced up, but it did work and the bitmapped color graphics were way ahead of its time.

Too bad Atari self destructed with the cartrage business, and who knows what other poor business decisions it made, but that computer was one of my favorite things of my late teens.


Are you mixing up the Atari 800 and ST, the 800 didn't have MIDI ports.


Yeah, I'm surprised that 6502 makes the list but Z80 and Motorola 68000 don't.


I read recently that Calvin Harris (music celebrity) made his first album on an Amiga.


Interesting that the B5000 didn't make this list. Berkeley CS252 has been reading the Design of the B5000 System paper for years. The lecture slides don't criticize it but Computer Organization and Design sorta does:

The Burroughs B5000 was the commercial fountainhead of this philosophy (High-Level-Language Computer Architectures), but today there is no significant commercial descendant of this 1960s radical.


I was also surprised - but I wonder thats the computer architectures language designers like, not computer architects.


The list seems biased towards pre-2001, so I’ll toss one in: Cell. I hold that it was so ahead of its time, it dragged game devs, kicking and screaming, into the future ahead of schedule when they were forced to support the PS3 for the extended console cycle. :)

Larrabee was cute, but to this day I still have no idea what their target workload was.


Yup. Most of this was culled from a 2001 conference (so small but distinguished sample set), and you really need to read the detail to understand what they were appreciative of. It's not a good/bad thing and probably represented what they were thinking about at the time (e.g. Alpert calls out Multiflow because it influenced a processor he built). Sites even includes a backhand at VAX by calling it the example of what they didn't do in Alpha; damning with faint praise.

I haven't fired up my Cell dev board (Mercury) in a while. Prolly should do that. :-)


I always thought of the 6809 as the Chrysler Cordoba of 8 bit microprocessors, with soft Corinthian Leather upholstery and a luxurious automatic multiply instruction.

https://www.youtube.com/watch?v=Vsg97bxuJnc


The CDC-6000 and Cray-1 designed by Seymour Cray are the most admired, hands down.

It is also notable that quite a bit of R&D was done in Chippewa Falls, WI, which is just a regular old town in America's Dairyland.


I'm no architect but I loved the Cray-2. I took over an old datacenter that had one just sitting there and a sense of awe hit me every time I saw it. What cost 12M brand new and was a marvel of engineering (12x faster than Cray-1!) was just sitting there collecting dust. Crazy world this is. They eventually sold it to a collector I think.


The Y/MP I got to use was pretty cool too...a multiprocessor Cray. But then that's a Chen design and I don't imagine Seymour would approve of everything that went into it.


In college our ACM chapter named its DEC machine (I think it was a Tru64, but it's so long ago I don't remember) "cray-ymp." If you were a member of the ACM you got an account on it, and I used to MUD from it.

One of the MUD admins was astounded when he noticed I was playing from what he thought was a real Cray Y/MP. If I was smarter I would have played along. Alas...


Missed opportunity for amusement.

Funny enough the Y/MP-48 was running Unicos (Cray's mostly-Unix) and someone had compiled nethack or rouge or something similar and was playing it until he had sucked up all the funny-money the department had budgeted for the year (yeah...we got accounted for CPU time, and yeah...someone screwed up the quotas). There was a kerfuffle....


Yeah, both the CDC 6600 and the Cray 1 did things that nobody even imagined were possible before. There were a bunch of attempts to do what the Cray-1 did, but none of them worked. Kind of like how C (about the same time) was the first portable language you could write an OS in.


Surprised the PDP-6/10 didn’t make the list as it was the dominant research architecture for a certain period. Another Gordon Bell jewel.


Alas...so little respect for the 36-bitters these days. The PDP-10 especially was hugely influential.


I wrote my first non-BASIC program ever on a PDP-10 or 11 and as I remember, it was one of those numberical programming for Engineers classes where we had to figure out why, when using floating-points vs integer math, that 2 != 2.0 (because 2.0 was actually 2.0000000000001 of course)

The funny thing was...no one told me the secret and I musta spent 5 long hours pulling out my hair.

If I remember right, one of the RAs running the computer center finally had pity on me and showed me what was going on...

I'm not really sure I can call them the "good ole days of programming", but that was how things were done back then.


As far as processors are concerned I loved the Zilog Z80 and the Motorola 68000. Oddly enough I really disliked the MT 6502 and the Intel 8086.

As total systems I loved the HP 41CX, the Sinclair ZX Spectrum, the Symbolics Lisp Machine and the Apple Mac IIcx (or really just any Mac before the PowerPC debacle).

After that era, I just started home-building x86 machines, and while there was the odd preferred component, it never went beyond the 'A is better than B' stage.


I started on the 6502, but outgrew it.

Still, cult chip! I mean, something like the following shows obsessive dedication to the thing:

http://www.visual6502.org/JSSim/


Anyone admire forthchips? Such as the 144-core chip from http://www.greenarraychips.com


>Processor design pitfalls - Designing a high-level ISA to support a specific language of language domain

Is there an equivalent pitfall in designing the ISA to support a specific Virtual Machine?

For example, wouldn't the performance of a server processor when running the Java Virtual Machine be a key factor in determining its commercial success? I've always wondered whether the failure of Itanium wasn't at least partly caused by the shift from binary executables to bytecode with the contemporary success of the Java language. Even when JIT compilers were used, they were probably too simple to take advantage of the VLIW architecture.


I don't feel that's the core reason but you do bring up a good point; some technologies are too good for their time and get swept in the history books due to nobody having a clue how to utilise them properly.

Not sure if that's the exact case for Itanium but your argument fired a neuron. :)


The machines I most admire are mechanical computers like the ones used in WWII era battle ships for targeting their long guns. Those machines performed differentiation and curve matching using cams and gears.


There is a fascinating series of videos on YouTube that describe the US Navy analog fire control computers. I had no idea such things existed until I came across those videos.


A friend's father back in the early 90s was convinced that analogue computers were going to come back and kick the asses of these newfangled digital pretenders :)

He was a little strange, but I loved seeing his workshop and attempts at building such computers. It's a pity I'm not in touch any more, I would like to learn a bit more about what he was actually doing and if it was effective at all.


I have a small IBM 390 which I haven't been able to find out much, but I did spot while searching that my 1999 S/390 has a 256 byte cache line. That's 4x over a 2020 i7.


The ones listed by 4 or more people (not including Bell) were:

- CDC-6600 and 7600 - listed by Fisher, Sites, Smith, Worley

- Cray-1 - listed by Hill, Patterson, Sites, Smith, Sohi, Wallach (also Bell, sorta)

- IBM S/360 and S/370 - listed by Alpert, Hill, Patterson, Sites (also Bell)

- MIPS - listed by Alpert, Hill, Patterson, Sohi, Worley

Special mention:

- 6502 - only listed by Wilson, but she was the chief architect of ARM so i think her choice is important to note

- Itanium - mentioned in the top-ranked comment in this HN discussion

- DEC Alpha - mentioned in the second-ranked comment in this HN discussion


Pentium Pro should be on the list. The out of order execution, especially with the micro-op translation was a huge breakthrough.


The Pentium Pro pioneered neither of those concepts.


Very true, but I believe it was the first to deliver them at consumer price points and volumes.


M68k and Z80 IMO deserved to be in that list much more than x86.


The Z80 and its implementation in the ZX Spectrum 48k will always have a place in my heart. So many BASIC games typed out, so many magazine tapes loading strange programs and games.

And the M68K powered my Atari ST, my friends Amigas, and so much more.

Makes me wish I could get a small ZX Spectrum, Atari ST, and Amiga hardware kit that would interface with USB HID stuff and output displayport/hdmi. (rather than software emulation on a Raspberry Pi)


I was always partial to the DEC-10 architecture. That said my first exposure to a machine that had been really well thought out was the IBM 360.


https://en.m.wikipedia.org/wiki/VAX-11

32 bit system from the late 1970s.


The VAX was, I think, co-designed alongside VMS. The two together were an innovative design, distinguishing architecture from implementation, a comprehensive isa, a roadmap for the future, etc etc. VAXcluster was amazing integration of both.

I believe the design was influenced by Djikstra's Structured Programming book but have no evidence.

My epiphany on the issues with the isa came when I discovered that the checksum calculation used by the VMS backup utility was faster when done in a short instruction loop over the microcoded instruction. MicroVAX II. Microcodes were a huge barrier between the speed potential of the electronics and the actual visible isa. Duh!

Cray knew this, but he didn't build product lines, just single point products. Sun built product lines with RISC and ate Digital Equipment's lunch.


I'm tempted to suggest Babbage's Analytical Engine, on the basis of shear audacity alone. Babbage was just amazingly ahead of his time.


Well, pretty much anything will run AutoCAD these days.


same goes for CATIA V5...


Are there any notable/not-just-academic "clean sheet" CPU architecture efforts other than what Mill computing is doing?


I admire AMDs Zen architecture.


For internal engineering beauty, or because it apparently made good trade offs and is taking the market to town right now? :-)


No mention of the SuperH.


I'm into computers for decades. The first time after ages a computer blew my mind again was when I got deep into k8s.

If you haven't yet do it now, check k8s out, get knee-deep into it and not from a devops perspective but as one who admire computers.


More software than hardware, but I share the same wonder at Kubernetes. I love the whole concept of: "this is a big blob that runs your stuff, don't worry about networking and some other server-type considerations, just run your apps and there's a bunch of magic"

It's a little sad to see my previous career disappear, I'd love nothing more than to manage some VMware clusters running Linux VMs for the rest of my life, but technology moves on. It's forcing me to become a coder and manager rather than sysadmin.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: