The final RISCs vs. CISCs – 4: How RISCs became CISCs

The mudslinging against CISCs of the propaganda set up by the RISCs evangelists, which we have already discussed in the previous article, has been very careful to continuously and incessantly point the finger in one direction, but has silently avoided addressing the issues that later proved fatal for their beloved ones.

It is not that RISCs have completely disappeared from the landscape of computer architectures, but we can certainly state, definition and history in hand, that almost all existing processors are not RISCs, but must necessarily be classified as CISCs.

The reasons for this have already been expressed in previous articles (it is enough, trivially, to apply the RISC definition slavishly), but I think it is of particular importance to understand how this transformation came about.

That is, how, as a result of advances in technology, RISCs have, in fact, become CISCs (in spite of the proclamations which, unfortunately, continue to this day) at a time when most of the pillars on which they were founded have miserably collapsed (but it is enough for even one of them to have fallen).

The fourth pillar falls: ‘Instructions must be simple -> executed in a single clock cycle’

The rapid increase in processor clock frequencies was the first of the technological advances that gripped RISCs proponents. In fact, the adoption of increasingly advanced and efficient manufacturing processes (well illustrated by the famous Moore’s Law) started a race to achieve higher and higher frequencies for processors, which was not followed by a corresponding increase in frequencies for memories, which achieved far fewer advances than the former.

This led to the introduction of a decoupling in the interfacing of processors towards memory: in fact, connection buses were used for the latter at lower frequencies than those of the processors, initially introducing the concept of a multiplier (with respect to the bus frequency) to obtain the processor frequency (like Intel’s famous 486DX2), and then making the two frequencies totally independent.

This misalignment between the two operating frequencies meant, therefore, that access to the memory, by the processors (or other devices), required more clock cycles to complete the operation (which, in any case, had already been happening for some time with the infamous wait states).

All this had practical repercussions on the execution of instructions, where these depended directly on memory accesses (the famous load/store) or indirectly (to load the instructions to be executed; not just because of a jump to a different address).

The introduction of memory virtualization (and not only that. But I avoid complicating the matter, as it is not strictly functional to the purpose of the article), which has introduced ‘levels’ for memory accesses and which, therefore, may involve the addition of other clock cycles to complete the operation.

Last but not least, the implementation of instruction pipelining, the execution of which was divided into several simpler ‘steps’ / ‘stages’, so as to run the processor at higher frequencies (the achievable frequency is dominated by the execution speed of the slowest circuit component), also added more clock cycles.

This certainly worked very well when the instruction did not have to access the memory (because there would be a stall in the pipeline until the data finally arrived) or did not have to change the execution address (jumping to another piece of program), because this, of course, meant introducing additional clock cycles until the instruction execution could be resumed.

Mechanisms have been introduced (L1/L2/L3 caches, TLB caches, BT caches among the most famous) to try to solve all these problems, but obviously they are unable to guarantee the completion of instructions in one clock cycle, even taken all them together.

Considering that modern processors are able to execute, on paper, several instructions per clock cycle (the infamous IPC: Instructions Per Cycle), for a processor to execute all of them in a clock cycle would mean travelling at maximum speed, therefore with IPC matching the maximum possible number. But it is clear that, due to what has already been said, this is basically impossible.

The consequence of all this is that the fourth pillar on which RISCs was founded has, in fact, collapsed.

The first pillar falls: ‘There must be a reduced set of instructions’

The aforementioned Moore’s Law has, over time, whetted the appetite of processor manufacturers, always looking to improve performances. This was realised by adding new, more powerful instructions thanks to the availability of an exponentially higher quantity of transistors to use (remember that, according to this law, their quantity packed into chips would double every two years).

Thus, for example, RISCs were given multiplication or even division (!) instructions, in defiance of the fact that they require more clock cycles (several, for divisions!) to complete their execution. But the fourth pillar had already fallen, so desperately trying to preserve it at all costs would no longer make any sense.

So, many other instructions became part of the ISA of RISC processors (of CISCs too, of course, but… they have been doing this since they were born!): those for floating point calculations (which are well known for their much longer latencies than ‘integer’ ones), SIMD (some with particularly complicated instructions), lock/synchronisation primitives, cryptography, digital signatures, bits, bit fields, etc. etc. etc, with a plethora requiring more and more clock cycles for their complete execution (and, thus, continuing to inflict damage on the aforementioned fourth pillar).

Needless to say, as new ones were added all the time, the instruction set increased, dismantling the first pillar of RISCs

It must be emphasised that CISCs have always been the object of ridicule (still today!) because they integrated several instructions, such as those for handling BCD data types, manipulating complicated stack frames (to better support programming languages that allowed the definition and use of nested functions; such as, for example, Pascal and derivatives), strings/memory blocks, etc., for exactly the same reason: to improve performance in the common cases (of the time)!

RISCs did, and continue to do, exactly the same and today we have processors that also integrate hundreds of instructions. But nobody blamed them for this! On the contrary, and exactly the opposite, such extensions are viewed favourably and sold & publicised as great innovations.

Coherence? Not at all!

Perhaps in the future, current processors/architectures will once again be pilloried as certain functionalities will be offloaded onto external coprocessors or chips (as can be seen with the explosion of AI, which feeds heavily on highly specialised chips/units), whereby instructions for such tasks will be seen as an intolerable burden and legacy. The repeating circle of life…

The third pillar falls: ‘Instructions must have a fixed length -> no to their variable-length’

Last but not least, RISCs manufacturers have been gripped by a deadly bug that has always been carried by this family of processors: low code density (for which they are well known. Certainly not in positive terms!).

That’s because this has several implications: more space required (memory has its costs), more memory bandwidth consumed (throughout the entire hierarchy: from caches to system memory), larger caches (again, costs) and, finally, higher power consumption (you have to feed all these additional transistors).

Adding new instructions to do more ‘useful work‘ helped in this respect (fewer instructions to perform the same tasks means less memory is required for the code), but it was not enough to solve this general problem.

So these manufacturers finally decided to ‘borrow’ the last, heretical (!) feature of many CISCs: variable-length instructions! And the third pillar is also disintegrated…

One important thing to point out on this subject is that many people claim that, since there is a lot of memory available these days (and also a lot of memory bandwidth), code density is no longer important and, therefore, there would be no need to save anything in this area.

Processor manufacturers and architects, however, sonorously disagree (hilariously; to say the least) with stances such as these, and continue to define new architectures with variable-length instructions or to reserve a (even conspicuous!) part of their ISAs specifically for the purpose.

For example, one of the newest and most modern architectures, RISC-V, has reserved as much as 75% of its entire opcode space for so-called ‘compact’ instructions (having 16 bits of width instead of the canonical 32), plus another part (of the remaining 25%) for even longer instructions (up to 22 bytes. Plus a part reserved for instructions… even longer!). In comparison, x86/x64, which arrive at a maximum of 15 bytes per instruction, look like amateurs…

ARM is a singular exception, as it has not (yet) defined any ‘compact extension’ to its 64-bit ISA (ARMv8. And now also ARMv9). Which sounds really strange, considering the enormous effort put into its Thumb and Thumb-2 (for the 32-bit architecture), which also strongly contributed to its success (so much so that there are several ARM processors with only Thumb-2 implemented).

There is an inordinate amount of research on the importance of code density, but having already addressed this topic in the series on Intel’s new APX architecture, I prefer to quote an excerpt from one of the most recent and interesting publications on the subject, from the thesis of one of the RISC-V designers:

Waterman shows that RVC fetches 25%-30% fewer instruction bits, which reduces instruction
cache misses by 20%-25%, or roughly the same performance impact as doubling the instruction
cache size
.

As it’s evident, code density has very serious implications at the microarchitectural level, and is the reason why, in addition to the flood of publications, new architectures either explicitly add support for it, or are directly developed around it (like ARM with the Thumb ISA, but also RISC-V itself).

This necessarily involves the adoption of a variable-length instruction set, as this is the architectural solution that offers the best benefits in this area (although it will be at the expense of processors’ frontends, which will become more complicated. Here, a lot also depends on how the instruction opcodes are designed).

Conclusions

I would say that there is not much more to add, except that it all brings me back to Animal Farm: <<Four Legs Good, Two Legs Better>>.

Because it is truly impressive how perfectly this novel describes the transformation of RISCs into CISCs, and how this came about. It was prophetic, to say the least…

The next article will focus on CISCs and how they… stayed the same!

Press ESC to close