APX: Intel’s new architecture – 4 – Advantages & flaws

Having finished reviewing all the new features of APX (with the last article analysing the new instructions), it is now time to take stock of the advantages (of which we have, however, already had a substantial overview) and the flaws (it is not, in fact, without issues or shortcomings).

Advantages of APX

Let us begin, then, by starting with the advantages that APX brings to the ancient (but still alive!) x64 architecture (which in turn extended the even older and more famous x86). This task is quite easy because, in my humble opinion, there are four innovations worthy of merit (plus one that I will discuss in the last part), although only the first two are the most important and incisive.

The first is the extension of the most common binary instructions into ternary, and unary into binary. With APX we have, at last!, the possibility of using three operands or two, in their respective cases, which makes it possible to avoid using several instructions to perform the same operation and, in many cases, even without using an additional register for some intermediate operation.

An example again from FFMPEG (x64):

XOR R8D, 0x1

which with APX would become instead:

XOR R8D, ECX, 0x1
Or even better:

MOV EBX, [RBP+0x30]
AND EBX, 0x80

which with APX would become:

AND EBX, [RBP+0x30], 0x80
where the ternary operation is combined with the possibility, for one of its two arguments, of directly referencing the data source in memory.

The performance benefits are, therefore, immediately palpable and need no further praise, except that this new mechanism of promoting certain instructions also has the additional advantage of being able not only to use an operand in memory as a source (which is quite rare, if not completely absent, in various other architectures) but, moreover, indifferently as first or second source (if the particular instruction allows it).

Intel claims that, with APX, about 10 per cent fewer instructions are executed than with x64 on preliminary data using the very popular SPEC2017 benchmark, and I have no doubt that the biggest contribution here comes from the ‘promotion’ of these instructions. This is a distinctive feature of CISC processors, which is further enhanced in this case and which, to paraphrase the well-known singer Madonna, makes me break out in a “CISCs do it better!“.

The second notable innovation is the extension of the general-purpose registers to 32 in total: 16 more registers are definitely handy in various fields (virtual machines and emulators come to mind, first and foremost, but also compilers/parsers, databases, etc.).

And it is precisely thanks to the 16 additional registers that Intel claims to have reduced load operations by 10 per cent and store operations by over 20 per cent, again with obvious performance benefits (less use of data caches and, in general, the entire memory hierarchy benefits).

Which, let me tell you, clashes heavily with the statements made by AMD following the introduction of its x86-64/x64 architecture, which, as I had already mentioned in the previous article, claimed that the move from 16 to 32 registers would offer little advantage (not justifying the greater implementation complexity). We do not know, however, how this company planned to add all these registers, compared to the mechanism that Intel has now implemented with APX.

Also of interest is the addition of the new PUSH2 and POP2 instructions, which operate on two registers at a time, which almost halves the number of usual PUSHs and POPs in the code, as we have already seen in the previous article, which also showed a portion of real code demonstrating how PUSH and POP sequences are used in the prologue and epilogue of a routine (a not at all unusual scenario). Here, too, we speak of performance benefits (fewer instructions executed).

Slightly less useful, however, is the possibility, for instructions that enjoy the possibility of being ‘promoted’, of being able to suppress the generation of flags, as discussed in detail above. The scenarios are not as common as those listed for the other points above, but they are relevant enough to merit the introduction of this feature (especially if the implementation cost were insignificant).

Flaws of APX

Turning to criticism and problems (it’s not all wonderful), and immediately rejoining the last feature mentioned just above, I don’t understand why for some absurd reason the suppression of flag generation is the prerogative of only some instructions (we are talking, in any case, always of the ‘promoted’ ones) and not of all: it doesn’t make any sense!

An example are the ADC and SBB instructions (addition and subtraction using the carry flag): the NF bit cannot be used (it must be compulsorily left at 0), while the traditional ADD and SUB (which do not use the carry) can enable it. Obviously, these are not the only cases, but there are several ‘promoted’ instructions that normally generate flags, for which it is not possible to suppress it.

Continuing in the same vein, the albeit useful PUSH2 and POP2 instructions could also be avoided, delegating to the microarchitecture alone the task of identifying the pairs to be ‘joined in marriage’ by exploiting the already present macro-op fusion mechanism, besides the fact that they are decidedly longer (but I’ll talk more about code density in the next article).

It is certainly more complicated to implement, but it would also be transparent to existing applications (and, therefore, exploitable even on 32-bit / x86 code, which abounds even more in PUSH and POP sequences due to the ABI being stack-based instead of register-based).

Frankly, I don’t see the new CMOVcc-based instructions as interesting. Apart from the ternary extension (which, by the way, is natural/obvious, and falls under the case of the extension of ‘promoted’ instructions from binary to ternary. So it would have benefited from this new functionality anyway), there is little use for CFCMOVcc.

Yes, being able to possibly suppress exceptions in the case of an unsatisfied condition (as I’ve already made clear in the previous article) is a sacrosanct improvement, but in fact all conditional instructions should have been implemented in this way, ever since their introduction with the Pentium Pro! The criticism, therefore, lies right here: such a mechanism should have been extended to many more instructions instead of being relegated to MOV alone (I will talk more about this in the future article on possible improvements to APX).

Similarly, the fact that only certain instructions can be ‘promoted’, and thus take advantage of the new and interesting features that allow their uses to be expanded, is a major limitation that complicates both compilers (which prefer orthogonal instruction sets) and the implementation (more on this in a future article).

In addition to the fact that, as it is designed, the new map 4 (in which these promoted instructions reside, as already illustrated in the first article) could in future run out of instructions encoded in it, forcing another one to be created and further complicating the implementation of the architecture. In this case, among other things, it will no longer be possible to make use of the REX2 prefix (because it can only map the opcodes of maps 0 and 1), forcing one to always use EVEX and pay the higher costs in terms of lower code density.

And then there are CCMP and CTEST!

Finally, the new CCMP and CTEST instructions deserve a separate discussion, as they have advantages and flaws, which is why I have preferred to devote a separate section to them instead of fragmenting the discussion in the above two.

Let us start with the advantages, which do not seem at all obvious, since their mechanism of operation appears, in fact, rather convoluted and, at first sight, downright useless. It is not easy, in fact, to get clear in one’s mind what the hell they can be used for and in what real scenario they can bring tangible benefits.

The goal explicitly stated by the company (in the APX presentation) is to reduce the number of conditional jumps by exploiting, instead, conditional instructions, so as to try to mitigate as much as possible the problems related to the pipelines, which over time become longer and longer (and, therefore, conditional jump instructions, whose prediction by the predictor has failed, carry a very high price to pay).

Intel had already gone ahead several years ago with the introduction of SETcc and CMOVcc (which I have already discussed at length), although they represent a very timid approach compared to what other architectures offer, and in particular ARM, which allows conditional execution on any of its instructions (which has made it one of its distinguishing features; but it is not the only ISA that works this way, although it is the best known).

What has left us extremely surprised is the discovery that for its new 64-bit architecture, called AArch64 or ARM64, ARM has decided to completely remove conditional execution on all instructions, falling back on the same approach followed by Intel, i.e. making only certain instructions available to be executed conditionally.

This is not so surprising if we take into account the fact that, having to encode instructions (in 32-bit opcodes) that work with a bank of 32 registers, ARM had to make a virtue of necessity by eliminating conditional execution and, thus, being able to reuse four precious bits to better model everything else. On the other hand, the calculator doesn’t lie: ternary instructions require 3 x 5 = 15 bits, and using another 4 bits for the condition to be checked would eventually leave 32 – 15 – 4 = 13 bits, which may seem like a lot to encode the opcodes of all the instructions, but it runs out quickly.

Apologising for this brief digression and returning to APX, this time it was Intel that copied ARM’s approach with AArch64, and in particular the CCMP instruction (what a coincidence!) which works in exactly the same way (a description was given in the previous article).

The reason for this is very simple: thanks to this instruction, it is possible to emulate the control of more complex conditional expressions, making use of boolean AND or OR operators to concatenate simpler single conditions. The issue is not easy to understand, but fortunately there is a splendid article by Raymond Chen on the Microsoft blog that explains in detail and with examples all the conditional instructions introduced by ARM in AArch64.

What is important, in this context, is that what may seem to be a pointless contortionism turns out, instead, to be not only very useful, but also extremely efficient (the examples in the aforementioned blog are more than eloquent) in terms of performance, greatly fulfilling the purpose for which this instruction was created.

A truly ingenious gimmick by ARM engineers, who deserve all my respect, and which Intel did very well to copy, but offering more flexibility for the second argument of the comparison, which is able to reference a data in memory (whereas the first argument must always be in a register).

Here, however, the first wrinkle also arises since, being in memory, such an operand could generate an exception. CCMP behaves, therefore, exactly like CMOVcc: an exception is generated in any case, even if the condition fails, when the referenced operand is in memory and is not accessible.

A choice which, of course, remains patently wrong, since the rationale behind the evaluation of conditional expressions that are ‘short circuited‘ is that if the first condition was sufficient to evaluate the result of the entire expression, then everything that follows (the other conditions to be checked) would not need to be processed (in fact, it is not done) and we would go directly to the block of code to be executed.

The paradoxical thing would be that, as previously reported, Intel introduced the new CFCMOVcc instruction specifically to support this widespread evaluation of conditional expressions (without generating exceptions if this was not the case), but decided not to do so where it was most needed: with the new CCMP, which was introduced precisely to help in these situations! It is a decision that leaves me absolutely baffled, as it’s totally meaningless…

Instead, he thought of introducing a CTEST, the use cases of which are far rarer than the widespread comparison operations. If it really wanted to keep CCMP working as it is, it could have added a CFCCMP instead and solved the problem, just as it did with CMOVcc.

Still on this aspect, it must be said that, even with these new additions, the number of conditional instructions that x64 makes available still remains meagre. Making a comparison with those of AArch64 (all of which can be found in the above-mentioned blog) shows how terribly it disfigures: there are only SETcc, CMOVcc, CCMP and CTEST. With APX, Intel could have taken advantage of this and caught up, adding more, or implemented another solution (as will be explained in the future article on possible improvements).

Last but not really least, in order to introduce CCMP and CTEST, it decided to complicate the implementation of its processors, as a different implementation of the EVEX prefix is required, specifically and exclusively for these two new instructions (more details in a future article which will also deal with the implementation costs of APX).

These are decisions, in short, that leave a bitter taste in the mouth for a job that could certainly have been done much better and in a much more useful way. But perhaps there is still room for some changes, considering that there is still no processor marketed with these extensions.

In the next article, as already mentioned, there will be some thoughts on the possible impact of APX with regard to code density. In fact, it has already been mentioned so many times that it now deserves to be discussed in depth, as it is a topic of fundamental importance when talking about processor architectures, and one that has made and will continue to make history and literature in this area.

Press ESC to close