Having finished reviewing all the new features of
APX (with the last article analysing the new instructions), it is now time to take stock of the advantages (of which we have, however, already had a substantial overview) and the flaws (it is not, in fact, without issues or shortcomings).
Let us begin, then, by starting with the advantages that
APX brings to the ancient (but still alive!)
x64 architecture (which in turn extended the even older and more famous
x86). This task is quite easy because, in my humble opinion, there are four innovations worthy of merit (plus one that I will discuss in the last part), although only the first two are the most important and incisive.
The first is the extension of the most common binary instructions into ternary, and unary into binary. With
APX we have, at last!, the possibility of using three operands or two, in their respective cases, which makes it possible to avoid using several instructions to perform the same operation and, in many cases, even without using an additional register for some intermediate operation.
An example again from
MOV R8D, ECX
XOR R8D, 0x1
APX would become instead:
XOR R8D, ECXOr even better:
MOV EBX, [RBP+0x30]
AND EBX, 0x80
APX would become:
AND EBX, [RBP+0x30]where the ternary operation is combined with the possibility, for one of its two arguments, of directly referencing the data source in memory.
The performance benefits are, therefore, immediately palpable and need no further praise, except that this new mechanism of promoting certain instructions also has the additional advantage of being able not only to use an operand in memory as a source (which is quite rare, if not completely absent, in various other architectures) but, moreover, indifferently as first or second source (if the particular instruction allows it).
Intel claims that, with
APX, about 10 per cent fewer instructions are executed than with
x64 on preliminary data using the very popular
SPEC2017 benchmark, and I have no doubt that the biggest contribution here comes from the ‘promotion’ of these instructions. This is a distinctive feature of
CISC processors, which is further enhanced in this case and which, to paraphrase the well-known singer Madonna, makes me break out in a “CISCs do it better!“.
The second notable innovation is the extension of the general-purpose registers to 32 in total: 16 more registers are definitely handy in various fields (virtual machines and emulators come to mind, first and foremost, but also compilers/parsers, databases, etc.).
And it is precisely thanks to the 16 additional registers that Intel claims to have reduced load operations by 10 per cent and store operations by over 20 per cent, again with obvious performance benefits (less use of data caches and, in general, the entire memory hierarchy benefits).
Which, let me tell you, clashes heavily with the statements made by AMD following the introduction of its
x64 architecture, which, as I had already mentioned in the previous article, claimed that the move from 16 to 32 registers would offer little advantage (not justifying the greater implementation complexity). We do not know, however, how this company planned to add all these registers, compared to the mechanism that Intel has now implemented with
Also of interest is the addition of the new
POP2 instructions, which operate on two registers at a time, which almost halves the number of usual
POPs in the code, as we have already seen in the previous article, which also showed a portion of real code demonstrating how
POP sequences are used in the prologue and epilogue of a routine (a not at all unusual scenario). Here, too, we speak of performance benefits (fewer instructions executed).
Slightly less useful, however, is the possibility, for instructions that enjoy the possibility of being ‘promoted’, of being able to suppress the generation of flags, as discussed in detail above. The scenarios are not as common as those listed for the other points above, but they are relevant enough to merit the introduction of this feature (especially if the implementation cost were insignificant).
Turning to criticism and problems (it’s not all wonderful), and immediately rejoining the last feature mentioned just above, I don’t understand why for some absurd reason the suppression of flag generation is the prerogative of only some instructions (we are talking, in any case, always of the ‘promoted’ ones) and not of all: it doesn’t make any sense!
An example are the
SBB instructions (addition and subtraction using the carry flag): the
NF bit cannot be used (it must be compulsorily left at
0), while the traditional
SUB (which do not use the carry) can enable it. Obviously, these are not the only cases, but there are several ‘promoted’ instructions that normally generate flags, for which it is not possible to suppress it.
Continuing in the same vein, the albeit useful
POP2 instructions could also be avoided, delegating to the microarchitecture alone the task of identifying the pairs to be ‘joined in marriage’ by exploiting the already present macro-op fusion mechanism, besides the fact that they are decidedly longer (but I’ll talk more about code density in the next article).
It is certainly more complicated to implement, but it would also be transparent to existing applications (and, therefore, exploitable even on 32-bit /
x86 code, which abounds even more in
POP sequences due to the
ABI being stack-based instead of register-based).
Frankly, I don’t see the new
CMOVcc-based instructions as interesting. Apart from the ternary extension (which, by the way, is natural/obvious, and falls under the case of the extension of ‘promoted’ instructions from binary to ternary. So it would have benefited from this new functionality anyway), there is little use for
Yes, being able to possibly suppress exceptions in the case of an unsatisfied condition (as I’ve already made clear in the previous article) is a sacrosanct improvement, but in fact all conditional instructions should have been implemented in this way, ever since their introduction with the Pentium Pro! The criticism, therefore, lies right here: such a mechanism should have been extended to many more instructions instead of being relegated to
MOV alone (I will talk more about this in the future article on possible improvements to
Similarly, the fact that only certain instructions can be ‘promoted’, and thus take advantage of the new and interesting features that allow their uses to be expanded, is a major limitation that complicates both compilers (which prefer orthogonal instruction sets) and the implementation (more on this in a future article).
In addition to the fact that, as it is designed, the new map
4 (in which these promoted instructions reside, as already illustrated in the first article) could in future run out of instructions encoded in it, forcing another one to be created and further complicating the implementation of the architecture. In this case, among other things, it will no longer be possible to make use of the
REX2 prefix (because it can only map the opcodes of maps
1), forcing one to always use
EVEX and pay the higher costs in terms of lower code density.
And then there are
Finally, the new
CTEST instructions deserve a separate discussion, as they have advantages and flaws, which is why I have preferred to devote a separate section to them instead of fragmenting the discussion in the above two.
Let us start with the advantages, which do not seem at all obvious, since their mechanism of operation appears, in fact, rather convoluted and, at first sight, downright useless. It is not easy, in fact, to get clear in one’s mind what the hell they can be used for and in what real scenario they can bring tangible benefits.
The goal explicitly stated by the company (in the
APX presentation) is to reduce the number of conditional jumps by exploiting, instead, conditional instructions, so as to try to mitigate as much as possible the problems related to the pipelines, which over time become longer and longer (and, therefore, conditional jump instructions, whose prediction by the predictor has failed, carry a very high price to pay).
Intel had already gone ahead several years ago with the introduction of
CMOVcc (which I have already discussed at length), although they represent a very timid approach compared to what other architectures offer, and in particular ARM, which allows conditional execution on any of its instructions (which has made it one of its distinguishing features; but it is not the only ISA that works this way, although it is the best known).
What has left us extremely surprised is the discovery that for its new 64-bit architecture, called
ARM64, ARM has decided to completely remove conditional execution on all instructions, falling back on the same approach followed by Intel, i.e. making only certain instructions available to be executed conditionally.
This is not so surprising if we take into account the fact that, having to encode instructions (in 32-bit opcodes) that work with a bank of 32 registers, ARM had to make a virtue of necessity by eliminating conditional execution and, thus, being able to reuse four precious bits to better model everything else. On the other hand, the calculator doesn’t lie: ternary instructions require 3 x 5 = 15 bits, and using another 4 bits for the condition to be checked would eventually leave 32 – 15 – 4 = 13 bits, which may seem like a lot to encode the opcodes of all the instructions, but it runs out quickly.
Apologising for this brief digression and returning to
APX, this time it was Intel that copied ARM’s approach with
AArch64, and in particular the
CCMP instruction (what a coincidence!) which works in exactly the same way (a description was given in the previous article).
The reason for this is very simple: thanks to this instruction, it is possible to emulate the control of more complex conditional expressions, making use of boolean
OR operators to concatenate simpler single conditions. The issue is not easy to understand, but fortunately there is a splendid article by Raymond Chen on the Microsoft blog that explains in detail and with examples all the conditional instructions introduced by ARM in
What is important, in this context, is that what may seem to be a pointless contortionism turns out, instead, to be not only very useful, but also extremely efficient (the examples in the aforementioned blog are more than eloquent) in terms of performance, greatly fulfilling the purpose for which this instruction was created.
A truly ingenious gimmick by ARM engineers, who deserve all my respect, and which Intel did very well to copy, but offering more flexibility for the second argument of the comparison, which is able to reference a data in memory (whereas the first argument must always be in a register).
Here, however, the first wrinkle also arises since, being in memory, such an operand could generate an exception.
CCMP behaves, therefore, exactly like
CMOVcc: an exception is generated in any case, even if the condition fails, when the referenced operand is in memory and is not accessible.
A choice which, of course, remains patently wrong, since the rationale behind the evaluation of conditional expressions that are ‘short circuited‘ is that if the first condition was sufficient to evaluate the result of the entire expression, then everything that follows (the other conditions to be checked) would not need to be processed (in fact, it is not done) and we would go directly to the block of code to be executed.
The paradoxical thing would be that, as previously reported, Intel introduced the new
CFCMOVcc instruction specifically to support this widespread evaluation of conditional expressions (without generating exceptions if this was not the case), but decided not to do so where it was most needed: with the new
CCMP, which was introduced precisely to help in these situations! It is a decision that leaves me absolutely baffled, as it’s totally meaningless…
Instead, he thought of introducing a
CTEST, the use cases of which are far rarer than the widespread comparison operations. If it really wanted to keep
CCMP working as it is, it could have added a
CFCCMP instead and solved the problem, just as it did with
Still on this aspect, it must be said that, even with these new additions, the number of conditional instructions that
x64 makes available still remains meagre. Making a comparison with those of
AArch64 (all of which can be found in the above-mentioned blog) shows how terribly it disfigures: there are only
APX, Intel could have taken advantage of this and caught up, adding more, or implemented another solution (as will be explained in the future article on possible improvements).
Last but not really least, in order to introduce
CTEST, it decided to complicate the implementation of its processors, as a different implementation of the
EVEX prefix is required, specifically and exclusively for these two new instructions (more details in a future article which will also deal with the implementation costs of
These are decisions, in short, that leave a bitter taste in the mouth for a job that could certainly have been done much better and in a much more useful way. But perhaps there is still room for some changes, considering that there is still no processor marketed with these extensions.
In the next article, as already mentioned, there will be some thoughts on the possible impact of
APX with regard to code density. In fact, it has already been mentioned so many times that it now deserves to be discussed in depth, as it is a topic of fundamental importance when talking about processor architectures, and one that has made and will continue to make history and literature in this area.