Having discussed the innovative features of
APX, let us turn to the new instructions that have been added by this extension.
Calling convention (routines)
Having doubled the general-purpose registers means that they have to be saved and then retrieved in/from the stack when they are used in calls to routines (whether functions or methods), depending on the calling convention adopted by the specific platform (which is part of the so-called ABI).
Intel has proposed defining the new registers as volatile, i.e. they are freely usable by the routine that has been called (callee, in jargon). It will, therefore, be up to the calling routine (the caller) to store their values before invoking the routine, and then restore them immediately afterwards (this convention is called caller-saved).
There are pros and cons to every such choice. In this case we can say that, since the saving and restoring of these new registers is entirely the responsibility of the caller, it will affect code density quite a bit, since these operations will have to be performed every single time the routine that uses them is called (so if there are 100 parts in the program that call it, there will be 100 times the operations of saving and restoring the new registers used).
If, on the other hand, the opposite convention (callee-saved) had been adopted, code density would have benefited considerably (because there would have been only one point in the program where these operations were performed: at the beginning and end of the called routine), but the performance of the routine would have suffered (because the new registers would have had to be saved before they could be used and, vice versa, they would have had to be restored before returning any results or, in any case, returning control to the caller).
It is not easy or possible to establish a priori what the best convention to adopt might be, since it is rather obvious and self-evident that this depends strictly on the type of code to be executed. But an
ABI needs to set a convention anyway, because it must be valid and used by all the applications that will run in the system, so a choice had to be made.
In my opinion, perhaps it would have been better to choose a middle way: a hybrid solution in which the first eight new registers could have been used freely by the caller (and, therefore, saved and restored by the called party, should he need to use them in turn), while the other eight would have been available to the called party (and, therefore, the caller would have had to retain their value).
This is because a routine rarely uses all the registers at its disposal, so often some of the registers would have been used, but without any need for the caller or the called party to retain their values, with obvious advantages on both sides (including the infamous code density).
Coming back to the new instructions, dealing with 32 registers means potentially having to execute several
POP instructions every time you fall into one of the above situations. Which should also be quite frequent: if the 16 new registers have been added, it is precisely because you want to use them, and often too (though not always all of them)! Otherwise, there would have been no point in making all these changes.
This sounds rather strange to me, since I still remember very well how AMD had claimed, when introducing
x64, to have evaluated the extension of
x86 to 32 instead of 16 registers, but to have given up because the advantages did not prove to be significant (contrary to the switch from 8 to 16 registers, where the differences, instead, were quite tangible, as we have seen for ourselves) and did not justify the greater implementation complexity of such a solution.
In any case, and going back to the topic, Intel thought of mitigating the situation a bit by adding a couple of new instructions,
POP2, which, as can be clearly guessed from their mnemonics, allow the push or pop on/from the stack of two registers at a time, instead of just one (as is the case with the normal
POP). This can roughly halve the number of corresponding instructions that would normally be required, with obvious performance advantages (one instruction executed each time, instead of two).
An example, taken from an old version of
SUB RSP, 0x68
LEA RBP, [RSP+0x80]
MOV ESI, [RIP+0x20f79f2]
TEST ESI, ESI
LEA RSP, [RBP-0x18]
easily shows how the
POP instructions could be halved by using the new
Also on the subject, although not a new instruction as such, is the introduction of a so-called ‘hint‘ for the
POP instructions (exclusively those operating on registers and using the classic as well as the most widespread encoding), which would indicate to the processor that these instructions (executed in the appropriate sequence) would be ‘balanced’. In this case, the processor would not save and restore their values in/from memory, but would store them internally, so as to improve the performance of these two operations (and without stressing the memory hierarchy).
Finally, another new instruction that was added is
JMPABS, which, as the name already suggests, allows jumping to a 64-bit absolute address. Evidently Intel has encountered some not rare cases in which this is necessary (on the other hand, the classic
JMP instructions only allow, in 64-bit mode, to move by + or – 2GB at most) and has decided to make up for it, even though I personally have not encountered occasions in which such an operation was necessary.
New conditional instructions
Other new instructions introduced by
APX are the so-called conditional instructions, for which the format of the
EVEX prefix changes according to the last table shown in the first article (which sees the introduction of the fields
SC3..SC0) and which, of course, check whether a certain condition (specified in
SC3..SC0) is true in order to decide how to proceed (depending on the particular type of instruction).
In fact, the only two (new, of course) instructions that use this special format of
CTESTscc, whose differences lie only in the type of check that, if any, is made (as with the
TEST instructions, respectively) as to whether the condition in
SC3..SC0 is true.
Their operating logic can be briefly summarised as follows: if
SC3..SC0 were to be satisfied, then the processor flags would be updated by comparing the two operands, just as with
TEST. If, on the other hand, it was not, then no comparison would be made, but the
CF flags would be set by copying their values from the equivalent fields found in
EVEX; in addition, the
AF flag would always be reset to zero.
It should be pointed out that not all conditions normally possible with
x64 can be used: the parity flag (
P) check conditions are not. In this case, the two encodings have been reused respectively to force the evaluation (and thus performing the check of the operands) or skip it (avoiding the check and thus copying the
CF fields to their respective flags).
An important thing to underline is that these instructions can always generate an exception if one of the elements is in memory and it’s not accessible (or, in general, generates any kind of fault). This occurs regardless, even if the condition in
SC3..SC0 is unsatisfied and reading the operand in memory is, therefore, completely useless. In this case, the behaviour is identical to another conditional instruction already present since the days of the Pentium Pro: the famous
The latter is, incidentally, also the basis of the four further new conditional instructions that
APX makes available. The first is the same
CMOVcc, which is extended using the
NDD and, therefore, gains a destination register to store the result of the operation (the second source is copied if the condition
cc is met, otherwise the first source is copied).
The other three instructions are called
CFCMOVcc, because they all have the same thing in common: they raise no exceptions if the operand in memory is not accessible and the condition is false (of course the exception is raised if the condition is true, in this case). The first of these is, therefore, identical to the
CMOVcc above, but with the suppression of exceptions (if the condition is not met). Which, I would say: finally! In fact, this was/is my expectation for a conditional instruction: there should be no side effects if the condition is not fulfilled!
The other two
CFCMOVccs do not use the
NDD and, therefore, have only two operands: the first will always act as both first source and destination. The difference between the two is that the operands are reversed: for the first, the first argument is a register and the second is an operand that can stay in memory (or in a register), while for the second instruction it is the exact opposite (the first operand can stay in memory and the second is always a register).
The peculiarity of these four new instructions is that they do not use the particular format of
EVEX at all (which, as I had already anticipated, is exploited exclusively for the new
CTEST), but the condition to be checked is included directly in the opcode (as in the original instruction from which they originated).
SETcc: improved / new (operating beyond bytes)
Finally, the operation of the
SETcc instruction (which I had mentioned in the previous article) has been extended (to different sizes rather than only bytes), giving the possibility (by exploiting the
ND bit) of applying the clear logic (instead of the merge, which is the default) when the operand (representing the destination of the result) is a register instead of a memory location (in this case there is no modification). This is very useful, because it avoids having to add an instruction before
SETcc to reset the contents of the register (which typically happens in real code, where the entire register is often used and not just the first 8 bits).
An example, also taken from
XOR EAX, EAX
CMP WORD [RCX+0x18], 0x20b
where it can be seen that the
EAX register is zeroed with the
XOR instruction, and only then does the
SETZ instruction set the value of the least significant byte (represented by the
AL register) to
1 if the memory location of the
CMP contained the value
AL would remain at
A similar recurring pattern often found is also the following:
CMP [RDI], EAX
MOVZX EAX, AL
where, in this case, first the comparison is carried out to update the flags appropriately, then the
SETZ instruction is executed to set, according to those new flags, the value of the least significant byte (always
AL), and immediately afterwards all other bytes of
EAX are reset with the
That is all for the moment. The next article will focus on analysing the advantages and flaws of