With the implementation costs covered in the previous article, the observations and criticisms come to an end, while possible improvements that could be made to
APX before the final commercialisation of the first processors that will implement it (assuming it is not too late now!) are now set out.
One modification I would suggest is to treat conditional instructions in the same way as other processors do, allowing their effects to be totally ignored if the specific condition is not met. This also makes the implementation in the execution pipeline simpler (only the commit or retire of the instruction is performed).
Currently, on the other hand, if the condition is not met, the target argument is reset (
CFCMOVcc) if it is a register in any case (while it remains unchanged otherwise). The original version of
CMOVcc also has the flaw of generating exceptions if the memory location it references cannot be accessed, even when the condition is false, but fortunately
APX provides one (
CFCMOVcc) that suppresses exceptions in such cases.
All these individual differences and different behaviour depending on the instruction do not benefit either the decoder that has to decode them or the backend that has to execute them. The same occurs when only some instructions are given the possibility of being able to suppress flags generation, while others are not. This results in greater implementation complexity, also at the expense of compilers (who must take into account and handle all these special cases).
REX2 (to add
So the next concrete, as well as extremely simple, change would be to give the possibility of using the
No Flags) bit to all instructions ‘promoted’ by this new extension, instead of just a few.
In reality, all the improvements proposed in this article involve the complete removal of the concept of ‘promotion’ (which currently only occurs for certain instructions. This led to the creation of map
4 using the prefix
EVEX, as we have already seen in the first article), since the idea is to allow all general-purpose instructions to take advantage of the new features introduced with
In order to achieve this (while at the same time giving code density a nice hand help), a trivial modification to the
REX2 prefix is required, which currently has the following structure:
|REX2 (2-byte REX)|
Which, by adding the
NF bit to signal the possible suppression of flags generation, becomes:
|New REX2 (2-byte REX)|
|0 (0xD4, 0xD5)||1||1||0||1||0||1||0||M0|
Now we not only use the opcode (
D5 in hexadecimal) of the old
AAD instruction (suppressed by
x64 in 64-bit mode), but also that of
D4), both of which allow us to set
NF (in MSB: the most significant bit of the second byte), without any other penalties apart from that of using
REX2, which, however, occupies only two bytes (as opposed to
EVEX where, instead, four bytes would be needed!).
The reason why
NF has taken the place of
M0 over the original in
REX2 will be better seen later with the other prefixes, but I anticipate that it serves to maintain exactly the same format of the second byte, everywhere. Whereas for the map to be selected, there are differences, depending on the prefix (but this is the only variation).
REX3 (to add condition)
In the same vein and as previously suggested, a condition could be applied to all general-purpose instructions. Giving them, therefore, the possibility of being able to be totally ignored in the event that it is not fulfilled, and without any side effects (also explained at length above).
This modification is extremely important precisely in order to come to Intel’s statement aid in the
APX presentation, which states that processor pipelines are becoming longer (and wider) as time goes by, and thus more susceptible to performance losses when the prediction of conditional jumps fails.
The solution I propose, for this purpose, is to introduce a new prefix,
REX3, very similar to
REX2, but with the addition of a byte in which it is possible to specify the condition that must be fulfilled in order to approve the execution of the that instruction. The format of the new prefix is as follows:
|REX3 (3-byte REX)|
where, as we have already seen in the first article setting out the format of all the prefixes added or modified by
SC3..SC0 are four bits representing the code (modified, excluding the test for the parity bit
P) of the condition that is used in conditional jumps. While
NF is the
No Flags bit we have already seen above with the new prefix
The three bits at
0 in the third byte, which are before
M0, leave room for any other maps to be added (although, using them all for this purpose, 16 would be too many) and/or to enable, in any future extensions, other features.
As can be seen, this new prefix (for which I have used opcode
1F, which corresponds to the old legacy
POP DS instruction) is quite simple, flexible, and easier to implement than
EVEX, besides the fact that it also has the not inconsiderable advantage of occupying one byte less than the latter and thus mitigating the impact on code density.
Taking advantage of
REX3, it is also possible to (re)implement the new
CTEST instructions by exploiting opcodes
0: the classic conditional jump instructions with an offset of 8 bits for the jump) for the former and
1: these are the less famous conditional jumps with an offset of 16 or 32 bits) for the latter. The first 4 bits (the least significant ones) will be used to specify the value of the
CF fields, to be copied to the respective flags in the event that the condition in
REX3 is not met.
In this case the format of the instruction for
CCMP becomes as follows:
|REX3 (3-byte REX) for CCMP|
|REX3 (3-byte REX) for CTEST|
The choice of reusing the opcodes of the conditional jump instructions is certainly the best one, because transforming (via the new
REX3) into conditional instructions that are already conditional in themselves would not make any sense. So we might as well reuse them, using the 4 bits of the condition to store the values of
This is a very simple implementation, as can be seen, which requires a couple of trivial comparisons in the presence of the new prefix
REX3 to check whether it is in the special case of these two new instructions, and which also has the advantage of occupying one byte less than the current solution using
EVEX, thus improving code density.
VEX3 (for new registers)
In this regard, code density could also be trivially improved for instructions (
AVX-2) that make use of the
VEX3 prefix, should it become necessary to access the 16 general-purpose registers that
APX has added, without having to resort to the longer (occupying an extra byte) and more complicated
VEX3 currently has the following format:
|VEX3 (3-byte VEX)|
whereas with my proposal it would become:
|New VEX3 (3-byte VEX)|
Thus, reusing bits
m4..m2 to add the 3 bits needed to be able to specify the new registers. This would reduce the selectable opcode maps from 32 to just 4, but this would not be a big problem for a couple of reasons.
The first is that there are currently only four maps for all instructions (and there is still room to add more), so none would be missing. The second is that the current trend is to use
AVX-512 to extend the
SIMD instruction set, which always makes use of the
EVEX prefix (which supports up to 8 maps. So there is plenty of room to add another thousand instructions).
REXM1 to eliminate
With a similar approach, but copying what has already been done with the
REX3 prefix that I proposed just above, one could avoid using
EVEX altogether in order to ‘promote’ instructions from binary to ternary, and from unary to binary, which
EVEX makes possible thanks to the new
ND bit (which, set to
1, enables this new functionality) and the
v̅4..v̅0 field that allows one to specify the register to be used to store the result of the operation.
In this case, it would be a matter of reusing some opcodes that
x64 has freed (by removing some legacy
x86 instructions) to add the following two prefixes:
|REXM0 (3-byte REX with NDD, for map 0)|
|0 (0x06, 0x16)||0||0||0||NDD4||0||1||1||0|
|REXM1 (3-byte REX with NDD, for map 1)|
|0 (0x0E, 0x1E)||0||0||0||NDD4||1||1||1||0|
As can be seen, the two new prefixes (using opcodes
1E, corresponding to the old
PUSH DS instructions)
REXM1 are very similar to
REX3, but with some slight differences.
Firstly, it is possible to specify the destination register (
NDD) via the new
NDD4..NDD0 bits (without having to set the
ND bit, which is implicitly specified). Then, the
M0 bit disappeared to make way for
NDD0, as now map
0 or map
1 is selected using the appropriate prefix (
REXM0 for map
REXM1 for map
1). Similarly, and if needed, other prefixes could be added to support new maps (there are still enough legacy instruction opcodes that are free in
It should be emphasised that these two prefixes do not need to implement the new
CTEST instructions as well, since there is no use of the new target register in this case (there is no result to store: they are just flags-altering instructions). Their implementation using only
REX3 is therefore sufficient, as explained above.
These two new prefixes are shorter (by one byte) than
EVEX, thus limiting the damage to code density caused by using such long prefixes, but they also have the added advantage of making conditional any general-purpose instruction that has been extended to ternary or binary.
1234567890 to the 64-bit value from memory and save it to RAX
if the zero flag (Z) is set.
ADD.Z RAX,[RBX + RCX * 8 + 1234],1234567890
whose operation as well as potential should be intelligible, but with the particular point to be made that the instruction would not generate any exception in the event that the condition was not verified and the element in memory was inaccessible.
Furthermore, and to close,
REXM1 are also much simpler to implement (the mechanism is similar to
REX3, which in turn are similar to
REX) than the enormous complication of the new prefix
EVEX (for new registers)
Which now, and having become completely useless for the ‘promotion’ of general-purpose instructions, only requires the trivial addition of the 3 bits to address the new
APX registers, as already proposed for
VEX3. So its new format will be this:
|Byte 0 (62h)||0||1||1||0||0||0||1||0|
|Byte 1 (P0)||R̅3||X̅3||B̅3||R̅4||B4||m2||m1||m0||P[7:0]|
|Byte 2 (P1)||W||v̅3||v̅2||v̅1||v̅0||X̅4||p1||p0||P[15:8]|
|Byte 3 (P2)||z||L’||L||b||v̅4||a2||a1||a0||P[23:16]|
and would continue to function exactly as now: exclusively for
Summary of the proposed changes
Coming to a close, I think it is appropriate to recapitulate the benefits of the proposed changes to
- simplified implementation (and, consequently, lower transistors & power consumption);
- less impact on code density (25% to 50% less space occupied by the new prefixes, compared to the use of
EVEX, for both general-purpose and
AVX/VEX3instructions), which in turn translates into lower consumption (less pressure on caches and, in general, on the entire memory hierarchy);
- all general-purpose instructions that modify flags can suppress their generation (the use of
- all general-purpose instructions become conditional (with simplification of both the compilers and the execution pipeline, which now only has to commit or not retire their execution).
The advantages of these solutions should be obvious, having the same amount of new functionality made available but with the not inconsiderable possibility of conditionally executing all general-purpose instructions (a new feature, therefore, in addition to what
Finally, it should be noted that the new prefixes are designed to use all innovations incrementally. The new
REX2 offers, as a basic feature, access to new registers and suppression of flags generation (
NF). On top of that,
REX3 adds the possibility of specifying the condition for instruction execution.
REXM1 add, on that, the new target register (
NDD). All in a simple and ‘compiler-friendly‘ manner.
The next article will be the last and will report the conclusions regarding