APX: Intel’s new architecture – 7 – Possible improvements

With the implementation costs covered in the previous article, the observations and criticisms come to an end, while possible improvements that could be made to APX before the final commercialisation of the first processors that will implement it (assuming it is not too late now!) are now set out.

Various modifications

One modification I would suggest is to treat conditional instructions in the same way as other processors do, allowing their effects to be totally ignored if the specific condition is not met. This also makes the implementation in the execution pipeline simpler (only the commit or retire of the instruction is performed).

Currently, on the other hand, if the condition is not met, the target argument is reset (CFCMOVcc) if it is a register in any case (while it remains unchanged otherwise). The original version of CMOVcc also has the flaw of generating exceptions if the memory location it references cannot be accessed, even when the condition is false, but fortunately APX provides one (CFCMOVcc) that suppresses exceptions in such cases.

All these individual differences and different behaviour depending on the instruction do not benefit either the decoder that has to decode them or the backend that has to execute them. The same occurs when only some instructions are given the possibility of being able to suppress flags generation, while others are not. This results in greater implementation complexity, also at the expense of compilers (who must take into account and handle all these special cases).

Modifications to REX2 (to add NF)

So the next concrete, as well as extremely simple, change would be to give the possibility of using the NF (No Flags) bit to all instructions ‘promoted’ by this new extension, instead of just a few.

In reality, all the improvements proposed in this article involve the complete removal of the concept of ‘promotion’ (which currently only occurs for certain instructions. This led to the creation of map 4 using the prefix EVEX, as we have already seen in the first article), since the idea is to allow all general-purpose instructions to take advantage of the new features introduced with APX.

In order to achieve this (while at the same time giving code density a nice hand help), a trivial modification to the REX2 prefix is required, which currently has the following structure:

ByteBit
REX2 (2-byte REX)
76543210
0 (0xD5)11010101
1M0R4X4B4WR3X3B3

Which, by adding the NF bit to signal the possible suppression of flags generation, becomes:

ByteBit
New REX2 (2-byte REX)
76543210
0 (0xD4, 0xD5)1101010M0
1NFR4X4B4WR3X3B3

Now we not only use the opcode (D5 in hexadecimal) of the old AAD instruction (suppressed by x64 in 64-bit mode), but also that of AAM (D4), both of which allow us to set NF (in MSB: the most significant bit of the second byte), without any other penalties apart from that of using REX2, which, however, occupies only two bytes (as opposed to EVEX where, instead, four bytes would be needed!).

The reason why NF has taken the place of M0 over the original in REX2 will be better seen later with the other prefixes, but I anticipate that it serves to maintain exactly the same format of the second byte, everywhere. Whereas for the map to be selected, there are differences, depending on the prefix (but this is the only variation).

New prefix REX3 (to add condition)

In the same vein and as previously suggested, a condition could be applied to all general-purpose instructions. Giving them, therefore, the possibility of being able to be totally ignored in the event that it is not fulfilled, and without any side effects (also explained at length above).

This modification is extremely important precisely in order to come to Intel’s statement aid in the APX presentation, which states that processor pipelines are becoming longer (and wider) as time goes by, and thus more susceptible to performance losses when the prediction of conditional jumps fails.

The solution I propose, for this purpose, is to introduce a new prefix, REX3, very similar to REX2, but with the addition of a byte in which it is possible to specify the condition that must be fulfilled in order to approve the execution of the that instruction. The format of the new prefix is as follows:

ByteBit
REX3 (3-byte REX)
76543210
0 (0x1F)00011111
1NFR4X4B4WR3X3B3
2000M0SC3SC2SC1SC0

where, as we have already seen in the first article setting out the format of all the prefixes added or modified by APX, SC3..SC0 are four bits representing the code (modified, excluding the test for the parity bit P) of the condition that is used in conditional jumps. While NF is the No Flags bit we have already seen above with the new prefix REX2.

The three bits at 0 in the third byte, which are before M0, leave room for any other maps to be added (although, using them all for this purpose, 16 would be too many) and/or to enable, in any future extensions, other features.

As can be seen, this new prefix (for which I have used opcode 1F, which corresponds to the old legacy POP DS instruction) is quite simple, flexible, and easier to implement than EVEX, besides the fact that it also has the not inconsiderable advantage of occupying one byte less than the latter and thus mitigating the impact on code density.

Taking advantage of REX3, it is also possible to (re)implement the new CCMP and CTEST instructions by exploiting opcodes 70-7F (map 0: the classic conditional jump instructions with an offset of 8 bits for the jump) for the former and 80-8F (map 1: these are the less famous conditional jumps with an offset of 16 or 32 bits) for the latter. The first 4 bits (the least significant ones) will be used to specify the value of the OF, SF, ZF and CF fields, to be copied to the respective flags in the event that the condition in REX3 is not met.

In this case the format of the instruction for CCMP becomes as follows:

ByteBit
REX3 (3-byte REX) for CCMP
76543210
0 (0x1F)00011111
1NFR4X4B4WR3X3B3
20000SC3SC2SC1SC0
30111OFSFZFCF

While for CTEST:

ByteBit
REX3 (3-byte REX) for CTEST
76543210
0 (0x1F)00011111
1NFR4X4B4WR3X3B3
20001SC3SC2SC1SC0
31000OFSFZFCF

The choice of reusing the opcodes of the conditional jump instructions is certainly the best one, because transforming (via the new REX3) into conditional instructions that are already conditional in themselves would not make any sense. So we might as well reuse them, using the 4 bits of the condition to store the values of OF, SF, ZF and CF instead.

This is a very simple implementation, as can be seen, which requires a couple of trivial comparisons in the presence of the new prefix REX3 to check whether it is in the special case of these two new instructions, and which also has the advantage of occupying one byte less than the current solution using EVEX, thus improving code density.

Changes to VEX3 (for new registers)

In this regard, code density could also be trivially improved for instructions (AVX, AVX-2) that make use of the VEX3 prefix, should it become necessary to access the 16 general-purpose registers that APX has added, without having to resort to the longer (occupying an extra byte) and more complicated EVEX. VEX3 currently has the following format:

ByteBit
VEX3 (3-byte VEX)
76543210
0 (0xC4)11000100
1m4m3m2m1m0
2W3210Lp1p0

whereas with my proposal it would become:

ByteBit
New VEX3 (3-byte VEX)
76543210
0 (0xC4)11000100
1333R4X4B4m1m0
2W3210Lp1p0

Thus, reusing bits m4..m2 to add the 3 bits needed to be able to specify the new registers. This would reduce the selectable opcode maps from 32 to just 4, but this would not be a big problem for a couple of reasons.

The first is that there are currently only four maps for all instructions (and there is still room to add more), so none would be missing. The second is that the current trend is to use AVX-512 to extend the SIMD instruction set, which always makes use of the EVEX prefix (which supports up to 8 maps. So there is plenty of room to add another thousand instructions).

New prefixes REXM0 and REXM1 to eliminate EVEX

With a similar approach, but copying what has already been done with the REX3 prefix that I proposed just above, one could avoid using EVEX altogether in order to ‘promote’ instructions from binary to ternary, and from unary to binary, which EVEX makes possible thanks to the new ND bit (which, set to 1, enables this new functionality) and the 4..v̅0 field that allows one to specify the register to be used to store the result of the operation.

In this case, it would be a matter of reusing some opcodes that x64 has freed (by removing some legacy x86 instructions) to add the following two prefixes:

ByteBit
REXM0 (3-byte REX with NDD, for map 0)
76543210
0 (0x06, 0x16)000NDD40110
1NFR4X4B4WR3X3B3
2NDD3NDD2NDD1NDD0SC3SC2SC1SC0
REXM1 (3-byte REX with NDD, for map 1)
76543210
0 (0x0E, 0x1E)000NDD41110
1NFR4X4B4WR3X3B3
2NDD3NDD2NDD1NDD0SC3SC2SC1SC0

As can be seen, the two new prefixes (using opcodes 06, 16, 0E and 1E, corresponding to the old PUSH ES, PUSH SS, PUSH CS, PUSH DS instructions) REXM0 and REXM1 are very similar to REX3, but with some slight differences.

Firstly, it is possible to specify the destination register (NDD) via the new NDD4..NDD0 bits (without having to set the ND bit, which is implicitly specified). Then, the M0 bit disappeared to make way for NDD0, as now map 0 or map 1 is selected using the appropriate prefix (REXM0 for map 0 and REXM1 for map 1). Similarly, and if needed, other prefixes could be added to support new maps (there are still enough legacy instruction opcodes that are free in x64).

It should be emphasised that these two prefixes do not need to implement the new CCMP and CTEST instructions as well, since there is no use of the new target register in this case (there is no result to store: they are just flags-altering instructions). Their implementation using only REX3 is therefore sufficient, as explained above.

These two new prefixes are shorter (by one byte) than EVEX, thus limiting the damage to code density caused by using such long prefixes, but they also have the added advantage of making conditional any general-purpose instruction that has been extended to ternary or binary.

For example:

; Add 1234567890 to the 64-bit value from memory and save it to RAX if the zero flag (Z) is set.
ADD.Z RAX,[RBX + RCX * 8 + 1234],1234567890

whose operation as well as potential should be intelligible, but with the particular point to be made that the instruction would not generate any exception in the event that the condition was not verified and the element in memory was inaccessible.

Furthermore, and to close, REXM0 and REXM1 are also much simpler to implement (the mechanism is similar to REX2 and REX3, which in turn are similar to REX) than the enormous complication of the new prefix EVEX.

Changes to EVEX (for new registers)

Which now, and having become completely useless for the ‘promotion’ of general-purpose instructions, only requires the trivial addition of the 3 bits to address the new APX registers, as already proposed for VEX3. So its new format will be this:

76543210
Byte 0 (62h)01100010
Byte 1 (P0)3334B4m2m1m0P[7:0]
Byte 2 (P1)W32104p1p0P[15:8]
Byte 3 (P2)zL’Lb4a2a1a0P[23:16]

and would continue to function exactly as now: exclusively for AVX-512 instructions.

Summary of the proposed changes

Coming to a close, I think it is appropriate to recapitulate the benefits of the proposed changes to APX:

  • simplified implementation (and, consequently, lower transistors & power consumption);
  • less impact on code density (25% to 50% less space occupied by the new prefixes, compared to the use of EVEX, for both general-purpose and AVX/VEX3 instructions), which in turn translates into lower consumption (less pressure on caches and, in general, on the entire memory hierarchy);
  • all general-purpose instructions that modify flags can suppress their generation (the use of NF becomes orthogonal);
  • all general-purpose instructions become conditional (with simplification of both the compilers and the execution pipeline, which now only has to commit or not retire their execution).

The advantages of these solutions should be obvious, having the same amount of new functionality made available but with the not inconsiderable possibility of conditionally executing all general-purpose instructions (a new feature, therefore, in addition to what APX offers).

Finally, it should be noted that the new prefixes are designed to use all innovations incrementally. The new REX2 offers, as a basic feature, access to new registers and suppression of flags generation (NF). On top of that, REX3 adds the possibility of specifying the condition for instruction execution. REXM0 and REXM1 add, on that, the new target register (NDD). All in a simple and ‘compiler-friendly‘ manner.

The next article will be the last and will report the conclusions regarding APX.

Press ESC to close