With the analysis of the pros and cons of
APX‘s features closed, we now move on to a series of analyses and reflections, starting with the familiar code density, which is an extremely important factor.
This is because the space occupied in memory by instructions has implications for the entire memory hierarchy and, therefore, directly affects performance. The subject is complex (and has been on the academic and industrial agendas for a very long time), and it only takes a trivial search to realise how much material has been written on the subject, but I will quote below a summary from the thesis of one of the RISC-V designers to show how important this aspect is:
Waterman shows that RVC fetches 25%-30% fewer instruction bits, which reduces instruction
cache misses by 20%-25%, or roughly the same performance impact as doubling the instruction
The highlighted parts (especially the last one) should be quite telling, and although they relate only to
RISC-V, similar results can be found in all architectures, as the concept and issues are general. They are not, however, directly relevant to this series of articles, so I will only report on my observations regarding
Intel claims that, by enabling
APX, code density is ‘similar‘ to
x64 (which itself is not that brilliant!), based on preliminary results compiling the aforementioned
SPEC2017 test suite. If this were really confirmed, it would certainly be quite a coup (it would mean that all the innovations introduced have compensated for the considerable increase in instruction size).
At the moment, however, I harbour my doubts about this, not least because Intel has not made any statements: it has not claimed that it is equal, slightly worse or slightly better, but only ‘similar’. This climate of uncertainty deserves, therefore, at least some consideration by bringing in some numbers, until the ‘official’ ones arrive.
Let us start immediately with an element that is certainly known to worsen code density: the new
POP2 instructions. Their encoding requires the use of the
EVEX prefix, so only four bytes are needed to use it, plus one for the opcode and finally another for
ModR/M, which is needed to reference the memory (in reality, only the configuration specifying a register and not a location is used). Total: six bytes (minimum).
For comparison, a
POP instruction requires one or two bytes (depending on whether an
x86 register or a new
x64 register is used). So a couple of these require two to four bytes, but in any case much less than six bytes. The deterioration that occurs with the new instructions is, in this case, very obvious (moreover, and as already mentioned, these instructions are used a lot).
The new registers (using
Using at least one of the new registers requires the use of the
REX2 prefix, which alone takes up two bytes, and which should be used whenever this register is referenced in an instruction. Conversely, emulating its operation with
x64 in some way would require more bytes, depending on the scenario.
For example, if a register were needed temporarily for certain operations, it would first have to be stored somewhere and the stack is the most convenient and suitable place. Then one would have to do a
PUSH to store it and a
POP to restore it after the operations have been completed. We know that the cost in this case would vary from one to two bytes, so in total we would need two to four bytes, so at the very least we would break even with the use of
REX2, but in the worst case we would double the cost.
But the advantage of
x64 is that the operations performed between
POP would not require the use of
REX2, but at most
REX (to reference the new registers of
x64), which occupies only one byte, so executing two or more instructions would absorb the cost of
POP and at some point we would be ahead in terms of space used by these instructions. Whereas and as already mentioned, all instructions using the new
APX registers would always require the use of
REX2, constantly paying for two bytes each time.
Another scenario to emulate the operation of the new
APX registers would be to use the stack as a sort of bank of additional registers, directly referencing precise locations (e.g.
[SP+16] to emulate the new
R18, and so on).
In this case the costs would be different, depending on the use. For example, copying from/to the stack requires the use of a
MOV instruction, which takes up three bytes (opcode +
ModR/M + 8-bit offset) if only
x86 registers are used, and four bytes (the prefix
REX is required) if at least one
x64 register is used.
Whereas an equivalent
MOV using one of the new registers made available by
APX will always need
REX2, but not the 8-bit offset, so it will always require 4 bytes. In this case the solution with the stack (
x64) would be more advantageous (one byte could be saved, while in the worst case it would occupy the same space).
It would be a different matter if the new register were to be used in instructions using only registers. In this case, instructions using
ModR/M would always have to specify the 8-bit offset, while the
APX equivalents would be forced to use
REX2. The advantage would clearly be of the stack solution, because a byte would always be saved.
If, on the other hand, the new register were to be used in instructions that also referenced a memory location, then the stack solution would be less efficient and take up much more space, as it would be necessary to find a free register (whose value would be stored) to load the value from the stack, then perform the required operation, and finally restore the register that had been used.
It would then take 2-4 bytes to ‘free’ the required register and 3-4 bytes to copy the value from the stack to it. Whereas using the new register with
APX requires only two extra bytes due to
REX2. Furthermore, if the final value were also to be stored on the stack, then another instruction (3-4 bytes) would be needed for this purpose. The price to pay in such cases would be very steep!
A hybrid solution between the two (as well as the preferable one) would be to
PUSH the register to be used on the stack, and then use it taking into account where it is now located within it. Eventually, when no longer useful, a
POP would restore the contents of the temporarily used register. This is a technique that I used in the first half of the 1990s, when I tried my hand at writing an
80186 emulator for Amiga systems equipped with a
68020 processor (or higher), and which has the virtue of combining the two previous scenarios, taking the best of them (minimising the cost of preserving and restoring the previous value in/from the stack).
As can be seen, depending on the specific scenarios,
APX has the advantage, depending on how and how much the new registers are used.
NF (No Flags. Using
Turning to the
No Flags) functionality, used to suppress the generation of flags, it requires the use of the
EVEX prefix, which means that four bytes must always be added to the length of the instruction, except when it resides in map
1 (i.e. with a
0F prefix): in this case the
0F prefix is already incorporated in
EVEX, so the additional bytes would be reduced to three.
The increase in instruction length would therefore be decidedly substantial, if we consider that to emulate its operation with
x64 it would be sufficient to save the flags on the stack, by means of the
PUSHF instruction, and restore them at the end of operations (therefore when the value needs to be checked or used), using the
POPF instruction. Total cost: just two bytes.
If we also consider that the block of instructions to be executed could change flags several times and, therefore, that it would always be necessary to use
NF, the cost for
APX would increase even more (whereas for
x64 it always remains fixed at two bytes).
The advantage of the current solution (
x64, therefore, is much greater in terms of better code density than
NDD (New Data Destination. Using
The last feature to be considered, and one which has a major influence on code density, is
NDD, which, as we have already seen, allows binary instructions to become ternary, and binary to become unary, giving the possibility of using a register as a destination (with the current two operands in
x64 both acting as data sources at this point).
Such an instruction always requires the use of the
EVEX prefix, so the same considerations apply as before for
NF: 4 extra bytes are needed, except for instructions in map
1 (which require 3), to which the opcode byte and the
ModR/M byte are then added. So in total you would always need (at least) 6 bytes.
To emulate this with
x64 would always require an extra instruction: a
MOV that would copy the value of the first source into the (new) destination register. As we have seen, this requires 2-3 bytes. Then you would need the instruction that performs the actual operation, which would in turn require 2-3 bytes. So a total of 4-6 bytes would be needed.
From these calculations I have removed, in order to simplify things a bit, the data concerning any offsets and/or immediate values, as they are invariant to both solutions (they occupy exactly the same bytes).
What remains is the real difference between
x64 when it comes to executing a ternary or binary instruction (emulated, in the case of the latter), and you can see how
x64 would be more efficient or, at most, would cost the same.
Trying to come to some conclusions, if we take the individual new functionalities of
APX, my opinion is that their use will, on average, have a decidedly negative impact on code density, which will decrease compared to
x64 (which, as already mentioned, is certainly not in a good shape in this respect, compared to other architectures — including the
x86 from which it derives).
Having, on the other hand, the possibility of combining/using more of these functionalities in the same instruction, there would be advantages (savings would accumulate). But I do not believe that such scenarios are, in any case, so common as to heavily influence code density. In my opinion, they are not common enough to bring
APX, overall, even close to
We will see concretely in the future, when the first processors with
APX and the corresponding binaries are released, as what I have provided at the moment are personal evaluations dictated only by my experience in this area and my analysis of the costs of using the new features of this extension.
The next article will deal with the subject of
APX implementation costs.