17. DISCUSSION OF SPECIAL INSTRUCTIONS
======================================
17.1 TEST
---------
The TEST instruction with an immediate operand is only pairable if the
destination is AL, AX, or EAX, and only if it is coded in a certain way.
TEST register,register and TEST register,memory is always pairable.
TEST EAX,immediate can be coded in three ways:
a. Two bytes instruction code + 4 bytes data: not pairable
b. Two bytes instruction code + 1 byte sign extended data: not pairable
c. One byte instruction code + 4 bytes data: pairable
The assembler will always choose the shortest form of an instruction. An
immediate constant between -128 and +127 can be written as a sign extended
byte, which will cause the assembler to pick form b, which is not pairable.
To make it pairable you have to hard code form c, e.g.:
DB 0A9H / DD const
or change it to TEST AL,const if possible.
If the constant is not between -128 and +127 or if the destination is AL,
then the shortest form is also the pairable form.
Examples:
TEST ECX,ECX ; pairable
TEST [mem],EBX ; pairable
TEST EDX,256 ; not pairable
TEST DWORD PTR [EBX],8000H ; not pairable
To make it pairable, use any of the following methods:
MOV EAX,[EBX] / TEST EAX,8000H
MOV EDX,[EBX] / AND EDX,8000H
MOV AL,[EBX+1] / TEST AL,80H
MOV AL,[EBX+1] / TEST AL,AL ; (result in sign flag)
It is also possible to test a bit by shifting it into the carry flag:
MOV EAX,[EBX] / SHR EAX,16 ; (result in carry flag)
but this method has a penalty on the PentiumPro when the shift count is more
than one.
(The reason for this non-pairability is probably that the first byte of the
2-byte instruction is the same as for some other non-pairable instructions,
and the Pentium cannot afford to check the second byte too when determining
pairability.)
17.2 WAIT
---------
You can often increase speed by omitting the WAIT instruction.
The WAIT instruction has three functions:
a. The 8087 processor requires a WAIT before _every_ floating point
instruction.
b. WAIT is used to coordinate memory access between the floating point
unit and the integer unit. Examples:
b.1. FISTP [mem32]
WAIT ; wait for f.p. unit to write before..
MOV EAX,[mem32] ; reading the result with the integer unit
b.2. FILD [mem32]
WAIT ; wait for f.p. unit to read value before..
MOV [mem32],EAX ; overwriting it with integer unit
b.3. FLD DWORD PTR [ESP]
WAIT ; prevent an accidental hardware interrupt from..
ADD ESP,4 ; overwriting value on stack before it is read
c. WAIT is sometimes used to check for exceptions. It will generate an
interrupt if there is an unmasked exception bit in the f.p. status word
set by a preceding floating point instruction.
Regarding a:
The function in point a is never needed on any other processors than the old
8087. Unless you want your code to be compatible with the 8087 you should
tell your assembler to not put in these WAITs by specifying a higher
processor.
Regarding b:
WAIT instructions to coordinate memory access are definitely needed on the
8087 and 80287. A superscalar processor like the Pentium has special
circuitry to detect memory conflicts so you wouldn't need the WAIT for this
purpose on code that only runs on a Pentium or higher. I have made some
tests on other Intel processors and not been able to provoke any error by
omitting the WAIT on any 32 bit Intel processor, although Intel manuals say
that the WAIT is needed for this purpose except after FNSTSW and FNSTCW. If
you want to be certain that your code will work on any 32 bit processor
(including non-Intel processors) then I would recommend that you include the
WAIT here in order to be safe.
Regarding c:
The assembler automatically inserts a WAIT for this purpose before the
following instructions:
FCLEX, FINIT, FSAVE, FSTCW, FSTENV, FSTSW
You can omit the WAIT by writing FNCLEX, etc. My tests show that the WAIT is
unneccessary in most cases because these instructions without WAIT will
still generate an interrupt on exceptions except for FNCLEX and FNINIT on
the 80387. (There is some inconsistency about whether the IRET from the
interrupt points to the FN.. instruction or to the next instruction).
Almost all other floating point instructions will also generate an interrupt
if a previous floating point instruction has set an unmasked exception bit,
so the exception is likely to be detected sooner or later anyway.
You may still need the WAIT if you want to know exactly where an exception
occurred in order to recover from the situation. Consider, for example, the
code under b.3 above: If you want to be able to recover from an exception
generated by the FLD here, then you need the WAIT because an interrupt after
ADD ESP,4 would overwrite the value to load.
17.3 FCOM + FSTSW AX
--------------------
The usual way of doing floating point comparisons is:
FLD [a]
FCOMP [b]
FSTSW AX
SAHF
JB ASmallerThanB
You may improve this code by using FNSTSW AX rather than FSTSW AX and
test AH directly rather than using the non-pairable SAHF.
(TASM version 3.0 has a bug with the FNSTSW AX instruction)
FLD [a]
FCOMP [b]
FNSTSW AX
SHR AH,1
JC ASmallerThanB
Testing for zero or equality:
FTST
FNSTSW AX
AND AH,40H
JNZ IsZero ; (the zero flag is inverted!)
Test if greater:
FLD [a]
FCOMP [b]
FNSTSW AX
AND AH,41H
JZ AGreaterThanB
Do not use TEST AH,41H as it is not pairable. Do not use TEST EAX,4100H as
it would produce a partial register stall on the PentiumPro. Do not test the
flags after multibit shifts, as this has a penalty on the PentiumPro.
It is often faster to use integer instructions for comparing floating point
values, as described in paragraph 18 below.
17.4 LEA
--------
The LEA instruction is useful for many purposes because it can do a shift,
two additions, and a move in just one pairable instruction taking one clock
cycle. Example:
LEA EAX,[EBX+8*ECX-1000]
is much faster than
MOV EAX,ECX / SHL EAX,3 / ADD EAX,EBX / SUB EAX,1000
The LEA instruction can also be used to do an add or shift without changing
the flags. The source and destination need not have the same word size, so
LEA EAX,[BX] is a useful replacement for MOVZX EAX,BX.
You must be aware, however, that the LEA instruction will suffer an AGI
stall if it uses a base or index register which has been changed in the
preceding clock cycle.
Since the LEA instruction is pairable in the V-pipe and shift instructions
are not, you may use LEA as a substitute for a SHL by 1, 2, or 3 if you want
the instruction to execute in the V-pipe.
The 32 bit processors have no documented addressing mode with a scaled index
register and nothing else, so an instruction like LEA EAX,[EAX*2] is
actually coded as LEA EAX,[EAX*2+00000000] with an immediate displacement
of 4 bytes. You may reduce the instruction size by instead writing LEA
EAX,[EAX+EAX] or even better ADD EAX,EAX. The latter code cannot have an
AGI delay. If you happen to have a register which is zero (like a loop
counter after a loop), then you may use it as a base register to reduce the
code size:
LEA EAX,[EBX*4] ; 7 bytes
LEA EAX,[ECX+EBX*4] ; 3 bytes
17.5 integer multiplication
---------------------------
An integer multiplication takes approximately 9 clock cycles. It is
therefore advantageous to replace a multiplication by a constant with a
combination of other instructions such as SHL, ADD, SUB, and LEA. Example:
IMUL EAX,10
can be replaced with
MOV EBX,EAX / ADD EAX,EAX / SHL EBX,3 / ADD EAX,EBX
or
LEA EAX,[EAX+4*EAX] / ADD EAX,EAX
Floating point multiplication is faster than integer multiplication on a
Pentium without MMX, but the time used to convert integers to float and
convert the product back again is usually more than the time saved by using
floating point multiplication, except when the number of conversions is low
compared with the number of multiplications.
17.6 division
-------------
Division is quite time consuming. The DIV instruction takes 17, 25, or 41
clock cycles for byte, word, and dword divisors respectively. The IDIV
instruction takes 5 clock cycles more. It is therefore preferable to use the
smallest operand size possible that won't generate an overflow, even if it
costs an operand size prefix, and use unsigned division if possible.
Unsigned division by a power of two can be done with SHR. Division of a
signed number by a power of two can be done with SAR, but the result with
SAR is rounded towards minus infinity, whereas the result with IDIV is
truncated towards zero.
Floating point division takes 39 clock cycles. It is possible to do a
floating point division and an integer division in parallel to save time.
Example: A = A1 / A2; B = B1 / B2
FILD [B1]
FILD [B2]
MOV EAX,[A1]
MOV EBX,[A2]
CDQ
FDIV
DIV EBX
FISTP [B]
MOV [A],EAX
(make sure you set the floating point unit to the desired rounding method)
Obviously, you should always try to minimize the number of divisions. For
example: if (A/B > C)... can be rewritten as if (A > B*C)... when B is
positive, and the opposite when B is negative.
A/B + C/D can be rewritten as (A*D + C*B) / (B*D)
If you are using integer division, then you should be aware that the
rounding errors may be different when you rewrite the formulas.
17.7 string instructions
------------------------
String instructions without a repeat prefix are too slow, and should always
be replaced by simpler instructions. The same applies to LOOP and JECXZ.
String instructions with repeat may be optimal. Always use the dword version
if possible, and make sure that both source and destination are aligned by 4.
REP MOVSD is the fastest way to move blocks of data when the destination is
in the cache. See section 19 for an alternative.
REP STOSD is optimal when the destination is in the cache.
REP LOADS, REP SCAS, and REP CMPS are not optimal, and may be replaced by
loops. See section 16 example 10 for an alternative to REP SCASB.
17.8 XCHG
---------
The XCHG register,memory instruction is dangerous. By default this
instruction has an implicit LOCK prefix which prevents it from using the
cache. The instruction is therefore very time consuming, and should always
be avoided.
17.9 rotates through carry
--------------------------
RCR and RCL with a count different from one are slow and should be avoided.
17.10 bit scan
--------------
BSF and BSR are the poorest optimized instructions on the Pentium, taking
11 + 2*n clock cycles, where n is the number of zeros skipped. (on later
processors it takes only 1)
The following code emulates BSF ECX,EAX:
TEST EAX,EAX
JZ SHORT BS6
PUSH EAX
XOR ECX,ECX
TEST EAX,0FFFFH ; (only pairable if register is EAX)
JNZ SHORT BS1
SHR EAX,16
ADD ECX,16
BS1: TEST AL,AL
JNZ SHORT BS2
MOV AL,AH
ADD ECX,8
BS2: TEST AL,0FH
JNZ SHORT BS3
SHR AL,4
ADD ECX,4
BS3: TEST AL,3
JNZ SHORT BS4
SHR AL,2
ADD ECX,2
BS4: TEST AL,1
JNZ SHORT BS5
INC ECX
BS5: POP EAX
BS6:
The following code emulates BSR ECX,EAX:
TEST EAX,EAX
JZ SHORT BS7
MOV DWORD PTR [TEMP],EAX
MOV DWORD PTR [TEMP+4],0
FILD QWORD PTR [TEMP]
FSTP QWORD PTR [TEMP]
WAIT ; WAIT only needed for compatibility with earlier processors
MOV ECX, DWORD PTR [TEMP+4]
SHR ECX,20
SUB ECX,3FFH
TEST EAX,EAX ; clear zero flag
BS7:
17.11 bit test
--------------
BT, BTC, BTR, and BTS instructions should preferably be replaced by
instructions like TEST, AND, OR, XOR, or shifts.
17.12 FPTAN
-----------
According to the manuals, FPTAN returns two values X and Y and leaves it to
the programmer to divide Y with X to get the result, but in fact it always
returns 1 in X so you can save the division. My tests show that on all 32
bit Intel processors with floating point unit or coprocessor, FPTAN always
returns 1 in X regardless of the argument. If you want to be sure that your
code will run correctly on all processors, then you may test if X is 1,
which is faster than dividing with X. The Y value may be very high, but
never infinity, so you don't have to test if Y contains a valid value.
18. USING INTEGER INSTRUCTIONS TO DO FLOATING POINT OPERATIONS
==============================================================
Integer instructions are generally faster than floating point instructions,
so it is often advantageous to use integer instructions for doing simple
floating point operations. The most obvious example is moving data. Example:
FLD QWORD PTR [ESI] / FSTP QWORD PTR [EDI]
Change to:
MOV EAX,[ESI] / MOV EBX,[ESI+4] / MOV [EDI],EAX / MOV [EDI+4],EBX
The former code takes 4 clocks, the latter takes 2.
Testing if a floating point value is zero:
The floating point value of zero is usually represented as 32 or 64 bits of
zero, but there is a pitfall here: The sign bit may be set! Minus zero is
regarded as a valid floating point number, and the processor may actually
generate a zero with the sign bit set if for example multiplying a negative
number with zero. So if you want to test if a floating point number is zero,
you should not test the sign bit. Example:
FLD DWORD PTR [EBX] / FTST / FNSTSW AX / AND AH,40H / JNZ IsZero
Use integer instructions in stead, and shift out the sign bit:
MOV EAX,[EBX] / ADD EAX,EAX / JZ IsZero
The former code takes 9 clocks, the latter takes only 2.
If the floating point number is double precision (QWORD) then you only have
to test bit 32-62. If they are zero, then the lower half will also be zero
if it is a valid floating point number.
Testing if negative:
A floating point number is negative if the sign bit is set and at least one
other bit is set. Example:
MOV EAX,[NumberToTest] / CMP EAX,80000000H / JA IsNegative
Manipulating the sign bit:
You can change the sign of a floating point number simply by flipping the sign
bit. Example:
XOR BYTE PTR [a] + (TYPE a) - 1, 80H
Likewise you may get the absolute value of a floating point number by simply
ANDing out the sign bit.
Comparing numbers:
Floating point numbers are stored in a unique format which allows you to use
integer instructions for comparing floating point numbers, except for the
sign bit. If you are certain that two floating point numbers both are
positive then you may simply compare them as integers. Example:
FLD [a] / FCOMP [b] / FNSTSW AX / AND AH,1 / JNZ ASmallerThanB
Change to:
MOV EAX,[a] / MOV EBX,[b] / CMP EAX,EBX / JB ASmallerThanB
This method only works if the two numbers have the same precision and you
are certain that none of the numbers have the sign bit set. If one or both
numbers may be negative or minus zero, then you have to take all
combinations into account which makes the code so complicated that you
probably would prefer to do a floating point compare.
19. USING FLOATING POINT INSTRUCTIONS TO DO INTEGER OPERATIONS
==============================================================
19.1 Moving data
----------------
Floating point instructions can be used to move 8 bytes at a time:
FILD QWORD PTR [ESI] / FISTP QWORD PTR [EDI]
This is only an advantage if the destination is not in the cache. The
optimal way to move a block of data to uncached memory on the Pentium is:
TopOfLoop:
FILD QWORD PTR [ESI]
FILD QWORD PTR [ESI+8]
FXCH
FISTP QWORD PTR [EDI]
FISTP QWORD PTR [EDI+8]
ADD ESI,16
ADD EDI,16
DEC ECX
JNZ TopOfLoop
The source and destination should of course be aligned by 8. The extra time
used by the slow FILD and FISTP instructions is compensated for by the fact
that you only have to do half as many write operations. Note that this
method is only advantageous on the Pentium and only if the destination is
not in the cache. On all other processors the optimal way to move blocks of
data is REP MOVSD, or if you have a processor with MMX you may use the MMX
instructions in stead to write 8 bytes at a time.
19.2 Integer multiplication
---------------------------
Floating point multiplication is faster than integer multiplication on the
Pentium without MMX, but the price for converting integer factors to float
and converting the result back to integer is high, so floating point
multiplication is only advantageous if the number of conversions needed is
low compared to the number of multiplications. Integer multiplication is
faster than floating point on other processors.
19.3 Integer division
---------------------
Floating point division is not faster than integer division, but you can do
other integer operations (including integer division, but not integer
multiplication) while the floating point unit is working on the division.
See paragraph 17.6 above for an example.
19.4 Converting binary to decimal numbers
-----------------------------------------
The FBSTP instruction converts a binary number to decimal faster than using
repeated division if you have more than a few digits.
20. LIST OF INTEGER INSTRUCTIONS
================================
Explanations:
Operands: r=register, m=memory, i=immediate data, sr=segment register
m32= 32 bit memory operand, etc.
Clock cycles:
The numbers are minimum values. Cache misses, misalignment, and exceptions
may increase the clock counts considerably.
Pairability:
u=pairable in U-pipe, v=pairable in V-pipe, uv=pairable in either pipe,
np=not pairable
Opcode Operands Clock cycles Pairability
----------------------------------------------------------------------------
NOP 1 uv
MOV r/m, r/m/i 1 uv
MOV r/m, sr 1 np
MOV sr, r/m >= 2 b) np
XCHG (E)AX, r 2 np
XCHG r , r 3 np
XCHG r , m >20 np
XLAT 4 np
PUSH r/i 1 uv
POP r 1 uv
PUSH m 2 np
POP m 3 np
PUSH sr 1 b) np
POP sr >= 3 b) np
PUSHF 4 np
POPF 6 np
PUSHA POPA 5 np
LAHF SAHF 2 np
MOVSX MOVZX r, r/m 3 a) np
LEA r/m 1 uv
LDS LES LFS LGS LSS m 4 c) np
ADD SUB AND OR XOR r , r/i 1 uv
ADD SUB AND OR XOR r , m 2 uv
ADD SUB AND OR XOR m , r/i 3 uv
CMP r , r/i 1 uv
CMP m , r/i 2 uv
TEST r , r 1 uv
TEST m , r 2 uv
TEST r , i 1 f)
TEST m , i 2 np
ADC SBB r/m, r/m/i 1/3 u
INC DEC r 1 uv
INC DEC m 3 uv
NEG NOT r/m 1/3 np
MUL IMUL r8/r16/m8/m16 11 np
MUL IMUL all other versions 9 d) np
DIV r8/r16/r32 17/25/41 np
IDIV r8/r16/r32 22/30/46 np
CBW CWDE 3 np
CWD CDQ 2 np
SHR SHL SAR SAL r , i 1 u
SHR SHL SAR SAL m , i 3 u
SHR SHL SAR SAL r/m, CL 4/5 np
ROR ROL RCR RCL r/m, 1 1/3 u
ROR ROL r/m, i(><1) 1/3 np
ROR ROL r/m, CL 4/5 np
RCR RCL r/m, i(><1) 8/10 np
RCR RCL r/m, CL 7/9 np
SHLD SHRD r, i/CL 4 a) np
SHLD SHRD m, i/CL 5 a) np
BT r, r/i 4 a) np
BT m, i 4 a) np
BT m, r 9 a) np
BTR BTS BTC r, r/i 7 a) np
BTR BTS BTC m, i 8 a) np
BTR BTS BTC m, r 14 a) np
BSF BSR r , r/m 7-73 a) np
SETcc r/m 1 a) np
JMP CALL short/near 1 v
JMP CALL far >= 3 np
conditional jump short/near 1/4/5 e) v
CALL JMP r/m 2 np
RETN 2 np
RETN i 3 np
RETF 4 np
RETF i 5 np
J(E)CXZ short 5-10 np
LOOP short 5-10 np
BOUND r , m 8 np
CLC STC CMC CLD STD 2 np
CLI STI 6-7 np
LODS 2 np
REP LODS 7+3*n g) np
STOS 3 np
REP STOS 10+n g) np
MOVS 4 np
REP MOVSB 12+1.8*n g) np
REP MOVSW 12+1.5*n g) np
REP MOVSD 12+n g) np
SCAS 4 np
REP(N)E SCAS 9+4*n g) np
CMPS 5 np
REP(N)E CMPS 8+5*n g) np
BSWAP 1 a) np
----------------------------------------------------------------------------
Notes:
a) this instruction has a 0FH prefix which takes one clock cycle extra to
decode on a Pentium without MMX unless preceded by a multicycle
instruction (see section 13 above).
b) versions with FS and GS have a 0FH prefix. see note a.
c) versions with SS, FS, and GS have a 0FH prefix. see note a.
d) versions with two operands and no immediate have a 0FH prefix, see note a.
e) see section 12 above
f) only certain versions are pairable. see paragraph 17.1 above
g) add one clock cycle for decoding the repeat prefix unless preceded by a
multicycle instruction (such as CLD. see section 13 above).
21. LIST OF FLOATING POINT INSTRUCTIONS
=======================================
Explanations:
Operands: r=register, m=memory, m32=32 bit memory operand, etc.
Clock cycles:
The numbers are minimum values. Cache misses, misalignment, and exceptions
may increase the clock counts considerably.
Pairability:
+=pairable with FXCH, np=not pairable
i-ov:
Overlap with integer instructions. i-ov = 4 means that the last four clock
cycles can overlap with subsequent integer instructions.
fp-ov:
Overlap with floating point instructions. fp-ov = 2 means that the last two
clock cycles can overlap with subsequent floating point instructions.
(WAIT is considered a floating point instruction here)
Opcode Operand Clock cycles Pairability i-ov fp-ov
-----------------------------------------------------------------------------
FLD r/m32/m64 1 + 0 0
FLD m80 3 np 0 0
FBLD m80 49 np 0 0
FST(P) r 1 np 0 0
FST(P) m32/m64 2 h) np 0 0
FST(P) m80 3 h) np 0 0
FBSTP m80 153 np 0 0
FILD m 3 np 2 2
FIST(P) m 6 np 0 0
FLDZ FLD1 2 np 0 0
FLDPI FLDL2E etc. 5 np 0 0
FNSTSW AX/m16 6 np 0 0
FLDCW m16 8 np 0 0
FNSTCW m16 2 np 0 0
FADD(P) r/m 3 + 2 2
FSUB(R)(P) r/m 3 + 2 2
FMUL(P) r/m 3 + 2 2 i)
FDIV(R)(P) r/m 39 + j) 38 k) 2
FCHS FABS 1 + 0 0
FCOM(P)(P) FUCOM r/m 1 + 0 0
FIADD FISUB(R) m 6 np 2 2
FIMUL m 6 np 2 2
FIDIV(R) m 42 np 38 k) 2
FICOM m 4 np 0 0
FTST 1 np 0 0
FXAM 17 np 4 0
FPREM 18-33 np 2 2
FPREM1 20-49 np 2 2
FRNDINT 19 np 0 0
FSCALE 32 np 5 0
FXTRACT 12-66 np 0 0
FSQRT 70 np 69 k) 2
FSIN FCOS FSINCOS varies np 2 2
F2XM1 FYL2X FYL2XP1 varies np 2 2
FPATAN varies np 2 2
FPTAN varies np 36 k) 0
FNOP 2 np 0 0
FXCH r 1 np 0 0
FINCSTP FDECSTP 2 np 0 0
FFREE r 2 np 0 0
FNCLEX 6-9 np 0 0
FNINIT 22 np 0 0
FNSAVE m ca.300 np 0 0
FRSTOR m 73 np 0 0
WAIT 1 np 0 0
-----------------------------------------------------------------------------
Notes:
h) The value to store is needed one clock cycle in advance.
i) 1 if the overlapping instruction is also a FMUL.
j) If the FXCH is followed by an integer instruction then it will still
pair, but take an extra clock cycle so that the integer instruction will
begin in clock cycle 3.
k) Cannot overlap integer multiplication instructions.
22. TESTING SPEED
=================
The Pentium has an internal 64 bit clock counter which can be read into
EDX:EAX using the instruction RDTSC (read time stamp counter). This is very
useful for testing exactly how many clock cycles a piece of code takes.
The program below is useful for measuring the number of clock cycles a piece
of code takes. The program executes the code to test 10 times and stores the
10 clock counts. The program can be used in both 16 and 32 bit mode.
RDTSC MACRO ; define RDTSC instruction
DB 0FH,31H
ENDM
ITER EQU 10 ; number of iterations
.DATA ; data segment
ALIGN 4
COUNTER DD 0 ; loop counter
TICS DD 0 ; temporary storage of clock
RESULTLIST DD ITER DUP (0) ; list of test results
.CODE ; code segment
BEGIN: MOV [COUNTER],0 ; reset loop counter
TESTLOOP: ; test loop
;**************** Do any initializations here: ************************
FINIT
;**************** End of initializations ************************
RDTSC ; read clock counter
MOV [TICS],EAX ; save count
CLD ; non-pairable filler
REPT 8
NOP ; eight NOP's to avoid shadowing effect
ENDM
;**************** Put instructions to test here: ************************
FLDPI ; this is only an example
FSQRT
RCR EBX,10
FSTP ST
;********************* End of instructions to test ************************
CLC ; non-pairable filler with shadow
RDTSC ; read counter again
SUB EAX,[TICS] ; compute difference
SUB EAX,15 ; subtract the clocks cycles used by fillers
MOV EDX,[COUNTER] ; loop counter
MOV [RESULTLIST][EDX],EAX ; store result in table
ADD EDX,TYPE RESULTLIST ; increment counter
MOV [COUNTER],EDX ; store counter
CMP EDX,ITER * (TYPE RESULTLIST)
JB TESTLOOP ; repeat ITER times
; insert here code to read out the values in RESULTLIST
The 'filler' instructions before and after the piece of code to test are
critical. The CLD is a non-pairable instruction which has been inserted to
make sure the pairing is the same the first time as the subsequent times.
The eight NOP instructions are inserted to prevent any prefixes in the code
to test to be decoded in the shadow of the preceding instructions. Single
byte instructions are used here to obtain the same pairing the first time as
the subsequent times. The CLC after the code to test is a non-pairable
instruction which has a shadow under which the 0FH prefix of the RDTSC can
be decoded so that it is independent of any shadowing effect from the code
to test.
The RDTSC instruction cannot execute in virtual mode, so if you are running
under DOS you must skip the EMM386 (or any other memory manager) in your
CONFIG.SYS and not run under a DOS box in Windows.
The Pentium processor has special performance monitor counters which can
count events such as cache misses, misalignments, AGI stalls, etc. Details
about how to use the performance monitor counters are not covered by this
manual and must be sought elsewhere.
23. CONSIDERATIONS FOR OTHER MICROPROCESSORS
============================================
Most of the optimations described in this document have little or no
negative effects on other microprocessors, including non-Intel processors,
but there are some problems to be aware of.
Using a full register after writing to part of the register will cause a
moderate delay on the 80486 and a severe delay on the PentiumPro. Example:
MOV AL,[EBX] / MOV ECX,EAX
On the PentiumPro you may avoid this penalty by zeroing the full register
first:
XOR EAX,EAX / MOV AL,[EBX] / MOV ECX,EAX
or by using MOVZX.
Scheduling floating point code for the Pentium often requires a lot of extra
FXCH instructions. This will slow down execution on earlier microprocessors,
but not on the PentiumPro and advanced non-Intel processors.
As mentioned in the introduction, Intel has announced new MMX versions of
the Pentium and PentiumPro chips with special instructions for integer
vector operations. These instructions will be very useful for massively
parallel integer calculations.
The Pentium Pro chip is faster than the Pentium in some respects, but
inferior in other respects. Knowing the strong and weak sides of the
PentiumPro can help you make your code work well on both processors.
The most important advantage of the PentiumPro is that it does much of the
optimation for you: reordering instructions and splitting complex
instructions into simple ones. But for perfectly optimized code there is
less difference between the two processors.
The two processors have basically the same number of execution units, so the
throughput should be near the same. The PPro has separate units for memory
read and write so that it can do three operations simultaneously if one of
them is a memory read, but on the other hand it cannot do two memory reads
or two writes simultaneously as the Pentium can.
The PPro is better than the Pentium in the following respects:
- out of order execution
- one cache miss does not delay subsequent independent instructions
- splitting complex instructions into smaller micro-ops
- automatic register renaming to avoid unnecessary dependencies
- better jump prediction algorithm than Pentium without MMX
- many instructions which are unpairable and poorly optimized on the Pentium
perform better on the PPro, f.ex. integer multiplication, movzx, cdq, bit
scan, bit test, shifts by cl, and floating point store
- floating point instructions and simple integer instructions can execute
simultaneously
- memory reads and writes do not occupy the ALU's
- indirect memory read instructions have no AGI stall
- new conditional move instructions can be used in stead of branches in some
cases
- new FCOMI instruction eliminates the need for the slow FNSTSW AX
- higher maximum clock frequency
The PPro is inferior to the Pentium in the following respects:
- mispredicted jumps are very expensive (10-15 clock cycles!)
- poor performance on 16 bit code and segmented models
- prefixes are expensive (except 0F extended opcode)
- long stall when mixing 8, 16, and 32 bit registers
- fadd, fsub, fmul, fchs have longer latency
- cannot do two memory reads or two memory writes simultaneously
- some instruction combinations cannot execute in parallel, like
push+push, push+call, compare+conditional jump
As a consequence of this, the Pentium Pro may actually be slower than the
Pentium on perfectly optimized code with a lot of unpredictable branches,
and a lot of floating point code with little or no natural parallelism.
Most of the drawbacks of each processor can be circumvented by careful
optimation and running 32 bit flat mode. But the problem with mispredicted
jumps on the PPro cannot be avoided except in the cases where you can use a
conditional move instead.
Taking advantage of the new instructions in the MMX and PentiumPro
processors will create problems if you want your code to be compatible with
earlier microprocessors. The solution may be to write several versions of
your code, each optimized for a particular processor. Your program should
automatically detect which processor it is running on and select the
appropriate version of code. Such a complicated approach is of course only
needed for the most critical parts of your program.
Discuss this article in the forums
See Also: © 1999-2011 Gamedev.net. All rights reserved. Terms of Use Privacy Policy
|