Pentium Discussion on Special Instructions

17. DISCUSSION OF SPECIAL INSTRUCTIONS
======================================

17.1 TEST
---------
The TEST instruction with an immediate operand is only pairable if the 
destination is AL, AX, or EAX, and only if it is coded in a certain way.
TEST register,register  and  TEST register,memory  is always pairable.

TEST EAX,immediate  can be coded in three ways:
a.  Two bytes instruction code + 4 bytes data: not pairable
b.  Two bytes instruction code + 1 byte sign extended data: not pairable
c.  One byte instruction code  + 4 bytes data: pairable
The assembler will always choose the shortest form of an instruction. An
immediate constant between -128 and +127 can be written as a sign extended
byte, which will cause the assembler to pick form b, which is not pairable.
To make it pairable you have to hard code form c, e.g.:
DB 0A9H / DD const
or change it to  TEST AL,const  if possible.
If the constant is not between -128 and +127 or if the destination is AL, 
then the shortest form is also the pairable form.

Examples:
TEST ECX,ECX     ; pairable
TEST [mem],EBX   ; pairable
TEST EDX,256     ; not pairable
TEST DWORD PTR [EBX],8000H ; not pairable
To make it pairable, use any of the following methods:
MOV EAX,[EBX] / TEST EAX,8000H
MOV EDX,[EBX] / AND  EDX,8000H
MOV AL,[EBX+1] / TEST AL,80H
MOV AL,[EBX+1] / TEST AL,AL  ; (result in sign flag)
It is also possible to test a bit by shifting it into the carry flag:
MOV EAX,[EBX] / SHR EAX,16   ; (result in carry flag)
but this method has a penalty on the PentiumPro when the shift count is more
than one.

(The reason for this non-pairability is probably that the first byte of the
2-byte instruction is the same as for some other non-pairable instructions,
and the Pentium cannot afford to check the second byte too when determining 
pairability.) 

17.2 WAIT
---------
You can often increase speed by omitting the WAIT instruction.
The WAIT instruction has three functions:

  a. The 8087 processor requires a WAIT before _every_ floating point 
     instruction.

  b. WAIT is used to coordinate memory access between the floating point
     unit and the integer unit. Examples:
     b.1.  FISTP [mem32]
           WAIT             ; wait for f.p. unit to write before..
           MOV EAX,[mem32]  ; reading the result with the integer unit

     b.2.  FILD [mem32]
           WAIT             ; wait for f.p. unit to read value before..
           MOV [mem32],EAX  ; overwriting it with integer unit

     b.3.  FLD DWORD PTR [ESP]
           WAIT             ; prevent an accidental hardware interrupt from..
           ADD ESP,4        ; overwriting value on stack before it is read

  c. WAIT is sometimes used to check for exceptions. It will generate an 
     interrupt if there is an unmasked exception bit in the f.p. status word 
     set by a preceding floating point instruction.  

Regarding a:
The function in point a is never needed on any other processors than the old 
8087. Unless you want your code to be compatible with the 8087 you should 
tell your assembler to not put in these WAITs by specifying a higher 
processor.

Regarding b:
WAIT instructions to coordinate memory access are definitely needed on the 
8087 and 80287. A superscalar processor like the Pentium has special 
circuitry to detect memory conflicts so you wouldn't need the WAIT for this 
purpose on code that only runs on a Pentium or higher. I have made some 
tests on other Intel processors and not been able to provoke any error by 
omitting the WAIT on any 32 bit Intel processor, although Intel manuals say 
that the WAIT is needed for this purpose except after FNSTSW and FNSTCW. If 
you want to be certain that your code will work on any 32 bit processor 
(including non-Intel processors) then I would recommend that you include the 
WAIT here in order to be safe.  

Regarding c:
The assembler automatically inserts a WAIT for this purpose before the 
following instructions: 
FCLEX, FINIT, FSAVE, FSTCW, FSTENV, FSTSW
You can omit the WAIT by writing FNCLEX, etc. My tests show that the WAIT is 
unneccessary in most cases because these instructions without WAIT will 
still generate an interrupt on exceptions except for FNCLEX and FNINIT on 
the 80387. (There is some inconsistency about whether the IRET from the 
interrupt points to the FN.. instruction or to the next instruction).  

Almost all other floating point instructions will also generate an interrupt 
if a previous floating point instruction has set an unmasked exception bit, 
so the exception is likely to be detected sooner or later anyway.  

You may still need the WAIT if you want to know exactly where an exception 
occurred in order to recover from the situation. Consider, for example, the 
code under b.3 above: If you want to be able to recover from an exception 
generated by the FLD here, then you need the WAIT because an interrupt after 
ADD ESP,4 would overwrite the value to load.  

17.3 FCOM + FSTSW AX
--------------------
The usual way of doing floating point comparisons is:
FLD [a]
FCOMP [b]
FSTSW AX
SAHF
JB ASmallerThanB

You may improve this code by using FNSTSW AX rather than FSTSW AX and 
test AH directly rather than using the non-pairable SAHF.
(TASM version 3.0 has a bug with the FNSTSW AX instruction)

FLD [a]
FCOMP [b]
FNSTSW AX
SHR AH,1
JC ASmallerThanB

Testing for zero or equality:

FTST
FNSTSW AX
AND AH,40H
JNZ IsZero     ; (the zero flag is inverted!)

Test if greater:

FLD [a]
FCOMP [b]
FNSTSW AX
AND AH,41H
JZ AGreaterThanB

Do not use TEST AH,41H as it is not pairable. Do not use TEST EAX,4100H as 
it would produce a partial register stall on the PentiumPro. Do not test the 
flags after multibit shifts, as this has a penalty on the PentiumPro.  

It is often faster to use integer instructions for comparing floating point 
values, as described in paragraph 18 below.

17.4 LEA
--------
The LEA instruction is useful for many purposes because it can do a shift, 
two additions, and a move in just one pairable instruction taking one clock 
cycle.  Example:
LEA EAX,[EBX+8*ECX-1000]
is much faster than
MOV EAX,ECX / SHL EAX,3 / ADD EAX,EBX / SUB EAX,1000
The LEA instruction can also be used to do an add or shift without changing 
the flags. The source and destination need not have the same word size, so  
LEA EAX,[BX]  is a useful replacement for  MOVZX EAX,BX.  

You must be aware, however, that the LEA instruction will suffer an AGI 
stall if it uses a base or index register which has been changed in the 
preceding clock cycle.

Since the LEA instruction is pairable in the V-pipe and shift instructions 
are not, you may use LEA as a substitute for a SHL by 1, 2, or 3 if you want 
the instruction to execute in the V-pipe.  

The 32 bit processors have no documented addressing mode with a scaled index 
register and nothing else, so an instruction like  LEA EAX,[EAX*2]  is 
actually coded as  LEA EAX,[EAX*2+00000000]  with an immediate displacement 
of 4 bytes.  You may reduce the instruction size by instead writing  LEA 
EAX,[EAX+EAX] or even better  ADD EAX,EAX.  The latter code cannot have an 
AGI delay. If you happen to have a register which is zero (like a loop 
counter after a loop), then you may use it as a base register to reduce the 
code size: 

LEA EAX,[EBX*4]     ; 7 bytes
LEA EAX,[ECX+EBX*4] ; 3 bytes

17.5 integer multiplication
---------------------------
An integer multiplication takes approximately 9 clock cycles. It is 
therefore advantageous to replace a multiplication by a constant with a 
combination of other instructions such as SHL, ADD, SUB, and LEA.  Example: 
IMUL EAX,10
can be replaced with
MOV EBX,EAX / ADD EAX,EAX / SHL EBX,3 / ADD EAX,EBX
or
LEA EAX,[EAX+4*EAX] / ADD EAX,EAX

Floating point multiplication is faster than integer multiplication on a 
Pentium without MMX, but the time used to convert integers to float and 
convert the product back again is usually more than the time saved by using 
floating point multiplication, except when the number of conversions is low 
compared with the number of multiplications.  

17.6 division
-------------
Division is quite time consuming. The DIV instruction takes 17, 25, or 41 
clock cycles for byte, word, and dword divisors respectively. The IDIV 
instruction takes 5 clock cycles more. It is therefore preferable to use the 
smallest operand size possible that won't generate an overflow, even if it 
costs an operand size prefix, and use unsigned division if possible.  

Unsigned division by a power of two can be done with SHR.  Division of a 
signed number by a power of two can be done with SAR, but the result with 
SAR is rounded towards minus infinity, whereas the result with IDIV is 
truncated towards zero.  

Floating point division takes 39 clock cycles. It is possible to do a
floating point division and an integer division in parallel to save time.
Example: A = A1 / A2;  B = B1 / B2
FILD [B1]
FILD [B2]
MOV EAX,[A1]
MOV EBX,[A2]
CDQ
FDIV
DIV EBX
FISTP [B]
MOV [A],EAX
(make sure you set the floating point unit to the desired rounding method)

Obviously, you should always try to minimize the number of divisions. For 
example:  if (A/B > C)...  can be rewritten as  if (A > B*C)...  when B is 
positive, and the opposite when B is negative.

A/B + C/D  can be rewritten as  (A*D + C*B) / (B*D)

If you are using integer division, then you should be aware that the 
rounding errors may be different when you rewrite the formulas.  

17.7 string instructions
------------------------
String instructions without a repeat prefix are too slow, and should always 
be replaced by simpler instructions. The same applies to LOOP and JECXZ.  

String instructions with repeat may be optimal. Always use the dword version 
if possible, and make sure that both source and destination are aligned by 4.

REP MOVSD is the fastest way to move blocks of data when the destination is 
in the cache. See section 19 for an alternative.  

REP STOSD is optimal when the destination is in the cache.

REP LOADS, REP SCAS, and REP CMPS are not optimal, and may be replaced by 
loops. See section 16 example 10 for an alternative to REP SCASB.

17.8 XCHG
---------
The  XCHG register,memory  instruction is dangerous. By default this 
instruction has an implicit LOCK prefix which prevents it from using the 
cache. The instruction is therefore very time consuming, and should always 
be avoided.

17.9 rotates through carry
--------------------------
RCR and RCL with a count different from one are slow and should be avoided.

17.10 bit scan
--------------
BSF and BSR are the poorest optimized instructions on the Pentium, taking 
11 + 2*n clock cycles, where n is the number of zeros skipped. (on later 
processors it takes only 1)

The following code emulates BSF ECX,EAX:
        TEST    EAX,EAX
        JZ      SHORT BS6
        PUSH    EAX
        XOR     ECX,ECX
        TEST    EAX,0FFFFH       ; (only pairable if register is EAX)
        JNZ     SHORT BS1
        SHR     EAX,16
        ADD     ECX,16
BS1:    TEST    AL,AL
        JNZ     SHORT BS2
        MOV     AL,AH
        ADD     ECX,8
BS2:    TEST    AL,0FH
        JNZ     SHORT BS3
        SHR     AL,4
        ADD     ECX,4
BS3:    TEST    AL,3
        JNZ     SHORT BS4
        SHR     AL,2
        ADD     ECX,2
BS4:    TEST    AL,1
        JNZ     SHORT BS5
        INC     ECX
BS5:    POP     EAX
BS6:

The following code emulates BSR ECX,EAX:
        TEST    EAX,EAX
        JZ      SHORT BS7
        MOV     DWORD PTR [TEMP],EAX
        MOV     DWORD PTR [TEMP+4],0
        FILD    QWORD PTR [TEMP]
        FSTP    QWORD PTR [TEMP]
        WAIT    ; WAIT only needed for compatibility with earlier processors
        MOV     ECX, DWORD PTR [TEMP+4]
        SHR     ECX,20
        SUB     ECX,3FFH
        TEST    EAX,EAX       ; clear zero flag
BS7:

17.11 bit test
--------------
BT, BTC, BTR, and BTS instructions should preferably be replaced by 
instructions like TEST, AND, OR, XOR, or shifts.  

17.12 FPTAN
-----------
According to the manuals, FPTAN returns two values X and Y and leaves it to 
the programmer to divide Y with X to get the result, but in fact it always 
returns 1 in X so you can save the division. My tests show that on all 32 
bit Intel processors with floating point unit or coprocessor, FPTAN always 
returns 1 in X regardless of the argument. If you want to be sure that your 
code will run correctly on all processors, then you may test if X is 1, 
which is faster than dividing with X.  The Y value may be very high, but 
never infinity, so you don't have to test if Y contains a valid value.  


18. USING INTEGER INSTRUCTIONS TO DO FLOATING POINT OPERATIONS
==============================================================
Integer instructions are generally faster than floating point instructions, 
so it is often advantageous to use integer instructions for doing simple 
floating point operations. The most obvious example is moving data. Example: 

FLD QWORD PTR [ESI] / FSTP QWORD PTR [EDI]

Change to:

MOV EAX,[ESI] / MOV EBX,[ESI+4] / MOV [EDI],EAX / MOV [EDI+4],EBX

The former code takes 4 clocks, the latter takes 2.

Testing if a floating point value is zero:

The floating point value of zero is usually represented as 32 or 64 bits of 
zero, but there is a pitfall here: The sign bit may be set! Minus zero is 
regarded as a valid floating point number, and the processor may actually 
generate a zero with the sign bit set if for example multiplying a negative 
number with zero. So if you want to test if a floating point number is zero, 
you should not test the sign bit. Example: 

FLD DWORD PTR [EBX] / FTST / FNSTSW AX / AND AH,40H / JNZ IsZero

Use integer instructions in stead, and shift out the sign bit:

MOV EAX,[EBX] / ADD EAX,EAX / JZ IsZero

The former code takes 9 clocks, the latter takes only 2.
If the floating point number is double precision (QWORD) then you only have 
to test bit 32-62. If they are zero, then the lower half will also be zero 
if it is a valid floating point number.  

Testing if negative:
A floating point number is negative if the sign bit is set and at least one 
other bit is set. Example:
MOV EAX,[NumberToTest] / CMP EAX,80000000H / JA IsNegative

Manipulating the sign bit:
You can change the sign of a floating point number simply by flipping the sign 
bit. Example:
XOR BYTE PTR [a] + (TYPE a) - 1, 80H

Likewise you may get the absolute value of a floating point number by simply 
ANDing out the sign bit.  

Comparing numbers:
Floating point numbers are stored in a unique format which allows you to use 
integer instructions for comparing floating point numbers, except for the 
sign bit. If you are certain that two floating point numbers both are 
positive then you may simply compare them as integers. Example: 

FLD [a] / FCOMP [b] / FNSTSW AX / AND AH,1 / JNZ ASmallerThanB

Change to:

MOV EAX,[a] / MOV EBX,[b] / CMP EAX,EBX / JB ASmallerThanB

This method only works if the two numbers have the same precision and you 
are certain that none of the numbers have the sign bit set. If one or both 
numbers may be negative or minus zero, then you have to take all 
combinations into account which makes the code so complicated that you 
probably would prefer to do a floating point compare.  


19. USING FLOATING POINT INSTRUCTIONS TO DO INTEGER OPERATIONS
==============================================================
19.1 Moving data
----------------
Floating point instructions can be used to move 8 bytes at a time:
FILD QWORD PTR [ESI] / FISTP QWORD PTR [EDI]
This is only an advantage if the destination is not in the cache. The 
optimal way to move a block of data to uncached memory on the Pentium is: 

TopOfLoop:
FILD QWORD PTR [ESI]
FILD QWORD PTR [ESI+8]
FXCH
FISTP QWORD PTR [EDI]
FISTP QWORD PTR [EDI+8]
ADD ESI,16
ADD EDI,16
DEC ECX
JNZ TopOfLoop

The source and destination should of course be aligned by 8. The extra time 
used by the slow FILD and FISTP instructions is compensated for by the fact 
that you only have to do half as many write operations.  Note that this 
method is only advantageous on the Pentium and only if the destination is 
not in the cache. On all other processors the optimal way to move blocks of 
data is REP MOVSD, or if you have a processor with MMX you may use the MMX 
instructions in stead to write 8 bytes at a time.  

19.2 Integer multiplication
---------------------------
Floating point multiplication is faster than integer multiplication on the 
Pentium without MMX, but the price for converting integer factors to float 
and converting the result back to integer is high, so floating point 
multiplication is only advantageous if the number of conversions needed is 
low compared to the number of multiplications. Integer multiplication is 
faster than floating point on other processors.  

19.3 Integer division
---------------------
Floating point division is not faster than integer division, but you can do 
other integer operations (including integer division, but not integer 
multiplication) while the floating point unit is working on the division. 
See paragraph 17.6 above for an example.  

19.4 Converting binary to decimal numbers
-----------------------------------------
The FBSTP instruction converts a binary number to decimal faster than using 
repeated division if you have more than a few digits.  


20. LIST OF INTEGER INSTRUCTIONS
================================
Explanations:
Operands: r=register, m=memory, i=immediate data, sr=segment register
m32= 32 bit memory operand, etc.

Clock cycles:
The numbers are minimum values. Cache misses, misalignment, and exceptions
may increase the clock counts considerably.

Pairability:
u=pairable in U-pipe, v=pairable in V-pipe, uv=pairable in either pipe,
np=not pairable 

Opcode                 Operands            Clock cycles        Pairability
----------------------------------------------------------------------------
NOP                                        1                   uv
MOV                    r/m, r/m/i          1                   uv
MOV                    r/m, sr             1                   np
MOV                    sr,  r/m            >= 2 b)             np
XCHG                   (E)AX, r            2                   np
XCHG                   r  ,   r            3                   np
XCHG                   r  ,   m            >20                 np
XLAT                                       4                   np
PUSH                   r/i                 1                   uv
POP                    r                   1                   uv
PUSH                   m                   2                   np
POP                    m                   3                   np
PUSH                   sr                  1 b)                np
POP                    sr                  >= 3 b)             np
PUSHF                                      4                   np
POPF                                       6                   np
PUSHA POPA                                 5                   np
LAHF SAHF                                  2                   np
MOVSX MOVZX            r, r/m              3 a)                np
LEA                    r/m                 1                   uv
LDS LES LFS LGS LSS    m                   4 c)                np
ADD SUB AND OR XOR     r  , r/i            1                   uv
ADD SUB AND OR XOR     r  , m              2                   uv
ADD SUB AND OR XOR     m  , r/i            3                   uv
CMP                    r  , r/i            1                   uv
CMP                    m  , r/i            2                   uv
TEST                   r  , r              1                   uv
TEST                   m  , r              2                   uv
TEST                   r  , i              1                   f)
TEST                   m  , i              2                   np
ADC SBB                r/m, r/m/i          1/3                 u
INC DEC                r                   1                   uv
INC DEC                m                   3                   uv
NEG NOT                r/m                 1/3                 np
MUL IMUL               r8/r16/m8/m16      11                   np
MUL IMUL               all other versions  9 d)                np
DIV                    r8/r16/r32          17/25/41            np
IDIV                   r8/r16/r32          22/30/46            np
CBW CWDE                                   3                   np
CWD CDQ                                    2                   np
SHR SHL SAR SAL        r  , i              1                   u
SHR SHL SAR SAL        m  , i              3                   u
SHR SHL SAR SAL        r/m, CL             4/5                 np
ROR ROL RCR RCL        r/m, 1              1/3                 u
ROR ROL                r/m, i(><1)         1/3                 np
ROR ROL                r/m, CL             4/5                 np
RCR RCL                r/m, i(><1)         8/10                np
RCR RCL                r/m, CL             7/9                 np
SHLD SHRD              r, i/CL             4 a)                np
SHLD SHRD              m, i/CL             5 a)                np
BT                     r, r/i              4 a)                np
BT                     m, i                4 a)                np
BT                     m, r                9 a)                np
BTR BTS BTC            r, r/i              7 a)                np
BTR BTS BTC            m, i                8 a)                np
BTR BTS BTC            m, r               14 a)                np
BSF BSR                r  , r/m            7-73 a)             np
SETcc                  r/m                 1 a)                np
JMP CALL               short/near          1                   v
JMP CALL               far                 >= 3                np
conditional jump       short/near          1/4/5 e)            v
CALL JMP               r/m                 2                   np
RETN                                       2                   np
RETN                   i                   3                   np
RETF                                       4                   np
RETF                   i                   5                   np
J(E)CXZ                short               5-10                np
LOOP                   short               5-10                np
BOUND                  r  , m              8                   np
CLC STC CMC CLD STD                        2                   np
CLI STI                                    6-7                 np
LODS                                       2                   np
REP LODS                                   7+3*n g)            np
STOS                                       3                   np
REP STOS                                   10+n  g)            np
MOVS                                       4                   np
REP MOVSB                                  12+1.8*n g)         np
REP MOVSW                                  12+1.5*n g)         np
REP MOVSD                                  12+n     g)         np
SCAS                                       4                   np
REP(N)E SCAS                               9+4*n    g)         np
CMPS                                       5                   np
REP(N)E CMPS                               8+5*n    g)         np
BSWAP                                      1 a)                np
----------------------------------------------------------------------------
Notes:
a) this instruction has a 0FH prefix which takes one clock cycle extra to 
   decode on a Pentium without MMX unless preceded by a multicycle 
   instruction (see section 13 above).
b) versions with FS and GS have a 0FH prefix. see note a.
c) versions with SS, FS, and GS have a 0FH prefix. see note a.
d) versions with two operands and no immediate have a 0FH prefix, see note a.
e) see section 12 above
f) only certain versions are pairable. see paragraph 17.1 above
g) add one clock cycle for decoding the repeat prefix unless preceded by a 
   multicycle instruction (such as CLD. see section 13 above).


21. LIST OF FLOATING POINT INSTRUCTIONS
=======================================
Explanations:
Operands: r=register, m=memory, m32=32 bit memory operand, etc.

Clock cycles:
The numbers are minimum values. Cache misses, misalignment, and exceptions 
may increase the clock counts considerably.  

Pairability:
+=pairable with FXCH, np=not pairable

i-ov:
Overlap with integer instructions. i-ov = 4 means that the last four clock 
cycles can overlap with subsequent integer instructions.  

fp-ov:
Overlap with floating point instructions. fp-ov = 2 means that the last two 
clock cycles can overlap with subsequent floating point instructions.
(WAIT is considered a floating point instruction here)
                                       
Opcode               Operand     Clock cycles    Pairability    i-ov   fp-ov
-----------------------------------------------------------------------------
FLD                  r/m32/m64         1         +              0      0
FLD                  m80               3         np             0      0
FBLD                 m80              49         np             0      0
FST(P)               r                 1         np             0      0
FST(P)               m32/m64           2 h)      np             0      0
FST(P)               m80               3 h)      np             0      0
FBSTP                m80             153         np             0      0
FILD                 m                 3         np             2      2
FIST(P)              m                 6         np             0      0
FLDZ FLD1                              2         np             0      0
FLDPI FLDL2E etc.                      5         np             0      0
FNSTSW               AX/m16            6         np             0      0
FLDCW                m16               8         np             0      0
FNSTCW               m16               2         np             0      0

FADD(P)              r/m               3         +              2      2
FSUB(R)(P)           r/m               3         +              2      2
FMUL(P)              r/m               3         +              2      2 i)
FDIV(R)(P)           r/m              39         + j)          38 k)   2
FCHS FABS                              1         +              0      0
FCOM(P)(P) FUCOM     r/m               1         +              0      0
FIADD FISUB(R)       m                 6         np             2      2
FIMUL                m                 6         np             2      2
FIDIV(R)             m                42         np            38 k)   2
FICOM                m                 4         np             0      0
FTST                                   1         np             0      0
FXAM                                  17         np             4      0
FPREM                              18-33         np             2      2
FPREM1                             20-49         np             2      2
FRNDINT                               19         np             0      0
FSCALE                                32         np             5      0
FXTRACT                            12-66         np             0      0

FSQRT                                 70         np            69 k)   2
FSIN FCOS FSINCOS                 varies         np             2      2
F2XM1 FYL2X FYL2XP1               varies         np             2      2
FPATAN                            varies         np             2      2
FPTAN                             varies         np            36 k)   0

FNOP                                   2         np             0      0
FXCH                 r                 1         np             0      0
FINCSTP FDECSTP                        2         np             0      0
FFREE                r                 2         np             0      0
FNCLEX                                 6-9       np             0      0
FNINIT                                22         np             0      0
FNSAVE               m            ca.300         np             0      0
FRSTOR               m                73         np             0      0
WAIT                                   1         np             0      0
-----------------------------------------------------------------------------
Notes:
h) The value to store is needed one clock cycle in advance.
i) 1 if the overlapping instruction is also a FMUL.
j) If the FXCH is followed by an integer instruction then it will still 
   pair, but take an extra clock cycle so that the integer instruction will 
   begin in clock cycle 3.
k) Cannot overlap integer multiplication instructions.


22. TESTING SPEED
=================
The Pentium has an internal 64 bit clock counter which can be read into 
EDX:EAX using the instruction RDTSC (read time stamp counter). This is very 
useful for testing exactly how many clock cycles a piece of code takes.  

The program below is useful for measuring the number of clock cycles a piece 
of code takes. The program executes the code to test 10 times and stores the 
10 clock counts. The program can be used in both 16 and 32 bit mode.  

RDTSC   MACRO                   ; define RDTSC instruction
        DB      0FH,31H
ENDM

ITER    EQU     10              ; number of iterations

.DATA                           ; data segment
ALIGN   4
COUNTER DD      0               ; loop counter
TICS    DD      0               ; temporary storage of clock
RESULTLIST  DD  ITER DUP (0)    ; list of test results

.CODE                           ; code segment
BEGIN:  MOV     [COUNTER],0     ; reset loop counter
TESTLOOP:                       ; test loop
;****************   Do any initializations here:    ************************
        FINIT
;****************   End of initializations          ************************
        RDTSC                   ; read clock counter
        MOV     [TICS],EAX      ; save count
        CLD                     ; non-pairable filler
REPT    8
        NOP                     ; eight NOP's to avoid shadowing effect
ENDM

;****************   Put instructions to test here:  ************************
        FLDPI                   ; this is only an example
        FSQRT
        RCR     EBX,10
        FSTP    ST
;********************* End of instructions to test  ************************

        CLC                     ; non-pairable filler with shadow
        RDTSC                   ; read counter again
        SUB     EAX,[TICS]      ; compute difference
        SUB     EAX,15          ; subtract the clocks cycles used by fillers
        MOV     EDX,[COUNTER]   ; loop counter
        MOV     [RESULTLIST][EDX],EAX   ; store result in table
        ADD     EDX,TYPE RESULTLIST     ; increment counter
        MOV     [COUNTER],EDX           ; store counter
        CMP     EDX,ITER * (TYPE RESULTLIST)
        JB      TESTLOOP                ; repeat ITER times

; insert here code to read out the values in RESULTLIST

The 'filler' instructions before and after the piece of code to test are 
critical. The CLD is a non-pairable instruction which has been inserted to 
make sure the pairing is the same the first time as the subsequent times. 
The eight NOP instructions are inserted to prevent any prefixes in the code 
to test to be decoded in the shadow of the preceding instructions. Single 
byte instructions are used here to obtain the same pairing the first time as 
the subsequent times. The CLC after the code to test is a non-pairable 
instruction which has a shadow under which the 0FH prefix of the RDTSC can 
be decoded so that it is independent of any shadowing effect from the code 
to test.  

The RDTSC instruction cannot execute in virtual mode, so if you are running 
under DOS you must skip the EMM386 (or any other memory manager) in your 
CONFIG.SYS and not run under a DOS box in Windows.  

The Pentium processor has special performance monitor counters which can 
count events such as cache misses, misalignments, AGI stalls, etc. Details 
about how to use the performance monitor counters are not covered by this 
manual and must be sought elsewhere.  


23. CONSIDERATIONS FOR OTHER MICROPROCESSORS
============================================
Most of the optimations described in this document have little or no 
negative effects on other microprocessors, including non-Intel processors, 
but there are some problems to be aware of.  

Using a full register after writing to part of the register will cause a 
moderate delay on the 80486 and a severe delay on the PentiumPro. Example:
MOV AL,[EBX] / MOV ECX,EAX
On the PentiumPro you may avoid this penalty by zeroing the full register 
first:
XOR EAX,EAX / MOV AL,[EBX] / MOV ECX,EAX
or by using MOVZX.

Scheduling floating point code for the Pentium often requires a lot of extra 
FXCH instructions. This will slow down execution on earlier microprocessors, 
but not on the PentiumPro and advanced non-Intel processors.  

As mentioned in the introduction, Intel has announced new MMX versions of 
the Pentium and PentiumPro chips with special instructions for integer 
vector operations. These instructions will be very useful for massively 
parallel integer calculations.  

The Pentium Pro chip is faster than the Pentium in some respects, but 
inferior in other respects. Knowing the strong and weak sides of the 
PentiumPro can help you make your code work well on both processors.  

The most important advantage of the PentiumPro is that it does much of the 
optimation for you: reordering instructions and splitting complex 
instructions into simple ones. But for perfectly optimized code there is 
less difference between the two processors.  

The two processors have basically the same number of execution units, so the 
throughput should be near the same. The PPro has separate units for memory 
read and write so that it can do three operations simultaneously if one of 
them is a memory read, but on the other hand it cannot do two memory reads 
or two writes simultaneously as the Pentium can.  

The PPro is better than the Pentium in the following respects:

- out of order execution

- one cache miss does not delay subsequent independent instructions

- splitting complex instructions into smaller micro-ops

- automatic register renaming to avoid unnecessary dependencies

- better jump prediction algorithm than Pentium without MMX

- many instructions which are unpairable and poorly optimized on the Pentium
  perform better on the PPro, f.ex. integer multiplication, movzx, cdq, bit
  scan, bit test, shifts by cl, and floating point store

- floating point instructions and simple integer instructions can execute
  simultaneously

- memory reads and writes do not occupy the ALU's

- indirect memory read instructions have no AGI stall

- new conditional move instructions can be used in stead of branches in some 
  cases

- new FCOMI instruction eliminates the need for the slow FNSTSW AX 

- higher maximum clock frequency

The PPro is inferior to the Pentium in the following respects:

- mispredicted jumps are very expensive (10-15 clock cycles!)

- poor performance on 16 bit code and segmented models

- prefixes are expensive (except 0F extended opcode)

- long stall when mixing 8, 16, and 32 bit registers

- fadd, fsub, fmul, fchs have longer latency

- cannot do two memory reads or two memory writes simultaneously

- some instruction combinations cannot execute in parallel, like
  push+push, push+call, compare+conditional jump

As a consequence of this, the Pentium Pro may actually be slower than the 
Pentium on perfectly optimized code with a lot of unpredictable branches, 
and a lot of floating point code with little or no natural parallelism.  

Most of the drawbacks of each processor can be circumvented by careful 
optimation and running 32 bit flat mode. But the problem with mispredicted 
jumps on the PPro cannot be avoided except in the cases where you can use a 
conditional move instead.  

Taking advantage of the new instructions in the MMX and PentiumPro 
processors will create problems if you want your code to be compatible with 
earlier microprocessors. The solution may be to write several versions of 
your code, each optimized for a particular processor. Your program should 
automatically detect which processor it is running on and select the 
appropriate version of code. Such a complicated approach is of course only 
needed for the most critical parts of your program.
Discuss this article in the forums
Date this article was posted to GameDev.net: 7/16/1999
(Note that this date does not necessarily correspond to the date the article was written)
See Also:
x86 Assembly